Reset all


Content Types


AID systems



Data access

Data access restrictions

Database access

Database access restrictions

Database licenses

Data licenses

Data upload

Data upload restrictions

Enhanced publication

Institution responsibility type

Institution type


Metadata standards

PID systems

Provider types

Quality management

Repository languages



Repository types


  • * at the end of a keyword allows wildcard searches
  • " quotes can be used for searching phrases
  • + represents an AND search (default)
  • | represents an OR search
  • - represents a NOT operation
  • ( and ) implies priority
  • ~N after a word specifies the desired edit distance (fuzziness)
  • ~N after a phrase specifies the desired slop amount
Found 33 result(s)
The Language Bank features text and speech corpora with different kinds of annotations in over 60 languages. There is also a selection of tools for working with them, from linguistic analyzers to programming environments. Corpora are also available via web interfaces, and users can be allowed to download some of them. The IP holders can monitor the use of their resources and view user statistics.
The World Atlas of Language Structures (WALS) is a large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials (such as reference grammars) by a team of 55 authors (many of them the leading authorities on the subject).
Codex Sinaiticus is one of the most important books in the world. Handwritten well over 1600 years ago, the manuscript contains the Christian Bible in Greek, including the oldest complete copy of the New Testament. The Codex Sinaiticus Project is an international collaboration to reunite the entire manuscript in digital form and make it accessible to a global audience for the first time. Drawing on the expertise of leading scholars, conservators and curators, the Project gives everyone the opportunity to connect directly with this famous manuscript.
Currently, the IMS repository focuses on resources provided by the Institute for Natural Language Processing in Stuttgart (IMS) and other CLARIN-D related institutions such as the local Collaborative Research Centre 732 (SFB 732) as well as institutions and/or organizations that belong to the CLARIN-D extended scientific community. Comprehensive guidelines and workflows for submission by external contributors are being compiled based on the experiences in archiving such in-house resources.
OLAC, the Open Language Archives Community, is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by: (i) developing consensus on best current practice for the digital archiving of language resources, and (ii) developing a network of interoperating repositories and services for housing and accessing such resources.
CHILDES is the child language component of the TalkBank system. TalkBank is a system for sharing and studying conversational interactions.
LINDAT/CLARIN is designed as a Czech “node” of Clarin ERIC (Common Language Resources and Technology Infrastructure). It also supports the goals of the META-NET language technology network. Both networks aim at collection, annotation, development and free sharing of language data and basic technologies between institutions and individuals both in science and in all types of research. The Clarin ERIC infrastructural project is more focused on humanities, while META-NET aims at the development of language technologies and applications. The data stored in the repository are already being used in scientific publications in the Czech Republic.
D-PLACE contains cultural, linguistic, environmental and geographic information for over 1400 human ‘societies’. A ‘society’ in D-PLACE represents a group of people in a particular locality, who often share a language and cultural identity. All cultural descriptions are tagged with the date to which they refer and with the ethnographic sources that provided the descriptions. The majority of the cultural descriptions in D-PLACE are based on ethnographic work carried out in the 19th and early-20th centuries (pre-1950).
Content type(s)
MICASE provides a collection of transcripts of academic speech events recorded at the University of Michigan. The original DAT audiotapes are held in the English Language Institute and may be consulted by bona fide researchers under special arrangements
The German Text Archive (Deutsches Textarchiv, DTA) presents online a selection of key German-language works in various disciplines from the 17th to 19th centuries. The electronic full-texts are indexed linguistically and the search facilities tolerate a range of spelling variants. The DTA presents German-language printed works from around 1650 to 1900 as full text and as digital facsimile. The selection of texts was made on the basis of lexicographical criteria and includes scientific or scholarly texts, texts from everyday life, and literary works. The digitalisation was made from the first edition of each work. Using the digital images of these editions, the text was first typed up manually twice (‘double keying’). To represent the structure of the text, the electronic full-text was encoded in conformity with the XML standard TEI P5. The next stages complete the linguistic analysis, i.e. the text is tokenised, lemmatised, and the parts of speech are annotated. The DTA thus presents a linguistically analysed, historical full-text corpus, available for a range of questions in corpus linguistics. Thanks to the interdisciplinary nature of the DTA Corpus, it also offers valuable source-texts for neighbouring disciplines in the humanities, and for scientists, legal scholars and economists.
The Australian National Corpus collates and provides access to assorted examples of Australian English text, transcriptions, audio and audio-visual materials. Text analysis tools are embedded in the interface allowing analysis and downloads in *.CSV format.
>>>>> As of 01/12/2015, deposit of data on SLDR website will be suspended to allow the public opening of Ortolang platform .<<<<<Speech & Language Data Repository (SLDR) is a Trusted Data Repository offering labs and scholars a free-of-charge service for sharing their oral/linguistic data and archiving it with the help of procedures compliant with the OAIS model for long-term preservation. Its entire storage is referenced in international repositories such as OLAC (Open Language Archives Community) and CLARIN Virtual Language Observatory (VLO). Currently, packages are distributed via the TGE-Adonis grid, now integrated in Huma-Num, hosted by CC-IN2P3 and preserved on the platform of CINES, an institutional archive beneficiary of the Data Seal of Approval. is a Danish IT infrastructure intended for use by humanities scholars. The infrastructure includes digitized research material in the form of written and spoken texts, audio and video records, lexical resources and tools. Part of the resources collected or converted to other formats as part of the project. Other resources developed or collected in other projects and made available to researchers through The website is regularly updated with new materials and tools. The vision is to create the humanities researcher's toolbox through the creation of resources with associated tools and integrating resources together in a web-based electronic research environment provided for humanities researchers. Such access to resources and tools will allow scientists unprecedented opportunities and will also help to enhance their ability to participate in European collaborative projects. The Danish CLARIN project will eventually generate better conditions for Danish language technology research and development.
Content type(s)
The Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) is a CLARIN partner institution and has been an officially certified CLARIN service center since June 20th, 2013. The CLARIN center at the BBAW focuses on historical text corpora (predominantly provided by the 'Deutsches Textarchiv'/German Text Archive, DTA) as well as on lexical resources (e.g. dictionaries provided by the 'Digitales Wörterbuch der Deutschen Sprache'/Digital Dictionary of the German Language, DWDS).
The Linguistic Data Consortium (LDC) is an open consortium of universities, libraries, corporations and government research laboratories. It was formed in 1992 to address the critical data shortage then facing language technology research and development. Initially, LDC's primary role was as a repository and distribution point for language resources. Since that time, and with the help of its members, LDC has grown into an organization that creates and distributes a wide array of language resources. LDC also supports sponsored research programs and language-based technology evaluations by providing resources and contributing organizational expertise. LDC is hosted by the University of Pennsylvania and is a center within the University’s School of Arts and Sciences.
Språkbanken (the Swedish Language Bank) was established in 1975 as a national center located in the Faculty of Arts, University of Gothenburg. Alléns groundbreaking corpus linguistic research resulted in the creation of one of the first large electronic text corpora in another language than English, with one million words of newspaper text. The task of Språkbanken is to collect, develop, and store (Swedish) text corpora, and to make linguistic data extracted from the corpora available to researchers and to the public.
The Alaska Native Language Archive houses documentation of the various Native languages of Alaska and helps to preserve and cultivate this unique heritage for future generations. As the premier repository worldwide for information relating to the Native languages of Alaska, the Archive serves researchers, teachers and students, as well as members of the broader community. The collection includes both published and unpublished materials in or on all of the Alaska Native languages and related languages. The collection has enduring cultural, historic, and intellectual value, particularly for Alaska Native language speakers and their descendants
The repository of the Hamburg Centre for Speech Corpora is used for archiving, maintenance, distribution and development of spoken language corpora. These usually consist of audio and / or video recordings, transcriptions and other data and structured metadata. The corpora treat the focus on multilingualism and are generally freely available for research and teaching. Most of the measures maintained by the HZSK corpora were created in the years 2000-2011 in the framework of the SFB 538 "Multilingualism" at the University of Hamburg. The HZSK however also strives to take linguistic data from other projects or contexts, and to provide also the scientific community for research and teaching are available, provided that they are compatible with the current focus of HZSK, ie especially spoken language and multilingualism.
Polish CLARIN node – CLARIN-PL Language Technology Centre – is being built at Wrocław University of Technology. The LTC is addressed to scholars in the humanities and social sciences. Registered users are granted free access to digital language resources and advanced tools to explore them. They can also archive and share their own language data (in written, spoken, video or multimodal form).
Content type(s)
CLARIN Centre Vienna (CCV) is Austria’s main connection point to the European network of CLARIN Centres. It is an Austrian contribution to CLARIN-ERIC and being hosted by the Austrian Centre for Digital Humanities of the Austrian Academy of Sciences (ACDH-OEAW). It is jointly funded by the Academy and the Federal Ministry of Science, Research and Economy. If you have language resources, would like to share these with the scientifc community (and/or the public) and want to make sure that the data will be around in the future, contact us. We offer archiving and online availability of the resources. If needed, we will assist you in converting data and metadata into required formats. CCV is embedded in the Digital Humanities Austria (DHA) initiative which has started in January 2014. DHA represents the umbrella under which the DH infrastructure activities CLARIN and DARIAH are conducted in Austria.
Created in 2005 by the CNRS, CNRTL unites in a single portal, a set of linguistic resources and tools for language processing. The CNRTL includes the identification, documentation (metadata), standardization, storage, enhancement and dissemination of resources. The sustainability of the service and the data is guaranteed by the backing of the UMR ATILF (CNRS - Université Nancy), support of the CNRS and its integration in the excellence equipment project ORTOLANG .
The Language Archive is storing a lot of unique material, from a large variety of languages worldwide, which is recorded and analyzed by researchers from different linguistic disciplines. Data creation, management and exploration tools. Archiving and software expertise for the Digital Humanities.
The University of Oxford Text Archive develops, collects, catalogues and preserves electronic literary and linguistic resources for use in Higher Education, in research, teaching and learning. We also give advice on the creation and use of these resources, and are involved in the development of standards and infrastructure for electronic language resources.