Content Types


AID systems


Data access

Data access restrictions

Database access

Database licenses

Data licenses

Data upload

Data upload restrictions

Enhanced publication

Institution responsibility type

Institution type


Metadata standards

PID systems

Provider types

Quality management

Repository languages



Repository types


  • * at the end of a keyword allows wildcard searches
  • " quotes can be used for searching phrases
  • + represents an AND search (default)
  • | represents an OR search
  • - represents a NOT operation
  • ( and ) implies priority
  • ~N after a word specifies the desired edit distance (fuzziness)
  • ~N after a phrase specifies the desired slop amount
Found 36 result(s)
OpenML is an open ecosystem for machine learning. By organizing all resources and results online, research becomes more efficient, useful and fun. OpenML is a platform to share detailed experimental results with the community at large and organize them for future reuse. Moreover, it will be directly integrated in today’s most popular data mining tools (for now: R, KNIME, RapidMiner and WEKA). Such an easy and free exchange of experiments has tremendous potential to speed up machine learning research, to engender larger, more detailed studies and to offer accurate advice to practitioners. Finally, it will also be a valuable resource for education in machine learning and data mining.
Social Computing Data Repository hosts data from a collection of many different social media sites, most of which have blogging capacity. Some of the prominent social media sites included in this repository are BlogCatalog, Twitter, MyBlogLog, Digg, StumbleUpon,, MySpace, LiveJournal, The Unofficial Apple Weblog (TUAW), Reddit, etc. The repository contains various facets of blog data including blog site metadata like, user defined tags, predefined categories, blog site description; blog post level metadata like, user defined tags, date and time of posting; blog posts; blog post mood (which is defined as the blogger's emotions when (s)he wrote the blog post); blogger name; blog post comments; and blogger social network.
Archiving data and housing geological collections is an important role the Bureau of Geology plays in improving our understanding of the geology of New Mexico. Aside from our numerous publications, several datasets are available to the public. Data in this repository supplements published papers in our publications. Please refer to both the published material and the repository documentation before using this data. Please cite repository data as shown in each repository listing.
SoyBase is a professionally curated repository for genetics, genomics and related data resources for soybean. It contains current genetic, physical and genomic sequence maps integrated with qualitative and quantitative traits. SoyBase includes annotated "Williams 82" genomic sequence and associated data mining tools. The repository maintains controlled vocabularies for soybean growth, development, and traits that are linked to more general plant ontologies.
OGSEarth provides geoscience data, collected by the Mines and Minerals division, which can be viewed using user-friendly geographic information programs such as Google Earth™. OSGEarth provides data on Mining claims, Geology, Index maps, Administrative boundaries and Abandoned mines.
The University of Pittsburgh English Language Institute Corpus (PELIC) is a 4.2-million-word learner corpus of written texts. These texts were collected in an English for Academic Purposes (EAP) context over seven years in the University of Pittsburgh’s Intensive English Program, and were produced by over 1100 students with a wide range of linguistic backgrounds and proficiency levels. PELIC is longitudinal, offering greater opportunities for tracking development in a natural classroom setting.
>>>!!!<<<As stated 2017-05-23 Cancer GEnome Mine is no longer available >>>!!!<<< Cancer GEnome Mine is a public database for storing clinical information about tumor samples and microarray data, with emphasis on array comparative genomic hybridization (aCGH) and data mining of gene copy number changes.
The British Geological Survey (BGS), the world’s oldest national geological survey, has over 400 datasets including environmental monitoring data, digital databases, physical collections (borehole core, rocks, minerals and fossils), records and archives.
The India Environment Portal provides open access to information about environmental and developmental issues in India. The Portal aggregates and presents data from research institutions, government bodies, NGOs, universities, the mass media, and experts across various issues of environmental management.
The database aims to bridge the gap between agent repositories and studies documenting the effect of antimicrobial combination therapies. Most notably, our primary aim is to compile data on the combination of antimicrobial agents, namely natural products such as AMP. To meet this purpose, we have developed a data curation workflow that combines text mining, manual expert curation and graph analysis and supports the reconstruction of AMP-Drug combinations.
Repository for New Mexico Experimental Program to Stimulate Competitive Research Data Collection. Provides access to data generated by the Energize New Mexico project as well as data gathered in our previous project that focused on Climate Change Impacts (RII 3). NM EPSCoR contributes its data to the DataONE network as a member node:
AmoebaDB belongs to the EuPathDB family of databases and is an integrated genomic and functional genomic database for Entamoeba and Acanthamoeba parasites. In its first iteration (released in early 2010), AmoebaDB contains the genomes of three Entamoeba species (see below). AmoebaDB integrates whole genome sequence and annotation and will rapidly expand to include experimental data and environmental isolate sequences provided by community researchers . The database includes supplemental bioinformatics analyses and a web interface for data-mining.
Content type(s)
The World Bank recognizes that transparency and accountability are essential to the development process and central to achieving the Bank’s mission to alleviate poverty. The Bank’s commitment to openness is also driven by a desire to foster public ownership, partnership and participation in development from a wide range of stakeholders. As a knowledge institution, the World Bank’s first step is to share its knowledge freely and openly.
Content type(s)
A machine learning data repository with interactive visual analytic techniques. This project is the first to combine the notion of a data repository with real-time visual analytics for interactive data mining and exploratory analysis on the web. State-of-the-art statistical techniques are combined with real-time data visualization giving the ability for researchers to seamlessly find, explore, understand, and discover key insights in a large number of public donated data sets. This large comprehensive collection of data is useful for making significant research findings as well as benchmark data sets for a wide variety of applications and domains and includes relational, attributed, heterogeneous, streaming, spatial, and time series data as well as non-relational machine learning data. All data sets are easily downloaded into a standard consistent format. We also have built a multi-level interactive visual analytics engine that allows users to visualize and interactively explore the data in a free-flowing manner.
CryptoDB is an integrated genomic and functional genomic database for the parasite Cryptosporidium and other related genera. CryptoDB integrates whole genome sequence and annotation along with experimental data and environmental isolate sequences provided by community researchers. The database includes supplemental bioinformatics analyses and a web interface for data-mining.
MicrosporidiaDB belongs to the EuPathDB family of databases and is an integrated genomic and functional genomic database for the phylum Microsporidia. In its first iteration (released in early 2010), MicrosporidiaDB contains the genomes of two Encephalitozoon species (see below). MicrosporidiaDB integrates whole genome sequence and annotation and will rapidly expand to include experimental data and environmental isolate sequences provided by community researchers. The database includes supplemental bioinformatics analyses and a web interface for data-mining.
Content type(s)
Go-Geo is an online resource discovery tool which allows for the identification and retrieval of records describing the content, quality, condition and other characteristics of geospatial data that exist with UK tertiary education and beyond. The portal supports geospatial searching by interactive map, grid co-ordinates and place name, as well as the more traditional topic or keyword forms of searching. The portal is a key component of the UK academic Spatial Data Infrastructure.
This database serves forest tree scientists by providing online access to hardwood tree genomic and genetic data, including assembled reference genomes, transcriptomes, and genetic mapping information. The web site also provides access to tools for mining and visualization of these data sets, including BLAST for comparing sequences, Jbrowse for browsing genomes, Apollo for community annotation and Expression Analysis to build gene expression heatmaps.
The Arabidopsis Information Resource (TAIR) maintains a database of genetic and molecular biology data for the model higher plant Arabidopsis thaliana . Data available from TAIR includes the complete genome sequence along with gene structure, gene product information, metabolism, gene expression, DNA and seed stocks, genome maps, genetic and physical markers, publications, and information about the Arabidopsis research community. Gene product function data is updated every two weeks from the latest published research literature and community data submissions. Gene structures are updated 1-2 times per year using computational and manual methods as well as community submissions of new and updated genes. TAIR also provides extensive linkouts from our data pages to other Arabidopsis resources.
FungiDB belongs to the EuPathDB family of databases and is an integrated genomic and functional genomic database for the kingdom Fungi. FungiDB was first released in early 2011 as a collaborative project between EuPathDB and the group of Jason Stajich (University of California, Riverside). At the end of 2015, FungiDB was integrated into the EuPathDB bioinformatic resource center. FungiDB integrates whole genome sequence and annotation and also includes experimental and environmental isolate sequence data. The database includes comparative genomics, analysis of gene expression, and supplemental bioinformatics analyses and a web interface for data-mining.
Stanford Network Analysis Platform (SNAP) is a general purpose network analysis and graph mining library. It is written in C++ and easily scales to massive networks with hundreds of millions of nodes, and billions of edges. It efficiently manipulates large graphs, calculates structural properties, generates regular and random graphs, and supports attributes on nodes and edges. SNAP is also available through the NodeXL which is a graphical front-end that integrates network analysis into Microsoft Office and Excel. The SNAP library is being actively developed since 2004 and is organically growing as a result of our research pursuits in analysis of large social and information networks. Largest network we analyzed so far using the library was the Microsoft Instant Messenger network from 2006 with 240 million nodes and 1.3 billion edges. The datasets available on the website were mostly collected (scraped) for the purposes of our research. The website was launched in July 2009.
The National Mine Map Repository (NMMR) collects, maintains, and provides U.S. coal and non-coal mine maps to individuals, public and private sectors. NMMR mine maps and data are searchable and indexed by state, county, company name, and mine name. Accessing NMMR mine maps and data requires contacting NMMR. NMMR has a diverse customer population and has provided data to efforts supporting industrial and commercial development, highway construction, and the preservation of public health, safety and welfare.
The focus of PolMine is on texts published by public institutions in Germany. Corpora of parliamentary protocols are at the heart of the project: Parliamentary proceedings are available for long stretches of time, cover a broad set of public policies and are in the public domain, making them a valuable text resource for political science. The project develops repositories of textual data in a sustainable fashion to suit the research needs of political science. Concerning data, the focus is on converting text issued by public institutions into a sustainable digital format (TEI/XML).
The Database explores the interactions of chemicals and proteins. It integrates information about interactions from metabolic pathways, crystal structures, binding experiments and drug-target relationships. Inferred information from phenotypic effects, text mining and chemical structure similarity is used to predict relations between chemicals. STITCH further allows exploring the network of chemical relations, also in the context of associated binding proteins.