DATA 301: Indtroduction to Data Science

Course Project

Dataset sources

You can find some links to different places where you can discover a variety of existing datasets.

Machine Learning Dataset Collections

We start with a list of machine learning dataset repositories.

UCI Machine Learning Dataset Repository [HTML] A repository of over 100 datasets used in ML research.
SNAP: Stanford Large Network Dataset Collection [HTML] A curated collection of social network and other graph data
Kaggle [HTML] A collection of open source datasets used for data science
KDNuggets [HTML] An aggregator of dataset repositories
Datasets.co [HTML] A small collection of well-known datasets
KDD Cup Datasets [HTML] Datasets used in competitions run but the KDD (Knowledge Discovery in Data) conference
Data Sets for data mining [HTML] A collection of classical data mining datasets from University of Edinburgh

Government and business datasets
Data.gov [HTML] US Government data portal
US Census Bureau [HTML] Demographic data on US population
UK Government data [HTML] UK Government Data Portal

Natural Language Processing Datasets
Microsoft NLP data [HTML] Corpora for natural language processing tasks released by Microsoft Research
Stanford NLP data [HTML] Datasets and Software released by Stanford NLP research group
Cornell NLP data [HTML] Collection of NLP corpora from Cornell University