You can find some links to different places where you can discover a variety of existing datasets.
Machine Learning Dataset CollectionsWe start with a list of machine learning dataset repositories.
| UCI Machine Learning Dataset Repository | [HTML] | A repository of over 100 datasets used in ML research. |
| SNAP: Stanford Large Network Dataset Collection | [HTML] | A curated collection of social network and other graph data |
| Kaggle | [HTML] | A collection of open source datasets used for data science |
| KDNuggets | [HTML] | An aggregator of dataset repositories |
| Datasets.co | [HTML] | A small collection of well-known datasets |
| KDD Cup Datasets | [HTML] | Datasets used in competitions run but the KDD (Knowledge Discovery in Data) conference |
| Data Sets for data mining | [HTML] | A collection of classical data mining datasets from University of Edinburgh |
Government and business datasets
| Data.gov | [HTML] | US Government data portal |
| US Census Bureau | [HTML] | Demographic data on US population |
| UK Government data | [HTML] | UK Government Data Portal |
Natural Language Processing Datasets
| Microsoft NLP data | [HTML] | Corpora for natural language processing tasks released by Microsoft Research |
| Stanford NLP data | [HTML] | Datasets and Software released by Stanford NLP research group |
| Cornell NLP data | [HTML] | Collection of NLP corpora from Cornell University |