The course is validated through 2 assignments. The assignments are carried out by groups of 3 to 5 students.
Students can choose any of the above or propose a new project. Note that these projects are not carved in stone but are rather meant as a starting point. Students are encouraged to propose new or derivative projects (but let me know before).
- Criptocurrency price analysis. Obtain historical data through coinmarketcap API or blockchain.info API or web scraping and analyze the resulting dataset. Potential tasks:
- Identify daily, weekly and yearly trends.
- Prove or deny a correlation between the price of bitcoin and the stock market price.
- Correlate with news sentiments
-
Network analysis in social media. Generate a network graph of twitter followers. Suggestion: use the twitter API and NetworkX.
-
Criptocurrency market visualization and clustering. Visualize the influence of different criptocurrencies. You can use this scikit-learn example on visualizing the market structure as a starting point.
-
Stock market and twitter analysis. Study which terms are associated with which company. Dataset: NASDAQ100
-
Analysis of geolocalied tweets. Use a geo-localized social network dataset such as this one to visualize the activity over a map. See also the facebook friendship map.
-
Trust and anonymity in bitcoin. Identify trends in the Bitcoin OTC social network dataset. See the related publications for tasks that can be explored in this dataset: Signed Networks in Social Media
-
Adverse food Events: The CFSAN Adverse Event Reporting System (CAERS) is a database that contains information on adverse event and product complaint reports submitted to FDA for foods, dietary supplements, and cosmetics. Questions: what are the most commonly reported foodstuffs?, what are the most commonly reported medical reactions to foods?, where do people in the US most commonly report food-related conditions?
-
World Atlas of Language Structures. There are over 7,000 human languages in the world. The World Atlas of Language Structures (WALS) contains information on the structure of 2,679 of them. Goal: make an interactive map that shows different linguistic features.
-
San Francisco beach water quality. Analyze and identify patterns in the concentration of bacteria in 15 stations around San Francisco. Dataset: SF Beaches Water Quality
-
San Francisco yearly budget. The city of San Francisco has published an open dataset with the annual budget. Using this dataset, describe how the budget is spent, and formulate recommendations for upcoming years.
- Cdiscount Image classification challenge. Participate in the Kaggle Cdiscount Image classification challenge. By its complexity, this project should last the full project.
First assignment
Deadline: October 10th.
The goal of this assignment is to gather, clean and present a dataset.
Second assignment
Deadline: November 28th.
Goal: Build, deploy and compare machine learning models at scale.
Second assignment grading. The second evaluation will be graded out of a total of 5 points and will have double weight in the final grade.
- Quality and clarity of presentation: 1pt.
- Quality of visualization in presentation and report: 1pt.
- Quality of predictive model and soundness of proposed approach: 1pt.
- Quality and clarity of written report: 2pt.
Final grade. min(0.4 * first milestone + 0.6 * second milestone + extra points, 1.)