Technological advances in data gathering and storage have led to a rapid proliferation of large amounts of data in diverse areas such as climate studies, cosmology, medicine, Web data processing, and engineering. Making sense of this data deluge requires a set of skills which have become fundamental in any major corporation and any almost any scientific discipline.

Learning objectives

From web scraping and data wrangling to advanced topics like machine learning and deep learning, in this course you will learn a variety of skills by working on real examples.

Gather and organize. Learn how to use Python to gather and organize data programmatically and prepare it for deeper analysis. Key technologies: Beautiful Soup, Pandas.
Analyze. Discover patterns and trends lurking in the data and extract conclusions. Key technologies: NumPy, SciPy, Pandas.
Model. From linear regression to deep learning, learn to model complex phenomena and use data to automate decisions. Key technologies: scikit-learn, Keras, PyTorch, Tensorflow.
Report. Display information visually and communicate your findings in a way that is clear and compelling. Software packages: Matplotlib, Jupyter, Bokeh.

Instructors

Fabian Pedregosa, <f@bianp.net>, Postdoctoral Researcher, UC Berkeley.

Laurent El Ghaoui, Professor, UC Berkeley.

Teaching assistant: Bowen Yin Wang

Invited speaker (September 26th): Nelle Varoquaux, Postdoctoral Researcher, Berkeley Institute for Data Science.

Office Hours: Office hours are available for students who need further clarification of concepts presented in lecture, or have made solid attempts on the homework assignment or other practice problems and require further assistance understanding how to approach such problems.

The office hours are usually Fridays from 15h to 17h. The google calendar below always has the latest information.

Course requirements

This course only requires an undergraduate level on statistics, linear algebra and calculus. I will assume basic understanding of the Python language.

Students are required to come with a laptop to the the lab sessions. Please come with a Python distribution such as Anaconda installed to minimize set up time. If you would like to attend but you don’t have access to computing resources, contact me and we will figure something up.

Course Schedule

The course is organized in sessions of 3 hours, from 14h to 17h. Each session is split in 1h of theory (presentation and whiteboard) and 1h45 of lab practice, with a 15min break in-between. The course will take place in Saturdjia Dai Hall room 250.

Session 1 (September 12th): Foundations of data science

Pioneers of data science. Introduction to regression models. Dimensionality and structured models. Model selection and bias-variance tradeoff. Classification.

Slides: part 1, part 2.

Practical session: The Jupyter (formerly IPython) interactive environment. NumPy, Python’s array computing library. As material we will use chapters 1 and 2 of Jake VanderPlas’ excellent Python Data Science Handbook.

Session 2 (September 19th): Analysis of dataset

Analysis of dataset. Work by paired programming. Lecture material

Session 3 (September 26th): Visualization with Matplotlib

This session will be given by invited speaker Nelle Varoquaux. slides

Session 4 (October 3rd): Introduction to supervised learning.

Permutation tests. Logistic regression. Slides: part 1, part 2

Session 5 (October 10th): First student presentation

In this session, students will present their first assignment. In this assignment, the students should make a 15-min presentation (10 min presentation + 5 min questions) on the project of their choice.

Session 6 (October 17th): Supervised learning.

Supervised learning models. Overfitting. Model selection. Class material.

Session 7 (October 24th): Unsupervised learning.

Clustering, dimensionality reduction, feature extraction. Class material

Session 8 (October 31st): Introduction to deep learning.

Deep convolutional networks. Pretrained networks. Library used: Keras. Class material

Session 9 (November 7th): Practical machine learning

Working with text and time series. Class material

Session 10 (November 14th): Generative Adversarial Networks (GANs)

Tutorial on Generative Adversarial Networks (GANs)

Session 11 (November 21st): Coding sprint

We will contribute to scikit-learn by fixing issues and improving the code.

Session 12 (December 1st): Final milestone

Final presentation of projects.

Rules for Success (Student Responsibilities)

This course has three important rules. If you choose to follow these rules, your odds of learning the material and earning a good grade in this class will improve greatly.

Work. To succeed in this class, you must choose to do your very best on all your assignments. See the course Assignments, for additional information on completing assignments.
Participate actively. To succeed in this class, you must choose to stay focused and involved, offering your best comments, questions, and answers. This is a seminar class, not a lecture class – active discussion is expected of all students.
Respect. You will be exposed to a variety of viewpoints, values and opinions in college that will differ from your own. All students in this class should feel comfortable expressing their viewpoints and concerns in class. You are an important part of creating an atmosphere that makes this possible. This applies to me too!

Instructor Responsibilities

What you can expect from me:

Attend every class period and arrive to class on time.
Provide access to quality learning material adapted to the level and background of all students.
Use a variety of teaching techniques and modalities to accommodate different learning styles.
Return written assignments in class and online in a timely fashion and provide helpful feedback.
Come to class with a positive and friendly attitude.
Be respectful of your ideas and value the diversity you bring to the classroom.
Be open to dialogue that challenges me.
Answer any appropriate questions you may have.
Be present during my stated office hours.

Bibliography

General Data Science

Ani Adhikari and John DeNero, Inferential thinking

Jake VanderPlas, Python Data Science Handbook

Trevor Hastie and Robert Tibshirany, Statistical Learning, Stanford MOOC on data science and machine learning. I will be reusing some of their slides.

Joel Grus, Data science from scratch: First principles with Python, O’Reilly Media, 2015.

Wes McKinney, Python for data analysis, O’Reilly Media, 2013.

Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, An Introduction to Statistical Learning

Machine Learning

Trevor Hastie, Robert Tibshirani, Jerome Friedman The Elements of Statistical Learning: Data Mining, Inference, and Prediction.

Andreas Müller and Sarah Guido Introduction to Machine Learning with Python. A Guide for Data Scientists

Learn2Launch Data Science

Website for UC Berkeley's Learn2Launch Data Science 2017 course