Data Science Team Companion Course

My website has moved to https://raf.prof. If you are not redirected within 2 seconds, click here.

Data Science Team Companion Course

When: Tuesdays 5-6pm
Where: TBA (Spring 2017)
Professor: Rafael Frongillo
Team website: http://codata.colorado.edu/
Mailing List: subscribe via the team page.
Prerequisites: linear algebra or permission of instructor.

Course Description

Gives students hands-on experience applying data science techniques and machine learning algorithms to real-world problems. Students will work in small teams on internal challenges, many of which will be sponsored by local companies and organizations, and will represent the university in larger teams for external challenges at the national and global level, such as those hosted by Kaggle. Students will be expected to participate in both internal and external challenges, attend meetings, and present short presentations to the group when appropriate.

Motivation

Data science is one of the fastest-growing sectors of our economy, and there is a great demand for data scientists with practical experience applying statistical techniques and machine learning algorithms to real data. While several courses in the CS curriculum develop these techniques, in the areas of machine learning, statistical modeling, network science, numerical analysis, and data science more broadly, and while these courses often include a hands-on project, no course specifically focuses on putting this myriad of tools to work on real data and developing intuition for when to apply certain techniques over others. The present course will fill in this gap, allowing students to work in teams both small and large to solve real-world prediction challenges, gaining valuable experience whether entering the workforce or remaining in academia.

Topics

To accompany the roughly tri-weekly prediction challenges hosted by the team, the instructor and the enrolled students will give short presentations on topics relevant to the current competition or data science more broadly. A non-exhaustive list of topics is as follows.

Basic Concepts (2 hours): classification and regression, prediction vs causation, regularization and overfitting.
Algorithms (4 hours): linear regression, logistic regression, support vector machines, boosting, decision trees and forests, neural networks, gradient and stochastic gradient descent.
Practical Techniques (4 hours): ensemble methods and aggregation, tradeoffs in regularization, and parameter and hyperparameter tuning, data imputation techniques, cross-validation.
Software and Tools (2 hours): tutorials on several modern data science software packages; as of this writing, this would include e.g. scikit-learn, pandas, vowpal wabbit, and xgboost.
Context and Industry Practice (3 hours): via weekly presentations from practicing data scientists, students will learn about techniques actually used in industry and academia, and which algorithms work well for which problems.

Assessment

The grade in this course will be broken down as follows:

Participation in meetings and competitions (50%)
Short presentation (20%)
Written tutorial (30%)

The participation grade will be based on activity logs, and absolute performance (achieving a prediction accuracy significantly above the baseline) rather than relative (a student's rank in the final standings).

Our competitions will either be split into advanced and beginner tracks, or combined. The participation score for students in the graduate-level course will be based on their performance in advanced competitions, or when combined, on a higher performance bar than students in the undergraduate-level course. The length of the tutorial required will be 2 pages for undergraduate-level students and 4 pages for graduate-level students.

For more information about the team, please visit the team website.