READING-NOTE

View on GitHub

Machine Learning Intro

Chapter 1: Bird’s Eye View

First, let’s start with the “80/20” of data science…

Generally speaking, we can break down applied machine learning into the following chunks:

In this first chapter, you’ll see how these moving pieces fit together. Therefore, we suggest the following two tips to making the most out of this primer:

We’ve seen students master this subject 2X faster by first understanding how all the pieces fit together… and then diving deeper. Our trainings all follow this “top-down” approach.

Again, it’s easy to get lost in the weeds at the beginning… so our goal is to see the forest instead of the trees. Don’t worry - We’ll get to the code later.

Go to Chapter 1: Bird’s Eye View

Chapter 2: Exploratory Analysis

Should you develop your product more? Invest in marketing? Hire an accountant? Etc.

In many ways, training a ML model is like growing a startup. You also have too many tactics to choose from:

Should you clean your data more? Engineer features? Test new algorithms? Etc.

There’s a lot of trial and error, so how do you avoid chasing dead ends? The answer is “Exploratory Analysis.” (Which is just fancy-talk for “getting to know” your data.)

Doing this upfront helps you save time and avoid wild goose chases… As a data scientist, you are a commander with limited resources (i.e. time). Exploratory analysis is like sending scouts to learn where to deploy your forces!

Go to Chapter 2: Exploratory Analysis

Chapter 3: Data Cleaning

Proper data cleaning is the “secret” sauce behind machine learning… Well, it’s not really a “secret”… It’s just a bit boring, so no one really talks about it. But the truth is:

Better data beats fancier algorithms…

(Even if you forget everything else from this primer, please remember this point)

Garbage in = Garbage out… Plain and Simple! If you have a clean dataset, even simple algorithms can learn impressive insights from it!

Now, as you might imagine, different problems will require different methods… For now though, let’s at least ensure we know how to fix the most common issues. This chapter will give you a reliable starting point, regardless of your dataset.

Go to Chapter 3: Data Cleaning

Chapter 4: Feature Engineering

In a nutshell, “feature engineering” is creating new model input features from your existing ones.

That doesn’t sounds like much… Yet Andrew Ng, former head of Baidu AI and Google Brain, said:

“Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering.”

Wow! No pressure, right?

So why is it so difficult and time-consuming?

To start, feature engineering is very open-ended. There are literally infinite options for new features to create. Plus, you’ll need domain knowledge to add informative features instead of more noise.

This is a skill that you’ll develop with time and practice, but heuristics will give you a head start. Heuristics help you know where to start looking, spark ideas, and get unstuck.

Go to Chapter 4: Feature Engineering

Chapter 5: Algorithm Selection

Next, we’ll introduce 5 very effective ML algorithms for regression. They each have classification counterparts as well.

Just 5?

Yes. Instead of giving you a long list of algorithms…

…our goal is to explain a few essential concepts (e.g. regularization, ensembling, automatic feature selection) that will teach you why some algorithms tend to perform better than others.

In applied machine learning, individual algorithms should be swapped in and out depending on which performs best for the problem and the dataset.

Therefore, we will focus on intuition and practical benefits over math and theory.

We have two main goals:

  1. To explain powerful mechanisms in modern ML.

  2. To introduce several algorithms that use those mechanisms. So if you’re ready, then we’re ready. Let’s go!

Go to Chapter 5: Algorithm Selection

Chapter 6: Model Training

At last… it’s time to build our models!

It might seem like it took a while to get here, but data scientists actually do spend most their time on the earlier steps:

Exploring the data. Cleaning the data. Engineering new features. Again, that’s because better data beats fancier algorithms.

Now you’ll learn how to maximize model performance while safeguarding against overfitting. Plus, you’ll learn how to automatically find the best parameters for each algorithm.

We’ll get an overview of splitting your dataset, deciding on hyperparameters, setting up cross-validation, fitting and tuning models, and finally… selecting a winner!

Go to Chapter 6: Model Training