View on GitHub

Concepts in Machine Learning

fredhutch.io's four-class overview of machine learning methods and applications

Concepts in Machine Learning Class 4: Exploratory Data Analysis, Experimental Design, and Ethics in Machine Learning

Welcome to class 4 of Concepts in Machine Learning!

Last class we covered unsupervised learning methods including k-means clustering and principle component analysis. Today we will cover some common steps to exploratory data analysis and why it’s important. We’ll also touch on ethics in machine learning.

By the end of this class you should be able to:

What is exploratory data analysis?

The data that you’re using in you’re machine learning project is as important as the model you choose.

Making meaning of highly dimensional data is complicated and error-prone. To have a solid start for a machine learning project, you need to analyze the data up front. This process is called exploratory data analysis (EDA). The goal of EDA is to describe the data through statistical analysis and visualization that highlights important features of the data for further analysis. It’s important to gain a deep understanding of:

An EDA checklist

EDA is often unstructured and explorative of the data and as a result there are many different approaches to exploratory data analysis. Below are five prompts to get started with EDA on a new dataset.

  1. What question(s) are you trying to solve (or prove wrong)?
  2. What kind of data do you have and how do you treat different types?
  3. What’s missing from the data and how do you deal with it?
  4. Where are the outliers and why should you care about them?
  5. How can you add, change or remove features to get more out of your data?

This is by no means a complete checklist of EDA items but is a good starting point.

Let’s practice!

Remember the OHSU dataset we looked at earlier in this course? Now, we will practice exploratory data analysis with this dataset.

patient_Id age htn treat smoking race t2d gender numAge bmi tchol sbp cvd
HHUID00076230 20-40 Y Y N Asian/PI N M 29 23 189 170 N
HHUID00547835 70-90 N Y N White Y M 72 35 178 118 N

1. What question(s) are you trying to solve?

Start simple with one problem and work out from there to add complexity as needed.

Looking at this dataset I can tell that each row contains a single patient with demographic and health information and whether or not that patient has been diagnosed with cardiovascular disease.

Our problem: Can we predict if someone will get cardiovascular disease based on these attributes?

2. What kind of data do you have?

This step is especially important in cases where you did not create your own dataset. You may be given, have found, or inherited datasets that you will use for analysis. A deep understanding of what each feature captures and how the data was collected is required to make a sound analysis.

Interrogate how the data was collected and why.

This preprint paper discusses the “who”, “how”, and “why” of this dataset.

Understand what information each feature captures.

For this dataset I had to google a few of the column headers to understand what they meant. Here are some examples of my google searches that turned up the results I was looking for:

Investigate the data types in the data set.

Based on the two-row subset of the data above we have the following data types:

The above data types are based off of only two rows of the data. It’s good practice to confirm each data type by investigating the entire column. Sometimes a variable might appear to be categorical but typos and misspellings mean than cleaning is required.

An example of this:

You might assume that a variable that designates the state in which a business is located in the Paycheck Protection Program dataset would have only 50 categories (one for each state). However this is not the case, as this data scientist found out. Datasets are often created by humans manually entering in information. Without clear guidelines on what kind of input is accepted there will usually be quite a bit of variance in the data from human error.

3. What’s missing from your data?

After determining what is in your dataset, it’s time to investigate what is missing! It’s crucial to identify features where data is missing as you cannot make accurate predictions using data that is incomplete.

NA is often used to denote missing data, but sometimes you might find missing data as blank cells, NAN, or -. An analyst might visualize missing data like the plot below.

Show visual check of missing data

There are a few things you can do with missing data:

Whatever you choose to do with missing data it’s important to clearly note what was done and why and take it into account in downstream analysis.

4. Outliers

Depending on your model, outliers can have a drastic effect on the results. Specifically we mentioned in last class that ordinary least squares regressions are very sensative to outliers. The definition of an outlier varies by dataset. A general rule you could go by is that if a data point is anything more than three standard deviations away from the mean it is an outlier. When in doubt, it’s best if you can consult a subject matter expert.

The easiest way to find outliers is to look at the distributions of each feature in your data.

Outliers are very important to creating a statistical model. Keeping drastic outliers can lead to your model being overfit to the training dataset. Removing all outliars can lead to a model being overly generalized and it won’t handle any data points that are out of the ordinary.

5. Getting more out of the data with feature engineering

Feature engineering is the process of using domain knowledge to extract features from raw data. More features mean more data for your model to use to make predictions which is generally considered a good thing in machine learning. What features you create will be highly dependent on the dataset and goals of the analysis so this process takes a bit of experimentation and creativity. The cardiovascualar disease risk dataset that we are using is fairly limited since it consists of only 1 table with 13 features. In reality machine learning models use multiple tables with hundreds of features that you could use to generate new data.

The simplest form of feature engineering is combining categorical features. This new, combined categorical feature can give information about interactions between categorical variables.

Exploratory data analysis is a step in the cyclical nature of model building

Ethics in machine learning

Replicating biases

What happens when the machine guesses wrong?

Machine learning applications are not naturally transparent

Data security

Wrapping up

Extra materials

If you have some coding experience and want to practice setting up your first model:

https://medium.com/@CoalitionForCriticalTechnology/abolish-the-techtoprisonpipeline-9b5b14366b16 https://www.nature.com/articles/d41586-020-00160-y

Closing

Upcoming R and Python machine learning courses