View on GitHub

Intermediate R, Machine Learning

fredhutch.io's four-class overview of machine learning using R statistical programming

Machine Learning in R, Class 1: Tidyverse and EDA

Objectives

Data Set

Outline

Course intro

Course learning objectives:

  1. Class 1 intro

Class 1 covers:

  1. Course introduction
  2. Overview of tidyverse for EDA
  3. Overview of tidymodels for modeling
  4. Dive into a quick visualization/model

KEY TAKEAWAYS: Develop an undestanding of the benefits of the tidyverse and how tidymodels fits in, Demonstrate how to explore a dataset and set up a linear model

Introduce dataset for class 1 >FIXME: Need to decide on datasets

Exploratory Data Anlysis (EDA) with the Tidyverse - Tidyverse is a meta package: - They are a single package that installs many different packages in one command - Modular: each package addresses a step in the data science modeling process

For a primer on tidyverse check out Tidy Modeling with R, Chapter 2: A Tidyverse Primer For more information on design principles behind tidyverse check out the Tidyverse Design Guide

Lets build a model! >Notes: Emphasize EDA as a step in model building, we will iteratively add more compelxity throughout this course including splitting data, bootstrapping, preprocessing with recipes, utilizing workflow objects, etc but class1 will demonstrate a very simplistic application of ML and usage of tidymodels

The first step is exploring / visualizing the data - ideally we will have munged the data a bit in class 1. This should be a very quick demonstration with easy to identify application. - Ex. JS quickly plots mpg distribution

A simple linear model using base R - Before diving into more complex models lets do a really simple linear model using base R lm() - Really set the scene - Look at the dataset. Specify column that we want to predict. - Look at the dataset again what columns won’t be useful to predict this - Ex. JS removes the model and model index (unique identifiers for certain car models). Another ex would be removing patient ID - Explain the formula - mpg is the predicted value explained by all predictors ( mpg ~ .)

Code: run lm() 

What is Tidymodels? >Note: Want to emphasize the major principles behind tidymodels and how it fits into tidyverse. Each package has a specific usage and together they create a cohesive ecosystem of packages that play together easily and intentionaly.

Parsnip standardizes the interface for fitting models >Note I think it’s important to stress that parsnip is an interface to other modeling ‘engines’

Function Package Code
glm stats predict(obj, type = "response")
lda MASS predict(obj)
gbm gbm predict(obj, type = "response", n.trees)
mda mda predict(obj, type = "posterior")
rpart rpart predict(obj, type = "prob")
Weka RWeka predict(obj, type = "probability")
logitboost LogitBoost predict(obj, type = "raw", nIter)
pamr.train pamr pamr.predict(obj, type = "posterior")

table content from this blog post by parsnip’s creator, Max Kuhn

For more indepth information on parsnip check out Tidy Modeling with R, Chapter 7: Fitting Models with Parsnip

Training and testing - We don’t need to do this for linear model but we will anyway (good practice) - use rsample - 80/20 split training/testing - balancing the split - build with training data - choose with validation data or resampled data - evaluate with testing data

Code: split the data

Using tidymodels to train a model > Ripped from chp one of JS machine learning with tidy models tutorial

In tidymodels, you specify models using three concepts.

Code: Specify a linear model and a random forest model

After a model has been specified you can fit it - typically using a symbolic description of the model (a formula) and some data. - SG example fits models with data = car_train [This means we’re saying, “Just fit the model one time, on the whole training set”] - Once you have fit your model, you can evaluate how well the model is performing.

Code: Fit the linear model and random forest model

Evaluate your models - root mean squared error metric - use the yardstick package

Note: In JS course she has them evaluate with training and testing data to show the difference in output (evaluating on training data will always be more optimistic)

Resampling - Why resample?: “The idea of resampling is to create simulated data sets that can be used to estimate the performance of your model, say, because you want to compare models. You can create these resampled data sets instead of using either your training set (which can give overly optimistic results, especially for powerful ML algorithms) or your testing set (which is extremely valuable and can only be used once or at most twice).”

Code:
1. Create bootstrap resamples using rsample::bootstraps()
2. Evaluate linear model and random forests with bootstrap resamples

Visualize model results - Use tune::collect_predictions() - Plot predicted vs truth in ggplot

Resources