View on GitHub

Intermediate R, Machine Learning

fredhutch.io's four-class overview of machine learning using R statistical programming

Machine Learning in R, Class 3: A more complex regression or classification ================

Outline

Intro, course objectives

Review of class 3

Introduce dataset - This is where we will ask questions of the dataset and work through the beginning conceptual steps of EDA - ex: What are we hoping to predice? What columns should be included in our prediction? What questions do we have of the data before we start?

Explore dataset - Take a look at the datset. What are we aiming to predict? What model should we use?

glimpse(), count(), group_by(), summarise()
Histogram of some interesting feature

Even if data is balanced have students make that assessment themselves

Training and testing data - Use rsample to split the data

- remove columns deemed unnecissary earlier
- load tidymodles
- split the data so it divides a specific feature evenly with rsample()

Preprocessing - Do we have any preprocessing to do? - If the data is imbalenced (ideally it will be) we will discuss upsampling here - Demonstrate using recipe to preprocess our training data

my_recipe <- recipe() %>% step_upsample()

Creating a workflow - We’ll use a different engine for the random forest model from the ranger package. - combine the model with preprocessing step (recipe) using workflow()

## specify ranger model
rf_spec <- rand_forest() %>% set_engine('ranger') %>% set_mode('classification')

## Add recipe and model to workflow
wf <- workflow() %>% add_recipe(my_recipe) %>% add_model(rf_spec)

Resampling by cross validation - Remember: Resampling is a way to improve the accuracy of your model - Maybe find a good resampling primer and link - Last class we did bootstrap, this class cross validation

Cross validation can take quite a long time - it can be beneficial to use parallel processing

Note: How do you choose the number of folds?? When do you use cross validation vs boostrapping?

 folds <- vfold_cv(training_dat, v = 10, repeats = 5)

Evaluation - At this point we have preprocessed the data, built workflow to model, and created cross validation folds - Now we will evaluate how the model performed - In our discussion of model performance we will touch on how to set non-default performance metrics and save predictions from resampled data.

wf %>%
    fit_resamples(
        folds,
        metrics = metric_set(roc_auc, sens, spec),
        control = control_resamples(save_pred = TRUE))