ML Workflow for R

Quickly find the best ML Model Using Caret

Github here. A typical starting workflow might be broken out below.

source('scripts/wrangling/ETL.r') #ETL
source('scripts/wrangling/wrangle.r') #wrangle
source('scripts/wrangling/missing_data.r') #impute 
source('scripts/pre-processing/00. transform.r') #normalize 
source('scripts/pre-processing/01. dummyvars.r') #one hot encode
source('scripts/pre-processing/03. nzv.r') #remove near zero variance predictors
source('scripts/pre-processing/04. correlated predictors.r') #drop highly correlated predictors
source('scripts/pre-processing/05. partitioning.r') #partition data

Caret allows for consistent syntax in building models from multiple packages. Caret covers about 250 different ML models, including those using H20.

Simply set some automated training parameters…

ctrl <- trainControl(method = "repeatedcv", 
                     number = 10, #10-fold cross validation 
                     repeats = 5, #repeated
                     search = "random",
                     classProbs = TRUE,
                     summaryFunction = twoClassSummary,
                     sampling = "smote")

…and execute! The model below is a random forest model.

rfTune <- train(y ~ ., 
                data=dfTrain,
                method = "ranger",
                preProc = c("center", "scale", "BoxCox"), 
                tuneLength = 50,
                trControl = ctrl,
                metric = "ROC")

Some of the more complicated models may require a grid search object to be passed to the tuneGrid method in the train() function.

There are a number of other analytic functions to performed on model objects, including tuning the model’s hyperparameters and updating predictions using Caret’s update() function. initialstate

This visualization of some dynamic features of an XGBoost model measure the accuracy for a varied combination of boosting iterations (ie. resampling data) and lower-weight nodes. Lower instance weight fits the data better but fewer nodes may suggest overfitting at this extreme margin of accuracy.