Machine Learing in the Tidyverse

In this document, we will work with a collection of economic and social indicators for 77 countries over a period of 52 years. This data is stored in the gapminder dataframe from dslabs package.

We will first transform our gapminder data into a nested dataframe by using the first tool needed to build the foundation of tidy machine seeing skills: nest().

# Tasks:
- Take a look at the first six rows of gapminder.
- Leverage group_by() and nest() to nest our dataframe by country, save this as gap_nested.
- Explore the first six rows of the newly created dataframe gap_nested.

gapminder <- readRDS('D:/learning/Datacamp/Datasets/gapminder.rds')
str(gapminder)

## Classes 'tbl_df', 'tbl' and 'data.frame':    4004 obs. of  7 variables:
##  $ country         : Factor w/ 185 levels "Albania","Algeria",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ year            : int  1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
##  $ infant_mortality: num  148 148 148 148 149 ...
##  $ life_expectancy : num  47.5 48 48.5 49.1 49.6 ...
##  $ fertility       : num  7.65 7.65 7.65 7.65 7.65 7.66 7.66 7.66 7.66 7.65 ...
##  $ population      : num  11124892 11404859 11690152 11985130 12295973 ...
##  $ gdpPercap       : int  1242 1047 820 1075 1109 1147 1062 1130 1216 1282 ...

# Explore gapminder
head(gapminder)

# Prepare the nested dataframe gap_nested
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.0     v dplyr   1.0.5
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

gap_nested <- gapminder %>% 
  group_by(country) %>% 
  nest(data = !country) %>% ungroup()

# Explore gap_nested
head(gap_nested)

dim(gap_nested)

## [1] 77  2

Notice that each row in gap_nested contains a tibble.

Unnesting our data

As we’ve seen, a nested dataframe is simply a way to shape our data. Essentially taking the group_by() windows and packaging them in corresponding rows.

As a corollary to this, we can use the unnest() function to expand the dataframes that are nested in these chunks.

# Tasks:
- Use unnest() on the gap_nested dataframe to take a nested column and expand it into a new dataframe and save it as gap_unnested.
- Make sure that gapminder and gap_unnested are identical by using the identical() function.

# Create the unnested dataframe called gap_unnnested
gap_unnested <- gap_nested %>% 
  unnest(cols= c(data))
  
# Confirm that our data was not modified  
identical(gapminder, gap_unnested)

## [1] TRUE

Explore a nested cell

The data column in the nested dataframe gap_nested contains tibbles for each country. Let us explore one of these nested chunks.

# Tasks:
- Extract the row index of India
- Extract the nested data for India and store this as india_df.
- Calculate the following summary stats for India's population: min(), max() and mean().

which(gap_nested$country == 'India')

## [1] 34

# Extract the data of India
india_df <- gap_nested$data[[34]]
india_df

# Calculate the minimum of the population vector
min(india_df$population)

## [1] 449661874

# Calculate the maximum of the population vector
max(india_df$population)

## [1] 1247446011

# Calculate the mean of the population vector
mean(india_df$population)

## [1] 811243414

We can see that working with a single chunk in a nested dataframe is identical to working with regular dataframes. Now we will see how to scale this approach to work on a vector of nested dataframes using the map family of functions.

Mapping our data

In combination with mutate(), we can use map() to append the results of our calculation to a dataframe. Since the map() function always returns a vector of lists we must use unnest() to extract this information into a numeric vector.

Here we will explore this functionality by calculating the mean population of each country in the dataset.

# TAsks:
- Use map() to apply the mean() function to calculate the population mean for each country and append this new list column called mean_pop using mutate().
- Explore the first 6 rows of pop_nested.
- Use unnest() to convert the mean_pop list into a numeric column and save this as the pop_mean dataframe.
- Explore pop_mean using head().

# Calculate the mean population for each country
pop_nested <- gap_nested %>%
  mutate(mean_pop = map(data, ~mean(.x$population)))

# Take a look at pop_nested
head(pop_nested)

# Extract the mean_pop value by using unnest
pop_mean <- pop_nested %>% 
  unnest(mean_pop)

# Take a look at pop_mean
head(pop_mean)

Expecting mapped output

When we know that the output of our mapped function is an expected type (here it is a numeric vector) we can leverage the map_*() family of functions to explicitly try to return that object type instead of a list.

Here we will again calculate the mean population of each country, but instead, we will use map_dbl() to explicitly append the numeric vector returned by mean() to our dataframe.

# Tasks:
- Generate the pop_mean dataframe using the map_dbl() function to calculate the population mean for each nested dataframe.
- Explore the pop_mean dataframe using head().

# Calculate mean population and store result as a double
pop_mean <- gap_nested %>%
  mutate(mean_pop = map_dbl(data, ~mean(.x$population)))

# Take a look at pop_mean
head(pop_mean)

With the nest() and map_*() functions in hand we now have the foundation for building multiple models.

Mapping many models

The gap_nested dataframe available in our workspace contains the gapminder dataset nested by country.

we will use this data to build a linear model for each country to predict life expectancy using the year feature.

# Tasks:
- Build a linear model for each country predicting life_expectancy using the year feature.  We usse the lm() function for this and save this new dataframe containing models as gap_models.
- Extract the 34th (corersponding to India) model from this dataframe and save this as india_model.
View the information about the model using summary().

# Build a linear model for each country
gap_models <- gap_nested %>%
    mutate(model = map(data, ~lm(formula = life_expectancy~year, data = .x)))
    
# Extract the model for India    
india_model <- gap_models$model[[34]]

# View the summary for the India model
summary(india_model)

## 
## Call:
## lm(formula = life_expectancy ~ year, data = .x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.90159 -0.83016 -0.05029  1.09813  1.96997 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -834.74522   23.67314  -35.26   <2e-16 ***
## year           0.44842    0.01192   37.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.29 on 50 degrees of freedom
## Multiple R-squared:  0.9659, Adjusted R-squared:  0.9652 
## F-statistic:  1415 on 1 and 50 DF,  p-value: < 2.2e-16

We have just built 77 models for 77 countries with just a few lines of code.

Next, we will see how we can extract information from summary() in a tidy fashion.

The three ways to tidy our model

Below are the descriptions of the three functions in the broom package.

tidy() returns the statistical findings of the model (such as coefficients)
glance() returns a concise one-row summary of the model
augment() adds prediction columns to the data being modeled

Extracting model statistics tidily

First, we will use the tidy() and glance() functions to extract information from india_model in a tidy manner.

For a linear model, tidy() extracts the model coefficients while glance() returns the model statistics such as the

# Tasks:
- Extract the coefficient information as a tidy dataframe of the india_model using tidy().
- Extract the model statistics of india_model using glance().

library(broom)

# Extract the coefficients of the india_model as a dataframe
tidy(india_model)

# Extract the statistics of the india_model as a dataframe
glance(india_model)

As we can see tidy() and glance() both return dataframes, this feature can be very useful for managing the results of many models in one dataframe.

Augmenting our data

From the results of glance(), we saw that using the available features the linear model fits well with an adjusted R^2 value of 0.96. The augment() function can help us to explore this fit by appending the predictions to the original data.

We will leverage this to compare the predicted values of life_expectancy based on the year feature with the original ones.

# Tasks:
- Build the augmented dataframe india_fitted using augment().
- Visualize the fit of the model with respect to year by plotting both life_expectancy as points and .fitted as a line.
- Visualize the fit of the model with respect to teh actual values.

# Build the augmented dataframe
india_fitted <- augment(india_model)
head(india_fitted)

# Compare the predicted values with the actual values of life expectancy
india_fitted %>% 
  ggplot(aes(x = year)) +
  geom_point(aes(y = life_expectancy)) + 
  geom_line(aes(y = .fitted), color = "red")

india_fitted %>% 
  ggplot(aes(x = .fitted, y = life_expectancy)) +
  geom_point() +
  geom_abline(color = 'blue')

Tidy up the coefficients of our models

We will now leverage the list column workflow along with the tidy() function from broom to extract and explore the coefficients for the 77 models we built.

# Tasks:
- Use tidy() to append a column (coef) containing coefficient statistics for each model to the gap_models dataframe and save it as model_coef_nested.
- Simplify this dataframe using unnest() to extract these coefficients in our dataframe.
- Explore the coefficient estimates for the year feature across our 77 models by plotting a histogram of their values.

# Extract the coefficient statistics of each model into nested dataframes
model_coef_nested <- gap_models %>% 
    mutate(coef = map(model, ~tidy(.x)))
head(model_coef_nested)

# Simplify the coef dataframes for each model    
model_coef <- model_coef_nested %>%
    unnest(coef)
head(model_coef)

# Plot a histogram of the coefficient estimates for year         
model_coef %>% 
  filter(term == "year") %>% 
  ggplot(aes(x = estimate)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Now that we have the slope for each model let’s explore their distribution.

model_coef %>% filter(term == 'year') %>% select(country, estimate) %>% filter(estimate < 0) %>% nrow()

## [1] 4

model_coef %>% filter(term == 'year') %>% select(country, estimate) %>% arrange(desc(estimate))

model_coef %>% filter(term == 'year') %>% select(country, estimate) %>% summary()

##        country      estimate      
##  Algeria   : 1   Min.   :-0.1984  
##  Argentina : 1   1st Qu.: 0.2185  
##  Australia : 1   Median : 0.2951  
##  Austria   : 1   Mean   : 0.3180  
##  Bangladesh: 1   3rd Qu.: 0.4492  
##  Belgium   : 1   Max.   : 0.6711  
##  (Other)   :71

Glance at the fit of our models

In this exercise we will use glance() to calculate how well the linear models fit the data for each country.

# Tasks:
- Append a column (fit) containing the fit statistics for each model to the gap_models dataframe and save it as model_perf_nested.
- Simplify this dataframe using unnest() to extract these fit statistics of each model and save it as model_perf.
- Finally, use head() to take a peek at model_perf.

# Extract the fit statistics of each model into dataframes
model_perf_nested <- gap_models %>% 
    mutate(fit = map(model, ~glance(.x)))
head(model_perf_nested)

# Simplify the fit dataframes for each model    
model_perf <- model_perf_nested %>% 
    unnest(fit)
    
# Look at the first six rows of model_perf
head(model_perf)

Best and worst fitting models

Let us try to answer the following questions:

- Overall, how well do our models fit our data?
- Which are the best fitting models?
- Which models do not fit the data well?

# Tasks:
- Plot a histogram of the R^2 values of the 77 models
- Extract the 4 best fitting models (based on R^2) and store this dataframe as best_fit
- Extract the 4 worst fitting models (based on R^2) and store this dataframe as worst_fit

# Plot a histogram of rsquared for the 77 models    
model_perf %>% 
  ggplot(aes(x = r.squared)) + 
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Extract the 4 best fitting models
best_fit <- model_perf %>% 
  top_n(n=4, wt= r.squared) 
best_fit

# Extract the 4 models with the worst fit
worst_fit <- model_perf %>% 
  top_n(n=4, wt=-r.squared)
worst_fit

we have now prepared two dataframes, one containing the four best fitting models and another the four worst fitting models. Next, we will use the augment() function to explore these fits visually.

Augment the fitted values of each model

In this exercise we will prepare our four best and worst fitting models for further exploration by augmenting our model data with augment().

# Tasks:
- Build the best_augmented dataframe by building augmented dataframes and simplifying them with unnest() using the best_fit dataframe.
- Build the worst_augmented dataframe by building augmented dataframes and simplifying them with unnest() using the worst_fit dataframe.

best_augmented <- best_fit %>% 
  # Build the augmented dataframe for each country model
  mutate(augmented = map(model, ~augment(.x))) %>% 
  # Expand the augmented dataframes
  unnest(augmented)

worst_augmented <- worst_fit %>% 
  # Build the augmented dataframe for each country model
  mutate(augmented = map(model, ~augment(.x))) %>% 
  # Expand the augmented dataframes
  unnest(augmented)

Explore our best and worst fitting models

Let’s explore our four best and worst fitting models by comparing the fitted lines with the actual values.

# TAsks:
- Visualize the fit of our four best fitting models with respect to year by plotting both life_expectancy as points and .fitted as a line.

# Compare the predicted values with the actual values of life expectancy 
# for the top 4 best fitting models
best_augmented %>% 
  ggplot(aes(x = year)) +
  geom_point(aes(y = life_expectancy)) + 
  geom_line(aes(y = .fitted), color = "red") +
  facet_wrap(~country, scales = "free_y")

- Visualize the fit of our four worst fitting models with respect to year by plotting both life_expectancy as points and .fitted as a line.

# Compare the predicted values with the actual values of life expectancy 
# for the top 4 worst fitting models
worst_augmented %>% 
  ggplot(aes(x = year)) +
  geom_point(aes(y = life_expectancy)) + 
  geom_line(aes(y = .fitted), color = "red") +
  facet_wrap(~country, scales = "free_y")

We can see that a linear model does a great job for the best 4 fitting models but the worst 4 fitting models do not seem to have a linear relationship. we will work to improve this fit by incorporating additional features.

Build better models

We will build multiple regression models for each country using all available features. we may be interested in comparing the performance of the four worst fitting models so their adjusted R^2 values generated previously:

# Tasks:
- Build a linear model for each country predicting life_expectancy using every feature in the dataset.
- Append a column (fit) containing fit statistics for each model and simplify this dataframe.
- Print the adjusted R^2 in fullmodel_perf of the four countries from worst_fit dataframe.

# Build a linear model for each country using all features
gap_fullmodel <- gap_nested %>% 
  mutate(model = map(data, ~ lm(formula = life_expectancy~., data = .x)))

fullmodel_perf <- gap_fullmodel %>% 
  # Extract the fit statistics of each model into dataframes
  mutate(fit = map(model, ~glance(.x))) %>% 
  # Simplify the fit dataframes for each model
  unnest(fit)
  
# View the performance for the four countries with the worst fitting 
# four simple models we looked at before
fullmodel_perf %>% 
  filter(country %in% worst_fit$country) %>% 
  select(country, adj.r.squared)

we can see that the performance of each of the four worst performing models based on their adjusted R^2 drastically improved once other features were added to the model.

While the adjusted does tell us how well the model fit our data, it does not give any indication on how it would perform on new data. We will now see how to estimate model performance using data withheld from building the model.

The test-train split

We will use the rsample package to split our data to perform the initial train-test split of our gapminder data.

# Tasks:
- Split our data into 75% training and 25% testing using the initial_split() function and assign it to gap_split.
- Extract the training dataframe from gap_split using the training() function.
- Extract the testing dataframe from gap_split using the testing() function.
- Ensure that the dimensions of our new dataframes are what we expected by using the dim() function on training_data and testing_data.

library(rsample)

set.seed(9)
dim(gapminder)

## [1] 4004    7

# Prepare the initial split object
gap_split <- initial_split(gapminder, prop = .75)
gap_split

## <Analysis/Assess/Total>
## <3003/1001/4004>

# Extract the training dataframe
training_data <- training(gap_split)

# Extract the testing dataframe
testing_data <- testing(gap_split)

# Calculate the dimensions of both training_data and testing_data
dim(training_data)

## [1] 3003    7

dim(testing_data)

## [1] 1001    7

Here on, we will take the steps necessary to identify the best performing model using only the training data. At the end of the chapter we will select the best performing model and measure its performance using the testing data that we created here.

Cross-validation dataframes

First, we will split the training data into a series of 5 train-validate sets using the vfold_cv() function from the rsample package.

# Tasks:
- Build a dataframe for 5-fold cross validation from the training_data using vfold_cv() and assign it to cv_split.
- Prepare cv_data by appending two new columns to cv_split:
    train: containing the train dataframes by mapping training() across the splits column.
    validate: containing the validate dataframes by using mapping testing() across the splits column.

set.seed(9)

# Prepare the dataframe containing the cross validation partitions
cv_split <- vfold_cv(training_data, v = 5)
cv_split

cv_split$splits

## [[1]]
## <Analysis/Assess/Total>
## <2402/601/3003>
## 
## [[2]]
## <Analysis/Assess/Total>
## <2402/601/3003>
## 
## [[3]]
## <Analysis/Assess/Total>
## <2402/601/3003>
## 
## [[4]]
## <Analysis/Assess/Total>
## <2403/600/3003>
## 
## [[5]]
## <Analysis/Assess/Total>
## <2403/600/3003>

cv_data <- cv_split %>% 
  mutate(
    # Extract the train dataframe for each split
    train = map(splits, ~training(.x)), 
    # Extract the validate dataframe for each split
    validate = map(splits, ~testing(.x))
  ) 

# Use head() to preview cv_data
cv_data

We will use this dataframe Here on to measure and compare the performance of the models we create.

Build cross-validated models

First, we will build a linear model predicting life_expectancy using all available features. we will do this for the train data of each cross-validation fold.

# TAsks:
- Build models predicting life_expectancy using all available features with the train data for each fold of the cross validation.

# Build a model using the train data for each fold of the cross validation
cv_models_lm <- cv_data %>% 
  mutate(model = map(train, ~lm(formula = life_expectancy~., data = .x)))

cv_models_lm

Now that we have the models built, let’s prepare the parts we need to evaluate their performance.

Preparing for evaluation

In order to measure the validate performance of our models we need compare the predicted values of life_expectancy for the observations from validate set to the actual values recorded. Here we will prepare both of these vectors for each partition.

# Tasks:
- Extract the actual life_expectancy from the validate dataframes and store these in the column validate_actual.
- Predict the life_expectancy for each validate partition using the map2() and predict() functions in the column validate_predicted.

cv_prep_lm <- cv_models_lm %>% 
  mutate(
    # Extract the recorded life expectancy for the records in the validate dataframes
    validate_actual = map(validate, ~.x$life_expectancy),
    # Predict life expectancy for each validate set using its corresponding model
    validate_predicted = map2(.x = model, .y = validate, ~predict(.x, .y))
  )

cv_prep_lm

Evaluate model performance

Now that we have both the actual and predicted values of each fold we can compare them to measure performance.

For this regression model, we will measure the Mean Absolute Error (MAE) between these two vectors. This value tells we the average difference between the actual and predicted values.

# Tasks:
- Calculate the MAE by comparing the actual with the predicted values for the validate data and assign it to the validate_mae column.
- Print the validate_mae column (note how they vary).
- Calculate the mean of this column.

library(Metrics)
# Calculate the mean absolute error for each validate fold       
cv_eval_lm <- cv_prep_lm %>% 
  mutate(validate_mae = map2_dbl(validate_actual, validate_predicted, ~mae(actual = .x, predicted = .y)))
cv_eval_lm

# Print the validate_mae column
cv_eval_lm$validate_mae

## [1] 1.576854 1.512977 1.572972 1.601739 1.488472

# Calculate the mean of validate_mae column
mean(cv_eval_lm$validate_mae)

## [1] 1.550603

we now know that based on 5 train-validate splits, the predictions of the models are on average off by 1.55 years. Let us try to improve this performance by using a more complex model

Build a random forest model

Here we will use the same cross-validation data to build (using train) and evaluate (using validate) random forests for each partition. Since we are using the same cross-validation partitions as our regression models, we are able to directly compare the performance of the two models.

# Tasks:
- Use ranger() to build a random forest predicting life_expectancy using all features in train for each cross validation partition.
- Add a new column validate_predicted predicting the life_expectancy for the observations in validate using the random forest models we just created.

library(ranger)

# Build a random forest model for each fold
cv_models_rf <- cv_data %>% 
  mutate(model = map(train, ~ranger(formula = life_expectancy~., data = .x,
                                    num.trees = 100, seed = 9)))
cv_models_rf

# Generate predictions using the random forest model
cv_prep_rf <- cv_models_rf %>% 
  mutate(validate_predicted = map2(.x = model, .y = validate, ~predict(.x, .y)$predictions))
cv_prep_rf

Evaluate a random forest model

Similar to the linear regression model, we will use the MAE metric to evaluate the performance of the random forest model.

# Tasks:
- Calculate the MAE by comparing the actual with the predicted values for the validate data and assign it to the validate_mae column.
- Add the actual values of the validate fold (validate_actual) to the data frmae.
- Print the validate_mae column.
- Calculate the mean of this column.

cv_prep_rf <- cv_prep_rf %>% mutate(validate_actual = map(validate, ~.x$life_expectancy))
cv_prep_rf

# Calculate validate MAE for each fold
cv_eval_rf <- cv_prep_rf %>% 
  mutate(validate_mae = map2_dbl(validate_actual, validate_predicted, ~mae(actual = .x, predicted = .y)))

# Print the validate_mae column
cv_eval_rf$validate_mae

## [1] 0.8550312 0.8240130 0.8337605 0.7316634 0.7891819

# Calculate the mean of validate_mae column
mean(cv_eval_rf$validate_mae)

## [1] 0.80673

We have dropped the average error of our predictions from 1.47 to 0.80. That’s quite an improvement! Now, we’ll see if we can squeeze a bit more performance out by tuning a parameter of the random forest model.

Fine tune our model

We will vary the mtry parameter when building our random forest models on our train data.

The default value of mtry for ranger is the rounded down square root of the total number of features (6). This results in a value of 2.

# Tasks
- Add validate_Actual to the cv_data dataframe
- Use crossing() to expand the cross validation data for values of mtry ranging from 2 through 5.
- Build random forest models for each fold/mtry combination.

cv_data <- cv_data %>% mutate(validate_actual = map(validate, ~.x$life_expectancy))

# Prepare for tuning our cross validation folds by varying mtry
cv_tune <- cv_data %>% 
  crossing(mtry = 2:5) 

# Build a model for each fold & mtry combination
cv_model_tunerf <- cv_tune %>% 
  mutate(model = map2(.x = train, .y = mtry, ~ranger(formula = life_expectancy~., 
                                              data = .x, mtry = .y, 
                                              num.trees = 100, seed = 9)))
cv_model_tunerf

We have built a model for each fold/mtry combination. Next, we’ll measure the performance of each to find the best performing value of mtry.

The best performing parameter

The validate MAE of 0.80 we have calculated previously was for the default mtry value of 2.

# Tasks:
- Generate predictions for each mtry/fold combination.
- Calculate the MAE for each mtry/fold combination.
- Calculate the mean MAE for each value of mtry.

# Generate validate predictions for each model
cv_prep_tunerf <- cv_model_tunerf %>% 
  mutate(validate_predicted = map2(.x = model, .y = validate, ~predict(.x, .y)$predictions))

# Calculate validate MAE for each fold and mtry combination
cv_eval_tunerf <- cv_prep_tunerf %>% 
  mutate(validate_mae = map2_dbl(.x = validate_actual, .y = validate_predicted, ~mae(actual = .x, predicted = .y)))
cv_eval_tunerf

# Calculate the mean validate_mae for each mtry used  
cv_eval_tunerf %>% 
  group_by(mtry) %>% 
  summarise(mean_mae = mean(validate_mae))

Looks like parameter tuning was able to eke out another slight boost in performance, dropping the mae from 0.80 (mtry = 2) to 0.792 (mtry = 4). Assuming that we’ve finished our model selection we can conclude that our final (best performing) model will be the random forest model built using ranger with an mtry = 4 and num.trees = 100.

Next, we will build this model using all training data and evaluate its expected future performance using the testing data.

Build & evaluate the best model

Using cross-validation we were able to identify the best model for predicting life_expectancy using all the features in gapminder. Now that we’ve selected our model, we can use the independent set of data (testing_data) that we’ve held out to estimate the performance of this model on new data.

we will build this model using all training_data and evaluate using testing_data.

# Tasks:
- Use ranger() to build the best performing model (mtry = 4) using all of the training data. Assign this to best_model.
- Extract the life_expectancy column from testing_data and assign it to test_actual.
Predict life_expectancy using the best_model on the testing data and assign it to test_predicted.
- Calculate the MAE using test_actual and test_predicted vectors.

# Build the model using all training data and the best performing parameter
best_model <- ranger(formula = life_expectancy~., data = training_data,
                     mtry = 4, num.trees = 100, seed = 9)

# Prepare the test_actual vector
test_actual <- testing_data$life_expectancy

# Predict life_expectancy for the testing_data
test_predicted <- predict(best_model, testing_data)$predictions

# Calculate the test MAE
mae(test_actual, test_predicted)

## [1] 0.5994785

we have successfully leveraged the list column workflow to identify and build a model to predict life expectancy. we can claim that based on the test holdout we can expect that our predictions on new data will only be off by a magnitude of 0.599 years.

Prepare train-test-validate parts

Now we will build a classification model to predict employee attrition.

For this, we will work with the attrition dataset, which contains 30 features about employees which we will use to predict if they have left the company.

We will first prepare the training & testing data sets, then we will further split the training data using cross-validation so that we can search for the best performing model for this task.

# Tasks:
- Split our data into 75% training and 25% testing using the initial_split() function.
- Extract the training and testing dataframes from data_split using training() and testing(), respectively.

attrition <- readRDS('D:/learning/Datacamp/Datasets/attrition.rds')
head(attrition)

str(attrition)

## 'data.frame':    1470 obs. of  31 variables:
##  $ Age                     : int  41 49 37 33 27 32 59 30 38 36 ...
##  $ Attrition               : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
##  $ DailyRate               : int  1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
##  $ Department              : Factor w/ 3 levels "Human_Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
##  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : Ord.factor w/ 5 levels "Below_College"<..: 2 1 2 4 1 2 3 1 3 3 ...
##  $ EducationField          : Factor w/ 6 levels "Human_Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
##  $ EnvironmentSatisfaction : Ord.factor w/ 4 levels "Low"<"Medium"<..: 2 3 4 4 1 4 3 4 4 3 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
##  $ HourlyRate              : int  94 61 92 56 40 79 81 67 44 94 ...
##  $ JobInvolvement          : Ord.factor w/ 4 levels "Low"<"Medium"<..: 3 2 2 3 3 3 4 3 2 3 ...
##  $ JobLevel                : int  2 2 1 1 1 1 1 1 3 2 ...
##  $ JobRole                 : Factor w/ 9 levels "Healthcare_Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
##  $ JobSatisfaction         : Ord.factor w/ 4 levels "Low"<"Medium"<..: 4 2 3 3 2 4 1 3 3 3 ...
##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
##  $ MonthlyIncome           : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
##  $ MonthlyRate             : int  19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
##  $ NumCompaniesWorked      : int  8 1 6 1 9 0 4 1 0 6 ...
##  $ OverTime                : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
##  $ PercentSalaryHike       : int  11 23 15 11 12 13 20 22 21 13 ...
##  $ PerformanceRating       : Ord.factor w/ 4 levels "Low"<"Good"<"Excellent"<..: 3 4 3 3 3 3 4 4 4 3 ...
##  $ RelationshipSatisfaction: Ord.factor w/ 4 levels "Low"<"Medium"<..: 1 4 2 3 4 3 1 2 2 2 ...
##  $ StockOptionLevel        : int  0 1 0 0 1 0 3 1 0 2 ...
##  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...
##  $ TrainingTimesLastYear   : int  0 3 3 3 3 2 3 2 2 3 ...
##  $ WorkLifeBalance         : Ord.factor w/ 4 levels "Bad"<"Good"<"Better"<..: 1 3 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  6 10 0 8 2 7 1 1 9 7 ...
##  $ YearsInCurrentRole      : int  4 7 0 7 2 7 0 0 7 7 ...
##  $ YearsSinceLastPromotion : int  0 1 0 3 2 3 0 0 1 7 ...
##  $ YearsWithCurrManager    : int  5 7 0 0 2 6 0 0 8 7 ...

glimpse(attrition)

## Rows: 1,470
## Columns: 31
## $ Age                      <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2~
## $ Attrition                <fct> Yes, No, Yes, No, No, No, No, No, No, No, No,~
## $ BusinessTravel           <fct> Travel_Rarely, Travel_Frequently, Travel_Rare~
## $ DailyRate                <int> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,~
## $ Department               <fct> Sales, Research_Development, Research_Develop~
## $ DistanceFromHome         <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, ~
## $ Education                <ord> College, Below_College, College, Master, Belo~
## $ EducationField           <fct> Life_Sciences, Life_Sciences, Other, Life_Sci~
## $ EnvironmentSatisfaction  <ord> Medium, High, Very_High, Very_High, Low, Very~
## $ Gender                   <fct> Female, Male, Male, Female, Male, Male, Femal~
## $ HourlyRate               <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4~
## $ JobInvolvement           <ord> High, Medium, Medium, High, High, High, Very_~
## $ JobLevel                 <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, ~
## $ JobRole                  <fct> Sales_Executive, Research_Scientist, Laborato~
## $ JobSatisfaction          <ord> Very_High, Medium, High, High, Medium, Very_H~
## $ MaritalStatus            <fct> Single, Married, Single, Married, Married, Si~
## $ MonthlyIncome            <int> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269~
## $ MonthlyRate              <int> 19479, 24907, 2396, 23159, 16632, 11864, 9964~
## $ NumCompaniesWorked       <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, ~
## $ OverTime                 <fct> Yes, No, Yes, Yes, No, No, Yes, No, No, No, N~
## $ PercentSalaryHike        <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1~
## $ PerformanceRating        <ord> Excellent, Outstanding, Excellent, Excellent,~
## $ RelationshipSatisfaction <ord> Low, Very_High, Medium, High, Very_High, High~
## $ StockOptionLevel         <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, ~
## $ TotalWorkingYears        <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3~
## $ TrainingTimesLastYear    <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, ~
## $ WorkLifeBalance          <ord> Bad, Better, Better, Better, Better, Good, Go~
## $ YearsAtCompany           <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,~
## $ YearsInCurrentRole       <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, ~
## $ YearsSinceLastPromotion  <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, ~
## $ YearsWithCurrManager     <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, ~

set.seed(9)

# Prepare the initial split object
data_split <- initial_split(attrition, prop = .75)

# Extract the training dataframe
training_data <- training(data_split)

# Extract the testing dataframe
testing_data <- testing(data_split)
dim(training_data)

## [1] 1103   31

dim(testing_data)

## [1] 367  31

- Build a dataframe for 5-fold cross validation from the training_data using vfold_cv().
- Prepare the cv_data dataframe by extracting the train and validate dataframes for each fold.

set.seed(9)
cv_split <- vfold_cv(training_data, v = 5)

cv_data <- cv_split %>% 
  mutate(
    # Extract the train dataframe for each split
    train = map(splits, ~training(.x)),
    # Extract the validate dataframe for each split
    validate = map(splits, ~testing(.x))
  )
head(cv_data)

dim(cv_data)

## [1] 5 4

Now we have the parts necessary to build & tune our classification models.

Build cross-validated models

We will build logistic regression models for each fold in our cross-validation.

We will build this using the glm() function and by setting the family argument to “binomial”.

# Tasks:
- Build models predicting Attrition using all available features with the train data for each fold of the cross validation.

# Build a model using the train data for each fold of the cross validation
cv_models_lr <- cv_data %>% 
  mutate(model = map(train, ~glm(formula = Attrition~., 
                               data = .x, family = 'binomial')))
cv_models_lr

Predictions of a single model

To calculate the performance of a classification model we need to compare the actual values of Attrition to those predicted by the model. When calculating metrics for binary classification tasks (such as precision and recall), the actual and predicted vectors must be converted to binary values.

We will prepare these vectors using the model and validate dataframes from the first cross-validation fold as an example.

# Tasks:
- Extract the model and the validate dataframe from the first fold of the cross-validation.
- Extract the Attrition column from the validate dataframe and convert the values to binary (TRUE/FALSE).
- Use model to predict the probabilities of attrition for the validate dataframe.
- Convert the predicted probabilities to a binary vector, assume all probabilities greater than 0.5 are TRUE.

# Extract the first model and validate 
model <- cv_models_lr$model[[1]]
validate <- cv_models_lr$validate[[1]]

# Prepare binary vector of actual Attrition values in validate
validate_actual <- validate$Attrition == "Yes"

# Predict the probabilities for the observations in validate
validate_prob <- predict(model, validate, type = "response")

# Prepare binary vector of predicted Attrition values for validate
validate_predicted <- validate_prob >.5
validate_predicted

##    10    11    15    24    26    33    39    57    73    74    91    96    97 
## FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
##   103   104   107   112   131   138   144   169   178   199   207   211   214 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
##   218   227   242   247   262   269   284   292   307   309   321   329   331 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE 
##   339   359   361   380   390   404   405   408   422   433   434   441   454 
## FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE 
##   458   464   466   469   479   483   485   487   494   510   515   523   549 
## FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE 
##   578   581   601   605   621   644   663   664   667   671   675   689   690 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
##   692   715   716   725   728   763   769   772   775   789   791   802   805 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
##   809   836   840   844   848   851   867   881   882   904   911   912   913 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE 
##   916   924   926   930   952   956   994   998  1005  1013  1029  1032  1036 
## FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE 
##  1043  1061  1062  1070  1080  1085  1094  1124  1125  1140  1143  1157  1190 
##  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE 
##  1195  1198  1216  1233  1234  1239  1255  1258  1264  1269  1275  1289  1291 
## FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE 
##  1298  1301  1315  1344  1356  1363  1369  1377  1379  1387  1401  1448  1460 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE 
##  1464  1469  1481  1492  1506  1523  1535  1543  1553  1555  1558  1560  1572 
##  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE 
##  1588  1597  1602  1609  1611  1612  1621  1623  1631  1633  1640  1650  1651 
## FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE 
##  1655  1656  1676  1681  1687  1689  1693  1694  1712  1716  1720  1724  1733 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE 
##  1735  1766  1787  1805  1826  1830  1863  1865  1866  1878  1883  1905  1912 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE 
##  1918  1945  1971  2003  2010  2022  2027  2034  2035  2037  2048  2049  2052 
## FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Now we have the actual and predicted vectors. We will use these vectors to calculate some metrics to check the performance of the model.

Performance of a single model

We can calculate many commonly used binary classification metrics. We will focus on:

accuracy: rate of correctly predicted values relative to all predictions.
precision: portion of predictions that the model correctly predicted as TRUE.
recall: portion of actual TRUE values that the model correctly recovered.

# Tasks:
- Use table() to compare the validate_actual and validate_predicted values for the example model and validate dataframe.
- Calculate the accuracy.
- Calculate the precision.
- Calculate the recall.

library(Metrics)

# Compare the actual & predicted performance visually using a table
table(validate_actual, validate_predicted)

##                validate_predicted
## validate_actual FALSE TRUE
##           FALSE   186   10
##           TRUE     12   13

# Calculate the accuracy
accuracy(validate_actual, validate_predicted)

## [1] 0.9004525

# Calculate the precision
precision(validate_actual, validate_predicted)

## [1] 0.5652174

# Calculate the recall
recall(validate_actual, validate_predicted)

## [1] 0.52

The type of metric we use should be informed by the application of our model. We will expand on this example to calculate the recall metric for each of our cross validation folds.

Prepare for cross-validated performance

We will expand this for all the folds in the cross-validation dataframe.

# Tasks:
- Add the validate_actual binary column for each cross-validation fold by converting all "Yes" values to TRUE.
- Use model to predict the probabilities of attrition for each cross-validation fold of validate. Convert the predicted probabilities to a binary vector, treating all probabilities greater than 0.5 as TRUE. Name this column validate_predicted.

cv_prep_lr <- cv_models_lr %>% 
  mutate(
    # Prepare binary vector of actual Attrition values in validate
    validate_actual = map(validate, ~.x$Attrition == "Yes"),
    # Prepare binary vector of predicted Attrition values for validate
    validate_predicted = map2(.x = model, .y = validate, ~predict(.x, .y, type = "response") > .5)
  )
cv_prep_lr

Calculate cross-validated performance

It is crucial to optimize models using a carefully selected metric aimed at achieving the goal of the model.

Imagine that in this case we want to use this model to identify employees that are predicted to leave the company. Ideally, we want a model that can capture as many of the ready-to-leave employees as possible so that we can intervene. The corresponding metric that captures this is the recall metric. As such, we will exclusively use recall to optimize and select our models.

# Tasks:
- Calculate the recall by comparing the actual with the predicted responses for each fold and assign it to the validate_recall column.
- Print the validate_recall column.
- Print the mean of this column.

# Calculate the validate recall for each cross validation fold
cv_perf_recall <- cv_prep_lr %>% 
  mutate(validate_recall = map2_dbl(validate_actual, validate_predicted, 
                                    ~recall(actual = .x, predicted = .y)))

cv_perf_recall

# Print the validate_recall column
cv_perf_recall$validate_recall

## [1] 0.5200000 0.5609756 0.3939394 0.3255814 0.3333333

# Calculate the average of the validate_recall column
mean(cv_perf_recall$validate_recall
)

## [1] 0.4267659

As we can see the validate recall of the model is 0.43. Now, we shall try to beat this using a more complex model.

Tune random forest models

Now that we have a working logistic regression model we will prepare a random forest model to compare it with.

# Tas:
- Use crossing() to expand the cross-validation data for values of mtry using the values of 2, 4, 8, and 16.
- Build random forest models for each fold/mtry combination.

# Prepare for tuning our cross validation folds by varying mtry
cv_tune <- cv_data %>%
  crossing(mtry = c(2,4,8,16)) 

# Build a cross validation model for each fold & mtry combination
cv_models_rf <- cv_tune %>% 
  mutate(model = map2(train, mtry, ~ranger(formula = Attrition~., 
                                           data = .x, mtry = .y,
                                           num.trees = 100, seed = 9)))

cv_models_rf

Random forest performance

It is now time to see whether the random forests models we built are able to outperform the logistic regression model.

The validate recall achieved for the logistic regression model was 0.43.

# Tasks:
- Prepare the validate_actual and validate_predicted columns for each mtry/fold combination.
- Calculate the recall for each mtry/fold combination.
- Calculate the mean recall for each value of mtry.

cv_prep_rf <- cv_models_rf %>% 
  mutate(
    # Prepare binary vector of actual Attrition values in validate
    validate_actual = map(validate, ~.x$Attrition == "Yes"),
    # Prepare binary vector of predicted Attrition values for validate
    validate_predicted = map2(.x = model, .y = validate, ~predict(.x, .y, type = "response")$predictions == "Yes")
  )

# Calculate the validate recall for each cross validation fold
cv_perf_recall <- cv_prep_rf %>% 
  mutate(recall = map2_dbl(.x = validate_actual, .y = validate_predicted, ~recall(actual = .x, predicted = .y)))

# Calculate the mean recall for each mtry used  
cv_perf_recall %>% 
  group_by(mtry) %>% 
  summarise(mean_recall = mean(recall))

This time we can see that none of the random forest models were able to outperform the logistic regression model with respect to recall.

Build final classification model

Comparing the recall performance between the logistic regression model (0.4) and the best performing random forest model (0.28), we conclude that the model with the best performance is the logistic regression model. Finally, we will build the logistic regression model using all of the train data and we will prepare the necessary vectors for evaluating this model’s test performance.

# Tasks:
- Build a logistic regression model predicting Attrition using all available features in the training_data.
- Prepare the binary vector of actual test values, test_actual.
- Prepare the binary vector of predicted values where a probability greater than 0.5 indicates TRUE and store this as test_predicted.

# Build the logistic regression model using all training data
best_model <- glm(formula = Attrition ~ ., 
                  data = training_data, family = "binomial")

# Prepare binary vector of actual Attrition values for testing_data
test_actual <- testing_data$Attrition == "Yes"

# Prepare binary vector of predicted Attrition values for testing_data
test_predicted <- predict(best_model, testing_data, type = "response") > .5

test_predicted

##     1     4    18    32    38    40    46    47    53    54    55    64    78 
##  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE 
##    83    85    94   105   110   113   118   124   126   129   137   140   141 
## FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE 
##   143   151   152   153   154   160   165   171   176   197   198   202   206 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
##   208   215   233   239   248   253   258   261   264   267   274   275   277 
## FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE 
##   282   293   296   297   300   312   314   330   332   334   337   341   343 
## FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
##   346   373   384   385   396   399   403   410   411   417   423   428   431 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE 
##   436   437   438   444   445   446   447   448   449   456   468   477   481 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
##   482   491   492   498   499   500   501   508   511   516   517   518   520 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE 
##   525   529   530   532   534   544   565   574   577   580   582   584   587 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE 
##   592   599   602   604   613   616   620   622   623   626   634   641   643 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE 
##   645   653   659   662   666   679   680   699   701   709   710   714   721 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
##   731   732   742   746   749   754   762   766   783   800   803   811   815 
## FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE 
##   817   820   824   830   832   833   842   845   859   875   880   888   893 
## FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
##   896   897   899   900   932   940   947   949   950   957   959   961   969 
## FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE 
##   975   976   984   991  1003  1006  1012  1018  1019  1027  1028  1039  1048 
## FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE 
##  1052  1071  1083  1098  1103  1116  1119  1128  1131  1132  1135  1138  1148 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE 
##  1160  1175  1177  1180  1202  1204  1206  1210  1218  1219  1220  1226  1231 
## FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
##  1235  1240  1243  1244  1245  1248  1256  1257  1278  1280  1286  1292  1293 
## FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE 
##  1295  1299  1304  1314  1317  1321  1331  1336  1346  1358  1390  1391  1407 
## FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE 
##  1423  1427  1439  1441  1446  1458  1461  1466  1473  1474  1477  1480  1482 
## FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
##  1484  1485  1489  1494  1496  1513  1514  1520  1522  1534  1541  1547  1549 
## FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE 
##  1550  1552  1568  1573  1583  1598  1605  1615  1617  1635  1642  1646  1647 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE 
##  1648  1649  1658  1661  1670  1671  1673  1677  1678  1692  1701  1709  1710 
## FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
##  1714  1719  1728  1731  1734  1736  1739  1745  1749  1752  1757  1760  1764 
##  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE 
##  1782  1784  1789  1790  1801  1804  1812  1813  1815  1816  1818  1822  1824 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
##  1835  1837  1839  1847  1850  1852  1853  1857  1858  1862  1869  1882  1907 
##  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE 
##  1908  1911  1915  1916  1922  1927  1928  1932  1934  1948  1949  1951  1955 
## FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE 
##  1966  1970  1985  1986  1995  1999  2007  2016  2021  2025  2026  2045  2054 
## FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE 
##  2055  2057  2062 
## FALSE FALSE FALSE

We have now selected & built our best performing model and have prepared the necessary parts to evaluate its performance.

Measure final model performance

Now its time to calculate the test performance of our final model (logistic regression). Here we will use the held out testing data to characterize the performance we would expect from this model when it is applied to new data.

# Tasks:
- Use table() to compare the test_actual and test_predicted vectors.
- Calculate the test accuracy.
- Calculate the test precision.
- Calculate the test recall.

# Compare the actual & predicted performance visually using a table
table(test_actual, test_predicted)

##            test_predicted
## test_actual FALSE TRUE
##       FALSE   288   23
##       TRUE     30   26

# Calculate the test accuracy
accuracy(test_actual, test_predicted)

## [1] 0.8555858

# Calculate the test precision
precision(test_actual, test_predicted)

## [1] 0.5306122

# Calculate the test recall
recall(test_actual, test_predicted)

## [1] 0.4642857

We now have a model that we can expect to identify 46% of employees that are at risk to leave the organization.

————— Thank You —————–

Machine Learing in the Tidyverse

V.Srinivas

06/07/2021

Unnesting our data

Explore a nested cell

Mapping our data

Expecting mapped output

Mapping many models

Extracting model statistics tidily

Augmenting our data

Tidy up the coefficients of our models

Glance at the fit of our models

Best and worst fitting models

Augment the fitted values of each model

Explore our best and worst fitting models

Build better models

The test-train split

Cross-validation dataframes

Build cross-validated models

Preparing for evaluation

Evaluate model performance

Build a random forest model

Evaluate a random forest model

Fine tune our model

The best performing parameter

Build & evaluate the best model

Prepare train-test-validate parts

Build cross-validated models

Predictions of a single model

Performance of a single model

Prepare for cross-validated performance

Calculate cross-validated performance

Tune random forest models

Random forest performance

Build final classification model

Measure final model performance