In this document, we will work with a collection of economic and social indicators for 77 countries over a period of 52 years. This data is stored in the gapminder dataframe from dslabs package.
We will first transform our gapminder data into a nested dataframe by using the first tool needed to build the foundation of tidy machine seeing skills: nest().
# Tasks:
- Take a look at the first six rows of gapminder.
- Leverage group_by() and nest() to nest our dataframe by country, save this as gap_nested.
- Explore the first six rows of the newly created dataframe gap_nested.
gapminder <- readRDS('D:/learning/Datacamp/Datasets/gapminder.rds')
str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame': 4004 obs. of 7 variables:
## $ country : Factor w/ 185 levels "Albania","Algeria",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ year : int 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
## $ infant_mortality: num 148 148 148 148 149 ...
## $ life_expectancy : num 47.5 48 48.5 49.1 49.6 ...
## $ fertility : num 7.65 7.65 7.65 7.65 7.65 7.66 7.66 7.66 7.66 7.65 ...
## $ population : num 11124892 11404859 11690152 11985130 12295973 ...
## $ gdpPercap : int 1242 1047 820 1075 1109 1147 1062 1130 1216 1282 ...
# Explore gapminder
head(gapminder)
# Prepare the nested dataframe gap_nested
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.0 v dplyr 1.0.5
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
gap_nested <- gapminder %>%
group_by(country) %>%
nest(data = !country) %>% ungroup()
# Explore gap_nested
head(gap_nested)
dim(gap_nested)
## [1] 77 2
Notice that each row in gap_nested contains a tibble.
As we’ve seen, a nested dataframe is simply a way to shape our data. Essentially taking the group_by() windows and packaging them in corresponding rows.
As a corollary to this, we can use the unnest() function to expand the dataframes that are nested in these chunks.
# Tasks:
- Use unnest() on the gap_nested dataframe to take a nested column and expand it into a new dataframe and save it as gap_unnested.
- Make sure that gapminder and gap_unnested are identical by using the identical() function.
# Create the unnested dataframe called gap_unnnested
gap_unnested <- gap_nested %>%
unnest(cols= c(data))
# Confirm that our data was not modified
identical(gapminder, gap_unnested)
## [1] TRUE
The data column in the nested dataframe gap_nested contains tibbles for each country. Let us explore one of these nested chunks.
# Tasks:
- Extract the row index of India
- Extract the nested data for India and store this as india_df.
- Calculate the following summary stats for India's population: min(), max() and mean().
which(gap_nested$country == 'India')
## [1] 34
# Extract the data of India
india_df <- gap_nested$data[[34]]
india_df
# Calculate the minimum of the population vector
min(india_df$population)
## [1] 449661874
# Calculate the maximum of the population vector
max(india_df$population)
## [1] 1247446011
# Calculate the mean of the population vector
mean(india_df$population)
## [1] 811243414
We can see that working with a single chunk in a nested dataframe is identical to working with regular dataframes. Now we will see how to scale this approach to work on a vector of nested dataframes using the map family of functions.
In combination with mutate(), we can use map() to append the results of our calculation to a dataframe. Since the map() function always returns a vector of lists we must use unnest() to extract this information into a numeric vector.
Here we will explore this functionality by calculating the mean population of each country in the dataset.
# TAsks:
- Use map() to apply the mean() function to calculate the population mean for each country and append this new list column called mean_pop using mutate().
- Explore the first 6 rows of pop_nested.
- Use unnest() to convert the mean_pop list into a numeric column and save this as the pop_mean dataframe.
- Explore pop_mean using head().
# Calculate the mean population for each country
pop_nested <- gap_nested %>%
mutate(mean_pop = map(data, ~mean(.x$population)))
# Take a look at pop_nested
head(pop_nested)
# Extract the mean_pop value by using unnest
pop_mean <- pop_nested %>%
unnest(mean_pop)
# Take a look at pop_mean
head(pop_mean)
When we know that the output of our mapped function is an expected type (here it is a numeric vector) we can leverage the map_*() family of functions to explicitly try to return that object type instead of a list.
Here we will again calculate the mean population of each country, but instead, we will use map_dbl() to explicitly append the numeric vector returned by mean() to our dataframe.
# Tasks:
- Generate the pop_mean dataframe using the map_dbl() function to calculate the population mean for each nested dataframe.
- Explore the pop_mean dataframe using head().
# Calculate mean population and store result as a double
pop_mean <- gap_nested %>%
mutate(mean_pop = map_dbl(data, ~mean(.x$population)))
# Take a look at pop_mean
head(pop_mean)
With the nest() and map_*() functions in hand we now have the foundation for building multiple models.
The gap_nested dataframe available in our workspace contains the gapminder dataset nested by country.
we will use this data to build a linear model for each country to predict life expectancy using the year feature.
# Tasks:
- Build a linear model for each country predicting life_expectancy using the year feature. We usse the lm() function for this and save this new dataframe containing models as gap_models.
- Extract the 34th (corersponding to India) model from this dataframe and save this as india_model.
View the information about the model using summary().
# Build a linear model for each country
gap_models <- gap_nested %>%
mutate(model = map(data, ~lm(formula = life_expectancy~year, data = .x)))
# Extract the model for India
india_model <- gap_models$model[[34]]
# View the summary for the India model
summary(india_model)
##
## Call:
## lm(formula = life_expectancy ~ year, data = .x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.90159 -0.83016 -0.05029 1.09813 1.96997
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -834.74522 23.67314 -35.26 <2e-16 ***
## year 0.44842 0.01192 37.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.29 on 50 degrees of freedom
## Multiple R-squared: 0.9659, Adjusted R-squared: 0.9652
## F-statistic: 1415 on 1 and 50 DF, p-value: < 2.2e-16
We have just built 77 models for 77 countries with just a few lines of code.
Next, we will see how we can extract information from summary() in a tidy fashion.
The three ways to tidy our model
Below are the descriptions of the three functions in the broom package.
tidy() returns the statistical findings of the model (such as coefficients)
glance() returns a concise one-row summary of the model
augment() adds prediction columns to the data being modeled
First, we will use the tidy() and glance() functions to extract information from india_model in a tidy manner.
For a linear model, tidy() extracts the model coefficients while glance() returns the model statistics such as the
# Tasks:
- Extract the coefficient information as a tidy dataframe of the india_model using tidy().
- Extract the model statistics of india_model using glance().
library(broom)
# Extract the coefficients of the india_model as a dataframe
tidy(india_model)
# Extract the statistics of the india_model as a dataframe
glance(india_model)
As we can see tidy() and glance() both return dataframes, this feature can be very useful for managing the results of many models in one dataframe.
From the results of glance(), we saw that using the available features the linear model fits well with an adjusted R^2 value of 0.96. The augment() function can help us to explore this fit by appending the predictions to the original data.
We will leverage this to compare the predicted values of life_expectancy based on the year feature with the original ones.
# Tasks:
- Build the augmented dataframe india_fitted using augment().
- Visualize the fit of the model with respect to year by plotting both life_expectancy as points and .fitted as a line.
- Visualize the fit of the model with respect to teh actual values.
# Build the augmented dataframe
india_fitted <- augment(india_model)
head(india_fitted)
# Compare the predicted values with the actual values of life expectancy
india_fitted %>%
ggplot(aes(x = year)) +
geom_point(aes(y = life_expectancy)) +
geom_line(aes(y = .fitted), color = "red")
india_fitted %>%
ggplot(aes(x = .fitted, y = life_expectancy)) +
geom_point() +
geom_abline(color = 'blue')
We will now leverage the list column workflow along with the tidy() function from broom to extract and explore the coefficients for the 77 models we built.
# Tasks:
- Use tidy() to append a column (coef) containing coefficient statistics for each model to the gap_models dataframe and save it as model_coef_nested.
- Simplify this dataframe using unnest() to extract these coefficients in our dataframe.
- Explore the coefficient estimates for the year feature across our 77 models by plotting a histogram of their values.
# Extract the coefficient statistics of each model into nested dataframes
model_coef_nested <- gap_models %>%
mutate(coef = map(model, ~tidy(.x)))
head(model_coef_nested)
# Simplify the coef dataframes for each model
model_coef <- model_coef_nested %>%
unnest(coef)
head(model_coef)
# Plot a histogram of the coefficient estimates for year
model_coef %>%
filter(term == "year") %>%
ggplot(aes(x = estimate)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now that we have the slope for each model let’s explore their distribution.
model_coef %>% filter(term == 'year') %>% select(country, estimate) %>% filter(estimate < 0) %>% nrow()
## [1] 4
model_coef %>% filter(term == 'year') %>% select(country, estimate) %>% arrange(desc(estimate))
model_coef %>% filter(term == 'year') %>% select(country, estimate) %>% summary()
## country estimate
## Algeria : 1 Min. :-0.1984
## Argentina : 1 1st Qu.: 0.2185
## Australia : 1 Median : 0.2951
## Austria : 1 Mean : 0.3180
## Bangladesh: 1 3rd Qu.: 0.4492
## Belgium : 1 Max. : 0.6711
## (Other) :71
In this exercise we will use glance() to calculate how well the linear models fit the data for each country.
# Tasks:
- Append a column (fit) containing the fit statistics for each model to the gap_models dataframe and save it as model_perf_nested.
- Simplify this dataframe using unnest() to extract these fit statistics of each model and save it as model_perf.
- Finally, use head() to take a peek at model_perf.
# Extract the fit statistics of each model into dataframes
model_perf_nested <- gap_models %>%
mutate(fit = map(model, ~glance(.x)))
head(model_perf_nested)
# Simplify the fit dataframes for each model
model_perf <- model_perf_nested %>%
unnest(fit)
# Look at the first six rows of model_perf
head(model_perf)
Let us try to answer the following questions:
- Overall, how well do our models fit our data?
- Which are the best fitting models?
- Which models do not fit the data well?
# Tasks:
- Plot a histogram of the R^2 values of the 77 models
- Extract the 4 best fitting models (based on R^2) and store this dataframe as best_fit
- Extract the 4 worst fitting models (based on R^2) and store this dataframe as worst_fit
# Plot a histogram of rsquared for the 77 models
model_perf %>%
ggplot(aes(x = r.squared)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Extract the 4 best fitting models
best_fit <- model_perf %>%
top_n(n=4, wt= r.squared)
best_fit
# Extract the 4 models with the worst fit
worst_fit <- model_perf %>%
top_n(n=4, wt=-r.squared)
worst_fit
we have now prepared two dataframes, one containing the four best fitting models and another the four worst fitting models. Next, we will use the augment() function to explore these fits visually.
In this exercise we will prepare our four best and worst fitting models for further exploration by augmenting our model data with augment().
# Tasks:
- Build the best_augmented dataframe by building augmented dataframes and simplifying them with unnest() using the best_fit dataframe.
- Build the worst_augmented dataframe by building augmented dataframes and simplifying them with unnest() using the worst_fit dataframe.
best_augmented <- best_fit %>%
# Build the augmented dataframe for each country model
mutate(augmented = map(model, ~augment(.x))) %>%
# Expand the augmented dataframes
unnest(augmented)
worst_augmented <- worst_fit %>%
# Build the augmented dataframe for each country model
mutate(augmented = map(model, ~augment(.x))) %>%
# Expand the augmented dataframes
unnest(augmented)
Let’s explore our four best and worst fitting models by comparing the fitted lines with the actual values.
# TAsks:
- Visualize the fit of our four best fitting models with respect to year by plotting both life_expectancy as points and .fitted as a line.
# Compare the predicted values with the actual values of life expectancy
# for the top 4 best fitting models
best_augmented %>%
ggplot(aes(x = year)) +
geom_point(aes(y = life_expectancy)) +
geom_line(aes(y = .fitted), color = "red") +
facet_wrap(~country, scales = "free_y")
- Visualize the fit of our four worst fitting models with respect to year by plotting both life_expectancy as points and .fitted as a line.
# Compare the predicted values with the actual values of life expectancy
# for the top 4 worst fitting models
worst_augmented %>%
ggplot(aes(x = year)) +
geom_point(aes(y = life_expectancy)) +
geom_line(aes(y = .fitted), color = "red") +
facet_wrap(~country, scales = "free_y")
We can see that a linear model does a great job for the best 4 fitting models but the worst 4 fitting models do not seem to have a linear relationship. we will work to improve this fit by incorporating additional features.
We will build multiple regression models for each country using all available features. we may be interested in comparing the performance of the four worst fitting models so their adjusted R^2 values generated previously:
# Tasks:
- Build a linear model for each country predicting life_expectancy using every feature in the dataset.
- Append a column (fit) containing fit statistics for each model and simplify this dataframe.
- Print the adjusted R^2 in fullmodel_perf of the four countries from worst_fit dataframe.
# Build a linear model for each country using all features
gap_fullmodel <- gap_nested %>%
mutate(model = map(data, ~ lm(formula = life_expectancy~., data = .x)))
fullmodel_perf <- gap_fullmodel %>%
# Extract the fit statistics of each model into dataframes
mutate(fit = map(model, ~glance(.x))) %>%
# Simplify the fit dataframes for each model
unnest(fit)
# View the performance for the four countries with the worst fitting
# four simple models we looked at before
fullmodel_perf %>%
filter(country %in% worst_fit$country) %>%
select(country, adj.r.squared)
we can see that the performance of each of the four worst performing models based on their adjusted R^2 drastically improved once other features were added to the model.
While the adjusted does tell us how well the model fit our data, it does not give any indication on how it would perform on new data. We will now see how to estimate model performance using data withheld from building the model.
We will use the rsample package to split our data to perform the initial train-test split of our gapminder data.
# Tasks:
- Split our data into 75% training and 25% testing using the initial_split() function and assign it to gap_split.
- Extract the training dataframe from gap_split using the training() function.
- Extract the testing dataframe from gap_split using the testing() function.
- Ensure that the dimensions of our new dataframes are what we expected by using the dim() function on training_data and testing_data.
library(rsample)
set.seed(9)
dim(gapminder)
## [1] 4004 7
# Prepare the initial split object
gap_split <- initial_split(gapminder, prop = .75)
gap_split
## <Analysis/Assess/Total>
## <3003/1001/4004>
# Extract the training dataframe
training_data <- training(gap_split)
# Extract the testing dataframe
testing_data <- testing(gap_split)
# Calculate the dimensions of both training_data and testing_data
dim(training_data)
## [1] 3003 7
dim(testing_data)
## [1] 1001 7
Here on, we will take the steps necessary to identify the best performing model using only the training data. At the end of the chapter we will select the best performing model and measure its performance using the testing data that we created here.
First, we will split the training data into a series of 5 train-validate sets using the vfold_cv() function from the rsample package.
# Tasks:
- Build a dataframe for 5-fold cross validation from the training_data using vfold_cv() and assign it to cv_split.
- Prepare cv_data by appending two new columns to cv_split:
train: containing the train dataframes by mapping training() across the splits column.
validate: containing the validate dataframes by using mapping testing() across the splits column.
set.seed(9)
# Prepare the dataframe containing the cross validation partitions
cv_split <- vfold_cv(training_data, v = 5)
cv_split
cv_split$splits
## [[1]]
## <Analysis/Assess/Total>
## <2402/601/3003>
##
## [[2]]
## <Analysis/Assess/Total>
## <2402/601/3003>
##
## [[3]]
## <Analysis/Assess/Total>
## <2402/601/3003>
##
## [[4]]
## <Analysis/Assess/Total>
## <2403/600/3003>
##
## [[5]]
## <Analysis/Assess/Total>
## <2403/600/3003>
cv_data <- cv_split %>%
mutate(
# Extract the train dataframe for each split
train = map(splits, ~training(.x)),
# Extract the validate dataframe for each split
validate = map(splits, ~testing(.x))
)
# Use head() to preview cv_data
cv_data
We will use this dataframe Here on to measure and compare the performance of the models we create.
First, we will build a linear model predicting life_expectancy using all available features. we will do this for the train data of each cross-validation fold.
# TAsks:
- Build models predicting life_expectancy using all available features with the train data for each fold of the cross validation.
# Build a model using the train data for each fold of the cross validation
cv_models_lm <- cv_data %>%
mutate(model = map(train, ~lm(formula = life_expectancy~., data = .x)))
cv_models_lm
Now that we have the models built, let’s prepare the parts we need to evaluate their performance.
In order to measure the validate performance of our models we need compare the predicted values of life_expectancy for the observations from validate set to the actual values recorded. Here we will prepare both of these vectors for each partition.
# Tasks:
- Extract the actual life_expectancy from the validate dataframes and store these in the column validate_actual.
- Predict the life_expectancy for each validate partition using the map2() and predict() functions in the column validate_predicted.
cv_prep_lm <- cv_models_lm %>%
mutate(
# Extract the recorded life expectancy for the records in the validate dataframes
validate_actual = map(validate, ~.x$life_expectancy),
# Predict life expectancy for each validate set using its corresponding model
validate_predicted = map2(.x = model, .y = validate, ~predict(.x, .y))
)
cv_prep_lm
Now that we have both the actual and predicted values of each fold we can compare them to measure performance.
For this regression model, we will measure the Mean Absolute Error (MAE) between these two vectors. This value tells we the average difference between the actual and predicted values.
# Tasks:
- Calculate the MAE by comparing the actual with the predicted values for the validate data and assign it to the validate_mae column.
- Print the validate_mae column (note how they vary).
- Calculate the mean of this column.
library(Metrics)
# Calculate the mean absolute error for each validate fold
cv_eval_lm <- cv_prep_lm %>%
mutate(validate_mae = map2_dbl(validate_actual, validate_predicted, ~mae(actual = .x, predicted = .y)))
cv_eval_lm
# Print the validate_mae column
cv_eval_lm$validate_mae
## [1] 1.576854 1.512977 1.572972 1.601739 1.488472
# Calculate the mean of validate_mae column
mean(cv_eval_lm$validate_mae)
## [1] 1.550603
we now know that based on 5 train-validate splits, the predictions of the models are on average off by 1.55 years. Let us try to improve this performance by using a more complex model
Here we will use the same cross-validation data to build (using train) and evaluate (using validate) random forests for each partition. Since we are using the same cross-validation partitions as our regression models, we are able to directly compare the performance of the two models.
# Tasks:
- Use ranger() to build a random forest predicting life_expectancy using all features in train for each cross validation partition.
- Add a new column validate_predicted predicting the life_expectancy for the observations in validate using the random forest models we just created.
library(ranger)
# Build a random forest model for each fold
cv_models_rf <- cv_data %>%
mutate(model = map(train, ~ranger(formula = life_expectancy~., data = .x,
num.trees = 100, seed = 9)))
cv_models_rf
# Generate predictions using the random forest model
cv_prep_rf <- cv_models_rf %>%
mutate(validate_predicted = map2(.x = model, .y = validate, ~predict(.x, .y)$predictions))
cv_prep_rf
Similar to the linear regression model, we will use the MAE metric to evaluate the performance of the random forest model.
# Tasks:
- Calculate the MAE by comparing the actual with the predicted values for the validate data and assign it to the validate_mae column.
- Add the actual values of the validate fold (validate_actual) to the data frmae.
- Print the validate_mae column.
- Calculate the mean of this column.
cv_prep_rf <- cv_prep_rf %>% mutate(validate_actual = map(validate, ~.x$life_expectancy))
cv_prep_rf
# Calculate validate MAE for each fold
cv_eval_rf <- cv_prep_rf %>%
mutate(validate_mae = map2_dbl(validate_actual, validate_predicted, ~mae(actual = .x, predicted = .y)))
# Print the validate_mae column
cv_eval_rf$validate_mae
## [1] 0.8550312 0.8240130 0.8337605 0.7316634 0.7891819
# Calculate the mean of validate_mae column
mean(cv_eval_rf$validate_mae)
## [1] 0.80673
We have dropped the average error of our predictions from 1.47 to 0.80. That’s quite an improvement! Now, we’ll see if we can squeeze a bit more performance out by tuning a parameter of the random forest model.
We will vary the mtry parameter when building our random forest models on our train data.
The default value of mtry for ranger is the rounded down square root of the total number of features (6). This results in a value of 2.
# Tasks
- Add validate_Actual to the cv_data dataframe
- Use crossing() to expand the cross validation data for values of mtry ranging from 2 through 5.
- Build random forest models for each fold/mtry combination.
cv_data <- cv_data %>% mutate(validate_actual = map(validate, ~.x$life_expectancy))
# Prepare for tuning our cross validation folds by varying mtry
cv_tune <- cv_data %>%
crossing(mtry = 2:5)
# Build a model for each fold & mtry combination
cv_model_tunerf <- cv_tune %>%
mutate(model = map2(.x = train, .y = mtry, ~ranger(formula = life_expectancy~.,
data = .x, mtry = .y,
num.trees = 100, seed = 9)))
cv_model_tunerf
We have built a model for each fold/mtry combination. Next, we’ll measure the performance of each to find the best performing value of mtry.
The validate MAE of 0.80 we have calculated previously was for the default mtry value of 2.
# Tasks:
- Generate predictions for each mtry/fold combination.
- Calculate the MAE for each mtry/fold combination.
- Calculate the mean MAE for each value of mtry.
# Generate validate predictions for each model
cv_prep_tunerf <- cv_model_tunerf %>%
mutate(validate_predicted = map2(.x = model, .y = validate, ~predict(.x, .y)$predictions))
# Calculate validate MAE for each fold and mtry combination
cv_eval_tunerf <- cv_prep_tunerf %>%
mutate(validate_mae = map2_dbl(.x = validate_actual, .y = validate_predicted, ~mae(actual = .x, predicted = .y)))
cv_eval_tunerf
# Calculate the mean validate_mae for each mtry used
cv_eval_tunerf %>%
group_by(mtry) %>%
summarise(mean_mae = mean(validate_mae))
Looks like parameter tuning was able to eke out another slight boost in performance, dropping the mae from 0.80 (mtry = 2) to 0.792 (mtry = 4). Assuming that we’ve finished our model selection we can conclude that our final (best performing) model will be the random forest model built using ranger with an mtry = 4 and num.trees = 100.
Next, we will build this model using all training data and evaluate its expected future performance using the testing data.
Using cross-validation we were able to identify the best model for predicting life_expectancy using all the features in gapminder. Now that we’ve selected our model, we can use the independent set of data (testing_data) that we’ve held out to estimate the performance of this model on new data.
we will build this model using all training_data and evaluate using testing_data.
# Tasks:
- Use ranger() to build the best performing model (mtry = 4) using all of the training data. Assign this to best_model.
- Extract the life_expectancy column from testing_data and assign it to test_actual.
Predict life_expectancy using the best_model on the testing data and assign it to test_predicted.
- Calculate the MAE using test_actual and test_predicted vectors.
# Build the model using all training data and the best performing parameter
best_model <- ranger(formula = life_expectancy~., data = training_data,
mtry = 4, num.trees = 100, seed = 9)
# Prepare the test_actual vector
test_actual <- testing_data$life_expectancy
# Predict life_expectancy for the testing_data
test_predicted <- predict(best_model, testing_data)$predictions
# Calculate the test MAE
mae(test_actual, test_predicted)
## [1] 0.5994785
we have successfully leveraged the list column workflow to identify and build a model to predict life expectancy. we can claim that based on the test holdout we can expect that our predictions on new data will only be off by a magnitude of 0.599 years.
Now we will build a classification model to predict employee attrition.
For this, we will work with the attrition dataset, which contains 30 features about employees which we will use to predict if they have left the company.
We will first prepare the training & testing data sets, then we will further split the training data using cross-validation so that we can search for the best performing model for this task.
# Tasks:
- Split our data into 75% training and 25% testing using the initial_split() function.
- Extract the training and testing dataframes from data_split using training() and testing(), respectively.
attrition <- readRDS('D:/learning/Datacamp/Datasets/attrition.rds')
head(attrition)
str(attrition)
## 'data.frame': 1470 obs. of 31 variables:
## $ Age : int 41 49 37 33 27 32 59 30 38 36 ...
## $ Attrition : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
## $ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
## $ Department : Factor w/ 3 levels "Human_Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
## $ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : Ord.factor w/ 5 levels "Below_College"<..: 2 1 2 4 1 2 3 1 3 3 ...
## $ EducationField : Factor w/ 6 levels "Human_Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
## $ EnvironmentSatisfaction : Ord.factor w/ 4 levels "Low"<"Medium"<..: 2 3 4 4 1 4 3 4 4 3 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
## $ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ...
## $ JobInvolvement : Ord.factor w/ 4 levels "Low"<"Medium"<..: 3 2 2 3 3 3 4 3 2 3 ...
## $ JobLevel : int 2 2 1 1 1 1 1 1 3 2 ...
## $ JobRole : Factor w/ 9 levels "Healthcare_Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
## $ JobSatisfaction : Ord.factor w/ 4 levels "Low"<"Medium"<..: 4 2 3 3 2 4 1 3 3 3 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
## $ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
## $ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
## $ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
## $ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : Ord.factor w/ 4 levels "Low"<"Good"<"Excellent"<..: 3 4 3 3 3 3 4 4 4 3 ...
## $ RelationshipSatisfaction: Ord.factor w/ 4 levels "Low"<"Medium"<..: 1 4 2 3 4 3 1 2 2 2 ...
## $ StockOptionLevel : int 0 1 0 0 1 0 3 1 0 2 ...
## $ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
## $ WorkLifeBalance : Ord.factor w/ 4 levels "Bad"<"Good"<"Better"<..: 1 3 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
## $ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
## $ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
## $ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
glimpse(attrition)
## Rows: 1,470
## Columns: 31
## $ Age <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2~
## $ Attrition <fct> Yes, No, Yes, No, No, No, No, No, No, No, No,~
## $ BusinessTravel <fct> Travel_Rarely, Travel_Frequently, Travel_Rare~
## $ DailyRate <int> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,~
## $ Department <fct> Sales, Research_Development, Research_Develop~
## $ DistanceFromHome <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, ~
## $ Education <ord> College, Below_College, College, Master, Belo~
## $ EducationField <fct> Life_Sciences, Life_Sciences, Other, Life_Sci~
## $ EnvironmentSatisfaction <ord> Medium, High, Very_High, Very_High, Low, Very~
## $ Gender <fct> Female, Male, Male, Female, Male, Male, Femal~
## $ HourlyRate <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4~
## $ JobInvolvement <ord> High, Medium, Medium, High, High, High, Very_~
## $ JobLevel <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, ~
## $ JobRole <fct> Sales_Executive, Research_Scientist, Laborato~
## $ JobSatisfaction <ord> Very_High, Medium, High, High, Medium, Very_H~
## $ MaritalStatus <fct> Single, Married, Single, Married, Married, Si~
## $ MonthlyIncome <int> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269~
## $ MonthlyRate <int> 19479, 24907, 2396, 23159, 16632, 11864, 9964~
## $ NumCompaniesWorked <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, ~
## $ OverTime <fct> Yes, No, Yes, Yes, No, No, Yes, No, No, No, N~
## $ PercentSalaryHike <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1~
## $ PerformanceRating <ord> Excellent, Outstanding, Excellent, Excellent,~
## $ RelationshipSatisfaction <ord> Low, Very_High, Medium, High, Very_High, High~
## $ StockOptionLevel <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, ~
## $ TotalWorkingYears <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3~
## $ TrainingTimesLastYear <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, ~
## $ WorkLifeBalance <ord> Bad, Better, Better, Better, Better, Good, Go~
## $ YearsAtCompany <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,~
## $ YearsInCurrentRole <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, ~
## $ YearsSinceLastPromotion <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, ~
## $ YearsWithCurrManager <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, ~
set.seed(9)
# Prepare the initial split object
data_split <- initial_split(attrition, prop = .75)
# Extract the training dataframe
training_data <- training(data_split)
# Extract the testing dataframe
testing_data <- testing(data_split)
dim(training_data)
## [1] 1103 31
dim(testing_data)
## [1] 367 31
- Build a dataframe for 5-fold cross validation from the training_data using vfold_cv().
- Prepare the cv_data dataframe by extracting the train and validate dataframes for each fold.
set.seed(9)
cv_split <- vfold_cv(training_data, v = 5)
cv_data <- cv_split %>%
mutate(
# Extract the train dataframe for each split
train = map(splits, ~training(.x)),
# Extract the validate dataframe for each split
validate = map(splits, ~testing(.x))
)
head(cv_data)
dim(cv_data)
## [1] 5 4
Now we have the parts necessary to build & tune our classification models.
We will build logistic regression models for each fold in our cross-validation.
We will build this using the glm() function and by setting the family argument to “binomial”.
# Tasks:
- Build models predicting Attrition using all available features with the train data for each fold of the cross validation.
# Build a model using the train data for each fold of the cross validation
cv_models_lr <- cv_data %>%
mutate(model = map(train, ~glm(formula = Attrition~.,
data = .x, family = 'binomial')))
cv_models_lr
To calculate the performance of a classification model we need to compare the actual values of Attrition to those predicted by the model. When calculating metrics for binary classification tasks (such as precision and recall), the actual and predicted vectors must be converted to binary values.
We will prepare these vectors using the model and validate dataframes from the first cross-validation fold as an example.
# Tasks:
- Extract the model and the validate dataframe from the first fold of the cross-validation.
- Extract the Attrition column from the validate dataframe and convert the values to binary (TRUE/FALSE).
- Use model to predict the probabilities of attrition for the validate dataframe.
- Convert the predicted probabilities to a binary vector, assume all probabilities greater than 0.5 are TRUE.
# Extract the first model and validate
model <- cv_models_lr$model[[1]]
validate <- cv_models_lr$validate[[1]]
# Prepare binary vector of actual Attrition values in validate
validate_actual <- validate$Attrition == "Yes"
# Predict the probabilities for the observations in validate
validate_prob <- predict(model, validate, type = "response")
# Prepare binary vector of predicted Attrition values for validate
validate_predicted <- validate_prob >.5
validate_predicted
## 10 11 15 24 26 33 39 57 73 74 91 96 97
## FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 103 104 107 112 131 138 144 169 178 199 207 211 214
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 218 227 242 247 262 269 284 292 307 309 321 329 331
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## 339 359 361 380 390 404 405 408 422 433 434 441 454
## FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## 458 464 466 469 479 483 485 487 494 510 515 523 549
## FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 578 581 601 605 621 644 663 664 667 671 675 689 690
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 692 715 716 725 728 763 769 772 775 789 791 802 805
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 809 836 840 844 848 851 867 881 882 904 911 912 913
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
## 916 924 926 930 952 956 994 998 1005 1013 1029 1032 1036
## FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 1043 1061 1062 1070 1080 1085 1094 1124 1125 1140 1143 1157 1190
## TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## 1195 1198 1216 1233 1234 1239 1255 1258 1264 1269 1275 1289 1291
## FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## 1298 1301 1315 1344 1356 1363 1369 1377 1379 1387 1401 1448 1460
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
## 1464 1469 1481 1492 1506 1523 1535 1543 1553 1555 1558 1560 1572
## TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 1588 1597 1602 1609 1611 1612 1621 1623 1631 1633 1640 1650 1651
## FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## 1655 1656 1676 1681 1687 1689 1693 1694 1712 1716 1720 1724 1733
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
## 1735 1766 1787 1805 1826 1830 1863 1865 1866 1878 1883 1905 1912
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
## 1918 1945 1971 2003 2010 2022 2027 2034 2035 2037 2048 2049 2052
## FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Now we have the actual and predicted vectors. We will use these vectors to calculate some metrics to check the performance of the model.
We can calculate many commonly used binary classification metrics. We will focus on:
accuracy: rate of correctly predicted values relative to all predictions.
precision: portion of predictions that the model correctly predicted as TRUE.
recall: portion of actual TRUE values that the model correctly recovered.
# Tasks:
- Use table() to compare the validate_actual and validate_predicted values for the example model and validate dataframe.
- Calculate the accuracy.
- Calculate the precision.
- Calculate the recall.
library(Metrics)
# Compare the actual & predicted performance visually using a table
table(validate_actual, validate_predicted)
## validate_predicted
## validate_actual FALSE TRUE
## FALSE 186 10
## TRUE 12 13
# Calculate the accuracy
accuracy(validate_actual, validate_predicted)
## [1] 0.9004525
# Calculate the precision
precision(validate_actual, validate_predicted)
## [1] 0.5652174
# Calculate the recall
recall(validate_actual, validate_predicted)
## [1] 0.52
The type of metric we use should be informed by the application of our model. We will expand on this example to calculate the recall metric for each of our cross validation folds.
We will expand this for all the folds in the cross-validation dataframe.
# Tasks:
- Add the validate_actual binary column for each cross-validation fold by converting all "Yes" values to TRUE.
- Use model to predict the probabilities of attrition for each cross-validation fold of validate. Convert the predicted probabilities to a binary vector, treating all probabilities greater than 0.5 as TRUE. Name this column validate_predicted.
cv_prep_lr <- cv_models_lr %>%
mutate(
# Prepare binary vector of actual Attrition values in validate
validate_actual = map(validate, ~.x$Attrition == "Yes"),
# Prepare binary vector of predicted Attrition values for validate
validate_predicted = map2(.x = model, .y = validate, ~predict(.x, .y, type = "response") > .5)
)
cv_prep_lr
It is crucial to optimize models using a carefully selected metric aimed at achieving the goal of the model.
Imagine that in this case we want to use this model to identify employees that are predicted to leave the company. Ideally, we want a model that can capture as many of the ready-to-leave employees as possible so that we can intervene. The corresponding metric that captures this is the recall metric. As such, we will exclusively use recall to optimize and select our models.
# Tasks:
- Calculate the recall by comparing the actual with the predicted responses for each fold and assign it to the validate_recall column.
- Print the validate_recall column.
- Print the mean of this column.
# Calculate the validate recall for each cross validation fold
cv_perf_recall <- cv_prep_lr %>%
mutate(validate_recall = map2_dbl(validate_actual, validate_predicted,
~recall(actual = .x, predicted = .y)))
cv_perf_recall
# Print the validate_recall column
cv_perf_recall$validate_recall
## [1] 0.5200000 0.5609756 0.3939394 0.3255814 0.3333333
# Calculate the average of the validate_recall column
mean(cv_perf_recall$validate_recall
)
## [1] 0.4267659
As we can see the validate recall of the model is 0.43. Now, we shall try to beat this using a more complex model.
Now that we have a working logistic regression model we will prepare a random forest model to compare it with.
# Tas:
- Use crossing() to expand the cross-validation data for values of mtry using the values of 2, 4, 8, and 16.
- Build random forest models for each fold/mtry combination.
# Prepare for tuning our cross validation folds by varying mtry
cv_tune <- cv_data %>%
crossing(mtry = c(2,4,8,16))
# Build a cross validation model for each fold & mtry combination
cv_models_rf <- cv_tune %>%
mutate(model = map2(train, mtry, ~ranger(formula = Attrition~.,
data = .x, mtry = .y,
num.trees = 100, seed = 9)))
cv_models_rf
It is now time to see whether the random forests models we built are able to outperform the logistic regression model.
The validate recall achieved for the logistic regression model was 0.43.
# Tasks:
- Prepare the validate_actual and validate_predicted columns for each mtry/fold combination.
- Calculate the recall for each mtry/fold combination.
- Calculate the mean recall for each value of mtry.
cv_prep_rf <- cv_models_rf %>%
mutate(
# Prepare binary vector of actual Attrition values in validate
validate_actual = map(validate, ~.x$Attrition == "Yes"),
# Prepare binary vector of predicted Attrition values for validate
validate_predicted = map2(.x = model, .y = validate, ~predict(.x, .y, type = "response")$predictions == "Yes")
)
# Calculate the validate recall for each cross validation fold
cv_perf_recall <- cv_prep_rf %>%
mutate(recall = map2_dbl(.x = validate_actual, .y = validate_predicted, ~recall(actual = .x, predicted = .y)))
# Calculate the mean recall for each mtry used
cv_perf_recall %>%
group_by(mtry) %>%
summarise(mean_recall = mean(recall))
This time we can see that none of the random forest models were able to outperform the logistic regression model with respect to recall.
Comparing the recall performance between the logistic regression model (0.4) and the best performing random forest model (0.28), we conclude that the model with the best performance is the logistic regression model. Finally, we will build the logistic regression model using all of the train data and we will prepare the necessary vectors for evaluating this model’s test performance.
# Tasks:
- Build a logistic regression model predicting Attrition using all available features in the training_data.
- Prepare the binary vector of actual test values, test_actual.
- Prepare the binary vector of predicted values where a probability greater than 0.5 indicates TRUE and store this as test_predicted.
# Build the logistic regression model using all training data
best_model <- glm(formula = Attrition ~ .,
data = training_data, family = "binomial")
# Prepare binary vector of actual Attrition values for testing_data
test_actual <- testing_data$Attrition == "Yes"
# Prepare binary vector of predicted Attrition values for testing_data
test_predicted <- predict(best_model, testing_data, type = "response") > .5
test_predicted
## 1 4 18 32 38 40 46 47 53 54 55 64 78
## TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE
## 83 85 94 105 110 113 118 124 126 129 137 140 141
## FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## 143 151 152 153 154 160 165 171 176 197 198 202 206
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 208 215 233 239 248 253 258 261 264 267 274 275 277
## FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
## 282 293 296 297 300 312 314 330 332 334 337 341 343
## FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 346 373 384 385 396 399 403 410 411 417 423 428 431
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## 436 437 438 444 445 446 447 448 449 456 468 477 481
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 482 491 492 498 499 500 501 508 511 516 517 518 520
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## 525 529 530 532 534 544 565 574 577 580 582 584 587
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## 592 599 602 604 613 616 620 622 623 626 634 641 643
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## 645 653 659 662 666 679 680 699 701 709 710 714 721
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 731 732 742 746 749 754 762 766 783 800 803 811 815
## FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
## 817 820 824 830 832 833 842 845 859 875 880 888 893
## FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 896 897 899 900 932 940 947 949 950 957 959 961 969
## FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## 975 976 984 991 1003 1006 1012 1018 1019 1027 1028 1039 1048
## FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## 1052 1071 1083 1098 1103 1116 1119 1128 1131 1132 1135 1138 1148
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
## 1160 1175 1177 1180 1202 1204 1206 1210 1218 1219 1220 1226 1231
## FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 1235 1240 1243 1244 1245 1248 1256 1257 1278 1280 1286 1292 1293
## FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## 1295 1299 1304 1314 1317 1321 1331 1336 1346 1358 1390 1391 1407
## FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 1423 1427 1439 1441 1446 1458 1461 1466 1473 1474 1477 1480 1482
## FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 1484 1485 1489 1494 1496 1513 1514 1520 1522 1534 1541 1547 1549
## FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## 1550 1552 1568 1573 1583 1598 1605 1615 1617 1635 1642 1646 1647
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE
## 1648 1649 1658 1661 1670 1671 1673 1677 1678 1692 1701 1709 1710
## FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 1714 1719 1728 1731 1734 1736 1739 1745 1749 1752 1757 1760 1764
## TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## 1782 1784 1789 1790 1801 1804 1812 1813 1815 1816 1818 1822 1824
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 1835 1837 1839 1847 1850 1852 1853 1857 1858 1862 1869 1882 1907
## TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
## 1908 1911 1915 1916 1922 1927 1928 1932 1934 1948 1949 1951 1955
## FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 1966 1970 1985 1986 1995 1999 2007 2016 2021 2025 2026 2045 2054
## FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## 2055 2057 2062
## FALSE FALSE FALSE
We have now selected & built our best performing model and have prepared the necessary parts to evaluate its performance.
Now its time to calculate the test performance of our final model (logistic regression). Here we will use the held out testing data to characterize the performance we would expect from this model when it is applied to new data.
# Tasks:
- Use table() to compare the test_actual and test_predicted vectors.
- Calculate the test accuracy.
- Calculate the test precision.
- Calculate the test recall.
# Compare the actual & predicted performance visually using a table
table(test_actual, test_predicted)
## test_predicted
## test_actual FALSE TRUE
## FALSE 288 23
## TRUE 30 26
# Calculate the test accuracy
accuracy(test_actual, test_predicted)
## [1] 0.8555858
# Calculate the test precision
precision(test_actual, test_predicted)
## [1] 0.5306122
# Calculate the test recall
recall(test_actual, test_predicted)
## [1] 0.4642857
We now have a model that we can expect to identify 46% of employees that are at risk to leave the organization.
————— Thank You —————–