Data understanding

In Data Understanding, you:

  • Import data
  • Clean data
  • Format data properly
  • Create new variables
  • Get an overview about the complete data
  • Split data into training and test set using stratified sampling
  • Discover and visualize the training data to gain insights

Setup

If you like to install all packages at once, use the code below.

#install.packages(c("tidyverse", "skimr", "GGally", "ggmap", "visdat", "corrr", "ggsignif", "gt", "vip", "themis", "purrr", "tidyr", "tidymodels", "keras", "ranger", "xgboost", "kknn")) 

library(tidyverse)   # Collection of R packages for data manipulation, visualization, and analysis  
── Attaching core tidyverse packages ─────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(skimr)       # Provides enhanced summary statistics for data frames  
Registered S3 methods overwritten by 'htmltools':
  method               from         
  print.html           tools:rstudio
  print.shiny.tag      tools:rstudio
  print.shiny.tag.list tools:rstudio
library(GGally)      # Extension of ggplot2 for correlation matrices, scatterplot matrices, and more  
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
library(ggmap)       # Enables working with spatial data and maps using ggplot2  
ℹ Google's Terms of Service: ]8;;https://mapsplatform.google.com<https://mapsplatform.google.com>]8;;
  Stadia Maps' Terms of Service: ]8;;https://stadiamaps.com/terms-of-service/<https://stadiamaps.com/terms-of-service/>]8;;
  OpenStreetMap's Tile Usage Policy: ]8;;https://operations.osmfoundation.org/policies/tiles/<https://operations.osmfoundation.org/policies/tiles/>]8;;
ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
library(visdat)      # Visualizes missing values and data structure for exploratory data analysis  
library(corrr)       # Focused on correlation analysis and visualization  

Attaching package: ‘corrr’

The following object is masked from ‘package:skimr’:

    focus
library(ggsignif)    # Adds significance indicators (e.g., p-values) to ggplot2 plots  
library(gt)          # Creates visually appealing tables in R  
library(vip)         # Generates variable importance plots for machine learning models  

Attaching package: ‘vip’

The following object is masked from ‘package:utils’:

    vi
library(themis)      # Helps with dealing with class imbalance in machine learning datasets  
Loading required package: recipes

Attaching package: ‘recipes’

The following object is masked from ‘package:stringr’:

    fixed

The following object is masked from ‘package:stats’:

    step
library(purrr)       # Provides functional programming tools for iteration and mapping  
library(tidyr)       # Helps reshape and tidy data for easier analysis  
library(tidymodels)  # Framework for modeling and machine learning using tidy principles  
── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────── tidymodels 1.3.0 ──
✔ broom        1.0.7     ✔ rsample      1.2.1
✔ dials        1.4.0     ✔ tune         1.3.0
✔ infer        1.0.7     ✔ workflows    1.2.0
✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
✔ parsnip      1.3.0     ✔ yardstick    1.3.2
── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Use suppressPackageStartupMessages() to eliminate package startup messages
library(keras)       # Interface to TensorFlow for deep learning in R  

Attaching package: ‘keras’

The following object is masked from ‘package:yardstick’:

    get_weights
library(ranger)      # Fast implementation of random forests for classification and regression  
ranger 0.17.0 using 2 threads (default). Change with num.threads in ranger() and predict(), options(Ncpus = N), options(ranger.num.threads = N) or environment variable R_RANGER_NUM_THREADS.
library(xgboost)     # Optimized gradient boosting framework for predictive modeling  

Attaching package: ‘xgboost’

The following object is masked from ‘package:dplyr’:

    slice
library(kknn)        # Implements k-Nearest Neighbors (kNN) classification and regression 
library(discrim)     # For discrim_linear classification model specifications using the MASS engine.

Attaching package: ‘discrim’

The following object is masked from ‘package:dials’:

    smoothness

Import Data

First of all, let’s import the data:

LINK <- "https://raw.githubusercontent.com/kirenz/datasets/master/housing_unclean.csv"
housing_df <- read_csv(LINK)
Rows: 20640 Columns: 10── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (3): housing_median_age, median_house_value, ocean_proximity
dbl (7): longitude, latitude, total_rooms, total_bedrooms, population, households, median_income
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Clean data

To get a first impression of the data we take a look at the top 4 rows:

housing_df |>
  slice_head(n = 4) |>
  gt() # print output using gt
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
-122.23 37.88 41.0years 880 129 322 126 8.3252 452600.0$ NEAR BAY
-122.22 37.86 21.0 7099 1106 2401 1138 8.3014 358500.0 NEAR BAY
-122.24 37.85 52.0 1467 190 496 177 7.2574 352100.0 NEAR BAY
-122.25 37.85 52.0 1274 235 558 219 5.6431 341300.0 NEAR BAY

Notice the values in the first row of the variables housing_median_age and median_house_value. We need to remove the strings “years” and “$”. Therefore, we use the function str_remove_all from the stringr package. Since there could be multiple wrong entries of the same type, we apply our corrections to all of the rows of the corresponding variable:

housing_df <- 
  housing_df |>
  mutate(
    housing_median_age = str_remove_all(housing_median_age, "[years]"),
    median_house_value = str_remove_all(median_house_value, "[$]")
  )

We don’t cover the phase of data cleaning in detail in this tutorial. However, in a real data science project, data cleaning is usually a very time consuming process.

Format data

Next, we take a look at the data structure and check wether all data formats are correct:
- Numeric variables should be formatted as integers (int) or double precision floating point numbers (dbl).
- Categorical (nominal and ordinal) variables should usually be formatted as factors (fct) and not characters (chr). Especially, if they don’t have many levels.

glimpse(housing_df)
Rows: 20,640
Columns: 10
$ longitude          <dbl> -122.23, -122.22, -122.24, -122.25, -122.25, -122.25, -122.25, -122.25, -122.26, -122.25, -122.26, -122.26, -1…
$ latitude           <dbl> 37.88, 37.86, 37.85, 37.85, 37.85, 37.85, 37.84, 37.84, 37.84, 37.84, 37.85, 37.85, 37.85, 37.84, 37.85, 37.85…
$ housing_median_age <chr> "41.0", "21.0", "52.0", "52.0", "52.0", "52.0", "52.0", "52.0", "42.0", "52.0", "52.0", "52.0", "52.0", "52.0"…
$ total_rooms        <dbl> 880, 7099, 1467, 1274, 1627, 919, 2535, 3104, 2555, 3549, 2202, 3503, 2491, 696, 2643, 1120, 1966, 1228, 2239,…
$ total_bedrooms     <dbl> 129, 1106, 190, 235, 280, 213, 489, 687, 665, 707, 434, 752, 474, 191, 626, 283, 347, 293, 455, 298, 184, 367,…
$ population         <dbl> 322, 2401, 496, 558, 565, 413, 1094, 1157, 1206, 1551, 910, 1504, 1098, 345, 1212, 697, 793, 648, 990, 690, 40…
$ households         <dbl> 126, 1138, 177, 219, 259, 193, 514, 647, 595, 714, 402, 734, 468, 174, 620, 264, 331, 303, 419, 275, 166, 366,…
$ median_income      <dbl> 8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.1200, 2.0804, 3.6912, 3.2031, 3.2705, 3.0750, 2.6736…
$ median_house_value <chr> "452600.0", "358500.0", "352100.0", "341300.0", "342200.0", "269700.0", "299200.0", "241400.0", "226700.0", "2…
$ ocean_proximity    <chr> "NEAR BAY", "NEAR BAY", "NEAR BAY", "NEAR BAY", "NEAR BAY", "NEAR BAY", "NEAR BAY", "NEAR BAY", "NEAR BAY", "N…

The package visdat helps us to explore the data class structure visually:

vis_dat(housing_df)

We can observe that the numeric variables housing_media_age and median_house_value are declared as characters (chr) instead of numeric. We choose to format the variables as dbl, since the values could be floating-point numbers.

Furthermore, the categorical variable ocean_proximity is formatted as character instead of factor. Let’s take a look at the levels of the variable:

housing_df |>
  count(ocean_proximity,
        sort = TRUE)

The variable has only 5 levels and therefore should be formatted as a factor.

Note that it is usually a good idea to first take care of the numerical variables. Afterwards, we can easily convert all remaining character variables to factors using the function across from the dplyr package (which is part of the tidyverse).

# convert to numeric
housing_df <- 
  housing_df |>
  mutate(
    housing_median_age = as.numeric(housing_median_age),
    median_house_value = as.numeric(median_house_value)
  )

# convert all remaining character variables to factors 
housing_df <- 
  housing_df |>
  mutate(across(where(is.character), as.factor))

Missing data

Now let’s turn our attention to missing data. Missing data can be viewed with the function vis_miss from the package visdat. We arrange the data by columns with most missingness:

vis_miss(housing_df, sort_miss = TRUE)

Here an alternative method to obtain missing data:

is.na(housing_df) |> colSums()
         longitude           latitude housing_median_age        total_rooms     total_bedrooms         population         households 
                 0                  0                  0                  0                207                  0                  0 
     median_income median_house_value    ocean_proximity 
                 0                  0                  0 

We have a missing rate of 0.1% (207 cases) in our variable total_bedroms. This can cause problems for some algorithms. We will take care of this issue during our data preparation phase.

Create new variables

One very important thing you may want to do at the beginning of your data science project is to create new variable combinations. For example:
- the total number of rooms in a district is not very useful if you don’t know how many households there are. What you really want is the number of rooms per household.
- Similarly, the total number of bedrooms by itself is not very useful: you probably want to compare it to the number of rooms.
- And the population per household also seems like an interesting attribute combination to look at.

Let’s create these new attributes:

housing_df <- 
  housing_df |>
  mutate(rooms_per_household = total_rooms/households,
        bedrooms_per_room = total_bedrooms/total_rooms,
        population_per_household = population/households)

Furthermore, in our example we need to create our dependent variable and drop the original numeric variable.

housing_df <- 
  housing_df |>
  mutate(price_category = case_when( 
    median_house_value < 150000 ~ "below",
    median_house_value >= 150000 ~ "above"
    )) |>
  mutate(price_category = as.factor(price_category)) |>
  dplyr::select(-median_house_value) # avoid conflicts with other packages like MASS

Since we created the new label price_category from the variable median_house_value it is crucial that we never use the variable median_house_value as a predictor in our models. Therefore we drop it.

Take a look at our dependent variable and create a table with the package gt

housing_df |>
  count(price_category, # count observations
        name ="districts_total") |> # name the new variable 
  mutate(percent = districts_total/sum(districts_total)) |> # calculate percentages
  gt() # create table
price_category districts_total percent
above 13084 0.6339147
below 7556 0.3660853

Let’s make a nice looking table:

housing_df |>
  count(price_category, 
        name ="districts_total") %>%
  mutate(percent = districts_total/sum(districts_total)*100,
         percent = round(percent, 2)) %>%
 gt() %>%
  tab_header(
    title = "California median house prices",
    subtitle = "Districts above and below 150.000$"
  ) %>%
  cols_label(
    price_category = "Price",
    districts_total = "Districts",
    percent = "Percent"
  ) |>
  fmt_number(
    columns = vars(districts_total),
    suffixing = TRUE
  ) 
Warning: Since gt v0.3.0, `columns = vars(...)` has been deprecated.
• Please use `columns = c(...)` instead.
California median house prices
Districts above and below 150.000$
Price Districts Percent
above 13.08K 63.39
below 7.56K 36.61

Data overview

After we took care of our data issues, we can obtain a data summary of all numerical and categorical attributes using a function from the package skimr:

skim(housing_df)
── Data Summary ────────────────────────
                           Values    
Name                       housing_df
Number of rows             20640     
Number of columns          13        
_______________________              
Column type frequency:               
  factor                   2         
  numeric                  11        
________________________             
Group variables            None      

We have 20640 observations and 13 columns in our data.

  • The sd column shows the standard deviation, which measures how dispersed the values are.

  • The p0, p25, p50, p75 and p100 columns show the corresponding percentiles: a percentile indicates the value below which a given percentage of observations in a group of observations fall. For example, 25% of the districts have a housing_median_age lower than 18, while 50% are lower than 29 and 75% are lower than 37. These are often called the 25th percentile (or first quartile), the median, and the 75th percentile.

  • Further note that the median income attribute does not look like it is expressed in US dollars (USD). Actually the data has been scaled and capped at 15 (actually, 15.0001) for higher median incomes, and at 0.5 (actually, 0.4999) for lower median incomes. The numbers represent roughly tens of thousands of dollars (e.g., 3 actually means about $30,000).

Another quick way to get an overview of the type of data you are dealing with is to plot a histogram for each numerical attribute. A histogram shows the number of instances (on the vertical axis) that have a given value range (on the horizontal axis). You can either plot this one attribute at a time, or you can use ggscatmat from the package GGally on the whole dataset (as shown in the following code example), and it will plot a histogram for each numerical attribute as well as correlation coefficients (Pearson is the default). We just select the most promising variabels for our plot:

housing_df |>
  dplyr::select(
    housing_median_age, 
    median_income, bedrooms_per_room, rooms_per_household, 
    population_per_household) |>
  ggscatmat(alpha = 0.2)

Another option is to use ggpairs, where we even can integrate categorical variables like our dependent variable price_category and ocean proximity in the output:

housing_df |>
  dplyr::select(
    housing_median_age, 
    median_income, bedrooms_per_room, rooms_per_household, 
    population_per_household, ocean_proximity,
    price_category) |>
  ggpairs()

There are a few things you might notice in these histograms:

  • The variables median income, housing median age were capped.

  • Note that our attributes have very different scales. We will take care of this issue later in data preparation, when we use feature scaling (data normalization).

  • Finally, many histograms are tail-heavy: they extend much farther to the right of the median than to the left. This may make it a bit harder for some Machine Learning algorithms to detect patterns. We will transform these attributes later on to have more bell-shaped distributions. For our right-skewed data (i.e., tail is on the right, also called positive skew), common transformations include square root and log (we will use the log).

Data splitting

Before we get started with our in-depth data exploration, let’s split our single dataset into two: a training set and a testing set. The training data will be used to fit models, and the testing set will be used to measure model performance. We perform data exploration only on the training data.

A training dataset is a dataset of examples used during the learning process and is used to fit the models. A test dataset is a dataset that is independent of the training dataset and is used to evaluate the performance of the final model. If a model fit to the training dataset also fits the test dataset well, minimal overfitting has taken place. A better fitting of the training dataset as opposed to the test dataset usually points to overfitting.

In our data split, we want to ensure that the training and test set is representative of the categories of our dependent variable.

housing_df |>
  ggplot(aes(price_category)) +
  geom_bar() 

In general, we would like to have instances for each stratum, or else the estimate of a stratum’s importance may be biased. A stratum (plural strata) refers to a subset (part) of the whole data from which is being sampled. We only have two categories in our data.

To actually split the data, we can use the rsample package (included in tidymodels) to create an object that contains the information on how to split the data (which we call data_split), and then two more rsample functions to create data frames for the training and testing sets:

# Fix the random numbers by setting the seed 
# This enables the analysis to be reproducible 
set.seed(123)

# Put 3/4 of the data into the training set 
data_split <- initial_split(housing_df, 
                           prop = 3/4, 
                           strata = price_category)

# Create dataframes for the two sets:
train_data <- training(data_split) 
test_data <- testing(data_split)

Data exploration

The point of data exploration is to gain insights that will help you select important variables for your model and to get ideas for feature engineering in the data preparation phase. Ususally, data exploration is an iterative process: once you get a prototype model up and running, you can analyze its output to gain more insights and come back to this exploration step. It is important to note that we perform data exploration only with our training data.

Create data copy

We first make a copy of the training data since we don’t want to alter our data during data exploration.

data_explore <- train_data

Next, we take a closer look at the relationships between our variables. In particular, we are interested in the relationships between ur dependent variable price_category and all other variables. The goal is to identify possible predictor variables which we could use in our models to predict the price_category.

Geographical overview

Since our data includes information about longitude and latitude, we start our data exploration with the creation of a geographical scatterplot of the data to get some first insights:

data_explore |>
  ggplot(aes(x = longitude, y = latitude)) +
  geom_point(color = "cornflowerblue")

A better visualization that highlights high-density areas (with parameter alpha = 0.1 ):

data_explore |>
  ggplot(aes(x = longitude, y = latitude)) +
  geom_point(color = "cornflowerblue", alpha = 0.1) 

Overview about California housing prices:
- red is expensive,
- teal is cheap and
- larger circles indicate areas with a larger population.

data_explore |>
  ggplot(aes(x = longitude, y = latitude)) +
  geom_point(aes(size = population, color = price_category), 
             alpha = 0.4)

Numerical variables

We can use boxplots to check, if we actually find differences in our numeric variables for the different levels of our dependent categorical variable:

data_explore |>
  ggplot(aes(x = price_category, y = median_income, 
             fill = price_category, color = price_category)) +
  geom_boxplot(alpha=0.4) 

Let`s define a function for this task that accepts strings as inputs so we don’t have to copy and paste our code for every plot. Note that we only have to change the “y-variable” in every plot.

print_boxplot <- function(.y_var){
  
  # convert strings to variable
  y_var <- sym(.y_var) 
 
  # unquote variables using {{}}
  data_explore |>
  ggplot(aes(x = price_category, y = {{y_var}},
             fill = price_category, color = price_category)) +
  geom_boxplot(alpha=0.4) 
  
}  

Obtain all of the names of the y-variables we want to use for our plots:

y_var <- 
  data_explore |>
  dplyr::select(where(is.numeric), -longitude, - latitude) |>
  variable.names() # obtain name

The map function applys the function print_boxplot to each element of our atomic vector y_var and returns the according plot:

map(y_var, print_boxplot)
[[1]]

[[2]]

[[3]]

[[4]]

[[5]]

[[6]]

[[7]]

[[8]]

[[9]]

We can observe a difference in the price_category:

  • The differences between our two groups are quite small for housing_median_age, total_room, total_bedrooms, population and households

  • We can observe a noticeable difference for our variables median_income and bedrooms_per_room

  • population_per_household and rooms_per_household include some extreme values. We first need to fix this before we can proceed with our interpretations for this variables.

Again, let’s write a short function for this task and filter some of the extreme cases. We call the new function print_boxplot_out:

print_boxplot_out <- function(.y_var_out){
  
  y_var <- sym(.y_var_out) 
 
  data_explore |>
  filter(rooms_per_household < 50, population_per_household < 20) |>
  ggplot(aes(x = price_category, y = {{y_var}},
             fill = price_category, color = price_category)) +
  geom_boxplot(alpha=0.4) 
  
} 

y_var_out <- 
  data_explore |>
  dplyr::select(rooms_per_household, population_per_household) |>
  variable.names() 

map(y_var_out, print_boxplot_out)
[[1]]

[[2]]

Now we are able to recognize a small difference for population_per_household. rooms_per_household on the other hand is quite similar for both groups.

Additionally, we can use the function ggscatmat to create plots with our dependent variable as color column:

data_explore |>
  dplyr::select(price_category, median_income, bedrooms_per_room, rooms_per_household, 
         population_per_household) |>
  ggscatmat(color="price_category", 
            corMethod = "spearman",
            alpha=0.2)
Warning: Factor variables are omitted in plot

There are a few things you might notice in these histograms:

  • Note that our attributes have very different scales. We will take care of this issue later in data preparation, when we use feature scaling (data normalization).

  • The histograms are tail-heavy: they extend much farther to the right of the median than to the left. This may make it a bit harder for some Machine Learning algorithms to detect patterns. We will transform these attributes later on to have more bell-shaped distributions. For our right-skewed data (i.e., tail is on the right, also called positive skew), common transformations include square root and log (we will use the log).

As a result of our data exploration, we will include the numerical variables

  • median_income,
  • bedrooms_per_room and
  • population_per_household

as predictors in our model.

Categorical variables

Now let’s analyze the relationship between our categorical variables ocean proximity and price_category. We start with a simple count.

data_explore |>
  count(price_category, ocean_proximity) |>
  group_by(price_category) |>
  mutate(percent = n / sum(n) *100,
         percent = round(percent, 2)) |>
  gt() |>
    tab_header(
    title = "California median house prices",
    subtitle = "Districts above and below 150.000$"
  ) |>
  cols_label(
    ocean_proximity = "Ocean Proximity",
    n = "Districts",
    percent = "Percent"
  ) |>
  fmt_number(
    columns = vars(n),
    suffixing = TRUE
  ) 
Warning: Since gt v0.3.0, `columns = vars(...)` has been deprecated.
• Please use `columns = c(...)` instead.
California median house prices
Districts above and below 150.000$
Ocean Proximity Districts Percent
above
<1H OCEAN 5.69K 58.00
INLAND 1.24K 12.62
ISLAND 3.00 0.03
NEAR BAY 1.37K 13.99
NEAR OCEAN 1.51K 15.36
below
<1H OCEAN 1.13K 19.98
INLAND 3.71K 65.40
NEAR BAY 345.00 6.09
NEAR OCEAN 484.00 8.54

The function geom_bin2d() creats a heatmap by counting the number of cases in each group, and then mapping the number of cases to each subgroub’s fill.

data_explore %>%
  ggplot(aes(price_category, ocean_proximity)) +
  geom_bin2d() +
  scale_fill_continuous(type = "viridis") 

We can observe that most districts with a median house price above 150,000 have an ocean proximity below 1 hour. On the other hand, districts below that threshold are typically inland. Hence, ocean proximity is indeed a good predictor for our two different median house value categories.

Data preparation

  • Handle missing values
  • Fix or remove outliers
  • Feature selection
  • Feature engineering
  • Feature scaling
  • Create a validation set

Next, we’ll preprocess our data before training the models. We mainly use the tidymodels packages recipes and workflows for these steps. Recipes are built as a series of optional data preparation steps, such as:

  • Data cleaning: Fix or remove outliers, fill in missing values (e.g., with zero, mean, median…) or drop their rows (or columns).

  • Feature selection: Drop the attributes that provide no useful information for the task.

  • Feature engineering: Discretize continuous features, decompose features (e.g., the weekday from a date variable, etc.), add promising transformations of features (e.g., log(x), sqrt(x), x2 , etc.) or aggregate features into promising new features (like we already did).

  • Feature scaling: Standardize or normalize features.

We will want to use our recipe across several steps as we train and test our models. To simplify this process, we can use a model workflow, which pairs a model and recipe together.

Data preparation

Before we create our recipes, we first select the variables which we will use in the model. Note that we keep longitude and latitude to be able to map the data in a later stage but we will not use the variables in our model.

housing_df_new <-
  housing_df |>
  dplyr::select( # select our predictors
    longitude, latitude, 
    price_category, 
    median_income, 
    ocean_proximity, 
    bedrooms_per_room, 
    rooms_per_household, 
    population_per_household
         )

glimpse(housing_df_new)
Rows: 20,640
Columns: 8
$ longitude                <dbl> -122.23, -122.22, -122.24, -122.25, -122.25, -122.25, -122.25, -122.25, -122.26, -122.25, -122.26, -122.…
$ latitude                 <dbl> 37.88, 37.86, 37.85, 37.85, 37.85, 37.85, 37.84, 37.84, 37.84, 37.84, 37.85, 37.85, 37.85, 37.84, 37.85,…
$ price_category           <fct> above, above, above, above, above, above, above, above, above, above, above, above, above, above, above,…
$ median_income            <dbl> 8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.1200, 2.0804, 3.6912, 3.2031, 3.2705, 3.0750, …
$ ocean_proximity          <fct> NEAR BAY, NEAR BAY, NEAR BAY, NEAR BAY, NEAR BAY, NEAR BAY, NEAR BAY, NEAR BAY, NEAR BAY, NEAR BAY, NEAR…
$ bedrooms_per_room        <dbl> 0.1465909, 0.1557966, 0.1295160, 0.1844584, 0.1720959, 0.2317737, 0.1928994, 0.2213273, 0.2602740, 0.199…
$ rooms_per_household      <dbl> 6.984127, 6.238137, 8.288136, 5.817352, 6.281853, 4.761658, 4.931907, 4.797527, 4.294118, 4.970588, 5.47…
$ population_per_household <dbl> 2.555556, 2.109842, 2.802260, 2.547945, 2.181467, 2.139896, 2.128405, 1.788253, 2.026891, 2.172269, 2.26…

Furthermore, we need to make a new data split since we updated the original data.

set.seed(123)

data_split <- initial_split(housing_df_new, # updated data
                           prop = 3/4, 
                           strata = price_category)

train_data <- training(data_split) 
test_data <- testing(data_split)

Data prepropecessing recipe

The type of data preprocessing is dependent on the data and the type of model being fit. The excellent book “Tidy Modeling with R” provides an appendix with recommendations for baseline levels of preprocessing that are needed for various model functions.

Let’s create a base recipe for all of our classification models. Note that the sequence of steps matter:

  • The recipe() function has two arguments:

    1. A formula. Any variable on the left-hand side of the tilde (~) is considered the model outcome (here, price_category). On the right-hand side of the tilde are the predictors. Variables may be listed by name (separated by a +), or you can use the dot (.) to indicate all other variables as predictors.

    2. The data. A recipe is associated with the data set used to create the model. This will typically be the training set, so data = train_data here.

  • update_role(): This step of adding roles to a recipe is optional; the purpose of using it here is that those two variables can be retained in the data but not included in the model. This can be convenient when, after the model is fit, we want to investigate some poorly predicted value. These ID columns will be available and can be used to try to understand what went wrong.

  • step_naomit() removes observations (rows of data) if they contain NA or NaN values. We use skip = TRUE because we don’t want to perform this part to new data so that the number of samples in the assessment set is the same as the number of predicted values (even if they are NA).

Note that instead of deleting missing values we could also easily substitute (i.e., impute) missing values of variables by one of the following methods (using the training set):

Take a look at the recipes reference for an overview about all possible imputation methods.

  • step_novel() converts all nominal variables to factors and takes care of other issues related to categorical variables.

  • step_log() will log transform data (since some of our numerical variables are right-skewed). Note that this step can not be performed on negative numbers.

  • step_normalize() normalizes (center and scales) the numeric variables to have a standard deviation of one and a mean of zero. (i.e., z-standardization).

  • step_dummy() converts our factor column ocean_proximity into numeric binary (0 and 1) variables.

Note that this step may cause problems if your categorical variable has too many levels - especially if some of the levels are very infrequent. In this case you should either drop the variable or pool infrequently occurring values into an “other” category with [step_other](https://recipes.tidymodels.org/reference/step_other.html). This steps has to be performed before step_dummy.

  • step_zv(): removes any numeric variables that have zero variance.

  • step_corr(): will remove predictor variables that have large correlations with other predictor variables.

Note that the package themis contains extra steps for the recipes package for dealing with imbalanced data. A classification data set with skewed class proportions is called imbalanced. Classes that make up a large proportion of the data set are called majority classes. Those that make up a smaller proportion are minority classes (see Google Developers for more details). Themis provides various methods for over-sampling (e.g. SMOTE) and under-sampling. However, we don’t have to use this methods since our data is not imbalanced.

housing_rec <-
  recipe(price_category ~ .,
         data = train_data) %>%
  update_role(longitude, latitude, new_role = "ID") %>%
  step_log(median_income,
           bedrooms_per_room, rooms_per_household, 
           population_per_household) %>%
  step_naomit(everything(), skip = TRUE) %>%
  step_novel(all_nominal(), -all_outcomes()) %>%
  # Make sure all nominal predictors (including ocean_proximity) are dummied
  step_dummy(all_nominal_predictors(), -all_outcomes()) %>%
  step_normalize(all_numeric(), -all_outcomes(), -longitude, -latitude) %>%
  step_zv(all_numeric(), -all_outcomes()) %>%
  step_corr(all_predictors(), threshold = 0.7, method = "spearman") %>%
  step_smote(price_category) 

To view the current set of variables and roles, use the summary() function:

summary(housing_rec)

If we would like to check if all of our preprocessing steps from above actually worked, we can proceed as follows:

prepped_data <- 
  housing_rec |># use the recipe object
  prep() |># perform the recipe on training data
  juice() # extract only the preprocessed dataframe 
Warning: !  The following column has zero variance so scaling cannot be used: ocean_proximity_new.
ℹ Consider using ]8;;ide:help:recipes::step_zv?step_zv]8;; to remove those columns before normalizing.

Take a look at the data structure:

glimpse(prepped_data)
Rows: 19,414
Columns: 10
$ longitude                  <dbl> -122.23, -122.22, -122.24, -122.25, -122.25, -122.25, -122.26, -122.25, -122.26, -122.26, -122.26, -12…
$ latitude                   <dbl> 37.88, 37.86, 37.85, 37.85, 37.85, 37.84, 37.84, 37.84, 37.85, 37.85, 37.85, 37.84, 37.85, 37.85, 37.8…
$ median_income              <dbl> 1.842052e+00, 1.836006e+00, 1.552164e+00, 2.112520e-01, 3.133968e-01, 1.059356e-01, -1.086555e+00, 1.2…
$ rooms_per_household        <dbl> 1.06765323, 0.65680009, 1.69028872, 0.68220035, -0.32555744, -0.19778338, -0.70146222, -0.16936755, 0.…
$ population_per_household   <dbl> -0.3913634897, -1.0996745281, -0.0507778478, -0.9762937735, -1.0474005776, -1.0673009190, -1.247910248…
$ price_category             <fct> above, above, above, above, above, above, above, above, above, above, above, above, above, above, abov…
$ ocean_proximity_INLAND     <dbl> -0.6861777, -0.6861777, -0.6861777, -0.6861777, -0.6861777, -0.6861777, -0.6861777, -0.6861777, -0.686…
$ ocean_proximity_ISLAND     <dbl> -0.01399229, -0.01399229, -0.01399229, -0.01399229, -0.01399229, -0.01399229, -0.01399229, -0.01399229…
$ ocean_proximity_NEAR.BAY   <dbl> 2.829997, 2.829997, 2.829997, 2.829997, 2.829997, 2.829997, 2.829997, 2.829997, 2.829997, 2.829997, 2.…
$ ocean_proximity_NEAR.OCEAN <dbl> -0.3834987, -0.3834987, -0.3834987, -0.3834987, -0.3834987, -0.3834987, -0.3834987, -0.3834987, -0.383…

Visualize the numerical data:

prepped_data |>
  dplyr::select(price_category, 
         median_income, 
         rooms_per_household, 
         population_per_household) |>
  ggscatmat(corMethod = "spearman",
            alpha=0.2)
Warning: Factor variables are omitted in plot

You should notice that:

  • the variables longitude and latitude did not change.

  • median_income, rooms_per_household and population_per_household are now z-standardized and the distributions are a bit less right skewed (due to our log transformation)

  • ocean_proximity was replaced by dummy variables.

Validation set

Remember that we already partitioned our data set into a training set and test set. This lets us judge whether a given model will generalize well to new data. However, using only two partitions may be insufficient when doing many rounds of hyperparameter tuning (which we don’t perform in this tutorial but it is always recommended to use a validation set).

Therefore, it is usually a good idea to create a so called validation set. Watch this short video from Google’s Machine Learning crash course to learn more about the value of a validation set.

We use k-fold cross validation to build a set of 5 validation folds with the function vfold_cv. We also use stratified sampling:

set.seed(100)

cv_folds <-
 vfold_cv(train_data, 
          v = 5, 
          strata = price_category) 

We will come back to the validation set after we specified our models.

Model building

Specify models

The process of specifying our models is always as follows:

  1. Pick a model type
  2. Set the engine
  3. Set the mode: regression or classification

You can choose the model type and engine from this list.

Logistic regression

log_spec <- # your model specification
  logistic_reg() |> # model type
  set_engine(engine = "glm") |> # model engine
  set_mode("classification") # model mode

# Show your model specification
log_spec
Logistic Regression Model Specification (classification)

Computational engine: glm 

Linear discriminant

lda_spec <-
  discrim_linear() |>
  set_engine(engine='MASS') |>
  set_mode("classification")

# Show your model specification
lda_spec
Linear Discriminant Model Specification (classification)

Computational engine: MASS 

Quadratic discriminant

qda_spec <-
  discrim_quad() |>
  set_engine(engine='MASS') |>
  set_mode("classification")

# Show your model specification
qda_spec
Quadratic Discriminant Model Specification (classification)

Computational engine: MASS 

K-nearest neighbor

knn_spec <- 
  nearest_neighbor(neighbors = 4) |># we can adjust the number of neighbors 
  set_engine("kknn") |>
  set_mode("classification") 

Create workflows

To combine the data preparation recipe with the model building, we use the package workflows. A workflow is an object that can bundle together your pre-processing recipe, modeling, and even post-processing requests (like calculating the RMSE).

Logistic regression

Bundle recipe and model with workflows:

log_wflow <- # new workflow object
 workflow() |># use workflow function
 add_recipe(housing_rec) |>  # use the new recipe
 add_model(log_spec)   # add your model spec

# show object
log_wflow
══ Workflow ═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
8 Recipe Steps

• step_log()
• step_naomit()
• step_novel()
• step_dummy()
• step_normalize()
• step_zv()
• step_corr()
• step_smote()

── Model ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Logistic Regression Model Specification (classification)

Computational engine: glm 

LDA

Bundle recipe and model:

lda_wflow <- # new workflow object
 workflow() |># use workflow function
 add_recipe(housing_rec) |>  # use the new recipe
 add_model(lda_spec)   # add your model spec

QDA

Bundle recipe and model:

qda_wflow <- # new workflow object
 workflow() |># use workflow function
 add_recipe(housing_rec) |>  # use the new recipe
 add_model(qda_spec)   # add your model spec

K-nearest neighbor

Bundle recipe and model:

knn_wflow <-
 workflow() %>%
 add_recipe(housing_rec) |>
 add_model(knn_spec)

Evaluate models

Now we can use our validation set (cv_folds) to estimate the performance of our models using the fit_resamples() function to fit the models on each of the folds and store the results.

Note that fit_resamples() will fit our model to each resample and evaluate on the holdout set from each resample. The function is usually only used for computing performance metrics across some set of resamples to evaluate our models (like accuracy) - the models are not even stored. However, in our example we save the predictions in order to visualize the model fit and residuals with control_resamples(save_pred = TRUE).

Finally, we collect the performance metrics with collect_metrics() and pick the model that does best on the validation set.

Logistic regression

We use our workflow object to perform resampling. Furthermore, we use metric_set() to choose some common classification performance metrics provided by the yardstick package. Visit yardsticks reference to see the complete list of all possible metrics.

Note that Cohen’s kappa coefficient (\(\kappa\)) is a similar measure to accuracy, but is normalized by the accuracy that would be expected by chance alone and is very useful when one or more classes have large frequency distributions. The higher the value, the better.

# Ensure that all metric functions are from yardstick:
log_res <- 
  log_wflow |>
  fit_resamples(
    resamples = cv_folds, 
    metrics = metric_set(
      yardstick::recall, 
      yardstick::precision, 
      yardstick::f_meas, 
      yardstick::accuracy, 
      yardstick::kap, 
      yardstick::roc_auc, 
      yardstick::sens, 
      yardstick::spec
    ),
    control = control_resamples(save_pred = TRUE)
  )
→ A | warning: !  The following column has zero variance so scaling cannot be used: ocean_proximity_new.
               ℹ Consider using ]8;;ide:help:recipes::step_zv?step_zv]8;; to remove those columns before normalizing.

There were issues with some computations   A: x1

There were issues with some computations   A: x2

There were issues with some computations   A: x3

There were issues with some computations   A: x4

There were issues with some computations   A: x5

There were issues with some computations   A: x5
Model coefficients

The above described method to obtain log_res is fine if we are not interested in model coefficients. However, if we would like to extract the model coefficients from fit_resamples, we need to proceed as follows:

# save model coefficients for a fitted model object from a workflow

get_model <- function(x) {
  extract_fit_parsnip(x) |>tidy()
}

# same as before with one exception
log_res_2 <- 
  log_wflow |>
  fit_resamples(
    resamples = cv_folds, 
    metrics = metric_set(
      yardstick::recall, 
      yardstick::precision, 
      yardstick::f_meas, 
      yardstick::accuracy, 
      yardstick::kap, 
      yardstick::roc_auc, 
      yardstick::sens, 
      yardstick::spec
    ),
    control = control_resamples(
      save_pred = TRUE,
      extract = get_model) # use extract and our new function
    ) 
→ A | warning: !  The following column has zero variance so scaling cannot be used: ocean_proximity_new.
               ℹ Consider using ]8;;ide:help:recipes::step_zv?step_zv]8;; to remove those columns before normalizing.

There were issues with some computations   A: x1

There were issues with some computations   A: x2

There were issues with some computations   A: x3

There were issues with some computations   A: x4

There were issues with some computations   A: x5

There were issues with some computations   A: x5

Now there is a .extracts column with nested tibbles.

log_res_2$.extracts[[1]]

To get the results use:

log_res_2$.extracts[[1]][[1]]
[[1]]
NA

All of the results can be flattened and collected using:

all_coef <- map_dfr(log_res_2$.extracts, ~ .x[[1]][[1]])

Show all of the resample coefficients for a single predictor:

filter(all_coef, term == "median_income")
Performance metrics

Show average performance over all folds (note that we use log_res):

log_res |> collect_metrics(summarize = TRUE)

Show performance for every single fold:

log_res |> collect_metrics(summarize = FALSE)
Collect predictions

To obtain the actual model predictions, we use the function collect_predictions and save the result as log_pred:

log_pred <- 
  log_res %>%
  collect_predictions()
Confusion matrix

Now we can use the predictions to create a confusion matrix with conf_mat():

log_pred |>
  conf_mat(price_category, .pred_class) 
          Truth
Prediction above below
     above  8245   902
     below  1568  4765

Additionally, the confusion matrix can quickly be visualized in different formats using autoplot(). Type mosaic:

log_pred |>
  conf_mat(price_category, .pred_class) |>
  autoplot(type = "mosaic")

Or type heatmap:

log_pred |>
  conf_mat(price_category, .pred_class) |>
  autoplot(type = "heatmap")

ROC-Curve

We can also make an ROC curve for our 5 folds. Since the category we are predicting is the first level in the price_category factor (“above”), we provide roc_curve() with the relevant class probability .pred_above:

log_pred |>
  group_by(id) |># id contains our folds
  roc_curve(price_category, .pred_above) |>
  autoplot()

Visit Google developer’s Machine Learning Crashcourse to learn more about the ROC-Curve.

Probability distributions

Plot predicted probability distributions for our two classes.

log_pred |>
  ggplot() +
  geom_density(aes(x = .pred_above, 
                   fill = price_category), 
               alpha = 0.5)

Random forest

We don’t repeat all of the steps shown in logistic regression and just focus on the performance metrics.

lda_res <-
  lda_wflow |>
  fit_resamples(
    resamples = cv_folds, 
    metrics = metric_set(
      yardstick::recall, 
      yardstick::precision, 
      yardstick::f_meas, 
      yardstick::accuracy, 
      yardstick::kap, 
      yardstick::roc_auc, 
      yardstick::sens, 
      yardstick::spec
    ),
    control = control_resamples(save_pred = TRUE)
    ) 
→ A | warning: !  The following column has zero variance so scaling cannot be used: ocean_proximity_new.
               ℹ Consider using ]8;;ide:help:recipes::step_zv?step_zv]8;; to remove those columns before normalizing.

There were issues with some computations   A: x1

There were issues with some computations   A: x2

There were issues with some computations   A: x3

There were issues with some computations   A: x4

There were issues with some computations   A: x5

There were issues with some computations   A: x5
lda_res |> collect_metrics(summarize = TRUE)

QDA

We don’t repeat all of the steps shown in logistic regression and just focus on the performance metrics.

qda_res <- 
  qda_wflow |>
  fit_resamples(
    resamples = cv_folds, 
    metrics = metric_set(
      yardstick::recall, 
      yardstick::precision, 
      yardstick::f_meas, 
      yardstick::accuracy, 
      yardstick::kap, 
      yardstick::roc_auc, 
      yardstick::sens, 
      yardstick::spec
    ),
    control = control_resamples(save_pred = TRUE)
    ) 

qda_res |>collect_metrics(summarize = TRUE)

#→ A | error:   rank deficiency in group below
#There were issues with some computations   A: x5
#Warning: All models failed. Run `show_notes(.Last.tune.result)` for more information.Error in `estimate_tune_results()`:
#! All models failed. Run `show_notes(.Last.tune.result)` for more information.
#Run `rlang::last_trace()` to see where the error occurred.

The error message suggests that there is a rank deficiency in your qda_res model, meaning that some of the predictor variables are likely collinear (highly correlated) or have near-zero variance, which is problematic for Quadratic Discriminant Analysis (QDA). Here’s how you can troubleshoot and resolve the issue:

1. Check for Near-Zero Variance Predictors
QDA does not handle variables with near-zero variance well. You can check for such predictors using:

library(caret)
Loading required package: lattice

Attaching package: ‘caret’

The following object is masked from ‘package:kknn’:

    contr.dummy

The following objects are masked from ‘package:yardstick’:

    precision, recall, sensitivity, specificity

The following object is masked from ‘package:purrr’:

    lift
nearZeroVar(train_data, saveMetrics = TRUE)

If any variables have near-zero variance, consider removing them. This is not our problem.

2. Check for Collinearity Highly correlated predictors can cause rank deficiency. Compute the correlation matrix:

cor(train_data %>% select_if(is.numeric), use = "pairwise.complete.obs")
                             longitude     latitude median_income bedrooms_per_room rooms_per_household population_per_household
longitude                 1.000000e+00 -0.924523028   -0.01507299       0.089800599        -0.030477506             6.477147e-05
latitude                 -9.245230e-01  1.000000000   -0.08151432      -0.108561739         0.109990761             7.726872e-03
median_income            -1.507299e-02 -0.081514323    1.00000000      -0.615391937         0.347294622             2.331952e-02
bedrooms_per_room         8.980060e-02 -0.108561739   -0.61539194       1.000000000        -0.430096710             6.924601e-03
rooms_per_household      -3.047751e-02  0.109990761    0.34729462      -0.430096710         1.000000000            -9.125535e-03
population_per_household  6.477147e-05  0.007726872    0.02331952       0.006924601        -0.009125535             1.000000e+00

You can remove highly correlated features using something like (Note: I mispelled train_data so this wouldn’t work):

training_data <- training_data %>%
  dplyr::select(-one_of(findCorrelation(cor_matrix, cutoff = 0.9)))

And try rerunning the model after dropping collinear variables. Our largest correlation is between latitude and longitude which makes sense. This is also not our problem.

3. Ensure Balanced Class Distribution
QDA can struggle if classes are highly imbalanced. Check the class distribution:

table(train_data$price_category)

above below 
 9813  5667 

If imbalance exists, try using SMOTE, downsampling, or upsampling which we did above

step_smote(price_category)

4. Check for Missing Values Missing values can also cause errors. Check and handle missing data:

sum(is.na(training_data))
[1] 0

Not this either.

5. Try an Alternative Model Sometimes models can’t fit a data set despite your best efforts. If the problem persists, try another model.

K-nearest neighbor

We don’t repeat all of the steps shown in logistic regression and just focus on the performance metrics.

knn_res <- 
  knn_wflow |>
  fit_resamples(
    resamples = cv_folds, 
    metrics = metric_set(
      yardstick::recall, 
      yardstick::precision, 
      yardstick::f_meas, 
      yardstick::accuracy, 
      yardstick::kap, 
      yardstick::roc_auc, 
      yardstick::sens, 
      yardstick::spec
    ),
    control = control_resamples(save_pred = TRUE)
    ) 
→ A | warning: !  The following column has zero variance so scaling cannot be used: ocean_proximity_new.
               ℹ Consider using ]8;;ide:help:recipes::step_zv?step_zv]8;; to remove those columns before normalizing.

There were issues with some computations   A: x1

There were issues with some computations   A: x2

There were issues with some computations   A: x3

There were issues with some computations   A: x4

There were issues with some computations   A: x5

There were issues with some computations   A: x5
knn_res |> collect_metrics(summarize = TRUE)

Compare models

Extract metrics from our models to compare them:

log_metrics <- 
  log_res |>
  collect_metrics(summarize = TRUE) %>%
  mutate(model = "Logistic Regression") # add the name of the model to every row

lda_metrics <- 
  lda_res |>
  collect_metrics(summarize = TRUE) %>%
  mutate(model = "Linear Discriminant")

# QDA did not run
# qda_metrics <- 
# qda_res |>
#   collect_metrics(summarise = TRUE) %>%
#   mutate(model = "Quadratic Discriminant")

# knn_metrics <- 
#   knn_res |>
#   collect_metrics(summarise = TRUE) %>%
#   mutate(model = "KNN")

# create dataframe with all models
model_compare <- bind_rows(
                          log_metrics,
                          lda_metrics
                           ) 

# change data structure
model_comp <- 
  model_compare |>
  dplyr::select(model, .metric, mean, std_err) |>
  pivot_wider(names_from = .metric, values_from = c(mean, std_err)) 

# show mean F1-Score for every model
model_comp |>
  arrange(mean_f_meas) |>
  mutate(model = fct_reorder(model, mean_f_meas)) |># order results
  ggplot(aes(model, mean_f_meas, fill=model)) +
  geom_col() +
  coord_flip() +
  scale_fill_brewer(palette = "Blues") +
   geom_text(
     size = 3,
     aes(label = round(mean_f_meas, 2), y = mean_f_meas + 0.08),
     vjust = 1
  )

# show mean area under the curve (auc) per model
model_comp |>
  arrange(mean_roc_auc) |>
  mutate(model = fct_reorder(model, mean_roc_auc)) %>%
  ggplot(aes(model, mean_roc_auc, fill=model)) +
  geom_col() +
  coord_flip() +
  scale_fill_brewer(palette = "Blues") + 
     geom_text(
     size = 3,
     aes(label = round(mean_roc_auc, 2), y = mean_roc_auc + 0.08),
     vjust = 1
  )

Note that the model results are all quite similar. In our example we choose the F1-Score as performance measure to select the best model. Let’s find the maximum mean F1-Score:

model_comp |> slice_max(mean_f_meas)

Last evaluation on test set

Tidymodels provides the function last_fit() which fits a model to the whole training data and evaluates it on the test set. We just need to provide the workflow object of the best model as well as the data split object (not the training data).

last_fit_logit <- last_fit(log_wflow, 
                        split = data_split,
                        metrics = metric_set(
                        yardstick::recall, 
                        yardstick::precision, 
                        yardstick::f_meas, 
                        yardstick::accuracy, 
                        yardstick::kap, 
                        yardstick::roc_auc, 
                        yardstick::sens, 
                        yardstick::spec
                        ))
→ A | warning: !  The following column has zero variance so scaling cannot be used: ocean_proximity_new.
               ℹ Consider using ]8;;ide:help:recipes::step_zv?step_zv]8;; to remove those columns before normalizing.

There were issues with some computations   A: x1

There were issues with some computations   A: x1

Show performance metrics

last_fit_logit |>
  collect_metrics()

And these are our final performance metrics. Remember that if a model fit to the training dataset also fits the test dataset well, minimal overfitting has taken place. This seems to be also the case in our example.

To learn more about the model we can access the variable importance scores via the .workflow column. We first need to pluck out the first element in the workflow column, then pull out the fit from the workflow object. Finally, the vip package helps us visualize the variable importance scores for the top features. Note that we can’t create this type of plot for every model engine.

last_fit_logit |>
  pluck(".workflow", 1) |>  
  extract_fit_parsnip() |>
  vip(num_features = 10)

The two most important predictors in whether a district has a median house value above or below $150,000 dollars were the ocean proximity inland and the median income.

Take a look at the confusion matrix:

last_fit_logit %>%
  collect_predictions() |>
  conf_mat(price_category, .pred_class) |>
  autoplot(type = "heatmap")

Let’s create the ROC curve. Again, since the event we are predicting is the first level in the price_category factor (“above”), we provide roc_curve() with the relevant class probability .pred_above:

last_fit_logit |>
  collect_predictions() |>
  roc_curve(price_category, .pred_above) |>
  autoplot()

Based on all of the results, the validation set and test set performance statistics are very close, so we would have pretty high confidence that our logit model would perform well when predicting new data.

---
title: "Classification in R"
output: 
  html_notebook:
    toc: true
    toc_float: true
---

## Data understanding
In Data Understanding, you:  

 - Import data
 - Clean data
 - Format data properly
 - Create new variables
 - Get an overview about the complete data
 - Split data into training and test set using stratified sampling
 - Discover and visualize the training data to gain insights

### Setup
If you like to install all packages at once, use the code below.

```{r}
#install.packages(c("tidyverse", "skimr", "GGally", "ggmap", "visdat", "corrr", "ggsignif", "gt", "vip", "themis", "purrr", "tidyr", "tidymodels", "keras", "ranger", "xgboost", "kknn")) 

library(tidyverse)   # Collection of R packages for data manipulation, visualization, and analysis  
library(skimr)       # Provides enhanced summary statistics for data frames  
library(GGally)      # Extension of ggplot2 for correlation matrices, scatterplot matrices, and more  
library(ggmap)       # Enables working with spatial data and maps using ggplot2  
library(visdat)      # Visualizes missing values and data structure for exploratory data analysis  
library(corrr)       # Focused on correlation analysis and visualization  
library(ggsignif)    # Adds significance indicators (e.g., p-values) to ggplot2 plots  
library(gt)          # Creates visually appealing tables in R  
library(vip)         # Generates variable importance plots for machine learning models  
library(themis)      # Helps with dealing with class imbalance in machine learning datasets  
library(purrr)       # Provides functional programming tools for iteration and mapping  
library(tidyr)       # Helps reshape and tidy data for easier analysis  
library(tidymodels)  # Framework for modeling and machine learning using tidy principles  
library(keras)       # Interface to TensorFlow for deep learning in R  
library(ranger)      # Fast implementation of random forests for classification and regression  
library(xgboost)     # Optimized gradient boosting framework for predictive modeling  
library(kknn)        # Implements k-Nearest Neighbors (kNN) classification and regression 
library(discrim)     # For discrim_linear classification model specifications using the MASS engine.
```
### Import Data
First of all, let’s import the data:


```{r}
LINK <- "https://raw.githubusercontent.com/kirenz/datasets/master/housing_unclean.csv"
housing_df <- read_csv(LINK)
```

### Clean data
To get a first impression of the data we take a look at the top 4 rows:

```{r}
housing_df |>
  slice_head(n = 4) |>
  gt() # print output using gt
```

Notice the values in the first row of the variables `housing_median_age` and `median_house_value`. We need to remove the strings “years” and “$”. Therefore, we use the function `str_remove_all` from the `stringr` package. Since there could be multiple wrong entries of the same type, we apply our corrections to all of the rows of the corresponding variable:

```{r}
housing_df <- 
  housing_df |>
  mutate(
    housing_median_age = str_remove_all(housing_median_age, "[years]"),
    median_house_value = str_remove_all(median_house_value, "[$]")
  )
```

We don’t cover the phase of data cleaning in detail in this tutorial. However, in a real data science project, data cleaning is usually a very time consuming process.

### Format data
Next, we take a look at the data structure and check wether all data formats are correct:  
 - Numeric variables should be formatted as integers (`int`) or double precision floating point numbers (`dbl`).  
 - Categorical (nominal and ordinal) variables should usually be formatted as factors (`fct`) and not characters (`chr`). Especially, if they don’t have many levels.

```{r}
glimpse(housing_df)
```

The package visdat helps us to explore the data class structure visually:
```{r}
vis_dat(housing_df)
```

We can observe that the numeric variables `housing_media_age` and `median_house_value` are declared as characters (`chr`) instead of numeric. We choose to format the variables as dbl, since the values could be floating-point numbers.

Furthermore, the categorical variable `ocean_proximity` is formatted as character instead of factor. Let’s take a look at the levels of the variable:

```{r}
housing_df |>
  count(ocean_proximity,
        sort = TRUE)
```
The variable has only 5 levels and therefore should be formatted as a factor.

Note that it is usually a good idea to first take care of the numerical variables. Afterwards, we can easily convert all remaining character variables to factors using the function across from the dplyr package (which is part of the tidyverse).

```{r}
# convert to numeric
housing_df <- 
  housing_df |>
  mutate(
    housing_median_age = as.numeric(housing_median_age),
    median_house_value = as.numeric(median_house_value)
  )

# convert all remaining character variables to factors 
housing_df <- 
  housing_df |>
  mutate(across(where(is.character), as.factor))
```
  
### Missing data  
Now let’s turn our attention to missing data. Missing data can be viewed with the function `vis_miss` from the package `visdat`. We arrange the data by columns with most missingness:

```{r}
vis_miss(housing_df, sort_miss = TRUE)
```

Here an alternative method to obtain missing data:

```{r}
is.na(housing_df) |> colSums()
```
We have a missing rate of 0.1% (207 cases) in our variable `total_bedroms`. This can cause problems for some algorithms. We will take care of this issue during our data preparation phase.

### Create new variables  
One very important thing you may want to do at the beginning of your data science project is to create new variable combinations. For example:  
 - the _total number of rooms_ in a district is not very useful if you don’t know how many households there are. What you really want is the _number of rooms per household_.  
 - Similarly, the total number of bedrooms by itself is not very useful: you probably want to compare it to the number of rooms.  
 - And the _population per household_ also seems like an interesting attribute combination to look at.  

Let’s create these new attributes:

```{r}
housing_df <- 
  housing_df |>
  mutate(rooms_per_household = total_rooms/households,
        bedrooms_per_room = total_bedrooms/total_rooms,
        population_per_household = population/households)
```

Furthermore, in our example we need to create our dependent variable and drop the original numeric variable.

```{r}
housing_df <- 
  housing_df |>
  mutate(price_category = case_when( 
    median_house_value < 150000 ~ "below",
    median_house_value >= 150000 ~ "above"
    )) |>
  mutate(price_category = as.factor(price_category)) |>
  dplyr::select(-median_house_value) # avoid conflicts with other packages like MASS
```

Since we created the new label `price_category` from the variable `median_house_value` it is crucial that we never use the variable `median_house_value` as a predictor in our models. Therefore we drop it.

Take a look at our dependent variable and create a table with the package gt

```{r}
housing_df |>
  count(price_category, # count observations
        name ="districts_total") |> # name the new variable 
  mutate(percent = districts_total/sum(districts_total)) |> # calculate percentages
  gt() # create table
```

Let’s make a nice looking table:

```{r}
housing_df |>
  count(price_category, 
        name ="districts_total") %>%
  mutate(percent = districts_total/sum(districts_total)*100,
         percent = round(percent, 2)) %>%
 gt() %>%
  tab_header(
    title = "California median house prices",
    subtitle = "Districts above and below 150.000$"
  ) %>%
  cols_label(
    price_category = "Price",
    districts_total = "Districts",
    percent = "Percent"
  ) |>
  fmt_number(
    columns = vars(districts_total),
    suffixing = TRUE
  ) 
```

### Data overview  
After we took care of our data issues, we can obtain a data summary of all numerical and categorical attributes using a function from the package `skimr`:

```{r}
skim(housing_df)
```

We have 20640 observations and 13 columns in our data.

 - The **sd** column shows the standard deviation, which measures how dispersed the values are.

 - The **p0**, **p25**, **p50**, **p75** and **p100** columns show the corresponding percentiles: a percentile indicates the value below which a given percentage of observations in a group of observations fall. For example, 25% of the districts have a `housing_median_age` lower than 18, while 50% are lower than 29 and 75% are lower than 37. These are often called the 25th percentile (or first quartile), the median, and the 75th percentile.

 - Further note that the **median income** attribute does not look like it is expressed in US dollars (USD). Actually the data has been scaled and capped at 15 (actually, 15.0001) for higher median incomes, and at 0.5 (actually, 0.4999) for lower median incomes. The numbers represent roughly tens of thousands of dollars (e.g., 3 actually means about $30,000).

Another quick way to get an overview of the type of data you are dealing with is to plot a **histogram** for each numerical attribute. A histogram shows the number of instances (on the vertical axis) that have a given value range (on the horizontal axis). You can either plot this one attribute at a time, or you can use ggscatmat from the package GGally on the whole dataset (as shown in the following code example), and it will plot a histogram for each numerical attribute as well as correlation coefficients (Pearson is the default). We just select the most promising variabels for our plot:

```{r,cache=TRUE}
housing_df |>
  dplyr::select(
    housing_median_age, 
    median_income, bedrooms_per_room, rooms_per_household, 
    population_per_household) |>
  ggscatmat(alpha = 0.2)
```

Another option is to use ggpairs, where we even can integrate categorical variables like our dependent variable price_category and ocean proximity in the output:

```{r}
housing_df |>
  dplyr::select(
    housing_median_age, 
    median_income, bedrooms_per_room, rooms_per_household, 
    population_per_household, ocean_proximity,
    price_category) |>
  ggpairs()
```

There are a few things you might notice in these histograms:

 - The variables median income, housing median age were capped.

 - Note that our attributes have very different scales. We will take care of this issue later in data preparation, when we use feature scaling (data normalization).

 - Finally, many histograms are tail-heavy: they extend much farther to the right of the median than to the left. This may make it a bit harder for some Machine Learning algorithms to detect patterns. We will transform these attributes later on to have more bell-shaped distributions. For our right-skewed data (i.e., tail is on the right, also called positive skew), common transformations include square root and log (we will use the log).
 
 
### Data splitting
Before we get started with our in-depth data exploration, let’s split our single dataset into two: a training set and a testing set. The training data will be used to fit models, and the testing set will be used to measure model performance. We perform data exploration only on the training data.

A **training dataset** is a dataset of examples used during the learning process and is used to fit the models. A **test dataset** is a dataset that is independent of the training dataset and is used to evaluate the performance of the final model. If a model fit to the training dataset also fits the test dataset well, minimal overfitting has taken place. A better fitting of the training dataset as opposed to the test dataset usually points to overfitting.

In our data split, we want to ensure that the training and test set is representative of the categories of our dependent variable.
```{r}
housing_df |>
  ggplot(aes(price_category)) +
  geom_bar() 
```

In general, we would like to have instances for each stratum, or else the estimate of a stratum’s importance may be biased. A stratum (plural strata) refers to a subset (part) of the whole data from which is being sampled. We only have two categories in our data.

To actually split the data, we can use the `rsample` package (included in tidymodels) to create an object that contains the information on how to split the data (which we call data_split), and then two more rsample functions to create data frames for the training and testing sets:

```{r}
# Fix the random numbers by setting the seed 
# This enables the analysis to be reproducible 
set.seed(123)

# Put 3/4 of the data into the training set 
data_split <- initial_split(housing_df, 
                           prop = 3/4, 
                           strata = price_category)

# Create dataframes for the two sets:
train_data <- training(data_split) 
test_data <- testing(data_split)
```

### Data exploration
The point of data exploration is to gain insights that will help you select important variables for your model and to get ideas for feature engineering in the data preparation phase. Ususally, data exploration is an iterative process: once you get a prototype model up and running, you can analyze its output to gain more insights and come back to this exploration step. It is important to note that we perform data exploration only with our training data.

#### Create data copy
We first make a copy of the training data since we don’t want to alter our data during data exploration.

```{r}
data_explore <- train_data
```

Next, we take a closer look at the relationships between our variables. In particular, we are interested in the relationships between ur dependent variable price_category and all other variables. The goal is to identify possible predictor variables which we could use in our models to predict the price_category.

#### Geographical overview
Since our data includes information about longitude and latitude, we start our data exploration with the creation of a geographical scatterplot of the data to get some first insights:

```{r}
data_explore |>
  ggplot(aes(x = longitude, y = latitude)) +
  geom_point(color = "cornflowerblue")
```

A better visualization that highlights high-density areas (with parameter `alpha = 0.1` ):
```{r}
data_explore |>
  ggplot(aes(x = longitude, y = latitude)) +
  geom_point(color = "cornflowerblue", alpha = 0.1) 
```

Overview about California housing prices:  
 - red is expensive,  
 - teal is cheap and  
 - larger circles indicate areas with a larger population.  

```{r}
data_explore |>
  ggplot(aes(x = longitude, y = latitude)) +
  geom_point(aes(size = population, color = price_category), 
             alpha = 0.4)
```
#### Numerical variables
We can use boxplots to check, if we actually find differences in our numeric variables for the different levels of our dependent categorical variable:

```{r}
data_explore |>
  ggplot(aes(x = price_category, y = median_income, 
             fill = price_category, color = price_category)) +
  geom_boxplot(alpha=0.4) 
```

Let`s define a function for this task that accepts strings as inputs so we don’t have to copy and paste our code for every plot. Note that we only have to change the “y-variable” in every plot.

```{r}
print_boxplot <- function(.y_var){
  
  # convert strings to variable
  y_var <- sym(.y_var) 
 
  # unquote variables using {{}}
  data_explore |>
  ggplot(aes(x = price_category, y = {{y_var}},
             fill = price_category, color = price_category)) +
  geom_boxplot(alpha=0.4) 
  
}  
```

Obtain all of the names of the y-variables we want to use for our plots:

```{r}
y_var <- 
  data_explore |>
  dplyr::select(where(is.numeric), -longitude, - latitude) |>
  variable.names() # obtain name
```

The `map` function applys the function `print_boxplot` to each element of our atomic vector `y_var` and returns the according plot:

```{r}
map(y_var, print_boxplot)
```
We can observe a difference in the price_category:

 - The differences between our two groups are quite small for `housing_median_age`, `total_room`, `total_bedrooms`, `population` and `households`

 - We can observe a noticeable difference for our variables `median_income` and `bedrooms_per_room`

 - `population_per_household` and `rooms_per_household` include some extreme values. We first need to fix this before we can proceed with our interpretations for this variables.  

Again, let’s write a short function for this task and filter some of the extreme cases. We call the new function print_boxplot_out:

```{r}
print_boxplot_out <- function(.y_var_out){
  
  y_var <- sym(.y_var_out) 
 
  data_explore |>
  filter(rooms_per_household < 50, population_per_household < 20) |>
  ggplot(aes(x = price_category, y = {{y_var}},
             fill = price_category, color = price_category)) +
  geom_boxplot(alpha=0.4) 
  
} 

y_var_out <- 
  data_explore |>
  dplyr::select(rooms_per_household, population_per_household) |>
  variable.names() 

map(y_var_out, print_boxplot_out)
```

Now we are able to recognize a small difference for population_per_household. rooms_per_household on the other hand is quite similar for both groups.

Additionally, we can use the function ggscatmat to create plots with our dependent variable as color column:

```{r}
data_explore |>
  dplyr::select(price_category, median_income, bedrooms_per_room, rooms_per_household, 
         population_per_household) |>
  ggscatmat(color="price_category", 
            corMethod = "spearman",
            alpha=0.2)
```

There are a few things you might notice in these histograms:

 - Note that our attributes have very different scales. We will take care of this issue later in data preparation, when we use feature scaling (data normalization).

 - The histograms are tail-heavy: they extend much farther to the right of the median than to the left. This may make it a bit harder for some Machine Learning algorithms to detect patterns. We will transform these attributes later on to have more bell-shaped distributions. For our right-skewed data (i.e., tail is on the right, also called positive skew), common transformations include square root and log (we will use the log).

As a result of our data exploration, we will include the numerical variables

 - `median_income`,
 - `bedrooms_per_room` and
 - `population_per_household`  
 
as predictors in our model.

#### Categorical variables
Now let’s analyze the relationship between our categorical variables ocean proximity and price_category. We start with a simple count.

```{r}
data_explore |>
  count(price_category, ocean_proximity) |>
  group_by(price_category) |>
  mutate(percent = n / sum(n) *100,
         percent = round(percent, 2)) |>
  gt() |>
    tab_header(
    title = "California median house prices",
    subtitle = "Districts above and below 150.000$"
  ) |>
  cols_label(
    ocean_proximity = "Ocean Proximity",
    n = "Districts",
    percent = "Percent"
  ) |>
  fmt_number(
    columns = vars(n),
    suffixing = TRUE
  ) 
```

The function geom_bin2d() creats a heatmap by counting the number of cases in each group, and then mapping the number of cases to each subgroub’s fill.

```{r}
data_explore %>%
  ggplot(aes(price_category, ocean_proximity)) +
  geom_bin2d() +
  scale_fill_continuous(type = "viridis") 
```
We can observe that most districts with a median house price above 150,000 have an ocean proximity below 1 hour. On the other hand, districts below that threshold are typically inland. Hence, ocean proximity is indeed a good predictor for our two different median house value categories.

## Data preparation

 - Handle missing values
 - Fix or remove outliers
 - Feature selection
 - Feature engineering
 - Feature scaling
 - Create a validation set

Next, we’ll preprocess our data before training the models. We mainly use the tidymodels packages `recipes` and `workflows` for these steps. Recipes are built as a series of optional data preparation steps, such as:

 - _Data cleaning_: Fix or remove outliers, fill in missing values (e.g., with zero, mean, median…) or drop their rows (or columns).

 - _Feature selection_: Drop the attributes that provide no useful information for the task.

 - _Feature engineering_: Discretize continuous features, decompose features (e.g., the weekday from a date variable, etc.), add promising transformations of features (e.g., log(x), sqrt(x), x2 , etc.) or aggregate features into promising new features (like we already did).

 - _Feature scaling_: Standardize or normalize features.

We will want to use our recipe across several steps as we train and test our models. To simplify this process, we can use a `model workflow`, which pairs a model and recipe together.

### Data preparation  
Before we create our `recipes`, we first select the variables which we will use in the model. Note that we keep `longitude` and `latitude` to be able to map the data in a later stage but we will not use the variables in our model.

```{r}
housing_df_new <-
  housing_df |>
  dplyr::select( # select our predictors
    longitude, latitude, 
    price_category, 
    median_income, 
    ocean_proximity, 
    bedrooms_per_room, 
    rooms_per_household, 
    population_per_household
         )

glimpse(housing_df_new)
```

Furthermore, we need to make a new data split since we updated the original data.

```{r}
set.seed(123)

data_split <- initial_split(housing_df_new, # updated data
                           prop = 3/4, 
                           strata = price_category)

train_data <- training(data_split) 
test_data <- testing(data_split)
```


### Data prepropecessing recipe
The type of data preprocessing is dependent on the data and the type of model being fit. The excellent book “Tidy Modeling with R” provides an [appendix with recommendations for baseline levels of preprocessing](https://www.tmwr.org/pre-proc-table.html) that are needed for various model functions.

Let’s create a base `recipe` for all of our classification models. Note that the sequence of steps matter:

 - The `recipe()` function has two arguments:

    1. A formula. Any variable on the left-hand side of the tilde (~) is considered the model outcome (here, `price_category`). On the right-hand side of the tilde are the predictors. Variables may be listed by name (separated by a +), or you can use the dot (.) to indicate all other variables as predictors.  

    2. The data. A recipe is associated with the data set used to create the model. This will typically be the training set, so `data = train_data` here.  

 - `update_role()`: This step of adding roles to a recipe is optional; the purpose of using it here is that those two variables can be retained in the data but not included in the model. This can be convenient when, after the model is fit, we want to investigate some poorly predicted value. These ID columns will be available and can be used to try to understand what went wrong.

 - `step_naomit()` removes observations (rows of data) if they contain `NA` or `NaN` values. We use `skip = TRUE` because we don’t want to perform this part to new data so that the number of samples in the assessment set is the same as the number of predicted values (even if they are NA).

Note that instead of deleting missing values we could also easily substitute (i.e., impute) missing values of variables by one of the following methods (using the training set):

 - [median](https://recipes.tidymodels.org/reference/step_medianimpute.html),
 - [mean](https://recipes.tidymodels.org/reference/step_meanimpute.html),
 - [mode](https://recipes.tidymodels.org/reference/step_modeimpute.html),
 - [k-nearest neighbors](https://recipes.tidymodels.org/reference/step_knnimpute.html),
 - [linear model](https://recipes.tidymodels.org/reference/step_impute_linear.html),
 - [bagged tree models](https://recipes.tidymodels.org/reference/step_bagimpute.html)  

Take a look at the [recipes reference](https://recipes.tidymodels.org/reference/index.html) for an overview about all possible imputation methods.

 - `step_novel()` converts all nominal variables to factors and takes care of other issues related to categorical variables.

 - `step_log()` will log transform data (since some of our numerical variables are right-skewed). Note that this step can not be performed on negative numbers.

 - `step_normalize()` normalizes (center and scales) the numeric variables to have a standard deviation of one and a mean of zero. (i.e., z-standardization).

 - `step_dummy()` converts our factor column ocean_proximity into numeric binary (0 and 1) variables.

Note that this step may cause problems if your categorical variable has too many levels - especially if some of the levels are very infrequent. In this case you should either drop the variable or pool infrequently occurring values into an “other” category with `[step_other](https://recipes.tidymodels.org/reference/step_other.html)`. This steps has to be performed before `step_dummy`.

 - `step_zv()`: removes any numeric variables that have zero variance.

 - `step_corr()`: will remove predictor variables that have large correlations with other predictor variables.

Note that the package [themis](https://themis.tidymodels.org/) contains extra steps for the `recipes` package for dealing with **imbalanced data**. A classification data set with skewed class proportions is called imbalanced. Classes that make up a large proportion of the data set are called majority classes. Those that make up a smaller proportion are minority classes (see [Google Developers](https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data) for more details). `Themis` provides various methods for over-sampling (e.g. SMOTE) and under-sampling. However, we don’t have to use this methods since our data is not imbalanced.

```{r}
housing_rec <-
  recipe(price_category ~ .,
         data = train_data) %>%
  update_role(longitude, latitude, new_role = "ID") %>%
  step_log(median_income,
           bedrooms_per_room, rooms_per_household, 
           population_per_household) %>%
  step_naomit(everything(), skip = TRUE) %>%
  step_novel(all_nominal(), -all_outcomes()) %>%
  # Make sure all nominal predictors (including ocean_proximity) are dummied
  step_dummy(all_nominal_predictors(), -all_outcomes()) %>%
  step_normalize(all_numeric(), -all_outcomes(), -longitude, -latitude) %>%
  step_zv(all_numeric(), -all_outcomes()) %>%
  step_corr(all_predictors(), threshold = 0.7, method = "spearman") %>%
  step_smote(price_category) 
```

To view the current set of variables and roles, use the `summary()` function:

```{r}
summary(housing_rec)
```
If we would like to check if all of our preprocessing steps from above actually worked, we can proceed as follows:

```{r}
prepped_data <- 
  housing_rec |># use the recipe object
  prep() |># perform the recipe on training data
  juice() # extract only the preprocessed dataframe 
```

Take a look at the data structure:

```{r}
glimpse(prepped_data)
```
Visualize the numerical data:

```{r}
prepped_data |>
  dplyr::select(price_category, 
         median_income, 
         rooms_per_household, 
         population_per_household) |>
  ggscatmat(corMethod = "spearman",
            alpha=0.2)
```
You should notice that:

 - the variables `longitude` and `latitude` did not change.

 - `median_income`, `rooms_per_household` and `population_per_household` are now _z_-standardized and the distributions are a bit less right skewed (due to our log transformation)

 - `ocean_proximity` was replaced by dummy variables.

#### Validation set  
Remember that we already partitioned our data set into a training set and test set. This lets us judge whether a given model will generalize well to new data. However, using only two partitions may be insufficient when doing many rounds of hyperparameter tuning (which we don’t perform in this tutorial but it is always recommended to use a validation set).

Therefore, it is usually a good idea to create a so called validation set. Watch [this short video](https://developers.google.com/machine-learning/crash-course/validation/video-lecture) from Google’s Machine Learning crash course to learn more about the value of a validation set.

We use _k_-fold cross validation to build a set of 5 validation folds with the function `vfold_cv`. We also use stratified sampling:

```{r}
set.seed(100)

cv_folds <-
 vfold_cv(train_data, 
          v = 5, 
          strata = price_category) 
```

We will come back to the validation set after we specified our models.

## Model building  
### Specify models  
The process of specifying our models is always as follows:  

1. Pick a model type  
2. Set the engine  
3. Set the `mode`: regression or classification

You can choose the model type and engine from this list.

#### Logistic regression
```{r}
log_spec <- # your model specification
  logistic_reg() |> # model type
  set_engine(engine = "glm") |> # model engine
  set_mode("classification") # model mode

# Show your model specification
log_spec
```

#### Linear discriminant
```{r}
lda_spec <-
  discrim_linear() |>
  set_engine(engine='MASS') |>
  set_mode("classification")

# Show your model specification
lda_spec
```

#### Quadratic discriminant
```{r}
qda_spec <-
  discrim_quad() |>
  set_engine(engine='MASS') |>
  set_mode("classification")

# Show your model specification
qda_spec
```
#### K-nearest neighbor
```{r}
knn_spec <- 
  nearest_neighbor(neighbors = 4) |># we can adjust the number of neighbors 
  set_engine("kknn") |>
  set_mode("classification") 
```

### Create workflows  
To combine the data preparation recipe with the model building, we use the package [workflows](https://workflows.tidymodels.org/). A workflow is an object that can bundle together your pre-processing recipe, modeling, and even post-processing requests (like calculating the RMSE).

#### Logistic regression  
Bundle recipe and model with workflows:  

```{r}
log_wflow <- # new workflow object
 workflow() |># use workflow function
 add_recipe(housing_rec) |>  # use the new recipe
 add_model(log_spec)   # add your model spec

# show object
log_wflow
```

#### LDA
Bundle recipe and model:

```{r}
lda_wflow <- # new workflow object
 workflow() |># use workflow function
 add_recipe(housing_rec) |>  # use the new recipe
 add_model(lda_spec)   # add your model spec
```
#### QDA
Bundle recipe and model:

```{r}
qda_wflow <- # new workflow object
 workflow() |># use workflow function
 add_recipe(housing_rec) |>  # use the new recipe
 add_model(qda_spec)   # add your model spec
```
#### K-nearest neighbor
Bundle recipe and model:

```{r}
knn_wflow <-
 workflow() %>%
 add_recipe(housing_rec) |>
 add_model(knn_spec)
```

### Evaluate models
Now we can use our validation set (`cv_folds`) to estimate the performance of our models using the `fit_resamples()` function to fit the models on each of the folds and store the results.

Note that `fit_resamples()` will fit our model to each resample and evaluate on the holdout set from each resample. The function is usually only used for computing performance metrics across some set of resamples to evaluate our models (like accuracy) - the models are not even stored. However, in our example we save the predictions in order to visualize the model fit and residuals with `control_resamples(save_pred = TRUE)`.

Finally, we collect the performance metrics with `collect_metrics()` and pick the model that does best on the validation set.

#### Logistic regression
We use our workflow object to perform resampling. Furthermore, we use `metric_set()` to choose some common classification performance metrics provided by the `yardstick` package. Visit [yardsticks reference](https://yardstick.tidymodels.org/reference/index.html) to see the complete list of all possible metrics.

Note that Cohen’s _kappa_ coefficient ($\kappa$) is a similar measure to accuracy, but is normalized by the accuracy that would be expected by chance alone and is very useful when one or more classes have large frequency distributions. The higher the value, the better.

```{r}
# Ensure that all metric functions are from yardstick:
log_res <- 
  log_wflow |>
  fit_resamples(
    resamples = cv_folds, 
    metrics = metric_set(
      yardstick::recall, 
      yardstick::precision, 
      yardstick::f_meas, 
      yardstick::accuracy, 
      yardstick::kap, 
      yardstick::roc_auc, 
      yardstick::sens, 
      yardstick::spec
    ),
    control = control_resamples(save_pred = TRUE)
  )
```

##### Model coefficients
The above described method to obtain `log_res` is fine if we are not interested in model coefficients. However, if we would like to extract the model coefficients from `fit_resamples`, we need to proceed as follows:

```{r}
# save model coefficients for a fitted model object from a workflow

get_model <- function(x) {
  extract_fit_parsnip(x) |>tidy()
}

# same as before with one exception
log_res_2 <- 
  log_wflow |>
  fit_resamples(
    resamples = cv_folds, 
    metrics = metric_set(
      yardstick::recall, 
      yardstick::precision, 
      yardstick::f_meas, 
      yardstick::accuracy, 
      yardstick::kap, 
      yardstick::roc_auc, 
      yardstick::sens, 
      yardstick::spec
    ),
    control = control_resamples(
      save_pred = TRUE,
      extract = get_model) # use extract and our new function
    ) 
```

Now there is a `.extracts` column with nested tibbles.

```{r}
log_res_2$.extracts[[1]]
```

To get the results use:

```{r}
log_res_2$.extracts[[1]][[1]]
```

All of the results can be flattened and collected using:

```{r}
all_coef <- map_dfr(log_res_2$.extracts, ~ .x[[1]][[1]])
```

Show all of the resample coefficients for a single predictor:

```{r}
filter(all_coef, term == "median_income")
```

##### Performance metrics
Show average performance over all folds (note that we use `log_res`):

```{r}
log_res |> collect_metrics(summarize = TRUE)
```
Show performance for every single fold:

```{r}
log_res |> collect_metrics(summarize = FALSE)
```

##### Collect predictions
To obtain the actual model predictions, we use the function collect_predictions and save the result as `log_pred`:

```{r}
log_pred <- 
  log_res %>%
  collect_predictions()
```

##### Confusion matrix
Now we can use the predictions to create a confusion matrix with conf_mat():

```{r}
log_pred |>
  conf_mat(price_category, .pred_class) 
```
Additionally, the confusion matrix can quickly be visualized in different formats using autoplot(). Type mosaic:

```{r}
log_pred |>
  conf_mat(price_category, .pred_class) |>
  autoplot(type = "mosaic")
```
Or type heatmap:

```{r}
log_pred |>
  conf_mat(price_category, .pred_class) |>
  autoplot(type = "heatmap")
```
##### ROC-Curve
We can also make an ROC curve for our 5 folds. Since the category we are predicting is the first level in the `price_category` factor (“above”), we provide `roc_curve()` with the relevant class probability `.pred_above`:

```{r}
log_pred |>
  group_by(id) |># id contains our folds
  roc_curve(price_category, .pred_above) |>
  autoplot()
```
Visit Google developer’s [Machine Learning Crashcourse](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc) to learn more about the ROC-Curve.

##### Probability distributions
Plot predicted probability distributions for our two classes.

```{r}
log_pred |>
  ggplot() +
  geom_density(aes(x = .pred_above, 
                   fill = price_category), 
               alpha = 0.5)
```
#### Random forest
We don’t repeat all of the steps shown in logistic regression and just focus on the performance metrics.
```{r}
lda_res <-
  lda_wflow |>
  fit_resamples(
    resamples = cv_folds, 
    metrics = metric_set(
      yardstick::recall, 
      yardstick::precision, 
      yardstick::f_meas, 
      yardstick::accuracy, 
      yardstick::kap, 
      yardstick::roc_auc, 
      yardstick::sens, 
      yardstick::spec
    ),
    control = control_resamples(save_pred = TRUE)
    ) 

lda_res |> collect_metrics(summarize = TRUE)
```

#### QDA
We don’t repeat all of the steps shown in logistic regression and just focus on the performance metrics.

```{r, eval=FALSE}
qda_res <- 
  qda_wflow |>
  fit_resamples(
    resamples = cv_folds, 
    metrics = metric_set(
      yardstick::recall, 
      yardstick::precision, 
      yardstick::f_meas, 
      yardstick::accuracy, 
      yardstick::kap, 
      yardstick::roc_auc, 
      yardstick::sens, 
      yardstick::spec
    ),
    control = control_resamples(save_pred = TRUE)
    ) 

qda_res |>collect_metrics(summarize = TRUE)

#→ A | error:   rank deficiency in group below
#There were issues with some computations   A: x5
#Warning: All models failed. Run `show_notes(.Last.tune.result)` for more information.Error in `estimate_tune_results()`:
#! All models failed. Run `show_notes(.Last.tune.result)` for more information.
#Run `rlang::last_trace()` to see where the error occurred.
```

The error message suggests that there is a rank deficiency in your `qda_res` model, meaning that some of the predictor variables are likely collinear (highly correlated) or have near-zero variance, which is problematic for Quadratic Discriminant Analysis (QDA). Here’s how you can troubleshoot and resolve the issue:  

__1. Check for Near-Zero Variance Predictors__  
QDA does not handle variables with near-zero variance well. You can check for such predictors using:  

```{r}
library(caret)
nearZeroVar(train_data, saveMetrics = TRUE)
```

If any variables have near-zero variance, consider removing them. This is not our problem.

__2. Check for Collinearity__
Highly correlated predictors can cause rank deficiency. Compute the correlation matrix:

```{r}
cor(train_data %>% select_if(is.numeric), use = "pairwise.complete.obs")
```
You can remove highly correlated features using something like (Note: I mispelled `train_data` so this wouldn't work):

```{r,eval=FALSE}
training_data <- training_data %>%
  dplyr::select(-one_of(findCorrelation(cor_matrix, cutoff = 0.9)))
```
And try rerunning the model after dropping collinear variables. Our largest correlation is between `latitude` and `longitude` which makes sense. This is also not our problem.

__3. Ensure Balanced Class Distribution__  
QDA can struggle if classes are highly imbalanced. Check the class distribution:  

```{r}
table(train_data$price_category)
```

If imbalance exists, try using SMOTE, downsampling, or upsampling which we did above

`step_smote(price_category)` 

__4. Check for Missing Values__
Missing values can also cause errors. Check and handle missing data:

```{r}
sum(is.na(training_data))
```

Not this either.

__5. Try an Alternative Model__
Sometimes models can't fit a data set despite your best efforts. If the problem persists, try [another model](https://www.tidymodels.org/find/parsnip/).



#### K-nearest neighbor  
We don’t repeat all of the steps shown in logistic regression and just focus on the performance metrics. 

```{r}
knn_res <- 
  knn_wflow |>
  fit_resamples(
    resamples = cv_folds, 
    metrics = metric_set(
      yardstick::recall, 
      yardstick::precision, 
      yardstick::f_meas, 
      yardstick::accuracy, 
      yardstick::kap, 
      yardstick::roc_auc, 
      yardstick::sens, 
      yardstick::spec
    ),
    control = control_resamples(save_pred = TRUE)
    ) 

knn_res |> collect_metrics(summarize = TRUE)
```


#### Compare models
Extract metrics from our models to compare them:

```{r}
log_metrics <- 
  log_res |>
  collect_metrics(summarize = TRUE) %>%
  mutate(model = "Logistic Regression") # add the name of the model to every row

lda_metrics <- 
  lda_res |>
  collect_metrics(summarize = TRUE) %>%
  mutate(model = "Linear Discriminant")

# QDA did not run
# qda_metrics <- 
# qda_res |>
#   collect_metrics(summarise = TRUE) %>%
#   mutate(model = "Quadratic Discriminant")

# knn_metrics <- 
#   knn_res |>
#   collect_metrics(summarise = TRUE) %>%
#   mutate(model = "KNN")

# create dataframe with all models
model_compare <- bind_rows(
                          log_metrics,
                          lda_metrics
                           ) 

# change data structure
model_comp <- 
  model_compare |>
  dplyr::select(model, .metric, mean, std_err) |>
  pivot_wider(names_from = .metric, values_from = c(mean, std_err)) 

# show mean F1-Score for every model
model_comp |>
  arrange(mean_f_meas) |>
  mutate(model = fct_reorder(model, mean_f_meas)) |># order results
  ggplot(aes(model, mean_f_meas, fill=model)) +
  geom_col() +
  coord_flip() +
  scale_fill_brewer(palette = "Blues") +
   geom_text(
     size = 3,
     aes(label = round(mean_f_meas, 2), y = mean_f_meas + 0.08),
     vjust = 1
  )
```

```{r}
# show mean area under the curve (auc) per model
model_comp |>
  arrange(mean_roc_auc) |>
  mutate(model = fct_reorder(model, mean_roc_auc)) %>%
  ggplot(aes(model, mean_roc_auc, fill=model)) +
  geom_col() +
  coord_flip() +
  scale_fill_brewer(palette = "Blues") + 
     geom_text(
     size = 3,
     aes(label = round(mean_roc_auc, 2), y = mean_roc_auc + 0.08),
     vjust = 1
  )
```

Note that the model results are all quite similar. In our example we choose the F1-Score as performance measure to select the best model. Let’s find the maximum mean F1-Score:

```{r}
model_comp |> slice_max(mean_f_meas)
```

### Last evaluation on test set
Tidymodels provides the function `last_fit()` which fits a model to the whole training data and evaluates it on the test set. We just need to provide the workflow object of the best model as well as the data split object (not the training data).

```{r}
last_fit_logit <- last_fit(log_wflow, 
                        split = data_split,
                        metrics = metric_set(
                        yardstick::recall, 
                        yardstick::precision, 
                        yardstick::f_meas, 
                        yardstick::accuracy, 
                        yardstick::kap, 
                        yardstick::roc_auc, 
                        yardstick::sens, 
                        yardstick::spec
                        ))
```

Show performance metrics

```{r}
last_fit_logit |>
  collect_metrics()
```
And these are our final performance metrics. Remember that if a model fit to the training dataset also fits the test dataset well, minimal _overfitting_ has taken place. This seems to be also the case in our example.

To learn more about the model we can access the variable importance scores via the `.workflow` column. We first need to pluck out the first element in the workflow column, then pull out the fit from the workflow object. Finally, the `vip` package helps us visualize the variable importance scores for the top features. Note that we can’t create this type of plot for every model engine.

```{r}
last_fit_logit |>
  pluck(".workflow", 1) |>  
  extract_fit_parsnip() |>
  vip(num_features = 10)
```

The two most important predictors in whether a district has a median house value above or below $150,000 dollars were the ocean proximity inland and the median income.

Take a look at the confusion matrix:

```{r}
last_fit_logit %>%
  collect_predictions() |>
  conf_mat(price_category, .pred_class) |>
  autoplot(type = "heatmap")
```

Let’s create the ROC curve. Again, since the event we are predicting is the first level in the price_category factor (“above”), we provide `roc_curve()` with the relevant class probability `.pred_above`:

```{r}
last_fit_logit |>
  collect_predictions() |>
  roc_curve(price_category, .pred_above) |>
  autoplot()
```


Based on all of the results, the validation set and test set performance statistics are very close, so we would have pretty high confidence that our logit model would perform well when predicting new data.
