Data mining is the process of exploring large datasets to discover previously unknown:
Patterns
Trends
Relationships
Anomalies
It allows us to turn raw data into meaningful knowledge and insights.
Key Points
Focuses on finding insights in existing data.
Often descriptive: answers questions like “What patterns exist?” or “Which groups are similar?”
Uses statistics, visualization, and simple algorithms to uncover these patterns.
Examples:
Discovering which products are often bought together in a supermarket.
Identifying customer segments based on purchasing behavior.
Detecting unusual patterns in bank transactions.
Data mining is part of the broader Knowledge Discovery in Databases (KDD) process, which includes:
Data Cleaning: Remove noise and inconsistencies.
Data Integration: Combine data from multiple sources.
Data Selection: Retrieve relevant data for analysis.
Data Transformation: Consolidate data into forms suitable for mining (e.g., aggregation, normalization).
Data Mining: Apply intelligent methods to uncover patterns.
Pattern Evaluation: Identify patterns that are truly meaningful.
Knowledge Presentation: Present insights through visualization and reporting.
Examples of Patterns
Emails containing words like “Free,” “Urgent,” or “Act now” are more likely spam.
A gradual rise in blood sugar, combined with weight gain and reduced activity, may indicate early diabetes risk.
Irregular nighttime heartbeats could signal heart rhythm issues.
Types of Data Mining Tasks:
Clustering: Groups similar objects without predefined labels.
Example: Customer segmentation for targeted marketing.
Algorithms: k-Means, Hierarchical Clustering.
Association Rule Learning (Market Basket Analysis): Finds items or events that occur together.
Example: People who buy diapers often also buy beer.
Algorithm: Apriori.
Classification: Predicts a category (e.g., “low,” “medium,” “high” risk loans).
Regression: Predicts a numeric value (e.g., house price, sales revenue).
Machine learning is a set of algorithms that allow computers to learn from data and make predictions or decisions without being explicitly programmed. In traditional programming, a programmer writes step-by-step instructions for the computer. For example:
If you want a program to identify cats in photos, in traditional programming you’d have to tell the computer: “If it has pointy ears and whiskers and fur, then it’s a cat.” This is explicit programming: the computer does exactly what you tell it, nothing more.
With machine learning, you don’t manually tell the computer the rules. Instead, you give it lots of examples and it figures out the patterns by itself.
For the cat example:
You give the computer thousands of photos labeled as “cat” or “not cat.”
The computer learns the patterns that usually indicate a cat (shape, color, features) without a human specifying the rules.
So the computer is learning from data, not following a hard-coded list of instructions.
Key points for machine learning:
Focuses on prediction and automation, not just discovery.
Often predictive: “Given this data, what will happen next?”
Uses algorithms (regression, decision trees, neural networks, etc.) that learn from examples.
Examples:
Predicting which customers may leave (churn).
Filtering spam emails.
Recognizing objects for self-driving cars.
We live in the age of data. Massive amounts of information from the web, sensors, business systems, and social media create opportunities to:
Turn Data into Knowledge: Move from storing data to understanding it.
Support Decision-Making: Make informed, data-driven decisions.
Gain a Competitive Advantage: Optimize processes and understand customer behavior.
Enable Scientific Discovery: Uncover hidden patterns in fields like genomics, astronomy, and climate science.
Data Quality: Noisy, incomplete, or inconsistent data.
Scalability: Algorithms must handle very large datasets efficiently.
Dimensionality: High-dimensional data can be sparse and hard to analyze.
Privacy and Ethics: Ensure responsible use of data and fairness.
Interpretability: Understanding results from complex models.
Retail: Market basket analysis, customer segmentation, sales forecasting.
Finance: Credit scoring, fraud detection, algorithmic trading.
Healthcare: Disease prediction, drug discovery, patient outcome analysis.
Telecommunications: Churn prediction, network fault detection.
Science: Gene sequence analysis, climate pattern recognition, astronomical data analysis.
Web & Social Media: Recommendation systems (Netflix, Amazon), sentiment analysis, targeted advertising.
Classification: Assigns new records to existing categories using known examples.
Prediction: Similar to classification but usually predicts numeric outcomes, though it can also predict categories.
Association / Market Basket Analysis:
Dimension Reduction: Reduces the number of variables to improve efficiency, predictive power, and interpretability.
Data Exploration & Visualization: Understanding overall patterns and generating hypotheses using charts, dashboards, and visual analytics.
Supervised Learning: Algorithm learns from labeled data (known outputs) and can predict outcomes on new, unseen data.
Unsupervised Learning: Algorithm finds patterns without known outputs, e.g., clustering or association.
Define the goal of the project.
Collect data from one or more sources.
Explore, clean, and process the data.
Reduce data dimensions, if necessary.
Identify the type of task: prediction, classification, clustering, etc.
For supervised learning, split data into training, validation, and test sets.
Choose suitable techniques: regression, neural networks, decision trees, etc.
Execute the data mining task.
Interpret the results.
When building models for prediction or classification, we need to make sure that the chosen model generalizes beyond the datasets that we have at hand for building and choosing the model. That is, the chosen model should not be too complicated (for example, with too many terms) so that it overfits the data with large standard errors.
The mean squared error of a model can be decomposed into two components: squared bias and variance. If a model underfits the data, the bias part is high but the variance part is low. If a model overfits the data, the bias part is low but the variance part is high. To check the predictive power of a model, we use a new set of predictor values
Here are some simulations of overfitting: https://medium.com/analytics-vidhya/full-demo-of-overfitting-and-underfitting-in-r-70fd9b66a7d1 and https://fengkehh.github.io/post/2017-06-30-overfitting/ and https://towardsdatascience.com/overfitting-vs-underfitting-a-complete-example-d05dd7e19765
When fitting a multiple regression model in R, we use the R function lm(). The syntax is lm(formula, data). To get the prediction with new data, we use the predict() function.
When building a predictive model, the typical steps are
Determine the purpose: prediction or classification
Obtain the data: you might need to select a sample from the database.
Explore, clean, and process the data: the description of the variables (AKA data dictionary), error checking, creation of dummy variables
Reduce the data dimension: The reduction of the number of variables
Determine the data mining task: prediction of a value or a label, supervised or unsupervised learning
Partition the data (for supervised tasks): the partition of data into training, validation, and test data.
Choose the best candidate technique.
Perform the task using algorithms: the lm() function for multiple regression, the naiveBayes() function for classification, the rpart(), knn() and svm() functions for both prediction and classification, and the functions neuralnet(), linDA(), and lda(), qda(), or mda()
Interpret the results.
Deploy the model: apply the model to new records to predict (or to score) the output values.
Data Quality: Dealing with noisy, incomplete, and inconsistent data.
Scalability: Algorithms must be efficient and scalable to handle enormous datasets.
Dimensionality: The “curse of dimensionality” – high-dimensional data can be sparse and computationally expensive to process.
Privacy and Ethics: Using data responsibly, ensuring individual privacy, and avoiding biased or unethical outcomes.
Interpretability: Understanding and trusting the results of complex models (“black boxes”).
Data Exploration is a crucial initial phase in the data mining process that involves using visual and statistical techniques to understand the general properties of a dataset before applying complex algorithms or models.
Data exploration typically involves three main types of analysis:
For numeric data, this involves calculating key metrics that describe the central tendency, spread, and shape of the data’s distribution.
Central Tendency: Mean, Median, Mode.
Spread: Range, Variance, Standard Deviation, Interquartile Range (IQR).
Shape: Skewness (measure of asymmetry), Kurtosis (measure of “tailedness”).
Example: Calculating the average (mean) customer age and the standard deviation tells you the typical age and how much variation there is in your customer base.
For Categorical Data:
Histograms & Box Plots: Understand the distribution of a single numerical variable. A box plot quickly shows the median, quartiles, and potential outliers.
Scatter Plots: Visualize the relationship (e.g., correlation) between two numerical variables (e.g., advertising spend vs. sales).
Bar Charts & Pie Charts: Show the frequency or proportion of categories in a categorical variable (e.g., number of customers per country).
Heatmaps & Correlation Matrices: Visualize the correlation coefficients between multiple numerical variables at once. This helps identify highly related variables.
Pair Plots: A grid of scatter plots for multiple variables, providing a comprehensive overview of their relationships.
Missing Values: Identifying columns with blank or NULL entries and understanding the pattern of missingness.
Outliers: Detecting extreme values that could be errors or genuine rare events. A scatter plot or box plot makes outliers obvious.
Inconsistencies: Spotting typos, duplicate records, or incorrect data types (e.g., a date stored as text).
While you can use the dplyr package for data management, you can also use the sqldf package. We use the diamonds data from ggplot2 package to demonstrate data management methods with R.
# Load the package
library(ggplot2)
library(sqldf)
df1 <- sqldf("SELECT carat, cut, price FROM diamonds LIMIT 10")
df1
## carat cut price
## 1 0.23 Ideal 326
## 2 0.21 Premium 326
## 3 0.23 Good 327
## 4 0.29 Premium 334
## 5 0.31 Good 335
## 6 0.24 Very Good 336
## 7 0.24 Very Good 336
## 8 0.26 Very Good 337
## 9 0.22 Fair 337
## 10 0.23 Very Good 338
df2 <- sqldf("SELECT * FROM diamonds WHERE carat < 1 AND price < 5000 LIMIT 10")
df2
## carat cut color clarity depth table price x y z
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4.00 4.05 2.39
df3 <- sqldf("
SELECT cut, AVG(price) AS avg_price, COUNT(*) AS count
FROM diamonds
GROUP BY cut
ORDER BY avg_price DESC
")
df3
## cut avg_price count
## 1 Premium 4584.258 13791
## 2 Fair 4358.758 1610
## 3 Very Good 3981.760 12082
## 4 Good 3928.864 4906
## 5 Ideal 3457.542 21551
# list only top 5 most expensive diamonds
df4 <- sqldf("SELECT * FROM diamonds ORDER BY price DESC LIMIT 5")
df4
## carat cut color clarity depth table price x y z
## 1 2.29 Premium I VS2 60.8 60 18823 8.50 8.47 5.16
## 2 2.00 Very Good G SI1 63.5 56 18818 7.90 7.97 5.04
## 3 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
## 4 2.07 Ideal G SI2 62.5 55 18804 8.20 8.13 5.11
## 5 2.00 Very Good H SI1 62.8 57 18803 7.95 8.00 5.01
# Add volume column = x*y*z
df5 <- sqldf("
SELECT *, x*y*z AS volume
FROM diamonds
WHERE x*y*z > 0
LIMIT 10
")
df5
## carat cut color clarity depth table price x y z volume
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 38.20203
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 34.50586
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 38.07688
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63 46.72458
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 51.91725
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 38.69395
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 38.83087
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 42.32108
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 36.42521
## 10 0.23 Very Good H VS1 59.4 61 338 4.00 4.05 2.39 38.71800
# Create a small lookup table for cut grades
cut_info <- data.frame(
cut = c("Fair", "Good", "Very Good", "Premium", "Ideal"),
rating = c(1, 2, 3, 4, 5)
)
# Join diamonds with cut_info
df6 <- sqldf("
SELECT d.carat, d.cut, d.price, c.rating
FROM diamonds d
JOIN cut_info c
ON d.cut = c.cut
LIMIT 10
")
df6
## carat cut price rating
## 1 0.23 Ideal 326 5
## 2 0.21 Premium 326 4
## 3 0.23 Good 327 2
## 4 0.29 Premium 334 4
## 5 0.31 Good 335 2
## 6 0.24 Very Good 336 3
## 7 0.24 Very Good 336 3
## 8 0.26 Very Good 337 3
## 9 0.22 Fair 337 1
## 10 0.23 Very Good 338 3
We use the R built-in data “iris” to demonstrate visualization.
# 1. Create a side-by-side box plot (bxp)
p <- ggplot(iris, aes(color = Species, x = Sepal.Length))
bxp <- p + geom_boxplot()
# 2. Create an overlay density plot (dp)
p <- ggplot(iris, aes(color = Species, x = Sepal.Length))
dp <- p + geom_density()
# 3. Create a scatter plot
lp <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point()
# 4. Combine the plots on one page
library(ggpubr)
ggarrange(bxp, dp, lp,
labels = c("A", "B", "C", "D"),
ncol = 2, nrow = 2)
The dimension of a dataset is the number of variables (quantitative or categorical). The dimensionality of a model is the number of predictors or input variables used by the model. For data mining algorithms to operate efficiently, the dimension need to be reduced. This process is part of the pilot phase of data mining and is done before deploying a model. Several dimension reduction approaches are:
Incorporating domain knowledge to remove or combine categories.
Use data visualization (such as scatterplot matrix) or summaries (such as correlation matrix) to detect redundant variables or categories.
Convert ordinal categorical variables to numerical variables.
Employ automated reduction techniques, such as the principal component analysis (PCA), to create a new set of uncorrelated variables and use only a few of them.
Use regression models and classification & regression trees to remove redundant variables or combining similar categories of categorical variables.
The curse of dimensionality is the affliction caused by adding variables to a model that already involves multiple variables. As variables are added, the feature space increasingly sparse, and classification and prediction models fail because available data are insufficient to provide a useful model across so many variables. Dimension reduction in artificial intelligence (AI) literature is often referred to as feature selection (selecting predictor variables or features) or feature extraction (creating more useful variables in place of original ones).
The handling of a categorical variable in regression is usually through the creation of dummy variables. Many times, software will handle it automatically. If you have \(k\) categories for a variable, you can create \(k\) dummy variables, each taking values of 0 or 1. Since these dummy variables are perfectly correlated, to reduce multicollinearity, you can only use any \(k-1\) of them. If your data have many categorical variables, then many more dummy variables will be created. If you think some categories of a categorical variable can be combined to form a new categorical variable with a less number of categories, do it!
Other way to achieve data reduction is to use meaningful numerical values for the categories of a ordinal variable. A categorical variable is ordinal (such as grade, one on a Likert scale), if there is a natural ordering in the categories of it. Some domain knowledge might be needed when mapping categories to numerical values.
The principal component analysis is a technique that allows us to create a set of \(p\) uncorrelated quantitative variables (called principal components) out of a set of \(p\) quantitative variables, such that the sum of the variances of the PC’s equals the sum of the variances of the original variables. The PC that has the largest variance among all PC’s is called the first principal component, the PC that has the second largest variance among all PC’s is called the second principal component, and so on.
Here is a math-free introduction to PCA: https://builtin.com/data-science/step-step-explanation-principal-component-analysis
head(iris, n = 10)
D = iris[,-5] # Remove 5th column which is categorical
head(D, n = 10)
# No centering, no scaling
pcs = princomp(D, cor = TRUE, scores = TRUE) # Use the correlation matrix to start PCA
# scores = TRUE returns a new data frame with columns being the principal components
pcs
summary(pcs) # Shows importance of components
plot(pcs, type = "l", main = "A Scree Plot") # Variances of PC's
# The numbers in the above graph are equal to the eigen values of the correlation matrix of D, as shown below.
eigen(cor(D))
eigen(cor(D))$values # Must sum to 1
# Get the corresponding eigen vectors
eigen(cor(D))$vectors
# pcs is a list
names(pcs)
pcs$sdev # Gives the standard deviation of each of the principal components.
pcs$center # Gives the mean of each column of the original data D
pcs$scale # Gives the scaling factor (similar to standard deviation) of each column of the original data D
pcs$loadings # Extract coefficients (or vectors); a matrix whose columns contain the eigenvectors
pcs$scores # Extract the data frame with columns that are the principal components variables (called a score matrix).
# The difference between this new data frame and the original data frame is that the new
# one has columns that are all uncorrelated to each other. The total variance remains the same.
biplot(pcs) # A scatterplot plot of the first two PCs with original variables and observation labels
# Add lines to the biplot
abline(h=0, v=0, lty = "dashed")
# Calculate the correlation between each of the original variable and each PC
cor(pcs$scores,D)
# Calculate angles between PC1 and original variables
(cor(pcs$scores,D)[1,] %>% acos())*180/pi # acos() is the inverse cosine function
Prediction is a central objective in data analysis and machine learning. Among predictive modeling techniques, Multiple Linear Regression (MLR) is one of the most widely used due to its simplicity, interpretability, and strong theoretical foundation. This chapter discusses the principles of multiple linear regression, the importance of data partitioning, and commonly used performance metrics for evaluating predictive models.
Classic regression is a way to understand how different factors are related to an outcome. Imagine you want to know what affects house prices. Classic regression looks at things like size, location, and age of a house and estimates how much each factor matters, while holding the others constant.
The multiple linear regression model has the structure: \[y=\beta_0+\beta_1x_1 + \beta_2x_2 + \beta_3x_3 +\cdots+ \beta_px_p + \epsilon\] where \(y\) is the response variable, the \(x_i\)’s are predictors, \(\beta\)’s are coefficients, and \(\epsilon\) is the error term which is assumed to have a normal distribution with mean 0 and a constant variance \(\sigma^2\).
library(coefplot)
m <- lm(mpg ~ wt + hp + cyl + disp + qsec, data = mtcars)
summary(m)
##
## Call:
## lm(formula = mpg ~ wt + hp + cyl + disp + qsec, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3117 -1.3483 -0.4352 1.2603 5.6094
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.87361 9.91809 3.617 0.00126 **
## wt -4.22527 1.25239 -3.374 0.00233 **
## hp -0.01584 0.01527 -1.037 0.30908
## cyl -1.15608 0.71525 -1.616 0.11809
## disp 0.01195 0.01191 1.004 0.32484
## qsec 0.25382 0.48746 0.521 0.60699
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.547 on 26 degrees of freedom
## Multiple R-squared: 0.8502, Adjusted R-squared: 0.8214
## F-statistic: 29.51 on 5 and 26 DF, p-value: 6.182e-10
tidy(m)
## # A tibble: 6 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 35.9 9.92 3.62 0.00126
## 2 wt -4.23 1.25 -3.37 0.00233
## 3 hp -0.0158 0.0153 -1.04 0.309
## 4 cyl -1.16 0.715 -1.62 0.118
## 5 disp 0.0119 0.0119 1.00 0.325
## 6 qsec 0.254 0.487 0.521 0.607
coefplot(m)
## `height` was translated to `width`.
## `height` was translated to `width`.
We can sort the coefficients by p-values (smallest p-value at the bottom) :
library(broom)
library(ggplot2)
tidy(m) |>
filter(term != "(Intercept)") |>
arrange(p.value) |>
mutate(term = factor(term, levels = term)) |>
ggplot(aes(estimate, term)) +
geom_point() +
geom_errorbarh(
aes(xmin = estimate - 1.96 * std.error,
xmax = estimate + 1.96 * std.error),
height = 0.2
) +
labs(x = "Coefficient estimate", y = "")
## Warning: `geom_errobarh()` was deprecated in ggplot2 4.0.0.
## ℹ Please use the `orientation` argument of `geom_errorbar()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `height` was translated to `width`.
Note: the tidy function from the broom package is very neat.
Since some of the x variables (predictors) might be correlated, the model fitting will have the so-called multi-collinearity issue which results in larger standard errors of the estimates of regression coefficients. To avoid such an issue, we can drop some predictors that are highly correlated to other predictors.
We would like to have parsimonious models which have a small number of predictors.
There are a few criteria that help select predictors:
and so on.
The Mallow’s \(C_p\) criterion chooses the model with \(C_p\) equal to the number of coefficients including the intercept. It is defined as
\[C_p=\frac{SSE}{\hat{\sigma}^2_{full}}+2(p+1)-n\] where \(n\) is the number of observations, \(p\) is the number of predictors in the smaller model, \(SSE\) is the sum of squares of residuals. The quantity \(\hat{\sigma}^2_{full}\) is the estimate of the variance of the error term under the full model, a model with all predictors in.
The \(AIC\) (Akaike Information Criterion) is defined as
\[AIC=n\cdot log(SSE/n)+2p\] The \(BIC\) (Bayesian Information Criterion) is defined as
\[BIC=n\cdot log(SSE/n)+p\cdot log(n)\] The smaller the AIC/BIC, the better the model.
A reference: https://online.stat.psu.edu/stat462/node/197/
Variable selection can be done using a “backward”, “forward”, or “stepwise” method.
# Backward selection starts with the full model
step = step(full_model, direction = "backward")
summary(step) # Summarize the final model
pred = predict(step) # You can use validation data if you do have (which is recommended)
pred
library(forecast)
accuracy(pred, mtcars$mpg)
# Stepwise selection can start with the full model
step = step(full_model, direction = "both")
summary(step)
intercept.only_model = lm(mpg ~ 1, data = mtcars)
# Forward selection starts with the intercept only model, or other small model
step = step(intercept.only_model,
scope=list(lower=intercept.only_model, upper = full_model),
direction = "forward")
summary(step)
# Stepwise selection can also start with the intercept only model
step = step(intercept.only_model,
scope=list(lower=intercept.only_model, upper = full_model),
direction = "both")
summary(step)
All end up with the same model, though not necessarily the case in general.
To assess a model’s predictive performance, it is essential to evaluate it on data not used during training. Using the same data for both training and evaluation can lead to overfitting and overly optimistic performance estimates.
The dataset is commonly divided into:
Training set – Used to fit the model.
Testing set – Used to evaluate predictive performance.
A common split ratio is 70:30 or 80:20, depending on dataset size.
In addition to simple train–test splits, a more robust validation techniques is the k-fold cross-validation (CV), where the data is divided into \(k\) subsets and the model is trained and tested \(k\) times. A special case of the \(k\)-fold CV is the leave-one-out cross-validation (LOOCV) for small datasets.
We use the Boston data from MASS package to demonstrate data partition into training and test sets.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
set.seed(123) # For reproducibility
n <- nrow(Boston)
train_index <- sample(1:n, size = 0.7 * n)
train_data <- Boston[train_index, ]
test_data <- Boston[-train_index, ]
head(train_data)
## crim zn indus chas nox rm age dis rad tax ptratio black
## 415 45.74610 0 18.10 0 0.693 4.519 100.0 1.6582 24 666 20.2 88.27
## 463 6.65492 0 18.10 0 0.713 6.317 83.0 2.7344 24 666 20.2 396.90
## 179 0.06642 0 4.05 0 0.510 6.860 74.4 2.9153 5 296 16.6 391.27
## 14 0.62976 0 8.14 0 0.538 5.949 61.8 4.7075 4 307 21.0 396.90
## 195 0.01439 60 2.93 0 0.401 6.604 18.8 6.2196 1 265 15.6 376.70
## 426 15.86030 0 18.10 0 0.679 5.896 95.4 1.9096 24 666 20.2 7.68
## lstat medv
## 415 36.98 7.0
## 463 13.99 19.5
## 179 6.92 29.9
## 14 8.26 20.4
## 195 4.38 29.1
## 426 24.39 8.3
head(test_data)
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 3 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 6 0.02985 0.0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## 8 0.14455 12.5 7.87 0 0.524 6.172 96.1 5.9505 5 311 15.2 396.90
## 9 0.21124 12.5 7.87 0 0.524 5.631 100.0 6.0821 5 311 15.2 386.63
## 12 0.11747 12.5 7.87 0 0.524 6.009 82.9 6.2267 5 311 15.2 396.90
## lstat medv
## 1 4.98 24.0
## 3 4.03 34.7
## 6 5.21 28.7
## 8 19.15 27.1
## 9 29.93 16.5
## 12 13.27 18.9
To evaluate predictive accuracy, several performance metrics are commonly used.
# Train the multiple linear regression model based on training data
mlr_model <- lm(medv ~ ., data = train_data)
# Model summary
summary(mlr_model)
##
## Call:
## lm(formula = medv ~ ., data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.5163 -2.6745 -0.5699 1.5818 24.7767
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.852e+01 6.084e+00 6.332 7.66e-10 ***
## crim -1.090e-01 3.539e-02 -3.079 0.002245 **
## zn 5.303e-02 1.668e-02 3.179 0.001614 **
## indus -5.224e-02 7.877e-02 -0.663 0.507669
## chas 4.044e+00 1.025e+00 3.946 9.64e-05 ***
## nox -1.443e+01 4.671e+00 -3.089 0.002171 **
## rm 3.178e+00 4.993e-01 6.365 6.32e-10 ***
## age -5.659e-04 1.618e-02 -0.035 0.972128
## dis -1.541e+00 2.405e-01 -6.406 4.98e-10 ***
## rad 3.023e-01 8.064e-02 3.749 0.000209 ***
## tax -1.049e-02 4.658e-03 -2.252 0.024963 *
## ptratio -8.587e-01 1.599e-01 -5.370 1.46e-07 ***
## black 6.865e-03 3.443e-03 1.994 0.046977 *
## lstat -5.838e-01 5.915e-02 -9.871 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.787 on 340 degrees of freedom
## Multiple R-squared: 0.733, Adjusted R-squared: 0.7228
## F-statistic: 71.8 on 13 and 340 DF, p-value: < 2.2e-16
# Generate predictions
predicted_medv <- predict(mlr_model, newdata = test_data)
# Actual MEDV values
actual_medv <- test_data$medv
# Compute Performance Metrics
MAE <- mean(abs(actual_medv - predicted_medv))
MSE <- mean((actual_medv - predicted_medv)^2)
RMSE <- sqrt(MSE)
SST <- sum((actual_medv - mean(actual_medv))^2)
SSE <- sum((actual_medv - predicted_medv)^2)
R2 <- 1 - (SSE / SST)
# Display Performance Results
performance_metrics <- data.frame(
MAE = MAE,
MSE = MSE,
RMSE = RMSE,
R_squared = R2
)
print(performance_metrics)
## MAE MSE RMSE R_squared
## 1 3.531555 23.06699 4.802811 0.7397028
The following is from the above reference, with a minor modification.
The ordinary multiple linear regression model based on the ordinary least squares method performs poorly in a situation, where you have a data set containing a number of variables superior to the number of samples.
A better alternative is the penalized regression allowing to create a linear regression model that is penalized, for having too many variables in the model, by adding a constraint when using the least squares approach. This is also known as the shrinkage or regularization method.
The consequence of imposing this penalty is to reduce (i.e. shrink) the coefficient values towards zero, and thus allows the less contributive variables to have a coefficient close or equal to zero.
Note that, the shrinkage requires the selection of a tuning parameter (\(\lambda\)) that determines the amount of shrinkage.
Three most commonly used penalized regression methods are ridge regression, lasso regression and elastic net regression.
Ridge regression shrinks the regression coefficients, so that variables, with minor contribution to the outcome, have their coefficients close to zero.
The shrinkage of the coefficients is achieved by penalizing the regression model with a penalty term called \(L_2\)-norm, which is the sum of the squared coefficients.
The amount of the penalty can be fine-tuned using a constant called lambda, denoted by Greek letter \(\lambda\). Selecting a good value for \(\lambda\) is critical.
When \(\lambda=0\), the penalty term has no effect, and ridge regression will produce the classical least square coefficients. However, as \(\lambda\) increases to infinite, the impact of the shrinkage penalty grows, and the ridge regression coefficients will get close to zero.
Note that, in contrast to the ordinary least squares regression, ridge regression is highly affected by the scale of the predictors. Therefore, it is better to standardize (i.e., scale) the predictors before applying the ridge regression, so that all the predictors are on the same scale.
The standardization of a predictor x, can be achieved using the formula \(x'=\frac{x-mean(x)}{sd(x)}\), where \(mean(x)\) is the mean of \(x\) and \(sd(x)\) is the standard deviation. The consequence of this is that, all standardized predictors will have a mean of zero and a standard deviation of one allowing the final fit to not depend on the scale on which the predictors are measured.
One important advantage of the ridge regression, is that it still performs well, compared to the ordinary least squares method, in a situation where you have data with the number of predictors (\(p\)) larger than the number of observations (\(n\)).
One disadvantage of the ridge regression is that, it will include all the predictors in the final model, unlike the stepwise regression methods, which generally select models that involve a reduced set of variables.
Ridge regression shrinks the coefficients towards zero, but it does not set any of them exactly to zero. The lasso regression (introduced in the next section) is an alternative that overcomes this drawback.
Required R packages for implementing the regularization methods are:
tidyverse for easy data manipulation and visualization
glmnet for computing penalized regression
caret for easy machine learning workflow
The key function to fit a penalized generalized linear model is glmnet(), which is used as follows:
\[\text{glmnet(x, y, alpha, lambda)},\] where
In penalized regression, you need to specify a constant lambda to adjust the amount of the coefficient shrinkage. The best lambda for your data, can be defined as the lambda that minimize the cross-validation prediction error rate. This can be determined automatically using the function cv.glmnet().
A note: when making prediction, glmnet uses the parameter “newx” instead of “newdata” for the predict() function!
A detailed introduction to the glmnet() function is https://glmnet.stanford.edu/articles/glmnet.html.
We use the Boston data to demonstrate the procedure.
We first partition Data:
# Load packages
library(tidyverse)
library(caret)
library(glmnet)
# Load the data
data("Boston", package = "MASS")
# Split the data into training and test set.
# Write my own function to partition original dataset into training and validation sets
partition = function(Data, prop.train = 0.8){
n = nrow(Data)
shuffled.idx = sample(1:n)
train.idx = shuffled.idx[1:round(n*0.8)]
train.data = Data[train.idx, ]
validation.data = Data[-train.idx, ]
return(list(train.data = train.data, validation.data= validation.data))
}
partitioned.data = partition(Boston, 0.8)
train.data <- partitioned.data$train.data
validation.data <- partitioned.data$validation.data
# Textbook partitions data by the function createDataPartition() from package "caret"
training.samples <- Boston$medv %>% createDataPartition(p = 0.8, list = FALSE)
train.data <- Boston[training.samples, ]
validation.data <- Boston[-training.samples, ]
We then train a elastic net to predict the outcome variable “medv”:
# Predict medv using all other variables
x <- train.data %>% dplyr::select(-medv) %>% as.matrix() # predictors
y <- train.data$medv # response
# alpha = 0.5 for Elastic Net (mix of Lasso and Ridge)
fit <- glmnet(x, y, alpha = 0.5)
plot(fit, xvar = "lambda", label = TRUE, main = "Elastic Net Coefficients vs Lambda")
How to interpret the plot? At a value on the x-axis, draw a vertical line which touches all the curves one corresponding to each coefficient. The y-coordinates of the cross points are the values of the coefficients. The graph gives you an idea how to choose \(\lambda\).
Train a Ridge Regression Model:
# Create a MATRIX containing only predictors and converting any categorical variable to dummy(0/1) variables
x = model.matrix(medv~., train.data)[, -1] # Remove the first column which is the intercept term
# Create the outcome/response variable
y = train.data$medv
# Apply cross validatation to get the best lambda value that minimizes the cross-validation error.
cv <- cv.glmnet(x, y, alpha = 0) # alpha = 0 indicates ridge regression
names(cv) # Check What are in the output
## [1] "lambda" "cvm" "cvsd" "cvup" "cvlo"
## [6] "nzero" "call" "name" "glmnet.fit" "lambda.min"
## [11] "lambda.1se" "index"
# Display the best lambda value that minimize the cross-validation error.
cv$lambda.min
## [1] 0.6940806
# Apply the glmnet() function from the package "glmnet" to fit penalized generalized linear regression model with the best lambda, based on training data
model = glmnet(x, y, alpha = 0, lambda = cv$lambda.min) # alpha = 0 indicates ridge regression
# Display regression coefficients
coef(model)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 25.686526454
## crim -0.105736648
## zn 0.034824414
## indus -0.050041284
## chas 2.787672194
## nox -11.666759585
## rm 4.263273841
## age -0.007122215
## dis -1.179585998
## rad 0.146192860
## tax -0.005327888
## ptratio -0.844027494
## black 0.010114394
## lstat -0.423980274
# Plot the model
plot(model, label=TRUE)
Make a Prediction:
# Prepare data for predictions on the validation data
x.new <- model.matrix(medv ~., validation.data)[,-1]
predictions <- model %>% predict(newx = x.new) %>% as.vector()
# Model performance metrics
data.frame(
RMSE = sqrt(sum((predictions-validation.data$medv)^2)),
Rsquare = (cor(predictions, validation.data$medv))^2
)
## RMSE Rsquare
## 1 48.33444 0.6622679
Lasso stands for Least Absolute Shrinkage and Selection Operator. It shrinks the regression coefficients toward zero by penalizing the regression model with a penalty term called \(L_1\)-norm, which is the sum of the absolute coefficients.
In the case of lasso regression, the penalty has the effect of forcing some of the coefficient estimates, with a minor contribution to the model, to be exactly equal to zero. This means that, lasso can be also seen as an alternative to the subset selection methods for performing variable selection in order to reduce the complexity of the model.
As in ridge regression, selecting a good value of \(\lambda\) for the lasso is critical.
A helpful video: https://www.youtube.com/watch?v=ldOqDuXKfL4
One obvious advantage of lasso regression over ridge regression, is that it produces simpler and more interpretable models that incorporate only a reduced set of the predictors. However, neither ridge regression nor the lasso will universally dominate the other.
Generally, lasso might perform better in a situation where some of the predictors have large coefficients, and the remaining predictors have very small coefficients.
Ridge regression will perform better when the outcome is a function of many predictors, all with coefficients of roughly equal size.
To fit a penalized regression using the lasso approach, the only difference from the R code used for ridge regression is that, for lasso regression you need to specify the argument alpha = 1 instead of alpha = 0 (for ridge regression).
Elastic Net produces a regression model that is penalized with both the L1-norm and L2-norm. The consequence of this is to effectively shrink coefficients (like in ridge regression) and to set some coefficients to zero (as in LASSO).
The elastic net regression can be easily computed using the caret workflow, which invokes the glmnet package.
We use caret to automatically select the best tuning parameters alpha and lambda. The caret package tests a range of possible alpha and lambda values, then selects the best values for lambda and alpha, resulting to a final model that is an elastic net model.
Here, we’ll test the combination of 10 different values for alpha and lambda. This is specified using the option tuneLength.
The best alpha and lambda values are those values that minimize the cross-validation error.
# Build the model using the training set
set.seed(123)
model <- train(
medv ~., data = train.data, method = "glmnet",
trControl = trainControl("cv", number = 10),
tuneLength = 10
)
# Best tuning parameter
model$bestTune
# Display Coefficient of the final model. You need
# to specify the best lambda
coef(model$finalModel, model$bestTune$lambda)
# Make predictions on the validation data
x.new <- model.matrix(medv ~., validation.data)[,-1]
predictions <- model %>% predict(x.new)
# Model performance metrics
data.frame(
RMSE = sqrt(sum((predictions-validation.data$medv)^2)),
Rsquare = (cor(predictions, validation.data$medv))^2
)
We can easily compute and compare ridge, lasso and elastic net regression using the caret workflow.
The package “caret” will automatically choose the best tuning parameter values, compute the final model and evaluate the model performance using cross-validation techniques.
# Setup a grid range of lambda values:
lambda <- 10^seq(-3, 3, length = 100)
# 0. # Build the model based on lm()
set.seed(123)
lm <- train(
medv ~.,
data = train.data,
method = "lm"
)
# Model coefficients
lm$finalModel
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Coefficients:
## (Intercept) crim zn indus chas nox
## 33.338585 -0.130009 0.046867 -0.003491 2.526631 -17.299498
## rm age dis rad tax ptratio
## 4.146894 -0.005098 -1.558500 0.282858 -0.010738 -0.941883
## black lstat
## 0.010599 -0.463753
# Make predictions
predictions <- lm %>% predict(newdata = validation.data)
# Model prediction performance
data.frame(
RMSE = sqrt(sum((predictions-validation.data$medv)^2)),
Rsquare = (cor(predictions, validation.data$medv))^2
)
## RMSE Rsquare
## 1 47.57402 0.671413
# 1. # Build the model based on ridge regression
set.seed(123)
ridge <- train(
medv ~.,
data = train.data,
method = "glmnet",
tuneGrid = expand.grid(alpha = 0, lambda = lambda)
)
# Model coefficients
coef(ridge$finalModel, ridge$bestTune$lambda)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s=0.6135907
## (Intercept) 25.628551046
## crim -0.105313423
## zn 0.034806751
## indus -0.049483931
## chas 2.786540055
## nox -11.590545351
## rm 4.263277893
## age -0.007093516
## dis -1.177033155
## rad 0.145549051
## tax -0.005338784
## ptratio -0.843474469
## black 0.010122418
## lstat -0.424451227
# Make predictions
predictions <- ridge %>% predict(newdata = validation.data)
# Model prediction performance
data.frame(
RMSE = sqrt(sum((predictions-validation.data$medv)^2)),
Rsquare = (cor(predictions, validation.data$medv))^2
)
## RMSE Rsquare
## 1 48.33378 0.6622897
# 2. Build the model based on lasso regression
set.seed(123)
lasso <- train(
medv ~., data = train.data, method = "glmnet",
tuneGrid = expand.grid(alpha = 1, lambda = lambda)
)
# Model coefficients
coef(lasso$finalModel, lasso$bestTune$lambda)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s=0.01873817
## (Intercept) 32.108736488
## crim -0.122739105
## zn 0.044123498
## indus -0.007120235
## chas 2.521672161
## nox -16.631848798
## rm 4.178175582
## age -0.003925512
## dis -1.491630996
## rad 0.250848972
## tax -0.009378908
## ptratio -0.930277791
## black 0.010398671
## lstat -0.465074312
# Make predictions
predictions <- lasso %>% predict(newdata = validation.data)
# Model prediction performance
data.frame(
RMSE = sqrt(sum((predictions-validation.data$medv)^2)),
Rsquare = (cor(predictions, validation.data$medv))^2
)
## RMSE Rsquare
## 1 47.57441 0.671666
# 3. Build the model based on Elastic net:
set.seed(123)
elastic <- train(
medv ~., data = train.data, method = "glmnet"
)
# Model coefficients
coef(elastic$finalModel, elastic$bestTune$lambda)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s=0.1388161
## (Intercept) 30.499783816
## crim -0.118653307
## zn 0.041692473
## indus -0.021497279
## chas 2.598502890
## nox -15.419098985
## rm 4.212263502
## age -0.004969490
## dis -1.427926453
## rad 0.221044387
## tax -0.008064518
## ptratio -0.910400787
## black 0.010364181
## lstat -0.456284394
# Make predictions
predictions <- elastic %>% predict(newdata = validation.data)
# Model prediction performance
data.frame(
RMSE = sqrt(sum((predictions-validation.data$medv)^2)),
Rsquare = (cor(predictions, validation.data$medv))^2
)
## RMSE Rsquare
## 1 47.72428 0.6697345
# 4. Comparing performances of the models:
# The performance of the different models - ridge, lasso and elastic net - can be easily compared using caret.
# The best model is defined as the one that minimizes the prediction error.
models <- list(lm = lm, ridge = ridge, lasso = lasso, elastic = elastic)
resamples(models) %>% summary( metric = "RMSE")
##
## Call:
## summary.resamples(object = ., metric = "RMSE")
##
## Models: lm, ridge, lasso, elastic
## Number of resamples: 25
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lm 3.943210 4.833511 5.071545 5.042111 5.352070 5.585943 0
## ridge 3.922317 4.829434 5.054163 5.043316 5.330106 5.661650 0
## lasso 3.925479 4.831975 5.086822 5.039214 5.364852 5.594020 0
## elastic 3.917906 4.818639 5.089301 5.034257 5.372499 5.588938 0
It can be seen that the ridge model has the lowest mean RMSE and median RMSE.
Ordinary multiple regression assumes a continuous response variable with normally distributed errors, a mean dependent on a linear combination of predictors, and constant variance. In R, this is implemented using the lm() function. When the response variable deviates from normality, generalized linear models (GLMs) extend this framework to accommodate various distributions. In R, GLMs are fitted using the glm() function.
Binary logistic regression, a special case of GLMs, models a binary categorical response variable assuming a Bernoulli distribution with a probability parameter \(p\) linked to a linear combination of predictors. In R, logistic regression is implemented as:
glm(y ~ x1 + x2 + x3 + ..., data = myData, family = binomial)
We use the Iris dataset to demonstrate logistic regression, transforming the Species variable into a binary outcome: “setosa” (1) versus “non-setosa” (0).
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
D <- iris
# Replace the Species column by a dummy variable indicating Setosa.
D$Species <- ifelse(D$Species == "setosa", 1, 0)
To understand relationships between predictors and the binary response, we visualize the data using boxplots and compute correlations.
library(ggplot2)
library(gridExtra)
p1 <- ggplot(D, aes(x = factor(Species), y = Sepal.Length)) +
geom_boxplot() +
labs(title = "Boxplot of Sepal Length", y = "Sepal Length (cm)", x = NULL) +
theme_minimal() +
theme(legend.position = "none")
p2 <- ggplot(D, aes(x = factor(Species), y = Sepal.Width)) +
geom_boxplot() +
labs(title = "Boxplot of Sepal Width", y = "Sepal Width (cm)", x = NULL) +
theme_minimal() +
theme(legend.position = "none")
p3 <- ggplot(D, aes(x = factor(Species), y = Petal.Length)) +
geom_boxplot() +
labs(title = "Boxplot of Petal Length", y = "Petal Length (cm)", x = NULL) +
theme_minimal() +
theme(legend.position = "none")
p4 <- ggplot(D, aes(x = factor(Species), y = Petal.Width)) +
geom_boxplot() +
labs(title = "Boxplot of Petal Width", y = "Petal Width (cm)", x = NULL) +
theme_minimal() +
theme(legend.position = "none")
grid.arrange(p1, p2, p3, p4, ncol = 2)
# Display a correlation matrix
cor(D[, -5])
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
## Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
## Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
## Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
The boxplots reveal how predictors distinguish species, while the correlation matrix highlights linear relationships among predictors.
We partition the data into training (60%, subjective) and validation sets, then train a logistic regression model using Sepal.Length as the predictor.
n <- nrow(D)
set.seed(3.14)
idx <- sample(1:n)
n.rows.training <- n * 0.6
trainingData <- D[idx[1:n.rows.training], ]
validationData <- D[-idx[1:n.rows.training], ]
m <- glm(Species ~ Sepal.Length, data = trainingData, family = binomial)
summary(m)
##
## Call:
## glm(formula = Species ~ Sepal.Length, family = binomial, data = trainingData)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 24.3285 5.1970 4.681 2.85e-06 ***
## Sepal.Length -4.5237 0.9561 -4.731 2.23e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 114.573 on 89 degrees of freedom
## Residual deviance: 50.449 on 88 degrees of freedom
## AIC: 54.449
##
## Number of Fisher Scoring iterations: 6
We predict probabilities on the validation set and assess performance using a confusion matrix.
# Extract the actual response values from the validation dataset
actual <- validationData$Species
# Predict probabilities for the validation data, excluding the response column, with type="response" for probabilities
pred <- predict(m, newdata = validationData[, -5], type = "response")
# Create a data frame to compare actual and predicted values
data.frame(actual, pred)
## actual pred
## 1 1 7.786169e-01
## 13 1 9.318040e-01
## 14 1 9.924347e-01
## 21 1 4.751484e-01
## 23 1 9.712369e-01
## 24 1 7.786169e-01
## 25 1 9.318040e-01
## 27 1 8.468356e-01
## 28 1 6.910971e-01
## 30 1 9.555154e-01
## 31 1 9.318040e-01
## 32 1 4.751484e-01
## 34 1 3.654331e-01
## 35 1 8.968187e-01
## 39 1 9.881584e-01
## 41 1 8.468356e-01
## 42 1 9.815098e-01
## 44 1 8.468356e-01
## 45 1 7.786169e-01
## 49 1 5.873164e-01
## 51 0 6.503024e-04
## 52 0 9.725824e-03
## 53 0 1.021918e-03
## 55 0 6.208747e-03
## 59 0 3.958440e-03
## 60 0 6.910971e-01
## 63 0 5.658760e-02
## 72 0 3.675319e-02
## 79 0 5.658760e-02
## 81 0 3.654331e-01
## 82 0 3.654331e-01
## 85 0 4.751484e-01
## 88 0 1.520474e-02
## 89 0 2.681105e-01
## 91 0 3.654331e-01
## 92 0 3.675319e-02
## 93 0 1.290964e-01
## 99 0 7.786169e-01
## 100 0 1.889877e-01
## 102 0 1.290964e-01
## 105 0 6.208747e-03
## 106 0 4.311279e-05
## 110 0 2.632444e-04
## 111 0 6.208747e-03
## 116 0 9.725824e-03
## 121 0 1.021918e-03
## 123 0 2.742526e-05
## 124 0 1.520474e-02
## 126 0 2.632444e-04
## 127 0 2.369628e-02
## 128 0 3.675319e-02
## 129 0 9.725824e-03
## 131 0 1.065376e-04
## 134 0 1.520474e-02
## 135 0 3.675319e-02
## 138 0 9.725824e-03
## 141 0 2.521668e-03
## 142 0 1.021918e-03
## 145 0 2.521668e-03
## 147 0 1.520474e-02
# Load the caret package for performance evaluation
library(caret)
# Convert predicted probabilities to binary labels (1 if probability > 0.5, else 0) and convert to a factor
pred.label <- factor((as.numeric(pred) > 0.5) * 1)
# Convert actual values to a factor for compatibility with confusionMatrix
act <- factor(actual)
# Generate a confusion matrix to evaluate model performance, setting positive class as "1"
confusionMatrix(pred.label, act, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 38 3
## 1 2 17
##
## Accuracy : 0.9167
## 95% CI : (0.8161, 0.9724)
## No Information Rate : 0.6667
## P-Value [Acc > NIR] : 5.6e-06
##
## Kappa : 0.8101
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8500
## Specificity : 0.9500
## Pos Pred Value : 0.8947
## Neg Pred Value : 0.9268
## Prevalence : 0.3333
## Detection Rate : 0.2833
## Detection Prevalence : 0.3167
## Balanced Accuracy : 0.9000
##
## 'Positive' Class : 1
##
A lift plot measures how effectively the model identifies positive outcomes compared to an intercept-only logistic model (i.e., a baseline model with no predictors). The dataset is sorted by predicted probabilities in descending order and divided into deciles, where each decile represents 10% of the data. Lift is calculated as the ratio of the model’s success rate (proportion of true positives in a decile) to the baseline success rate (proportion of positives in the entire dataset).
Example: Suppose we have 100 observations, with 20 positive cases (20% baseline positive rate). We sort the data by predicted probabilities and divide it into 10 deciles (10 observations each). If in the top decile, the model identifies 8 positive cases (assuming you have a cutoff in mind to define what a positive is). The success rate in this decile is \(8/10 = 80\%\). The lift is \(80\% / 20\% = 4\), meaning the model is 4 times as effective at identifying positives in this decile as the baseline.
library(gains)
# Calculate gains table, dividing data into 10 deciles based on predicted probabilities
gain <- gains(actual, pred, groups = 10)
# Compute lift as the ratio of mean response in each decile to the overall mean response
lift <- gain$mean.resp / mean(actual)
# Create a barplot for the decile-wise lift chart, with decile depths as x-axis labels
midpoints <- barplot(lift, names.arg = gain$depth, xlab = "Percentile", ylab = "Mean Response", main = "Decile-wise Lift Chart", ylim = c(0, 3.5))
# Add text labels above bars to display rounded lift values for clarity
text(midpoints, lift + 0.1, labels = round(lift, 1), cex = 0.8)
Interpretation: The model identifies up to three times more positives in the top deciles compared to baseline, indicating strong performance.
The ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) across thresholds. The Area Under the Curve (AUC) summarizes model performance.
library(pROC)
roc <- roc(actual ~ pred)
sensitivity = roc$sensitivities
specificity = roc$specificities
plot(sensitivity~I(1-specificity), type = "l", col = "blue", main = "ROC Curve", xlab = "1-Specificity")
# alculate AUC
auc(roc)
## Area under the curve: 0.9819
Interpretation: An AUC > 0.5 indicates better-than-random performance, with values closer to 1 reflecting stronger models.
Trade-off: Improving recall often reduces precision, and vice versa. The \(F_1\)-score balances these metrics. \(F_1\) is the harmonic mean of the two:
\[F_1=\frac{2}{\frac{1}{Recall}+\frac{1}{Precision}}\]
Choosing Metrics:
For rare event data (imbalanced data), these metrics are more useful than accuracy for selecting models.
Read the paper https://pmc.ncbi.nlm.nih.gov/articles/PMC8902098/ (right click the link and choose “open link in new tab”) to learn how a real paper can be drafted. Mimic the paper to do your project on a logistic regression.
Classification and Regression Trees (CART) are decision tree algorithms used for:
CART works by
To understand how a classification tree model works, watch videos:
To understand how a classification tree model works, watch videos:
Key advantages of CART:
Key disadvantages of CART:
We’ll
We will also covers
Example 1: Classification with the Iris Dataset
We’ll use the iris dataset to build a classification tree to predict flower species.
Step 1: Load Libraries and Split Data
library(rpart)
library(rpart.plot)
library(caret) # for splitting data and assessing model
set.seed(123) # For reproducibility
trainIndex <- createDataPartition(iris$Species, p = 0.7, list = FALSE)
train_data <- iris[trainIndex, ]
validation_data <- iris[-trainIndex, ]
Step 2: Build the Classification Tree
We’ll create a decision tree to classify Species based on the other features.
# Train the classification tree
tree_model <- rpart(Species ~ ., data = train_data, method = "class")
# Plot the tree
rpart.plot(tree_model, main = "Classification Tree for Iris Dataset")
The resulting plot shows the tree structure, with nodes representing decisions and leaves representing predicted classes.
Step 3: Make Predictions
Predict species for the first few rows of the dataset:
# Predict
predictions <- predict(tree_model, newdata = validation_data, type = "class")
Step 4: Evaluate the Model
Use the caret package to compute accuracy:
library(caret)
confusionMatrix(predictions, validation_data$Species)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 15 0 0
## versicolor 0 14 2
## virginica 0 1 13
##
## Overall Statistics
##
## Accuracy : 0.9333
## 95% CI : (0.8173, 0.986)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.9333 0.8667
## Specificity 1.0000 0.9333 0.9667
## Pos Pred Value 1.0000 0.8750 0.9286
## Neg Pred Value 1.0000 0.9655 0.9355
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3111 0.2889
## Detection Prevalence 0.3333 0.3556 0.3111
## Balanced Accuracy 1.0000 0.9333 0.9167
This outputs accuracy and other metrics like sensitivity and specificity.
Example 2: Regression with the Boston Housing Dataset
Now, let’s build a regression tree to predict house prices using the Boston dataset from the MASS package.
Step 1: Load and Split Data
library(MASS)
data(Boston)
library(caret)
library(rpart)
library(rpart.plot)
set.seed(123) # For reproducibility
trainIndex <- createDataPartition(Boston$medv, p = 0.7, list = FALSE)
train_data <- Boston[trainIndex, ]
validation_data <- Boston[-trainIndex, ]
Step 2: Build the Regression Tree
# Fit the regression tree
reg_tree <- rpart(medv ~ ., data = train_data, method = "anova")
# Plot the tree
rpart.plot(reg_tree, main = "Regression Tree for Boston Housing")
Here,
Step 3: Make Predictions and calculate the performance metric “root mean square error”.
# Predict
reg_predictions <- predict(reg_tree, validation_data)
# Calculate RMSE
rmse <- sqrt(mean((reg_predictions - validation_data$medv)^2))
cat("Root Mean Squared Error:", rmse, "\n")
## Root Mean Squared Error: 5.568488
To avoid overfitting, we can prune the tree based on the complexity parameter (cp):
# Print complexity parameter table
printcp(reg_tree)
##
## Regression tree:
## rpart(formula = medv ~ ., data = train_data, method = "anova")
##
## Variables actually used in tree construction:
## [1] lstat nox ptratio rm
##
## Root node error: 28877/356 = 81.115
##
## n= 356
##
## CP nsplit rel error xerror xstd
## 1 0.462980 0 1.00000 1.00335 0.097781
## 2 0.165305 1 0.53702 0.71078 0.074466
## 3 0.080161 2 0.37171 0.52042 0.061138
## 4 0.034497 3 0.29155 0.38648 0.054263
## 5 0.027794 4 0.25706 0.36181 0.052094
## 6 0.021800 5 0.22926 0.33327 0.048749
## 7 0.011952 6 0.20746 0.31856 0.048561
## 8 0.011358 7 0.19551 0.31018 0.048737
## 9 0.010000 8 0.18415 0.31168 0.049371
# Calculate the optimal cp value
optimal.cp = reg_tree$cptable[which.min(reg_tree$cptable[,"xerror"]),"CP"]
cat("The optimal complexity parameter is ", optimal.cp, ", based on the cross validation error.", sep = "")
## The optimal complexity parameter is 0.01135807, based on the cross validation error.
# Prune the tree
pruned_tree <- prune(reg_tree, cp = optimal.cp)
# Plot pruned tree
rpart.plot(pruned_tree, main = "Pruned Regression Tree")
The cp table shows the cross-validation error for different tree sizes. We select the cp value that minimizes the error (xerror).
You can tune a tree by adjusting parameters in rpart.control:
# Example with custom control parameters
control <- rpart.control(minsplit = 20, maxdepth = 10, cp = 0.01)
tuned_tree <- rpart(medv ~ ., data = Boston, method = "anova", control = control)
Here,
Pruning is the process of reducing the size of a decision tree by removing branches (nodes and leaves) that contribute little to predictive accuracy or are likely to cause overfitting. Overfitting occurs when a tree is too complex and captures noise in the training data, leading to poor performance on unseen data (e.g., validation or test sets).
Purpose
How It Works
Pruning in CART typically involves cost-complexity pruning, which balances the tree’s complexity (number of nodes) against its predictive error.
The best tree size minimizes the cost-complexity measure, defined as:
\[R_\alpha(T) = R(T) + \alpha \cdot |T|\]
Where:
The complexity parameter \(\alpha\) (referred to as \(c_p\) in rpart) determines how much penalty is applied for each additional terminal node. A higher \(\alpha\) favors simpler trees by pruning branches that don’t sufficiently reduce \(R(T)\).
The rpart package uses the complexity parameter (\(c_p\)) to control pruning:
A higher \(c_p\) value leads to more aggressive pruning (simpler tree). A lower \(c_p\) value retains more branches (more complex tree).
The pruning process involves:
Growing a full tree (with low \(c_p\), allowing many splits). Evaluating the tree’s performance using cross-validation or a validation set. Selecting a \(c_p\) value that minimizes the cross-validated error (xerror in rpart). Removing branches that don’t justify their complexity based on the chosen \(c_p\).
Tuning is the process of adjusting hyperparameters of the decision tree algorithm before or during model training to optimize its performance. In the context of rpart, tuning involves setting parameters in the rpart.control function to control how the tree is grown.
Purpose
How It Works
Tuning involves specifying parameters in rpart.control to influence how the tree is constructed. Common parameters include:
Tuning typically involves:
Check which variables are most important. We continue with the regression tree “reg_tree” we trained before.
# Variable importance for regression tree
var_imp <- reg_tree$variable.importance
barplot(var_imp, main = "Variable Importance", las = 2)
This plots the relative importance of each feature in the tree.
For a regression tree, the importance of a feature (i.e. predictor) sums the reductions in the sum of squared errors (sse) across splits.
For a classification tree, the importance of a feature (i.e. predictor) sums the reductions in impurity (e.g., Gini index) across splits.
For hand calculation, refer to examples in https://scientistcafe.com/ids/splitting-criteria.
k-Nearest Neighbors (k-NN) is a simple, intuitive supervised machine learning algorithm used for classification and regression.
Two short videos to understand k-NN:
How It Works?
For each point in the training data,
Compute the distance (usually Euclidean) from this point to all other points in the training data.
Select the \(k\) closest points (neighbors).
Use the majority class among those \(k\) neighbors as the prediction.
Repeat the procedure for other \(k\)’s.
The \(k\) that achieves the highest performance () becomes the optimal \(k\).
How to test the predictive ability of the learner (the k-nn with the optimal \(k\))?
For each point in the test data,
Compute the distance (usually Euclidean) from this point to all other points in the test data.
Select the \(k\) closest points (neighbors).
Use the majority class among those \(k\) neighbors as the prediction.
Report the overall performance metric (accuracy, sensitivity, specificity, \(F_1\), …).
An Example:
library(caret)
library(MLmetrics)
# Prepare binary classification dataset, with setosa as positive class first
binary_iris <- iris
binary_iris$Species <- factor(ifelse(iris$Species == "setosa", "setosa", "other"),
levels = c("setosa", "other"))
# Train/test split
set.seed(123)
train_index <- createDataPartition(binary_iris$Species, p = 0.7, list = FALSE)
train_data <- binary_iris[train_index, ]
test_data <- binary_iris[-train_index, ]
# Simple train control with cross-validation
ctrl <- trainControl(
method = "cv",
number = 5,
classProbs = FALSE,
savePredictions = TRUE
)
# Train k-NN model
set.seed(123)
knn_model <- train(
Species ~ .,
data = train_data,
method = "knn",
trControl = ctrl,
tuneLength = 10,
metric = "Accuracy",
preProcess = c("center", "scale")
)
# Print training results
print(knn_model$results)
## k Accuracy Kappa AccuracySD KappaSD
## 1 5 1 1 0 0
## 2 7 1 1 0 0
## 3 9 1 1 0 0
## 4 11 1 1 0 0
## 5 13 1 1 0 0
## 6 15 1 1 0 0
## 7 17 1 1 0 0
## 8 19 1 1 0 0
## 9 21 1 1 0 0
## 10 23 1 1 0 0
plot(knn_model)
# Predict on test set
test_pred <- predict(knn_model, newdata = test_data)
# Evaluate on test set
conf_mat <- confusionMatrix(test_pred, test_data$Species, positive = "setosa")
f1_test <- F1_Score(y_pred = test_pred, y_true = test_data$Species, positive = "setosa")
sens_test <- Sensitivity(y_pred = test_pred, y_true = test_data$Species, positive = "setosa")
# Show test results
print(conf_mat)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa other
## setosa 15 0
## other 0 30
##
## Accuracy : 1
## 95% CI : (0.9213, 1)
## No Information Rate : 0.6667
## P-Value [Acc > NIR] : 1.191e-08
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.3333
## Detection Rate : 0.3333
## Detection Prevalence : 0.3333
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : setosa
##
cat("Test F1 Score:", round(f1_test, 3), "\n")
## Test F1 Score: 1
cat("Test Sensitivity:", round(sens_test, 3), "\n")
## Test Sensitivity: 1
Explanation of the code:
trainControl() defines how the training should be controlled.
method = “cv”: Use cross-validation.
number = 10: Use 10-fold cross-validation (data is split into 10 parts, trained on 9, tested on 1, repeated 10 times).
train() is a function from the caret package that:
Trains the model
Tunes hyperparameters (in this case, the number of neighbors \(k\))
Evaluates performance using the strategy defined in trainControl
Arguments:
Species ~ .: This is a formula. It means to predict Species using all other columns (the . means “all other variables”).
data = iris: Use the iris dataset.
method = “knn”: Use the k-NN algorithm.
trControl = ctrl: Apply the 10-fold cross-validation we set up.
tuneLength = 10: Try 10 different values of k (usually from k = 1 to k = 10).
By default, R choose accuracy or kappa as the performance mewtric when determining \(k\). Fortunately, the caret package allows users to define their own performance metrics, which is done through the use of the “summaryFunction” parameter in the trainControl() function.
The following example shows how a summary function is created and used for the choice of performance metrics.
library(caret)
library(MLmetrics)
# Prepare binary classification dataset, with setosa as positive class first
binary_iris <- iris
binary_iris$Species <- factor(ifelse(iris$Species == "setosa", "setosa", "other"),
levels = c("setosa", "other"))
# Train/test split
set.seed(123)
train_index <- createDataPartition(binary_iris$Species, p = 0.8, list = FALSE)
train_data <- binary_iris[train_index, ]
test_data <- binary_iris[-train_index, ]
# FIXED: Custom summary function
custom_summary <- function(data, lev = NULL, model = NULL) {
# Ensure predictions and observations have the same levels
pred <- factor(data$pred, levels = lev)
obs <- factor(data$obs, levels = lev)
# Calculate metrics
acc <- mean(pred == obs)
kappa <- postResample(pred, obs)
f1 <- F1_Score(y_pred = pred, y_true = obs, positive = lev[1])
sens <- Sensitivity(y_pred = pred, y_true = obs, positive = lev[1])
c(Accuracy = acc, Kappa = kappa, F1 = f1, Sensitivity = sens)
}
# Train control with cross-validation and custom summary function
ctrl <- trainControl(
method = "cv",
number = 5,
summaryFunction = custom_summary,
classProbs = FALSE,
savePredictions = TRUE
)
# Train k-NN model
set.seed(123)
knn_model <- train(
Species ~ .,
data = train_data,
method = "knn",
trControl = ctrl,
tuneLength = 10,
metric = "F1", # Use F1 to select best model
preProcess = c("center", "scale")
)
# Print training results
print(knn_model)
## k-Nearest Neighbors
##
## 120 samples
## 4 predictor
## 2 classes: 'setosa', 'other'
##
## Pre-processing: centered (4), scaled (4)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 96, 96, 96, 96, 96
## Resampling results across tuning parameters:
##
## k Accuracy Kappa.Accuracy Kappa.Kappa F1 Sensitivity
## 5 1 1 1 1 1
## 7 1 1 1 1 1
## 9 1 1 1 1 1
## 11 1 1 1 1 1
## 13 1 1 1 1 1
## 15 1 1 1 1 1
## 17 1 1 1 1 1
## 19 1 1 1 1 1
## 21 1 1 1 1 1
## 23 1 1 1 1 1
##
## F1 was used to select the optimal model using the largest value.
## The final value used for the model was k = 23.
plot(knn_model)
# Predict on test set
test_pred <- predict(knn_model, newdata = test_data)
# FIXED: Ensure both have the same factor levels
test_pred <- factor(test_pred, levels = levels(test_data$Species))
# Evaluate on test set
conf_mat <- confusionMatrix(test_pred, test_data$Species, positive = "setosa")
f1_test <- F1_Score(y_pred = test_pred, y_true = test_data$Species, positive = "setosa")
sens_test <- Sensitivity(y_pred = test_pred, y_true = test_data$Species, positive = "setosa")
# Show test results
print(conf_mat)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa other
## setosa 10 0
## other 0 20
##
## Accuracy : 1
## 95% CI : (0.8843, 1)
## No Information Rate : 0.6667
## P-Value [Acc > NIR] : 5.215e-06
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.3333
## Detection Rate : 0.3333
## Detection Prevalence : 0.3333
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : setosa
##
cat("Test F1 Score:", round(f1_test, 3), "\n")
## Test F1 Score: 1
cat("Test Sensitivity:", round(sens_test, 3), "\n")
## Test Sensitivity: 1
The idea is similar to that for classification.
An example:
library(caret)
library(MLmetrics)
# Use mtcars dataset for regression example
data(mtcars)
# We'll predict mpg (miles per gallon) based on other features
# Train/test split
set.seed(123)
train_index <- createDataPartition(mtcars$mpg, p = 0.7, list = FALSE)
train_data <- mtcars[train_index, ]
test_data <- mtcars[-train_index, ]
# Train control with cross-validation for regression
ctrl <- trainControl(
method = "cv",
number = 5,
verboseIter = TRUE
)
# Train k-NN regression model
set.seed(123)
knn_regression <- train(
mpg ~ .,
data = train_data,
method = "knn",
trControl = ctrl,
tuneLength = 10,
metric = "RMSE", # Use RMSE to select k
preProcess = c("center", "scale") # Normalize data
)
## + Fold1: k= 5
## - Fold1: k= 5
## + Fold1: k= 7
## - Fold1: k= 7
## + Fold1: k= 9
## - Fold1: k= 9
## + Fold1: k=11
## - Fold1: k=11
## + Fold1: k=13
## - Fold1: k=13
## + Fold1: k=15
## - Fold1: k=15
## + Fold1: k=17
## - Fold1: k=17
## + Fold1: k=19
## - Fold1: k=19
## + Fold1: k=21
## - Fold1: k=21
## + Fold1: k=23
## - Fold1: k=23
## + Fold2: k= 5
## - Fold2: k= 5
## + Fold2: k= 7
## - Fold2: k= 7
## + Fold2: k= 9
## - Fold2: k= 9
## + Fold2: k=11
## - Fold2: k=11
## + Fold2: k=13
## - Fold2: k=13
## + Fold2: k=15
## - Fold2: k=15
## + Fold2: k=17
## - Fold2: k=17
## + Fold2: k=19
## - Fold2: k=19
## + Fold2: k=21
## - Fold2: k=21
## + Fold2: k=23
## - Fold2: k=23
## + Fold3: k= 5
## - Fold3: k= 5
## + Fold3: k= 7
## - Fold3: k= 7
## + Fold3: k= 9
## - Fold3: k= 9
## + Fold3: k=11
## - Fold3: k=11
## + Fold3: k=13
## - Fold3: k=13
## + Fold3: k=15
## - Fold3: k=15
## + Fold3: k=17
## - Fold3: k=17
## + Fold3: k=19
## - Fold3: k=19
## + Fold3: k=21
## - Fold3: k=21
## + Fold3: k=23
## - Fold3: k=23
## + Fold4: k= 5
## - Fold4: k= 5
## + Fold4: k= 7
## - Fold4: k= 7
## + Fold4: k= 9
## - Fold4: k= 9
## + Fold4: k=11
## - Fold4: k=11
## + Fold4: k=13
## - Fold4: k=13
## + Fold4: k=15
## - Fold4: k=15
## + Fold4: k=17
## - Fold4: k=17
## + Fold4: k=19
## - Fold4: k=19
## + Fold4: k=21
## - Fold4: k=21
## + Fold4: k=23
## - Fold4: k=23
## + Fold5: k= 5
## - Fold5: k= 5
## + Fold5: k= 7
## - Fold5: k= 7
## + Fold5: k= 9
## - Fold5: k= 9
## + Fold5: k=11
## - Fold5: k=11
## + Fold5: k=13
## - Fold5: k=13
## + Fold5: k=15
## - Fold5: k=15
## + Fold5: k=17
## - Fold5: k=17
## + Fold5: k=19
## - Fold5: k=19
## + Fold5: k=21
## - Fold5: k=21
## + Fold5: k=23
## - Fold5: k=23
## Aggregating results
## Selecting tuning parameters
## Fitting k = 7 on full training set
# Print the model and k-value results
print(knn_regression)
## k-Nearest Neighbors
##
## 24 samples
## 10 predictors
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 20, 19, 19, 19, 19
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.883708 0.7602765 3.290200
## 7 3.686561 0.7520515 3.116714
## 9 3.976232 0.7694084 3.191778
## 11 3.964159 0.7890296 3.165727
## 13 4.190642 0.8206310 3.268121
## 15 4.609692 0.8076585 3.534733
## 17 5.211151 0.7847071 4.046588
## 19 5.914555 0.4368621 4.676421
## 21 5.919207 0.5900987 4.715763
## 23 5.919207 0.5900987 4.715763
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 7.
# Plot the tuning results
plot(knn_regression)
# Print table of k values and their performance
print(knn_regression$results)
## k RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 5 3.883708 0.7602765 3.290200 0.9946827 0.2103032 0.9945161
## 2 7 3.686561 0.7520515 3.116714 1.0175383 0.1715952 0.8600984
## 3 9 3.976232 0.7694084 3.191778 1.2834203 0.1382612 1.1485793
## 4 11 3.964159 0.7890296 3.165727 1.6363006 0.1198575 1.3007283
## 5 13 4.190642 0.8206310 3.268121 1.9679972 0.1469771 1.4052232
## 6 15 4.609692 0.8076585 3.534733 2.0436493 0.1809126 1.5147916
## 7 17 5.211151 0.7847071 4.046588 1.9356391 0.1559647 1.5617107
## 8 19 5.914555 0.4368621 4.676421 1.8620472 0.4872509 1.5349162
## 9 21 5.919207 0.5900987 4.715763 1.8539655 0.3948347 1.4700451
## 10 23 5.919207 0.5900987 4.715763 1.8539655 0.3948347 1.4700451
# Predict on test set
test_predictions <- predict(knn_regression, newdata = test_data)
# Evaluate on test set
test_actual <- test_data$mpg
# Calculate regression metrics with the MLmetric package
rmse_test <- RMSE(test_predictions, test_actual)
mae_test <- MAE(test_predictions, test_actual)
r2_test <- R2(test_predictions, test_actual)
mape_test <- MAPE(test_predictions, test_actual)
# Show test results
cat("\n=== Test Set Performance ===\n")
##
## === Test Set Performance ===
cat("RMSE:", round(rmse_test, 3), "\n")
## RMSE: 1.799
cat("MAE:", round(mae_test, 3), "\n")
## MAE: 1.511
cat("R-squared:", round(r2_test, 3), "\n")
## R-squared: 0.91
cat("MAPE:", round(mape_test * 100, 2), "%\n")
## MAPE: 7.57 %
# Plot actual vs predicted
plot(test_actual, test_predictions,
main = "Actual vs Predicted MPG",
xlab = "Actual MPG",
ylab = "Predicted MPG",
pch = 19, col = "blue")
abline(0, 1, col = "red", lwd = 2) # Perfect prediction line
# Add some statistics to the plot
legend("topleft",
legend = c(paste("R² =", round(r2_test, 3)),
paste("RMSE =", round(rmse_test, 3))),
bty = "n")
Deep learning is a subset of machine learning that uses neural networks with multiple layers to model complex patterns in data. It excels in tasks like image classification, natural language processing, and regression. A neural network consists of:
Training involves:
We can use the function neuralnet() from R package neuralnet to train models for deep learning.
For classification, the output layer typically has more than one neuron (one for each category of the target variable) with a sigmoid activation function. The error (cost or loss) function is the sum of squared error (sse) \(err = \frac{1}{2n}\Sigma_{i=1}^{n} (y_i-\hat{y_i})^2\) or the cross-entropy (ce), measuring the difference between predicted and true probability distributions”
Example: We show how to classify Species using iris data.
# Install and load required packages
if (!require(neuralnet)) install.packages("neuralnet")
if (!require(neuralnet)) install.packages("NeuralNetTools")
library(neuralnet)
library(NeuralNetTools) # To produce variable importance
# Set random seed for reproducibility
set.seed(123)
# Standardize features (mean = 0, sd = 1)
features <- iris[, 1:4]
features_scaled <- scale(features)
iris_scaled <- data.frame(features_scaled, Species = iris$Species)
# Split into training and test sets (80% train, 20% test)
n_samples <- nrow(iris_scaled)
train_idx <- sample(1:n_samples, 0.8 * n_samples)
train_data <- iris_scaled[train_idx, ]
test_data <- iris_scaled[-train_idx, ]
# Train neural network with neuralnet
m <- neuralnet(
Species ~ Sepal.Length + Sepal.Width,
data = train_data,
hidden = c(2, 3), # Two hidden layers: 2 and 3 nodes
rep = 5, # 5 repetitions
act.fct = "logistic", # Logistic activation for hidden layers
linear.output = FALSE, # Logistic activation
stepmax = 20000 # Maximum iterations (epochs)
)
# Plot the best neural network
plot(m, rep = "best")
# Predict on test set
predictions <- predict(m, test_data[, 1:4])
# Convert predictions to class labels
predicted_labels <- apply(predictions, 1, which.max)
predicted_species <- factor(levels(iris$Species)[predicted_labels], levels = levels(iris$Species))
# Create confusion matrix using caret
caret::confusionMatrix(predicted_species, test_data$Species)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 7 0
## virginica 0 8 5
##
## Overall Statistics
##
## Accuracy : 0.7333
## 95% CI : (0.5411, 0.8772)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 0.008062
##
## Kappa : 0.619
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.4667 1.0000
## Specificity 1.0000 1.0000 0.6800
## Pos Pred Value 1.0000 1.0000 0.3846
## Neg Pred Value 1.0000 0.6522 1.0000
## Prevalence 0.3333 0.5000 0.1667
## Detection Rate 0.3333 0.2333 0.1667
## Detection Prevalence 0.3333 0.2333 0.4333
## Balanced Accuracy 1.0000 0.7333 0.8400
For regression, the output layer typically has one neuron (for a single continuous target) with a linear activation function. The error function is The error function is the sum of squared error (sse) \(err = \frac{1}{2n}\Sigma_{i=1}^{n} (y_i-\hat{y_i})^2\).
We demonstrate the use of neural network for regression with synthetic data and with the Boston housing data from the mlbench package.
Example 1. We generate a synthetic dataset where \(y = x^2 + \sin(3x) + \epsilon\), with \(\epsilon\) as Gaussian (normal) noise.
Generate data:
# Generate synthetic data
set.seed(123)
n <- 200
x <- runif(n, -2, 2)
y <- x^2 + sin(3 * x) + rnorm(n, 0, 0.2) # Nonlinear relation with noise
data <- data.frame(x = x, y = y)
# Split into training and test sets
train_idx <- sample(1:n, 0.8 * n)
train_data <- data[train_idx, ]
test_data <- data[-train_idx, ]
# Plot data
ggplot(data, aes(x = x, y = y)) +
geom_point(color = "#1f77b4", alpha = 0.5) +
labs(title = "Synthetic Regression Dataset") +
theme_minimal()
We train a neural network with one hidden layer (10 nodes) using the neuralnet package. The linear.output = TRUE ensures no activation function is applied to the output layer, suitable for regression.
# Train neural network
nn_model <- neuralnet(y ~ x,
data = train_data,
hidden = c(10), # One hidden layer with 10 nodes
err.fct = "sse", # Sum of Squared Errors
linear.output = TRUE, # Linear output for regression
act.fct = "tanh", # Tanh activation for hidden layer
stepmax = 1e6) # Max iterations
We evaluate the model on the test set by calculating the SSE and Mean Squared Error (MSE).
# Predict on test set
pred_test <- predict(nn_model, test_data)
# Calculate SSE and MSE
sse_test <- sum((test_data$y - pred_test)^2)
mse_test <- sse_test / nrow(test_data)
cat("Test SSE:", sse_test, "\n")
## Test SSE: 0.9893282
cat("Test MSE:", mse_test, "\n")
## Test MSE: 0.0247332
# Plot predictions vs. actual
test_results <- data.frame(x = test_data$x, y_actual = test_data$y, y_pred = pred_test)
ggplot(test_results, aes(x = x)) +
geom_point(aes(y = y_actual), color = "#1f77b4", alpha = 0.5, shape = 16) +
geom_point(aes(y = y_pred), color = "#ff7f0e", alpha = 0.5, shape = 17) +
labs(title = "Test Set: Actual vs. Predicted") +
theme_minimal() +
scale_color_manual(values = c("#1f77b4", "#ff7f0e"))
We visualize the model’s predictions across the input range to assess how well it captures the nonlinear relationship.
# Generate predictions for a smooth curve
x_grid <- data.frame(x = seq(-2, 2, length.out = 100))
pred_grid <- predict(nn_model, x_grid)
# Plot
ggplot() +
geom_point(data = train_data, aes(x = x, y = y), color = "#1f77b4", alpha = 0.5) +
geom_line(data = data.frame(x = x_grid$x, y = pred_grid), aes(x = x, y = y),
color = "#ff7f0e", size = 1) +
labs(title = "Neural Network Fit (Training Data)", x = "x", y = "y") +
theme_minimal()
Now, we use the Boston housing data to train a neural net for predicting median house price.
Data partition and normalization.
library(neuralnet)
library(mlbench)
library(caret)
data(BostonHousing)
head(BostonHousing)
## crim zn indus chas nox rm age dis rad tax ptratio b lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
str(BostonHousing)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : num 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ b : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
BostonHousing$chas = as.numeric(BostonHousing$chas) - 1 # Convert factor to numeric
# Split data
set.seed(123)
trainIndex <- createDataPartition(BostonHousing$medv, p = 0.8, list = FALSE)
train_data <- BostonHousing[trainIndex, ]
test_data <- BostonHousing[-trainIndex, ]
# Neural networks perform better with scaled data (normalized to [0,1]
# Define preprocessing (normalize to [0,1])
pre_proc <- preProcess(train_data, method = "range")
# Apply to training and test data
train_scaled <- predict(pre_proc, train_data)
test_scaled <- predict(pre_proc, test_data)
Train the model:
# Train neural network
set.seed(123) # For reproducibility
nn <- neuralnet(medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + b + lstat,
data = train_scaled,
hidden = c(4, 2), # Two hidden layers: 4 and 2 neurons
act.fct = "tanh", # Tanh activation for hidden layers
linear.output = TRUE, # Linear output for regression
err.fct = "sse", # Sum of squared errors
threshold = 0.01, # Convergence threshold
stepmax = 1e6) # Max iterations
Visualize the network:
plot(nn)
Make Predictions: Predict medv for the test set and rescale to original units ($1000s).
predictions <- predict(nn, test_scaled[, -which(names(test_scaled) == "medv")])
# Rescale predictions and actual values to original scale
predictions_rescaled <- predictions * (max(train_data$medv) - min(train_data$medv)) + min(train_data$medv)
actual_rescaled <- test_scaled$medv * (max(train_data$medv) - min(train_data$medv)) + min(train_data$medv)
Evaluate Performance: Calculate Mean Squared Error (MSE) and R-squared.
mse <- mean((actual_rescaled - predictions_rescaled)^2)
cat("Mean Squared Error:", mse, "\n")
## Mean Squared Error: 14.44873
# R-squared
r_squared <- cor(actual_rescaled, predictions_rescaled)^2
cat("R-squared:", r_squared, "\n")
## R-squared: 0.8397848
All code together:
library(neuralnet)
library(mlbench)
library(caret)
data(BostonHousing)
BostonHousing$chas = as.numeric(BostonHousing$chas) - 1 # Convert factor to numeric
# Split data
set.seed(123)
trainIndex <- createDataPartition(BostonHousing$medv, p = 0.8, list = FALSE)
train_data <- BostonHousing[trainIndex, ]
test_data <- BostonHousing[-trainIndex, ]
# Normalize with preProcess
pre_proc <- preProcess(train_data, method = "range")
train_scaled <- predict(pre_proc, train_data)
test_scaled <- predict(pre_proc, test_data)
# Model Training
set.seed(123)
nn <- neuralnet(medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + b + lstat, data = train_scaled, hidden = c(4, 2), act.fct = "tanh",
linear.output = TRUE, err.fct = "sse", threshold = 0.01, stepmax = 1e6)
# Predictions and Evaluation
predictions <- predict(nn, test_scaled[, -which(names(test_scaled) == "medv")])
predictions_rescaled <- predictions * (max(train_data$medv) - min(train_data$medv)) + min(train_data$medv)
actual_rescaled <- test_scaled$medv * (max(train_data$medv) - min(train_data$medv)) + min(train_data$medv)
mse <- mean((actual_rescaled - predictions_rescaled)^2)
r_squared <- cor(actual_rescaled, predictions_rescaled)^2
plot(actual_rescaled, predictions_rescaled, main = "Predicted vs Actual",
xlab = "Actual ($1000s)", ylab = "Predicted ($1000s)")
abline(0, 1, col = "red")