Chapter 1. Data Mining: An Overview

Data Mining is the computational process of discovering

patterns,
correlations,
anomalies, and
statistical structures

within large datasets. It employs methods at the intersection of machine learning, statistics, and database systems to extract useful information and transform it into an understandable structure for further use.

Data mining is often seen as a core step in the broader KDD (Knowledge Discovery in Databases) process:

Data Cleaning: Remove noise and inconsistent data.
Data Integration: Combine data from multiple sources.
Data Selection: Retrieve relevant data from the database.
Data Transformation: Consolidate data into forms appropriate for mining (e.g., aggregation, normalization).
Data Mining: Apply intelligent methods to extract patterns.
Pattern Evaluation: Identify truly interesting patterns representing knowledge.
Knowledge Presentation: Present the mined knowledge to the user using visualization and knowledge representation techniques.

Data mining tasks are broadly categorized into two types:

Predictive Tasks: Infer patterns to predict future outcomes.
- Classification: Predicts a categorical class label. An example is to classify a loan application as “low,” “medium,” or “high” risk. Common classification algorithms include Decision Trees, Naive Bayes, k-Nearest Neighbors (k-NN), Support Vector Machines (SVM), and Neural Networks.
- Regression: Predicts a continuous numeric value. An example is to predict the price of a house or future sales revenue. common algorithms include Linear Regression, Regression Trees, k-Nearest Neighbors (k-NN), Support Vector Machines (SVM), and Neural Networks.
Descriptive Tasks: Discover patterns that describe the data, providing new insights.
- Clustering: Groups similar objects together without predefined labels. The goal is to find inherent groupings in the data. An example is customer segmentation for targeted marketing. Commonly used algorithms for clustering include k-Means and Hierarchical Clustering.
- Association Rule Learning (often called market basket analysis): Discovers interesting relations (affinities) between variables in large databases. An example is the famous rule {Diapers} -> {Beer} which suggests that people who buy diapers are also likely to buy beer. A commonly used algorithm is Apriori.

Why is Data Mining Important?

We live in the age of data. The explosion of data from the web, sensors, business systems, and social media has created immense opportunities. Data mining helps us:

Turn Data into Knowledge: Move from simply storing data to understanding it.
Make Informed Decisions: Support decision-making with data-driven insights, not just intuition.
Gain a Competitive Advantage: Businesses can understand customer behavior, optimize processes, and create new strategies.
Enable Scientific Discovery: Find hidden patterns in scientific data, from genomics to astronomy.

Common Challenges in Data Mining:

Data Quality: Dealing with noisy, incomplete, and inconsistent data.
Scalability: Algorithms must be efficient and scalable to handle enormous datasets.
Dimensionality: The “curse of dimensionality” – high-dimensional data can be sparse and computationally expensive to process.
Privacy and Ethics: Using data responsibly, ensuring individual privacy, and avoiding biased or unethical outcomes.
Interpretability: Understanding and trusting the results of complex models (“black boxes”).

Data mining has ubiquitous applications across industries:

Retail: Market basket analysis, customer segmentation, sales forecasting.
Finance: Credit scoring, fraud detection, algorithmic trading.
Healthcare: Disease prediction, drug discovery, patient outcome analysis.
Telecommunications: Churn prediction, network fault isolation.
Science: Gene sequence analysis, climate pattern recognition, astronomical data analysis.
Web & Social Media: Recommendation systems (Netflix, Amazon), sentiment analysis, targeted advertising.

1.1 Core Ideas in Data Mining

Classification is the process of classifying a new record into one of the existing categories using a rule developed with similar data where individuals are already classified.

An application for a loan can repay on time (category 1), repay late (category 2), or declare bankrupcy (category 3).
A credit card transaction can be normal (category 1) or fraudulent (category 2).
A packet of data traveling on a network can be benigh (category 1) or threatening (category 2).

The victim of an illness can be recovered (category 1), still be ill (category 2), or be deceased (category 3).

Prediction is similar to classification, except that we are trying to predict the value of a numeric output variable. Loosely speaking, we can also predict the class of a categorical output variable, so prediction is a more general term.

In business settings, we may wish to find how likely a customer who purchases item A also purchases item B, or find what items tend to be purchased together. Such analysis of associations patterns is called market basket analysis (or affinity analysis or association rules). The rules are generated at the population level and can be used in a variety of ways, such as product placement, weekly promotional offers, and bundling products. Association rules derived from a hospital database on patients’ symptoms during consecutive hospitalizations can help find “which symptom is followed by what other symptoms” to help predict future symptoms for returning patients.

A dataset may contain hundreds of thousands of variables of which many are simply noise. In this situation, reducing the number of variables in the initial step before deploying data mining methods, called dimension reduction, can improve predictive power, manageability, and interpretability.

Before mining your data, you may need to clean, manipulate, as well as visualize them to generate hypotheses. This aiming at understanding the global landscape of the data is termed data exploration. Exploration by creating charts and dashboards is called data visualization or visual analytics.

learning can be supervised or unsupervised. Supervised learning refers to the process of providing an algorithm with records (called training data) in which an output variable of interest is known and the algorithm “learns” how to predict the value of this output variable with new records where the output is unknown. Once the model is developed based on the training data, it can then be applied to some reserved records to check it’s performance. The reserved data consist of the test data. If a few models are being considered, a third reserved dataset, called the validation data, can be used to select the best model based on some criterion. Simple linear regression is an example of supervised learning algorithm, since the values of the outcome variable (usually denoted by y) are all known. Unsupervised learning, on the other hand, refers to the process where there is no output variable to predict or classify and one simply uses a computer algorithm to learn patterns in the data.

1.2 The Steps in Data Mining

The steps are

Understand the purpose of the data mining project.
Obtain the data from one or more database or resources.
Explore, clean, and process the data.
Reduce the data dimension, if necessary.
Determine the data mining task (prediction, classification, clustering, etc.)
Partition the data for supervised learning into traing, validation, and test datasets.
Choose the data mining techniques to be used (regression, neural nets, decision trees, etc.)
Perform the task.
Interpret the results.
Deploy the model.

1.3 Preliminary Steps

1.3.1 Loading and looking at the data in R

If your data is stored in a CSV (Comma Delimited Values) file, you can use the r function read.csv() or fread() from the data.table package. The syntax is

read.csv(file = “myfile.csv”, header = TRUE, sep = “,”, skip = k)
fread(file = “myfile.csv”, sep = “,”, skip = k, header = TRUE)

1.3.2 Oversampling rare events in classification tasks

If the event we are using for classification is rare, for example, fraudulent credit card transactions, sampling a random subset of records may yield so few events that we have little information on them. In such cases, we would want our sampling procedure to overweight the rare class relative to the majority class so that our sample would end up with a healthy complement of rare events. Capturing events is more valuable than missing events, with price paid due to misclassifying nonevents.

The ROSE package creates possibly balanced samples by random over-sampling minority examples, under-sampling majority examples or combination of over- and under-sampling.

Here is a reference of using ROSE: https://www.analyticsvidhya.com/blog/2016/03/practical-guide-deal-imbalanced-classification-problems/

1.3.3 Outliers and missing values

Outliers can be deleted or appropriately handled with domain knowledge. Missing values can be imputed using one of a few methods, or the records with missing values can be deleted if missing is at random.

1.3.4 Normalyzing (Standardizing) and rescaling data

Some algorithms require that the numeric data be normalized (or standardized). To normalize or standardize a variable, we subtract the mean from the corresponding column and then divide by the standard deviation of the column. The R function scale() will do the job. The resulting values are called z-scores of the original values.

scale(mtcars)

##                             mpg        cyl        disp          hp        drat
## Mazda RX4            0.15088482 -0.1049878 -0.57061982 -0.53509284  0.56751369
## Mazda RX4 Wag        0.15088482 -0.1049878 -0.57061982 -0.53509284  0.56751369
## Datsun 710           0.44954345 -1.2248578 -0.99018209 -0.78304046  0.47399959
## Hornet 4 Drive       0.21725341 -0.1049878  0.22009369 -0.53509284 -0.96611753
## Hornet Sportabout   -0.23073453  1.0148821  1.04308123  0.41294217 -0.83519779
## Valiant             -0.33028740 -0.1049878 -0.04616698 -0.60801861 -1.56460776
## Duster 360          -0.96078893  1.0148821  1.04308123  1.43390296 -0.72298087
## Merc 240D            0.71501778 -1.2248578 -0.67793094 -1.23518023  0.17475447
## Merc 230             0.44954345 -1.2248578 -0.72553512 -0.75387015  0.60491932
## Merc 280            -0.14777380 -0.1049878 -0.50929918 -0.34548584  0.60491932
## Merc 280C           -0.38006384 -0.1049878 -0.50929918 -0.34548584  0.60491932
## Merc 450SE          -0.61235388  1.0148821  0.36371309  0.48586794 -0.98482035
## Merc 450SL          -0.46302456  1.0148821  0.36371309  0.48586794 -0.98482035
## Merc 450SLC         -0.81145962  1.0148821  0.36371309  0.48586794 -0.98482035
## Cadillac Fleetwood  -1.60788262  1.0148821  1.94675381  0.85049680 -1.24665983
## Lincoln Continental -1.60788262  1.0148821  1.84993175  0.99634834 -1.11574009
## Chrysler Imperial   -0.89442035  1.0148821  1.68856165  1.21512565 -0.68557523
## Fiat 128             2.04238943 -1.2248578 -1.22658929 -1.17683962  0.90416444
## Honda Civic          1.71054652 -1.2248578 -1.25079481 -1.38103178  2.49390411
## Toyota Corolla       2.29127162 -1.2248578 -1.28790993 -1.19142477  1.16600392
## Toyota Corona        0.23384555 -1.2248578 -0.89255318 -0.72469984  0.19345729
## Dodge Challenger    -0.76168319  1.0148821  0.70420401  0.04831332 -1.56460776
## AMC Javelin         -0.81145962  1.0148821  0.59124494  0.04831332 -0.83519779
## Camaro Z28          -1.12671039  1.0148821  0.96239618  1.43390296  0.24956575
## Pontiac Firebird    -0.14777380  1.0148821  1.36582144  0.41294217 -0.96611753
## Fiat X1-9            1.19619000 -1.2248578 -1.22416874 -1.17683962  0.90416444
## Porsche 914-2        0.98049211 -1.2248578 -0.89093948 -0.81221077  1.55876313
## Lotus Europa         1.71054652 -1.2248578 -1.09426581 -0.49133738  0.32437703
## Ford Pantera L      -0.71190675  1.0148821  0.97046468  1.71102089  1.16600392
## Ferrari Dino        -0.06481307 -0.1049878 -0.69164740  0.41294217  0.04383473
## Maserati Bora       -0.84464392  1.0148821  0.56703942  2.74656682 -0.10578782
## Volvo 142E           0.21725341 -1.2248578 -0.88529152 -0.54967799  0.96027290
##                               wt        qsec         vs         am       gear
## Mazda RX4           -0.610399567 -0.77716515 -0.8680278  1.1899014  0.4235542
## Mazda RX4 Wag       -0.349785269 -0.46378082 -0.8680278  1.1899014  0.4235542
## Datsun 710          -0.917004624  0.42600682  1.1160357  1.1899014  0.4235542
## Hornet 4 Drive      -0.002299538  0.89048716  1.1160357 -0.8141431 -0.9318192
## Hornet Sportabout    0.227654255 -0.46378082 -0.8680278 -0.8141431 -0.9318192
## Valiant              0.248094592  1.32698675  1.1160357 -0.8141431 -0.9318192
## Duster 360           0.360516446 -1.12412636 -0.8680278 -0.8141431 -0.9318192
## Merc 240D           -0.027849959  1.20387148  1.1160357 -0.8141431  0.4235542
## Merc 230            -0.068730634  2.82675459  1.1160357 -0.8141431  0.4235542
## Merc 280             0.227654255  0.25252621  1.1160357 -0.8141431  0.4235542
## Merc 280C            0.227654255  0.58829513  1.1160357 -0.8141431  0.4235542
## Merc 450SE           0.871524874 -0.25112717 -0.8680278 -0.8141431 -0.9318192
## Merc 450SL           0.524039143 -0.13920420 -0.8680278 -0.8141431 -0.9318192
## Merc 450SLC          0.575139986  0.08464175 -0.8680278 -0.8141431 -0.9318192
## Cadillac Fleetwood   2.077504765  0.07344945 -0.8680278 -0.8141431 -0.9318192
## Lincoln Continental  2.255335698 -0.01608893 -0.8680278 -0.8141431 -0.9318192
## Chrysler Imperial    2.174596366 -0.23993487 -0.8680278 -0.8141431 -0.9318192
## Fiat 128            -1.039646647  0.90727560  1.1160357  1.1899014  0.4235542
## Honda Civic         -1.637526508  0.37564148  1.1160357  1.1899014  0.4235542
## Toyota Corolla      -1.412682800  1.14790999  1.1160357  1.1899014  0.4235542
## Toyota Corona       -0.768812180  1.20946763  1.1160357 -0.8141431 -0.9318192
## Dodge Challenger     0.309415603 -0.54772305 -0.8680278 -0.8141431 -0.9318192
## AMC Javelin          0.222544170 -0.30708866 -0.8680278 -0.8141431 -0.9318192
## Camaro Z28           0.636460997 -1.36476075 -0.8680278 -0.8141431 -0.9318192
## Pontiac Firebird     0.641571082 -0.44699237 -0.8680278 -0.8141431 -0.9318192
## Fiat X1-9           -1.310481114  0.58829513  1.1160357  1.1899014  0.4235542
## Porsche 914-2       -1.100967659 -0.64285758 -0.8680278  1.1899014  1.7789276
## Lotus Europa        -1.741772228 -0.53093460  1.1160357  1.1899014  1.7789276
## Ford Pantera L      -0.048290296 -1.87401028 -0.8680278  1.1899014  1.7789276
## Ferrari Dino        -0.457097039 -1.31439542 -0.8680278  1.1899014  1.7789276
## Maserati Bora        0.360516446 -1.81804880 -0.8680278  1.1899014  1.7789276
## Volvo 142E          -0.446876870  0.42041067  1.1160357  1.1899014  0.4235542
##                           carb
## Mazda RX4            0.7352031
## Mazda RX4 Wag        0.7352031
## Datsun 710          -1.1221521
## Hornet 4 Drive      -1.1221521
## Hornet Sportabout   -0.5030337
## Valiant             -1.1221521
## Duster 360           0.7352031
## Merc 240D           -0.5030337
## Merc 230            -0.5030337
## Merc 280             0.7352031
## Merc 280C            0.7352031
## Merc 450SE           0.1160847
## Merc 450SL           0.1160847
## Merc 450SLC          0.1160847
## Cadillac Fleetwood   0.7352031
## Lincoln Continental  0.7352031
## Chrysler Imperial    0.7352031
## Fiat 128            -1.1221521
## Honda Civic         -0.5030337
## Toyota Corolla      -1.1221521
## Toyota Corona       -1.1221521
## Dodge Challenger    -0.5030337
## AMC Javelin         -0.5030337
## Camaro Z28           0.7352031
## Pontiac Firebird    -0.5030337
## Fiat X1-9           -1.1221521
## Porsche 914-2       -0.5030337
## Lotus Europa        -0.5030337
## Ford Pantera L       0.7352031
## Ferrari Dino         1.9734398
## Maserati Bora        3.2116766
## Volvo 142E          -0.5030337
## attr(,"scaled:center")
##        mpg        cyl       disp         hp       drat         wt       qsec 
##  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
##         vs         am       gear       carb 
##   0.437500   0.406250   3.687500   2.812500 
## attr(,"scaled:scale")
##         mpg         cyl        disp          hp        drat          wt 
##   6.0269481   1.7859216 123.9386938  68.5628685   0.5346787   0.9784574 
##        qsec          vs          am        gear        carb 
##   1.7869432   0.5040161   0.4989909   0.7378041   1.6152000

We can also normalize data by rescaling a numeric variable to a scale from 0 to 1, which means to subtract the minimum value of the column and then divide by the range (maximum - minimum). The function rescale() from the “scales” package can do the job.

scale(mtcars$mpg)

##              [,1]
##  [1,]  0.15088482
##  [2,]  0.15088482
##  [3,]  0.44954345
##  [4,]  0.21725341
##  [5,] -0.23073453
##  [6,] -0.33028740
##  [7,] -0.96078893
##  [8,]  0.71501778
##  [9,]  0.44954345
## [10,] -0.14777380
## [11,] -0.38006384
## [12,] -0.61235388
## [13,] -0.46302456
## [14,] -0.81145962
## [15,] -1.60788262
## [16,] -1.60788262
## [17,] -0.89442035
## [18,]  2.04238943
## [19,]  1.71054652
## [20,]  2.29127162
## [21,]  0.23384555
## [22,] -0.76168319
## [23,] -0.81145962
## [24,] -1.12671039
## [25,] -0.14777380
## [26,]  1.19619000
## [27,]  0.98049211
## [28,]  1.71054652
## [29,] -0.71190675
## [30,] -0.06481307
## [31,] -0.84464392
## [32,]  0.21725341
## attr(,"scaled:center")
## [1] 20.09062
## attr(,"scaled:scale")
## [1] 6.026948

1.4 Overfitting

When building models for prediction or classification, we need to make sure that the chosen model generalizes beyond the datasets that we have at hand for building and choosing the model. That is, the chosen model should not be too complicated (for example, with too many terms) so that it overfits the data with large standard errors.

The mean squared error of a model can be decomposed into two components: squared bias and variance. If a model underfits the data, the bias part is high but the variance part is low. If a model overfits the data, the bias part is low but the variance part is high. To check the predictive power of a model, we use a new set of predictor values

Here are some simulations of overfitting: https://medium.com/analytics-vidhya/full-demo-of-overfitting-and-underfitting-in-r-70fd9b66a7d1 and https://fengkehh.github.io/post/2017-06-30-overfitting/ and https://towardsdatascience.com/overfitting-vs-underfitting-a-complete-example-d05dd7e19765

When fitting a multiple regression model in R, we use the R function lm(). The syntax is lm(formula, data). To get the prediction with new data, we use the predict() function.

1.5 Building a Predictive Model

When building a predictive model, the typical steps are

Determine the purpose: prediction or classification
Obtain the data: you might need to select a sample from the database.
Explore, clean, and process the data: the description of the variables (AKA data dictionary), error checking, creation of dummy variables
Reduce the data dimension: The reduction of the number of variables
Determine the data mining task: prediction of a value or a label, supervised or unsupervised learning
Partition the data (for supervised tasks): the partition of data into training, validation, and test data.
Choose the best candidate technique.
Perform the task using algorithms: the lm() function for multiple regression, the naiveBayes() function for classification, the rpart(), knn() and svm() functions for both prediction and classification, and the functions neuralnet(), linDA(), and lda(), qda(), or mda()
Interpret the results.
Deploy the model: apply the model to new records to predict (or to score) the output values.

1.6 Common Challenges in Data Mining:

Data Quality: Dealing with noisy, incomplete, and inconsistent data.
Scalability: Algorithms must be efficient and scalable to handle enormous datasets.
Dimensionality: The “curse of dimensionality” – high-dimensional data can be sparse and computationally expensive to process.
Privacy and Ethics: Using data responsibly, ensuring individual privacy, and avoiding biased or unethical outcomes.
Interpretability: Understanding and trusting the results of complex models (“black boxes”).

Chapter 2. Data Explorarion

Data Exploration is a crucial initial phase in the data mining process that involves using visual and statistical techniques to understand the general properties of a dataset before applying complex algorithms or models.

Think of it as a detective first surveying a crime scene—looking for obvious clues, understanding the layout, and identifying potential points of interest before calling in the forensics team for deep analysis.

The primary goals are to:

Understand the data’s structure.
Uncover initial patterns, trends, and relationships.
Detect anomalies and outliers.
Identify important variables.
Formulate hypotheses for more rigorous testing later.
Inform and guide the subsequent steps of data preprocessing and modeling.

2.1 Key Activities in Data Exploration

Data exploration typically involves three main types of analysis:

Summary Statistics (Numerical Summaries)

This involves calculating key metrics that describe the central tendency, spread, and shape of the data’s distribution.

Central Tendency: Mean, Median, Mode.

Spread: Range, Variance, Standard Deviation, Interquartile Range (IQR).

Shape: Skewness (measure of asymmetry), Kurtosis (measure of “tailedness”).

For Categorical Data: Frequency counts and proportions for each category.

Example: Calculating the average (mean) customer age and the standard deviation tells you the typical age and how much variation there is in your customer base.

Visualization (Graphical Representations)

Visualizations are powerful tools for seeing patterns that are difficult to detect in tables of numbers.

Histograms & Box Plots: Understand the distribution of a single numerical variable. A box plot quickly shows the median, quartiles, and potential outliers.

Scatter Plots: Visualize the relationship (e.g., correlation) between two numerical variables (e.g., advertising spend vs. sales).

Bar Charts & Pie Charts: Show the frequency or proportion of categories in a categorical variable (e.g., number of customers per country).

Heatmaps & Correlation Matrices: Visualize the correlation coefficients between multiple numerical variables at once. This helps identify highly related variables.

Pair Plots: A grid of scatter plots for multiple variables, providing a comprehensive overview of their relationships.

Data Quality Assessment

This is an essential part of exploration where you check for issues that need to be addressed in the data preprocessing stage.

Missing Values: Identifying columns with blank or NULL entries and understanding the pattern of missingness.

Outliers: Detecting extreme values that could be errors or genuine rare events. A scatter plot or box plot makes outliers obvious.

Inconsistencies: Spotting typos, duplicate records, or incorrect data types (e.g., a date stored as text).

2.2 Visualization with R

We use the R built-in data “ToothGrowth” and “economics” to demonstrate visualization.

# 1. Create a box plot (bp)
p <- ggplot(ToothGrowth, aes(color = factor(dose), y = len))
bxp <- p + geom_boxplot() 


# 2. Create a histogram (hp)
p <- ggplot(iris, aes(color = Species, x = Sepal.Length))
hp <- p + geom_density() 


# 3. Create a line plot
lp <- ggplot(economics, aes(x = date, y = psavert)) + 
  geom_line() 

# 3. Combine the plots on one page
library(ggpubr)
ggarrange(bxp, hp, lp,
                    labels = c("A", "B", "C"),
                    ncol = 2, nrow = 2)

Chapter 3 Dimension Reduction

3.1 Introduction

The dimension of a dataset is the number of variables (quantitative or categorical). The dimensionality of a model is the number of predictors or input variables used by the model. For data mining algorithms to operate efficiently, the dimension need to be reduced. This process is part of the pilot phase of data mining and is done before deploying a model. Several dimension reduction approaches are:

Incorporating domain knowledge to remove or combine categories.
Use data visualization (such as scatterplot matrix) or summaries (such as correlation matrix) to detect redundant variables or categories.
Convert ordinal categorical variables to numerical variables.
Employ automated reduction techniques, such as the principal component analysis (PCA), to create a new set of uncorrelated variables and use only a few of them.
Use regression models and classification & regression trees to remove redundant variables or combining similar categories of categorical variables.

3.2 Curse of Dimensionality

The curse of dimensionality is the affliction caused by adding variables to a model that already involves multiple variables. As variables are added, the feature space increasingly sparse, and classification and prediction models fail because available data are insufficient to provide a useful model across so many variables. Dimension reduction in artificial intelligence (AI) literature is often referred to as feature selection (selecting predictor variables or features) or feature extraction (creating more useful variables in place of original ones).

3.3 Reducing the Number of Categories in Categorical Variables

The handling of a categorical variable in regression is usually through the creation of dummy variables. Many times, software will handle it automatically. If you have \(k\) categories for a variable, you can create \(k\) dummy variables, each taking values of 0 or 1. Since these dummy variables are perfectly correlated, to reduce multicollinearity, you can only use any \(k-1\) of them. If your data have many categorical variables, then many more dummy variables will be created. If you think some categories of a categorical variable can be combined to form a new categorical variable with a less number of categories, do it!

3.4 Converting Ordinal Categorical Variables to Numerical Variables

Other way to achieve data reduction is to use meaningful numerical values for the categories of a ordinal variable. A categorical variable is ordinal (such as grade, one on a Likert scale), if there is a natural ordering in the categories of it. Some domain knowledge might be needed when mapping categories to numerical values.

3.5 Principal Component Analysis

The principal component analysis is a technique that allows us to create a set of \(p\) uncorrelated quantitative variables (called principal components) out of a set of \(p\) quantitative variables, such that the sum of the variances of the PC’s equals the sum of the variances of the original variables. The PC that has the largest variance among all PC’s is called the first principal component, the PC that has the second largest variance among all PC’s is called the second principal component, and so on.

Here is a math-free introduction to PCA: https://builtin.com/data-science/step-step-explanation-principal-component-analysis

head(iris, n = 10)

##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa

D = iris[,-5] # Remove 5th column which is categorical
head(D, n = 10)

##    Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1           5.1         3.5          1.4         0.2
## 2           4.9         3.0          1.4         0.2
## 3           4.7         3.2          1.3         0.2
## 4           4.6         3.1          1.5         0.2
## 5           5.0         3.6          1.4         0.2
## 6           5.4         3.9          1.7         0.4
## 7           4.6         3.4          1.4         0.3
## 8           5.0         3.4          1.5         0.2
## 9           4.4         2.9          1.4         0.2
## 10          4.9         3.1          1.5         0.1

# No centering, no scaling
pcs = princomp(D, cor = TRUE, scores = TRUE) # Use the correlation matrix to start PCA
# scores = TRUE returns a new data frame with columns being the principal components

pcs

## Call:
## princomp(x = D, cor = TRUE, scores = TRUE)
## 
## Standard deviations:
##    Comp.1    Comp.2    Comp.3    Comp.4 
## 1.7083611 0.9560494 0.3830886 0.1439265 
## 
##  4  variables and  150 observations.

summary(pcs) # Shows importance of components

## Importance of components:
##                           Comp.1    Comp.2     Comp.3      Comp.4
## Standard deviation     1.7083611 0.9560494 0.38308860 0.143926497
## Proportion of Variance 0.7296245 0.2285076 0.03668922 0.005178709
## Cumulative Proportion  0.7296245 0.9581321 0.99482129 1.000000000

plot(pcs, type = "l", main = "A Scree Plot") # Variances of PC's

# The numbers are equal to the eigen values of the correlation matrix of D
eigen(cor(D))

## eigen() decomposition
## $values
## [1] 2.91849782 0.91403047 0.14675688 0.02071484
## 
## $vectors
##            [,1]        [,2]       [,3]       [,4]
## [1,]  0.5210659 -0.37741762 -0.7195664  0.2612863
## [2,] -0.2693474 -0.92329566  0.2443818 -0.1235096
## [3,]  0.5804131 -0.02449161  0.1421264 -0.8014492
## [4,]  0.5648565 -0.06694199  0.6342727  0.5235971

eigen(cor(D))$values # Must sum to 1

## [1] 2.91849782 0.91403047 0.14675688 0.02071484

# Get the corresponding eigen vectors
eigen(cor(D))$vectors

##            [,1]        [,2]       [,3]       [,4]
## [1,]  0.5210659 -0.37741762 -0.7195664  0.2612863
## [2,] -0.2693474 -0.92329566  0.2443818 -0.1235096
## [3,]  0.5804131 -0.02449161  0.1421264 -0.8014492
## [4,]  0.5648565 -0.06694199  0.6342727  0.5235971

# pcs is a list
names(pcs)

## [1] "sdev"     "loadings" "center"   "scale"    "n.obs"    "scores"   "call"

pcs$sdev # Gives the standard deviation of each of the principal components.

##    Comp.1    Comp.2    Comp.3    Comp.4 
## 1.7083611 0.9560494 0.3830886 0.1439265

pcs$center # Gives the mean of each column of the original data D

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##     5.843333     3.057333     3.758000     1.199333

pcs$scale # Gives the scaling factor (similar to standard deviation) of each column of the original data D

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##    0.8253013    0.4344110    1.7594041    0.7596926

pcs$loadings # Extract coefficients (or vectors); a matrix whose columns contain the eigenvectors

## 
## Loadings:
##              Comp.1 Comp.2 Comp.3 Comp.4
## Sepal.Length  0.521  0.377  0.720  0.261
## Sepal.Width  -0.269  0.923 -0.244 -0.124
## Petal.Length  0.580        -0.142 -0.801
## Petal.Width   0.565        -0.634  0.524
## 
##                Comp.1 Comp.2 Comp.3 Comp.4
## SS loadings      1.00   1.00   1.00   1.00
## Proportion Var   0.25   0.25   0.25   0.25
## Cumulative Var   0.25   0.50   0.75   1.00

pcs$scores # Extract the data frame with columns that are the principal components variables (called a score matrix).

##             Comp.1       Comp.2       Comp.3       Comp.4
##   [1,] -2.26470281  0.480026597  0.127706022  0.024168204
##   [2,] -2.08096115 -0.674133557  0.234608854  0.103006775
##   [3,] -2.36422905 -0.341908024 -0.044201485  0.028377053
##   [4,] -2.29938422 -0.597394508 -0.091290106 -0.065955560
##   [5,] -2.38984217  0.646835383 -0.015738196 -0.035922813
##   [6,] -2.07563095  1.489177523 -0.026968294  0.006608180
##   [7,] -2.44402884  0.047644198 -0.335470401 -0.036775557
##   [8,] -2.23284716  0.223148073  0.088695498 -0.024612096
##   [9,] -2.33464048 -1.115327675 -0.145076864 -0.026859221
##  [10,] -2.18432817 -0.469013561  0.253765567 -0.039899288
##  [11,] -2.16631010  1.043690653  0.268681102  0.016731367
##  [12,] -2.32613087  0.133078335 -0.093759244 -0.133483413
##  [13,] -2.21845090 -0.728676165  0.230911237  0.002425038
##  [14,] -2.63310070 -0.961506729 -0.180796084 -0.019215534
##  [15,] -2.19874060  1.860057113  0.472900998  0.194731769
##  [16,] -2.26221453  2.686284485 -0.030526609  0.050533737
##  [17,] -2.20758770  1.483609363  0.005344094  0.188817432
##  [18,] -2.19034951  0.488838316  0.044215316  0.093090438
##  [19,] -1.89857200  1.405018794  0.374343275  0.061095967
##  [20,] -2.34336905  1.127849382 -0.132630467 -0.037756420
##  [21,] -1.91432300  0.408855708  0.421292594  0.010921286
##  [22,] -2.20701284  0.924121427 -0.159865277  0.059597330
##  [23,] -2.77434470  0.458343668 -0.332179098  0.019648430
##  [24,] -1.81866953  0.085558526 -0.034488596  0.151140999
##  [25,] -2.22716331  0.137254455 -0.117993536 -0.270140352
##  [26,] -1.95184633 -0.625618588  0.305640982  0.043561651
##  [27,] -2.05115137  0.242163553 -0.086364011  0.067680060
##  [28,] -2.16857717  0.527149525  0.206816248  0.010275393
##  [29,] -2.13956345  0.313217810  0.271150240  0.084259221
##  [30,] -2.26526149 -0.337731904 -0.068435776 -0.108279885
##  [31,] -2.14012214 -0.504540690  0.075008442 -0.048188868
##  [32,] -1.83159477  0.423695068  0.270467377  0.239870381
##  [33,] -2.61494794  1.793575856 -0.047228419 -0.229235932
##  [34,] -2.44617739  2.150727877  0.082668045 -0.048214393
##  [35,] -2.10997488 -0.460201841  0.170274861  0.029022947
##  [36,] -2.20780890 -0.206107398  0.225441580  0.168907873
##  [37,] -2.04514621  0.661558111  0.484537410  0.196358525
##  [38,] -2.52733191  0.592292774 -0.019435812 -0.136504550
##  [39,] -2.42963258 -0.904180040 -0.193254662 -0.009738423
##  [40,] -2.16971071  0.268878961  0.175883821  0.007047406
##  [41,] -2.28647514  0.441715388 -0.034894909  0.106983249
##  [42,] -1.85812246 -2.337415158  0.204234223  0.289863919
##  [43,] -2.55363840 -0.479100690 -0.305766453 -0.066601453
##  [44,] -1.96444768  0.472326668 -0.309601318  0.177093014
##  [45,] -2.13705901  1.142229262 -0.248433561 -0.151043437
##  [46,] -2.06974430 -0.711052725  0.063929826  0.140269507
##  [47,] -2.38473317  1.120429702 -0.057217858 -0.152230967
##  [48,] -2.39437631 -0.386246873 -0.139467905 -0.048834762
##  [49,] -2.22944655  0.997959764  0.181492780 -0.014928135
##  [50,] -2.20383344  0.009216358  0.153029490  0.049371732
##  [51,]  1.10178118  0.862972418  0.684586163  0.034833776
##  [52,]  0.73133743  0.594614726  0.094121716  0.004903623
##  [53,]  1.24097932  0.616297654  0.554006835  0.009423397
##  [54,]  0.40748306 -1.754403989  0.023101768  0.065768835
##  [55,]  1.07547470 -0.208421046  0.398255523  0.104736873
##  [56,]  0.38868734 -0.593283636 -0.124191550 -0.240831300
##  [57,]  0.74652974  0.773019312 -0.148969403 -0.077369785
##  [58,] -0.48732274 -1.852429087 -0.249265266 -0.040520205
##  [59,]  0.92790164  0.032226078  0.596169361 -0.029879609
##  [60,]  0.01142619 -1.034018275 -0.538899390 -0.028461184
##  [61,] -0.11019628 -2.654072819  0.046790444  0.013760731
##  [62,]  0.44069345 -0.063295188 -0.205073815  0.040126082
##  [63,]  0.56210831 -1.764724381  0.765771394  0.045731157
##  [64,]  0.71956189 -0.186224606  0.068658945 -0.164807198
##  [65,] -0.03335470 -0.439003210 -0.194932893  0.109048499
##  [66,]  0.87540719  0.509063957  0.503511382  0.104943723
##  [67,]  0.35025167 -0.196311735 -0.490873075 -0.191509364
##  [68,]  0.15881005 -0.792095742  0.302037174 -0.205297735
##  [69,]  1.22509363 -1.622243803  0.482304024  0.225899769
##  [70,]  0.16491790 -1.302609230  0.172837808 -0.051726849
##  [71,]  0.73768265  0.396571562 -0.616526306 -0.083284123
##  [72,]  0.47628719 -0.417320281  0.264952227  0.113568273
##  [73,]  1.23417810 -0.933325729  0.368412272 -0.009944526
##  [74,]  0.63285820 -0.416387721  0.291896252 -0.274220152
##  [75,]  0.70266118 -0.063411820  0.446027008  0.043458325
##  [76,]  0.87427365  0.250793393  0.472578954  0.101715736
##  [77,]  1.25650912 -0.077256020  0.727155002  0.039688518
##  [78,]  1.35840512  0.331311682  0.260826577  0.066828064
##  [79,]  0.66480037 -0.225927855 -0.085863889 -0.036439840
##  [80,] -0.04025861 -1.058718547  0.319573330  0.064788156
##  [81,]  0.13079518 -1.562271834  0.149983478 -0.009402523
##  [82,]  0.02345269 -1.572475594  0.241552281 -0.032772444
##  [83,]  0.24153827 -0.777256383  0.151211957  0.023651360
##  [84,]  1.06109461 -0.633843245 -0.105311387 -0.183968453
##  [85,]  0.22397877 -0.287773512 -0.665249720 -0.254828368
##  [86,]  0.42913912  0.845582241 -0.450634071 -0.109675181
##  [87,]  1.04872805  0.522051797  0.395786384  0.037209019
##  [88,]  1.04453138 -1.382988719  0.688295960  0.136835600
##  [89,]  0.06958832 -0.219503335 -0.291579274 -0.147144581
##  [90,]  0.28347724 -1.329324639 -0.089410023  0.008905805
##  [91,]  0.27907778 -1.120028524 -0.094487601 -0.270657196
##  [92,]  0.62456979  0.024923029  0.020481147 -0.147686401
##  [93,]  0.33653037 -0.988404018  0.199389755  0.006530562
##  [94,] -0.36218338 -2.019237873 -0.105821048  0.019570812
##  [95,]  0.28858624 -0.855730320 -0.130889685 -0.107402349
##  [96,]  0.09136066 -0.181192126 -0.128978343 -0.229959626
##  [97,]  0.22771687 -0.384920081 -0.156213154 -0.132605877
##  [98,]  0.57638829 -0.154873597  0.271650362 -0.019860679
##  [99,] -0.44766702 -1.543792034 -0.190400930  0.199946457
## [100,]  0.25673059 -0.598851796 -0.091879161 -0.058622049
## [101,]  1.84456887  0.870421312 -1.005401018 -0.049249743
## [102,]  1.15788161 -0.698869862 -0.530160149 -0.040520754
## [103,]  2.20526679  0.562010477  0.202914170  0.059184194
## [104,]  1.44015066 -0.046987588 -0.163630107 -0.235770073
## [105,]  1.86781222  0.295044824 -0.395628375 -0.016298272
## [106,]  2.75187334  0.800409201  0.582309103 -0.101384486
## [107,]  0.36701769 -1.561502891 -0.986893267 -0.133123834
## [108,]  2.30243944  0.420065580  0.651706439 -0.238041242
## [109,]  2.00668647 -0.711438654  0.393990571 -0.086510630
## [110,]  2.25977735  1.921010376 -0.397551897  0.104838918
## [111,]  1.36417549  0.692756454 -0.284612074  0.107860420
## [112,]  1.60267867 -0.421700450 -0.023186408  0.058331633
## [113,]  1.88390070  0.419249651 -0.026338410  0.146414939
## [114,]  1.26011510 -1.162260421 -0.580249290  0.099157321
## [115,]  1.46764520 -0.442271587 -1.003869574  0.275658903
## [116,]  1.59007732  0.676244806 -0.638428708  0.191862996
## [117,]  1.47143146  0.255621824 -0.037431260 -0.155330271
## [118,]  2.42632899  2.556661251  0.127881459 -0.273807183
## [119,]  3.31069558  0.017780949  0.703305304  0.045188606
## [120,]  1.26376667 -1.706745380  0.267536893 -0.065180800
## [121,]  2.03771630  0.910467410 -0.234799484  0.167951254
## [122,]  0.97798073 -0.571764325 -0.828127201  0.027755587
## [123,]  2.89765149  0.413641060  0.857421825 -0.127336502
## [124,]  1.33323218 -0.481811219  0.005428364  0.139959148
## [125,]  1.70073390  1.013921867 -0.298450613 -0.061643734
## [126,]  1.95432671  1.007777596  0.419984722 -0.218338351
## [127,]  1.17510363 -0.316394472 -0.129937757  0.125420444
## [128,]  1.02095055  0.064346029 -0.337715967 -0.008654401
## [129,]  1.78834992 -0.187361215 -0.270658006  0.031087648
## [130,]  1.86364755  0.562290726  0.715634119 -0.208215164
## [131,]  2.43595373  0.259284433  0.727816146 -0.017923365
## [132,]  2.30492772  2.626323468  0.493473808 -0.211675709
## [133,]  1.86270322 -0.178549495 -0.354148712  0.100009882
## [134,]  1.11414774 -0.292922623  0.183488392 -0.186343697
## [135,]  1.20247330 -0.811315271  0.164723757 -0.489483470
## [136,]  2.79877045  0.856803329  0.542906499  0.295881050
## [137,]  1.57625591  1.068581107 -0.945853819  0.035605759
## [138,]  1.34629210  0.422430611 -0.180875478 -0.215421288
## [139,]  0.92482492  0.017223100 -0.416826193  0.005238409
## [140,]  1.85204505  0.676128174  0.012672115  0.195195239
## [141,]  2.01481043  0.613885637 -0.428332842  0.247538313
## [142,]  1.90178409  0.689575494 -0.130075005  0.469696647
## [143,]  1.15788161 -0.698869862 -0.530160149 -0.040520754
## [144,]  2.04055823  0.867520601 -0.338144000  0.045187126
## [145,]  1.99814710  1.049168747 -0.632413436  0.214045204
## [146,]  1.87050329  0.386966082 -0.256273852  0.389256845
## [147,]  1.56458048 -0.896686809  0.026371352  0.220192100
## [148,]  1.52117050  0.269069144 -0.180178380  0.119171137
## [149,]  1.37278779  1.011254419 -0.933395241  0.026128648
## [150,]  0.96065603 -0.024331668 -0.528248807 -0.163078032

# The difference between this new data frame and the original data frame is that the new 
# one has columns that are all uncorrelated to each other. The total variance remains the same.

biplot(pcs) # A scatterplot plot of the first two PCs with original variables and observation labels
# Add lines to the biplot
abline(h=0, v=0, lty = "dashed")

# Calculate the correlation between each of the original variable and each PC
cor(pcs$scores,D)

##        Sepal.Length Sepal.Width Petal.Length Petal.Width
## Comp.1   0.89016876 -0.46014271   0.99155518  0.96497896
## Comp.2   0.36082989  0.88271627   0.02341519  0.06399985
## Comp.3   0.27565767 -0.09361987  -0.05444699 -0.24298265
## Comp.4   0.03760602 -0.01777631  -0.11534978  0.07535950

# Calculate angles between PC1 and original variables
(cor(pcs$scores,D)[1,] %>% acos())*180/pi # acos() is the inverse cosine function

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##    27.105539   117.396316     7.451417    15.208200

# Centering at zeros and scaling to have standard deviations being 1
pcs2 = princomp(D, cor = TRUE, scores = TRUE, center = c(0, 0, 0, 0), scale. = TRUE)

## Warning: In princomp.default(D, cor = TRUE, scores = TRUE, center = c(0, 
##     0, 0, 0), scale. = TRUE) :
##  extra arguments 'center', 'scale.' will be disregarded

pcs2

## Call:
## princomp(x = D, cor = TRUE, scores = TRUE, center = c(0, 0, 0, 
##     0), scale. = TRUE)
## 
## Standard deviations:
##    Comp.1    Comp.2    Comp.3    Comp.4 
## 1.7083611 0.9560494 0.3830886 0.1439265 
## 
##  4  variables and  150 observations.

biplot(pcs2)

# Q-mode with prcomp()
pcs3 = prcomp(D, center = TRUE, scale. = TRUE)
pcs3$rotation

##                     PC1         PC2        PC3        PC4
## Sepal.Length  0.5210659 -0.37741762  0.7195664  0.2612863
## Sepal.Width  -0.2693474 -0.92329566 -0.2443818 -0.1235096
## Petal.Length  0.5804131 -0.02449161 -0.1421264 -0.8014492
## Petal.Width   0.5648565 -0.06694199 -0.6342727  0.5235971

pcs3$x # Data frame with PC's

##                PC1          PC2          PC3          PC4
##   [1,] -2.25714118 -0.478423832  0.127279624  0.024087508
##   [2,] -2.07401302  0.671882687  0.233825517  0.102662845
##   [3,] -2.35633511  0.340766425 -0.044053900  0.028282305
##   [4,] -2.29170679  0.595399863 -0.090985297 -0.065735340
##   [5,] -2.38186270 -0.644675659 -0.015685647 -0.035802870
##   [6,] -2.06870061 -1.484205297 -0.026878250  0.006586116
##   [7,] -2.43586845 -0.047485118 -0.334350297 -0.036652767
##   [8,] -2.22539189 -0.222403002  0.088399352 -0.024529919
##   [9,] -2.32684533  1.111603700 -0.144592465 -0.026769540
##  [10,] -2.17703491  0.467447569  0.252918268 -0.039766068
##  [11,] -2.15907699 -1.040205867  0.267784001  0.016675503
##  [12,] -2.31836413 -0.132633999 -0.093446191 -0.133037725
##  [13,] -2.21104370  0.726243183  0.230140246  0.002416941
##  [14,] -2.62430902  0.958296347 -0.180192423 -0.019151375
##  [15,] -2.19139921 -1.853846555  0.471322025  0.194081578
##  [16,] -2.25466121 -2.677315230 -0.030424684  0.050365010
##  [17,] -2.20021676 -1.478655729  0.005326251  0.188186988
##  [18,] -2.18303613 -0.487206131  0.044067686  0.092779618
##  [19,] -1.89223284 -1.400327567  0.373093377  0.060891973
##  [20,] -2.33554476 -1.124083597 -0.132187626 -0.037630354
##  [21,] -1.90793125 -0.407490576  0.419885937  0.010884821
##  [22,] -2.19964383 -0.921035871 -0.159331502  0.059398340
##  [23,] -2.76508142 -0.456813301 -0.331069982  0.019582826
##  [24,] -1.81259716 -0.085272854 -0.034373442  0.150636353
##  [25,] -2.21972701 -0.136796175 -0.117599566 -0.269238379
##  [26,] -1.94532930  0.623529705  0.304620475  0.043416203
##  [27,] -2.04430277 -0.241354991 -0.086075649  0.067454082
##  [28,] -2.16133650 -0.525389422  0.206125707  0.010241084
##  [29,] -2.13241965 -0.312172005  0.270244895  0.083977887
##  [30,] -2.25769799  0.336604248 -0.068207276 -0.107918349
##  [31,] -2.13297647  0.502856075  0.074757996 -0.048027970
##  [32,] -1.82547925 -0.422280389  0.269564311  0.239069476
##  [33,] -2.60621687 -1.787587272 -0.047070727 -0.228470534
##  [34,] -2.43800983 -2.143546796  0.082392024 -0.048053409
##  [35,] -2.10292986  0.458665270  0.169706329  0.028926042
##  [36,] -2.20043723  0.205419224  0.224688852  0.168343905
##  [37,] -2.03831765 -0.659349230  0.482919584  0.195702902
##  [38,] -2.51889339 -0.590315163 -0.019370918 -0.136048774
##  [39,] -2.42152026  0.901161067 -0.192609402 -0.009705907
##  [40,] -2.16246625 -0.267981199  0.175296561  0.007023875
##  [41,] -2.27884081 -0.440240541 -0.034778398  0.106626042
##  [42,] -1.85191836  2.329610745  0.203552303  0.288896090
##  [43,] -2.54511203  0.477501017 -0.304745527 -0.066379077
##  [44,] -1.95788857 -0.470749613 -0.308567588  0.176501717
##  [45,] -2.12992356 -1.138415464 -0.247604064 -0.150539117
##  [46,] -2.06283361  0.708678586  0.063716370  0.139801160
##  [47,] -2.37677076 -1.116688691 -0.057026813 -0.151722682
##  [48,] -2.38638171  0.384957230 -0.139002234 -0.048671707
##  [49,] -2.22200263 -0.994627669  0.180886792 -0.014878291
##  [50,] -2.19647504 -0.009185585  0.152518539  0.049206884
##  [51,]  1.09810244 -0.860091033  0.682300393  0.034717469
##  [52,]  0.72889556 -0.592629362  0.093807452  0.004887251
##  [53,]  1.23683580 -0.614239894  0.552157058  0.009391933
##  [54,]  0.40612251  1.748546197  0.023024633  0.065549239
##  [55,]  1.07188379  0.207725147  0.396925784  0.104387166
##  [56,]  0.38738955  0.591302717 -0.123776885 -0.240027187
##  [57,]  0.74403715 -0.770438272 -0.148472007 -0.077111455
##  [58,] -0.48569562  1.846243998 -0.248432992 -0.040384912
##  [59,]  0.92480346 -0.032118478  0.594178807 -0.029779844
##  [60,]  0.01138804  1.030565784 -0.537100055 -0.028366154
##  [61,] -0.10982834  2.645211115  0.046634215  0.013714785
##  [62,]  0.43922201  0.063083852 -0.204389093  0.039992104
##  [63,]  0.56023148  1.758832129  0.763214554  0.045578465
##  [64,]  0.71715934  0.185602819  0.068429700 -0.164256922
##  [65,] -0.03324333  0.437537419 -0.194282030  0.108684396
##  [66,]  0.87248429 -0.507364239  0.501830204  0.104593326
##  [67,]  0.34908221  0.195656268 -0.489234095 -0.190869932
##  [68,]  0.15827980  0.789451008  0.301028700 -0.204612265
##  [69,]  1.22100316  1.616827281  0.480693656  0.225145511
##  [70,]  0.16436725  1.298259939  0.172260719 -0.051554138
##  [71,]  0.73521959 -0.395247446 -0.614467782 -0.083006045
##  [72,]  0.47469691  0.415926887  0.264067576  0.113189079
##  [73,]  1.23005729  0.930209441  0.367182178 -0.009911322
##  [74,]  0.63074514  0.414997441  0.290921638 -0.273304557
##  [75,]  0.70031506  0.063200094  0.444537765  0.043313222
##  [76,]  0.87135454 -0.249956017  0.471001057  0.101376117
##  [77,]  1.25231375  0.076998069  0.724727099  0.039556002
##  [78,]  1.35386953 -0.330205463  0.259955701  0.066604931
##  [79,]  0.66258066  0.225173502 -0.085577197 -0.036318171
##  [80,] -0.04012419  1.055183583  0.318506304  0.064571834
##  [81,]  0.13035846  1.557055553  0.149482697 -0.009371129
##  [82,]  0.02337438  1.567225244  0.240745761 -0.032663020
##  [83,]  0.24073180  0.774661195  0.150707074  0.023572390
##  [84,]  1.05755171  0.631726901 -0.104959762 -0.183354200
##  [85,]  0.22323093  0.286812663 -0.663028512 -0.253977520
##  [86,]  0.42770626 -0.842758920 -0.449129446 -0.109308985
##  [87,]  1.04522645 -0.520308714  0.394464890  0.037084781
##  [88,]  1.04104379  1.378371048  0.685997804  0.136378719
##  [89,]  0.06935597  0.218770433 -0.290605718 -0.146653279
##  [90,]  0.28253073  1.324886147 -0.089111491  0.008876070
##  [91,]  0.27814596  1.116288852 -0.094172116 -0.269753497
##  [92,]  0.62248441 -0.024839814  0.020412763 -0.147193289
##  [93,]  0.33540673  0.985103828  0.198724011  0.006508757
##  [94,] -0.36097409  2.012495825 -0.105467721  0.019505467
##  [95,]  0.28762268  0.852873116 -0.130452657 -0.107043742
##  [96,]  0.09105561  0.180587142 -0.128547696 -0.229191812
##  [97,]  0.22695654  0.383634868 -0.155691572 -0.132163118
##  [98,]  0.57446378  0.154356489  0.270743347 -0.019794366
##  [99,] -0.44617230  1.538637456 -0.189765199  0.199278855
## [100,]  0.25587339  0.596852285 -0.091572385 -0.058426315
## [101,]  1.83841002 -0.867515056 -1.002044077 -0.049085303
## [102,]  1.15401555  0.696536401 -0.528389994 -0.040385459
## [103,]  2.19790361 -0.560133976  0.202236658  0.058986583
## [104,]  1.43534213  0.046830701 -0.163083761 -0.234982858
## [105,]  1.86157577 -0.294059697 -0.394307408 -0.016243853
## [106,]  2.74268509 -0.797736709  0.580364827 -0.101045973
## [107,]  0.36579225  1.556289178 -0.983598122 -0.132679346
## [108,]  2.29475181 -0.418663020  0.649530452 -0.237246445
## [109,]  1.99998633  0.709063226  0.392675073 -0.086221779
## [110,]  2.25223216 -1.914596301 -0.396224508  0.104488870
## [111,]  1.35962064 -0.690443405 -0.283661780  0.107500284
## [112,]  1.59732747  0.420292431 -0.023108991  0.058136869
## [113,]  1.87761053 -0.417849815 -0.026250468  0.145926073
## [114,]  1.25590769  1.158379741 -0.578311891  0.098826244
## [115,]  1.46274487  0.440794883 -1.000517746  0.274738504
## [116,]  1.58476820 -0.673986887 -0.636297054  0.191222383
## [117,]  1.46651849 -0.254768327 -0.037306280 -0.154811637
## [118,]  2.41822770 -2.548124795  0.127454475 -0.272892966
## [119,]  3.29964148 -0.017721580  0.700957033  0.045037725
## [120,]  1.25954707  1.701046715  0.266643612 -0.064963167
## [121,]  2.03091256 -0.907427443 -0.234015510  0.167390481
## [122,]  0.97471535  0.569855257 -0.825362161  0.027662914
## [123,]  2.88797650 -0.412259950  0.854558973 -0.126911337
## [124,]  1.32878064  0.480202496  0.005410239  0.139491837
## [125,]  1.69505530 -1.010536476 -0.297454114 -0.061437911
## [126,]  1.94780139 -1.004412720  0.418582432 -0.217609339
## [127,]  1.17118007  0.315338060 -0.129503907  0.125001677
## [128,]  1.01754169 -0.064131184 -0.336588365 -0.008625505
## [129,]  1.78237879  0.186735633 -0.269754304  0.030983849
## [130,]  1.85742501 -0.560413289  0.713244682 -0.207519953
## [131,]  2.42782030 -0.258418706  0.725386035 -0.017863520
## [132,]  2.29723178 -2.617554417  0.491826144 -0.210968943
## [133,]  1.85648383  0.177953334 -0.352966242  0.099675959
## [134,]  1.11042770  0.291944582  0.182875741 -0.185721512
## [135,]  1.19845835  0.808606364  0.164173760 -0.487849130
## [136,]  2.78942561 -0.853942542  0.541093785  0.294893130
## [137,]  1.57099294 -1.065013214 -0.942695700  0.035486875
## [138,]  1.34179696 -0.421020154 -0.180271551 -0.214702016
## [139,]  0.92173701 -0.017165594 -0.415434449  0.005220919
## [140,]  1.84586124 -0.673870645  0.012629804  0.194543500
## [141,]  2.00808316 -0.611835930 -0.426902678  0.246711805
## [142,]  1.89543421 -0.687273065 -0.129640697  0.468128374
## [143,]  1.15401555  0.696536401 -0.528389994 -0.040385459
## [144,]  2.03374499 -0.864624030 -0.337014969  0.045036251
## [145,]  1.99147547 -1.045665670 -0.630301866  0.213330527
## [146,]  1.86425786 -0.385674038 -0.255418178  0.387957152
## [147,]  1.55935649  0.893692855  0.026283300  0.219456899
## [148,]  1.51609145 -0.268170747 -0.179576781  0.118773236
## [149,]  1.36820418 -1.007877934 -0.930278721  0.026041407
## [150,]  0.95744849  0.024250427 -0.526485033 -0.162533529

biplot(pcs3)

Chapter 4 Regularization Methods in Regression: Ridge Regression, Lasso, and Elastic Net

4.1 Review of Classifical multiple regression

The multiple linear regression model has the structure: \[y=\beta_0+\beta_1x_1 + \beta_2x_2 + \beta_3x_3 +\cdots+ \beta_px_p + \epsilon\] where \(y\) is the response variable, the \(x_i\)’s are predictors, \(\beta\)’s are coefficients, and \(\epsilon\) is the error term which is assumed to have a normal distribution with mean 0 and a constant variance \(\sigma^2\).

The R function that fits a multiple linear regression is lm(). The syntax is

\[lm(formula = y\sim{~}x_1 + x_2 + x_3 + \cdots + x_p, data = myDataFrame)\]

Since some of the x variables (predictors) might be correlated, the model fitting will have the so-called multi-collinearity issue which results in larger standard errors of the estimates of regression coefficients. To avoid such an issue, we can drop some predictors that are highly correlated to other predictors.

We would like to have parsimonious models which have a small number of predictors.

There are a few criteria that help select predictors:

\(R^2\), the ratio of regression sum of squares to the total sum of squares
adjusted \(R^2\), \(R^2_{adj}=1-\frac{n-1}{n-p-1}(1-R^2)\)
Mallow’s \(C_p\),
\(AIC\),
\(BIC\),

and so on.

The Mallow’s \(C_p\) criterion chooses the model with \(C_p\) equal to the number of coefficients including the intercept. It is defined as

\[C_p=\frac{SSE}{\hat{\sigma}^2_{full}}+2(p+1)-n\] where \(n\) is the number of observations, \(p\) is the number of predictors in the smaller model, \(SSE\) is the sum of squares of residuals. The quantity \(\hat{\sigma}^2_{full}\) is the estimate of the variance of the error term under the full model, a model with all predictors in.

The \(AIC\) (Akaike Information Criterion) is defined as

\[AIC=n\cdot log(SSE/n)+2p\] The \(BIC\) (Bayesian Information Criterion) is defined as

\[BIC=n\cdot log(SSE/n)+p\cdot log(n)\] The smaller the AIC/BIC, the better the model.

A reference: https://online.stat.psu.edu/stat462/node/197/

Here we show how to calculate Mallow’s \(C_p\) with R code. You can use the R package “leaps”.

full_model = lm(mpg ~ ., data = mtcars)
sigma.square.full = summary(full_model)$sigma^2 # The MSE of the full model

reduced_model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)

p = 4 # 4 predictors used in the reduced model
n = nrow(mtcars) # Number of observations in data

sse.reduced = sum(reduced_model$residuals^2) # The SSE of the reduced model

Cp=sse.reduced/sigma.square.full + 2*(p+1) - n

Cp

## [1] 4.430434

Variable selection can be done using a “backward”, “forward”, or “stepwise” method.

# Backward selection starts with the full model
step = step(full_model, direction = "backward")

## Start:  AIC=70.9
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - cyl   1    0.0799 147.57 68.915
## - vs    1    0.1601 147.66 68.932
## - carb  1    0.4067 147.90 68.986
## - gear  1    1.3531 148.85 69.190
## - drat  1    1.6270 149.12 69.249
## - disp  1    3.9167 151.41 69.736
## - hp    1    6.8399 154.33 70.348
## - qsec  1    8.8641 156.36 70.765
## <none>              147.49 70.898
## - am    1   10.5467 158.04 71.108
## - wt    1   27.0144 174.51 74.280
## 
## Step:  AIC=68.92
## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - vs    1    0.2685 147.84 66.973
## - carb  1    0.5201 148.09 67.028
## - gear  1    1.8211 149.40 67.308
## - drat  1    1.9826 149.56 67.342
## - disp  1    3.9009 151.47 67.750
## - hp    1    7.3632 154.94 68.473
## <none>              147.57 68.915
## - qsec  1   10.0933 157.67 69.032
## - am    1   11.8359 159.41 69.384
## - wt    1   27.0280 174.60 72.297
## 
## Step:  AIC=66.97
## mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - carb  1    0.6855 148.53 65.121
## - gear  1    2.1437 149.99 65.434
## - drat  1    2.2139 150.06 65.449
## - disp  1    3.6467 151.49 65.753
## - hp    1    7.1060 154.95 66.475
## <none>              147.84 66.973
## - am    1   11.5694 159.41 67.384
## - qsec  1   15.6830 163.53 68.200
## - wt    1   27.3799 175.22 70.410
## 
## Step:  AIC=65.12
## mpg ~ disp + hp + drat + wt + qsec + am + gear
## 
##        Df Sum of Sq    RSS    AIC
## - gear  1     1.565 150.09 63.457
## - drat  1     1.932 150.46 63.535
## <none>              148.53 65.121
## - disp  1    10.110 158.64 65.229
## - am    1    12.323 160.85 65.672
## - hp    1    14.826 163.35 66.166
## - qsec  1    26.408 174.94 68.358
## - wt    1    69.127 217.66 75.350
## 
## Step:  AIC=63.46
## mpg ~ disp + hp + drat + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - drat  1     3.345 153.44 62.162
## - disp  1     8.545 158.64 63.229
## <none>              150.09 63.457
## - hp    1    13.285 163.38 64.171
## - am    1    20.036 170.13 65.466
## - qsec  1    25.574 175.67 66.491
## - wt    1    67.572 217.66 73.351
## 
## Step:  AIC=62.16
## mpg ~ disp + hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - disp  1     6.629 160.07 61.515
## <none>              153.44 62.162
## - hp    1    12.572 166.01 62.682
## - qsec  1    26.470 179.91 65.255
## - am    1    32.198 185.63 66.258
## - wt    1    69.043 222.48 72.051
## 
## Step:  AIC=61.52
## mpg ~ hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - hp    1     9.219 169.29 61.307
## <none>              160.07 61.515
## - qsec  1    20.225 180.29 63.323
## - am    1    25.993 186.06 64.331
## - wt    1    78.494 238.56 72.284
## 
## Step:  AIC=61.31
## mpg ~ wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## <none>              169.29 61.307
## - am    1    26.178 195.46 63.908
## - qsec  1   109.034 278.32 75.217
## - wt    1   183.347 352.63 82.790

summary(step) # Summarize the final model

## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am            2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

pred = predict(step) # You can use validation data if you do have (which is recommended)
pred

##           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
##            22.47046            22.15825            26.28107            20.85744 
##   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
##            17.00959            20.85409            15.05390            21.64185 
##            Merc 230            Merc 280           Merc 280C          Merc 450SE 
##            25.35358            18.57872            19.31425            15.00803 
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
##            16.58481            16.87934            11.09757            10.21995 
##   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
##            10.03900            27.80531            28.93187            29.76196 
##       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
##            24.49358            16.51238            17.37242            13.46931 
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
##            15.46018            28.14443            24.64460            27.34542 
##      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
##            17.91365            20.70613            16.46963            24.46722

library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## 
## Attaching package: 'forecast'

## The following object is masked from 'package:ggpubr':
## 
##     gghistogram

accuracy(pred, mtcars$mpg)

##                    ME    RMSE      MAE       MPE    MAPE
## Test set 3.441691e-15 2.30004 1.931954 -0.916667 9.54988

# Stepwise selection can start with the full model
step = step(full_model, direction = "both")

## Start:  AIC=70.9
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - cyl   1    0.0799 147.57 68.915
## - vs    1    0.1601 147.66 68.932
## - carb  1    0.4067 147.90 68.986
## - gear  1    1.3531 148.85 69.190
## - drat  1    1.6270 149.12 69.249
## - disp  1    3.9167 151.41 69.736
## - hp    1    6.8399 154.33 70.348
## - qsec  1    8.8641 156.36 70.765
## <none>              147.49 70.898
## - am    1   10.5467 158.04 71.108
## - wt    1   27.0144 174.51 74.280
## 
## Step:  AIC=68.92
## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - vs    1    0.2685 147.84 66.973
## - carb  1    0.5201 148.09 67.028
## - gear  1    1.8211 149.40 67.308
## - drat  1    1.9826 149.56 67.342
## - disp  1    3.9009 151.47 67.750
## - hp    1    7.3632 154.94 68.473
## <none>              147.57 68.915
## - qsec  1   10.0933 157.67 69.032
## - am    1   11.8359 159.41 69.384
## + cyl   1    0.0799 147.49 70.898
## - wt    1   27.0280 174.60 72.297
## 
## Step:  AIC=66.97
## mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - carb  1    0.6855 148.53 65.121
## - gear  1    2.1437 149.99 65.434
## - drat  1    2.2139 150.06 65.449
## - disp  1    3.6467 151.49 65.753
## - hp    1    7.1060 154.95 66.475
## <none>              147.84 66.973
## - am    1   11.5694 159.41 67.384
## - qsec  1   15.6830 163.53 68.200
## + vs    1    0.2685 147.57 68.915
## + cyl   1    0.1883 147.66 68.932
## - wt    1   27.3799 175.22 70.410
## 
## Step:  AIC=65.12
## mpg ~ disp + hp + drat + wt + qsec + am + gear
## 
##        Df Sum of Sq    RSS    AIC
## - gear  1     1.565 150.09 63.457
## - drat  1     1.932 150.46 63.535
## <none>              148.53 65.121
## - disp  1    10.110 158.64 65.229
## - am    1    12.323 160.85 65.672
## - hp    1    14.826 163.35 66.166
## + carb  1     0.685 147.84 66.973
## + vs    1     0.434 148.09 67.028
## + cyl   1     0.414 148.11 67.032
## - qsec  1    26.408 174.94 68.358
## - wt    1    69.127 217.66 75.350
## 
## Step:  AIC=63.46
## mpg ~ disp + hp + drat + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - drat  1     3.345 153.44 62.162
## - disp  1     8.545 158.64 63.229
## <none>              150.09 63.457
## - hp    1    13.285 163.38 64.171
## + gear  1     1.565 148.53 65.121
## + cyl   1     1.003 149.09 65.242
## + vs    1     0.645 149.45 65.319
## + carb  1     0.107 149.99 65.434
## - am    1    20.036 170.13 65.466
## - qsec  1    25.574 175.67 66.491
## - wt    1    67.572 217.66 73.351
## 
## Step:  AIC=62.16
## mpg ~ disp + hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - disp  1     6.629 160.07 61.515
## <none>              153.44 62.162
## - hp    1    12.572 166.01 62.682
## + drat  1     3.345 150.09 63.457
## + gear  1     2.977 150.46 63.535
## + cyl   1     2.447 150.99 63.648
## + vs    1     1.121 152.32 63.927
## + carb  1     0.011 153.43 64.160
## - qsec  1    26.470 179.91 65.255
## - am    1    32.198 185.63 66.258
## - wt    1    69.043 222.48 72.051
## 
## Step:  AIC=61.52
## mpg ~ hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - hp    1     9.219 169.29 61.307
## <none>              160.07 61.515
## + disp  1     6.629 153.44 62.162
## + carb  1     3.227 156.84 62.864
## + drat  1     1.428 158.64 63.229
## - qsec  1    20.225 180.29 63.323
## + cyl   1     0.249 159.82 63.465
## + vs    1     0.249 159.82 63.466
## + gear  1     0.171 159.90 63.481
## - am    1    25.993 186.06 64.331
## - wt    1    78.494 238.56 72.284
## 
## Step:  AIC=61.31
## mpg ~ wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## <none>              169.29 61.307
## + hp    1     9.219 160.07 61.515
## + carb  1     8.036 161.25 61.751
## + disp  1     3.276 166.01 62.682
## + cyl   1     1.501 167.78 63.022
## + drat  1     1.400 167.89 63.042
## + gear  1     0.123 169.16 63.284
## + vs    1     0.000 169.29 63.307
## - am    1    26.178 195.46 63.908
## - qsec  1   109.034 278.32 75.217
## - wt    1   183.347 352.63 82.790

summary(step)

## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am            2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

intercept.only_model = lm(mpg ~ 1, data = mtcars)

# Forward selection starts with the intercept only model, or other small model
step = step(intercept.only_model, 
            scope=list(lower=intercept.only_model, upper = full_model), 
            direction = "forward")

## Start:  AIC=115.94
## mpg ~ 1
## 
##        Df Sum of Sq     RSS     AIC
## + wt    1    847.73  278.32  73.217
## + cyl   1    817.71  308.33  76.494
## + disp  1    808.89  317.16  77.397
## + hp    1    678.37  447.67  88.427
## + drat  1    522.48  603.57  97.988
## + vs    1    496.53  629.52  99.335
## + am    1    405.15  720.90 103.672
## + carb  1    341.78  784.27 106.369
## + gear  1    259.75  866.30 109.552
## + qsec  1    197.39  928.66 111.776
## <none>              1126.05 115.943
## 
## Step:  AIC=73.22
## mpg ~ wt
## 
##        Df Sum of Sq    RSS    AIC
## + cyl   1    87.150 191.17 63.198
## + hp    1    83.274 195.05 63.840
## + qsec  1    82.858 195.46 63.908
## + vs    1    54.228 224.09 68.283
## + carb  1    44.602 233.72 69.628
## + disp  1    31.639 246.68 71.356
## <none>              278.32 73.217
## + drat  1     9.081 269.24 74.156
## + gear  1     1.137 277.19 75.086
## + am    1     0.002 278.32 75.217
## 
## Step:  AIC=63.2
## mpg ~ wt + cyl
## 
##        Df Sum of Sq    RSS    AIC
## + hp    1   14.5514 176.62 62.665
## + carb  1   13.7724 177.40 62.805
## <none>              191.17 63.198
## + qsec  1   10.5674 180.60 63.378
## + gear  1    3.0281 188.14 64.687
## + disp  1    2.6796 188.49 64.746
## + vs    1    0.7059 190.47 65.080
## + am    1    0.1249 191.05 65.177
## + drat  1    0.0010 191.17 65.198
## 
## Step:  AIC=62.66
## mpg ~ wt + cyl + hp
## 
##        Df Sum of Sq    RSS    AIC
## <none>              176.62 62.665
## + am    1    6.6228 170.00 63.442
## + disp  1    6.1762 170.44 63.526
## + carb  1    2.5187 174.10 64.205
## + drat  1    2.2453 174.38 64.255
## + qsec  1    1.4010 175.22 64.410
## + gear  1    0.8558 175.76 64.509
## + vs    1    0.0599 176.56 64.654

summary(step)

## 
## Call:
## lm(formula = mpg ~ wt + cyl + hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9290 -1.5598 -0.5311  1.1850  5.8986 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.75179    1.78686  21.687  < 2e-16 ***
## wt          -3.16697    0.74058  -4.276 0.000199 ***
## cyl         -0.94162    0.55092  -1.709 0.098480 .  
## hp          -0.01804    0.01188  -1.519 0.140015    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared:  0.8431, Adjusted R-squared:  0.8263 
## F-statistic: 50.17 on 3 and 28 DF,  p-value: 2.184e-11

# Stepwise selection can also start with the intercept only model
step = step(intercept.only_model, 
            scope=list(lower=intercept.only_model, upper = full_model), 
            direction = "both")

## Start:  AIC=115.94
## mpg ~ 1
## 
##        Df Sum of Sq     RSS     AIC
## + wt    1    847.73  278.32  73.217
## + cyl   1    817.71  308.33  76.494
## + disp  1    808.89  317.16  77.397
## + hp    1    678.37  447.67  88.427
## + drat  1    522.48  603.57  97.988
## + vs    1    496.53  629.52  99.335
## + am    1    405.15  720.90 103.672
## + carb  1    341.78  784.27 106.369
## + gear  1    259.75  866.30 109.552
## + qsec  1    197.39  928.66 111.776
## <none>              1126.05 115.943
## 
## Step:  AIC=73.22
## mpg ~ wt
## 
##        Df Sum of Sq     RSS     AIC
## + cyl   1     87.15  191.17  63.198
## + hp    1     83.27  195.05  63.840
## + qsec  1     82.86  195.46  63.908
## + vs    1     54.23  224.09  68.283
## + carb  1     44.60  233.72  69.628
## + disp  1     31.64  246.68  71.356
## <none>               278.32  73.217
## + drat  1      9.08  269.24  74.156
## + gear  1      1.14  277.19  75.086
## + am    1      0.00  278.32  75.217
## - wt    1    847.73 1126.05 115.943
## 
## Step:  AIC=63.2
## mpg ~ wt + cyl
## 
##        Df Sum of Sq    RSS    AIC
## + hp    1    14.551 176.62 62.665
## + carb  1    13.772 177.40 62.805
## <none>              191.17 63.198
## + qsec  1    10.567 180.60 63.378
## + gear  1     3.028 188.14 64.687
## + disp  1     2.680 188.49 64.746
## + vs    1     0.706 190.47 65.080
## + am    1     0.125 191.05 65.177
## + drat  1     0.001 191.17 65.198
## - cyl   1    87.150 278.32 73.217
## - wt    1   117.162 308.33 76.494
## 
## Step:  AIC=62.66
## mpg ~ wt + cyl + hp
## 
##        Df Sum of Sq    RSS    AIC
## <none>              176.62 62.665
## - hp    1    14.551 191.17 63.198
## + am    1     6.623 170.00 63.442
## + disp  1     6.176 170.44 63.526
## - cyl   1    18.427 195.05 63.840
## + carb  1     2.519 174.10 64.205
## + drat  1     2.245 174.38 64.255
## + qsec  1     1.401 175.22 64.410
## + gear  1     0.856 175.76 64.509
## + vs    1     0.060 176.56 64.654
## - wt    1   115.354 291.98 76.750

summary(step)

## 
## Call:
## lm(formula = mpg ~ wt + cyl + hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9290 -1.5598 -0.5311  1.1850  5.8986 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.75179    1.78686  21.687  < 2e-16 ***
## wt          -3.16697    0.74058  -4.276 0.000199 ***
## cyl         -0.94162    0.55092  -1.709 0.098480 .  
## hp          -0.01804    0.01188  -1.519 0.140015    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared:  0.8431, Adjusted R-squared:  0.8263 
## F-statistic: 50.17 on 3 and 28 DF,  p-value: 2.184e-11

All end up with the same model, though not necessarily the case in general.

4.2 Regularization Methods in Regression

A reference: http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/

The following is from the above reference, with a minor modification.

The ordinary multiple linear regression model based on the ordinary least squares method performs poorly in a situation, where you have a data set containing a number of variables superior to the number of samples.

A better alternative is the penalized regression allowing to create a linear regression model that is penalized, for having too many variables in the model, by adding a constraint when using the least squares approach. This is also known as the shrinkage or regularization method.

The consequence of imposing this penalty is to reduce (i.e. shrink) the coefficient values towards zero, and thus allows the less contributive variables to have a coefficient close or equal to zero.

Note that, the shrinkage requires the selection of a tuning parameter (\(\lambda\)) that determines the amount of shrinkage.

Three most commonly used penalized regression methods are ridge regression, lasso regression and elastic net regression.

4.2.1 Ridge regression

Ridge regression shrinks the regression coefficients, so that variables, with minor contribution to the outcome, have their coefficients close to zero.

The shrinkage of the coefficients is achieved by penalizing the regression model with a penalty term called \(L_2\)-norm, which is the sum of the squared coefficients.

The amount of the penalty can be fine-tuned using a constant called lambda, denoted by Greek letter \(\lambda\). Selecting a good value for \(\lambda\) is critical.

When \(\lambda=0\), the penalty term has no effect, and ridge regression will produce the classical least square coefficients. However, as \(\lambda\) increases to infinite, the impact of the shrinkage penalty grows, and the ridge regression coefficients will get close to zero.

Note that, in contrast to the ordinary least squares regression, ridge regression is highly affected by the scale of the predictors. Therefore, it is better to standardize (i.e., scale) the predictors before applying the ridge regression, so that all the predictors are on the same scale.

The standardization of a predictor x, can be achieved using the formula \(x'=\frac{x-mean(x)}{sd(x)}\), where \(mean(x)\) is the mean of \(x\) and \(sd(x)\) is the standard deviation. The consequence of this is that, all standardized predictors will have a mean of zero and a standard deviation of one allowing the final fit to not depend on the scale on which the predictors are measured.

One important advantage of the ridge regression, is that it still performs well, compared to the ordinary least squares method, in a situation where you have data with the number of predictors (\(p\)) larger than the number of observations (\(n\)).

One disadvantage of the ridge regression is that, it will include all the predictors in the final model, unlike the stepwise regression methods, which generally select models that involve a reduced set of variables.

Ridge regression shrinks the coefficients towards zero, but it does not set any of them exactly to zero. The lasso regression is an alternative that overcomes this drawback.

Required R packages for implementing the regularization methods are:

tidyverse for easy data manipulation and visualization
glmnet for computing penalized regression
caret for easy machine learning workflow

The key function to fit a penalized generalized linear model is glmnet(), which is used as follows:

\[\text{glmnet(x, y, alpha, lambda)},\] where

x: matrix of predictor variables
y: the response or outcome variable, which is a binary variable.
alpha: the elastic net mixing parameter. Allowed values include:
“1”: for lasso regression
“0”: for ridge regression
a value between 0 and 1 (say 0.3) for elastic net regression.
lamba: a numeric value defining the amount of shrinkage. Should be specify by analyst.

In penalized regression, you need to specify a constant lambda to adjust the amount of the coefficient shrinkage. The best lambda for your data, can be defined as the lambda that minimize the cross-validation prediction error rate. This can be determined automatically using the function cv.glmnet().

A note: when making prediction, glmnet uses the parameter “newx” instead of “newdata” for the predict() function!

# Load packages
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::ident()  masks dbplyr::ident()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dplyr::sql()    masks dbplyr::sql()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(glmnet)

## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loaded glmnet 4.1-10

# Load the data
data("Boston", package = "MASS")

# Split the data into training and test set.
# Write my own function to partition original dataset into training and validation sets
partition = function(Data, prop.train = 0.8){
  n = nrow(Data)
  shuffled.idx = sample(1:n)
  train.idx = shuffled.idx[1:round(n*0.8)]
  train.data = Data[train.idx, ]
  validation.data = Data[-train.idx, ]
  return(list(train.data = train.data, validation.data= validation.data))
}

partitioned.data = partition(Boston, 0.8)
train.data  <- partitioned.data$train.data
validation.data <- partitioned.data$validation.data

# Textbook partitions data by the function createDataPartition() from package "caret" 
training.samples <- Boston$medv %>% createDataPartition(p = 0.8, list = FALSE)
train.data  <- Boston[training.samples, ]
validation.data <- Boston[-training.samples, ]

# Create a MATRIX containing only predictors and converting any categorical variable to dummy(0/1) variables
x = model.matrix(medv~., train.data)[, -1] # Remove the first column which is the intercept term
# Create the outcome/response variable
y = train.data$medv

# Apply cross validatation to get the best lambda value
cv <- cv.glmnet(x, y, alpha = 0) # alpha = 0 indicates ridge regression
names(cv) # Check What are in the output

##  [1] "lambda"     "cvm"        "cvsd"       "cvup"       "cvlo"      
##  [6] "nzero"      "call"       "name"       "glmnet.fit" "lambda.min"
## [11] "lambda.1se" "index"

# Display the best lambda value
cv$lambda.min

## [1] 0.6517234

# Apply the glmnet() function from the package "glmnet" to fit penalized generalized linear regression model with the best lambda, based on training data
model = glmnet(x, y, alpha = 0, lambda = cv$lambda.min) # alpha = 0 indicates ridge regression

# Display regression coefficients
coef(model)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept)  34.631640377
## crim         -0.085634685
## zn            0.033025289
## indus        -0.051766503
## chas          3.118054591
## nox         -14.733308405
## rm            3.354140359
## age           0.003372532
## dis          -1.259075878
## rad           0.161043798
## tax          -0.004290355
## ptratio      -0.914296828
## black         0.009535343
## lstat        -0.508521082

# Fit a multiple regression using the the ordinary least squares method
m = lm(medv~.,Boston)
m # The estimated coefficients are close to those based on the ridge regression method.

## 
## Call:
## lm(formula = medv ~ ., data = Boston)
## 
## Coefficients:
## (Intercept)         crim           zn        indus         chas          nox  
##   3.646e+01   -1.080e-01    4.642e-02    2.056e-02    2.687e+00   -1.777e+01  
##          rm          age          dis          rad          tax      ptratio  
##   3.810e+00    6.922e-04   -1.476e+00    3.060e-01   -1.233e-02   -9.527e-01  
##       black        lstat  
##   9.312e-03   -5.248e-01

# Prepare data for predictions on the validation data
x.new <- model.matrix(medv ~., validation.data)[,-1]
predictions <- model %>% predict(newx = x.new) %>% as.vector()
# Model performance metrics
data.frame(
  RMSE = RMSE(predictions, validation.data$medv),
  Rsquare = R2(predictions, validation.data$medv)
)

##       RMSE   Rsquare
## 1 4.405931 0.8127145

4.2.2 Lasso regression

Lasso stands for Least Absolute Shrinkage and Selection Operator. It shrinks the regression coefficients toward zero by penalizing the regression model with a penalty term called \(L_1\)-norm, which is the sum of the absolute coefficients.

In the case of lasso regression, the penalty has the effect of forcing some of the coefficient estimates, with a minor contribution to the model, to be exactly equal to zero. This means that, lasso can be also seen as an alternative to the subset selection methods for performing variable selection in order to reduce the complexity of the model.

As in ridge regression, selecting a good value of \(\lambda\) for the lasso is critical.

One obvious advantage of lasso regression over ridge regression, is that it produces simpler and more interpretable models that incorporate only a reduced set of the predictors. However, neither ridge regression nor the lasso will universally dominate the other.

Generally, lasso might perform better in a situation where some of the predictors have large coefficients, and the remaining predictors have very small coefficients.

Ridge regression will perform better when the outcome is a function of many predictors, all with coefficients of roughly equal size.

Cross-validation methods can be used for identifying which of these two techniques is better on a particular data set.

To fit a penalized regression using the lasso approach, the only difference from the R code used for ridge regression is that, for lasso regression you need to specify the argument alpha = 1 instead of alpha = 0 (for ridge regression).

# Find the best lambda using cross-validation
set.seed(123) 
cv <- cv.glmnet(x, y, alpha = 1)
# Display the best lambda value
cv$lambda.min

## [1] 0.006670591

# Fit the final model on the training data
model <- glmnet(x, y, alpha = 1, lambda = cv$lambda.min)
# Dsiplay regression coefficients
coef(model)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept)  44.844653530
## crim         -0.102419215
## zn            0.047942876
## indus        -0.013252552
## chas          2.947453217
## nox         -21.487992811
## rm            2.984649016
## age           0.009241279
## dis          -1.673374292
## rad           0.282577678
## tax          -0.008719058
## ptratio      -1.032726389
## black         0.009968408
## lstat        -0.573662729

# Make predictions on the validation data
x.new <- model.matrix(medv ~., validation.data)[,-1]
predictions <- model %>% predict(newx = x.new) %>% as.vector()
# Model performance metrics
data.frame(
  RMSE = RMSE(predictions, validation.data$medv),
  Rsquare = R2(predictions, validation.data$medv)
)

##       RMSE   Rsquare
## 1 4.509415 0.7918737

4.2.3 Elastic Net

Elastic Net produces a regression model that is penalized with both the L1-norm and L2-norm. The consequence of this is to effectively shrink coefficients (like in ridge regression) and to set some coefficients to zero (as in LASSO).

The elastic net regression can be easily computed using the caret workflow, which invokes the glmnet package.

We use caret to automatically select the best tuning parameters alpha and lambda. The caret package tests a range of possible alpha and lambda values, then selects the best values for lambda and alpha, resulting to a final model that is an elastic net model.

Here, we’ll test the combination of 10 different values for alpha and lambda. This is specified using the option tuneLength.

The best alpha and lambda values are those values that minimize the cross-validation error.

# Build the model using the training set
set.seed(123)
model <- train(
  medv ~., data = train.data, method = "glmnet",
  trControl = trainControl("cv", number = 10),
  tuneLength = 10
)
# Best tuning parameter
model$bestTune

##    alpha      lambda
## 72   0.8 0.006956104

# Display Coefficient of the final model. You need
# to specify the best lambda
coef(model$finalModel, model$bestTune$lambda)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##             s=0.006956104
## (Intercept)  44.726340725
## crim         -0.102058135
## zn            0.047808060
## indus        -0.013402014
## chas          2.946629814
## nox         -21.367234454
## rm            2.985301193
## age           0.009351051
## dis          -1.667687491
## rad           0.281394477
## tax          -0.008695110
## ptratio      -1.031377704
## black         0.009973175
## lstat        -0.574068870

# Make predictions on the validation data
x.new <- model.matrix(medv ~., validation.data)[,-1]
predictions <- model %>% predict(x.new)
# Model performance metrics
data.frame(
  RMSE = RMSE(predictions, validation.data$medv),
  Rsquare = R2(predictions, validation.data$medv)
)

##       RMSE   Rsquare
## 1 4.508176 0.7920561

4.3 Using the Workflow of the “caret” Package

We can easily compute and compare ridge, lasso and elastic net regression using the caret workflow.

The package “caret” will automatically choose the best tuning parameter values, compute the final model and evaluate the model performance using cross-validation techniques.

# Setup a grid range of lambda values:
lambda <- 10^seq(-3, 3, length = 100)

# 0. # Build the model based on lm()
set.seed(123)
lm <- train(
  medv ~., 
  data = train.data, 
  method = "lm"
)
# Model coefficients
lm$finalModel

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Coefficients:
## (Intercept)         crim           zn        indus         chas          nox  
##   45.321907    -0.104520     0.049274    -0.011388     2.945807   -21.773457  
##          rm          age          dis          rad          tax      ptratio  
##    2.958251     0.010386    -1.688717     0.294327    -0.009248    -1.037560  
##       black        lstat  
##    0.010034    -0.576143

# Make predictions
predictions <- lm %>% predict(newdata = validation.data)
# Model prediction performance
data.frame(
  RMSE = RMSE(predictions, validation.data$medv),
  Rsquare = R2(predictions, validation.data$medv)
)

##      RMSE   Rsquare
## 1 4.51628 0.7909431

# 1. # Build the model based on ridge regression
set.seed(123)
ridge <- train(
  medv ~., 
  data = train.data, 
  method = "glmnet",
  tuneGrid = expand.grid(alpha = 0, lambda = lambda)
)
# Model coefficients
coef(ridge$finalModel, ridge$bestTune$lambda)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##               s=0.5336699
## (Intercept)  34.621096335
## crim         -0.085550948
## zn            0.033209753
## indus        -0.051966937
## chas          3.114192240
## nox         -14.666987631
## rm            3.347838529
## age           0.003595232
## dis          -1.258715007
## rad           0.161302070
## tax          -0.004321984
## ptratio      -0.913368038
## black         0.009540003
## lstat        -0.509440000

# Make predictions
predictions <- ridge %>% predict(newdata = validation.data)
# Model prediction performance
data.frame(
  RMSE = RMSE(predictions, validation.data$medv),
  Rsquare = R2(predictions, validation.data$medv)
)

##       RMSE   Rsquare
## 1 4.407198 0.8126339

# 2. Build the model based on lasso regression
set.seed(123)
lasso <- train(
  medv ~., data = train.data, method = "glmnet",
  tuneGrid = expand.grid(alpha = 1, lambda = lambda)
  )

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

# Model coefficients
coef(lasso$finalModel, lasso$bestTune$lambda)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##             s=0.004641589
## (Intercept)  44.770279125
## crim         -0.102144269
## zn            0.047880563
## indus        -0.013158091
## chas          2.945611446
## nox         -21.395017010
## rm            2.983468219
## age           0.009379338
## dis          -1.669226731
## rad           0.282063992
## tax          -0.008723950
## ptratio      -1.031837561
## black         0.009974593
## lstat        -0.574297542

# Make predictions
predictions <- lasso %>% predict(newdata = validation.data)
# Model prediction performance
data.frame(
  RMSE = RMSE(predictions, validation.data$medv),
  Rsquare = R2(predictions, validation.data$medv)
)

##      RMSE   Rsquare
## 1 4.50877 0.7919742

# 3. Build the model based on Elastic net:
set.seed(123)
elastic <- train(
  medv ~., data = train.data, method = "glmnet"
)
# Model coefficients
coef(elastic$finalModel, elastic$bestTune$lambda)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##              s=0.01303447
## (Intercept)  44.586330957
## crim         -0.102702217
## zn            0.047949555
## indus        -0.015295877
## chas          2.958559125
## nox         -21.292591375
## rm            2.989772007
## age           0.009610116
## dis          -1.664543120
## rad           0.281268986
## tax          -0.008682153
## ptratio      -1.029589993
## black         0.009993821
## lstat        -0.572668774

# Make predictions
predictions <- elastic %>% predict(newdata = validation.data)
# Model prediction performance
data.frame(
  RMSE = RMSE(predictions, validation.data$medv),
  Rsquare = R2(predictions, validation.data$medv)
)

##       RMSE   Rsquare
## 1 4.506433 0.7922856

# 4. Comparing performances of the models:
# The performance of the different models - ridge, lasso and elastic net - can be easily compared using caret. 
# The best model is defined as the one that minimizes the prediction error.

models <- list(lm = lm, ridge = ridge, lasso = lasso, elastic = elastic)
resamples(models) %>% summary( metric = "RMSE")

## 
## Call:
## summary.resamples(object = ., metric = "RMSE")
## 
## Models: lm, ridge, lasso, elastic 
## Number of resamples: 25 
## 
## RMSE 
##             Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## lm      4.241198 4.858906 5.030561 5.051926 5.293270 5.534099    0
## ridge   4.290261 4.849826 5.121742 5.088865 5.398873 5.598911    0
## lasso   4.242698 4.849701 5.023601 5.052528 5.292296 5.532163    0
## elastic 4.241728 4.848898 5.021566 5.052012 5.292446 5.530059    0

It can be seen that the elastic net model has the lowest mean RMSE and ridge mode has the lowest median RMSE.

Chapter 5 Logistic Regression

The ordinary multiple regression handles a continuous response variable against a set of explanatory variables (or predictors, or features) and the error term is assumed to be normally distributed, with mean 0 and a constant variance. In other words, The distribution of the response variable is normally distributed with mean \(\mu\) depending on a linear combinations of other variables and variance that is a constant. In R, we use the lm() function to fit a multiple linear regression model.

When the response variable is not normally distributed, we use a more general method to fit a model. The model is called a generalized linear model. In R, we use glm() to fit such a model.

The (binary) logistic regression is a special generalized linear model, which handles a response that is a binary categorical variable. It assumes that the response variable has a Bernoulli distribution with a parameter \(p\) that depends on a linear combinations of other variables. To fit a logistic regression model, in R, we use the code: glm(y~x1+x2+x3+…, data = myData, family = something)

Let’s use the “iris” data fit some generalized linear models.

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

# Fit an ordinary multiple regression model
m1 = glm(Sepal.Length ~ ., data = iris, family = "gaussian")
summary(m1)

## 
## Call:
## glm(formula = Sepal.Length ~ ., family = "gaussian", data = iris)
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.17127    0.27979   7.760 1.43e-12 ***
## Sepal.Width        0.49589    0.08607   5.761 4.87e-08 ***
## Petal.Length       0.82924    0.06853  12.101  < 2e-16 ***
## Petal.Width       -0.31516    0.15120  -2.084  0.03889 *  
## Speciesversicolor -0.72356    0.24017  -3.013  0.00306 ** 
## Speciesvirginica  -1.02350    0.33373  -3.067  0.00258 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.09414226)
## 
##     Null deviance: 102.168  on 149  degrees of freedom
## Residual deviance:  13.556  on 144  degrees of freedom
## AIC: 79.116
## 
## Number of Fisher Scoring iterations: 2

# It is the same as
m2 = lm(Sepal.Length ~ ., data = iris)
summary(m2)

## 
## Call:
## lm(formula = Sepal.Length ~ ., data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.79424 -0.21874  0.00899  0.20255  0.73103 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.17127    0.27979   7.760 1.43e-12 ***
## Sepal.Width        0.49589    0.08607   5.761 4.87e-08 ***
## Petal.Length       0.82924    0.06853  12.101  < 2e-16 ***
## Petal.Width       -0.31516    0.15120  -2.084  0.03889 *  
## Speciesversicolor -0.72356    0.24017  -3.013  0.00306 ** 
## Speciesvirginica  -1.02350    0.33373  -3.067  0.00258 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3068 on 144 degrees of freedom
## Multiple R-squared:  0.8673, Adjusted R-squared:  0.8627 
## F-statistic: 188.3 on 5 and 144 DF,  p-value: < 2.2e-16

# Now, let's fit a logistic regression with Species as the response
# Since Species has 3 categories, we combine two categories into one so that we have two categories: "setosa" or not.

D = iris # We make a copy for iris data and modify D later

# Code the Species column as 0/1: 0 = versicolor and 1 = setosa
# The new variable is called Setosa
D$Setosa = ifelse(D$Species == "setosa", 1, 0)
D

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species Setosa
## 1            5.1         3.5          1.4         0.2     setosa      1
## 2            4.9         3.0          1.4         0.2     setosa      1
## 3            4.7         3.2          1.3         0.2     setosa      1
## 4            4.6         3.1          1.5         0.2     setosa      1
## 5            5.0         3.6          1.4         0.2     setosa      1
## 6            5.4         3.9          1.7         0.4     setosa      1
## 7            4.6         3.4          1.4         0.3     setosa      1
## 8            5.0         3.4          1.5         0.2     setosa      1
## 9            4.4         2.9          1.4         0.2     setosa      1
## 10           4.9         3.1          1.5         0.1     setosa      1
## 11           5.4         3.7          1.5         0.2     setosa      1
## 12           4.8         3.4          1.6         0.2     setosa      1
## 13           4.8         3.0          1.4         0.1     setosa      1
## 14           4.3         3.0          1.1         0.1     setosa      1
## 15           5.8         4.0          1.2         0.2     setosa      1
## 16           5.7         4.4          1.5         0.4     setosa      1
## 17           5.4         3.9          1.3         0.4     setosa      1
## 18           5.1         3.5          1.4         0.3     setosa      1
## 19           5.7         3.8          1.7         0.3     setosa      1
## 20           5.1         3.8          1.5         0.3     setosa      1
## 21           5.4         3.4          1.7         0.2     setosa      1
## 22           5.1         3.7          1.5         0.4     setosa      1
## 23           4.6         3.6          1.0         0.2     setosa      1
## 24           5.1         3.3          1.7         0.5     setosa      1
## 25           4.8         3.4          1.9         0.2     setosa      1
## 26           5.0         3.0          1.6         0.2     setosa      1
## 27           5.0         3.4          1.6         0.4     setosa      1
## 28           5.2         3.5          1.5         0.2     setosa      1
## 29           5.2         3.4          1.4         0.2     setosa      1
## 30           4.7         3.2          1.6         0.2     setosa      1
## 31           4.8         3.1          1.6         0.2     setosa      1
## 32           5.4         3.4          1.5         0.4     setosa      1
## 33           5.2         4.1          1.5         0.1     setosa      1
## 34           5.5         4.2          1.4         0.2     setosa      1
## 35           4.9         3.1          1.5         0.2     setosa      1
## 36           5.0         3.2          1.2         0.2     setosa      1
## 37           5.5         3.5          1.3         0.2     setosa      1
## 38           4.9         3.6          1.4         0.1     setosa      1
## 39           4.4         3.0          1.3         0.2     setosa      1
## 40           5.1         3.4          1.5         0.2     setosa      1
## 41           5.0         3.5          1.3         0.3     setosa      1
## 42           4.5         2.3          1.3         0.3     setosa      1
## 43           4.4         3.2          1.3         0.2     setosa      1
## 44           5.0         3.5          1.6         0.6     setosa      1
## 45           5.1         3.8          1.9         0.4     setosa      1
## 46           4.8         3.0          1.4         0.3     setosa      1
## 47           5.1         3.8          1.6         0.2     setosa      1
## 48           4.6         3.2          1.4         0.2     setosa      1
## 49           5.3         3.7          1.5         0.2     setosa      1
## 50           5.0         3.3          1.4         0.2     setosa      1
## 51           7.0         3.2          4.7         1.4 versicolor      0
## 52           6.4         3.2          4.5         1.5 versicolor      0
## 53           6.9         3.1          4.9         1.5 versicolor      0
## 54           5.5         2.3          4.0         1.3 versicolor      0
## 55           6.5         2.8          4.6         1.5 versicolor      0
## 56           5.7         2.8          4.5         1.3 versicolor      0
## 57           6.3         3.3          4.7         1.6 versicolor      0
## 58           4.9         2.4          3.3         1.0 versicolor      0
## 59           6.6         2.9          4.6         1.3 versicolor      0
## 60           5.2         2.7          3.9         1.4 versicolor      0
## 61           5.0         2.0          3.5         1.0 versicolor      0
## 62           5.9         3.0          4.2         1.5 versicolor      0
## 63           6.0         2.2          4.0         1.0 versicolor      0
## 64           6.1         2.9          4.7         1.4 versicolor      0
## 65           5.6         2.9          3.6         1.3 versicolor      0
## 66           6.7         3.1          4.4         1.4 versicolor      0
## 67           5.6         3.0          4.5         1.5 versicolor      0
## 68           5.8         2.7          4.1         1.0 versicolor      0
## 69           6.2         2.2          4.5         1.5 versicolor      0
## 70           5.6         2.5          3.9         1.1 versicolor      0
## 71           5.9         3.2          4.8         1.8 versicolor      0
## 72           6.1         2.8          4.0         1.3 versicolor      0
## 73           6.3         2.5          4.9         1.5 versicolor      0
## 74           6.1         2.8          4.7         1.2 versicolor      0
## 75           6.4         2.9          4.3         1.3 versicolor      0
## 76           6.6         3.0          4.4         1.4 versicolor      0
## 77           6.8         2.8          4.8         1.4 versicolor      0
## 78           6.7         3.0          5.0         1.7 versicolor      0
## 79           6.0         2.9          4.5         1.5 versicolor      0
## 80           5.7         2.6          3.5         1.0 versicolor      0
## 81           5.5         2.4          3.8         1.1 versicolor      0
## 82           5.5         2.4          3.7         1.0 versicolor      0
## 83           5.8         2.7          3.9         1.2 versicolor      0
## 84           6.0         2.7          5.1         1.6 versicolor      0
## 85           5.4         3.0          4.5         1.5 versicolor      0
## 86           6.0         3.4          4.5         1.6 versicolor      0
## 87           6.7         3.1          4.7         1.5 versicolor      0
## 88           6.3         2.3          4.4         1.3 versicolor      0
## 89           5.6         3.0          4.1         1.3 versicolor      0
## 90           5.5         2.5          4.0         1.3 versicolor      0
## 91           5.5         2.6          4.4         1.2 versicolor      0
## 92           6.1         3.0          4.6         1.4 versicolor      0
## 93           5.8         2.6          4.0         1.2 versicolor      0
## 94           5.0         2.3          3.3         1.0 versicolor      0
## 95           5.6         2.7          4.2         1.3 versicolor      0
## 96           5.7         3.0          4.2         1.2 versicolor      0
## 97           5.7         2.9          4.2         1.3 versicolor      0
## 98           6.2         2.9          4.3         1.3 versicolor      0
## 99           5.1         2.5          3.0         1.1 versicolor      0
## 100          5.7         2.8          4.1         1.3 versicolor      0
## 101          6.3         3.3          6.0         2.5  virginica      0
## 102          5.8         2.7          5.1         1.9  virginica      0
## 103          7.1         3.0          5.9         2.1  virginica      0
## 104          6.3         2.9          5.6         1.8  virginica      0
## 105          6.5         3.0          5.8         2.2  virginica      0
## 106          7.6         3.0          6.6         2.1  virginica      0
## 107          4.9         2.5          4.5         1.7  virginica      0
## 108          7.3         2.9          6.3         1.8  virginica      0
## 109          6.7         2.5          5.8         1.8  virginica      0
## 110          7.2         3.6          6.1         2.5  virginica      0
## 111          6.5         3.2          5.1         2.0  virginica      0
## 112          6.4         2.7          5.3         1.9  virginica      0
## 113          6.8         3.0          5.5         2.1  virginica      0
## 114          5.7         2.5          5.0         2.0  virginica      0
## 115          5.8         2.8          5.1         2.4  virginica      0
## 116          6.4         3.2          5.3         2.3  virginica      0
## 117          6.5         3.0          5.5         1.8  virginica      0
## 118          7.7         3.8          6.7         2.2  virginica      0
## 119          7.7         2.6          6.9         2.3  virginica      0
## 120          6.0         2.2          5.0         1.5  virginica      0
## 121          6.9         3.2          5.7         2.3  virginica      0
## 122          5.6         2.8          4.9         2.0  virginica      0
## 123          7.7         2.8          6.7         2.0  virginica      0
## 124          6.3         2.7          4.9         1.8  virginica      0
## 125          6.7         3.3          5.7         2.1  virginica      0
## 126          7.2         3.2          6.0         1.8  virginica      0
## 127          6.2         2.8          4.8         1.8  virginica      0
## 128          6.1         3.0          4.9         1.8  virginica      0
## 129          6.4         2.8          5.6         2.1  virginica      0
## 130          7.2         3.0          5.8         1.6  virginica      0
## 131          7.4         2.8          6.1         1.9  virginica      0
## 132          7.9         3.8          6.4         2.0  virginica      0
## 133          6.4         2.8          5.6         2.2  virginica      0
## 134          6.3         2.8          5.1         1.5  virginica      0
## 135          6.1         2.6          5.6         1.4  virginica      0
## 136          7.7         3.0          6.1         2.3  virginica      0
## 137          6.3         3.4          5.6         2.4  virginica      0
## 138          6.4         3.1          5.5         1.8  virginica      0
## 139          6.0         3.0          4.8         1.8  virginica      0
## 140          6.9         3.1          5.4         2.1  virginica      0
## 141          6.7         3.1          5.6         2.4  virginica      0
## 142          6.9         3.1          5.1         2.3  virginica      0
## 143          5.8         2.7          5.1         1.9  virginica      0
## 144          6.8         3.2          5.9         2.3  virginica      0
## 145          6.7         3.3          5.7         2.5  virginica      0
## 146          6.7         3.0          5.2         2.3  virginica      0
## 147          6.3         2.5          5.0         1.9  virginica      0
## 148          6.5         3.0          5.2         2.0  virginica      0
## 149          6.2         3.4          5.4         2.3  virginica      0
## 150          5.9         3.0          5.1         1.8  virginica      0

D = D[, -5] # Drop the original Species column

n = nrow(D)
set.seed(3.14)
idx = sample(1:n)
n.rows.training = n * 0.6  # 60% data are used as training data
trainingData = D[idx[1:n.rows.training], ] # Form the training data
validationData = D[-idx[1:n.rows.training], ] # Form the validation data

library(ggplot2)
ggplot(trainingData, aes(x = Sepal.Length, y = Sepal.Width, color = factor(Setosa, levels = c(1, 0), labels = c("Setosa", "Not")))) +
  geom_point()

# Fit a generalized linear model with binomial family; that is fit a logistic regression model
m3 = glm(Setosa ~ Sepal.Length, data = trainingData, family = binomial)

actual = validationData$Setosa
pred = predict(m3, newdata = validationData[, -5], type = "response") # Predict the probabilities

data.frame(actual, pred) # Compare actual against predicted

##     actual         pred
## 1        1 7.786169e-01
## 13       1 9.318040e-01
## 14       1 9.924347e-01
## 21       1 4.751484e-01
## 23       1 9.712369e-01
## 24       1 7.786169e-01
## 25       1 9.318040e-01
## 27       1 8.468356e-01
## 28       1 6.910971e-01
## 30       1 9.555154e-01
## 31       1 9.318040e-01
## 32       1 4.751484e-01
## 34       1 3.654331e-01
## 35       1 8.968187e-01
## 39       1 9.881584e-01
## 41       1 8.468356e-01
## 42       1 9.815098e-01
## 44       1 8.468356e-01
## 45       1 7.786169e-01
## 49       1 5.873164e-01
## 51       0 6.503024e-04
## 52       0 9.725824e-03
## 53       0 1.021918e-03
## 55       0 6.208747e-03
## 59       0 3.958440e-03
## 60       0 6.910971e-01
## 63       0 5.658760e-02
## 72       0 3.675319e-02
## 79       0 5.658760e-02
## 81       0 3.654331e-01
## 82       0 3.654331e-01
## 85       0 4.751484e-01
## 88       0 1.520474e-02
## 89       0 2.681105e-01
## 91       0 3.654331e-01
## 92       0 3.675319e-02
## 93       0 1.290964e-01
## 99       0 7.786169e-01
## 100      0 1.889877e-01
## 102      0 1.290964e-01
## 105      0 6.208747e-03
## 106      0 4.311279e-05
## 110      0 2.632444e-04
## 111      0 6.208747e-03
## 116      0 9.725824e-03
## 121      0 1.021918e-03
## 123      0 2.742526e-05
## 124      0 1.520474e-02
## 126      0 2.632444e-04
## 127      0 2.369628e-02
## 128      0 3.675319e-02
## 129      0 9.725824e-03
## 131      0 1.065376e-04
## 134      0 1.520474e-02
## 135      0 3.675319e-02
## 138      0 9.725824e-03
## 141      0 2.521668e-03
## 142      0 1.021918e-03
## 145      0 2.521668e-03
## 147      0 1.520474e-02

# Make a confusion matrix and report accuracy
library(caret)
pred.label = factor((as.numeric(pred) > 0.5) * 1, levels = c(1,0)) # Must convert to a factor
# The reason that levels are set to the order 1 and 0 is to match the modeled probability of being "1"
act = actual %>% factor(levels = c(1,0)) # Must convert to a factor
cm = confusionMatrix(pred.label, act) # The output
cm # The confusion matrix is not printed in the way that rows are actual labels.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  0
##          1 17  2
##          0  3 38
##                                           
##                Accuracy : 0.9167          
##                  95% CI : (0.8161, 0.9724)
##     No Information Rate : 0.6667          
##     P-Value [Acc > NIR] : 5.6e-06         
##                                           
##                   Kappa : 0.8101          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8500          
##             Specificity : 0.9500          
##          Pos Pred Value : 0.8947          
##          Neg Pred Value : 0.9268          
##              Prevalence : 0.3333          
##          Detection Rate : 0.2833          
##    Detection Prevalence : 0.3167          
##       Balanced Accuracy : 0.9000          
##                                           
##        'Positive' Class : 1               
##

t(cm[[2]]) # Transpose the confusion matrix and print it in a nicer way.

##          Prediction
## Reference  1  0
##         1 17  3
##         0  2 38

# Create the gains table using the "gains" package
library(gains)
gain = gains(actual, pred, groups = 10) # Default is 10 groups
gain

## Depth                            Cume   Cume Pct                     Mean
##  of           Cume     Mean      Mean   of Total    Lift   Cume     Model
## File     N      N      Resp      Resp      Resp    Index   Lift     Score
## -------------------------------------------------------------------------
##   13     8      8      1.00      1.00      40.0%     300    300      0.96
##   20     4     12      1.00      1.00      60.0%     300    300      0.86
##   30     6     18      0.67      0.89      80.0%     200    267      0.75
##   43     8     26      0.50      0.77     100.0%     150    231      0.43
##   50     4     30      0.00      0.67     100.0%       0    200      0.18
##   60     6     36      0.00      0.56     100.0%       0    167      0.04
##   75     9     45      0.00      0.44     100.0%       0    133      0.01
##   80     3     48      0.00      0.42     100.0%       0    125      0.01
##   90     6     54      0.00      0.37     100.0%       0    111      0.00
##  100     6     60      0.00      0.33     100.0%       0    100      0.00

## Plot the lift chart: use of two columns from the "gain" output
plot(c(0, gain$cume.pct.of.total * sum(actual)) ~ c(0, gain$cume.obs), main = "Lift Chart", xlab = "# Cases", ylab = "Cumulative", type = "l")

lines(c(0, sum(actual)) ~ c(0, nrow(validationData)), lty = 2)

# Create a decile-wise chart
heights = gain$mean.resp/mean(actual)
midpoints = barplot(heights, names.arg = gain$depth, xlab = "Percentile", ylab = "Mean Response", main = "Decile-wise Lift Chart", ylim = c(0, 3.5))

## Add labels to columns
text(midpoints, heights + 0.1, labels = round(heights, 1), cex = 0.8)

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

roc = roc(actual~pred)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc, main = "ROC Curve")

auc(roc) # Calculate the area under ROC curve: It's called AUC

## Area under the curve: 0.9819

## Using package "blorr" to create Decile-wise Lift Charts
## Reference: <http://finzi.psych.upenn.edu/library/blorr/html/blr_decile_lift_chart.html>
library(blorr)
model <- glm(honcomp ~ female + read + science, data = hsb2,
             family = binomial)
gt <- blr_gains_table(model)
blr_decile_lift_chart(gt)

Lecture Notes for Data Mining

Shiju Zhang