Learning Objective:
Describe probability as a foundation of statistical modeling, including
inference and maximum likelihood estimation.
Dataset: Auto (from ISLR2)
Goal: Show that you understand probability foundations, inference,
and maximum likelihood through regression.
For my portfolio I am using “Auto dataset” , lets start the
impoirting and the set of Dataset.
Import &
Setup
Loading required libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom 1.0.9 ✔ rsample 1.3.1
## ✔ dials 1.4.1 ✔ tune 1.3.0
## ✔ infer 1.0.9 ✔ workflows 1.2.0
## ✔ modeldata 1.5.0 ✔ workflowsets 1.1.1
## ✔ parsnip 1.3.2 ✔ yardstick 1.3.2
## ✔ recipes 1.3.1
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
library(ISLR2)
library(GGally)
library(ggcorrplot)
library(broom)
library(skimr)
2.Data loading & basic cleaning / summary
Load Auto dataset and inspect
data("Auto", package = "ISLR2")
auto <- Auto
Quick structure and head
str(auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
## ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
head(auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
Removing ‘name’ if present and ensure numeric types and checking for
missing values
if("name" %in% names(auto)) auto <- auto %>% select(-name)
colSums(is.na(auto))
## mpg cylinders displacement horsepower weight acceleration
## 0 0 0 0 0 0
## year origin
## 0 0
Summary Statistics
# Basic summary statistics
summary(auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
## acceleration year origin
## Min. : 8.00 Min. :70.00 Min. :1.000
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000
## Median :15.50 Median :76.00 Median :1.000
## Mean :15.54 Mean :75.98 Mean :1.577
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :24.80 Max. :82.00 Max. :3.000
# Optional: a nicer skim of the dataset
skimr::skim(auto)
Data summary
| Name |
auto |
| Number of rows |
392 |
| Number of columns |
8 |
| _______________________ |
|
| Column type frequency: |
|
| numeric |
8 |
| ________________________ |
|
| Group variables |
None |
Variable type: numeric
| mpg |
0 |
1 |
23.45 |
7.81 |
9 |
17.00 |
22.75 |
29.00 |
46.6 |
▆▇▆▃▁ |
| cylinders |
0 |
1 |
5.47 |
1.71 |
3 |
4.00 |
4.00 |
8.00 |
8.0 |
▇▁▃▁▅ |
| displacement |
0 |
1 |
194.41 |
104.64 |
68 |
105.00 |
151.00 |
275.75 |
455.0 |
▇▂▂▃▁ |
| horsepower |
0 |
1 |
104.47 |
38.49 |
46 |
75.00 |
93.50 |
126.00 |
230.0 |
▆▇▃▁▁ |
| weight |
0 |
1 |
2977.58 |
849.40 |
1613 |
2225.25 |
2803.50 |
3614.75 |
5140.0 |
▇▇▅▅▂ |
| acceleration |
0 |
1 |
15.54 |
2.76 |
8 |
13.78 |
15.50 |
17.02 |
24.8 |
▁▆▇▂▁ |
| year |
0 |
1 |
75.98 |
3.68 |
70 |
73.00 |
76.00 |
79.00 |
82.0 |
▇▆▇▆▇ |
| origin |
0 |
1 |
1.58 |
0.81 |
1 |
1.00 |
1.00 |
2.00 |
3.0 |
▇▁▂▁▂ |
Exploratory Data
Analysis
Distribution plots (histograms + density)
ggplot(auto, aes(x = mpg)) +
geom_histogram(bins = 30, alpha = 0.8) +
geom_density(aes(y = ..count.. * diff(range(auto$mpg)) / 30), color = "blue", size = 0.8) +
labs(title = "Distribution of mpg", x = "mpg", y = "count") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ggplot(auto, aes(x = horsepower)) +
geom_histogram(bins = 30, alpha = 0.8) +
geom_density(aes(y = ..count.. * diff(range(auto$horsepower)) / 30), color = "blue", size = 0.8) +
labs(title = "Distribution of horsepower", x = "horsepower", y = "count") +
theme_minimal()

Distribution of Key Variables (mpg and horsepower)
The histogram of mpg shows a right-skewed distribution, with most
vehicles achieving between 15–30 miles per gallon and fewer extremely
high-mpg cars. This skewness is expected because very fuel-efficient
cars were relatively rare in the time period represented by the Auto
dataset. The distribution also appears somewhat multimodal, suggesting
that different groups of cars (perhaps from different years or engine
sizes) contribute to the mpg variation.
The histogram of horsepower shows a slightly right-skewed
distribution as well, with most cars falling between 70 and 150
horsepower. There are a handful of extremely high-horsepower vehicles,
which appear as outliers in the boxplot. These outliers could influence
model fit, so identifying them early is helpful for later
diagnostics.
Boxplots
ggplot(auto, aes(y = mpg)) +
geom_boxplot() +
labs(title = "Boxplot of mpg", y = "mpg") +
theme_minimal()

ggplot(auto, aes(y = horsepower)) +
geom_boxplot() +
labs(title = "Boxplot of horsepower", y = "horsepower") +
theme_minimal()

The boxplot of horsepower reveals several high-end outliers above 200
horsepower. These observations are legitimate but represent vehicles
that are less common in the dataset. While they do not prevent us from
fitting a linear regression model, they could have an impact on residual
patterns and leverage points. It is good practice to keep these in mind
when later interpreting diagnostic plots such as Cook’s distance.
Correlation matrix and heatmap for numeric variables
numeric_vars <- auto %>% select(where(is.numeric))
cor_mat <- cor(numeric_vars, use = "pairwise.complete.obs")
round(cor_mat, 2)
## mpg cylinders displacement horsepower weight acceleration year
## mpg 1.00 -0.78 -0.81 -0.78 -0.83 0.42 0.58
## cylinders -0.78 1.00 0.95 0.84 0.90 -0.50 -0.35
## displacement -0.81 0.95 1.00 0.90 0.93 -0.54 -0.37
## horsepower -0.78 0.84 0.90 1.00 0.86 -0.69 -0.42
## weight -0.83 0.90 0.93 0.86 1.00 -0.42 -0.31
## acceleration 0.42 -0.50 -0.54 -0.69 -0.42 1.00 0.29
## year 0.58 -0.35 -0.37 -0.42 -0.31 0.29 1.00
## origin 0.57 -0.57 -0.61 -0.46 -0.59 0.21 0.18
## origin
## mpg 0.57
## cylinders -0.57
## displacement -0.61
## horsepower -0.46
## weight -0.59
## acceleration 0.21
## year 0.18
## origin 1.00
ggcorrplot::ggcorrplot(cor_mat, hc.order = TRUE, type = "lower",
lab = TRUE, lab_size = 3, title = "Correlation heatmap (numeric vars)") +
theme_minimal()

Correlation Heatmap Interpretation
The correlation heatmap clearly shows that mpg is strongly negatively
correlated with horsepower –0.78, displacement, weight, and cylinders.
This matches our expectations—larger, heavier, more powerful engines
generally produce lower fuel efficiency.
Acceleration and year show mild positive correlations with mpg,
suggesting that newer cars and those with better acceleration tend to
have slightly higher fuel efficiency. Most of the engine-related
variables (weight, cylinders, displacement, horsepower) are also highly
correlated with each other, which is typical of automotive data where
engine components scale together.
These patterns support using horsepower as a meaningful predictor of
mpg, since the negative relationship appears both strong and
consistent.
ggpairs for selected
quantitative variables
GGally::ggpairs(numeric_vars %>% select(mpg, horsepower, displacement, weight, acceleration, year),
columns = 1:6,
title = "GGpairs: selected quantitative variables")
Pairwise Plots (GGpairs)
The ggpairs matrix reinforces the correlation heatmap. The
scatterplots show a clear downward trend between mpg and horsepower, and
between mpg and displacement or weight. The density plots on the
diagonal again confirm the skewness observed in the histograms.
A notable point from the ggpairs plot is the clustering of
observations, which suggests the dataset includes different groups of
cars . This clustering explains why the year variable is moderately
correlated with mpg which is 0.58. Newer vehicles tend to be more
fuel-efficient and lower in weight.
The ggpairs plot also highlights multicollinearity among engine
variables for example, displacement and cylinders have correlation >
0.93, which is important if we were doing multiple regression. However,
for simple regression with a single predictor (horsepower), this is not
a concern.
Scatterplot mpg vs
horsepower with linear trend
ggplot(auto, aes(x = horsepower, y = mpg)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Scatterplot: mpg vs horsepower", x = "horsepower", y = "mpg") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The scatterplot with a fitted regression line shows a clear, strong,
negative linear trend between horsepower and mpg. As horsepower
increases, mpg decreases almost linearly. Although the points are
spread, the general pattern is unmistakable and indicates that a linear
model is appropriate for capturing the main trend.
There is some variability at lower horsepower levels, suggesting that
cars with similar horsepower can still have different mpg depending on
weight, year, or engine design. Despite this, the downward slope remains
consistent across the range of horsepower values.
Overall, the exploratory data analysis strongly supports fitting a
linear regression model using horsepower to predict mpg. The
distributions are reasonable, the correlation is strong and negative,
and the scatterplot confirms a linear pattern. Outliers are present but
do not distort the overall relationship. These visual and numerical
findings provide the foundation for the modeling and inference that
follow, aligning directly with Learning Objective 1.
Regression model
mpg - horsepower using tidymodels
Specify the model (linear regression via lm engine)
lm_spec <- linear_reg() %>%
set_engine("lm")
Recipe
# Using a recipe helps demonstrate tidy modeling workflow
rec <- recipe(mpg ~ horsepower, data = auto)
Workflow and fit
wf <- workflow() %>%
add_model(lm_spec) %>%
add_recipe(rec)
mod_slr <- wf %>% fit(data = auto)
Extract tidy coefficients (from underlying lm)
# The parsnip fit stores the underlying lm object at mod_slr$fit$fit
tidy_mod <- broom::tidy(mod_slr$fit$fit, conf.int = TRUE)
tidy_mod
## # A tibble: 2 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 39.9 0.717 55.7 1.22e-187 38.5 41.3
## 2 horsepower -0.158 0.00645 -24.5 7.03e- 81 -0.171 -0.145
Model-level metrics
(R-squared, RMSE)
glance_mod <- broom::glance(mod_slr$fit$fit)
glance_mod
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.606 0.605 4.91 600. 7.03e-81 1 -1179. 2363. 2375.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Print a friendly formula-like summary
cat("Fitted model: mpg =", round(tidy_mod$estimate[2], 3), "* horsepower +", round(tidy_mod$estimate[1], 3), "\n")
## Fitted model: mpg = -0.158 * horsepower + 39.936
Model
diagnostics
Residuals vs Fitted
train_aug <- augment(mod_slr, new_data = auto)
ggplot(train_aug, aes(x = .pred, y = .resid)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Residuals vs Fitted", x = "Predicted values", y = "Residuals") +
theme_minimal()

Linearity and Heteroscedasticity
The Residuals vs Fitted plot provides insight into the assumptions of
linearity and constant variance in my simple regression model of mpg on
horsepower. Overall, the residuals are centered around zero without any
strong curvature, which suggests that the linear relationship between
horsepower and mpg is reasonable for this dataset. However, the plot
also reveals some mild heteroscedasticity: the spread of the residuals
increases as the predicted mpg values increase. In other words, the
model’s errors are smaller and more tightly clustered for cars with low
mpg, but the variability becomes larger among cars with higher mpg. This
pattern is common in automotive data, as fuel efficiency in lighter or
smaller-engined cars can be influenced by many factors beyond horsepower
alone. Despite this increasing spread, there are no obvious patterns or
clusters of points that would indicate a major violation of the linear
model assumptions, and no extreme residuals suggesting problematic
outliers.
QQ plot
ggplot(train_aug, aes(sample = .resid)) +
stat_qq(alpha = 0.6) +
stat_qq_line() +
labs(title = "QQ Plot of Residuals") +
theme_minimal()

The QQ plot of the residuals further evaluates whether the error
terms follow a normal distribution, which is important for constructing
valid confidence intervals and conducting hypothesis tests. In my model,
the residuals follow the theoretical normal line closely through the
center of the distribution, indicating that the bulk of the residuals
are approximately normal. At the extreme ends of the plot, however, the
points begin to deviate from the reference line, suggesting slightly
heavier tails than would be expected under perfect normality. This means
that there are a few observations with residuals that are more extreme
than a perfectly normal distribution would produce. These slight
departures from normality are not uncommon, especially in simple
regression models with only one predictor, and they do not undermine the
overall interpretation of the relationship between horsepower and
mpg.
In summary, while the diagnostic plots reveal minor violations such
as small deviations from normality and some heteroscedasticity, the
simple linear regression model remains appropriate for capturing the
strong negative relationship between horsepower and fuel efficiency. The
diagnostic results support the validity of the model’s inference and
prediction outputs within the context of Learning Objective 1.
Scale-Location Plot
ggplot(train_aug, aes(x = .pred, y = sqrt(abs(.resid)))) +
geom_point(alpha = 0.6) +
geom_smooth(method = "loess", se = FALSE) +
labs(title = "Scale-Location Plot", x = "Predicted values", y = "Sqrt(|Residuals|)") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Histogram of Residuals
ggplot(train_aug, aes(x = .resid)) +
geom_histogram(bins = 30, alpha = 0.8) +
labs(title = "Histogram of Residuals", x = "Residuals", y = "Count") +
theme_minimal()

Predictions &
Inference
# Predict mpg at horsepower = 98
new_df <- tibble(horsepower = 98)
# 6A: Point prediction
pred_point <- predict(mod_slr, new_data = new_df)
pred_point
## # A tibble: 1 × 1
## .pred
## <dbl>
## 1 24.5
95% confidence
interval
# 95% confidence interval for the MEAN response
conf_int <- predict(mod_slr, new_data = new_df, type = "conf_int")
conf_int
## # A tibble: 1 × 2
## .pred_lower .pred_upper
## <dbl> <dbl>
## 1 24.0 25.0
# 95% prediction interval for a NEW observation
pred_int <- predict(mod_slr, new_data = new_df, type = "pred_int")
pred_int
## # A tibble: 1 × 2
## .pred_lower .pred_upper
## <dbl> <dbl>
## 1 14.8 34.1
Coefficient table with confidence intervals
# Combine into one table
bind_cols(new_df, pred_point, conf_int, pred_int)
## New names:
## • `.pred_lower` -> `.pred_lower...3`
## • `.pred_upper` -> `.pred_upper...4`
## • `.pred_lower` -> `.pred_lower...5`
## • `.pred_upper` -> `.pred_upper...6`
## # A tibble: 1 × 6
## horsepower .pred .pred_lower...3 .pred_upper...4 .pred_lower...5
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 98 24.5 24.0 25.0 14.8
## # ℹ 1 more variable: .pred_upper...6 <dbl>
# Coefficient table with confidence intervals
coef_tbl <- broom::tidy(mod_slr$fit$fit, conf.int = TRUE)
coef_tbl
## # A tibble: 2 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 39.9 0.717 55.7 1.22e-187 38.5 41.3
## 2 horsepower -0.158 0.00645 -24.5 7.03e- 81 -0.171 -0.145
Interpretation of the Regression Coefficients:
The estimated intercept is 39.94, meaning that if horsepower were
equal to zero, the model predicts an mpg of approximately 39.94.
Although horsepower = 0 is not realistic for actual cars, the intercept
provides a baseline from which the effect of horsepower is measured.
The estimated slope for horsepower is –0.1578, with a 95% confidence
interval from –0.1705 to –0.1452. This means that for each one-unit
increase in horsepower, the expected miles per gallon decreases by about
0.158 mpg, on average. The entire confidence interval is negative,
showing a consistently decreasing trend.
The p-value for horsepower is 7.031989e-81, which is essentially
zero. This provides overwhelming statistical evidence that horsepower is
a significant predictor of mpg. In other words, we reject the null
hypothesis that horsepower has no effect on mpg.
Learning Objective
1: Reflection
Working through this analysis helped me clearly understand how
probability and inference form the foundation of statistical modeling.
By fitting a simple linear regression model to predict mpg from
horsepower, I was able to see how the assumptions of the model such as
linearity, normality of residuals, and constant variance which are
connected to probability concepts we discussed in class. The regression
line itself is estimated using the method of maximum likelihood, and the
slope estimate comes with a standard error, confidence interval, and
p-value that are all rooted in probability theory. Seeing these values
produced from my model helped me understand how statistical inference
quantifies uncertainty and supports conclusions about relationships
between variables.
The exploratory data analysis was also an important part of this
learning objective. Before fitting any model, I examined the
distributions of mpg and horsepower, checked for outliers, and
visualized the correlation patterns among variables. This step made me
feel more confident that a linear model was appropriate and helped me
anticipate some of the issues I later observed in the residual plots.
After fitting the model, interpreting the coefficient table and
diagnostic plots allowed me to connect the theoretical ideas from
lectures—like sampling distributions and assumptions of the error
term—to real data. For example, the QQ plot helped me evaluate whether
the residuals followed a normal distribution, and the
residuals-vs-fitted plot helped me assess whether variance was constant.
Even though some assumptions were only partly met, the process gave me a
clearer sense of how a model behaves when applied to real-world
data.
Overall, completing this section strengthened my understanding of how
probability supports all parts of a statistical model—from estimating
parameters, to interpreting uncertainty, to checking assumptions. It
also helped me build confidence in using R and Tidymodels to fit,
assess, and explain a regression model. This hands-on experience
reinforced that meaningful statistical modeling is not just about
running code, but about understanding the reasoning behind the results
and being able to clearly communicate what the model is telling us.
Learning Objective:
Apply the appropriate generalized linear model for a specific data
context
Dataset: Diamonds (from ISLR2)
Goal: Demonstrate choosing the correct GLM (multiple linear
regression for continuous price), justify model choice using EDA, fit
the model, interpret coefficients, and evaluate diagnostics.
Loading libraries and
importing Diamonds Dataset
library(ggplot2)
library(dplyr)
data("diamonds")
Exploratory Data
Analysis for Diamonds Dataset
Summary Statistics
summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
Distribution of price
hist(diamonds$price,
main = "Distribution of Diamond Prices",
xlab = "Price",
col = "lightblue",
border = "white")

Relationship between
carat and price
plot(diamonds$carat, diamonds$price,
main = "Price vs Carat",
xlab = "Carat",
ylab = "Price",
pch = 16, col = rgb(0,0,1,0.4))

Boxplots for categorical predictors
boxplot(price ~ cut, data = diamonds,
main = "Price by Cut Quality",
xlab = "Cut", ylab = "Price", col = "orange")

boxplot(price ~ color, data = diamonds,
main = "Price by Color",
xlab = "Color", ylab = "Price", col = "blue")

boxplot(price ~ clarity, data = diamonds,
main = "Price by Clarity",
xlab = "Clarity", ylab = "Price", col = "lightgreen")
To demonstrate my understanding of generalized linear models, I used the
diamonds dataset. My response variable was price, which represents the
cost of a diamond in U.S. dollars. I examined this variable first and
found that price is continuous, strictly positive, and moderately
right-skewed. This type of response variable is commonly modeled with a
linear model using a Gaussian distribution and an identity link, which
is part of the generalized linear model framework.
The predictor variables in this dataset include both numeric (such as
carat) and categorical (such as cut, color, and clarity) variables. I
visualized the relationship between carat and price and observed a clear
positive trend, indicating that larger diamonds are consistently more
expensive. Boxplots of price across cut, color, and clarity levels
showed strong differences between categories, meaning that diamond
quality plays a meaningful role in determining price. This initial EDA
helped me understand the structure of the data and guided my decision
about which generalized linear model was most appropriate.
Choosing the
Appropriate GLM
The learning objective emphasizes using the appropriate generalized
linear model for the data context. Based on my EDA, I identified the
following characteristics:
1.The outcome variable (price) is continuous.
2.The distribution is roughly Gaussian after accounting for
skewness.
3.The predictors include both quantitative and qualitative
variables.
4.The research question focuses on how different factors impact
diamond price.
Because of this, the correct generalized linear model is a linear
regression model with: Gaussian family and Identity link function. This
is the standard GLM for continuous responses. Although price is skewed,
the linear model still performs well because the log-transformation was
not required for this data and the residual diagnostics supported the
model. The presence of categorical variables (cut, color, clarity) also
fits naturally into the GLM framework through dummy variable coding.
By selecting this model, I was able to explain price variation using
carat weight and diamond quality while meeting the expectations of
Learning Objective 2.
Fitting the
Model
I fit two main models:
Model 1: Simple
Linear Regression (Price ~ Carat)
model1 <- lm(price ~ carat, data = diamonds)
summary(model1)
##
## Call:
## lm(formula = price ~ carat, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18585.3 -804.8 -18.9 537.4 12731.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2256.36 13.06 -172.8 <2e-16 ***
## carat 7756.43 14.07 551.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1549 on 53938 degrees of freedom
## Multiple R-squared: 0.8493, Adjusted R-squared: 0.8493
## F-statistic: 3.041e+05 on 1 and 53938 DF, p-value: < 2.2e-16
Interpretation of Model 1 (price ~ carat)
The first model used only carat as a predictor of diamond price. The
results show a very strong positive association between carat and price.
The estimated coefficient for carat is 7756.43, meaning that for every
1-carat increase in size, the expected price of the diamond increases by
approximately $7,756, on average. This relationship is highly
statistically significant (p < 2e-16), indicating extremely strong
evidence that carat weight is a major driver of price.
The adjusted R² for this model is 0.8493, meaning that carat alone
explains about 85% of the variability in diamond prices. This is
surprisingly high for a single predictor and shows how influential
diamond size is. However, the residual standard error of 1549 suggests
that while carat explains much of the variation, there is still
substantial unexplained noise—likely due to other factors such as cut,
color, and clarity.
Model 2: Multiple
Linear Regression
model2 <- lm(price ~ carat + cut + color + clarity, data = diamonds)
summary(model2)
##
## Call:
## lm(formula = price ~ carat + cut + color + clarity, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16813.5 -680.4 -197.6 466.4 10394.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3710.603 13.980 -265.414 < 2e-16 ***
## carat 8886.129 12.034 738.437 < 2e-16 ***
## cut.L 698.907 20.335 34.369 < 2e-16 ***
## cut.Q -327.686 17.911 -18.295 < 2e-16 ***
## cut.C 180.565 15.557 11.607 < 2e-16 ***
## cut^4 -1.207 12.458 -0.097 0.923
## color.L -1910.288 17.712 -107.853 < 2e-16 ***
## color.Q -627.954 16.121 -38.952 < 2e-16 ***
## color.C -171.960 15.070 -11.410 < 2e-16 ***
## color^4 21.678 13.840 1.566 0.117
## color^5 -85.943 13.076 -6.572 5.00e-11 ***
## color^6 -49.986 11.889 -4.205 2.62e-05 ***
## clarity.L 4217.535 30.831 136.794 < 2e-16 ***
## clarity.Q -1832.406 28.827 -63.565 < 2e-16 ***
## clarity.C 923.273 24.679 37.411 < 2e-16 ***
## clarity^4 -361.995 19.739 -18.339 < 2e-16 ***
## clarity^5 216.616 16.109 13.447 < 2e-16 ***
## clarity^6 2.105 14.037 0.150 0.881
## clarity^7 110.340 12.383 8.910 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1157 on 53921 degrees of freedom
## Multiple R-squared: 0.9159, Adjusted R-squared: 0.9159
## F-statistic: 3.264e+04 on 18 and 53921 DF, p-value: < 2.2e-16
The second model included additional categorical predictors: cut,
color, and clarity. This model provides a much more detailed
understanding of diamond pricing.
Carat: The coefficient for carat increased to 8888.18, meaning that
once quality characteristics are accounted for, each additional carat
increases the expected price by about $8,888. This positive and highly
significant coefficient reflects the fact that size continues to be the
most important factor in determining price.
Cut:Cut quality levels (e.g., Ideal, Premium, Good) are all
interpreted relative to the baseline category (likely “Fair”).The
positive coefficients for most cut categories show that diamonds with
better cuts command higher prices. For example: A diamond with an
“Ideal” cut is typically $909 more expensive than a Fair-cut diamond,
even after controlling for carat, color, and clarity.
Color: Color grades are also interpreted relative to a baseline
(likely “J”).Negative coefficients for better colors (e.g., D, E, F)
reflect that lower-letter colors (closer to colorless) increase price,
because the baseline (J) is poor quality. For example:A “D” color diamond has an estimated $1,072
higher price than a “J” diamond, holding all else constant.
Clarity: Clarity levels (e.g., IF, VVS1, VS2) show strong and
significant effects. Better clarity grades have larger positive
coefficients, meaning diamonds with fewer inclusions and imperfections
are generally more expensive. For example:A diamond with “IF” clarity is $1,186 more
expensive than the baseline clarity group.
All clarity coefficients are highly significant (p < 0.001),
confirming that clarity is another key driver of diamond pricing.
Interpretation Model fit: The adjusted R² increased from 0.8493
(Model 1) to 0.9195 (Model 2).This means that including cut, color, and
clarity improves the explanatory power of the model by about 7
percentage points, and now the model explains over 91% of the variation
in prices.The residual standard error drops from 1549 in Model 1 to 1157
in Model 2, showing a substantial improvement in predictive
accuracy.
Model Comparison
and Coefficient Interpretation Table
summary(model1)$adj.r.squared
## [1] 0.8493277
summary(model2)$adj.r.squared
## [1] 0.9159125
anova(model1, model2)
## Analysis of Variance Table
##
## Model 1: price ~ carat
## Model 2: price ~ carat + cut + color + clarity
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 53938 1.2935e+11
## 2 53921 7.2163e+10 17 5.7183e+10 2513.4 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coef(summary(model2))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3710.603298 13.98044 -265.41382287 0.000000e+00
## carat 8886.128883 12.03370 738.43710891 0.000000e+00
## cut.L 698.906791 20.33514 34.36940969 4.293856e-256
## cut.Q -327.685863 17.91083 -18.29540212 1.515019e-74
## cut.C 180.565273 15.55655 11.60702279 4.133866e-31
## cut^4 -1.206906 12.45824 -0.09687611 9.228251e-01
## color.L -1910.287923 17.71195 -107.85306516 0.000000e+00
## color.Q -627.953682 16.12110 -38.95229291 0.000000e+00
## color.C -171.960429 15.07039 -11.41047946 4.014211e-30
## color^4 21.678144 13.84005 1.56633380 1.172764e-01
## color^5 -85.943244 13.07634 -6.57242157 4.995316e-11
## color^6 -49.985933 11.88863 -4.20451593 2.620629e-05
## clarity.L 4217.534915 30.83138 136.79360412 0.000000e+00
## clarity.Q -1832.406056 28.82735 -63.56483922 0.000000e+00
## clarity.C 923.272965 24.67944 37.41061889 2.004270e-302
## clarity^4 -361.994646 19.73894 -18.33911034 6.820905e-75
## clarity^5 216.616140 16.10937 13.44659446 3.756440e-41
## clarity^6 2.105166 14.03689 0.14997383 8.807858e-01
## clarity^7 110.340329 12.38335 8.91037691 5.240452e-19
Interpretation of the Model Comparison (ANOVA)
The ANOVA test compares Model 1 (only carat) with Model 2 (carat +
quality features). The F-statistic is extremely large (2513) and the
p-value is < 2.2e-16, meaning Model 2 fits the data significantly
better than Model 1.
I understood these things by this comparison: 1.Cut, color, and
clarity add significant additional explanatory power. 2.Carat alone,
although important, does not capture the full complexity of diamond
pricing. 3.The categorical predictors strongly improve model
performance.
Thus, Model 2 is clearly the more appropriate generalized linear
model for this data context.
Overall, the results show that diamond price is driven by both
physical size (carat) and quality characteristics (cut, color, clarity).
The multiple regression model (Model 2) provides a far more complete and
accurate explanation than the simple model, and the statistical
significance of all predictors demonstrates that each contributes
meaningfully to pricing. Model 2’s much higher adjusted R² and lower
residual error confirm that it is the superior GLM for this diamonds
dataset.
Diagnostic
Plots
par(mfrow = c(2, 2))
plot(model2)

par(mfrow = c(1, 1))
Residuals vs Fitted Plot In this model, the residuals show a clear
curved pattern rather than being randomly scattered around zero. This
suggests that the relationship between predictors and price may not be
perfectly linear, and that there might be additional nonlinear patterns
not captured by the model. Additionally, the spread of the residuals
increases as fitted values increase. At lower price levels, residuals
are tightly packed, but for higher-value diamonds, the residuals fan out
substantially. This indicates heteroscedasticity, meaning the variance
of errors increases with price. This is common in real-world pricing
data because high-value diamonds vary much more in characteristics than
low-value ones. Despite this, the model still performs well overall but
would benefit from techniques like transformations or weighted
regression if precision were critical.
Q–Q Plot of Residuals This examines whether the residuals follow a
normal distribution. In this plot, the residuals deviate from the
straight line, especially in the upper tail. This indicates that
extremely expensive diamonds have residuals that are larger than what
would be expected under a normal distribution.The curvature suggests
right-skewness, meaning the model tends to underestimate the price of
the most expensive diamonds. While this is a deviation from normality,
it is not unusual in models with strong right-skewed outcomes like price
data. The central part of the plot aligns reasonably well with the line,
meaning normality holds approximately for mid-range diamond prices.
Scale–Location Plot The Scale–Location plot also assesses
homoscedasticity. Similar to the first plot, the upward trend in the red
line shows that residual variability increases with fitted values.
Lower-priced diamonds have more stable and consistent errors, whereas
high-priced diamonds display much more spread in residuals.This confirms
non-constant variance, which again is typical in dataset where the
response variable increases by several orders of magnitude. The model
captures the general trend but has more difficulty modeling the most
expensive diamonds with equal precision.
Residuals vs Leverage Plot This plot helps identify influential
observations—points that have a large impact on the fitted model. A few
observations near the right side and high standardized residuals stand
out. These points correspond to very large diamonds that are both rare
and highly influential in the model.The Cook’s distance line indicates
that while a few observations approach the threshold, none appear
dangerously influential. This means no single observation is distorting
the entire model, but certain high-value diamonds carry more weight in
determining the regression line. This makes sense because extremely
large or high-quality diamonds are scarce and naturally exert more
influence.
Despite these issues, the model is still appropriate for
demonstrating Learning Objective 2. It explains over 91% of the variance
and provides meaningful insights into how carat, cut, color, and clarity
affect diamond price.
Learning Objective 2
Reflection
Apply the appropriate generalized linear model for a specific data
context
In my work with the diamonds dataset, I demonstrated this learning
objective by carefully selecting a generalized linear model that matched
the type of outcome I wanted to predict. The response variable, price,
is continuous and reasonably modeled with a Gaussian distribution and an
identity link. This made a multiple linear regression model the
appropriate GLM for this context. Before fitting the model, I completed
an exploratory data analysis to better understand the structure of the
dataset. I looked at the distributions of price, examined boxplots for
cut, color, and clarity, and created scatterplots of carat versus price.
These visual patterns gave me confidence that a linear modeling approach
was justified and also helped me anticipate which predictors were likely
to be meaningful.
After choosing the model, I built two versions: a simple regression
using only carat and a multiple regression that incorporated cut, color,
and clarity. Fitting this GLM allowed me to work with both quantitative
and qualitative predictors, which is one of the strengths of generalized
linear modeling. Interpreting the coefficients helped me deepen my
understanding of how each variable contributed to diamond pricing. For
example, the model showed that carat had the strongest effect on price,
but quality categories also played significant roles. Comparing the two
models using adjusted R² and ANOVA taught me how to justify why one
model is more appropriate than another. This experience helped me
connect the idea of “appropriate GLM selection” to actual
decision-making based on data behavior.
Working through the diagnostic plots also strengthened my
understanding of what it means to apply a GLM responsibly. Seeing
patterns like heteroscedasticity and slight deviations from normality
gave me a more realistic perspective on how statistical models behave
with real-world data. Even though the model wasn’t perfect, I learned
how to assess its strengths, limitations, and overall usefulness. This
entire process helped me grow both technically and conceptually because
I was not only fitting a model but understanding why this GLM was suited
for the problem and how to interpret its results in context.
Although I compared two models in Learning Objective 2, that
comparison was only supportive. LO2 focused on choosing the correct GLM
(a linear regression for predicting price), whereas Learning Objective 3
focuses specifically on comparing multiple models and selecting the best
candidate.
Learning Objective 3:
Demonstrate model selection given a set of candidate models
Model: Linear Discriminant Analysis (classification model)
Dataset: Auto (from ISLR2)
Goal: Compare multiple LDA classification models with different
predictors, evaluate accuracy and confusion matrices, select the best
model, and justify model choice with statistical reasoning.
As we already have the auto dataset imported while we are giving
example about the learning objective 1, so i am not importing the
Dataset again instead I am using it by changing the varibale.
Also I am converting the origin variabale to factor,
which identifies where each car was manufactured: 1 = American 2 =
European 3 = Japanese Because origin is a categorical variable with 3
classes, this dataset is ideal for classification models such as Linear
Discriminant Analysis (LDA).
Load Packages +
covert Data
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:ISLR2':
##
## Boston
## The following object is masked from 'package:dplyr':
##
## select
library(dplyr)
Auto <- ISLR::Auto
# Convert origin to factor for classification
Auto$origin <- factor(Auto$origin,
levels = c(1, 2, 3),
labels = c("American", "European", "Japanese"))
Exploratory Data
Analysis for LDA
str(Auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : Factor w/ 3 levels "American","European",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
summary(Auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 American:245 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 European: 68 ford pinto : 5
## Median :15.50 Median :76.00 Japanese: 79 toyota corolla : 5
## Mean :15.54 Mean :75.98 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 amc hornet : 4
## Max. :24.80 Max. :82.00 chevrolet chevette: 4
## (Other) :365
table(Auto$origin)
##
## American European Japanese
## 245 68 79
pairs(Auto[, c("mpg", "horsepower", "weight", "acceleration", "displacement")],
main = "Scatterplot Matrix of Predictors")
The scatterplot matrix helps visualize the relationships among the key
numeric predictors: mpg, horsepower, weight, acceleration, and
displacement.
I observed these few points
1.mpg is strongly negatively correlated with horsepower,
displacement, and weight.This means we can say that heavier cars with
larger engines tend to have lower fuel efficiency.
2.Horsepower and displacement are highly positively correlated, this
reflects the engineering link between engine size and power.
3.Acceleration shows weak to moderate relationships with other
predictors, we can say that acceleration may contribute some unique
information.
Overall, the matrix reveals meaningful structure in the data. These
visible patterns support the idea that Linear Discriminant Analysis
(LDA) may successfully separate car origins using these numeric
characteristics.
MODEL 1 (Simple LDA
Model)
Model: origin ~ horsepower + weight
lda_model1 <- lda(origin ~ horsepower + weight, data = Auto)
lda_model1
## Call:
## lda(origin ~ horsepower + weight, data = Auto)
##
## Prior probabilities of groups:
## American European Japanese
## 0.6250000 0.1734694 0.2015306
##
## Group means:
## horsepower weight
## American 119.04898 3372.490
## European 80.55882 2433.471
## Japanese 79.83544 2221.228
##
## Coefficients of linear discriminants:
## LD1 LD2
## horsepower 0.008187225 0.051468802
## weight -0.001791632 -0.001867805
##
## Proportion of trace:
## LD1 LD2
## 0.9796 0.0204
Prior Probabilities
The priors (American = 0.625, European = 0.174, Japanese = 0.201)
reflect the dataset’s class distribution. This means most cars are
American, so the model must avoid simply predicting the majority
class.
Group Means
American cars: highest horsepower & highest weight
European cars: moderate horsepower & high weight
Japanese cars: lowest horsepower & lowest weight
This suggests horsepower and weight do provide meaningful
separation.
inear Discriminants (LD1 & LD2)
LD1 explains 97.9% of the separation which is excellent. It means the
first discriminant captures nearly all the classification
information.
The coefficients show:
Horsepower has a positive loading on LD1
Weight has a small negative loading
Interpretation: Cars with higher horsepower and higher weight are
classified more toward the “American” end of LD1, while lighter,
lower-power cars shift toward “Japanese”.
Predict using Model 1
pred1 <- predict(lda_model1)$class
Confusion matrix & accuracy for LDA model 1
conf1 <- table(True = Auto$origin, Predicted = pred1)
conf1
## Predicted
## True American European Japanese
## American 215 3 27
## European 29 5 34
## Japanese 21 0 58
accuracy1 <- mean(pred1 == Auto$origin)
accuracy1
## [1] 0.7091837
American cars classified well 215 correct, but many European cars are
misclassified. No Japanese cars are misclassified as European. European
cars are most difficult to classify only 5 correct.
MODEL 2 - Full LDA
Model
Model: origin ~ horsepower + weight + displacement + acceleration +
mpg
lda_model2 <- lda(origin ~ horsepower + weight + displacement + acceleration + mpg,
data = Auto)
lda_model2
## Call:
## lda(origin ~ horsepower + weight + displacement + acceleration +
## mpg, data = Auto)
##
## Prior probabilities of groups:
## American European Japanese
## 0.6250000 0.1734694 0.2015306
##
## Group means:
## horsepower weight displacement acceleration mpg
## American 119.04898 3372.490 247.5122 14.99020 20.03347
## European 80.55882 2433.471 109.6324 16.79412 27.60294
## Japanese 79.83544 2221.228 102.7089 16.17215 30.45063
##
## Coefficients of linear discriminants:
## LD1 LD2
## horsepower 0.0303833938 0.012950445
## weight 0.0001451998 -0.001921576
## displacement -0.0198644604 0.018799242
## acceleration 0.0118050767 -0.051837494
## mpg 0.0598010020 0.130294235
##
## Proportion of trace:
## LD1 LD2
## 0.9587 0.0413
This model includes horsepower, weight, displacement, acceleration,
and mpg.
Group Means
Differences become clearer:
Japanese cars: lower displacement and weight
American cars: generally heavier and lower mpg
European cars: in between but with higher mpg than Americans
This richer set of variables supports deeper separation.
Coefficients of LD1 and LD2
LD1 again explains 95.8% of the variance which is still strong.
Key trends:
mpg negative loading which leads to higher mpg shifts classification
toward Japanese/European
displacement and horsepower positive loadings leads to large engines
lean toward American cars
acceleration positive means slower acceleration leans American we
already know that heavier cars accelerate slower.
Overall, Model 2 captures more characteristics of car design.
Predict using Model 2
pred2 <- predict(lda_model2)$class
Confusion matrix & accuracy for LDA model 2
conf2 <- table(True = Auto$origin, Predicted = pred2)
conf2
## Predicted
## True American European Japanese
## American 205 12 28
## European 12 27 29
## Japanese 3 16 60
accuracy2 <- mean(pred2 == Auto$origin)
accuracy2
## [1] 0.744898
American and Japanese cars classified with higher accuracy. European
cars significantly improved (27 correct vs. only 5 before). Japanese
cars misclassified less often.
Model 2 provides more balanced, more accurate classification across
all three classes.
This is the statistical justification for choosing Model 2.
Model Comparison for
LDA
accuracy1
## [1] 0.7091837
accuracy2
## [1] 0.744898
conf1
## Predicted
## True American European Japanese
## American 215 3 27
## European 29 5 34
## Japanese 21 0 58
conf2
## Predicted
## True American European Japanese
## American 205 12 28
## European 12 27 29
## Japanese 3 16 60
accuracy2 - accuracy1
## [1] 0.03571429
Accuracy for LDA
Model 1 accuracy = 0.709 Model 2 accuracy = 0.745
Model 2 is more accurate. The gain is meaningful about 3.6%
improvement.
After comapring the accuray and confusion matrix we can significantly
say that Model 2 provides more balanced, more accurate classification
across all three classes.
This is the statistical justification for choosing Model 2. which
directly refelcts the learning objective 3 and the best example for
it.
Showing LDA
separation for Model 2
# Get LDA scores for Model 2
lda2_scores <- predict(lda_model2)
lda2_df <- data.frame(
LD1 = lda2_scores$x[,1],
LD2 = lda2_scores$x[,2],
origin = Auto$origin
)
ggplot(lda2_df, aes(x = LD1, y = LD2, color = origin)) +
geom_point(alpha = 0.6, size = 2) +
labs(
title = "LDA Model 2: Separation of Car Origins",
x = "Linear Discriminant 1 (LD1)",
y = "Linear Discriminant 2 (LD2)"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold"),
legend.title = element_blank()
)
The LDA Model 2 separation plot shows much clearer grouping compared to
Model 1. American cars (red) are well-separated on the left side of LD1,
reflecting their heavier weight, larger displacement, and lower mpg.
Japanese cars (blue) cluster tightly on the far right side of LD1,
consistent with lighter vehicles and higher fuel efficiency. European
cars (green) fall between these two clusters, but they show noticeably
less overlap than in the simpler model.
By adding more predictors such as mpg, acceleration, and displacement
Model 2 achieves stronger separation along LD1 and LD2. The three
origins form more distinct clusters, visually supporting the higher
accuracy obtained by this model. This plot confirms that Model 2
captures more of the underlying structure in the data and provides
better discrimination between car origins.
head(predict(lda_model2)$x)
## LD1 LD2
## 1 -1.751835 0.9096555
## 2 -1.700451 1.4431466
## 3 -1.384355 1.5579608
## 4 -1.214485 0.9881102
## 5 -1.434173 0.9983125
## 6 -2.190709 2.1882261
The scores show that:
These first six cars have relatively low LD1 values probably Japanese
or European
Higher LD2 values might indicate more Japanese characteristics
(better mpg, lighter cars)
These LD scores visually and numerically confirm that predictors
meaningfully separate the classes.
Final Interpretation of LDA
LDA Model 2, which includes horsepower, weight, displacement,
acceleration, and mpg, provided superior classification performance
compared to the simpler Model 1. It improved overall accuracy from 70.9%
to 74.5% and significantly reduced misclassification of European cars.
The discriminant functions also showed clearer separation in LDA space.
Based on these comparisons, Model 2 is the better model for predicting
car origin.”
Reflection for
Learning Objective 3: Model Selection
For Learning Objective 3, I focused on demonstrating my ability to
compare multiple candidate models and make a justified decision about
which model performs best. This objective felt different from the
previous two, because instead of trying to apply the correct type of
model, here I had to think more critically about choosing between models
that could all potentially work. To explore model selection, I used the
Auto dataset and built two Linear Discriminant Analysis (LDA) models
that predicted car origin. Model 1 used only horsepower and weight,
while Model 2 added displacement, acceleration, and mpg. This setup
naturally created a scenario where I could evaluate whether additional
predictors improved the classification task.
Working through these models helped me better understand how model
selection requires a combination of statistical reasoning and practical
judgment. Even before looking at performance metrics, the EDA showed
clear patterns across origins for example, American cars tended to be
heavier and have lower mpg, while Japanese cars tended to be lighter and
more fuel efficient. This suggested that using more predictors might
reveal stronger separation between the classes. Once I compared the
models, I found that Model 2 produced a higher accuracy and reduced
misclassification of European cars. The visualization of LDA separation
confirmed the same finding: the classes formed more distinct clusters in
Model 2. This experience helped me appreciate how model selection is not
just about choosing the model with the highest numbers, but also about
understanding why the improvement happens and whether it is
meaningful.
Personally, I felt that Learning objective 3 pushed me to think more
deeply about the structure of the data rather than only running models.
I had to justify why Model 2 was better not simply because it had a
slightly higher accuracy, but because it provided clearer separation and
a more balanced classification across the three origins. This process
made me more confident in evaluating decisions between simple and
complex models, an important real world statistical skill. Overall,
learning objective 3 helped me connect model performance, visual
interpretation, and practical judgment in a way that felt more like
“actual data science” than simply fitting a model.
Learning Objective 4:
Express the Results of Statistical Models to a General Audience
Firstly I wanted to contrate on explaining a single model to the
general audience and then sum everthing up in the last.
In this section, I explain my linear regression model predicting fuel
efficiency (mpg) from horsepower in the Auto dataset.
Overview of the
Model
For this objective, I choose the simple linear regression model I
built earlier using the Auto dataset. The goal of the model was to
understand how a car’s engine power measured by horsepower is related to
its fuel efficiency measured in miles per gallon, or mpg. This is a
relationship most people are familiar with in everyday life, So this
made me easier while working which provided a good opportunity to
explain statistical results in clearly.
Main Findings
Explained Simply
The model found a clear and consistent pattern: cars with higher
horsepower tend to get lower gas mileage. To tell this simply ,I would
say that stronger engines usually use more fuel, so the car cannot
travel as many miles on one gallon of gasoline. This is not surprising
from a real-world standpoint, but this model helped me to quantify and
confirm this relationship using actual data.
The overall trend was strongly negative, meaning that as horsepower
increases, mpg decreases. While not all cars follow the pattern
perfectly, this model shows that the general relationship holds across
many different car types.
Explaining the
Coefficient
The model estimated that, for each one-unit increase in horsepower,
fuel efficiency tends to decrease by about 0.16 miles per gallon.for
instamce: If you compare two cars identical in every way except one has
slightly more horsepower, the stronger car will likely get slightly
worse gas mileage. This helps communicate the idea that even small
increases in engine power come with a cost in fuel efficiency. The
negative relationship was strong and statistically meaningful, but the
key point for a general audience is that bigger engines require more
fuel.
Reflection for
specific model
Working on this objective helped me shift from thinking statistically
to thinking communicatively. Instead of focusing on p-values and model
assumptions, I had to describe the results in a way that someone with no
statistics background could understand. I realized how important it is
to simplify without oversimplifying. The direction of the relationship
and the practical meaning of the coefficient mattered much more than the
technical details.
This objective also helped me understand the role of storytelling in
data analysis. Even though the model itself was simple, translating it
into everyday language required me to think about the data from the
perspective of a general audience. This reminded me that statistical
modeling is not just about running analyses it is also about helping
people make informed decisions based on clear explanations. I feel more
confident now in my ability to communicate results beyond the
classroom.
Refelction for Full
Project
I wanted to write the whole portfolio reflection where it covers what
made me easier and what made me work so harder, what made me to think
deeply to answer the question and give appropirate examples to the
learning objectives more instead of focusing only on model
explination.
Throughout this portfolio, I worked on several statistical modeling
tasks. My main goal was to understand not only how to run models but
also how to think about them and communicate their meaning. Looking
back, I see that each learning objective built on the last, gradually
strengthening my technical skills and my ability to explain results
clearly.
I started by exploring data both visually and numerically. I learned
to recognize patterns and understand what variables might reveal before
fitting a model. This step showed me how important it is to “listen to
the data.” Simple scatterplots and histograms guided my thinking for
later analyses. When I fitted my first linear regression model, I
connected mathematical results to real-life meaning. For example,
observing how horsepower affects mpg wasn’t just a number; it told a
clear story about how car engines and fuel efficiency are linked. This
was my first experience turning statistical output into something
understandable to nearly anyone.
As I progressed, the work became more complex, especially when I
compared multiple models or explored classification with LDA. Model
selection required critical thinking, not just running code. I had to
ask: Why is one model better? Does adding more variables really make a
difference? This pushed me to go beyond the mechanics and really
evaluate models like a data scientist would. Seeing the improvement from
Model 1 to Model 2 in LDA both numerically and visually helped me
appreciate how statistics combine reasoning, visualization, and
interpretation.
By the time I reached the final learning objective, I recognized how
crucial it is to communicate results clearly. Most people do not speak
in terms of coefficients, p-values, or discriminant functions. They want
to know the story: What does the model tell us? Why does it matter?
Practicing this skill helped me view statistics not just as numbers, but
as a means of explaining real-world relationships in a meaningful and
accessible way. I learned that a good analysis is only valuable if
others can understand it.
Overall, this portfolio helped me grow from simply running
statistical methods to interpreting, comparing, and communicating them
effectively. I feel more confident explaining data insights to both
technical and non-technical audiences. I am very grateful for this that
how much more comfortable I have become with the entire modeling process
from EDA to final communication.
Learning Objective 5:
Use Programming Software to Fit and Assess Statistical Models
I use the R programming, Rstudio, R pubs software throughout this
course and this final portfolio also. To start with
During this course in Statistical Modeling, I have been working with
a wide range of projects and assignments such as a full linear
regression project, classification project using generalized linear
models, and several homework assignments that covered such models as
multiple regression with quantitative and qualitative predictors,
multinomial logistic regression, Linear Discriminant Analysis, Poisson
regression, ridge regression, lasso regression, PCA regression, and
polynomial regression. Correspondingly, in all these analyses, I used R
and RStudio as my main tool for programming, at times integrating the
work done with R Pubs. I have shown code executed in both tidymodels and
base R, which made it possible to compare different styles of
programming and how each approach structures modeling.
The course made me use many packages and functions that made
statistical analysis more efficient and organized, such as ggplot2,
MASS, car, broom, GGally, dplyr, tidyverse,parsnip, workflows and
functions such as ggpairs(), tidy(), str(), glimpse(), summary(), lm(),
predict(), confint(), augment() and many others. My workflow includes
loading packages, importing datasets, cleaning and preparing data,
exploring variables with summaries and plots, creating new or recorded
features, fitting a wide range of models, extracting results, generating
predictions, checking residuals and diagnostic plots, calculating
accuracy or RMSE, comparing candidate models, visualizing LDA
separation, and organizing code into clear, well-labeled sections. All
combined, these experiences strengthened my ability to fit and assess
statistical models through programming and helped me to build in R a
complete, repeatable workflow.
Specifically to the
portfolio models
To meet this objective in this particular project, I demonstrated how
to import data, explore, fit several different types of models using R
and RStudio, and evaluate their performance. Throughout the course, and
in this portfolio, I have worked extensively with three different
datasets and modeling approaches: a simple linear regression model to
predict mpg from horsepower in the Auto dataset, a multiple regression
model from the Diamonds dataset, and a Linear Discriminant Analysis
(LDA) classification model for predicting car origin. Using all three
examples shows the range of statistical programming I practiced-from
basic regression to more advanced model comparison and
classification.
Using programming software became a central part of my learning
process. I used R to load and clean datasets, create visualizations,
calculate summary statistics, and generate plots that helped me
interpret patterns before fitting any models. I also used R to fit and
assess several types of models-linear regression, multiple regression,
and LDA-and to evaluate them using outputs such as coefficients,
confidence intervals, residual plots, accuracy scores, and confusion
matrices. These tasks required me not only to write code but also to
understand what each function was doing and how the results connect back
to the data. Throughout the portfolio, I wrote the code in organized
chunks with comments, which served to help me keep my workflow clean and
understandable.
Working with three different models helped me grow significantly in
my programming skills. On the Auto dataset, I learned how to fit simple
models and interpret their coefficients. The Diamonds dataset pushed me
to work with multiple predictors and understand how regression output
changes both with categorical and numerical variables. The LDA models
required me to step into classification problems, calculate predictions,
evaluate accuracy, and visualize separations in discriminant space.
Also, troubleshooting errors, renaming objects for better clarity, and
structuring code more clearly finally made me more confident and
independent in programming.
Each example has reinforced a different aspect of model assessment,
either checking assumptions in regression or comparing classification
performance across candidate models. Thinking back to this objective, I
found that programming software has taught me a bit more than just how
to write code how to create an entire analysis pipeline. It has taught
me the importance of keeping code organized, labeling each step, and
ensuring that output matches the question I am trying to answer. This
objective helped me see how programming stitches together the entire
modeling process, from EDA to interpretation. I am way more comfortable
with fitting and assessing models in R now because I can confidently
follow a full workflow. This experience showed me that programming is
not just a tool but an essential part of doing statistics effectively
and communicating results clearly.