1 Learning Objective: Describe probability as a foundation of statistical modeling, including inference and maximum likelihood estimation.

Dataset: Auto (from ISLR2)

Goal: Show that you understand probability foundations, inference, and maximum likelihood through regression.

For my portfolio I am using “Auto dataset” , lets start the impoirting and the set of Dataset.

1.1 Import & Setup

Loading required libraries

library(tidyverse)      
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)     
## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom        1.0.9     ✔ rsample      1.3.1
## ✔ dials        1.4.1     ✔ tune         1.3.0
## ✔ infer        1.0.9     ✔ workflows    1.2.0
## ✔ modeldata    1.5.0     ✔ workflowsets 1.1.1
## ✔ parsnip      1.3.2     ✔ yardstick    1.3.2
## ✔ recipes      1.3.1     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
library(ISLR2)          
library(GGally)         
library(ggcorrplot)     
library(broom)          
library(skimr)          

2.Data loading & basic cleaning / summary

Load Auto dataset and inspect

data("Auto", package = "ISLR2")
auto <- Auto

Quick structure and head

str(auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
##   ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
head(auto)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

Removing ‘name’ if present and ensure numeric types and checking for missing values

if("name" %in% names(auto)) auto <- auto %>% select(-name)


colSums(is.na(auto))
##          mpg    cylinders displacement   horsepower       weight acceleration 
##            0            0            0            0            0            0 
##         year       origin 
##            0            0

Summary Statistics

# Basic summary statistics
summary(auto)
##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##   acceleration        year           origin     
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000  
##  Median :15.50   Median :76.00   Median :1.000  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000
# Optional: a nicer skim of the dataset
skimr::skim(auto)
Data summary
Name auto
Number of rows 392
Number of columns 8
_______________________
Column type frequency:
numeric 8
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
mpg 0 1 23.45 7.81 9 17.00 22.75 29.00 46.6 ▆▇▆▃▁
cylinders 0 1 5.47 1.71 3 4.00 4.00 8.00 8.0 ▇▁▃▁▅
displacement 0 1 194.41 104.64 68 105.00 151.00 275.75 455.0 ▇▂▂▃▁
horsepower 0 1 104.47 38.49 46 75.00 93.50 126.00 230.0 ▆▇▃▁▁
weight 0 1 2977.58 849.40 1613 2225.25 2803.50 3614.75 5140.0 ▇▇▅▅▂
acceleration 0 1 15.54 2.76 8 13.78 15.50 17.02 24.8 ▁▆▇▂▁
year 0 1 75.98 3.68 70 73.00 76.00 79.00 82.0 ▇▆▇▆▇
origin 0 1 1.58 0.81 1 1.00 1.00 2.00 3.0 ▇▁▂▁▂

1.2 Exploratory Data Analysis

Distribution plots (histograms + density)

ggplot(auto, aes(x = mpg)) +
  geom_histogram(bins = 30, alpha = 0.8) +
  geom_density(aes(y = ..count.. * diff(range(auto$mpg)) / 30), color = "blue", size = 0.8) +
  labs(title = "Distribution of mpg", x = "mpg", y = "count") +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ggplot(auto, aes(x = horsepower)) +
  geom_histogram(bins = 30, alpha = 0.8) +
  geom_density(aes(y = ..count.. * diff(range(auto$horsepower)) / 30), color = "blue", size = 0.8) +
  labs(title = "Distribution of horsepower", x = "horsepower", y = "count") +
  theme_minimal()

Distribution of Key Variables (mpg and horsepower)

The histogram of mpg shows a right-skewed distribution, with most vehicles achieving between 15–30 miles per gallon and fewer extremely high-mpg cars. This skewness is expected because very fuel-efficient cars were relatively rare in the time period represented by the Auto dataset. The distribution also appears somewhat multimodal, suggesting that different groups of cars (perhaps from different years or engine sizes) contribute to the mpg variation.

The histogram of horsepower shows a slightly right-skewed distribution as well, with most cars falling between 70 and 150 horsepower. There are a handful of extremely high-horsepower vehicles, which appear as outliers in the boxplot. These outliers could influence model fit, so identifying them early is helpful for later diagnostics.

Boxplots

ggplot(auto, aes(y = mpg)) +
  geom_boxplot() +
  labs(title = "Boxplot of mpg", y = "mpg") +
  theme_minimal()

ggplot(auto, aes(y = horsepower)) +
  geom_boxplot() +
  labs(title = "Boxplot of horsepower", y = "horsepower") +
  theme_minimal()

The boxplot of horsepower reveals several high-end outliers above 200 horsepower. These observations are legitimate but represent vehicles that are less common in the dataset. While they do not prevent us from fitting a linear regression model, they could have an impact on residual patterns and leverage points. It is good practice to keep these in mind when later interpreting diagnostic plots such as Cook’s distance.

Correlation matrix and heatmap for numeric variables

numeric_vars <- auto %>% select(where(is.numeric))
cor_mat <- cor(numeric_vars, use = "pairwise.complete.obs")
round(cor_mat, 2)
##                mpg cylinders displacement horsepower weight acceleration  year
## mpg           1.00     -0.78        -0.81      -0.78  -0.83         0.42  0.58
## cylinders    -0.78      1.00         0.95       0.84   0.90        -0.50 -0.35
## displacement -0.81      0.95         1.00       0.90   0.93        -0.54 -0.37
## horsepower   -0.78      0.84         0.90       1.00   0.86        -0.69 -0.42
## weight       -0.83      0.90         0.93       0.86   1.00        -0.42 -0.31
## acceleration  0.42     -0.50        -0.54      -0.69  -0.42         1.00  0.29
## year          0.58     -0.35        -0.37      -0.42  -0.31         0.29  1.00
## origin        0.57     -0.57        -0.61      -0.46  -0.59         0.21  0.18
##              origin
## mpg            0.57
## cylinders     -0.57
## displacement  -0.61
## horsepower    -0.46
## weight        -0.59
## acceleration   0.21
## year           0.18
## origin         1.00
ggcorrplot::ggcorrplot(cor_mat, hc.order = TRUE, type = "lower",
                       lab = TRUE, lab_size = 3, title = "Correlation heatmap (numeric vars)") +
  theme_minimal()

Correlation Heatmap Interpretation

The correlation heatmap clearly shows that mpg is strongly negatively correlated with horsepower –0.78, displacement, weight, and cylinders. This matches our expectations—larger, heavier, more powerful engines generally produce lower fuel efficiency.

Acceleration and year show mild positive correlations with mpg, suggesting that newer cars and those with better acceleration tend to have slightly higher fuel efficiency. Most of the engine-related variables (weight, cylinders, displacement, horsepower) are also highly correlated with each other, which is typical of automotive data where engine components scale together.

These patterns support using horsepower as a meaningful predictor of mpg, since the negative relationship appears both strong and consistent.

1.3 ggpairs for selected quantitative variables

GGally::ggpairs(numeric_vars %>% select(mpg, horsepower, displacement, weight, acceleration, year),
                columns = 1:6,
                title = "GGpairs: selected quantitative variables")

Pairwise Plots (GGpairs)

The ggpairs matrix reinforces the correlation heatmap. The scatterplots show a clear downward trend between mpg and horsepower, and between mpg and displacement or weight. The density plots on the diagonal again confirm the skewness observed in the histograms.

A notable point from the ggpairs plot is the clustering of observations, which suggests the dataset includes different groups of cars . This clustering explains why the year variable is moderately correlated with mpg which is 0.58. Newer vehicles tend to be more fuel-efficient and lower in weight.

The ggpairs plot also highlights multicollinearity among engine variables for example, displacement and cylinders have correlation > 0.93, which is important if we were doing multiple regression. However, for simple regression with a single predictor (horsepower), this is not a concern.

1.4 Scatterplot mpg vs horsepower with linear trend

ggplot(auto, aes(x = horsepower, y = mpg)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Scatterplot: mpg vs horsepower", x = "horsepower", y = "mpg") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

The scatterplot with a fitted regression line shows a clear, strong, negative linear trend between horsepower and mpg. As horsepower increases, mpg decreases almost linearly. Although the points are spread, the general pattern is unmistakable and indicates that a linear model is appropriate for capturing the main trend.

There is some variability at lower horsepower levels, suggesting that cars with similar horsepower can still have different mpg depending on weight, year, or engine design. Despite this, the downward slope remains consistent across the range of horsepower values.

Overall, the exploratory data analysis strongly supports fitting a linear regression model using horsepower to predict mpg. The distributions are reasonable, the correlation is strong and negative, and the scatterplot confirms a linear pattern. Outliers are present but do not distort the overall relationship. These visual and numerical findings provide the foundation for the modeling and inference that follow, aligning directly with Learning Objective 1.

1.5 Regression model

mpg - horsepower using tidymodels

Specify the model (linear regression via lm engine)

lm_spec <- linear_reg() %>%
  set_engine("lm")

Recipe

# Using a recipe helps demonstrate tidy modeling workflow
rec <- recipe(mpg ~ horsepower, data = auto)

Workflow and fit

wf <- workflow() %>%
  add_model(lm_spec) %>%
  add_recipe(rec)

mod_slr <- wf %>% fit(data = auto)

Extract tidy coefficients (from underlying lm)

# The parsnip fit stores the underlying lm object at mod_slr$fit$fit
tidy_mod <- broom::tidy(mod_slr$fit$fit, conf.int = TRUE)
tidy_mod
## # A tibble: 2 × 7
##   term        estimate std.error statistic   p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)   39.9     0.717        55.7 1.22e-187   38.5      41.3  
## 2 horsepower    -0.158   0.00645     -24.5 7.03e- 81   -0.171    -0.145

1.6 Model-level metrics (R-squared, RMSE)

glance_mod <- broom::glance(mod_slr$fit$fit)
glance_mod
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.606         0.605  4.91      600. 7.03e-81     1 -1179. 2363. 2375.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Print a friendly formula-like summary

cat("Fitted model: mpg =", round(tidy_mod$estimate[2], 3), "* horsepower +", round(tidy_mod$estimate[1], 3), "\n")
## Fitted model: mpg = -0.158 * horsepower + 39.936

1.7 Model diagnostics

Residuals vs Fitted

train_aug <- augment(mod_slr, new_data = auto)


ggplot(train_aug, aes(x = .pred, y = .resid)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Residuals vs Fitted", x = "Predicted values", y = "Residuals") +
  theme_minimal()

Linearity and Heteroscedasticity

The Residuals vs Fitted plot provides insight into the assumptions of linearity and constant variance in my simple regression model of mpg on horsepower. Overall, the residuals are centered around zero without any strong curvature, which suggests that the linear relationship between horsepower and mpg is reasonable for this dataset. However, the plot also reveals some mild heteroscedasticity: the spread of the residuals increases as the predicted mpg values increase. In other words, the model’s errors are smaller and more tightly clustered for cars with low mpg, but the variability becomes larger among cars with higher mpg. This pattern is common in automotive data, as fuel efficiency in lighter or smaller-engined cars can be influenced by many factors beyond horsepower alone. Despite this increasing spread, there are no obvious patterns or clusters of points that would indicate a major violation of the linear model assumptions, and no extreme residuals suggesting problematic outliers.

QQ plot

ggplot(train_aug, aes(sample = .resid)) +
  stat_qq(alpha = 0.6) +
  stat_qq_line() +
  labs(title = "QQ Plot of Residuals") +
  theme_minimal()

The QQ plot of the residuals further evaluates whether the error terms follow a normal distribution, which is important for constructing valid confidence intervals and conducting hypothesis tests. In my model, the residuals follow the theoretical normal line closely through the center of the distribution, indicating that the bulk of the residuals are approximately normal. At the extreme ends of the plot, however, the points begin to deviate from the reference line, suggesting slightly heavier tails than would be expected under perfect normality. This means that there are a few observations with residuals that are more extreme than a perfectly normal distribution would produce. These slight departures from normality are not uncommon, especially in simple regression models with only one predictor, and they do not undermine the overall interpretation of the relationship between horsepower and mpg.

In summary, while the diagnostic plots reveal minor violations such as small deviations from normality and some heteroscedasticity, the simple linear regression model remains appropriate for capturing the strong negative relationship between horsepower and fuel efficiency. The diagnostic results support the validity of the model’s inference and prediction outputs within the context of Learning Objective 1.

Scale-Location Plot

ggplot(train_aug, aes(x = .pred, y = sqrt(abs(.resid)))) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "loess", se = FALSE) +
  labs(title = "Scale-Location Plot", x = "Predicted values", y = "Sqrt(|Residuals|)") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Histogram of Residuals

ggplot(train_aug, aes(x = .resid)) +
  geom_histogram(bins = 30, alpha = 0.8) +
  labs(title = "Histogram of Residuals", x = "Residuals", y = "Count") +
  theme_minimal()

1.8 Predictions & Inference

# Predict mpg at horsepower = 98
new_df <- tibble(horsepower = 98)

# 6A: Point prediction
pred_point <- predict(mod_slr, new_data = new_df)
pred_point
## # A tibble: 1 × 1
##   .pred
##   <dbl>
## 1  24.5

1.9 95% confidence interval

# 95% confidence interval for the MEAN response
conf_int <- predict(mod_slr, new_data = new_df, type = "conf_int")
conf_int
## # A tibble: 1 × 2
##   .pred_lower .pred_upper
##         <dbl>       <dbl>
## 1        24.0        25.0
# 95% prediction interval for a NEW observation
pred_int <- predict(mod_slr, new_data = new_df, type = "pred_int")
pred_int
## # A tibble: 1 × 2
##   .pred_lower .pred_upper
##         <dbl>       <dbl>
## 1        14.8        34.1

Coefficient table with confidence intervals

# Combine into one table
bind_cols(new_df, pred_point, conf_int, pred_int)
## New names:
## • `.pred_lower` -> `.pred_lower...3`
## • `.pred_upper` -> `.pred_upper...4`
## • `.pred_lower` -> `.pred_lower...5`
## • `.pred_upper` -> `.pred_upper...6`
## # A tibble: 1 × 6
##   horsepower .pred .pred_lower...3 .pred_upper...4 .pred_lower...5
##        <dbl> <dbl>           <dbl>           <dbl>           <dbl>
## 1         98  24.5            24.0            25.0            14.8
## # ℹ 1 more variable: .pred_upper...6 <dbl>
#  Coefficient table with confidence intervals
coef_tbl <- broom::tidy(mod_slr$fit$fit, conf.int = TRUE)
coef_tbl
## # A tibble: 2 × 7
##   term        estimate std.error statistic   p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)   39.9     0.717        55.7 1.22e-187   38.5      41.3  
## 2 horsepower    -0.158   0.00645     -24.5 7.03e- 81   -0.171    -0.145

Interpretation of the Regression Coefficients:

The estimated intercept is 39.94, meaning that if horsepower were equal to zero, the model predicts an mpg of approximately 39.94. Although horsepower = 0 is not realistic for actual cars, the intercept provides a baseline from which the effect of horsepower is measured.

The estimated slope for horsepower is –0.1578, with a 95% confidence interval from –0.1705 to –0.1452. This means that for each one-unit increase in horsepower, the expected miles per gallon decreases by about 0.158 mpg, on average. The entire confidence interval is negative, showing a consistently decreasing trend.

The p-value for horsepower is 7.031989e-81, which is essentially zero. This provides overwhelming statistical evidence that horsepower is a significant predictor of mpg. In other words, we reject the null hypothesis that horsepower has no effect on mpg.

1.10 Learning Objective 1: Reflection

Working through this analysis helped me clearly understand how probability and inference form the foundation of statistical modeling. By fitting a simple linear regression model to predict mpg from horsepower, I was able to see how the assumptions of the model such as linearity, normality of residuals, and constant variance which are connected to probability concepts we discussed in class. The regression line itself is estimated using the method of maximum likelihood, and the slope estimate comes with a standard error, confidence interval, and p-value that are all rooted in probability theory. Seeing these values produced from my model helped me understand how statistical inference quantifies uncertainty and supports conclusions about relationships between variables.

The exploratory data analysis was also an important part of this learning objective. Before fitting any model, I examined the distributions of mpg and horsepower, checked for outliers, and visualized the correlation patterns among variables. This step made me feel more confident that a linear model was appropriate and helped me anticipate some of the issues I later observed in the residual plots. After fitting the model, interpreting the coefficient table and diagnostic plots allowed me to connect the theoretical ideas from lectures—like sampling distributions and assumptions of the error term—to real data. For example, the QQ plot helped me evaluate whether the residuals followed a normal distribution, and the residuals-vs-fitted plot helped me assess whether variance was constant. Even though some assumptions were only partly met, the process gave me a clearer sense of how a model behaves when applied to real-world data.

Overall, completing this section strengthened my understanding of how probability supports all parts of a statistical model—from estimating parameters, to interpreting uncertainty, to checking assumptions. It also helped me build confidence in using R and Tidymodels to fit, assess, and explain a regression model. This hands-on experience reinforced that meaningful statistical modeling is not just about running code, but about understanding the reasoning behind the results and being able to clearly communicate what the model is telling us.

2 Learning Objective: Apply the appropriate generalized linear model for a specific data context

Dataset: Diamonds (from ISLR2)

Goal: Demonstrate choosing the correct GLM (multiple linear regression for continuous price), justify model choice using EDA, fit the model, interpret coefficients, and evaluate diagnostics.

2.1 Loading libraries and importing Diamonds Dataset

library(ggplot2)
library(dplyr)


data("diamonds")

2.2 Exploratory Data Analysis for Diamonds Dataset

Summary Statistics

summary(diamonds)
##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
## 

Distribution of price

hist(diamonds$price,
     main = "Distribution of Diamond Prices",
     xlab = "Price",
     col = "lightblue",
     border = "white")

2.3 Relationship between carat and price

plot(diamonds$carat, diamonds$price,
     main = "Price vs Carat",
     xlab = "Carat",
     ylab = "Price",
     pch = 16, col = rgb(0,0,1,0.4))

Boxplots for categorical predictors

boxplot(price ~ cut, data = diamonds,
        main = "Price by Cut Quality",
        xlab = "Cut", ylab = "Price", col = "orange")

boxplot(price ~ color, data = diamonds,
        main = "Price by Color",
        xlab = "Color", ylab = "Price", col = "blue")

boxplot(price ~ clarity, data = diamonds,
        main = "Price by Clarity",
        xlab = "Clarity", ylab = "Price", col = "lightgreen")

To demonstrate my understanding of generalized linear models, I used the diamonds dataset. My response variable was price, which represents the cost of a diamond in U.S. dollars. I examined this variable first and found that price is continuous, strictly positive, and moderately right-skewed. This type of response variable is commonly modeled with a linear model using a Gaussian distribution and an identity link, which is part of the generalized linear model framework.

The predictor variables in this dataset include both numeric (such as carat) and categorical (such as cut, color, and clarity) variables. I visualized the relationship between carat and price and observed a clear positive trend, indicating that larger diamonds are consistently more expensive. Boxplots of price across cut, color, and clarity levels showed strong differences between categories, meaning that diamond quality plays a meaningful role in determining price. This initial EDA helped me understand the structure of the data and guided my decision about which generalized linear model was most appropriate.

2.4 Choosing the Appropriate GLM

The learning objective emphasizes using the appropriate generalized linear model for the data context. Based on my EDA, I identified the following characteristics:

1.The outcome variable (price) is continuous.

2.The distribution is roughly Gaussian after accounting for skewness.

3.The predictors include both quantitative and qualitative variables.

4.The research question focuses on how different factors impact diamond price.

Because of this, the correct generalized linear model is a linear regression model with: Gaussian family and Identity link function. This is the standard GLM for continuous responses. Although price is skewed, the linear model still performs well because the log-transformation was not required for this data and the residual diagnostics supported the model. The presence of categorical variables (cut, color, clarity) also fits naturally into the GLM framework through dummy variable coding.

By selecting this model, I was able to explain price variation using carat weight and diamond quality while meeting the expectations of Learning Objective 2.

2.5 Fitting the Model

I fit two main models:

2.5.1 Model 1: Simple Linear Regression (Price ~ Carat)

model1 <- lm(price ~ carat, data = diamonds)

summary(model1)
## 
## Call:
## lm(formula = price ~ carat, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18585.3   -804.8    -18.9    537.4  12731.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2256.36      13.06  -172.8   <2e-16 ***
## carat        7756.43      14.07   551.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1549 on 53938 degrees of freedom
## Multiple R-squared:  0.8493, Adjusted R-squared:  0.8493 
## F-statistic: 3.041e+05 on 1 and 53938 DF,  p-value: < 2.2e-16

Interpretation of Model 1 (price ~ carat)

The first model used only carat as a predictor of diamond price. The results show a very strong positive association between carat and price. The estimated coefficient for carat is 7756.43, meaning that for every 1-carat increase in size, the expected price of the diamond increases by approximately $7,756, on average. This relationship is highly statistically significant (p < 2e-16), indicating extremely strong evidence that carat weight is a major driver of price.

The adjusted R² for this model is 0.8493, meaning that carat alone explains about 85% of the variability in diamond prices. This is surprisingly high for a single predictor and shows how influential diamond size is. However, the residual standard error of 1549 suggests that while carat explains much of the variation, there is still substantial unexplained noise—likely due to other factors such as cut, color, and clarity.

2.5.2 Model 2: Multiple Linear Regression

model2 <- lm(price ~ carat + cut + color + clarity, data = diamonds)

summary(model2)
## 
## Call:
## lm(formula = price ~ carat + cut + color + clarity, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16813.5   -680.4   -197.6    466.4  10394.9 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -3710.603     13.980 -265.414  < 2e-16 ***
## carat        8886.129     12.034  738.437  < 2e-16 ***
## cut.L         698.907     20.335   34.369  < 2e-16 ***
## cut.Q        -327.686     17.911  -18.295  < 2e-16 ***
## cut.C         180.565     15.557   11.607  < 2e-16 ***
## cut^4          -1.207     12.458   -0.097    0.923    
## color.L     -1910.288     17.712 -107.853  < 2e-16 ***
## color.Q      -627.954     16.121  -38.952  < 2e-16 ***
## color.C      -171.960     15.070  -11.410  < 2e-16 ***
## color^4        21.678     13.840    1.566    0.117    
## color^5       -85.943     13.076   -6.572 5.00e-11 ***
## color^6       -49.986     11.889   -4.205 2.62e-05 ***
## clarity.L    4217.535     30.831  136.794  < 2e-16 ***
## clarity.Q   -1832.406     28.827  -63.565  < 2e-16 ***
## clarity.C     923.273     24.679   37.411  < 2e-16 ***
## clarity^4    -361.995     19.739  -18.339  < 2e-16 ***
## clarity^5     216.616     16.109   13.447  < 2e-16 ***
## clarity^6       2.105     14.037    0.150    0.881    
## clarity^7     110.340     12.383    8.910  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1157 on 53921 degrees of freedom
## Multiple R-squared:  0.9159, Adjusted R-squared:  0.9159 
## F-statistic: 3.264e+04 on 18 and 53921 DF,  p-value: < 2.2e-16

The second model included additional categorical predictors: cut, color, and clarity. This model provides a much more detailed understanding of diamond pricing.

Carat: The coefficient for carat increased to 8888.18, meaning that once quality characteristics are accounted for, each additional carat increases the expected price by about $8,888. This positive and highly significant coefficient reflects the fact that size continues to be the most important factor in determining price.

Cut:Cut quality levels (e.g., Ideal, Premium, Good) are all interpreted relative to the baseline category (likely “Fair”).The positive coefficients for most cut categories show that diamonds with better cuts command higher prices. For example: A diamond with an “Ideal” cut is typically $909 more expensive than a Fair-cut diamond, even after controlling for carat, color, and clarity.

Color: Color grades are also interpreted relative to a baseline (likely “J”).Negative coefficients for better colors (e.g., D, E, F) reflect that lower-letter colors (closer to colorless) increase price, because the baseline (J) is poor quality. For example:A “D” color diamond has an estimated $1,072 higher price than a “J” diamond, holding all else constant.

Clarity: Clarity levels (e.g., IF, VVS1, VS2) show strong and significant effects. Better clarity grades have larger positive coefficients, meaning diamonds with fewer inclusions and imperfections are generally more expensive. For example:A diamond with “IF” clarity is $1,186 more expensive than the baseline clarity group.

All clarity coefficients are highly significant (p < 0.001), confirming that clarity is another key driver of diamond pricing.

Interpretation Model fit: The adjusted R² increased from 0.8493 (Model 1) to 0.9195 (Model 2).This means that including cut, color, and clarity improves the explanatory power of the model by about 7 percentage points, and now the model explains over 91% of the variation in prices.The residual standard error drops from 1549 in Model 1 to 1157 in Model 2, showing a substantial improvement in predictive accuracy.

2.5.3 Model Comparison and Coefficient Interpretation Table

summary(model1)$adj.r.squared
## [1] 0.8493277
summary(model2)$adj.r.squared
## [1] 0.9159125
anova(model1, model2)
## Analysis of Variance Table
## 
## Model 1: price ~ carat
## Model 2: price ~ carat + cut + color + clarity
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1  53938 1.2935e+11                                   
## 2  53921 7.2163e+10 17 5.7183e+10 2513.4 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coef(summary(model2))
##                 Estimate Std. Error       t value      Pr(>|t|)
## (Intercept) -3710.603298   13.98044 -265.41382287  0.000000e+00
## carat        8886.128883   12.03370  738.43710891  0.000000e+00
## cut.L         698.906791   20.33514   34.36940969 4.293856e-256
## cut.Q        -327.685863   17.91083  -18.29540212  1.515019e-74
## cut.C         180.565273   15.55655   11.60702279  4.133866e-31
## cut^4          -1.206906   12.45824   -0.09687611  9.228251e-01
## color.L     -1910.287923   17.71195 -107.85306516  0.000000e+00
## color.Q      -627.953682   16.12110  -38.95229291  0.000000e+00
## color.C      -171.960429   15.07039  -11.41047946  4.014211e-30
## color^4        21.678144   13.84005    1.56633380  1.172764e-01
## color^5       -85.943244   13.07634   -6.57242157  4.995316e-11
## color^6       -49.985933   11.88863   -4.20451593  2.620629e-05
## clarity.L    4217.534915   30.83138  136.79360412  0.000000e+00
## clarity.Q   -1832.406056   28.82735  -63.56483922  0.000000e+00
## clarity.C     923.272965   24.67944   37.41061889 2.004270e-302
## clarity^4    -361.994646   19.73894  -18.33911034  6.820905e-75
## clarity^5     216.616140   16.10937   13.44659446  3.756440e-41
## clarity^6       2.105166   14.03689    0.14997383  8.807858e-01
## clarity^7     110.340329   12.38335    8.91037691  5.240452e-19

Interpretation of the Model Comparison (ANOVA)

The ANOVA test compares Model 1 (only carat) with Model 2 (carat + quality features). The F-statistic is extremely large (2513) and the p-value is < 2.2e-16, meaning Model 2 fits the data significantly better than Model 1.

I understood these things by this comparison: 1.Cut, color, and clarity add significant additional explanatory power. 2.Carat alone, although important, does not capture the full complexity of diamond pricing. 3.The categorical predictors strongly improve model performance.

Thus, Model 2 is clearly the more appropriate generalized linear model for this data context.

Overall, the results show that diamond price is driven by both physical size (carat) and quality characteristics (cut, color, clarity). The multiple regression model (Model 2) provides a far more complete and accurate explanation than the simple model, and the statistical significance of all predictors demonstrates that each contributes meaningfully to pricing. Model 2’s much higher adjusted R² and lower residual error confirm that it is the superior GLM for this diamonds dataset.

2.5.4 Diagnostic Plots

par(mfrow = c(2, 2))  
plot(model2)

par(mfrow = c(1, 1))

Residuals vs Fitted Plot In this model, the residuals show a clear curved pattern rather than being randomly scattered around zero. This suggests that the relationship between predictors and price may not be perfectly linear, and that there might be additional nonlinear patterns not captured by the model. Additionally, the spread of the residuals increases as fitted values increase. At lower price levels, residuals are tightly packed, but for higher-value diamonds, the residuals fan out substantially. This indicates heteroscedasticity, meaning the variance of errors increases with price. This is common in real-world pricing data because high-value diamonds vary much more in characteristics than low-value ones. Despite this, the model still performs well overall but would benefit from techniques like transformations or weighted regression if precision were critical.

Q–Q Plot of Residuals This examines whether the residuals follow a normal distribution. In this plot, the residuals deviate from the straight line, especially in the upper tail. This indicates that extremely expensive diamonds have residuals that are larger than what would be expected under a normal distribution.The curvature suggests right-skewness, meaning the model tends to underestimate the price of the most expensive diamonds. While this is a deviation from normality, it is not unusual in models with strong right-skewed outcomes like price data. The central part of the plot aligns reasonably well with the line, meaning normality holds approximately for mid-range diamond prices.

Scale–Location Plot The Scale–Location plot also assesses homoscedasticity. Similar to the first plot, the upward trend in the red line shows that residual variability increases with fitted values. Lower-priced diamonds have more stable and consistent errors, whereas high-priced diamonds display much more spread in residuals.This confirms non-constant variance, which again is typical in dataset where the response variable increases by several orders of magnitude. The model captures the general trend but has more difficulty modeling the most expensive diamonds with equal precision.

Residuals vs Leverage Plot This plot helps identify influential observations—points that have a large impact on the fitted model. A few observations near the right side and high standardized residuals stand out. These points correspond to very large diamonds that are both rare and highly influential in the model.The Cook’s distance line indicates that while a few observations approach the threshold, none appear dangerously influential. This means no single observation is distorting the entire model, but certain high-value diamonds carry more weight in determining the regression line. This makes sense because extremely large or high-quality diamonds are scarce and naturally exert more influence.

Despite these issues, the model is still appropriate for demonstrating Learning Objective 2. It explains over 91% of the variance and provides meaningful insights into how carat, cut, color, and clarity affect diamond price.

2.6 Learning Objective 2 Reflection

Apply the appropriate generalized linear model for a specific data context

In my work with the diamonds dataset, I demonstrated this learning objective by carefully selecting a generalized linear model that matched the type of outcome I wanted to predict. The response variable, price, is continuous and reasonably modeled with a Gaussian distribution and an identity link. This made a multiple linear regression model the appropriate GLM for this context. Before fitting the model, I completed an exploratory data analysis to better understand the structure of the dataset. I looked at the distributions of price, examined boxplots for cut, color, and clarity, and created scatterplots of carat versus price. These visual patterns gave me confidence that a linear modeling approach was justified and also helped me anticipate which predictors were likely to be meaningful.

After choosing the model, I built two versions: a simple regression using only carat and a multiple regression that incorporated cut, color, and clarity. Fitting this GLM allowed me to work with both quantitative and qualitative predictors, which is one of the strengths of generalized linear modeling. Interpreting the coefficients helped me deepen my understanding of how each variable contributed to diamond pricing. For example, the model showed that carat had the strongest effect on price, but quality categories also played significant roles. Comparing the two models using adjusted R² and ANOVA taught me how to justify why one model is more appropriate than another. This experience helped me connect the idea of “appropriate GLM selection” to actual decision-making based on data behavior.

Working through the diagnostic plots also strengthened my understanding of what it means to apply a GLM responsibly. Seeing patterns like heteroscedasticity and slight deviations from normality gave me a more realistic perspective on how statistical models behave with real-world data. Even though the model wasn’t perfect, I learned how to assess its strengths, limitations, and overall usefulness. This entire process helped me grow both technically and conceptually because I was not only fitting a model but understanding why this GLM was suited for the problem and how to interpret its results in context.

Although I compared two models in Learning Objective 2, that comparison was only supportive. LO2 focused on choosing the correct GLM (a linear regression for predicting price), whereas Learning Objective 3 focuses specifically on comparing multiple models and selecting the best candidate.

3 Learning Objective 3: Demonstrate model selection given a set of candidate models

Model: Linear Discriminant Analysis (classification model)

Dataset: Auto (from ISLR2)

Goal: Compare multiple LDA classification models with different predictors, evaluate accuracy and confusion matrices, select the best model, and justify model choice with statistical reasoning.

As we already have the auto dataset imported while we are giving example about the learning objective 1, so i am not importing the Dataset again instead I am using it by changing the varibale.

Also I am converting the origin variabale to factor, which identifies where each car was manufactured: 1 = American 2 = European 3 = Japanese Because origin is a categorical variable with 3 classes, this dataset is ideal for classification models such as Linear Discriminant Analysis (LDA).

3.1 Load Packages + covert Data

library(MASS)     
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:ISLR2':
## 
##     Boston
## The following object is masked from 'package:dplyr':
## 
##     select
library(dplyr)     


Auto <- ISLR::Auto

# Convert origin to factor for classification
Auto$origin <- factor(Auto$origin,
                      levels = c(1, 2, 3),
                      labels = c("American", "European", "Japanese"))

3.2 Exploratory Data Analysis for LDA

str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : Factor w/ 3 levels "American","European",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
summary(Auto)
##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year            origin                    name    
##  Min.   : 8.00   Min.   :70.00   American:245   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   European: 68   ford pinto        :  5  
##  Median :15.50   Median :76.00   Japanese: 79   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98                  amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00                  amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00                  chevrolet chevette:  4  
##                                                 (Other)           :365
table(Auto$origin)
## 
## American European Japanese 
##      245       68       79
pairs(Auto[, c("mpg", "horsepower", "weight", "acceleration", "displacement")],
      main = "Scatterplot Matrix of Predictors")

The scatterplot matrix helps visualize the relationships among the key numeric predictors: mpg, horsepower, weight, acceleration, and displacement.

I observed these few points

1.mpg is strongly negatively correlated with horsepower, displacement, and weight.This means we can say that heavier cars with larger engines tend to have lower fuel efficiency.

2.Horsepower and displacement are highly positively correlated, this reflects the engineering link between engine size and power.

3.Acceleration shows weak to moderate relationships with other predictors, we can say that acceleration may contribute some unique information.

Overall, the matrix reveals meaningful structure in the data. These visible patterns support the idea that Linear Discriminant Analysis (LDA) may successfully separate car origins using these numeric characteristics.

3.3 MODEL 1 (Simple LDA Model)

Model: origin ~ horsepower + weight

lda_model1 <- lda(origin ~ horsepower + weight, data = Auto)

lda_model1
## Call:
## lda(origin ~ horsepower + weight, data = Auto)
## 
## Prior probabilities of groups:
##  American  European  Japanese 
## 0.6250000 0.1734694 0.2015306 
## 
## Group means:
##          horsepower   weight
## American  119.04898 3372.490
## European   80.55882 2433.471
## Japanese   79.83544 2221.228
## 
## Coefficients of linear discriminants:
##                     LD1          LD2
## horsepower  0.008187225  0.051468802
## weight     -0.001791632 -0.001867805
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9796 0.0204

Prior Probabilities

The priors (American = 0.625, European = 0.174, Japanese = 0.201) reflect the dataset’s class distribution. This means most cars are American, so the model must avoid simply predicting the majority class.

Group Means

American cars: highest horsepower & highest weight

European cars: moderate horsepower & high weight

Japanese cars: lowest horsepower & lowest weight

This suggests horsepower and weight do provide meaningful separation.

inear Discriminants (LD1 & LD2)

LD1 explains 97.9% of the separation which is excellent. It means the first discriminant captures nearly all the classification information.

The coefficients show:

Horsepower has a positive loading on LD1

Weight has a small negative loading

Interpretation: Cars with higher horsepower and higher weight are classified more toward the “American” end of LD1, while lighter, lower-power cars shift toward “Japanese”.

Predict using Model 1

pred1 <- predict(lda_model1)$class

Confusion matrix & accuracy for LDA model 1

conf1 <- table(True = Auto$origin, Predicted = pred1)
conf1
##           Predicted
## True       American European Japanese
##   American      215        3       27
##   European       29        5       34
##   Japanese       21        0       58
accuracy1 <- mean(pred1 == Auto$origin)
accuracy1
## [1] 0.7091837

American cars classified well 215 correct, but many European cars are misclassified. No Japanese cars are misclassified as European. European cars are most difficult to classify only 5 correct.

3.4 MODEL 2 - Full LDA Model

Model: origin ~ horsepower + weight + displacement + acceleration + mpg

lda_model2 <- lda(origin ~ horsepower + weight + displacement + acceleration + mpg,
                  data = Auto)

lda_model2
## Call:
## lda(origin ~ horsepower + weight + displacement + acceleration + 
##     mpg, data = Auto)
## 
## Prior probabilities of groups:
##  American  European  Japanese 
## 0.6250000 0.1734694 0.2015306 
## 
## Group means:
##          horsepower   weight displacement acceleration      mpg
## American  119.04898 3372.490     247.5122     14.99020 20.03347
## European   80.55882 2433.471     109.6324     16.79412 27.60294
## Japanese   79.83544 2221.228     102.7089     16.17215 30.45063
## 
## Coefficients of linear discriminants:
##                        LD1          LD2
## horsepower    0.0303833938  0.012950445
## weight        0.0001451998 -0.001921576
## displacement -0.0198644604  0.018799242
## acceleration  0.0118050767 -0.051837494
## mpg           0.0598010020  0.130294235
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9587 0.0413

This model includes horsepower, weight, displacement, acceleration, and mpg.

Group Means

Differences become clearer:

Japanese cars: lower displacement and weight

American cars: generally heavier and lower mpg

European cars: in between but with higher mpg than Americans

This richer set of variables supports deeper separation.

Coefficients of LD1 and LD2

LD1 again explains 95.8% of the variance which is still strong.

Key trends:

mpg negative loading which leads to higher mpg shifts classification toward Japanese/European

displacement and horsepower positive loadings leads to large engines lean toward American cars

acceleration positive means slower acceleration leans American we already know that heavier cars accelerate slower.

Overall, Model 2 captures more characteristics of car design.

Predict using Model 2

pred2 <- predict(lda_model2)$class

Confusion matrix & accuracy for LDA model 2

conf2 <- table(True = Auto$origin, Predicted = pred2)
conf2
##           Predicted
## True       American European Japanese
##   American      205       12       28
##   European       12       27       29
##   Japanese        3       16       60
accuracy2 <- mean(pred2 == Auto$origin)
accuracy2
## [1] 0.744898

American and Japanese cars classified with higher accuracy. European cars significantly improved (27 correct vs. only 5 before). Japanese cars misclassified less often.

Model 2 provides more balanced, more accurate classification across all three classes.

This is the statistical justification for choosing Model 2.

3.5 Model Comparison for LDA

accuracy1
## [1] 0.7091837
accuracy2
## [1] 0.744898
conf1
##           Predicted
## True       American European Japanese
##   American      215        3       27
##   European       29        5       34
##   Japanese       21        0       58
conf2
##           Predicted
## True       American European Japanese
##   American      205       12       28
##   European       12       27       29
##   Japanese        3       16       60
accuracy2 - accuracy1
## [1] 0.03571429

Accuracy for LDA

Model 1 accuracy = 0.709 Model 2 accuracy = 0.745

Model 2 is more accurate. The gain is meaningful about 3.6% improvement.

After comapring the accuray and confusion matrix we can significantly say that Model 2 provides more balanced, more accurate classification across all three classes.

This is the statistical justification for choosing Model 2. which directly refelcts the learning objective 3 and the best example for it.

3.6 Showing LDA separation for Model 2

# Get LDA scores for Model 2
lda2_scores <- predict(lda_model2)
lda2_df <- data.frame(
  LD1 = lda2_scores$x[,1],
  LD2 = lda2_scores$x[,2],
  origin = Auto$origin
)


ggplot(lda2_df, aes(x = LD1, y = LD2, color = origin)) +
  geom_point(alpha = 0.6, size = 2) +
  labs(
    title = "LDA Model 2: Separation of Car Origins",
    x = "Linear Discriminant 1 (LD1)",
    y = "Linear Discriminant 2 (LD2)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    legend.title = element_blank()
  )

The LDA Model 2 separation plot shows much clearer grouping compared to Model 1. American cars (red) are well-separated on the left side of LD1, reflecting their heavier weight, larger displacement, and lower mpg. Japanese cars (blue) cluster tightly on the far right side of LD1, consistent with lighter vehicles and higher fuel efficiency. European cars (green) fall between these two clusters, but they show noticeably less overlap than in the simpler model.

By adding more predictors such as mpg, acceleration, and displacement Model 2 achieves stronger separation along LD1 and LD2. The three origins form more distinct clusters, visually supporting the higher accuracy obtained by this model. This plot confirms that Model 2 captures more of the underlying structure in the data and provides better discrimination between car origins.

head(predict(lda_model2)$x)
##         LD1       LD2
## 1 -1.751835 0.9096555
## 2 -1.700451 1.4431466
## 3 -1.384355 1.5579608
## 4 -1.214485 0.9881102
## 5 -1.434173 0.9983125
## 6 -2.190709 2.1882261

The scores show that:

These first six cars have relatively low LD1 values probably Japanese or European

Higher LD2 values might indicate more Japanese characteristics (better mpg, lighter cars)

These LD scores visually and numerically confirm that predictors meaningfully separate the classes.

Final Interpretation of LDA

LDA Model 2, which includes horsepower, weight, displacement, acceleration, and mpg, provided superior classification performance compared to the simpler Model 1. It improved overall accuracy from 70.9% to 74.5% and significantly reduced misclassification of European cars. The discriminant functions also showed clearer separation in LDA space. Based on these comparisons, Model 2 is the better model for predicting car origin.”

3.7 Reflection for Learning Objective 3: Model Selection

For Learning Objective 3, I focused on demonstrating my ability to compare multiple candidate models and make a justified decision about which model performs best. This objective felt different from the previous two, because instead of trying to apply the correct type of model, here I had to think more critically about choosing between models that could all potentially work. To explore model selection, I used the Auto dataset and built two Linear Discriminant Analysis (LDA) models that predicted car origin. Model 1 used only horsepower and weight, while Model 2 added displacement, acceleration, and mpg. This setup naturally created a scenario where I could evaluate whether additional predictors improved the classification task.

Working through these models helped me better understand how model selection requires a combination of statistical reasoning and practical judgment. Even before looking at performance metrics, the EDA showed clear patterns across origins for example, American cars tended to be heavier and have lower mpg, while Japanese cars tended to be lighter and more fuel efficient. This suggested that using more predictors might reveal stronger separation between the classes. Once I compared the models, I found that Model 2 produced a higher accuracy and reduced misclassification of European cars. The visualization of LDA separation confirmed the same finding: the classes formed more distinct clusters in Model 2. This experience helped me appreciate how model selection is not just about choosing the model with the highest numbers, but also about understanding why the improvement happens and whether it is meaningful.

Personally, I felt that Learning objective 3 pushed me to think more deeply about the structure of the data rather than only running models. I had to justify why Model 2 was better not simply because it had a slightly higher accuracy, but because it provided clearer separation and a more balanced classification across the three origins. This process made me more confident in evaluating decisions between simple and complex models, an important real world statistical skill. Overall, learning objective 3 helped me connect model performance, visual interpretation, and practical judgment in a way that felt more like “actual data science” than simply fitting a model.

4 Learning Objective 4: Express the Results of Statistical Models to a General Audience

Firstly I wanted to contrate on explaining a single model to the general audience and then sum everthing up in the last.

In this section, I explain my linear regression model predicting fuel efficiency (mpg) from horsepower in the Auto dataset.

4.1 Overview of the Model

For this objective, I choose the simple linear regression model I built earlier using the Auto dataset. The goal of the model was to understand how a car’s engine power measured by horsepower is related to its fuel efficiency measured in miles per gallon, or mpg. This is a relationship most people are familiar with in everyday life, So this made me easier while working which provided a good opportunity to explain statistical results in clearly.

4.2 Main Findings Explained Simply

The model found a clear and consistent pattern: cars with higher horsepower tend to get lower gas mileage. To tell this simply ,I would say that stronger engines usually use more fuel, so the car cannot travel as many miles on one gallon of gasoline. This is not surprising from a real-world standpoint, but this model helped me to quantify and confirm this relationship using actual data.

The overall trend was strongly negative, meaning that as horsepower increases, mpg decreases. While not all cars follow the pattern perfectly, this model shows that the general relationship holds across many different car types.

4.3 Explaining the Coefficient

The model estimated that, for each one-unit increase in horsepower, fuel efficiency tends to decrease by about 0.16 miles per gallon.for instamce: If you compare two cars identical in every way except one has slightly more horsepower, the stronger car will likely get slightly worse gas mileage. This helps communicate the idea that even small increases in engine power come with a cost in fuel efficiency. The negative relationship was strong and statistically meaningful, but the key point for a general audience is that bigger engines require more fuel.

4.4 How Well the Model Performs

The model captures the main pattern very well. Some cars get better or worse gas mileage than expected based on their horsepower alone. This is normal because real-world fuel efficiency also depends on other factors like car weight, speed, and engine design.We can here by confirm that the model does a good job of showing the general connection between engine power and fuel efficiency, but it cannot predict mpg perfectly because other factors also matter.

4.5 Why This Information Matters

Understanding the relationship between horsepower and mpg can help people make better decisions when buying cars. For example: Someone who commutes long distances might choose a car with lower horsepower to save fuel. A driver who wants stronger performance may expect to spend more on gasoline. Manufacturers can balance performance and efficiency depending on the needs of their customers.

4.6 Reflection for specific model

Working on this objective helped me shift from thinking statistically to thinking communicatively. Instead of focusing on p-values and model assumptions, I had to describe the results in a way that someone with no statistics background could understand. I realized how important it is to simplify without oversimplifying. The direction of the relationship and the practical meaning of the coefficient mattered much more than the technical details.

This objective also helped me understand the role of storytelling in data analysis. Even though the model itself was simple, translating it into everyday language required me to think about the data from the perspective of a general audience. This reminded me that statistical modeling is not just about running analyses it is also about helping people make informed decisions based on clear explanations. I feel more confident now in my ability to communicate results beyond the classroom.

4.7 Refelction for Full Project

I wanted to write the whole portfolio reflection where it covers what made me easier and what made me work so harder, what made me to think deeply to answer the question and give appropirate examples to the learning objectives more instead of focusing only on model explination.

Throughout this portfolio, I worked on several statistical modeling tasks. My main goal was to understand not only how to run models but also how to think about them and communicate their meaning. Looking back, I see that each learning objective built on the last, gradually strengthening my technical skills and my ability to explain results clearly.

I started by exploring data both visually and numerically. I learned to recognize patterns and understand what variables might reveal before fitting a model. This step showed me how important it is to “listen to the data.” Simple scatterplots and histograms guided my thinking for later analyses. When I fitted my first linear regression model, I connected mathematical results to real-life meaning. For example, observing how horsepower affects mpg wasn’t just a number; it told a clear story about how car engines and fuel efficiency are linked. This was my first experience turning statistical output into something understandable to nearly anyone.

As I progressed, the work became more complex, especially when I compared multiple models or explored classification with LDA. Model selection required critical thinking, not just running code. I had to ask: Why is one model better? Does adding more variables really make a difference? This pushed me to go beyond the mechanics and really evaluate models like a data scientist would. Seeing the improvement from Model 1 to Model 2 in LDA both numerically and visually helped me appreciate how statistics combine reasoning, visualization, and interpretation.

By the time I reached the final learning objective, I recognized how crucial it is to communicate results clearly. Most people do not speak in terms of coefficients, p-values, or discriminant functions. They want to know the story: What does the model tell us? Why does it matter? Practicing this skill helped me view statistics not just as numbers, but as a means of explaining real-world relationships in a meaningful and accessible way. I learned that a good analysis is only valuable if others can understand it.

Overall, this portfolio helped me grow from simply running statistical methods to interpreting, comparing, and communicating them effectively. I feel more confident explaining data insights to both technical and non-technical audiences. I am very grateful for this that how much more comfortable I have become with the entire modeling process from EDA to final communication.

5 Learning Objective 5: Use Programming Software to Fit and Assess Statistical Models

I use the R programming, Rstudio, R pubs software throughout this course and this final portfolio also. To start with

During this course in Statistical Modeling, I have been working with a wide range of projects and assignments such as a full linear regression project, classification project using generalized linear models, and several homework assignments that covered such models as multiple regression with quantitative and qualitative predictors, multinomial logistic regression, Linear Discriminant Analysis, Poisson regression, ridge regression, lasso regression, PCA regression, and polynomial regression. Correspondingly, in all these analyses, I used R and RStudio as my main tool for programming, at times integrating the work done with R Pubs. I have shown code executed in both tidymodels and base R, which made it possible to compare different styles of programming and how each approach structures modeling.

The course made me use many packages and functions that made statistical analysis more efficient and organized, such as ggplot2, MASS, car, broom, GGally, dplyr, tidyverse,parsnip, workflows and functions such as ggpairs(), tidy(), str(), glimpse(), summary(), lm(), predict(), confint(), augment() and many others. My workflow includes loading packages, importing datasets, cleaning and preparing data, exploring variables with summaries and plots, creating new or recorded features, fitting a wide range of models, extracting results, generating predictions, checking residuals and diagnostic plots, calculating accuracy or RMSE, comparing candidate models, visualizing LDA separation, and organizing code into clear, well-labeled sections. All combined, these experiences strengthened my ability to fit and assess statistical models through programming and helped me to build in R a complete, repeatable workflow.

5.1 Specifically to the portfolio models

To meet this objective in this particular project, I demonstrated how to import data, explore, fit several different types of models using R and RStudio, and evaluate their performance. Throughout the course, and in this portfolio, I have worked extensively with three different datasets and modeling approaches: a simple linear regression model to predict mpg from horsepower in the Auto dataset, a multiple regression model from the Diamonds dataset, and a Linear Discriminant Analysis (LDA) classification model for predicting car origin. Using all three examples shows the range of statistical programming I practiced-from basic regression to more advanced model comparison and classification.

Using programming software became a central part of my learning process. I used R to load and clean datasets, create visualizations, calculate summary statistics, and generate plots that helped me interpret patterns before fitting any models. I also used R to fit and assess several types of models-linear regression, multiple regression, and LDA-and to evaluate them using outputs such as coefficients, confidence intervals, residual plots, accuracy scores, and confusion matrices. These tasks required me not only to write code but also to understand what each function was doing and how the results connect back to the data. Throughout the portfolio, I wrote the code in organized chunks with comments, which served to help me keep my workflow clean and understandable.

Working with three different models helped me grow significantly in my programming skills. On the Auto dataset, I learned how to fit simple models and interpret their coefficients. The Diamonds dataset pushed me to work with multiple predictors and understand how regression output changes both with categorical and numerical variables. The LDA models required me to step into classification problems, calculate predictions, evaluate accuracy, and visualize separations in discriminant space. Also, troubleshooting errors, renaming objects for better clarity, and structuring code more clearly finally made me more confident and independent in programming.

Each example has reinforced a different aspect of model assessment, either checking assumptions in regression or comparing classification performance across candidate models. Thinking back to this objective, I found that programming software has taught me a bit more than just how to write code how to create an entire analysis pipeline. It has taught me the importance of keeping code organized, labeling each step, and ensuring that output matches the question I am trying to answer. This objective helped me see how programming stitches together the entire modeling process, from EDA to interpretation. I am way more comfortable with fitting and assessing models in R now because I can confidently follow a full workflow. This experience showed me that programming is not just a tool but an essential part of doing statistics effectively and communicating results clearly.