STA-631-Portfolio-Project
Introduction
This project aims to demonstrate my understanding of the course
objectives for the Statistical Modeling and Regression course by
analyzing the mtcars
dataset. The dataset contains various
car performance metrics and provides a rich ground for applying
statistical models and making inferences.
Course Objectives
1. Describe probability as a foundation of statistical modeling, including inference and maximum likelihood estimation
Probability is the cornerstone of statistical modeling, enabling us to make inferences about populations based on sample data. In this project, I will explore the distributions of variables in the mtcars dataset, calculate summary statistics, and visualize the data. These steps help us understand the underlying distributions and relationships, forming the basis for further modeling and inference.
2. Determine and apply the appropriate generalized linear model for a specific data context
I will fit regression models to predict car performance metrics such
as miles per gallon (mpg
). Additionally, I will try to
apply logistic regression to predict binary outcomes like the
transmission type (automatic vs. manual). This will demonstrate my
ability to choose and apply appropriate models based on the nature of
the data and the research questions.
3. Conduct model selection for a set of candidate models
Model selection is crucial for identifying the most suitable model among several candidates. I will use techniques such as stepwise regression, AIC/BIC, and cross-validation to compare model performance. This process helps ensure that the final model is both accurate and parsimonious.
4. Communicate the results of statistical models to a general audience
Effective communication of results is essential for making informed decisions based on statistical analysis. I will create visualizations and write summaries to explain my findings clearly. This ensures that even those without a strong statistical background can understand the implications of my analysis.
5. Use programming software (i.e., R) to fit and assess statistical models
Throughout this project, I will use R and various packages to manipulate data, fit models, and create visualizations. This demonstrates my proficiency in using R for statistical analysis and model assessment.
Data Analysis
Load necessary packages
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom 1.0.6 ✔ rsample 1.2.1
## ✔ dials 1.2.1 ✔ tune 1.2.1
## ✔ infer 1.0.7 ✔ workflows 1.1.4
## ✔ modeldata 1.3.0 ✔ workflowsets 1.1.0
## ✔ parsnip 1.2.1 ✔ yardstick 1.3.1
## ✔ recipes 1.0.10
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
## Warning: package 'MASS' was built under R version 4.4.1
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
Load and explore the dataset
## Rows: 32
## Columns: 11
## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
## $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
## $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
## $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
## $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
## $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…
Exploratory Data Analysis (EDA)
Visualize relationships between variables
Summary of EDA
The pairwise plots provide an overview of the relationships between
all variables. We can see that mpg
is negatively correlated
with wt
and hp
, while it is positively
correlated with qsec
.
Objective 1: Probability and Inference
Summary statistics and visualizations lay the foundation for probability-based modeling. Understanding the distributions and relationships between variables will help me to make inferences about the population.
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
Objective 2: Generalized Linear Models
# Multiple linear regression
mlr_fit <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm") %>%
fit(mpg ~ wt + hp + qsec + am, data = mtcars)
# Tidy the model summary
mlr_summary <- tidy(mlr_fit$fit)
mlr_summary
## # A tibble: 5 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 17.4 9.32 1.87 0.0721
## 2 wt -3.24 0.890 -3.64 0.00114
## 3 hp -0.0176 0.0142 -1.25 0.223
## 4 qsec 0.811 0.439 1.85 0.0757
## 5 am 2.93 1.40 2.09 0.0458
The intercept represents the estimated value of mpg
(miles per gallon) when all predictors (wt
,
hp
, qsec
, and am
) are equal to
zero. While this estimate doesn’t have a meaningful interpretation in
the context of cars (since a car cannot have zero weight, horsepower,
etc.), it serves as the baseline level of mpg
. The p-value
of 0.072 indicates that the intercept is not statistically significant
at the 0.05 level, suggesting that we can’t be very confident that this
baseline value differs from zero.
The coefficient for weight (wt
) is -3.24, which means
that for each unit increase in the weight of the car (1000 lbs), the
mpg
decreases by approximately 3.24 units, holding all
other variables constant. The p-value is 0.001, indicating that this
effect is statistically significant at the 0.05 level. Therefore, weight
is a significant predictor of fuel efficiency, with heavier cars
generally having lower mpg
.
The coefficient for horsepower (hp
) is -0.018,
suggesting that for each additional unit of horsepower, the
mpg
decreases by about 0.018 units, holding other variables
constant. However, the p-value of 0.223 indicates that this relationship
is not statistically significant at the 0.05 level. Hence, there is
insufficient evidence to conclude that horsepower significantly impacts
mpg
in this model.
The coefficient for quarter mile time (qsec
) is 0.811,
implying that each additional second in the quarter-mile time is
associated with an increase in mpg
by about 0.811 units,
holding other variables constant. The p-value of 0.076 indicates that
this relationship is not statistically significant at the 0.05 level,
although it is marginally significant and may warrant further
investigation.
The coefficient for the transmission (am
) is 2.93,
suggesting that cars with manual transmission (am
= 1) have
an mpg
that is approximately 2.93 units higher than those
with automatic transmission (am
= 0), holding other
variables constant. The p-value of 0.046 indicates that this effect is
statistically significant at the 0.05 level, making transmission type a
significant predictor of fuel efficiency.
# Transform 'am' to a factor
mtcars <- mtcars %>%
mutate(am = as.factor(am))
# Logistic regression model to predict transmission type
logit_fit <- logistic_reg() %>%
set_mode("classification") %>%
set_engine("glm") %>%
fit(am ~ wt + hp, data = mtcars, family = binomial())
# Summary of the logistic regression model
summary(logit_fit$fit)
##
## Call:
## stats::glm(formula = am ~ wt + hp, family = stats::binomial,
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 18.86630 7.44356 2.535 0.01126 *
## wt -8.08348 3.06868 -2.634 0.00843 **
## hp 0.03626 0.01773 2.044 0.04091 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 43.230 on 31 degrees of freedom
## Residual deviance: 10.059 on 29 degrees of freedom
## AIC: 16.059
##
## Number of Fisher Scoring iterations: 8
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 18.9 7.44 2.53 0.0113
## 2 wt -8.08 3.07 -2.63 0.00843
## 3 hp 0.0363 0.0177 2.04 0.0409
Intercept (18.87): When wt
and hp
are zero,
the log-odds of having an automatic transmission (am
= 1)
is 18.87. Since wt
and hp
cannot realistically
be zero, this is mostly a baseline for the model.
Weight (wt
): Each unit increase in weight decreases the
log-odds of having an automatic transmission by approximately 8.08. The
negative coefficient suggests that heavier cars are less likely to have
automatic transmissions.The p-value of 0.00843 indicates that this
effect is statistically significant.
Horsepower (hp
): Each unit increase in horsepower
increases the log-odds of having an automatic transmission by
approximately 0.036. The positive coefficient suggests that more
powerful cars are more likely to have automatic transmissions.The
p-value of 0.04091 indicates that this effect is statistically
significant.
Objective 3: Model Selection
Stepwise Regression
# stepwise regression
step_model <- stepAIC(lm(mpg ~ wt + hp + qsec + am, data = mtcars), direction = "both")
## Start: AIC=61.52
## mpg ~ wt + hp + qsec + am
##
## Df Sum of Sq RSS AIC
## - hp 1 9.219 169.29 61.307
## <none> 160.07 61.515
## - qsec 1 20.225 180.29 63.323
## - am 1 25.993 186.06 64.331
## - wt 1 78.494 238.56 72.284
##
## Step: AIC=61.31
## mpg ~ wt + qsec + am
##
## Df Sum of Sq RSS AIC
## <none> 169.29 61.307
## + hp 1 9.219 160.07 61.515
## - am 1 26.178 195.46 63.908
## - qsec 1 109.034 278.32 75.217
## - wt 1 183.347 352.63 82.790
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## am1 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
The best model is the one with the lowest AIC after dropping or adding predictors. Hence the model with 61.307 AIC is the best fit in this case. mpg ~ wt + qsec + am
Cross-validation
# Cross validation
set.seed(123)
cv_results <- vfold_cv(mtcars, v = 10)
cv_model <- function(train_data, test_data) {
fit <- lm(mpg ~ wt + hp + qsec + am, data = train_data)
preds <- predict(fit, newdata = test_data)
rmse <- sqrt(mean((test_data$mpg - preds)^2))
return(rmse)
}
cv_rmse <- map_dbl(cv_results$splits, ~ cv_model(analysis(.x), assessment(.x)))
mean(cv_rmse)
## [1] 2.66519
Cross-validation is a technique used to assess the generalizability and performance of a statistical model. The output of the cross-validation process in this project is 2.66519, which represents the Root Mean Squared Error (RMSE) of the model.
Interpretation: RMSE Value (2.66519): This value indicates the
average magnitude of the residuals or prediction errors, i.e., the
difference between the actual and predicted values of mpg
.
The lower the RMSE, the better the model’s predictive accuracy.
Model Performance: An RMSE of 2.66519 suggests that, on average, the
predictions of the mpg
from the model are off by
approximately 2.67 units. This provides an indication of how well the
model is likely to perform on unseen data.
Objective 4: Communicate Results
Throughout this project, I have demonstrated how to effectively communicate results and outputs by providing clear, concise explanations and visualizations for some of the analysis.I kept explanations precise and focused on the key findings and their implications, ensuring that the results are accessible and actionable. This approach facilitates effective communication of the analysis and its outcomes to stakeholders.
Visualizations
# MLR Coefficients
tidy(mlr_fit, conf.int = TRUE) %>%
ggplot(aes(term, estimate, ymin = conf.low, ymax = conf.high)) +
geom_pointrange() +
coord_flip()
This plot visualizes the estimates of the regression coefficients along with their confidence intervals.
Terms (x-axis): The plot includes each term (or predictor) in the MLR
model ((Intercept), wt
, hp
, qsec
,
am
).
Estimates (y-axis): The central point for each term represents the
estimated coefficient. These estimates indicate the expected change in
the response variable (mpg
) for a one-unit change in the
predictor, holding all other predictors constant.
Confidence Intervals: The lines extending from each estimate show the confidence intervals, typically at the 95% level. These intervals provide a range within which we can be reasonably confident the true coefficient lies.
- A confidence interval that does not cross zero suggests that the predictor has a statistically significant effect on the response variable.
# Logistic Regression Coefficients
tidy(logit_fit, conf.int = TRUE) %>%
ggplot(aes(term, estimate, ymin = conf.low, ymax = conf.high)) +
geom_pointrange() +
coord_flip()
This plot visualizes the estimates of the logistic regression model coefficients and their confidence intervals.
Terms (x-axis): The plot includes each term in the logistic
regression model ((Intercept), wt
, hp
).
Estimates (y-axis): The central point for each term represents the
estimated log-odds coefficient. These estimates indicate the change in
the log-odds of the outcome (am
being 1) for a one-unit
change in the predictor, holding all other predictors constant.
Confidence Intervals: The lines extending from each estimate show the confidence intervals, typically at the 95% level. These intervals provide a range within which we can be reasonably confident the true coefficient lies.
- A confidence interval that does not cross zero suggests that the predictor has a statistically significant effect on the response variable.
Summary of Findings
The multiple linear regression model shows that wt
, and
am
are significant predictors of mpg
. The
logistic regression model indicates that hp
and
wt
are significant predictors of transmission type
(am
). These results align with my expectations and
demonstrate the effectiveness of the models.
Objective 5: Use R for Analysis
Throughout this project, I have used R
for data
manipulation, model fitting, and visualization. This showcases my
ability to use R
programming software to conduct
comprehensive statistical analyses.
Conclusion
This project has demonstrated the application of various statistical
modeling techniques using the mtcars
dataset. By following
the course objectives, I have effectively analyzed the data, selected
appropriate models, and communicated my findings. This comprehensive
analysis highlights my understanding and proficiency in statistical
modeling and regression.
Class Activities
In addition to this project, I completed all the class activities for this course. I have put a lot of effort into completing these activities and I have learned a lot in the process. The following is a list of links to the activities I successfully completed that demonstrate my understanding of the course objectives:
Describe your participation in our course community.
(Remember that how participation (active) is different from attendance or “turning things in” (passive). How did you contribute to our course learning community?)
The following is a description of my participation in the course learning community:
The most important aspect of my participation in this class is the presentation that I made on the Job Ads assignment. I have the desire of honing my public speaking skills especially in a professional setting and this was a perfect opportunity for me to put this skill to test by speaking to my peer classmates and my instructor. I learnt a lot from my classmates’ presentations, how they formatted their slides, how they engaged the audience and the bits of career information each of us shared. For instance I discovered that soft skills such as team collaboration, ability to work independently, communication skills were required in almost all the jobs that were presented about. The presentations were a great way to learn and also practice public speaking even though that was not the main aim of the assignment.
The Data Feminism channel on Teams was my other avenue for participation in this course. I made an effort to provide my replies to the weekly Data Feminism chapter prompts that the instructor posted. It enabled me to read the Data Feminism book objectively through the guidance of the prompts. Reading replies from my classmates provided me with fresh perspectives on the issues raised in the prompts. I also posted a couple of things that were muddy to me in the Muddy channel and responded to muddy issues raised by classmates that I had a solution for.
I actively posted thoughts about my progress in STA 631 on LinkedIn. The main aim was to test what I have learnt by forging article-like posts for my future reference and for engagement with my LinkedIn community. I made posts about the processes of conducting statistical modeling, linear regression, multiple linear regression, logistic regression, generalized linear models and the Data Feminism course text. I have learned to share knowledge on a social media platform and this is a great way of practicing our course objective 4 (communicating results of a statistical model to a general audience).