STA-631-Portfolio-Project

Introduction

This project aims to demonstrate my understanding of the course objectives for the Statistical Modeling and Regression course by analyzing the mtcars dataset. The dataset contains various car performance metrics and provides a rich ground for applying statistical models and making inferences.

Course Objectives

1. Describe probability as a foundation of statistical modeling, including inference and maximum likelihood estimation

Probability is the cornerstone of statistical modeling, enabling us to make inferences about populations based on sample data. In this project, I will explore the distributions of variables in the mtcars dataset, calculate summary statistics, and visualize the data. These steps help us understand the underlying distributions and relationships, forming the basis for further modeling and inference.

2. Determine and apply the appropriate generalized linear model for a specific data context

I will fit regression models to predict car performance metrics such as miles per gallon (mpg). Additionally, I will try to apply logistic regression to predict binary outcomes like the transmission type (automatic vs. manual). This will demonstrate my ability to choose and apply appropriate models based on the nature of the data and the research questions.

3. Conduct model selection for a set of candidate models

Model selection is crucial for identifying the most suitable model among several candidates. I will use techniques such as stepwise regression, AIC/BIC, and cross-validation to compare model performance. This process helps ensure that the final model is both accurate and parsimonious.

4. Communicate the results of statistical models to a general audience

Effective communication of results is essential for making informed decisions based on statistical analysis. I will create visualizations and write summaries to explain my findings clearly. This ensures that even those without a strong statistical background can understand the implications of my analysis.

5. Use programming software (i.e., R) to fit and assess statistical models

Throughout this project, I will use R and various packages to manipulate data, fit models, and create visualizations. This demonstrates my proficiency in using R for statistical analysis and model assessment.

Data Analysis

Load necessary packages

# Load necessary packages
knitr::opts_chunk$set(echo = TRUE, warning = FALSE)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom        1.0.6      ✔ rsample      1.2.1 
## ✔ dials        1.2.1      ✔ tune         1.2.1 
## ✔ infer        1.0.7      ✔ workflows    1.1.4 
## ✔ modeldata    1.3.0      ✔ workflowsets 1.1.0 
## ✔ parsnip      1.2.1      ✔ yardstick    1.3.1 
## ✔ recipes      1.0.10     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(MASS)
## Warning: package 'MASS' was built under R version 4.4.1
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library(broom)

Load and explore the dataset

# Load and explore data
data(mtcars)
glimpse(mtcars)
## Rows: 32
## Columns: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…

Exploratory Data Analysis (EDA)

Visualize relationships between variables

# Pairwise plots
ggpairs(mtcars[, c("mpg", "hp", "am", "cyl", "wt")])

Summary of EDA

The pairwise plots provide an overview of the relationships between all variables. We can see that mpg is negatively correlated with wt and hp, while it is positively correlated with qsec.

Objective 1: Probability and Inference

Summary statistics and visualizations lay the foundation for probability-based modeling. Understanding the distributions and relationships between variables will help me to make inferences about the population.

# Summary statistics
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Objective 2: Generalized Linear Models

# Multiple linear regression
mlr_fit <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm") %>%
  fit(mpg ~ wt + hp + qsec + am, data = mtcars)

# Tidy the model summary
mlr_summary <- tidy(mlr_fit$fit)
mlr_summary
## # A tibble: 5 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)  17.4       9.32        1.87 0.0721 
## 2 wt           -3.24      0.890      -3.64 0.00114
## 3 hp           -0.0176    0.0142     -1.25 0.223  
## 4 qsec          0.811     0.439       1.85 0.0757 
## 5 am            2.93      1.40        2.09 0.0458

The intercept represents the estimated value of mpg (miles per gallon) when all predictors (wt, hp, qsec, and am) are equal to zero. While this estimate doesn’t have a meaningful interpretation in the context of cars (since a car cannot have zero weight, horsepower, etc.), it serves as the baseline level of mpg. The p-value of 0.072 indicates that the intercept is not statistically significant at the 0.05 level, suggesting that we can’t be very confident that this baseline value differs from zero.

The coefficient for weight (wt) is -3.24, which means that for each unit increase in the weight of the car (1000 lbs), the mpg decreases by approximately 3.24 units, holding all other variables constant. The p-value is 0.001, indicating that this effect is statistically significant at the 0.05 level. Therefore, weight is a significant predictor of fuel efficiency, with heavier cars generally having lower mpg.

The coefficient for horsepower (hp) is -0.018, suggesting that for each additional unit of horsepower, the mpg decreases by about 0.018 units, holding other variables constant. However, the p-value of 0.223 indicates that this relationship is not statistically significant at the 0.05 level. Hence, there is insufficient evidence to conclude that horsepower significantly impacts mpg in this model.

The coefficient for quarter mile time (qsec) is 0.811, implying that each additional second in the quarter-mile time is associated with an increase in mpg by about 0.811 units, holding other variables constant. The p-value of 0.076 indicates that this relationship is not statistically significant at the 0.05 level, although it is marginally significant and may warrant further investigation.

The coefficient for the transmission (am) is 2.93, suggesting that cars with manual transmission (am = 1) have an mpg that is approximately 2.93 units higher than those with automatic transmission (am = 0), holding other variables constant. The p-value of 0.046 indicates that this effect is statistically significant at the 0.05 level, making transmission type a significant predictor of fuel efficiency.

# Transform 'am' to a factor
mtcars <- mtcars %>%
  mutate(am = as.factor(am))

# Logistic regression model to predict transmission type
logit_fit <- logistic_reg() %>%
  set_mode("classification") %>%
  set_engine("glm") %>%
  fit(am ~ wt + hp, data = mtcars, family = binomial())

# Summary of the logistic regression model
summary(logit_fit$fit)
## 
## Call:
## stats::glm(formula = am ~ wt + hp, family = stats::binomial, 
##     data = data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) 18.86630    7.44356   2.535  0.01126 * 
## wt          -8.08348    3.06868  -2.634  0.00843 **
## hp           0.03626    0.01773   2.044  0.04091 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43.230  on 31  degrees of freedom
## Residual deviance: 10.059  on 29  degrees of freedom
## AIC: 16.059
## 
## Number of Fisher Scoring iterations: 8
# Displaying the coefficients in a tidy format
tidy(logit_fit$fit)
## # A tibble: 3 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)  18.9       7.44        2.53 0.0113 
## 2 wt           -8.08      3.07       -2.63 0.00843
## 3 hp            0.0363    0.0177      2.04 0.0409

Intercept (18.87): When wt and hp are zero, the log-odds of having an automatic transmission (am = 1) is 18.87. Since wt and hp cannot realistically be zero, this is mostly a baseline for the model.

Weight (wt): Each unit increase in weight decreases the log-odds of having an automatic transmission by approximately 8.08. The negative coefficient suggests that heavier cars are less likely to have automatic transmissions.The p-value of 0.00843 indicates that this effect is statistically significant.

Horsepower (hp): Each unit increase in horsepower increases the log-odds of having an automatic transmission by approximately 0.036. The positive coefficient suggests that more powerful cars are more likely to have automatic transmissions.The p-value of 0.04091 indicates that this effect is statistically significant.

Objective 3: Model Selection

Stepwise Regression

# stepwise regression
step_model <- stepAIC(lm(mpg ~ wt + hp + qsec + am, data = mtcars), direction = "both")
## Start:  AIC=61.52
## mpg ~ wt + hp + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - hp    1     9.219 169.29 61.307
## <none>              160.07 61.515
## - qsec  1    20.225 180.29 63.323
## - am    1    25.993 186.06 64.331
## - wt    1    78.494 238.56 72.284
## 
## Step:  AIC=61.31
## mpg ~ wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## <none>              169.29 61.307
## + hp    1     9.219 160.07 61.515
## - am    1    26.178 195.46 63.908
## - qsec  1   109.034 278.32 75.217
## - wt    1   183.347 352.63 82.790
summary(step_model)
## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am1           2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

The best model is the one with the lowest AIC after dropping or adding predictors. Hence the model with 61.307 AIC is the best fit in this case. mpg ~ wt + qsec + am

Cross-validation

# Cross validation
set.seed(123)
cv_results <- vfold_cv(mtcars, v = 10)

cv_model <- function(train_data, test_data) {
  fit <- lm(mpg ~ wt + hp + qsec + am, data = train_data)
  preds <- predict(fit, newdata = test_data)
  rmse <- sqrt(mean((test_data$mpg - preds)^2))
  return(rmse)
}

cv_rmse <- map_dbl(cv_results$splits, ~ cv_model(analysis(.x), assessment(.x)))
mean(cv_rmse)
## [1] 2.66519

Cross-validation is a technique used to assess the generalizability and performance of a statistical model. The output of the cross-validation process in this project is 2.66519, which represents the Root Mean Squared Error (RMSE) of the model.

Interpretation: RMSE Value (2.66519): This value indicates the average magnitude of the residuals or prediction errors, i.e., the difference between the actual and predicted values of mpg. The lower the RMSE, the better the model’s predictive accuracy.

Model Performance: An RMSE of 2.66519 suggests that, on average, the predictions of the mpg from the model are off by approximately 2.67 units. This provides an indication of how well the model is likely to perform on unseen data.

Objective 4: Communicate Results

Throughout this project, I have demonstrated how to effectively communicate results and outputs by providing clear, concise explanations and visualizations for some of the analysis.I kept explanations precise and focused on the key findings and their implications, ensuring that the results are accessible and actionable. This approach facilitates effective communication of the analysis and its outcomes to stakeholders.

Visualizations

# MLR Coefficients
tidy(mlr_fit, conf.int = TRUE) %>%
  ggplot(aes(term, estimate, ymin = conf.low, ymax = conf.high)) +
  geom_pointrange() +
  coord_flip() 

This plot visualizes the estimates of the regression coefficients along with their confidence intervals.

Terms (x-axis): The plot includes each term (or predictor) in the MLR model ((Intercept), wt, hp, qsec, am).

Estimates (y-axis): The central point for each term represents the estimated coefficient. These estimates indicate the expected change in the response variable (mpg) for a one-unit change in the predictor, holding all other predictors constant.

Confidence Intervals: The lines extending from each estimate show the confidence intervals, typically at the 95% level. These intervals provide a range within which we can be reasonably confident the true coefficient lies.

  • A confidence interval that does not cross zero suggests that the predictor has a statistically significant effect on the response variable.
# Logistic Regression Coefficients
tidy(logit_fit, conf.int = TRUE) %>%
  ggplot(aes(term, estimate, ymin = conf.low, ymax = conf.high)) +
  geom_pointrange() +
  coord_flip()

This plot visualizes the estimates of the logistic regression model coefficients and their confidence intervals.

Terms (x-axis): The plot includes each term in the logistic regression model ((Intercept), wt, hp).

Estimates (y-axis): The central point for each term represents the estimated log-odds coefficient. These estimates indicate the change in the log-odds of the outcome (am being 1) for a one-unit change in the predictor, holding all other predictors constant.

Confidence Intervals: The lines extending from each estimate show the confidence intervals, typically at the 95% level. These intervals provide a range within which we can be reasonably confident the true coefficient lies.

  • A confidence interval that does not cross zero suggests that the predictor has a statistically significant effect on the response variable.

Summary of Findings

The multiple linear regression model shows that wt, and am are significant predictors of mpg. The logistic regression model indicates that hp and wt are significant predictors of transmission type (am). These results align with my expectations and demonstrate the effectiveness of the models.

Objective 5: Use R for Analysis

Throughout this project, I have used R for data manipulation, model fitting, and visualization. This showcases my ability to use R programming software to conduct comprehensive statistical analyses.

Conclusion

This project has demonstrated the application of various statistical modeling techniques using the mtcars dataset. By following the course objectives, I have effectively analyzed the data, selected appropriate models, and communicated my findings. This comprehensive analysis highlights my understanding and proficiency in statistical modeling and regression.

Class Activities

In addition to this project, I completed all the class activities for this course. I have put a lot of effort into completing these activities and I have learned a lot in the process. The following is a list of links to the activities I successfully completed that demonstrate my understanding of the course objectives:

Describe your participation in our course community.

(Remember that how participation (active) is different from attendance or “turning things in” (passive). How did you contribute to our course learning community?)

The following is a description of my participation in the course learning community:

The most important aspect of my participation in this class is the presentation that I made on the Job Ads assignment. I have the desire of honing my public speaking skills especially in a professional setting and this was a perfect opportunity for me to put this skill to test by speaking to my peer classmates and my instructor. I learnt a lot from my classmates’ presentations, how they formatted their slides, how they engaged the audience and the bits of career information each of us shared. For instance I discovered that soft skills such as team collaboration, ability to work independently, communication skills were required in almost all the jobs that were presented about. The presentations were a great way to learn and also practice public speaking even though that was not the main aim of the assignment.

The Data Feminism channel on Teams was my other avenue for participation in this course. I made an effort to provide my replies to the weekly Data Feminism chapter prompts that the instructor posted. It enabled me to read the Data Feminism book objectively through the guidance of the prompts. Reading replies from my classmates provided me with fresh perspectives on the issues raised in the prompts. I also posted a couple of things that were muddy to me in the Muddy channel and responded to muddy issues raised by classmates that I had a solution for.

I actively posted thoughts about my progress in STA 631 on LinkedIn. The main aim was to test what I have learnt by forging article-like posts for my future reference and for engagement with my LinkedIn community. I made posts about the processes of conducting statistical modeling, linear regression, multiple linear regression, logistic regression, generalized linear models and the Data Feminism course text. I have learned to share knowledge on a social media platform and this is a great way of practicing our course objective 4 (communicating results of a statistical model to a general audience).