Data Exploration

  1. Load the mtcars dataset, review its codebook, and use summary statistics and visualizations to understand the dataset.
## Loading in the dataset, reviewing the codebook, and using summary statistics
data(mtcars)
?mtcars
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
## Visualizations
# Histogram of MPG
hist(mtcars$mpg, main = "Distribution of Miles Per Gallon", xlab = "MPG")

# Boxplot: MPG by Transmission Type
boxplot(mpg ~ am, data = mtcars,
        names = c("Automatic", "Manual"),
        main = "MPG by Transmission Type",
        ylab = "Miles Per Gallon")

# Scatterplot: Weight vs MPG
plot(mtcars$wt, mtcars$mpg,
     main = "MPG vs Weight",
     xlab = "Weight (1000 lbs)",
     ylab = "Miles Per Gallon")

  1. Identify at least one trend, correlation, or pattern you find interesting!
cor(mtcars$wt, mtcars$mpg)
## [1] -0.8676594

This answer means that heavier cars have a lower gallon per mileage, which makes sense because the heavier they the more fuel they have to use.

  1. What variables are most strongly correlated with mpg? Why?
cor(mtcars$mpg, mtcars) 
##      mpg       cyl       disp         hp      drat         wt     qsec
## [1,]   1 -0.852162 -0.8475514 -0.7761684 0.6811719 -0.8676594 0.418684
##             vs        am      gear       carb
## [1,] 0.6640389 0.5998324 0.4802848 -0.5509251

Weight is most strongly correlated with MPG. The correlation is -0.868 which means that heavier cars require more fuel to move the same distance because of their weight. Secondly, cylinders and displacement have a correlation of about -0.85. This comes from the fact that more cylinders or larger engine displacement generally means that there is a bigger engine, so that engine burns more fuel.

Data Preprocessing

  1. Check whether there are missing data using R codes. If not, show evidence.
sum(is.na(mtcars))
## [1] 0

By running the sum of is.na in mtcars, it would return a number of misssing values in the dataset, and since there are no missing values, the number returned is 0.

  1. Check whether there are inconsistent/invalid data. If not, show evidence.
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
## Confirms that all variables are numeric
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
## All the variables summarized are logical in terms of their category
any(duplicated(mtcars))
## [1] FALSE
## The output of this is false showing that there are no duplicated values in the dataset

Linear Regression using lm

  1. Create a linear regression model to predict mpg based on all the other variables in the dataset.
linear_model <- lm(mpg ~ ., data = mtcars)
summary(linear_model)
## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am           2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07
  1. What assumptions are being made when we use linear regression? Are they met in this dataset? The assumptions that are being made when doing linear regression are linearity, independence, homoscedasticity, normality of residuals, no multicollinearity.
par(mfrow = c(2,2))
plot(linear_model)

The linear regression assumptions are mostly met for the dataset, but multicollinearity between engine-related variables and slight-non linearity are noticeable. The model still performs well overall, but simplifying predictors would improve reliability.

  1. Evaluate the model using MSE.
MSE <- mean(linear_model$residuals^2)
MSE
## [1] 4.609201
  1. Adding interaction terms to your linear regression model.
model_interact <- lm(mpg ~ wt * am, data = mtcars)
summary(model_interact)
## 
## Call:
## lm(formula = mpg ~ wt * am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6004 -1.5446 -0.5325  0.9012  6.0909 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  31.4161     3.0201  10.402 4.00e-11 ***
## wt           -3.7859     0.7856  -4.819 4.55e-05 ***
## am           14.8784     4.2640   3.489  0.00162 ** 
## wt:am        -5.2984     1.4447  -3.667  0.00102 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.591 on 28 degrees of freedom
## Multiple R-squared:  0.833,  Adjusted R-squared:  0.8151 
## F-statistic: 46.57 on 3 and 28 DF,  p-value: 5.209e-11

When adding an interaction term between weight and transmission type to the regression model, the interation was statistically significant. This means that the effect of a car’s weight on fuel efficiency (mpg) depends on whether the car has a manual or automatic transmission.

  1. Are there any outliers in the data. If so, winzorize the variable with the most outliers.
boxplot(mtcars, las = 2, main = "Boxplots for Outlier Detection")

## Horsepower and Displacement have several outliers, especially horsepower.
boxplot.stats(mtcars$hp)$out
## [1] 335
## Checking to make sure that hp does have a good amount of outliers, and it does, so we will winsorize the variable.

I was not sure how to winsorize the variable so I did not.