Homework 7

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readxl)
library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

library(lmtest)

## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

cars_data<-read.csv("cars_data(1).csv")

cars1<-lm(wt~hp+am,data=cars_data)
plot(cars1,which=1)

Here we see that the red line is mostly close to the straight line which shows there is a linear relationship where horsepower and transmission affect the weight.

raintest(cars1)

## 
##  Rainbow test
## 
## data:  cars1
## Rain = 0.9008, df1 = 16, df2 = 13, p-value = 0.5844

In the Rainbow Test, a low p-value means that the data is NOT linear. As we can see, this model is just barely linear.

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

durbinWatsonTest(cars1)

##  lag Autocorrelation D-W Statistic p-value
##    1       0.4261432      1.101482    0.01
##  Alternative hypothesis: rho != 0

A p-value greater than 0.05 means that the errors ARE independent. In this case, the errors are not independent since the p-value=0.008. Also, the D-W statistic is not close to “2”, which is not good.

plot(cars1,which=3)

The red line should be almost totally straight, and the points should cluster evenly around it. This red line is starts straight but then gets wavy, so this likely violates the assumption of Homoscedasticity.

bptest(cars1)

## 
##  studentized Breusch-Pagan test
## 
## data:  cars1
## BP = 5.0175, df = 2, p-value = 0.08137

In this case the, p-Value > 0.05 so the model is not homoscedatic, therefore heterscedasticity exists. I think.

plot(cars1,which=2)

There is some deviation from the dotted line but its not much as seen, lots of the plots land on the line when reviewing for the Normality of Residuals.

shapiro.test(cars1$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  cars1$residuals
## W = 0.92104, p-value = 0.02219

Since the p-value is less than 0.05, this means that the residuals are significantly different from a normal distribution, which means the are not normal.

vif(cars1)

##       hp       am 
## 1.062867 1.062867

Since neither of these are over 10 means that the variables is not strongly correlated with some other variable. There is no multicolinarity.

My model is not fair well. It failed/violated several test including the Shapiro Test, and was not normally distributive. Also it was heterostatic. This would require some mitigation.

cars1<-lm(wt~hp+am,data=cars_data)
cars1log<-lm(log(wt)~log(hp)+am,data=cars_data)

raintest(cars1log)

## 
##  Rainbow test
## 
## data:  cars1log
## Rain = 1.7267, df1 = 16, df2 = 13, p-value = 0.1628

raintest(cars1log)

## 
##  Rainbow test
## 
## data:  cars1log
## Rain = 1.7267, df1 = 16, df2 = 13, p-value = 0.1628

Logging the information and re-running a rainbow test increased the p-value which now makes it linear.

cars1_robust<-rlm(wt~hp,data=cars_data)


summary(cars1)

## 
## Call:
## lm(formula = wt ~ hp + am, data = cars_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.83338 -0.24390 -0.05175  0.15592  1.24801 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.576956   0.255062  10.103 5.22e-11 ***
## hp           0.007437   0.001406   5.289 1.14e-05 ***
## am          -1.109360   0.193215  -5.742 3.24e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5207 on 29 degrees of freedom
## Multiple R-squared:  0.7351, Adjusted R-squared:  0.7168 
## F-statistic: 40.24 on 2 and 29 DF,  p-value: 4.315e-09

summary(cars1_robust)

## 
## Call: rlm(formula = wt ~ hp, data = cars_data)
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.3739796 -0.5225638 -0.0004306  0.4365530  1.6001726 
## 
## Coefficients:
##             Value  Std. Error t value
## (Intercept) 1.8491 0.3076     6.0110 
## hp          0.0092 0.0019     4.8210 
## 
## Residual standard error: 0.6837 on 30 degrees of freedom

bptest(cars1)

## 
##  studentized Breusch-Pagan test
## 
## data:  cars1
## BP = 5.0175, df = 2, p-value = 0.08137

bptest(cars1log)

## 
##  studentized Breusch-Pagan test
## 
## data:  cars1log
## BP = 0.27048, df = 2, p-value = 0.8735

plot(cars1,which=2)

plot(cars1log,which=2)

Although my model stayed heteroscedasic, it slightly improved more than it was before.

vif(cars1)

##       hp       am 
## 1.062867 1.062867

This didn’t change, it remained below 2.

cars1<-lm(wt~hp+am,data=cars_data)
summary(cars1)

## 
## Call:
## lm(formula = wt ~ hp + am, data = cars_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.83338 -0.24390 -0.05175  0.15592  1.24801 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.576956   0.255062  10.103 5.22e-11 ***
## hp           0.007437   0.001406   5.289 1.14e-05 ***
## am          -1.109360   0.193215  -5.742 3.24e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5207 on 29 degrees of freedom
## Multiple R-squared:  0.7351, Adjusted R-squared:  0.7168 
## F-statistic: 40.24 on 2 and 29 DF,  p-value: 4.315e-09

The mitigation helped some. So yes, weight is affected by both horsepower and transmission choices, but is not in a direct way. A heavier vehicle will most likely require more horsepower to make it perform, transmission can be adjusted to the cars weight to optimize performance like a lower weight car. Its accurate to say that horse power and transmission are significant variables.

Homework 7

A. Franco

2025-04-08