Section I: Introduction

The Auto dataset, taken from the ISLR package, contains 392 observations across 9 different variables. I plan to evaluate three different variables: miles per gallon, origin, and year and their relationship to horsepower. I intend to find out if there are trends between horsepower and variables that are usually seen as unrelated to it. I hope to see one of the more unlikely variables to have a strong relationship actually have one, however I hypothesize that there will most likely not be any especially strong correlations between the variables, excluding miles per gallon which I actually expect to see a rather significant negative relationship.

The four variables themselves are describes as follows:
. Miles per Gallon- Miles per gallon. Numeric variable

. Origin- Origin of car (1. American 2. European 3. Japanese). Categorical Variable

. Year- Model Year. Categorical Variable

. Horsepower- Engine horsepower. Numerical variable

Each variable was chosen for specific reasons. To find if variables that aren’t often correlated with horsepower actually have some correlation, I had to chose variables that don’t necessarily have any relation to it of course. For this reason, I selected year and origin. Although cars obviously improve over time, and I’m sure as time progresses the average horsepower in cars increases I don’t believe the year of the car necessarily means it will have a higher or lower horsepower; e.g. a minivan today wouldn’t have more hp than a muscle car in the 80’s. Origin was selected because of the three explanatory variables I anticipate this one to be the least, well, explanatory. The place a car was built shouldn’t determine its horsepower in anyway. Car manufacturers create powerful cars all over the world even though there may be greater demand for cars high in hp in different regions leading to some correlation and I’m excited to see that. Miles per gallon and horsepower are fairly related at least according to my limited mechanical knowledge. The higher the horsepower in a car generally leads to lower miles per gallon. I wanted to use this variable as sort of an offset of the other two, as well look for an actual relationship. Miles per gallon should act as a sort of baseline for the other two variables to try reach, or if I’m wrong I simply would have found another variable that generally has no correlation to horsepower.

Section II: Exploratory Data Analysis

Horsepower Analysis

summary(Auto$horsepower)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    46.0    75.0    93.5   104.5   126.0   230.0
boxplot(Auto$horsepower, main = "Dispersion of Horsepower", sub = "A dataset of 392 cars from '71-'82", xlab = "horsepower", horizontal = TRUE, col = "skyblue", outpch = 16, outcol = "skyblue")

sd(Auto$horsepower)
## [1] 38.49116
quantile(Auto$horsepower, c(0, .2, .4, .6, .8, 1))
##   0%  20%  40%  60%  80% 100% 
##   46   72   88  100  140  230
zscoresHP<- scale(Auto$horsepower)
summary(zscoresHP)
##        V1         
##  Min.   :-1.5190  
##  1st Qu.:-0.7656  
##  Median :-0.2850  
##  Mean   : 0.0000  
##  3rd Qu.: 0.5594  
##  Max.   : 3.2613

One can see that the horsepower has a large range across the 392 observed cars. The mean and median are a little on the slow side meaning a majority of cars have a horsepower less than 125 which is to be expected. This is most significant because it tells us the cars in the study are relatively varied, although 75% of the cars are under a horsepower below 126 that remains a good representation of cars in general as there are more slow ones that fast ones. Also we can discern from the z-scores there is perhaps a few outliers, pontiac catalina, buick estate wagon (sw), buick electra 225 custom, pontiac grand prix all had z-scores outside the standard 3 deviations from the mean, or in the 99.7th percentile.

Miles per Gallon Analysis

summary(Auto$mpg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00   17.00   22.75   23.45   29.00   46.60
boxplot(Auto$mpg, main = "Dispersion of Miles per Gallon", sub = "A dataset of 392 cars from '71-'82", xlab = "mpg", horizontal = TRUE, col = "burlywood2")

sd(Auto$mpg)
## [1] 7.805007
quantile(Auto$mpg, c(0, .2, .4, .6, .8, 1))
##    0%   20%   40%   60%   80%  100% 
##  9.00 16.00 20.00 25.00 30.98 46.60
zscores_mpg<- scale(Auto$mpg)
summary(zscores_mpg)
##        V1          
##  Min.   :-1.85085  
##  1st Qu.:-0.82587  
##  Median :-0.08916  
##  Mean   : 0.00000  
##  3rd Qu.: 0.71160  
##  Max.   : 2.96657

Again, we see a good variability in the vehicle’s miles per gallon. The deviation is significantly lower than horsepower’s which isn’t too surprising considering the numbers describing horsepower are larger. The zscores are spread rather evenly as well, with no potential outliers.

Model Year Analysis

summary(Auto$year)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   70.00   73.00   76.00   75.98   79.00   82.00
Years<- table(Auto$year)
barplot(Years, main = "Model Year of Cars", sub = "A dataset of 392 cars from '71-'82", xlab = "Years", ylab = "# of Cars", col = "Orange")

sd(Auto$year)
## [1] 3.683737
quantile(Auto$year, c(0, .2, .4, .6, .8, 1))
##   0%  20%  40%  60%  80% 100% 
##   70   72   75   77   80   82
zscoresyr<- scale(Auto$year)
summary(zscoresyr)
##        V1          
##  Min.   :-1.62324  
##  1st Qu.:-0.80885  
##  Median : 0.00554  
##  Mean   : 0.00000  
##  3rd Qu.: 0.81993  
##  Max.   : 1.63432

There is clearly a pretty even number of cars per year, this is good for comparing the horsepower per year later on.

Car Origin Analysis

summary(Auto$origin)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.577   2.000   3.000
Origins<- table(Auto$origin)
barplot.default(Origins[order(Origins, decreasing = T)], main = "Manufacturing Origin of Cars", sub = "A dataset of 392 cars from '71-'82", xlab = "Country", ylab = "# of Cars", names.arg = c("American", "European", "Japanese"), ylim = c(0,250), col = "plum4")

sd(Auto$origin)
## [1] 0.8055182
quantile(Auto$origin, c(0,.2, .4, .6, .8, 1))
##   0%  20%  40%  60%  80% 100% 
##  1.0  1.0  1.0  1.0  2.8  3.0
Ozscores<- scale(Auto$origin)
summary(Ozscores)
##        V1         
##  Min.   :-0.7157  
##  1st Qu.:-0.7157  
##  Median :-0.7157  
##  Mean   : 0.0000  
##  3rd Qu.: 0.5257  
##  Max.   : 1.7671

The origin of the cars is not as varied as the years. As you can see there are more American cars in the study than European and Japanese cars combined. This may give America an advantage when it comes to establishing reliable relationships with horsepower.

Bivariate Data Analysis

Miles per Gallon and Horsepower

plot(Auto$horsepower~Auto$mpg, main = "Relationship of Horsepower and Miles per Gallon", sub = "A dataset of 392 cars from '71-'82", 
     xlab = "Miles per Gallon", ylab = "Horsepower")
abline(lm(Auto$horsepower~Auto$mpg), col = "burlywood2", lwd = 3)

cor(Auto$mpg, Auto$horsepower)
## [1] -0.7784268
cov(Auto$mpg, Auto$horsepower)
## [1] -233.8579
Mpg_HP<- lm(Auto$horsepower~Auto$mpg)
summary(Mpg_HP)
## 
## Call:
## lm(formula = Auto$horsepower ~ Auto$mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -64.892 -15.716  -2.094  13.108  96.947 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 194.4756     3.8732   50.21   <2e-16 ***
## Auto$mpg     -3.8389     0.1568  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.19 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

With a correlation of -.7784 and a covariance of -233.8579 its pretty clear I was successful in selecting a good variable to be the mark of what a strong predictor would look like. Nearly .8 correlation tells us alone that the two variables move together about 80% of the time which is very high.

Model Year and Horsepower

boxplot(Auto$horsepower~Auto$year, main="Relationship of Horsepower and Model Year", sub = "A dataset of 392 cars from '71-'82",
        xlab = "Model Year", ylab = "Horsepower", col = "Orange", outpch = 16, 
        outcol = "Orange")

cov(Auto$year, Auto$horsepower)
## [1] -59.03643
cor(Auto$year, Auto$horsepower)
## [1] -0.4163615
Years_HP<- lm(Auto$horsepower~Auto$year)
summary(Years_HP)
## 
## Call:
## lm(formula = Auto$horsepower ~ Auto$year)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -84.484 -25.455  -6.478  23.980 112.568 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 435.0215    36.5936  11.888   <2e-16 ***
## Auto$year    -4.3505     0.4811  -9.044   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.04 on 390 degrees of freedom
## Multiple R-squared:  0.1734, Adjusted R-squared:  0.1712 
## F-statistic: 81.79 on 1 and 390 DF,  p-value: < 2.2e-16

The correlation here is only -.4164 which is barely more than half the stregnth of the previous pair. It also has a signifigantly lower covariance, only -59.0364. It’s clear that my hypothesis was correct for this value as they were not very much related at all. However I was surprised to see that the relationship was negative.

Vehicle Origin and Horsepower

boxplot(Auto$horsepower~factor(Auto$origin), main = "Relationship of Horsepower and Car Origin", sub = "A dataset of 392 cars from '71-'82", names = c("American", "European", "Japanese"), xlab = "Car Origin", ylab = "Horsepower", col = "plum4", outpch = 16, outcol = "plum4")

cov(Auto$origin, Auto$horsepower)
## [1] -14.11274
cor(Auto$origin, Auto$horsepower)
## [1] -0.4551715

The correlation here isn’t much better: -.45517. Meaning the stregnth in which the two move together is not very profound. This was to be expected as there is only 3 categories for horsepower to fall into meaning for the relationship to have any chance at being strong there would have to be signifigant differnces across regions which just isn’t likely.

Section III: Simple Linear Regression

Quantitative Predictor: Miles per Gallon and Horsepower

plot(Auto$horsepower~Auto$mpg, main = "Relationship of Horsepower and Miles per Gallon", sub = "A dataset of 392 cars from '71-'82", 
     xlab = "Miles per Gallon", ylab = "Horsepower")
abline(lm(Auto$horsepower~Auto$mpg), col = "burlywood2", lwd = 3)

summary(lm(Auto$horsepower~Auto$mpg))
## 
## Call:
## lm(formula = Auto$horsepower ~ Auto$mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -64.892 -15.716  -2.094  13.108  96.947 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 194.4756     3.8732   50.21   <2e-16 ***
## Auto$mpg     -3.8389     0.1568  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.19 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

The distribution of the residuals do not appear to be strongly symmetrical. Meaning that the model predicts certain points fall far away from the actual observed points. The coefficients tell us that the y intercept falls at 194.48 and the rate at which the fitted line falls is -3.84 per mile per gallon. However, the error shows us that that rate of change can vary by about .157 horsepower which is a rather small error window. That along with the the residual standard error being 24.19 which isn’t too high at all either, the abline is a fairly accurate representation of the data. Finally, with a R-squared score of .6059 we know that about 61% of the variance we see in horsepower is caused by the change in miles per gallon. So does that mean we can say the higher the mpg in a car the lower the horsepower? Not necessarily, although the two variables are clearly related, and they do tend to move together doesn’t automatically guarantee causation.

boxplot(Auto$horsepower~Auto$year, main="Relationship of Horsepower and Model Year", sub = "A dataset of 392 cars from '71-'82",
        xlab = "Model Year", ylab = "Horsepower", col = "Orange", outpch = 16, 
        outcol = "Orange")

summary(Years_HP)
## 
## Call:
## lm(formula = Auto$horsepower ~ Auto$year)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -84.484 -25.455  -6.478  23.980 112.568 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 435.0215    36.5936  11.888   <2e-16 ***
## Auto$year    -4.3505     0.4811  -9.044   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.04 on 390 degrees of freedom
## Multiple R-squared:  0.1734, Adjusted R-squared:  0.1712 
## F-statistic: 81.79 on 1 and 390 DF,  p-value: < 2.2e-16

Here again we see the residuals aren’t quite symmetrical however I wouldn’t call them skewed either, simply meaning the model didn’t quite predict where the observations actually fell. For whatever reason the intercept is 435 when the graph doesn’t surpass 250, so I can’t really make much of that. THe rate however is -4.35, again meaning that as the years increase one can expect about a 4.35 horsepower decrease, which I must admit was surpsing to me but hey that’s what the data tells us. With an R-Squared score of .173 though it sort of makes sense as to why there is a decrease in horsepower as the years increase becasue there isn’t a strong relationship between the two variables.

Overall this section validated my hypothesis, that being that the categorical variables that go into making a car, location of origin or the model year don’t have a whole lot of relation with the specifications of the car itself, in this case with the horsepower. Those qualities of the car will always be more likely to be related to the other specs of the car, in this case miles per gallon, which we saw actually had a strong correlation.

Section IV: Multiple Linear Regression

This section is to be completed for Part 2.

Section V: Hypothesis Testing

This section is to be completed for Part 2.

Section VI: Conclusions

There is only text in this section, no code. It is just your writing.

If you need help structuring your .Rmd file you can find some help below. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Honor Code

Put your honor code pledge here.

I, Nylan Yancy, pledge to have not cheated in any form or fashion.