hw1.R

suppressPackageStartupMessages(library("tidyr"))
suppressPackageStartupMessages(library("ggplot2"))
suppressPackageStartupMessages(library("dplyr"))

dat <- read.csv("mileage.csv", header=TRUE)

y: miles per gallon
x1: displacement (cubic in.)
x2: Horsepower
x3: Torque (ft-lb)
x4: Compression ratio
x5: Rear axle ratio
x6: Carburetor (# barrels)
x7: # of transmission speeds
x8: Overall length (in.)
x9: Width (in.)
x10: Weight (lb)
x11: Type of transmission (1 = Automatic; 0 = Manual)

Problem 1

Prior to looking at the data, what variables (out of x1 through x11) would you expect to influence the mpg? Specify whether you think they would have a strong, moderate, or little effect on the mpg, and your reasoning.

x1, Displacement, strong negative, engine doesn’t operate efficiently until it is warmed up
x2, Horsepower, strong negative, more horsepower tends to mean more moving parts (cylinders) or faster moving parts which means more heat loss and thus less efficiency
x3, Torque, strong negative, more energy is spent on the “turning power” and less on moving the car forward
x8, Length, moderate negative, seems plausible that smaller cars (being less heavy) are more efficient
x9, Width, moderate negative, seems plausible that smaller cars (being less heavy) are more efficient
x10, Weight, strong negative, heavier cars will take more gas to drive the same distance
x11, Transmission Type, moderate negative, manual transmission will be more efficient

Problem 2

Construct scatter plots of y (on the vertical axis) versus x1 through x11 (horizontal axes). There should be 11 plots total. Based on this, what variables do you believe have a large influence on mpg?

ggplot(gather(dat, var, x, -y), aes(x=x,y=y)) + geom_point(na.rm = TRUE) + 
  facet_wrap(~var, scales="free_x")

Problem 3

Look at the plot of Mpg versus Transmission Type (x11) from Problem 2. To see a slightly different presentation of the same data, use boxplot() to construct a Box and Whisker plot of the Mpg data grouped according to the Transmission Type (two boxes total, one for Automatic and one for Manual). Your first argument to the boxplot() function should be a formula of the form “Mpg ~ . . .” Based on this, does it appear that Transmission Type has a large influence on mpg?

Scatter Plot of Mpg versus Transmission Type (x11)

ggplot(dat, aes(x=x11,y=y)) + geom_point(na.rm = TRUE)

Box and Whisker Plot of the Mpg data grouped according to the Transmission Type

boxplot(y~x11,data=dat)

Yes, Transmission Type has a large influence on Mpg.

Problem 4

dat %>% group_by(x11) %>% summarise(
  min = min(y),
  max = max(y),
  mean = mean(y),
  median = median(y),
  sd = sd(y),
  quantile1 = quantile(y, probs=c(.25)),
  quantile2 = quantile(y, probs=c(.50)),
  quantile3 = quantile(y, probs=c(.75))
)

## Source: local data frame [2 x 9]
## 
##   x11   min   max     mean median       sd quantile1 quantile2 quantile3
## 1   0 20.07 36.50 27.63000   29.4 6.325275     21.50      29.4    31.900
## 2   1 11.20 23.54 17.32478   17.0 3.236905     14.64      17.0    19.715

Also calculate separate 95% confidence intervals for the mean Mpg for each transmission type.

dat %>% group_by(x11) %>% summarise(
  lower = t.test(y)$conf.int[1],
  upper = t.test(y)$conf.int[2]
)

## Source: local data frame [2 x 3]
## 
##   x11    lower    upper
## 1   0 22.76796 32.49204
## 2   1 15.92504 18.72453

Also plot histograms of Mpg within each transmission type.

ggplot(dat, aes(x=y)) + geom_histogram(na.rm = TRUE, binwidth=1) + facet_wrap(~x11)

Problem 5

In addition, use the t.test() function to find one single 95% confidence interval for the difference in the two mpg means.

t.test(y~x11, data=dat, var.equal=FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  y by x11
## t = 4.6549, df = 9.687, p-value = 0.0009811
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   5.35080 15.25963
## sample estimates:
## mean in group 0 mean in group 1 
##        27.63000        17.32478

Does this confidence interval include zero? No

What is the P-value for this test of whether the two means are equal? 0.0009811

Based on this, does it appear that Transmission Type has a large influence on mpg? Yes

Problem 6

Give a possible explanation for why it appears that manual transmissions result in significantly better mileage, when they probably really only have a small effect.

The transmission type itself may only have a small effect. The individuals that drive cars with manual transmission vs autoamtic may have differing driving abilities and differing preference of car. It may be that individuals that drive with manual transmission are significantly better at driving with efficient mpg. In addition, they may prefer cars with lower torque / horsepower which are all conflating factors.

Problem 7

What are the parameter estimates for the regression line y = B0 + B1x1 + e relating the Mpg (y, the response variable) to just the Displacement (x1, the predictor variable)?

modelx1 <- lm(y ~ x1, dat)
summary(modelx1)

## 
## Call:
## lm(formula = y ~ x1, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.7923 -1.9752  0.0044  1.7677  6.8171 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.722677   1.443903   23.36  < 2e-16 ***
## x1          -0.047360   0.004695  -10.09 3.74e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.065 on 30 degrees of freedom
## Multiple R-squared:  0.7723, Adjusted R-squared:  0.7647 
## F-statistic: 101.7 on 1 and 30 DF,  p-value: 3.743e-11

What is the r2 value? 0.7723 What is the P-value in the test of whether B1 = 0? 3.74e-11 Construct a scatter plot of y versus x1 with the best fit straight line added.

ggplot(dat, aes(x=x1, y=y)) + geom_point() + 
  geom_abline(intercept=coef(modelx1)[1], slope=coef(modelx1)[2])

Based on the preceding, does it appear that Displacement has a significant effect on mpg? Yes Does the relationship appear to be linear? Yes ### Problem 8 What equation(s) did R use when it spit out the confidence intervals in Problem 4? I used the formula for a One Sample t-test. What equation(s) did R use when it spit out the P-value in Problem 5?

title title

I used the formula for Welch’s Two Sample t-test with unequal variance. RULE OF THUMB: If the larger sample standard deviation is MORE THAN twice the smaller sample standard deviation then perform the analysis using unpooled methods.

# sd(filter(dat, x11==0)$y) # 6.325275
# sd(filter(dat, x11==1)$y) # 3.236905