This assignment is an intrudction to using R. In this assignment, we will examine the relationship between the quantitative/continuous variable mpg (miles per gallon, which is a measure of fuel efficiency) and the qualitative/nominal variable am (transmission type, with am = 0 for automatics and am = 1 for manuals).
Calculate summary statistics (mean, standard deviation, median, quartiles, minimum, maximum, and range) separately for each of the two transmission types.
attach (mtcars)
tapply(mpg,am,sample)
## $`0`
## [1] 24.4 15.2 21.4 19.2 15.2 21.5 14.7 18.7 14.3 10.4 16.4 15.5 17.8 13.3
## [15] 17.3 18.1 22.8 10.4 19.2
##
## $`1`
## [1] 15.8 32.4 21.0 27.3 21.4 19.7 33.9 21.0 15.0 22.8 26.0 30.4 30.4
tapply(mpg,am,mean)
## 0 1
## 17.14737 24.39231
tapply(mpg,am,sd)
## 0 1
## 3.833966 6.166504
tapply(mpg,am,median)
## 0 1
## 17.3 22.8
tapply(mpg,am,quantile)
## $`0`
## 0% 25% 50% 75% 100%
## 10.40 14.95 17.30 19.20 24.40
##
## $`1`
## 0% 25% 50% 75% 100%
## 15.0 21.0 22.8 30.4 33.9
tapply(mpg,am,min)
## 0 1
## 10.4 15.0
tapply(mpg,am,max)
## 0 1
## 24.4 33.9
tapply(mpg,am,range)
## $`0`
## [1] 10.4 24.4
##
## $`1`
## [1] 15.0 33.9
Create box plots of mpg for each transmission type separately, but plot them on the same figure aligned to the same axis.
boxplot(mpg~am,data=mtcars, main="Car Mileage Data",
xlab="Transmission Type", ylab="Miles Per Gallon")
Create a scatter plot of mpg (the y-axis) versus am (the x-axis) using the plot() function.
plot(am, mpg, main="Scatterplot",
xlab="Transmission Type ", ylab="Miles Per Gallon ", pch=2)
From examining both the boxplot and scatterplot transmittion type manuals appears to have better fuel economy since they use up less fuel per mile. There will be statistically significance, but I cannot see the difference when the significance level is α = .05.
We will now perform a formal hypothesis test of whether the mean gas mileage differs between automatics and manuals. Let µ0 denote the population mean gas mileage of automatics and µ1 denote the mean mileage for manual cars. We will test the null hypothesis H0 : µ0 = µ1 that the average mileages are the same versus the alternative hypothesis HA : µ0 6= µ1 that the mileages differ on average between transmission types, using a significance level of α = .05.
x0<-17.14737
x1<-24.39231
x0-x1
## [1] -7.24494
b)Calculate the pooled standard deviation sp = sqrt((s02(n0−1)+s12(n1−1))/n0+n1−2).
n0<-19
n1<-13
sd0<-3.833966
sd1<-6.166504
sp<-sqrt(((sd0)^2 *(n0-1)+(sd1)^2*(n1-1))/(n0+n1-2))
sp<-sqrt(((3.833966) ^ 2 *(18)+(6.166504) ^ 2 *(12))/(30))
sp<-sqrt(((14.699295*18)+(38.025772*12))/30)
sp<-sqrt((264.5873152+456.309259)/30)
sp
## [1] 4.902029
c)Calculate the test statistic t∗ = (¯x0 − x¯1)/(sp*sqrt(1/n0 + 1/n1).
x0-x1
## [1] -7.24494
sp
## [1] 4.902029
ts<-(x0-x1)/(sp*sqrt((1/n0)+(1/n1)))
ts<-(-7.24494)/((4.902029)*sqrt((1/19)+(1/13)))
ts
## [1] -4.106127
abs(ts)
## [1] 4.106127
d)If |t∗| is larger than the critical value t.975,30 = qt(.975,30), then conclude that H0 is probably false and HA is probably true (“reject the null hypothesis”). Otherwise, conclude that we had insufficient evidence to find that HA is true (“fail to reject the null hypothesis”).
qt(.975,30)
## [1] 2.042272
t<-2.042272
My t* is larger than the critical value so H0 is probably false and hence HA is probably true.
e)Calculate the p-value p = 2 · P(t > |t∗|).
p<-2*pt(abs(ts),30,lower.tail=FALSE)
p<-2*pt(abs(-4.106127),30,lower.tail=FALSE)
p
## [1] 0.0002850207
f)If p < .05, then conclude that we should reject H0; otherwise, fail to reject H0 (you should always get the same answer as you got with the critical value approach above).
p
## [1] 0.0002850207
Since my p=0.0002850207 < 0.05 then we can conclude that we should reject H0.
g)Let R perform the test for you: t.test(mpg ∼ am,var.equal=TRUE); confirm you got the same values for the test statistic and p-value as above.
t.test(mpg ~ am,var.equal=TRUE)
##
## Two Sample t-test
##
## data: mpg by am
## t = -4.1061, df = 30, p-value = 0.000285
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -10.84837 -3.64151
## sample estimates:
## mean in group 0 mean in group 1
## 17.14737 24.39231
We can rephrase the hypothesis test from the previous question in terms of a simple linear regression model Yi = α + βXi + i , where Y = mpg is the response variable and X = am is the predictor variable. In particular, β = µ1 −µ0, so testing whether µ0 and µ1 are equal is equivalent to testing whether β = 0. Using the output of the R code summary(lm(mpg ∼ am)), confirm that you got the same point estimate, test statistic, and p-value as with the two-sample t-test.
summary(lm(mpg ~ am))
##
## Call:
## lm(formula = mpg ~ am)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## am 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285