(A)Question1 Fit a simple linear regression using the number of minutes worked to afford a big mac in 2003 to explain the number of minuets worked to afford a big mac in 2009.
Solution:
data<-read.csv('C:/Users/LMT/Desktop/CLASS/mas637 Reg/HW/UBSprices.csv')
attach(data)
a<-plot(bigmac2003,bigmac2009,main="bigmac2003 vs bigmac2009",col="blue")
b<-lm(bigmac2009~bigmac2003)
abline(b)
b is the simple linear model we get.
par(mfrow=c(2,2))
plot(b)
(B)Question 2. Look at all diagnostic plots and make appropriate adjustments if possible to create a model that meets the assumptions of simple linear regression. We will call this model the final model.
Solution: From the Residuals vs Fitted graph above, we can know that the dots are almost equally distributed about zero line, so the average residual are almost equals zero, the linear relationship is good.
From the Normal QQ plot, we can know that the errors cannot be considered normal because the residuals don’t follow the straight dashed line closely.
From the Scale-Location Plot, we can know that it is not a horizontal line with equally spread points, it has trend, so the error terms do not have a constant variance.
From the Residuals vs Leverage, we can know that here are three outliers.They are No.21,No.12 and No.36
Now, we need to do something to the model and try to make it meets the assumptions of simple linear regression.
First, we delate the outliers: Record No.21,No.12 and No.36
data<-data[-c(12,21,36),]
Second, we make bigmac2003 and bigmac2009 into log. It can transforming linear model into a more normally distributed data set.
Lbigmac2003<-log(data$bigmac2003)
Lbigmac2009<-log(data$bigmac2009)
Third, we can have a look at the diagnostic plot of the final model.
par(mfrow=c(2,2))
finalmodel<-lm(Lbigmac2009~Lbigmac2003)
plot(finalmodel)
From the plot, we can see that the final model looks better than model b. It become more normally distributted. However, the error terms also do not have a constant variance, the Scale-Location graph still have a increasing trend. Then we try to delate another 3 outliers in final model and have a look whether the finalmodel2 meets all the assumption.
par(mfrow=c(2,2))
data<-data[-c(34,29,51),]
Lbigmac2003<-log(data$bigmac2003)
Lbigmac2009<-log(data$bigmac2009)
finalmodel2<-lm(Lbigmac2009~Lbigmac2003)
plot(finalmodel2)
From the Scale-Location plot of finalmodel2, it seems to have a constant variance this time. Thus finalmodel2 meets all the assumption.
Finally,we need to use the Durbin-Watson test to test the autocorrelation.
library(car)
## Loading required package: carData
durbinWatsonTest(finalmodel2)
## lag Autocorrelation D-W Statistic p-value
## 1 0.1169307 1.675945 0.204
## Alternative hypothesis: rho != 0
P value is 0.218,it is larger than 0.05, so it do not have a autocorrelation.we do not need to use Cochrance-Orcutt or other method to eliminate the autocorrelation procedure.
(C)Question3 Interpret the coefficients associated with you final model. Discuss the meaning of the ANOVA F-Test and the t-test for the slope.
Solution: My final model is finalmodel2.
summary(finalmodel2)
##
## Call:
## lm(formula = Lbigmac2009 ~ Lbigmac2003)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.48046 -0.13583 -0.02189 0.06675 0.66794
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.92818 0.19496 4.761 1.95e-05 ***
## Lbigmac2003 0.70741 0.05964 11.862 1.36e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2374 on 46 degrees of freedom
## Multiple R-squared: 0.7536, Adjusted R-squared: 0.7483
## F-statistic: 140.7 on 1 and 46 DF, p-value: 1.36e-15
T value of slope equals 11.862 and the p value is 1.36e-15,so we can reject the null hypothesis(H0 :β1 = 0.) under 0.05 significant level. Thus the slope coeffient is significant.
F-statistic is 140.7 and it’s P value is 1.36e-15,it is under the 0.05 significant level, we can reject the null hypothesis that there is a big difference between the model and population.
(D)Question4 Repeat this analysis for bread (use bread 2003 to explain bread 2009).
Solution:
datab<-read.csv('C:/Users/LMT/Desktop/CLASS/mas637 Reg/HW/UBSprices.csv')
a2<-plot(datab$bigmac2003,datab$bread,main="bread2003 vs bread2009",col="blue")
b2<-lm(datab$bread2009~datab$bread2003)
abline(b2)
par(mfrow=c(2,2))
plot(b2)
From the Residuals vs Fitted graph above, we can know that the dots are almost equally distributed about zero line, so the average residual are almost equals zero, the linear relationship is good.
From the Normal QQ plot, we can know that the errors cannot be considered normal because the residuals don’t follow the straight dashed line closely.
From the Scale-Location Plot, we can know that it is almost a horizontal line with equally spread points, the error terms nearly have a constant variance.
From the Residuals vs Leverage, we can know that here are three outliers.They are No.21,No.12 and No.31
Now, we need to do something to the model.
First, we delate the outliers: Record No.21,No.12 and No.31.
datab<-datab[-c(12,21,31),]
Second, we make bread2003 and bread2009 into log.Because it can transforming linear model into a more normally distributed data set.
Lbread2003<-log(datab$bread2003)
Lbread2009<-log(datab$bread2009)
Third, we can have a look at the diagnostic plot of the final model.
par(mfrow=c(2,2))
breadmodel<-lm(Lbread2009~Lbread2003)
plot(breadmodel)
From the four plots above, it looks fit all assumption, it looks good enough. This time, my final model is breadmodel.
Finally,we need to use the Durbin-Watson test to test the autocorrelation.
library(car)
durbinWatsonTest(breadmodel)
## lag Autocorrelation D-W Statistic p-value
## 1 0.08847046 1.788773 0.42
## Alternative hypothesis: rho != 0
P value is 0.452,it is larger than 0.05, so it do not have a autocorrelation.we do not need to use Cochrance-Orcutt or other method to eliminate the autocorrelation procedure.
summary(breadmodel)
##
## Call:
## lm(formula = Lbread2009 ~ Lbread2003)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.60440 -0.20169 -0.04372 0.21074 0.72718
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.20864 0.23419 5.161 4.44e-06 ***
## Lbread2003 0.59366 0.08109 7.321 2.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3008 on 49 degrees of freedom
## Multiple R-squared: 0.5224, Adjusted R-squared: 0.5126
## F-statistic: 53.59 on 1 and 49 DF, p-value: 2.116e-09
T value of slope equals 7.321 and the p value is 2.12e-09,so we can reject the null hypothesis(H0 :β1 = 0.) under 0.05 significant level. Thus the slope coeffient is significant.
F-statistic is 53.59 and it’s P value is 2.116e-09,so under the 0.05 significant level, we can reject the null hypothesis that there is a big difference between the model and population.