Last Name: Motwani
First Name: Sahil

1.

Read into R. Response variable = Marble.Tombstone.Mean.Surface.Recession.Rate, and covariate = Mean.SO2.concentrations over a 100 year period.

Answer:

To read data into R, I use the following command. I am also trying to change column names in the dataset

tomb=read.csv("tombstone.csv")
ls()
## [1] "tomb"
colnames(tomb)[2:3]<- c("SO2.Conc","Surface.Reces.")
names(tomb)
## [1] "City"           "SO2.Conc"       "Surface.Reces."

We can see that after executing tomb=read.csv("tombstone.csv"), we have a variable tomb in the memory. We also changed the names of column to a bit shorter with help of

2. Plot and explore data.

Answer:

I have to plot the data using the following command

plot1<-qplot(SO2.Conc,Surface.Reces.)
plot2<-qplot(SO2.Conc,Surface.Reces.)+geom_smooth(method = 'lm')
plot3<-qplot(SO2.Conc,Surface.Reces.)+stat_smooth(se=F)+theme_bw()
grid.arrange(plot1,plot2,plot3,ncol=3)

We can see the scatter plot and also the curved lowess line. It is almost a straight line

nrow(tomb)
## [1] 21
cor(SO2.Conc,Surface.Reces.)
## [1] 0.9008731
summary(tomb)
##            City       SO2.Conc     Surface.Reces. 
##  Albany,NY   : 1   Min.   : 12.0   Min.   :0.140  
##  Baltimore,MD: 1   1st Qu.: 91.0   1st Qu.:1.010  
##  Boston,MA   : 1   Median :122.0   Median :1.530  
##  Brooklyn,NY : 1   Mean   :136.5   Mean   :1.496  
##  Cambridge,MA: 1   3rd Qu.:197.0   3rd Qu.:1.980  
##  Chicago,IL  : 1   Max.   :323.0   Max.   :3.160  
##  (Other)     :15

There are 21 observation. The correlation coefficient of SO2 Concentration and Surface Recession is 0.9008731. Since it is rather close to 1, we can conclude that the variables are positively linearly related. We can also visualize the data with the following plot.

We can see in the lower half triangle that the data points are correlated.

3.Perform linear regression

Answer:

model<- lm(Surface.Reces.~ SO2.Conc, data=tomb)
plot(Surface.Reces.~SO2.Conc)
points(tomb$SO2.Conc,model$fitted.values,pch=20,col="red")

model$coefficients[1]
## (Intercept) 
##   0.3229959
model$coefficients[2]
##    SO2.Conc 
## 0.008593333
summary(model)$r.squared
## [1] 0.8115724
summary(model)$adj.r.squared
## [1] 0.8016552

3.1

The values for \(\widehat{\beta_0}\) = 0.3229959 and \(\widehat{\beta_1}\) = 0.0085933

3.2

The \(R^{2}\) is 0.8115724. This shows that the model fits our data and the model can explain 81% of the variability of the response data around it’s mean. So the data points will fall closer to the fitted regression line.

4. Perform statistics on the model

Answer:

4.1

Performing t tests. We do hypothesis testing on the slope and intercept. The t statistics and the p-value for both slope and intercept can be obtained as follows, using the summary function on our model.

summary(model)$coef[,3]
## (Intercept)    SO2.Conc 
##    2.122239    9.046242
summary(model)$coef[,4]
##  (Intercept)     SO2.Conc 
## 4.718525e-02 2.578534e-08

As we can see, the t value for intercept is 2.1222387 and for slope is 9.0462416. The p-value for the intercept is 0.0471852 and for slope is \(2.58*10^{-8}\). The null hypothesis states that the slope is equal to zero. The t value for slope is quite significant as it is greater than our \(\alpha\) of 2. Also, since the P-value \(2.58*10^{-8}\) is less than the significance level (0.05), we cannot accept the null hypothesis.Thus we reject the null hypothesis.

4.2

Now I am performing the anova(F) test.

anova(model)
## Analysis of Variance Table
## 
## Response: Surface.Reces.
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## SO2.Conc   1 10.9031 10.9031  81.835 2.579e-08 ***
## Residuals 19  2.5314  0.1332                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we can see, the F value is 81.835 and the p-value is \(2.58*10^{-8}\). Since this is linear regression, the anova test and t test are similar as \(t_0^{2} = F_0\). The null hypothesis is that the slope is equal to zero. There is strong evidence that the slope is not equal to zero as p-value is quite significant.

4.3

Now calculating the confidence interval, the fitted values, the prediction interval

confint(model,level=0.95)
##                   2.5 %     97.5 %
## (Intercept) 0.004446349 0.64154545
## SO2.Conc    0.006605098 0.01058157
model$fitted.values
##         1         2         3         4         5         6         7 
## 0.4261159 0.4948626 0.4948626 0.7182892 0.7354759 1.1135825 1.1049892 
##         8         9        10        11        12        13        14 
## 1.1307692 1.1995159 1.3284159 1.3713825 1.5432492 1.5432492 1.8526092 
##        15        16        17        18        19        20        21 
## 1.8697959 2.0158825 2.2479025 2.3338359 2.3768025 2.4197692 3.0986425
predict.lm(model,interval = "confidence")
##          fit       lwr       upr
## 1  0.4261159 0.1276356 0.7245962
## 2  0.4948626 0.2094375 0.7802876
## 3  0.4948626 0.2094375 0.7802876
## 4  0.7182892 0.4729586 0.9636199
## 5  0.7354759 0.4930475 0.9779043
## 6  1.1135825 0.9248239 1.3023412
## 7  1.1049892 0.9152900 1.2946885
## 8  1.1307692 0.9438424 1.3176960
## 9  1.1995159 1.0192244 1.3798074
## 10 1.3284159 1.1572428 1.4995889
## 11 1.3713825 1.2021867 1.5405784
## 12 1.5432492 1.3761806 1.7103178
## 13 1.5432492 1.3761806 1.7103178
## 14 1.8526092 1.6666152 2.0386032
## 15 1.8697959 1.6820050 2.0575867
## 16 2.0158825 1.8103314 2.2214336
## 17 2.2479025 2.0069821 2.4888230
## 18 2.3338359 2.0781916 2.5894801
## 19 2.3768025 2.1135420 2.6400631
## 20 2.4197692 2.1487417 2.6907967
## 21 3.0986425 2.6921265 3.5051585
predict.lm(model,interval = "prediction")
##          fit         lwr      upr
## 1  0.4261159 -0.39409843 1.246330
## 2  0.4948626 -0.32069155 1.310417
## 3  0.4948626 -0.32069155 1.310417
## 4  0.7182892 -0.08411227 1.520691
## 5  0.7354759 -0.06604303 1.536995
## 6  1.1135825  0.32663218 1.900533
## 7  1.1049892  0.31781271 1.892166
## 8  1.1307692  0.34425624 1.917282
## 9  1.1995159  0.41455342 1.984478
## 10 1.3284159  0.54549746 2.111334
## 11 1.3713825  0.58889402 2.153871
## 12 1.5432492  0.76121790 2.325281
## 13 1.5432492  0.76121790 2.325281
## 14 1.8526092  1.06631740 2.638901
## 15 1.8697959  1.08307708 2.656515
## 16 2.0158825  1.22473635 2.807029
## 17 2.2479025  1.44683842 3.048967
## 18 2.3338359  1.52822118 3.139451
## 19 2.3768025  1.56873869 3.184866
## 20 2.4197692  1.60914169 3.230397
## 21 3.0986425  2.23324303 3.964042

4.4

Plotting the confidence interval, the fitted values, the prediction interval using ci.plot()

ci.plot(model)

5

Observation and Analysis

Answer :

As we can see that the p-value is significant as the value is very small compared to 0.05. There is a correlation of Marble Tombstone Mean Surface Recession Rate and Mean SO2 concentrations over a 100 year period. The variation in the response variable can be explained with the linear regression model. The confidence interval bands are not very far away from the fitted line as seen in the ci.plot(model). The residual plot shows that there is a random pattern in the residuals indicating that the deterministic portion that is (predictor variables) of the model is capturing useful information. The null hypothesis states that the slope is equal to zero. We reject the null hypothesis.

Repeating steps 1-5 for the bus.csv dataset.

6.1

Answer

Using response variable as Expenses per car mile (pence), covariate as Car miles per year (1000s), doing the following analysis.

d=read.csv("bus.csv")
names(d)
## [1] "Expenses.per.car.mile..pence."     
## [2] "Car.miles.per.year..1000s."        
## [3] "Percent.of.Double.Deckers.in.fleet"
## [4] "Percent.of.fleet.on.fuel.oil"      
## [5] "Receipts.per.car.mile..pence."
names(d)=c("Expense","Car_mile","DoubleDecker","Fleet","Receipts")
pairs(d)

attach(d)
plot(Car_mile,Expense,pch=20)
cor(Car_mile,Expense)
## [1] -0.3978242
model1=lm(Expense~Car_mile)
summary(model1)
## 
## Call:
## lm(formula = Expense ~ Car_mile)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0123 -0.9417 -0.1894  0.8993  2.6176 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.878e+01  4.075e-01  46.085   <2e-16 ***
## Car_mile    -4.450e-05  2.188e-05  -2.034   0.0542 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.347 on 22 degrees of freedom
## Multiple R-squared:  0.1583, Adjusted R-squared:   0.12 
## F-statistic: 4.136 on 1 and 22 DF,  p-value: 0.0542
abline(model1)
lines(lowess(Expense~Car_mile))

library(HH)
ci.plot(model1)

anova(model1)
## Analysis of Variance Table
## 
## Response: Expense
##           Df Sum Sq Mean Sq F value Pr(>F)  
## Car_mile   1  7.506  7.5058  4.1365 0.0542 .
## Residuals 22 39.920  1.8145                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(model1,level=0.95)
##                     2.5 %       97.5 %
## (Intercept)  1.793660e+01 1.962700e+01
## Car_mile    -8.987441e-05 8.761294e-07
plot(model1$residuals,pch=20)
points(model1$residuals,type="h")
abline(h=0,col="red",lwd=3)

par(mfrow = c(2,2))
plot(model1)

As we can see that the p-value is not at all significant as the value is greater than 0.05. The residual plot shows that the points are quite away from the center and the non-random pattern in the residuals indicates that the deterministic portion that is (predictor variables) of the model are not capturing some information that is leaking into the residuals. The null hypothesis states that the slope is equal to zero. We cannot reject the null hypothesis.

6.2

Answer

Using response variable as Expenses per car mile (pence), covariate as Percent of Double Deckers in fleet, doing the following analysis.

names(d)
## [1] "Expense"      "Car_mile"     "DoubleDecker" "Fleet"       
## [5] "Receipts"
plot(DoubleDecker,Expense,pch=20)
model2=lm(Expense~DoubleDecker)
summary(model2)
## 
## Call:
## lm(formula = Expense ~ DoubleDecker)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2992 -0.5985 -0.1054  0.2084  3.5161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  16.54775    0.55637  29.742  < 2e-16 ***
## DoubleDecker  0.03289    0.01011   3.252  0.00366 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.207 on 22 degrees of freedom
## Multiple R-squared:  0.3246, Adjusted R-squared:  0.2939 
## F-statistic: 10.57 on 1 and 22 DF,  p-value: 0.003657
abline(model2)
lines(lowess(Expense~DoubleDecker))

library(HH)
ci.plot(model2)

anova(model2)
## Analysis of Variance Table
## 
## Response: Expense
##              Df Sum Sq Mean Sq F value   Pr(>F)   
## DoubleDecker  1 15.395 15.3949  10.574 0.003657 **
## Residuals    22 32.031  1.4559                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(model2,level=0.95)
##                    2.5 %      97.5 %
## (Intercept)  15.39390854 17.70159668
## DoubleDecker  0.01191374  0.05386638
plot(model2$residuals,pch=20)
points(model2$residuals,type="h")
abline(h=0,col="red",lwd=3)

par(mfrow = c(2,2))
plot(model2)

As we can see that the p-value is significant as the value is very small compared to 0.05. The residual plot shows that there is a random pattern in the residuals indicating that the deterministic portion that is (predictor variables) of the model is capturing useful information. The null hypothesis states that the slope is equal to zero. We reject the null hypothesis.

6.3

Answer

Using response variable as Expenses per car mile (pence), covariate as Percent of fleet on fuel oil, doing the following analysis.

names(d)
## [1] "Expense"      "Car_mile"     "DoubleDecker" "Fleet"       
## [5] "Receipts"
plot(Fleet,Expense,pch=20)
model3=lm(Expense~Fleet)
summary(model3)
## 
## Call:
## lm(formula = Expense ~ Fleet)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6282 -1.1915 -0.3800  0.4129  3.1490 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17.876372   1.966369   9.091 6.64e-09 ***
## Fleet        0.003375   0.022341   0.151    0.881    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.467 on 22 degrees of freedom
## Multiple R-squared:  0.001037,   Adjusted R-squared:  -0.04437 
## F-statistic: 0.02283 on 1 and 22 DF,  p-value: 0.8813
abline(model3)
lines(lowess(Expense~Fleet))

library(HH)
ci.plot(model3)

anova(model3)
## Analysis of Variance Table
## 
## Response: Expense
##           Df Sum Sq Mean Sq F value Pr(>F)
## Fleet      1  0.049 0.04916  0.0228 0.8813
## Residuals 22 47.376 2.15347
confint(model3,level=0.95)
##                   2.5 %      97.5 %
## (Intercept) 13.79837118 21.95437183
## Fleet       -0.04295722  0.04970821
plot(model3$residuals,pch=20)
points(model3$residuals,type="h")
abline(h=0,col="red",lwd=3)

par(mfrow = c(2,2))
plot(model3)

As we can see that the p-value is not at all significant as the value is greater than 0.05. The residual plot shows that the points are quite away from the center and the non-random pattern in the residuals indicates that the deterministic portion that is (predictor variables) of the model are not capturing some information that is leaking into the residuals. The null hypothesis states that the slope is equal to zero. We cannot reject the null hypothesis.

6.4

Answer

Using response variable as Expenses per car mile (pence), covariate as Receipts per car mile (pence), doing the following analysis.

names(d)
## [1] "Expense"      "Car_mile"     "DoubleDecker" "Fleet"       
## [5] "Receipts"
plot(Receipts,Expense,pch=20)
cor(Receipts,Expense)
## [1] 0.7868772
model4=lm(Expense~Receipts)
summary(model4)
## 
## Call:
## lm(formula = Expense ~ Receipts)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.91091 -0.56057  0.09436  0.41483  2.62151 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.65846    1.60108   5.408 1.97e-05 ***
## Receipts     0.47987    0.08024   5.981 5.10e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9061 on 22 degrees of freedom
## Multiple R-squared:  0.6192, Adjusted R-squared:  0.6019 
## F-statistic: 35.77 on 1 and 22 DF,  p-value: 5.097e-06
abline(model4)
lines(lowess(Expense~Receipts))

library(HH)
ci.plot(model4)

anova(model4)
## Analysis of Variance Table
## 
## Response: Expense
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Receipts   1 29.365 29.3648  35.769 5.097e-06 ***
## Residuals 22 18.061  0.8209                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(model4,level=0.95)
##                 2.5 %     97.5 %
## (Intercept) 5.3380250 11.9788852
## Receipts    0.3134688  0.6462633
plot(model4$residuals,pch=20)
points(model4$residuals,type="h")
abline(h=0,col="red",lwd=3)

par(mfrow = c(2,2))
plot(model4)

As we can see that the p-value is highly significant as the value is very small compared to 0.05. The residual plot shows that there is a random pattern in the residuals indicating that the deterministic portion that is (predictor variables) of the model is capturing useful information. The null hypothesis states that the slope is equal to zero. We reject the null hypothesis.

6.5

Answer

As we can see from the analysis model2(answer 6.2) and model4(answer 6.4) are significant models and are able to explain the variablility in the data. The other two models model1 and model3 are not able to explain the variability of data. This shows that the variability in response variable ‘Expense’Expenses per car mile (pence)’ can be explained by the co-variates ‘Percent of Double Deckers in fleet’ and ‘Receipts per car mile (pence)’. So in a real life example,the Cross-sectional analysis of 24 British bus companies (1951) shows that the expenses per car mile cannot be explained by the number of miles a bus travels over the year or the percentage of buses on fuel. The expenses can be significantly explained by factors such as number of double decker buses in the fleet and the receipts per car mile over the year.