Read
To read data into R, I use the following command. I am also trying to change column names in the dataset
tomb=read.csv("tombstone.csv")
ls()
## [1] "tomb"
colnames(tomb)[2:3]<- c("SO2.Conc","Surface.Reces.")
names(tomb)
## [1] "City" "SO2.Conc" "Surface.Reces."
We can see that after executing tomb=read.csv("tombstone.csv"), we have a variable tomb in the memory. We also changed the names of column to a bit shorter with help of
I have to plot the data using the following command
plot1<-qplot(SO2.Conc,Surface.Reces.)
plot2<-qplot(SO2.Conc,Surface.Reces.)+geom_smooth(method = 'lm')
plot3<-qplot(SO2.Conc,Surface.Reces.)+stat_smooth(se=F)+theme_bw()
grid.arrange(plot1,plot2,plot3,ncol=3)
We can see the scatter plot and also the curved lowess line. It is almost a straight line
nrow(tomb)
## [1] 21
cor(SO2.Conc,Surface.Reces.)
## [1] 0.9008731
summary(tomb)
## City SO2.Conc Surface.Reces.
## Albany,NY : 1 Min. : 12.0 Min. :0.140
## Baltimore,MD: 1 1st Qu.: 91.0 1st Qu.:1.010
## Boston,MA : 1 Median :122.0 Median :1.530
## Brooklyn,NY : 1 Mean :136.5 Mean :1.496
## Cambridge,MA: 1 3rd Qu.:197.0 3rd Qu.:1.980
## Chicago,IL : 1 Max. :323.0 Max. :3.160
## (Other) :15
There are 21 observation. The correlation coefficient of SO2 Concentration and Surface Recession is 0.9008731. Since it is rather close to 1, we can conclude that the variables are positively linearly related. We can also visualize the data with the following plot.
We can see in the lower half triangle that the data points are correlated.
model<- lm(Surface.Reces.~ SO2.Conc, data=tomb)
plot(Surface.Reces.~SO2.Conc)
points(tomb$SO2.Conc,model$fitted.values,pch=20,col="red")
model$coefficients[1]
## (Intercept)
## 0.3229959
model$coefficients[2]
## SO2.Conc
## 0.008593333
summary(model)$r.squared
## [1] 0.8115724
summary(model)$adj.r.squared
## [1] 0.8016552
The values for \(\widehat{\beta_0}\) = 0.3229959 and \(\widehat{\beta_1}\) = 0.0085933
The \(R^{2}\) is 0.8115724. This shows that the model fits our data and the model can explain 81% of the variability of the response data around it’s mean. So the data points will fall closer to the fitted regression line.
Performing t tests. We do hypothesis testing on the slope and intercept. The t statistics and the p-value for both slope and intercept can be obtained as follows, using the summary function on our model.
summary(model)$coef[,3]
## (Intercept) SO2.Conc
## 2.122239 9.046242
summary(model)$coef[,4]
## (Intercept) SO2.Conc
## 4.718525e-02 2.578534e-08
As we can see, the t value for intercept is 2.1222387 and for slope is 9.0462416. The p-value for the intercept is 0.0471852 and for slope is \(2.58*10^{-8}\). The null hypothesis states that the slope is equal to zero. The t value for slope is quite significant as it is greater than our \(\alpha\) of 2. Also, since the P-value \(2.58*10^{-8}\) is less than the significance level (0.05), we cannot accept the null hypothesis.Thus we reject the null hypothesis.
Now I am performing the anova(F) test.
anova(model)
## Analysis of Variance Table
##
## Response: Surface.Reces.
## Df Sum Sq Mean Sq F value Pr(>F)
## SO2.Conc 1 10.9031 10.9031 81.835 2.579e-08 ***
## Residuals 19 2.5314 0.1332
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we can see, the F value is 81.835 and the p-value is \(2.58*10^{-8}\). Since this is linear regression, the anova test and t test are similar as \(t_0^{2} = F_0\). The null hypothesis is that the slope is equal to zero. There is strong evidence that the slope is not equal to zero as p-value is quite significant.
Now calculating the confidence interval, the fitted values, the prediction interval
confint(model,level=0.95)
## 2.5 % 97.5 %
## (Intercept) 0.004446349 0.64154545
## SO2.Conc 0.006605098 0.01058157
model$fitted.values
## 1 2 3 4 5 6 7
## 0.4261159 0.4948626 0.4948626 0.7182892 0.7354759 1.1135825 1.1049892
## 8 9 10 11 12 13 14
## 1.1307692 1.1995159 1.3284159 1.3713825 1.5432492 1.5432492 1.8526092
## 15 16 17 18 19 20 21
## 1.8697959 2.0158825 2.2479025 2.3338359 2.3768025 2.4197692 3.0986425
predict.lm(model,interval = "confidence")
## fit lwr upr
## 1 0.4261159 0.1276356 0.7245962
## 2 0.4948626 0.2094375 0.7802876
## 3 0.4948626 0.2094375 0.7802876
## 4 0.7182892 0.4729586 0.9636199
## 5 0.7354759 0.4930475 0.9779043
## 6 1.1135825 0.9248239 1.3023412
## 7 1.1049892 0.9152900 1.2946885
## 8 1.1307692 0.9438424 1.3176960
## 9 1.1995159 1.0192244 1.3798074
## 10 1.3284159 1.1572428 1.4995889
## 11 1.3713825 1.2021867 1.5405784
## 12 1.5432492 1.3761806 1.7103178
## 13 1.5432492 1.3761806 1.7103178
## 14 1.8526092 1.6666152 2.0386032
## 15 1.8697959 1.6820050 2.0575867
## 16 2.0158825 1.8103314 2.2214336
## 17 2.2479025 2.0069821 2.4888230
## 18 2.3338359 2.0781916 2.5894801
## 19 2.3768025 2.1135420 2.6400631
## 20 2.4197692 2.1487417 2.6907967
## 21 3.0986425 2.6921265 3.5051585
predict.lm(model,interval = "prediction")
## fit lwr upr
## 1 0.4261159 -0.39409843 1.246330
## 2 0.4948626 -0.32069155 1.310417
## 3 0.4948626 -0.32069155 1.310417
## 4 0.7182892 -0.08411227 1.520691
## 5 0.7354759 -0.06604303 1.536995
## 6 1.1135825 0.32663218 1.900533
## 7 1.1049892 0.31781271 1.892166
## 8 1.1307692 0.34425624 1.917282
## 9 1.1995159 0.41455342 1.984478
## 10 1.3284159 0.54549746 2.111334
## 11 1.3713825 0.58889402 2.153871
## 12 1.5432492 0.76121790 2.325281
## 13 1.5432492 0.76121790 2.325281
## 14 1.8526092 1.06631740 2.638901
## 15 1.8697959 1.08307708 2.656515
## 16 2.0158825 1.22473635 2.807029
## 17 2.2479025 1.44683842 3.048967
## 18 2.3338359 1.52822118 3.139451
## 19 2.3768025 1.56873869 3.184866
## 20 2.4197692 1.60914169 3.230397
## 21 3.0986425 2.23324303 3.964042
Plotting the confidence interval, the fitted values, the prediction interval using ci.plot()
ci.plot(model)
As we can see that the p-value is significant as the value is very small compared to 0.05. There is a correlation of Marble Tombstone Mean Surface Recession Rate and Mean SO2 concentrations over a 100 year period. The variation in the response variable can be explained with the linear regression model. The confidence interval bands are not very far away from the fitted line as seen in the ci.plot(model). The residual plot shows that there is a random pattern in the residuals indicating that the deterministic portion that is (predictor variables) of the model is capturing useful information. The null hypothesis states that the slope is equal to zero. We reject the null hypothesis.
Using response variable as Expenses per car mile (pence), covariate as Car miles per year (1000s), doing the following analysis.
d=read.csv("bus.csv")
names(d)
## [1] "Expenses.per.car.mile..pence."
## [2] "Car.miles.per.year..1000s."
## [3] "Percent.of.Double.Deckers.in.fleet"
## [4] "Percent.of.fleet.on.fuel.oil"
## [5] "Receipts.per.car.mile..pence."
names(d)=c("Expense","Car_mile","DoubleDecker","Fleet","Receipts")
pairs(d)
attach(d)
plot(Car_mile,Expense,pch=20)
cor(Car_mile,Expense)
## [1] -0.3978242
model1=lm(Expense~Car_mile)
summary(model1)
##
## Call:
## lm(formula = Expense ~ Car_mile)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0123 -0.9417 -0.1894 0.8993 2.6176
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.878e+01 4.075e-01 46.085 <2e-16 ***
## Car_mile -4.450e-05 2.188e-05 -2.034 0.0542 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.347 on 22 degrees of freedom
## Multiple R-squared: 0.1583, Adjusted R-squared: 0.12
## F-statistic: 4.136 on 1 and 22 DF, p-value: 0.0542
abline(model1)
lines(lowess(Expense~Car_mile))
library(HH)
ci.plot(model1)
anova(model1)
## Analysis of Variance Table
##
## Response: Expense
## Df Sum Sq Mean Sq F value Pr(>F)
## Car_mile 1 7.506 7.5058 4.1365 0.0542 .
## Residuals 22 39.920 1.8145
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(model1,level=0.95)
## 2.5 % 97.5 %
## (Intercept) 1.793660e+01 1.962700e+01
## Car_mile -8.987441e-05 8.761294e-07
plot(model1$residuals,pch=20)
points(model1$residuals,type="h")
abline(h=0,col="red",lwd=3)
par(mfrow = c(2,2))
plot(model1)
As we can see that the p-value is not at all significant as the value is greater than 0.05. The residual plot shows that the points are quite away from the center and the non-random pattern in the residuals indicates that the deterministic portion that is (predictor variables) of the model are not capturing some information that is leaking into the residuals. The null hypothesis states that the slope is equal to zero. We cannot reject the null hypothesis.
Using response variable as Expenses per car mile (pence), covariate as Percent of Double Deckers in fleet, doing the following analysis.
names(d)
## [1] "Expense" "Car_mile" "DoubleDecker" "Fleet"
## [5] "Receipts"
plot(DoubleDecker,Expense,pch=20)
model2=lm(Expense~DoubleDecker)
summary(model2)
##
## Call:
## lm(formula = Expense ~ DoubleDecker)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2992 -0.5985 -0.1054 0.2084 3.5161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.54775 0.55637 29.742 < 2e-16 ***
## DoubleDecker 0.03289 0.01011 3.252 0.00366 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.207 on 22 degrees of freedom
## Multiple R-squared: 0.3246, Adjusted R-squared: 0.2939
## F-statistic: 10.57 on 1 and 22 DF, p-value: 0.003657
abline(model2)
lines(lowess(Expense~DoubleDecker))
library(HH)
ci.plot(model2)
anova(model2)
## Analysis of Variance Table
##
## Response: Expense
## Df Sum Sq Mean Sq F value Pr(>F)
## DoubleDecker 1 15.395 15.3949 10.574 0.003657 **
## Residuals 22 32.031 1.4559
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(model2,level=0.95)
## 2.5 % 97.5 %
## (Intercept) 15.39390854 17.70159668
## DoubleDecker 0.01191374 0.05386638
plot(model2$residuals,pch=20)
points(model2$residuals,type="h")
abline(h=0,col="red",lwd=3)
par(mfrow = c(2,2))
plot(model2)
As we can see that the p-value is significant as the value is very small compared to 0.05. The residual plot shows that there is a random pattern in the residuals indicating that the deterministic portion that is (predictor variables) of the model is capturing useful information. The null hypothesis states that the slope is equal to zero. We reject the null hypothesis.
Using response variable as Expenses per car mile (pence), covariate as Percent of fleet on fuel oil, doing the following analysis.
names(d)
## [1] "Expense" "Car_mile" "DoubleDecker" "Fleet"
## [5] "Receipts"
plot(Fleet,Expense,pch=20)
model3=lm(Expense~Fleet)
summary(model3)
##
## Call:
## lm(formula = Expense ~ Fleet)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6282 -1.1915 -0.3800 0.4129 3.1490
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.876372 1.966369 9.091 6.64e-09 ***
## Fleet 0.003375 0.022341 0.151 0.881
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.467 on 22 degrees of freedom
## Multiple R-squared: 0.001037, Adjusted R-squared: -0.04437
## F-statistic: 0.02283 on 1 and 22 DF, p-value: 0.8813
abline(model3)
lines(lowess(Expense~Fleet))
library(HH)
ci.plot(model3)
anova(model3)
## Analysis of Variance Table
##
## Response: Expense
## Df Sum Sq Mean Sq F value Pr(>F)
## Fleet 1 0.049 0.04916 0.0228 0.8813
## Residuals 22 47.376 2.15347
confint(model3,level=0.95)
## 2.5 % 97.5 %
## (Intercept) 13.79837118 21.95437183
## Fleet -0.04295722 0.04970821
plot(model3$residuals,pch=20)
points(model3$residuals,type="h")
abline(h=0,col="red",lwd=3)
par(mfrow = c(2,2))
plot(model3)
As we can see that the p-value is not at all significant as the value is greater than 0.05. The residual plot shows that the points are quite away from the center and the non-random pattern in the residuals indicates that the deterministic portion that is (predictor variables) of the model are not capturing some information that is leaking into the residuals. The null hypothesis states that the slope is equal to zero. We cannot reject the null hypothesis.
Using response variable as Expenses per car mile (pence), covariate as Receipts per car mile (pence), doing the following analysis.
names(d)
## [1] "Expense" "Car_mile" "DoubleDecker" "Fleet"
## [5] "Receipts"
plot(Receipts,Expense,pch=20)
cor(Receipts,Expense)
## [1] 0.7868772
model4=lm(Expense~Receipts)
summary(model4)
##
## Call:
## lm(formula = Expense ~ Receipts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.91091 -0.56057 0.09436 0.41483 2.62151
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.65846 1.60108 5.408 1.97e-05 ***
## Receipts 0.47987 0.08024 5.981 5.10e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9061 on 22 degrees of freedom
## Multiple R-squared: 0.6192, Adjusted R-squared: 0.6019
## F-statistic: 35.77 on 1 and 22 DF, p-value: 5.097e-06
abline(model4)
lines(lowess(Expense~Receipts))
library(HH)
ci.plot(model4)
anova(model4)
## Analysis of Variance Table
##
## Response: Expense
## Df Sum Sq Mean Sq F value Pr(>F)
## Receipts 1 29.365 29.3648 35.769 5.097e-06 ***
## Residuals 22 18.061 0.8209
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(model4,level=0.95)
## 2.5 % 97.5 %
## (Intercept) 5.3380250 11.9788852
## Receipts 0.3134688 0.6462633
plot(model4$residuals,pch=20)
points(model4$residuals,type="h")
abline(h=0,col="red",lwd=3)
par(mfrow = c(2,2))
plot(model4)
As we can see that the p-value is highly significant as the value is very small compared to 0.05. The residual plot shows that there is a random pattern in the residuals indicating that the deterministic portion that is (predictor variables) of the model is capturing useful information. The null hypothesis states that the slope is equal to zero. We reject the null hypothesis.
As we can see from the analysis model2(answer 6.2) and model4(answer 6.4) are significant models and are able to explain the variablility in the data. The other two models model1 and model3 are not able to explain the variability of data. This shows that the variability in response variable ‘Expense’Expenses per car mile (pence)’ can be explained by the co-variates ‘Percent of Double Deckers in fleet’ and ‘Receipts per car mile (pence)’. So in a real life example,the Cross-sectional analysis of 24 British bus companies (1951) shows that the expenses per car mile cannot be explained by the number of miles a bus travels over the year or the percentage of buses on fuel. The expenses can be significantly explained by factors such as number of double decker buses in the fleet and the receipts per car mile over the year.