Sharma Shivankit M10625565

R Markdown

Summary of answers

Dataset Covariate(x) Response(y) \(\beta\)0(intercept) \(\beta\)1(slope) R2 Tvalue Fvalue P>0.05
Tombstone Mean SO2levels Marble recession rates 0.322996 0.008593 0.8116 9.046 81.835 No
Bus Car miles per year Expenses per car mile 18.78 -0.0000445 0.1582 -2.03383 4.1365 Yes
Bus Percent of double deckers Expenses per car mile 16.5477 0.03289 0.3246 3.2517 10.574 No
Bus Percent of fleet on oil fuel Expenses per car mile 17.8763 0.003375 0.001036 0.1510 0.0228 Yes
Bus Receipts per car mile Expenses per can mile 8.6584 0.479866 0.6191 5.9807 35.769 No

Q-1)

Read csv file into the system.

  • Covariate = Mean SO2 concentrations over a 100 year period (This is x)
  • Response variable = Marble Tombstone Mean Surface Recession Rate (This is y)
tombs <- read.csv("C:\\Users\\sharm_000\\OneDrive\\University\\BANA 7038\\Homework 2\\tombstone.csv",header = T,sep = ",")
names(tombs)
## [1] "City"                                                      
## [2] "Modelled.100.Year.Mean.SO2.Concentration..ug.m..3."        
## [3] "Marble.Tombstone.Mean.Surface.Recession.Rate..mm.100years."

We rename the columns in the following manner:-

  • mso2c = Mean SO2 concentrations over a 100 year period
  • mtmsrr = Marble Tombstone Mean Surface Recession Rate

Q-2)

Data Exploration

str(tombs)
## 'data.frame':    21 obs. of  3 variables:
##  $ City  : Factor w/ 21 levels "Albany,NY","Baltimore,MD",..: 21 8 16 19 10 11 9 1 20 13 ...
##  $ mso2c : int  12 20 20 46 48 92 91 94 102 117 ...
##  $ mtmsrr: num  0.27 0.14 0.33 0.81 0.84 1.08 1.78 1.21 1.09 1.72 ...
  #Explore distribution of response/dependent variable mtmsrr 
  summary(tombs$mtmsrr)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.140   1.010   1.530   1.496   1.980   3.160
  • Mean < Median. Hence distribution for mtmsrr is left skewed.

  • Plot

The plot of points and the regression line show a visible positive correlation between the SO2 levels and the surface recession rate.

Colored Scatterplot Matrix

  library(psych)
  pairs.panels(tombs[c("mtmsrr","mso2c")]) 

Stretched elipse shows strong co-relation. Red line (loess smooth) rising steeply shows that mtmsrr increases with mso2c

Q-3)

Linear regression model

  • Building Linear regression model

mtombs <- lm (mtmsrr ~ mso2c, data = tombs)
#Beta coefficient estimates
mtombs$coefficients
## (Intercept)       mso2c 
## 0.322995899 0.008593333

\(\beta\)0 (intercept) = 0.322996 and \(\beta\)1 (slope) = 0.008593

  • R2 and its interpretation
summary(mtombs)$r.squared 
## [1] 0.8115724
R2 = 0.8116.Explains that model as a whole explains the values of the dependent variable very well since the closer it is to 1, the better the model perfectly explains the data.
81% of the variation in mtmsrr is explained by the model

Q-4)

Hypothesis testing using the linear model

  • 4.1 – T Test

summary(mtombs)$coefficients
##                Estimate   Std. Error  t value     Pr(>|t|)
## (Intercept) 0.322995899 0.1521958377 2.122239 4.718525e-02
## mso2c       0.008593333 0.0009499341 9.046242 2.578534e-08
In tombs, the T value is 9.046 and it is the coefficient divided by the standard error. This t statistic value is for the slope of the line corresponding to the covariate mso2c.
As the p-value is much less than 0.05 (0.00000002578534), we reject the null hypothesis that x does not affect y
Hence there is a significant relationship between the variables in the linear regression model of the data set.
  • 4.2 ANOVA test

#ANOVA test
anova(mtombs)
## Analysis of Variance Table
## 
## Response: mtmsrr
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## mso2c      1 10.9031 10.9031  81.835 2.579e-08 ***
## Residuals 19  2.5314  0.1332                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F value is 81.835 which is far from 0 and the p value is much less than 0.05 (0.00000002578534)
Therefore, we reject the null hypothesis that x does not affect y
  • 4.3 Confidence interval

Confidence Levels For coefficients

  confint(mtombs)
##                   2.5 %     97.5 %
## (Intercept) 0.004446349 0.64154545
## mso2c       0.006605098 0.01058157

For Fitted Values

  fitval <- fitted(mtombs)
  fitval
##         1         2         3         4         5         6         7 
## 0.4261159 0.4948626 0.4948626 0.7182892 0.7354759 1.1135825 1.1049892 
##         8         9        10        11        12        13        14 
## 1.1307692 1.1995159 1.3284159 1.3713825 1.5432492 1.5432492 1.8526092 
##        15        16        17        18        19        20        21 
## 1.8697959 2.0158825 2.2479025 2.3338359 2.3768025 2.4197692 3.0986425

Predicted confidence intervals

  ptombs <- predict(mtombs,interval = "confidence")
  ptombs[1:5,]
##         fit       lwr       upr
## 1 0.4261159 0.1276356 0.7245962
## 2 0.4948626 0.2094375 0.7802876
## 3 0.4948626 0.2094375 0.7802876
## 4 0.7182892 0.4729586 0.9636199
## 5 0.7354759 0.4930475 0.9779043

For predicted values

  predval <- c(30,110,210)
  new.so2 <- data.frame(mso2c = predval)
  pdtombs <- predict(mtombs, newdata = new.so2, interval="confidence")
  pdtombs
##         fit       lwr      upr
## 1 0.5807959 0.3112588 0.850333
## 2 1.2682625 1.0934071 1.443118
## 3 2.1275959 1.9059316 2.349260
  • 4.4 Final plotting

#Final Plotting
plot(tombs$mso2c, tombs$mtmsrr,xlab="Mean SO2 Concentration", ylab="Mean Surface Recession Rate", main="Tombstones Regression plot")
abline(mtombs, lty=2)
  
  #Confidence intervals for fitted values 
  lines(ptombs[,1],ptombs[,2],col="blue")
  lines(ptombs[,1],ptombs[,3],col="blue")
  
  #Confidence intervals for predict values
  lines(predval,pdtombs[,2],col="red")
  lines(predval,pdtombs[,3],col="red")

Q5

Summary

On the basis of our analysis, we can confidently say that there is a strong correlation between the levels of atmospheric SO2 and the levels or marble recession that occurs on tombstones in the corresponding areas.

The \(\beta\)0(intercept) is 0.3229, the \(\beta\)1(slope) is 0.008593. The value of R2 is 0.8116 which shows a strong correlation.

The value of T test result is 9.046 with a P value that is less than 0.05. Similarly the F value is 81.835

On the basis of these results, we can say to some extent that the Recession rate of marble is directly proportional to the levels of atmospheric SO2.

We can test further with a bigger sample size to confirm our theory further that the levels of atmospheric SO~2 may be a cause of marble erosion.



Q6

The dataset to be used is the Cross-sectional analysis of 24 British bus companies (1951). This contains the following columns :-
## [1] "Expenses.per.car.mile..pence."     
## [2] "Car.miles.per.year..1000s."        
## [3] "Percent.of.Double.Deckers.in.fleet"
## [4] "Percent.of.fleet.on.fuel.oil"      
## [5] "Receipts.per.car.mile..pence."

Renaming the columns in the following manner:-

  • exp = “Expenses.per.car.mile..pence.”
  • car = “Car.miles.per.year..1000s.”
  • dd = “Percent.of.Double.Deckers.in.fleet”
  • oil = “Percent.of.fleet.on.fuel.oil”
  • rec = “Receipts.per.car.mile..pence.”

Plotting

  1. Car miles per year vs expenses per mile
  2. Percent of double deckers in fleet vs expenses per mile
  3. Percent of fleet on oil fuel vs expenses per mile
  4. Receipts per mile vs expenses per mile

6.1 Car miles per year vs expenses per mile

Defining the relationship

  • Covariate = Car miles per year (1000s) (This is x)
  • Response variable = Expenses per car mile (pence) (This is y)

Exploring the structure Summary of response variable (Expenses per car mile (pence)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.56   16.95   17.76   18.17   18.62   21.24

Mean > Median. Hence distribution for exp is right skewed.

  • Plotting the correlation matrix

1) When we complare Car miles per year vs expenses per mile, a correlation value of 0.4 suggests a weak correlation
2) In the cases of Percent of double deckers in fleet vs expenses per mile, the correlation is 0.57 which suggest a medium level of correlation since it is more than 0.5
3) For Percent of fleet on oil fuel vs expenses per mile, the correlation value is 0.03 which is close to 0 and hence shows that there is no correlation in this case
4) Lastly, for Receipts per mile vs expenses per mile, the correlation value is 0.79 which suggests a strong level of correlation between these two variables.
  • Computing the values of \(\beta\)0 and \(\beta\)1
##   (Intercept)           car 
##  1.878180e+01 -4.449914e-05

\(\beta\)0 (intercept) = 18.78 and \(\beta\)1 (slope) = -0.0000445

R2 and its interpretation

## [1] 0.1582641
R2 = 0.1582 means that model as a whole does not explain the values of the dependent variable very well since it is closer to 0.
Only 16% of the variation in car is explained by the model.

Hypothesis testing using the linear model

  • T Test

##                  Estimate   Std. Error  t value     Pr(>|t|)
## (Intercept)  1.878180e+01 4.075464e-01 46.08506 2.223005e-23
## car         -4.449914e-05 2.187948e-05 -2.03383 5.420264e-02
The T value is -2.034 and it is the coefficient divided by the standard error. This t statistic value is for the slope of the line corresponding to the covariate exp.
As the p-value is more than 0.05 (0.05420264), we accept the null hypothesis that x does not affect y
  • ANOVA test

## Analysis of Variance Table
## 
## Response: exp
##           Df Sum Sq Mean Sq F value Pr(>F)  
## car        1  7.506  7.5058  4.1365 0.0542 .
## Residuals 22 39.920  1.8145                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F value is 4.136 which is close to 0 and the p value is more than 0.05 (0.05420264)

Therefore, we accept the null hypothesis that x does not affect y

  • Confidence interval

Confidence Levels For coefficients

##                     2.5 %       97.5 %
## (Intercept)  1.793660e+01 1.962700e+01
## car         -8.987441e-05 8.761294e-07

For Fitted Values

##        1        2        3        4        5        6        7        8 
## 18.50435 16.72461 18.45429 17.50401 17.80576 18.72231 17.98611 18.67861 
##        9       10       11       12       13       14       15       16 
## 17.97904 18.73076 18.68497 18.19143 18.62245 18.10969 16.68994 18.33063 
##       17       18       19       20       21       22       23       24 
## 18.50827 17.75436 17.86734 18.36129 18.73606 18.61057 18.08512 18.43805

Predicted confidence intervals

##        fit      lwr      upr
## 1 18.50435 17.83996 19.16874
## 2 16.72461 15.14429 18.30493
## 3 18.45429 17.81459 19.09398
## 4 17.50401 16.61724 18.39078
## 5 17.80576 17.12523 18.48629

For predicted values

##        fit      lwr      upr
## 1 18.78047 17.93627 19.62466
## 2 18.77691 17.93538 19.61843
## 3 18.77246 17.93427 19.61065

6.2 Percent of double deckers in fleet vs expenses per mile

  • Covariate = Percent.of.Double.Deckers.in.fleet (This is x)
  • Response variable = Expenses per car mile (pence) (This is y)

  • Computing the values of \(\beta\)0 and \(\beta\)1

## (Intercept)          dd 
## 16.54775261  0.03289006

\(\beta\)0 (intercept) = 16.5477 and \(\beta\)1 (slope) = 0.03289

R2 and its interpretation

## [1] 0.3246126
R2 = 0.3246 means that model as a whole does not explain the values of the dependent variable very well since it is closer to 0.
Only 32% of the variation in dd is explained by the model.

Hypothesis testing using the linear model

  • T Test

##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 16.54775261 0.55637160 29.742267 2.922432e-19
## dd           0.03289006 0.01011456  3.251753 3.656954e-03
The T value is 3.2517 and it is the coefficient divided by the standard error. This t statistic value is for the slope of the line corresponding to the covariate exp.
As the p-value is less than 0.05 (0.00365), we reject the null hypothesis that x has no effect on y
  • ANOVA test

## Analysis of Variance Table
## 
## Response: exp
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## dd         1 15.395 15.3949  10.574 0.003657 **
## Residuals 22 32.031  1.4559                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F value is 10.574 and the p value is less than 0.05 (0.00365)

Therefore, we reject the null hypothesis that x has no effect on y

Confidence Levels For coefficients

##                   2.5 %      97.5 %
## (Intercept) 15.39390854 17.70159668
## dd           0.01191374  0.05386638

For Fitted Values

##        1        2        3        4        5        6        7        8 
## 19.83676 17.98406 18.70238 18.03307 18.16594 19.00924 18.87176 18.65041 
##        9       10       11       12       13       14       15       16 
## 17.02301 18.80335 18.30178 17.37527 17.72390 18.11727 17.11379 17.96696 
##       17       18       19       20       21       22       23       24 
## 18.77540 17.64200 17.42296 18.56556 19.83676 16.72371 17.22299 18.21166

Predicted confidence intervals

##        fit      lwr      upr
## 1 19.83676 18.65739 21.01612
## 2 17.98406 17.45968 18.50844
## 3 18.70238 18.08903 19.31573
## 4 18.03307 17.51486 18.55128
## 5 18.16594 17.65514 18.67675

For predicted values

##        fit      lwr      upr
## 1 17.53445 16.88237 18.18653
## 2 20.16566 18.79421 21.53711
## 3 23.45467 20.04577 26.86356

6.3 Percent of fleet on oil fuel vs expenses per mile

  • Covariate = Percent.of.fleet.on.fuel.oil (This is x)
  • Response variable = Expenses per car mile (pence) (This is y)

  • Computing the values of \(\beta\)0 and \(\beta\)1

##  (Intercept)          oil 
## 17.876371507  0.003375493

\(\beta\)0 (intercept) = 17.8763 and \(\beta\)1 (slope) = 0.003375

R2 and its interpretation

## [1] 0.00103655
R2 = 0.00103655 means that model as a whole does not explain the values of the dependent variable very well since it is closer to 0.
Only 0.1 % of the variation in oil is explained by the model.

Hypothesis testing using the linear model

  • T Test

##                 Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.876371507 1.96636930 9.0910550 6.637217e-09
## oil          0.003375493 0.02234115 0.1510886 8.812827e-01
The T value is 0.1510 and it is the coefficient divided by the standard error. This t statistic value is for the slope of the line corresponding to the covariate exp.
As the p-value is more than 0.05 (0.8813 ), we accept the null hypothesis that x does not affect y
  • ANOVA test

## Analysis of Variance Table
## 
## Response: exp
##           Df Sum Sq Mean Sq F value Pr(>F)
## oil        1  0.049 0.04916  0.0228 0.8813
## Residuals 22 47.376 2.15347

The F value is 0.0228 which is close to 0 and the p value is more than 0.05 (0.8813)

Therefore, we accept the null hypothesis that x does not affect y

Confidence Levels For coefficients

##                   2.5 %      97.5 %
## (Intercept) 13.79837118 21.95437183
## oil         -0.04295722  0.04970821

For Fitted Values

##        1        2        3        4        5        6        7        8 
## 18.21392 18.16170 18.15171 18.19141 18.15677 18.19701 18.18806 18.19731 
##        9       10       11       12       13       14       15       16 
## 18.08309 18.20683 18.20548 18.06830 18.09099 18.19802 18.21392 18.17814 
##       17       18       19       20       21       22       23       24 
## 18.18874 18.10432 18.20825 18.16909 18.21392 18.09774 18.19272 18.20255

Predicted confidence intervals

##        fit      lwr      upr
## 1 18.21392 17.34826 19.07958
## 2 18.16170 17.53012 18.79328
## 3 18.15171 17.48168 18.82174
## 4 18.19141 17.50420 18.87861
## 5 18.15677 17.50957 18.80398

For predicted values

##        fit      lwr      upr
## 1 17.97764 15.26512 20.69015
## 2 18.24768 17.01371 19.48165
## 3 18.58522 12.85200 24.31845

6.4 Receipts per car mile (pence) vs Expenses per car mile (pence)

  • Covariate = Receipts per car mile (pence). (This is x)
  • Response variable = Expenses per car mile (pence) (This is y)

  • Computing the values of \(\beta\)0 and \(\beta\)1

## (Intercept)         rec 
##    8.658455    0.479866

\(\beta\)0 (intercept) = 8.658455 and \(\beta\)1 (slope) = 0.479866

R2 and its interpretation `

## [1] 0.6191757
R2 = 0.6191757 means that model as a whole does explain the values of the dependent variable very well since it is closer to 1.
62% of the variation in rec is explained by the model.

Hypothesis testing using the linear model

  • T Test

##             Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 8.658455 1.60107683 5.407895 1.973762e-05
## rec         0.479866 0.08023504 5.980754 5.097023e-06
The T value is 5.980754 and it is the coefficient divided by the standard error. This t statistic value is for the slope of the line corresponding to the covariate exp.
As the p-value is less than 0.05 (0.000005097023), we reject the null hypothesis that x does not affect y
  • ANOVA test

## Analysis of Variance Table
## 
## Response: exp
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## rec        1 29.365 29.3648  35.769 5.097e-06 ***
## Residuals 22 18.061  0.8209                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F value is 35.769 and the p value is less than 0.05 (0.000005097023)

Therefore, we reject the null hypothesis that x does not affect y

Confidence Levels For coefficients

##                 2.5 %     97.5 %
## (Intercept) 5.3380250 11.9788852
## rec         0.3134688  0.6462633

For Fitted Values

##        1        2        3        4        5        6        7        8 
## 20.70309 17.88628 18.93719 17.34883 17.89108 17.92467 18.28937 20.34319 
##        9       10       11       12       13       14       15       16 
## 17.10410 18.31816 17.48799 17.75672 21.01501 17.96786 17.60316 17.82390 
##       17       18       19       20       21       22       23       24 
## 18.25578 17.92467 18.49091 16.84977 18.54849 16.20675 17.63195 17.77111

Predicted confidence intervals

##        fit      lwr      upr
## 1 20.70309 19.74463 21.66156
## 2 17.88628 17.49030 18.28226
## 3 18.93719 18.47040 19.40397
## 4 17.34883 16.87113 17.82653
## 5 17.89108 17.49551 18.28664

For predicted values

##         fit      lwr       upr
## 1  23.05444 21.31783  24.79104
## 2  61.44372 46.43332  76.45412
## 3 109.43033 77.78277 141.07788
  • Final plotting

Summary

In conclusion, from the observations made in the dataset “bus”, only the relation of the variables Receipts per car mile vs Expenses per car mile seems to show any definite correlation between the covariate and the response variable.

The other variable comparisons do not exhibit distinct correlation and in the case of Percent of fleet on oil fuel vs expenses per mile, there is almost zero correlation between the observed variables.