Homework 1

Last Name: Nair

First Name: Sreejith

M-number: M13451169

E-mail: nairsn@mail.uc.edu

Setting up Working Directory

getwd()

## [1] "C:/Users/Sreejith Nair/Google Drive/UC/Course/Data Analytics Methods BANA 7038/Week 1"

setwd('C:/Users/Sreejith Nair/Google Drive/UC/Course/Data Analytics Methods BANA 7038/Week 1')
getwd()

## [1] "C:/Users/Sreejith Nair/Google Drive/UC/Course/Data Analytics Methods BANA 7038/Week 1"

Install/ Invoke required Packages

library('rmarkdown')
library('asbio')

## Warning: package 'asbio' was built under R version 3.6.2

## Loading required package: tcltk

1. Read <tombstone.csv> into R. Use response variable = Marble Tombstone Mean Surface Recession Rate, and covariate = Mean SO2 concentrations over a 100 year period. Description: Marble Tombstone Mean Surface Recession Rates and Mean SO2 concentrations over a 100 year period.

tombstone <- read.csv('tombstone.csv',h=T)
head(tombstone)

##                    City Modelled.100.Year.Mean.SO2.Concentration..ug.m..3.
## 1 Washington,DC (Rural)                                                 12
## 2 Cincinnati,OH (Rural)                                                 20
## 3 Philadelphia,PA (Rura                                                 20
## 4           Richmond,VA                                                 46
## 5         Fall River,MA                                                 48
## 6           Hartford,CT                                                 92
##   Marble.Tombstone.Mean.Surface.Recession.Rate..mm.100years.
## 1                                                       0.27
## 2                                                       0.14
## 3                                                       0.33
## 4                                                       0.81
## 5                                                       0.84
## 6                                                       1.08

2. Plot data, explore data, and briefly describe what you observe.

str(tombstone)

## 'data.frame':    21 obs. of  3 variables:
##  $ City                                                      : Factor w/ 21 levels "Albany,NY","Baltimore,MD",..: 21 8 16 19 10 11 9 1 20 13 ...
##  $ Modelled.100.Year.Mean.SO2.Concentration..ug.m..3.        : int  12 20 20 46 48 92 91 94 102 117 ...
##  $ Marble.Tombstone.Mean.Surface.Recession.Rate..mm.100years.: num  0.27 0.14 0.33 0.81 0.84 1.08 1.78 1.21 1.09 1.72 ...

colnames(tombstone) <- c("City","SO2Conc","RecRate")

summary(tombstone)

##            City       SO2Conc         RecRate     
##  Albany,NY   : 1   Min.   : 12.0   Min.   :0.140  
##  Baltimore,MD: 1   1st Qu.: 91.0   1st Qu.:1.010  
##  Boston,MA   : 1   Median :122.0   Median :1.530  
##  Brooklyn,NY : 1   Mean   :136.5   Mean   :1.496  
##  Cambridge,MA: 1   3rd Qu.:197.0   3rd Qu.:1.980  
##  Chicago,IL  : 1   Max.   :323.0   Max.   :3.160  
##  (Other)     :15

Description: This dataset consists of Mean SO2 concentration in different cities and corresponding Mean Surface Recession rates in these cities. Covariate, Mean SO2 Concentration in a Numeric and Response Variable, Mean Surface Recession Rate is Integer.

For Simplicity of computations, renamed the column names as “City”,“SO2Conc”,“RecRate”

plot(tombstone$SO2Conc,tombstone$RecRate,pch=4,xlab = 'Mean SO2 Concentration',ylab='Mean Surface Recession Rate')

We can observe a linear relation between Response Variable and Covariate. As the Mean SO2 concentration increases, there is a proportional increase in Mean Surface Recession Rate

plot(tombstone$SO2Conc,tombstone$RecRate,pch=4,xlab = 'Mean SO2 Concentration',ylab='Mean Surface Recession Rate')
abline(a = 0.1, b = 0.010,lty=2)

By observing the plot, I applied different values for intercept and slope to find a best fitting line. Although this method doesn’t give us the correct values, it gives us a sense of relationship between the variables. From the above manual estimation, we got Intercept = 0.1 and Slope = 0.01

Interpretation: Intercept = 0.1 (Mean Surface Recession Rate = 01 when Mean SO2 Concentration is 0) Slope = 0.01 (Mean Surface Recession Rate changes by 0.01 for every unit increase in Mean SO2 Concentration)

3. Perform linear regression using lm() function

ModelA <- lm(RecRate~SO2Conc,data=tombstone)
summary(ModelA)

## 
## Call:
## lm(formula = RecRate ~ SO2Conc, data = tombstone)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72384 -0.19138  0.06136  0.13320  0.69412 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.3229959  0.1521958   2.122   0.0472 *  
## SO2Conc     0.0085933  0.0009499   9.046 2.58e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.365 on 19 degrees of freedom
## Multiple R-squared:  0.8116, Adjusted R-squared:  0.8017 
## F-statistic: 81.83 on 1 and 19 DF,  p-value: 2.579e-08

plot(tombstone$SO2Conc,tombstone$RecRate,pch=4,xlab='Mean SO2 Concentration',ylab = 'Mean Recession Rate')
abline(ModelA,col='red')

3.1. Obtain coefficient estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\).

ModelA$coefficients

## (Intercept)     SO2Conc 
## 0.322995899 0.008593333

\(\hat{\beta}_0\) = 0.322995899 \(\hat{\beta}_1\) = 0.008593333

3.2. Obtain fitted values and the sum of fitted values.

ModelA$fitted.values

##         1         2         3         4         5         6         7         8 
## 0.4261159 0.4948626 0.4948626 0.7182892 0.7354759 1.1135825 1.1049892 1.1307692 
##         9        10        11        12        13        14        15        16 
## 1.1995159 1.3284159 1.3713825 1.5432492 1.5432492 1.8526092 1.8697959 2.0158825 
##        17        18        19        20        21 
## 2.2479025 2.3338359 2.3768025 2.4197692 3.0986425

sum(ModelA$fitted.values)

## [1] 31.42

Sum of fitted values = 31.42

3.3. Obtain the sum of all values of response variable.

sum(tombstone$RecRate)

## [1] 31.42

Sum of all Response values = 31.42

3.4. Verify the fact that the sum of fitted values is always the same as the sum of response variable. In addition, verify the fact that the mean of the fitted values is always the same as the mean of response variable

plot(tombstone$SO2Conc,tombstone$RecRate,pch=4,xlab='Mean SO2 Concentration',ylab = 'Mean Recession Rate')
abline(ModelA,col='red')
points(tombstone$SO2Conc,ModelA$fitted.values,pch=15)
for (i in 1:dim(tombstone)[1])
{
  lines(c(tombstone$SO2Conc[i],tombstone$SO2Conc[i]),
        c(ModelA$fitted.values[i],tombstone$RecRate[i]))
}

Yes, sum of fitted values = sum of actual values of the response variable in the dataset. This is because lm() model derives Intercept and Slope estimates are Least Square estimates i.e. the estimates compensates positive residuals with negative residuals to a minimum optimum variation and hence if we sum up all fitted values, it will be same as the sum of actual observed values .

sum(tombstone$RecRate) / nrow(tombstone)

## [1] 1.49619

sum(ModelA$fitted.values) / nrow(tombstone)

## [1] 1.49619

Mean of Fitted Values = Mean of Observed Values = 1.49619

3.5. Obtain residuals and the sum of residuals, and verify the fact that the sum of residuals is always zero.

ModelA$residuals

##           1           2           3           4           5           6 
## -0.15611590 -0.35486256 -0.16486256  0.09171078  0.10452411 -0.03358255 
##           7           8           9          10          11          12 
##  0.67501079  0.07923079 -0.10951588  0.39158412 -0.19138254 -0.53324921 
##          13          14          15          16          17          18 
##  0.35675079  0.12739080 -0.33979586  0.69411747  0.16209748 -0.72383585 
##          19          20          21 
##  0.13319748 -0.26976919  0.06135750

sum(ModelA$residuals)

## [1] 5.065393e-16

Yes, sum of residuals values = 0. This is because lm() model derives Intercept and Slope estimates are Least Square estimates i.e. the estimates compensates positive residuals with negative residuals to a minimum optimum variation and hence if we sum up all residuals, we get zero.

3.6. Obtain the standard errors of \(\hat{\beta}_0\) and \(\hat{\beta}_1\). Are these standard errors satisfactory and why?

std_err <- summary(ModelA)$coef[,2]
std_err

##  (Intercept)      SO2Conc 
## 0.1521958377 0.0009499341

2 X Std Error (\(\hat{\beta}_0\) estimate) = 2 * 0.1521958377 = 0.3043917 which is less than Intercept estimate and hence we can say Std Error for \(\hat{\beta}_0\) is satisfactory

2 X Std Error (\(\hat{\beta}_1\) estimate) = 2 * 0.0009499341 = 0.001899868 which is less than the estmate value and hence we say Std Error for \(\hat{\beta}_1\) is satisfactory

4. Suppose we increase SO2 Concentration by one unit, how does such a change influence the Surface Recession Rate?

b0_est <- summary(ModelA)$coef[,1][1]
b1_est <- summary(ModelA)$coef[,1][2]

b1_est

##     SO2Conc 
## 0.008593333

Mean Recession Rate increases by 0.008593333 for every unit increase of Mean SO2 Concentration

5. Does the intercept of the linear regression have natural interpretation? If so, what does it mean?

b0_est

## (Intercept) 
##   0.3229959

Yes, the natural interpretation of Intercept is that in a city with zero Mean SO2 concentration, Mean Recession rate = 0.3229959

6. Which city has the highest Surface Recession Rate?

tombstone[tombstone$SO2Conc == max(tombstone$SO2Conc),]

##          City SO2Conc RecRate
## 21 Chicago,IL     323    3.16

Chicago is the city with the highest Surface Recession Rate at 323

7. Which city has the largest residual (i.e., the largest absolute value) according to the linear regression you just fitted?

tombstone[(abs(tombstone$RecRate - ModelA$fitted.values)) == max(abs(tombstone$RecRate - ModelA$fitted.values)),]

##           City SO2Conc RecRate
## 18 Brooklyn,NY     234    1.61

Brooklyn have the highest absolute residual value at 0.7238359

8. Calculate the mean of covariate and mean of response. Verify the fact that the fitted regression line go through the point \((\bar{x},\bar{y})\).

mean_covar <- mean(tombstone$SO2Conc)
mean_resp <- mean(tombstone$RecRate)
mean_covar

## [1] 136.5238

mean_resp

## [1] 1.49619

mean_resp_pred <- b0_est+b1_est*mean_covar
mean_resp_pred

## (Intercept) 
##     1.49619

plot(tombstone$SO2Conc,tombstone$RecRate,pch=4,xlab='Mean SO2 Concentration',ylab = 'Mean Recession Rate')
abline(ModelA,col='red')
points(mean_covar,mean_resp,pch=15,col='blue')

When we apply Mean Covariance (136.5238) to the Linear Regression equation, we get a Response Variable Prediction = 1.49619, which is same as the Mean of the Response Variable in the dataset.

Same we can verify by plotting the mean point on the line.

9. Repeat the same questions (1-9) for the date set <bus.csv>. Description: Cross-sectional analysis of 24 British bus companies (1951). Use response variable = Expenses per car mile (pence), covariate = Car miles per year (1000s).

9.1 Read <bus.csv> into R.

Description: Cross-sectional analysis of 24 British bus companies (1951). Use response variable = Expenses per car mile (pence), covariate = Car miles per year (1000s).

bus <- read.csv('bus.csv',h=T)
head(bus)

##   Expenses.per.car.mile..pence. Car.miles.per.year..1000s.
## 1                         19.76                       6235
## 2                         17.85                      46230
## 3                         19.96                       7360
## 4                         16.80                      28715
## 5                         18.20                      21934
## 6                         16.71                       1337
##   Percent.of.Double.Deckers.in.fleet Percent.of.fleet.on.fuel.oil
## 1                             100.00                       100.00
## 2                              43.67                        84.53
## 3                              65.51                        81.57
## 4                              45.16                        93.33
## 5                              49.20                        83.07
## 6                              74.84                        94.99
##   Receipts.per.car.mile..pence.
## 1                         25.10
## 2                         19.23
## 3                         21.42
## 4                         18.11
## 5                         19.24
## 6                         19.31

9.2 Plot data, explore data, and briefly describe what you observe.

bus <- bus[,1:2]
str(bus)

## 'data.frame':    24 obs. of  2 variables:
##  $ Expenses.per.car.mile..pence.: num  19.8 17.9 20 16.8 18.2 ...
##  $ Car.miles.per.year..1000s.   : int  6235 46230 7360 28715 21934 1337 17881 2319 18040 1147 ...

colnames(bus) <- c('Expenses','CarMiles')
head(bus)

##   Expenses CarMiles
## 1    19.76     6235
## 2    17.85    46230
## 3    19.96     7360
## 4    16.80    28715
## 5    18.20    21934
## 6    16.71     1337

Description: This dataset consists of Expenses/mile and Car miles/year information of 24 companies. Expenses is a Double and Response Variable, Car miles is an Integer.

For Simplicity of computations, I am taking a subset of the first two columns from the original dataset and renamed thecolumn names as “Expenses” and “CarMiles”

summary(bus)

##     Expenses        CarMiles    
##  Min.   :16.56   Min.   : 1028  
##  1st Qu.:16.95   1st Qu.: 3781  
##  Median :17.75   Median : 9794  
##  Mean   :18.17   Mean   :13749  
##  3rd Qu.:18.61   3rd Qu.:18668  
##  Max.   :21.24   Max.   :47009

plot(bus$CarMiles,bus$Expenses,pch=4,xlab = 'Car Miles/year',ylab='Expenses / Mile')

Although we can observe a relation between Response Variable and Covariate, from the plot we cannot really say that there is a proper linear relationship. As the Car Miles increase, we can see Expenses / mile decreasing for many cars but there are also some cases where Expenses is on lower sphere and Car Miles / Year is also less.

plot(bus$CarMiles,bus$Expenses,pch=4,xlab = 'Car Miles / Year',ylab='Expenses / Mile')
abline(a = 19, b = -0.0001,lty=2)

9.3 Perform linear regression using lm() function

ModelB <- lm(Expenses~CarMiles,data=bus)
summary(ModelB)

## 
## Call:
## lm(formula = Expenses ~ CarMiles, data = bus)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0123 -0.9417 -0.1894  0.8993  2.6176 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.878e+01  4.075e-01  46.085   <2e-16 ***
## CarMiles    -4.450e-05  2.188e-05  -2.034   0.0542 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.347 on 22 degrees of freedom
## Multiple R-squared:  0.1583, Adjusted R-squared:   0.12 
## F-statistic: 4.136 on 1 and 22 DF,  p-value: 0.0542

plot(bus$CarMiles,bus$Expenses,pch=4,xlab='Car Miles / Year',ylab = 'Expenses / Mile')
abline(ModelB,col='red')

9.3.1 Obtain coefficient estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\).

ModelB$coefficients

##   (Intercept)      CarMiles 
##  1.878180e+01 -4.449914e-05

\(\hat{\beta}_0\) = 1.878180e+01 \(\hat{\beta}_1\) = -4.449914e-05

9.3.2 Obtain fitted values and the sum of fitted values.

ModelB$fitted.values

##        1        2        3        4        5        6        7        8 
## 18.50435 16.72461 18.45429 17.50401 17.80576 18.72231 17.98611 18.67861 
##        9       10       11       12       13       14       15       16 
## 17.97904 18.73076 18.68497 18.19143 18.62245 18.10969 16.68994 18.33063 
##       17       18       19       20       21       22       23       24 
## 18.50827 17.75436 17.86734 18.36129 18.73606 18.61057 18.08512 18.43805

sum(ModelB$fitted.values)

## [1] 436.08

Sum of fitted values = 436.08

9.3.3 Obtain the sum of all values of response variable.

sum(bus$Expenses)

## [1] 436.08

Sum of all Response values = 436.08

9.3.4 Verify the fact that the sum of fitted values is always the same as the sum of response variable. In addition, verify the fact that the mean of the fitted values is always the same as the mean of response variable

plot(bus$CarMiles,bus$Expenses,pch=4,xlab='Car Miles / Year',ylab = 'Expenses / Mile')
abline(ModelB,col='red')

points(bus$CarMiles,ModelB$fitted.values,pch=15)

for (i in 1:dim(bus)[1])
{
  lines(c(bus$CarMiles[i],bus$CarMiles[i]),
        c(ModelB$fitted.values[i],bus$Expenses[i]))
}

sum(bus$Expenses) / nrow(bus)

## [1] 18.17

sum(ModelB$fitted.values) / nrow(bus)

## [1] 18.17

Mean of Fitted Values = Mean of Observed Values = 18.17

9.3.5 Obtain residuals and the sum of residuals, and verify the fact that the sum of residuals is always zero.

ModelB$residuals

##          1          2          3          4          5          6          7 
##  1.2556501  1.1253933  1.5057117 -0.7040092  0.3942422 -2.0123067  0.8238871 
##          8          9         10         11         12         13         14 
##  2.0613915 -1.4190375 -0.1807615 -1.2849719 -0.5714319  2.6175494  0.1203130 
##         15         16         17         18         19         20         21 
##  0.1700581 -0.8806252 -0.8482658  0.5456387 -1.2873447 -0.8512851  2.4339431 
##         22         23         24 
## -1.6905693 -1.1251235 -0.1980461

sum(ModelB$residuals)

## [1] 1.637579e-15

9.3.6 Obtain the standard errors of \(\hat{\beta}_0\) and \(\hat{\beta}_1\). Are these standard errors satisfactory and why?

intercepts <- summary(ModelB)$coef[,1]
intercepts

##   (Intercept)      CarMiles 
##  1.878180e+01 -4.449914e-05

std_err <- summary(ModelB)$coef[,2]
std_err

##  (Intercept)     CarMiles 
## 4.075464e-01 2.187948e-05

std_err[1] * 2 < intercepts[1]

## (Intercept) 
##        TRUE

2 X Std Error (\(\hat{\beta}_0\) estimate) is less than 2 * \(\hat{\beta}_0\) estimate and hence we can say Std Error for \(\hat{\beta}_0\) is satisfactory

std_err[2] * 2 < intercepts[2]

## CarMiles 
##    FALSE

2 X Std Error (\(\hat{\beta}_1\) estimate) is not less than 2 * \(\hat{\beta}_1\) estimate and hence we cannot say that Std Error for \(\hat{\beta}_1\) is satisfactory

9.4 Suppose we increase Car Miles / Year by one unit, how does such a change influence the Expense / Year?

b0_est <- summary(ModelB)$coef[,1][1]
b1_est <- summary(ModelB)$coef[,1][2]

b1_est

##      CarMiles 
## -4.449914e-05

Mean Recession Rate decreases by 4.449914e-05 for every unit increase of Car Miles / Year

9.5 Does the intercept of the linear regression have natural interpretation? If so, what does it mean?

b0_est

## (Intercept) 
##     18.7818

Yes, the natural interpretation of Intercept is that a new car with no Miles yet registered, the Expense / mile will be 18.7818

9.8 Calculate the mean of covariate and mean of response. Verify the fact that the fitted regression line go through the point \((\bar{x},\bar{y})\).

mean_covar <- mean(bus$CarMiles)
mean_resp <- mean(bus$Expenses)
mean_covar

## [1] 13748.62

mean_resp

## [1] 18.17

mean_resp_pred <- b0_est+b1_est*mean_covar
mean_resp_pred

## (Intercept) 
##       18.17

plot(bus$CarMiles,bus$Expenses,pch=4,xlab='Car Miles / Year',ylab = 'Expenses / Mile')
abline(ModelB,col='red')

points(mean_covar,mean_resp,pch=15,col='blue')

When we apply Mean Covariance (13748.62 car miles / year) to the Linear Regression equation, we get a Response Variable Prediction = 18.17, which is same as the Mean of the Response Variable in the dataset.

Same we can verify by plotting the mean point on the line.

#Omitted.