getwd()
## [1] "C:/Users/Sreejith Nair/Google Drive/UC/Course/Data Analytics Methods BANA 7038/Week 1"
setwd('C:/Users/Sreejith Nair/Google Drive/UC/Course/Data Analytics Methods BANA 7038/Week 1')
getwd()
## [1] "C:/Users/Sreejith Nair/Google Drive/UC/Course/Data Analytics Methods BANA 7038/Week 1"
library('rmarkdown')
library('asbio')
## Warning: package 'asbio' was built under R version 3.6.2
## Loading required package: tcltk
tombstone <- read.csv('tombstone.csv',h=T)
head(tombstone)
## City Modelled.100.Year.Mean.SO2.Concentration..ug.m..3.
## 1 Washington,DC (Rural) 12
## 2 Cincinnati,OH (Rural) 20
## 3 Philadelphia,PA (Rura 20
## 4 Richmond,VA 46
## 5 Fall River,MA 48
## 6 Hartford,CT 92
## Marble.Tombstone.Mean.Surface.Recession.Rate..mm.100years.
## 1 0.27
## 2 0.14
## 3 0.33
## 4 0.81
## 5 0.84
## 6 1.08
str(tombstone)
## 'data.frame': 21 obs. of 3 variables:
## $ City : Factor w/ 21 levels "Albany,NY","Baltimore,MD",..: 21 8 16 19 10 11 9 1 20 13 ...
## $ Modelled.100.Year.Mean.SO2.Concentration..ug.m..3. : int 12 20 20 46 48 92 91 94 102 117 ...
## $ Marble.Tombstone.Mean.Surface.Recession.Rate..mm.100years.: num 0.27 0.14 0.33 0.81 0.84 1.08 1.78 1.21 1.09 1.72 ...
colnames(tombstone) <- c("City","SO2Conc","RecRate")
summary(tombstone)
## City SO2Conc RecRate
## Albany,NY : 1 Min. : 12.0 Min. :0.140
## Baltimore,MD: 1 1st Qu.: 91.0 1st Qu.:1.010
## Boston,MA : 1 Median :122.0 Median :1.530
## Brooklyn,NY : 1 Mean :136.5 Mean :1.496
## Cambridge,MA: 1 3rd Qu.:197.0 3rd Qu.:1.980
## Chicago,IL : 1 Max. :323.0 Max. :3.160
## (Other) :15
Description: This dataset consists of Mean SO2 concentration in different cities and corresponding Mean Surface Recession rates in these cities. Covariate, Mean SO2 Concentration in a Numeric and Response Variable, Mean Surface Recession Rate is Integer.
For Simplicity of computations, renamed the column names as “City”,“SO2Conc”,“RecRate”
Data Description: 1, There are 21 cities in the dataset. 2, Mean SO2 Concentration: Minimum = 12 | Maximum = 323 | Median = 122 | Mean = 136.5 | Range = 311 | IQR = 106 3, Mean Recession Rate: Minimum = 0.140 | Maximum = 3.160 | Median = 1.530 | Mean = 1.496 | Range = 3.02 | IQR = 0.97
plot(tombstone$SO2Conc,tombstone$RecRate,pch=4,xlab = 'Mean SO2 Concentration',ylab='Mean Surface Recession Rate')
We can observe a linear relation between Response Variable and Covariate. As the Mean SO2 concentration increases, there is a proportional increase in Mean Surface Recession Rate
plot(tombstone$SO2Conc,tombstone$RecRate,pch=4,xlab = 'Mean SO2 Concentration',ylab='Mean Surface Recession Rate')
abline(a = 0.1, b = 0.010,lty=2)
By observing the plot, I applied different values for intercept and slope to find a best fitting line. Although this method doesn’t give us the correct values, it gives us a sense of relationship between the variables. From the above manual estimation, we got Intercept = 0.1 and Slope = 0.01
Interpretation: Intercept = 0.1 (Mean Surface Recession Rate = 01 when Mean SO2 Concentration is 0) Slope = 0.01 (Mean Surface Recession Rate changes by 0.01 for every unit increase in Mean SO2 Concentration)
ModelA <- lm(RecRate~SO2Conc,data=tombstone)
summary(ModelA)
##
## Call:
## lm(formula = RecRate ~ SO2Conc, data = tombstone)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.72384 -0.19138 0.06136 0.13320 0.69412
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3229959 0.1521958 2.122 0.0472 *
## SO2Conc 0.0085933 0.0009499 9.046 2.58e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.365 on 19 degrees of freedom
## Multiple R-squared: 0.8116, Adjusted R-squared: 0.8017
## F-statistic: 81.83 on 1 and 19 DF, p-value: 2.579e-08
plot(tombstone$SO2Conc,tombstone$RecRate,pch=4,xlab='Mean SO2 Concentration',ylab = 'Mean Recession Rate')
abline(ModelA,col='red')
ModelA$coefficients
## (Intercept) SO2Conc
## 0.322995899 0.008593333
\(\hat{\beta}_0\) = 0.322995899 \(\hat{\beta}_1\) = 0.008593333
ModelA$fitted.values
## 1 2 3 4 5 6 7 8
## 0.4261159 0.4948626 0.4948626 0.7182892 0.7354759 1.1135825 1.1049892 1.1307692
## 9 10 11 12 13 14 15 16
## 1.1995159 1.3284159 1.3713825 1.5432492 1.5432492 1.8526092 1.8697959 2.0158825
## 17 18 19 20 21
## 2.2479025 2.3338359 2.3768025 2.4197692 3.0986425
sum(ModelA$fitted.values)
## [1] 31.42
Sum of fitted values = 31.42
sum(tombstone$RecRate)
## [1] 31.42
Sum of all Response values = 31.42
plot(tombstone$SO2Conc,tombstone$RecRate,pch=4,xlab='Mean SO2 Concentration',ylab = 'Mean Recession Rate')
abline(ModelA,col='red')
points(tombstone$SO2Conc,ModelA$fitted.values,pch=15)
for (i in 1:dim(tombstone)[1])
{
lines(c(tombstone$SO2Conc[i],tombstone$SO2Conc[i]),
c(ModelA$fitted.values[i],tombstone$RecRate[i]))
}
Yes, sum of fitted values = sum of actual values of the response variable in the dataset. This is because lm() model derives Intercept and Slope estimates are Least Square estimates i.e. the estimates compensates positive residuals with negative residuals to a minimum optimum variation and hence if we sum up all fitted values, it will be same as the sum of actual observed values .
sum(tombstone$RecRate) / nrow(tombstone)
## [1] 1.49619
sum(ModelA$fitted.values) / nrow(tombstone)
## [1] 1.49619
Mean of Fitted Values = Mean of Observed Values = 1.49619
ModelA$residuals
## 1 2 3 4 5 6
## -0.15611590 -0.35486256 -0.16486256 0.09171078 0.10452411 -0.03358255
## 7 8 9 10 11 12
## 0.67501079 0.07923079 -0.10951588 0.39158412 -0.19138254 -0.53324921
## 13 14 15 16 17 18
## 0.35675079 0.12739080 -0.33979586 0.69411747 0.16209748 -0.72383585
## 19 20 21
## 0.13319748 -0.26976919 0.06135750
sum(ModelA$residuals)
## [1] 5.065393e-16
Yes, sum of residuals values = 0. This is because lm() model derives Intercept and Slope estimates are Least Square estimates i.e. the estimates compensates positive residuals with negative residuals to a minimum optimum variation and hence if we sum up all residuals, we get zero.
std_err <- summary(ModelA)$coef[,2]
std_err
## (Intercept) SO2Conc
## 0.1521958377 0.0009499341
2 X Std Error (\(\hat{\beta}_0\) estimate) = 2 * 0.1521958377 = 0.3043917 which is less than Intercept estimate and hence we can say Std Error for \(\hat{\beta}_0\) is satisfactory
2 X Std Error (\(\hat{\beta}_1\) estimate) = 2 * 0.0009499341 = 0.001899868 which is less than the estmate value and hence we say Std Error for \(\hat{\beta}_1\) is satisfactory
b0_est <- summary(ModelA)$coef[,1][1]
b1_est <- summary(ModelA)$coef[,1][2]
b1_est
## SO2Conc
## 0.008593333
Mean Recession Rate increases by 0.008593333 for every unit increase of Mean SO2 Concentration
b0_est
## (Intercept)
## 0.3229959
Yes, the natural interpretation of Intercept is that in a city with zero Mean SO2 concentration, Mean Recession rate = 0.3229959
tombstone[tombstone$SO2Conc == max(tombstone$SO2Conc),]
## City SO2Conc RecRate
## 21 Chicago,IL 323 3.16
Chicago is the city with the highest Surface Recession Rate at 323
tombstone[(abs(tombstone$RecRate - ModelA$fitted.values)) == max(abs(tombstone$RecRate - ModelA$fitted.values)),]
## City SO2Conc RecRate
## 18 Brooklyn,NY 234 1.61
Brooklyn have the highest absolute residual value at 0.7238359
mean_covar <- mean(tombstone$SO2Conc)
mean_resp <- mean(tombstone$RecRate)
mean_covar
## [1] 136.5238
mean_resp
## [1] 1.49619
mean_resp_pred <- b0_est+b1_est*mean_covar
mean_resp_pred
## (Intercept)
## 1.49619
plot(tombstone$SO2Conc,tombstone$RecRate,pch=4,xlab='Mean SO2 Concentration',ylab = 'Mean Recession Rate')
abline(ModelA,col='red')
points(mean_covar,mean_resp,pch=15,col='blue')
When we apply Mean Covariance (136.5238) to the Linear Regression equation, we get a Response Variable Prediction = 1.49619, which is same as the Mean of the Response Variable in the dataset.
Same we can verify by plotting the mean point on the line.
Description: Cross-sectional analysis of 24 British bus companies (1951). Use response variable = Expenses per car mile (pence), covariate = Car miles per year (1000s).
bus <- read.csv('bus.csv',h=T)
head(bus)
## Expenses.per.car.mile..pence. Car.miles.per.year..1000s.
## 1 19.76 6235
## 2 17.85 46230
## 3 19.96 7360
## 4 16.80 28715
## 5 18.20 21934
## 6 16.71 1337
## Percent.of.Double.Deckers.in.fleet Percent.of.fleet.on.fuel.oil
## 1 100.00 100.00
## 2 43.67 84.53
## 3 65.51 81.57
## 4 45.16 93.33
## 5 49.20 83.07
## 6 74.84 94.99
## Receipts.per.car.mile..pence.
## 1 25.10
## 2 19.23
## 3 21.42
## 4 18.11
## 5 19.24
## 6 19.31
bus <- bus[,1:2]
str(bus)
## 'data.frame': 24 obs. of 2 variables:
## $ Expenses.per.car.mile..pence.: num 19.8 17.9 20 16.8 18.2 ...
## $ Car.miles.per.year..1000s. : int 6235 46230 7360 28715 21934 1337 17881 2319 18040 1147 ...
colnames(bus) <- c('Expenses','CarMiles')
head(bus)
## Expenses CarMiles
## 1 19.76 6235
## 2 17.85 46230
## 3 19.96 7360
## 4 16.80 28715
## 5 18.20 21934
## 6 16.71 1337
Description: This dataset consists of Expenses/mile and Car miles/year information of 24 companies. Expenses is a Double and Response Variable, Car miles is an Integer.
For Simplicity of computations, I am taking a subset of the first two columns from the original dataset and renamed thecolumn names as “Expenses” and “CarMiles”
summary(bus)
## Expenses CarMiles
## Min. :16.56 Min. : 1028
## 1st Qu.:16.95 1st Qu.: 3781
## Median :17.75 Median : 9794
## Mean :18.17 Mean :13749
## 3rd Qu.:18.61 3rd Qu.:18668
## Max. :21.24 Max. :47009
Data Description: Expenses/Mile: Minimum = 16.56 | Maximum = 21.24 | Median = 17.75 | Mean = 18.17 | Range = 4.68 | IQR = 1.66 Car Miles/Year: Minimum = 1028 | Maximum = 47009 | Median = 9794 | Mean = 13749 | Range = 45981 | IQR = 14887
plot(bus$CarMiles,bus$Expenses,pch=4,xlab = 'Car Miles/year',ylab='Expenses / Mile')
Although we can observe a relation between Response Variable and Covariate, from the plot we cannot really say that there is a proper linear relationship. As the Car Miles increase, we can see Expenses / mile decreasing for many cars but there are also some cases where Expenses is on lower sphere and Car Miles / Year is also less.
plot(bus$CarMiles,bus$Expenses,pch=4,xlab = 'Car Miles / Year',ylab='Expenses / Mile')
abline(a = 19, b = -0.0001,lty=2)
By observing the plot, I applied different values for intercept and slope to find a best fitting line. Although this method doesn’t give us the correct values, it gives us a sense of relationship between the variables. From the above manual estimation, we got Intercept = 19 and Slope = 0.0001
Interpretation: Intercept = 0.1 (Mean Surface Recession Rate = 01 when Mean SO2 Concentration is 0) Slope = 0.01 (Mean Surface Recession Rate changes by 0.01 for every unit increase in Mean SO2 Concentration)
ModelB <- lm(Expenses~CarMiles,data=bus)
summary(ModelB)
##
## Call:
## lm(formula = Expenses ~ CarMiles, data = bus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0123 -0.9417 -0.1894 0.8993 2.6176
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.878e+01 4.075e-01 46.085 <2e-16 ***
## CarMiles -4.450e-05 2.188e-05 -2.034 0.0542 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.347 on 22 degrees of freedom
## Multiple R-squared: 0.1583, Adjusted R-squared: 0.12
## F-statistic: 4.136 on 1 and 22 DF, p-value: 0.0542
plot(bus$CarMiles,bus$Expenses,pch=4,xlab='Car Miles / Year',ylab = 'Expenses / Mile')
abline(ModelB,col='red')
ModelB$coefficients
## (Intercept) CarMiles
## 1.878180e+01 -4.449914e-05
\(\hat{\beta}_0\) = 1.878180e+01 \(\hat{\beta}_1\) = -4.449914e-05
ModelB$fitted.values
## 1 2 3 4 5 6 7 8
## 18.50435 16.72461 18.45429 17.50401 17.80576 18.72231 17.98611 18.67861
## 9 10 11 12 13 14 15 16
## 17.97904 18.73076 18.68497 18.19143 18.62245 18.10969 16.68994 18.33063
## 17 18 19 20 21 22 23 24
## 18.50827 17.75436 17.86734 18.36129 18.73606 18.61057 18.08512 18.43805
sum(ModelB$fitted.values)
## [1] 436.08
Sum of fitted values = 436.08
sum(bus$Expenses)
## [1] 436.08
Sum of all Response values = 436.08
plot(bus$CarMiles,bus$Expenses,pch=4,xlab='Car Miles / Year',ylab = 'Expenses / Mile')
abline(ModelB,col='red')
points(bus$CarMiles,ModelB$fitted.values,pch=15)
for (i in 1:dim(bus)[1])
{
lines(c(bus$CarMiles[i],bus$CarMiles[i]),
c(ModelB$fitted.values[i],bus$Expenses[i]))
}
Yes, sum of fitted values = sum of actual values of the response variable in the dataset. This is because lm() model derives Intercept and Slope estimates are Least Square estimates i.e. the estimates compensates positive residuals with negative residuals to a minimum optimum variation and hence if we sum up all fitted values, it will be same as the sum of actual observed values .
sum(bus$Expenses) / nrow(bus)
## [1] 18.17
sum(ModelB$fitted.values) / nrow(bus)
## [1] 18.17
Mean of Fitted Values = Mean of Observed Values = 18.17
ModelB$residuals
## 1 2 3 4 5 6 7
## 1.2556501 1.1253933 1.5057117 -0.7040092 0.3942422 -2.0123067 0.8238871
## 8 9 10 11 12 13 14
## 2.0613915 -1.4190375 -0.1807615 -1.2849719 -0.5714319 2.6175494 0.1203130
## 15 16 17 18 19 20 21
## 0.1700581 -0.8806252 -0.8482658 0.5456387 -1.2873447 -0.8512851 2.4339431
## 22 23 24
## -1.6905693 -1.1251235 -0.1980461
sum(ModelB$residuals)
## [1] 1.637579e-15
Yes, sum of residuals values = 0. This is because lm() model derives Intercept and Slope estimates are Least Square estimates i.e. the estimates compensates positive residuals with negative residuals to a minimum optimum variation and hence if we sum up all residuals, we get zero.
intercepts <- summary(ModelB)$coef[,1]
intercepts
## (Intercept) CarMiles
## 1.878180e+01 -4.449914e-05
std_err <- summary(ModelB)$coef[,2]
std_err
## (Intercept) CarMiles
## 4.075464e-01 2.187948e-05
std_err[1] * 2 < intercepts[1]
## (Intercept)
## TRUE
2 X Std Error (\(\hat{\beta}_0\) estimate) is less than 2 * \(\hat{\beta}_0\) estimate and hence we can say Std Error for \(\hat{\beta}_0\) is satisfactory
std_err[2] * 2 < intercepts[2]
## CarMiles
## FALSE
2 X Std Error (\(\hat{\beta}_1\) estimate) is not less than 2 * \(\hat{\beta}_1\) estimate and hence we cannot say that Std Error for \(\hat{\beta}_1\) is satisfactory
b0_est <- summary(ModelB)$coef[,1][1]
b1_est <- summary(ModelB)$coef[,1][2]
b1_est
## CarMiles
## -4.449914e-05
Mean Recession Rate decreases by 4.449914e-05 for every unit increase of Car Miles / Year
b0_est
## (Intercept)
## 18.7818
Yes, the natural interpretation of Intercept is that a new car with no Miles yet registered, the Expense / mile will be 18.7818
mean_covar <- mean(bus$CarMiles)
mean_resp <- mean(bus$Expenses)
mean_covar
## [1] 13748.62
mean_resp
## [1] 18.17
mean_resp_pred <- b0_est+b1_est*mean_covar
mean_resp_pred
## (Intercept)
## 18.17
plot(bus$CarMiles,bus$Expenses,pch=4,xlab='Car Miles / Year',ylab = 'Expenses / Mile')
abline(ModelB,col='red')
points(mean_covar,mean_resp,pch=15,col='blue')
When we apply Mean Covariance (13748.62 car miles / year) to the Linear Regression equation, we get a Response Variable Prediction = 18.17, which is same as the Mean of the Response Variable in the dataset.
Same we can verify by plotting the mean point on the line.
#Omitted.