testing.utf8.md

Title: “Working with Tombstone and Bus Dataset”

Author: “Ajinkya Prashant Dalvi”

Date: “01/22/2020”

Part1

Q1. Read <tombstone.csv> into R. Use response variable = Marble Tombstone Mean Surface Recession Rate, and covariate = Mean SO2 concentrations over a 100 year period. Description: Marble Tombstone Mean Surface Recession Rates and Mean SO2 concentrations over a 100 year period.

Reading the Tombstone Dataset.

Assigned response variable y as the “Marble Tombstone Mean Surface Recession Rate” and covariate x as the “Mean SO2 concentrations over a 100 year period”

TomStone_DataSet <- read.csv("tombstone.csv",h=T)
TomStone_DataSet

##                     City
## 1  Washington,DC (Rural)
## 2  Cincinnati,OH (Rural)
## 3  Philadelphia,PA (Rura
## 4            Richmond,VA
## 5          Fall River,MA
## 6            Hartford,CT
## 7            Evanston,IL
## 8              Albany,NY
## 9          Washington,DC
## 10         Louisville,KY
## 11         Providence,RI
## 12          Cambridge,MA
## 13          Baltimore,MD
## 14             Newark,NJ
## 15             Boston,MA
## 16         Pittsburgh,PA
## 17         Cincinnati,OH
## 18           Brooklyn,NY
## 19       Philadelphia,PA
## 20       Indianapolis,IN
## 21            Chicago,IL
##    Modelled.100.Year.Mean.SO2.Concentration..ug.m..3.
## 1                                                  12
## 2                                                  20
## 3                                                  20
## 4                                                  46
## 5                                                  48
## 6                                                  92
## 7                                                  91
## 8                                                  94
## 9                                                 102
## 10                                                117
## 11                                                122
## 12                                                142
## 13                                                142
## 14                                                178
## 15                                                180
## 16                                                197
## 17                                                224
## 18                                                234
## 19                                                239
## 20                                                244
## 21                                                323
##    Marble.Tombstone.Mean.Surface.Recession.Rate..mm.100years.
## 1                                                        0.27
## 2                                                        0.14
## 3                                                        0.33
## 4                                                        0.81
## 5                                                        0.84
## 6                                                        1.08
## 7                                                        1.78
## 8                                                        1.21
## 9                                                        1.09
## 10                                                       1.72
## 11                                                       1.18
## 12                                                       1.01
## 13                                                       1.90
## 14                                                       1.98
## 15                                                       1.53
## 16                                                       2.71
## 17                                                       2.41
## 18                                                       1.61
## 19                                                       2.51
## 20                                                       2.15
## 21                                                       3.16

Q2- Plot data, explore data, and briefly describe what you observe.

It’s a positive trend as many plotted points lie on the right end of the graph. After plotting the linear regression line, its being observed that the line is going from left to right showing a positive trend. This graph depicts the increase in S02 concentration in different cities over the period of 100 years.

x <- TomStone_DataSet[,2]
y <- TomStone_DataSet[,3]
plot(x, y, pch=20)
abline(a=0.2, b=0.010,lty=2)

Q3.Perform linear regression using lm() function

model2 <- lm(y ~ x, data=TomStone_DataSet) # obtain least square estimate
summary(model2)

## 
## Call:
## lm(formula = y ~ x, data = TomStone_DataSet)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72384 -0.19138  0.06136  0.13320  0.69412 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.3229959  0.1521958   2.122   0.0472 *  
## x           0.0085933  0.0009499   9.046 2.58e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.365 on 19 degrees of freedom
## Multiple R-squared:  0.8116, Adjusted R-squared:  0.8017 
## F-statistic: 81.83 on 1 and 19 DF,  p-value: 2.579e-08

Q3 Sub-Questions

3.1. Obtain coefficient estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\).

Beta0 = 0.3229959, Beta1 = 0.0085933 For every unit increase in covariate(x) -> Mean SO2 concentrations over a 100 year period, the response variable(y)-> Marble Tombstone Mean Surface Recession Rate will increase by 0.0085933. But when the Mean SO2 concentrations over a 100 year period is equal to zero, then the Marble Tombstone Mean Surface Recession Rate is 0.3229959.

model2 <- lm(y ~ x, data=TomStone_DataSet) # obtain least square estimate
summary(model2)

## 
## Call:
## lm(formula = y ~ x, data = TomStone_DataSet)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72384 -0.19138  0.06136  0.13320  0.69412 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.3229959  0.1521958   2.122   0.0472 *  
## x           0.0085933  0.0009499   9.046 2.58e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.365 on 19 degrees of freedom
## Multiple R-squared:  0.8116, Adjusted R-squared:  0.8017 
## F-statistic: 81.83 on 1 and 19 DF,  p-value: 2.579e-08

plot(x,y,pch=20) 
abline(model2)

3.2. Obtain fitted values and the sum of fitted values.

For the first observation, y1 is 0.27 and x1 is 12. The fitted value of the first observation is 0.4261159, which means that, given x=12, according to our fitted regression linear, the mean of the response variable y is estimated to be 0.27. In other words, the estimated mean response of y at x=12 is 0.4261159. So, the observed value y1=0.27. y1=0.27 is a little below the estimated mean response. The main objective of fitted value is to make sure that the coordinates xi and yi all fall on the estimated line.

fitted_values <- model2$fitted.values
fitted_values

##         1         2         3         4         5         6         7 
## 0.4261159 0.4948626 0.4948626 0.7182892 0.7354759 1.1135825 1.1049892 
##         8         9        10        11        12        13        14 
## 1.1307692 1.1995159 1.3284159 1.3713825 1.5432492 1.5432492 1.8526092 
##        15        16        17        18        19        20        21 
## 1.8697959 2.0158825 2.2479025 2.3338359 2.3768025 2.4197692 3.0986425

#Sum of Fitted Values
sum(fitted_values)

## [1] 31.42

3.3. Obtain the sum of all values of response variable.

# y is assigned as a response variable
sum(y)

## [1] 31.42

3.4. Verify the fact that the sum of fitted values is always the same as the sum of response variable. In addition, verify the fact that the mean of the fitted values is always the same as the mean of response variable

The sum of fitted values and response variable(y) is the same. And so does the mean of fitted values and mean of response values(y) is the same.
The fitted values is calculated by = Beta0 + xiBeta1. While calculating the residual we subtract every fitted value from each response variable yi (yi – (Beta0 + xiBeta1)). As the summation of residuals comes to zero that’s the reason why the sum of response variable(y) is equal to the sum of fitted values. According to the line equation: Yi = Beta0 + xiBeta1 + e When residual becomes zero the y(response) value becomes equal to the fitted value Beta0 + xiBeta1.

#Sum and mean of fitted values and response variables
sum(y)

## [1] 31.42

sum(fitted_values)

## [1] 31.42

mean(y)

## [1] 1.49619

mean(fitted_values)

## [1] 1.49619

3.5. Obtain residuals and the sum of residuals, and verify the fact that the sum of residuals is always zero.

Sum of residuals is very close to zero. In this case I have rounded off the values to the eight position to give a exact zero as a answer.

residuals <- model2$residuals
residuals

##           1           2           3           4           5           6 
## -0.15611590 -0.35486256 -0.16486256  0.09171078  0.10452411 -0.03358255 
##           7           8           9          10          11          12 
##  0.67501079  0.07923079 -0.10951588  0.39158412 -0.19138254 -0.53324921 
##          13          14          15          16          17          18 
##  0.35675079  0.12739080 -0.33979586  0.69411747  0.16209748 -0.72383585 
##          19          20          21 
##  0.13319748 -0.26976919  0.06135750

round(sum(residuals),8)

## [1] 0

3.6. Obtain the standard errors of \(\hat{\beta}_0\) and \(\hat{\beta}_1\). Are these standard errors satisfactory and why?

summary(model2)$coef[,2]

##  (Intercept)            x 
## 0.1521958377 0.0009499341

In both the cases the standard error is small which means there are less variable and more accuracy. The value of Beta0 = 0.3229959, Beta1 = 0.0085933 which is more than double the standard error which means the standard error in this case is satisfactory. So, we conclude that there is no standard error.

Q4. Suppose we increase SO2 Concentration by one unit, how does such a change influence the Surface Recession Rate?

For every unit increase in SO2 Concentration(x) the Marble Tombstone Mean Surface Recession Rate(y) will increase by Beta1 which is 0.4261159.

summary(model2)

## 
## Call:
## lm(formula = y ~ x, data = TomStone_DataSet)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72384 -0.19138  0.06136  0.13320  0.69412 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.3229959  0.1521958   2.122   0.0472 *  
## x           0.0085933  0.0009499   9.046 2.58e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.365 on 19 degrees of freedom
## Multiple R-squared:  0.8116, Adjusted R-squared:  0.8017 
## F-statistic: 81.83 on 1 and 19 DF,  p-value: 2.579e-08

Q5. Does the intercept of the linear regression have natural interpretation? If so, what does it mean?

Yes, Intercept of the linear regression has natural interpretation. But, in this case if the SO2 concentration will be zero then also the marble tombstone mean surface recession rate will have a minimum value of beta1 = 0.3229959. It represents that, when SO2 concentration is 0, then mean succession rate for marble is 0.322mm/100 years.

Q6.Which city has the highest Surface Recession Rate?

Chicago has the highest Surface Recession Rate. City with highest Surface Recession Rate

7. Which city has the largest residual (i.e., the largest absolute value) according to the linear regression you just fitted?

Brooklyn, NY has the largest residuals. The largest absolute residual value is present at the 18th position in the list of residuals which is a value of abs(-0.72383585.)

abs(model2$residuals)

##          1          2          3          4          5          6 
## 0.15611590 0.35486256 0.16486256 0.09171078 0.10452411 0.03358255 
##          7          8          9         10         11         12 
## 0.67501079 0.07923079 0.10951588 0.39158412 0.19138254 0.53324921 
##         13         14         15         16         17         18 
## 0.35675079 0.12739080 0.33979586 0.69411747 0.16209748 0.72383585 
##         19         20         21 
## 0.13319748 0.26976919 0.06135750

abs(model2$residuals[18])

##        18 
## 0.7238359

TomStone_DataSet[18,1]

## [1] Brooklyn,NY
## 21 Levels: Albany,NY Baltimore,MD Boston,MA Brooklyn,NY ... Washington,DC (Rural)

8. Calculate the mean of covariate and mean of response. Verify the fact that the fitted regression line go through the point \((\bar{x},\bar{y})\).

Covariate Mean (Mean(x)) -> 136.5238 Response Variable(Mean(y)) - > 1.49619

Note : Please refer to the red dot on the graph for the plot.

Part2

Q1.Read <Bus.csv> into R.

Bus_DataSet <- read.csv("bus.csv",h=T)
Bus_DataSet

##    Expenses.per.car.mile..pence. Car.miles.per.year..1000s.
## 1                          19.76                       6235
## 2                          17.85                      46230
## 3                          19.96                       7360
## 4                          16.80                      28715
## 5                          18.20                      21934
## 6                          16.71                       1337
## 7                          18.81                      17881
## 8                          20.74                       2319
## 9                          16.56                      18040
## 10                         18.55                       1147
## 11                         17.40                       2176
## 12                         17.62                      13267
## 13                         21.24                       3581
## 14                         18.23                      15104
## 15                         16.86                      47009
## 16                         17.45                      10139
## 17                         17.66                       6147
## 18                         18.30                      23089
## 19                         16.58                      20550
## 20                         17.51                       9450
## 21                         21.17                       1028
## 22                         16.92                       3848
## 23                         16.96                      15656
## 24                         18.24                       7725
##    Percent.of.Double.Deckers.in.fleet Percent.of.fleet.on.fuel.oil
## 1                              100.00                       100.00
## 2                               43.67                        84.53
## 3                               65.51                        81.57
## 4                               45.16                        93.33
## 5                               49.20                        83.07
## 6                               74.84                        94.99
## 7                               70.66                        92.34
## 8                               63.93                        95.08
## 9                               14.45                        61.24
## 10                              68.58                        97.90
## 11                              53.33                        97.50
## 12                              25.16                        56.86
## 13                              35.76                        63.58
## 14                              47.72                        95.29
## 15                              17.21                       100.00
## 16                              43.15                        89.40
## 17                              67.73                        92.54
## 18                              33.27                        67.53
## 19                              26.61                        98.32
## 20                              61.35                        86.72
## 21                             100.00                       100.00
## 22                               5.35                        65.58
## 23                              20.53                        93.72
## 24                              50.59                        96.63
##    Receipts.per.car.mile..pence.
## 1                          25.10
## 2                          19.23
## 3                          21.42
## 4                          18.11
## 5                          19.24
## 6                          19.31
## 7                          20.07
## 8                          24.35
## 9                          17.60
## 10                         20.13
## 11                         18.40
## 12                         18.96
## 13                         25.75
## 14                         19.40
## 15                         18.64
## 16                         19.10
## 17                         20.00
## 18                         19.31
## 19                         20.49
## 20                         17.07
## 21                         20.61
## 22                         15.73
## 23                         18.70
## 24                         18.99

x <- Bus_DataSet[,2]
y <- Bus_DataSet[,1]

Q2. Plot data, explore data, and briefly describe what you observe.

It’s a negative trend as many plotted points lie on the left end of the graph. After plotting the linear regression line, its being observed that the line is going from right to left showing a negative trend. This graph depicts that with increase in the miles the car is been drove the expense on the car will decrease.

Q3. Perform linear regression using lm() function

model1 <- lm(y ~ x, data=Bus_DataSet) # obtain least square estimate
summary(model1)

## 
## Call:
## lm(formula = y ~ x, data = Bus_DataSet)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0123 -0.9417 -0.1894  0.8993  2.6176 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.878e+01  4.075e-01  46.085   <2e-16 ***
## x           -4.450e-05  2.188e-05  -2.034   0.0542 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.347 on 22 degrees of freedom
## Multiple R-squared:  0.1583, Adjusted R-squared:   0.12 
## F-statistic: 4.136 on 1 and 22 DF,  p-value: 0.0542

plot(x,y,pch=20)
abline(model1)

Q3 Part 2: Sub-Questions

3.1. Obtain coefficient estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\).

Beta0 = 1.878e+01, Beta1 = -4.450e-05 For every unit increase in covariate(x) -> Car miles per year (1000s), the response variable(y)-> Expenses per car mile (pence) will decrease by -4.450e-05. (negative trend).

#Obtain coefficient estimates Beta0 and Beta1
model1 <- lm(y ~ x, data=Bus_DataSet) # obtain least square estimate
summary(model1)

## 
## Call:
## lm(formula = y ~ x, data = Bus_DataSet)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0123 -0.9417 -0.1894  0.8993  2.6176 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.878e+01  4.075e-01  46.085   <2e-16 ***
## x           -4.450e-05  2.188e-05  -2.034   0.0542 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.347 on 22 degrees of freedom
## Multiple R-squared:  0.1583, Adjusted R-squared:   0.12 
## F-statistic: 4.136 on 1 and 22 DF,  p-value: 0.0542

plot(x,y,pch=20)
abline(model1)

3.2. Obtain fitted values and the sum of fitted values.

For the first observation, y1 is 19.76 and x1 is 6235. The fitted value of the first observation is 18.50435, which means that, given x=6235, according to our fitted regression linear, the mean of the response variable y is estimated to be 19.76. In other words, the estimated mean response of y at x=6235 is 19.76. So the observed value y1=19.76. y1=19.76 is a little above the estimated mean response. The main objective of fitted value is to make sure that the coordinates xi and yi all fall on the estimated line.

fitted_values <- model1$fitted.values
fitted_values

##        1        2        3        4        5        6        7        8 
## 18.50435 16.72461 18.45429 17.50401 17.80576 18.72231 17.98611 18.67861 
##        9       10       11       12       13       14       15       16 
## 17.97904 18.73076 18.68497 18.19143 18.62245 18.10969 16.68994 18.33063 
##       17       18       19       20       21       22       23       24 
## 18.50827 17.75436 17.86734 18.36129 18.73606 18.61057 18.08512 18.43805

#Sum of Fitted Values
sum(fitted_values)

## [1] 436.08

Image

3.3. Obtain the sum of all values of response variable.

The sum of responses the 436.08.

# y is assigned as a response variable
sum(y)

## [1] 436.08

3.4. Verify the fact that the sum of fitted values is always the same as the sum of response variable. In addition, verify the fact that the mean of the fitted values is always the same as the mean of response variable

The sum of fitted values and response variable(y) is the same. And so does the mean of fitted values and mean of response values(y) is the same.

#Sum and mean of fitted values and response variables
sum(y)

## [1] 436.08

sum(fitted_values)

## [1] 436.08

mean(y)

## [1] 18.17

mean(fitted_values)

## [1] 18.17

The fitted values is calculated by = Beta0 + xiBeta1. While calculating the residual we subtract every fitted value from each response variable yi (yi – (Beta0 + xiBeta1)). As the summation of residuals comes to zero that’s the reason why the sum of response variable(y) is equal to the sum of fitted values. According to the line equation: Yi = Beta0 + xiBeta1 + e When residual becomes zero the y(response) value becomes equal to the fitted value Beta0 + xiBeta1.

3.5. Obtain residuals and the sum of residuals, and verify the fact that the sum of residuals is always zero.

Sum of residuals is very close to zero. In this case I have rounded off the values to the eight position to give a exact zero as a answer.

residuals <- model1$residuals
residuals

##          1          2          3          4          5          6 
##  1.2556501  1.1253933  1.5057117 -0.7040092  0.3942422 -2.0123067 
##          7          8          9         10         11         12 
##  0.8238871  2.0613915 -1.4190375 -0.1807615 -1.2849719 -0.5714319 
##         13         14         15         16         17         18 
##  2.6175494  0.1203130  0.1700581 -0.8806252 -0.8482658  0.5456387 
##         19         20         21         22         23         24 
## -1.2873447 -0.8512851  2.4339431 -1.6905693 -1.1251235 -0.1980461

round(sum(residuals),8)

## [1] 0

3.6. Obtain the standard errors of \(\hat{\beta}_0\) and \(\hat{\beta}_1\). Are these standard errors satisfactory and why?

The standard error of Beta0 = 4.075464e^-01 and Beta1 = 2.187948e^-05

summary(model1)$coef[,2]

##  (Intercept)            x 
## 4.075464e-01 2.187948e-05

Beta0 = 1.878e+01, Beta1 = -4.450e-05. This case is very close to the error, but as the limit between the beta values and standard error as not exceeded so there is no standard error.

Q4. Suppose we increase SO2 Concentration by one unit, how does such a change influence the Surface Recession Rate?

If the car per miles increases by 1 unit then the expense per car in pence decreases by Beta1 = -4.450e-05.

summary(model1)

## 
## Call:
## lm(formula = y ~ x, data = Bus_DataSet)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0123 -0.9417 -0.1894  0.8993  2.6176 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.878e+01  4.075e-01  46.085   <2e-16 ***
## x           -4.450e-05  2.188e-05  -2.034   0.0542 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.347 on 22 degrees of freedom
## Multiple R-squared:  0.1583, Adjusted R-squared:   0.12 
## F-statistic: 4.136 on 1 and 22 DF,  p-value: 0.0542

Q5. Does the intercept of the linear regression have natural interpretation? If so, what does it mean?

Intercept of the linear regression have no natural interpretation. Because in this case if the car drove for zero miles then expense of car per pence will have no effect. This means that when car is brand new then at that time there won’t be any expense per mile.

Q8. Calculate the mean of covariate and mean of response. Verify the fact that the fitted regression line go through the point \((\bar{x},\bar{y})\).

Covariate Mean (Mean(x)) -> 13748.62 Response Variable(Mean(y)) - > 18.17

The plotted point lies on the line created by the fitted points (the estimated line). Therefore, the fitted regression line goes through the point mean(x) that is mean of Car miles per year (1000s) and mean(y) Expenses per car mile (pence).

mean(x)

## [1] 13748.62

mean(y)

## [1] 18.17

Image