Julia Russell - Homework 2

1.

Read into R. Use the response variable ‘Marble Tombstone Mean Surface Recession Rate’ and the covariate ’Mean SO2 Concentrations Over a Hundred Year Period“.

Answer:

I read the file into R using the following command.

dt=read.csv("tombstone.csv")

You can see that after executing d=read.csv("tombstone.csv"), I have a variable dt in the memory.

2.

Plot data, explore data, and briefly describe what you observe.

Answer:

I use the commands below to return the number of variables, view and change the variable names, return the number of observations, and summerize the data.

Number of variables:

length(names(dt))
## [1] 3

Variables before renaming:

names(dt)
## [1] "City"                                                      
## [2] "Modelled.100.Year.Mean.SO2.Concentration..ug.m..3."        
## [3] "Marble.Tombstone.Mean.Surface.Recession.Rate..mm.100years."

Number of observations:

nrow(dt)
## [1] 21

Summary of data:

summary(dt)
##            city    mean.concentration mean.recession 
##  Albany,NY   : 1   Min.   : 12.0      Min.   :0.140  
##  Baltimore,MD: 1   1st Qu.: 91.0      1st Qu.:1.010  
##  Boston,MA   : 1   Median :122.0      Median :1.530  
##  Brooklyn,NY : 1   Mean   :136.5      Mean   :1.496  
##  Cambridge,MA: 1   3rd Qu.:197.0      3rd Qu.:1.980  
##  Chicago,IL  : 1   Max.   :323.0      Max.   :3.160  
##  (Other)     :15

As you can see, there are 3 variables: city, mean.concentration, & mean.recession with 21 observations. A summary of the data is also above.

Below I plot kernel density plots to show the distributions of each variable.

The distributions look relatively normal with some skew.

I plot boxplots to show the quartiles, the range and to help identify outliers.

The SO2 mid point is skewed toward the lower half. The Mean Rate of Recession’s midpoint is located closer to the middle.

Lastly I plot the scatter to explore the relationship between Mean Concentration of SO2 and the Mean Recession Rate.

The variables exibit a linear relationship.

3.

Perform linear regression using the lm() function.

Answer:

Below I create the regression model.

model<-lm(mean.recession~mean.concentration)

Now the model variable contains the regression coefficients.

3.1

Obtain the coefficient estimates.

Answer:

I executed the following code to obtain the coefficients.

model$coefficients
##        (Intercept) mean.concentration 
##        0.322995899        0.008593333

The intercept, \(\widehat{\beta}_{0}\), is estimated at 0.323. The slope of the covariate, \(\widehat{\beta}_{1}\), is estimated at 0.009.

\(\widehat{\beta}_{0}\) represents the the rate of recession when no Sulfur Dioxide is present. \(\widehat{\beta}_{1}\) represents the unit change in recession per unit increase in Sulfur Dioxide.

3.2

Obtain R2. Explain what it means.

Answer:

I executed the following code to obtain R2.

summary(model)$r.squared
## [1] 0.8115724

The value R2 (0.81) is calculated by subtracting the residual sum of squares by the total sum of squares from 1. It estimates how much variance is explained by the regression model.

4.

4.1

Perform t-tests, obtain the t statistics and p values, interpret the results, make a conclusion (i.e. to reject or not reject) and explain why. Note: please explain what the null hypothesis is.

Answer:

I executed the code below to obtain the t statistics from the linear regression model.

summary(model)$coefficients[,3:4]
##                     t value     Pr(>|t|)
## (Intercept)        2.122239 4.718525e-02
## mean.concentration 9.046242 2.578534e-08

The null hypothesis assumes that the a change in the concentration of SO2 will have no effect on the the rate of recession. The t-value of 9.046 and p value of 0 suggest that there is approximently a 0 chance that the coefficient of the SO2 concentration is 0 based on the calculated slope and standard error.

I chose to reject the null hypothesis because the calculated p value is below 5%.

4.2

Perform ANOVA test (F test), obtain F statistic and p value, interpret the results, make a conclusion, and explain why. Note: please explain what a null hypothesis is.

Answer:

I executed the code below to obtain the F statistic and p value from the linear regression model.

f<-summary(model)$f
f
##    value    numdf    dendf 
## 81.83449  1.00000 19.00000
pf(f[1],f[2],f[3],lower.tail=F)
##        value 
## 2.578534e-08

The F statistic represents the overall signicance of the model. Because there is only one covariate, the F value is the same as the t value. The p value shows the probability that significance is due to chance. Since the p value is less than .05 there is a 5% probability that the model is due to chance, so I reject the null hypothesis.

4.3

Compute confidence intervals for coefficients, fitted values, and predicted values. Interpret the meaning of these quatities.

Answer:

The following code calculates the range of the model’s coefficients, fitted values, and predicted values within a 95% confidence interval.

confCoeff<-confint(model,level=.95)
confFv<-predict.lm(model,interval = "confidence", level=.95)
confPv<-predict.lm(model,interval = "prediction", level=.95)
## Warning in predict.lm(model, interval = "prediction", level = 0.95): predictions on current data refer to _future_ responses

The confint function shows the lower and upper bound that the model intercept and coefficient are between with a 95% level of probability.

The predict.lm function with its interval argument set to “confidence” calculates the interval that the real value is between with a 95% confidence level.

The predict.lm function with its interval argument set to “prediction” uses the fitted values generated by the model, but treats them as new variables, and calculates the lower and upper limit that each would be between with a 95% confidence level.

4.4

Plot data points, the regression line, the confidence interval for fitted values, and the confidence interval for predicted values.

Answer:

The plot below shows the regression line and confidence intervals on the graph.

The blue line represents the fitted values, while the red and green lines represent the 95% confidence intervals for fitted and predicted values respectively.

5.

Based on the analysis above, what is your observation and conclusion?

Answer:

The regression model is statistically significant (p <. 05). So it is reasonable to say that for every unit increase in average concentration of SO2 (ug/m^3), the average rate of recession (mm/100 yrs) increases by .86% of a unit.

6.

Repeat the analysis above for the dataset using different covariates. Description: cross-sectional analysis of 24 British bus companies in 1951.

Before creating the regression models, I imported the file, identified the number of variables, their names, the number of observations, and printed a brief summary of the data.

No. of variables:

## [1] 5

Variable names:

## [1] "Expenses.per.car.mile..pence."     
## [2] "Car.miles.per.year..1000s."        
## [3] "Percent.of.Double.Deckers.in.fleet"
## [4] "Percent.of.fleet.on.fuel.oil"      
## [5] "Receipts.per.car.mile..pence."

No. of observations:

## [1] 24

Summary:

##  Expenses.per.car.mile..pence. Car.miles.per.year..1000s.
##  Min.   :16.56                 Min.   : 1028             
##  1st Qu.:16.95                 1st Qu.: 3781             
##  Median :17.75                 Median : 9794             
##  Mean   :18.17                 Mean   :13749             
##  3rd Qu.:18.61                 3rd Qu.:18668             
##  Max.   :21.24                 Max.   :47009             
##  Percent.of.Double.Deckers.in.fleet Percent.of.fleet.on.fuel.oil
##  Min.   :  5.35                     Min.   : 56.86              
##  1st Qu.: 31.61                     1st Qu.: 82.69              
##  Median : 48.46                     Median : 92.94              
##  Mean   : 49.32                     Mean   : 86.99              
##  3rd Qu.: 66.06                     3rd Qu.: 96.85              
##  Max.   :100.00                     Max.   :100.00              
##  Receipts.per.car.mile..pence.
##  Min.   :15.73                
##  1st Qu.:18.68                
##  Median :19.27                
##  Mean   :19.82                
##  3rd Qu.:20.22                
##  Max.   :25.75

The file has 5 variables with Expenses.per.car.mile..pence., Car.miles.per.year..1000s., Percent.of.Double.Deckers.in.fleet, Percent.of.fleet.on.fuel.oil, & Receipts.per.car.mile..pence. as names, and 24 observations.

6.1

Use the response variable ‘Expenses per car mile (pence)’ and covariate ‘Car miles per year (1000s)’.

Answer:

The following code is similar to the code above and produces the same graph and statistics, except instead of a manually created confidence interval plot, it uses the ciplot from the HH package. The hypothesis is that car miles per year have an impact on expenses per car mile. The null hypothesis is that car miles per year do not have an impact.

Exploratory plots:

Model:

model<-lm(Expenses.per.car.mile..pence.~ Car.miles.per.year..1000s.)
summary(model)
## 
## Call:
## lm(formula = Expenses.per.car.mile..pence. ~ Car.miles.per.year..1000s.)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0123 -0.9417 -0.1894  0.8993  2.6176 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 1.878e+01  4.075e-01  46.085   <2e-16 ***
## Car.miles.per.year..1000s. -4.450e-05  2.188e-05  -2.034   0.0542 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.347 on 22 degrees of freedom
## Multiple R-squared:  0.1583, Adjusted R-squared:   0.12 
## F-statistic: 4.136 on 1 and 22 DF,  p-value: 0.0542

Coefficients:

model$coefficients
##                (Intercept) Car.miles.per.year..1000s. 
##               1.878180e+01              -4.449914e-05

R squared:

summary(model)$r.squared
## [1] 0.1582641

T and p - value:

summary(model)$coefficients[,3:4]
##                             t value     Pr(>|t|)
## (Intercept)                46.08506 2.223005e-23
## Car.miles.per.year..1000s. -2.03383 5.420264e-02

F-statistic and p value:

f<-summary(model)$f
f
##     value     numdf     dendf 
##  4.136463  1.000000 22.000000
pf(f[1],f[2],f[3],lower.tail=F)
##      value 
## 0.05420264

Plot of confidence intervals and regression line:

6.2

Use the response variable ‘Expenses per car mile (pence)’ and covariate ‘Percent of Double Deckers in fleet’.

Answer:

Exploratory plots:

Model:

model<-lm(Expenses.per.car.mile..pence.~ Percent.of.Double.Deckers.in.fleet)
summary(model)
## 
## Call:
## lm(formula = Expenses.per.car.mile..pence. ~ Percent.of.Double.Deckers.in.fleet)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2992 -0.5985 -0.1054  0.2084  3.5161 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)                        16.54775    0.55637  29.742  < 2e-16
## Percent.of.Double.Deckers.in.fleet  0.03289    0.01011   3.252  0.00366
##                                       
## (Intercept)                        ***
## Percent.of.Double.Deckers.in.fleet ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.207 on 22 degrees of freedom
## Multiple R-squared:  0.3246, Adjusted R-squared:  0.2939 
## F-statistic: 10.57 on 1 and 22 DF,  p-value: 0.003657

Coefficients:

model$coefficients
##                        (Intercept) Percent.of.Double.Deckers.in.fleet 
##                        16.54775261                         0.03289006

R squared:

summary(model)$r.squared
## [1] 0.3246126

T and p - value:

summary(model)$coefficients[,3:4]
##                                      t value     Pr(>|t|)
## (Intercept)                        29.742267 2.922432e-19
## Percent.of.Double.Deckers.in.fleet  3.251753 3.656954e-03

F-statistic and p value:

f<-summary(model)$f
f
##   value   numdf   dendf 
## 10.5739  1.0000 22.0000
pf(f[1],f[2],f[3],lower.tail=F)
##       value 
## 0.003656954

Plot of confidence intervals and regression line:

6.3

Use the response variable ‘Expenses per car mile (pence)’ and covariate ‘Percent of fleet on fuel oil’.

Answer:

Exploratory plots:

Model:

model<-lm(Expenses.per.car.mile..pence.~ Percent.of.fleet.on.fuel.oil)
summary(model)
## 
## Call:
## lm(formula = Expenses.per.car.mile..pence. ~ Percent.of.fleet.on.fuel.oil)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6282 -1.1915 -0.3800  0.4129  3.1490 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  17.876372   1.966369   9.091 6.64e-09 ***
## Percent.of.fleet.on.fuel.oil  0.003375   0.022341   0.151    0.881    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.467 on 22 degrees of freedom
## Multiple R-squared:  0.001037,   Adjusted R-squared:  -0.04437 
## F-statistic: 0.02283 on 1 and 22 DF,  p-value: 0.8813

Coefficients:

model$coefficients
##                  (Intercept) Percent.of.fleet.on.fuel.oil 
##                 17.876371507                  0.003375493

R squared:

summary(model)$r.squared
## [1] 0.00103655

T and p - value:

summary(model)$coefficients[,3:4]
##                                t value     Pr(>|t|)
## (Intercept)                  9.0910550 6.637217e-09
## Percent.of.fleet.on.fuel.oil 0.1510886 8.812827e-01

F-statistic and p value:

f<-summary(model)$f
f
##       value       numdf       dendf 
##  0.02282775  1.00000000 22.00000000
pf(f[1],f[2],f[3],lower.tail=F)
##     value 
## 0.8812827

Plot of confidence intervals and regression line:

6.4

Use the response variable ‘Expenses per car mile (pence)’ and covariate ‘Receipts per car mile (pence)’.

Answer:

Exploratory plots:

Model:

model<-lm(Expenses.per.car.mile..pence.~ Receipts.per.car.mile..pence.)
summary(model)
## 
## Call:
## lm(formula = Expenses.per.car.mile..pence. ~ Receipts.per.car.mile..pence.)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.91091 -0.56057  0.09436  0.41483  2.62151 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    8.65846    1.60108   5.408 1.97e-05 ***
## Receipts.per.car.mile..pence.  0.47987    0.08024   5.981 5.10e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9061 on 22 degrees of freedom
## Multiple R-squared:  0.6192, Adjusted R-squared:  0.6019 
## F-statistic: 35.77 on 1 and 22 DF,  p-value: 5.097e-06

Coefficients:

model$coefficients
##                   (Intercept) Receipts.per.car.mile..pence. 
##                      8.658455                      0.479866

R squared:

summary(model)$r.squared
## [1] 0.6191757

T and p - value:

summary(model)$coefficients[,3:4]
##                                t value     Pr(>|t|)
## (Intercept)                   5.407895 1.973762e-05
## Receipts.per.car.mile..pence. 5.980754 5.097023e-06

F-statistic and p value:

f<-summary(model)$f
f
##    value    numdf    dendf 
## 35.76942  1.00000 22.00000
pf(f[1],f[2],f[3],lower.tail=F)
##        value 
## 5.097023e-06

Plot of confidence intervals and regression line:

6.5

What are your observations on these analysis?

Answer:

Two explanatory variables have - as individuals - a statistically significant impact (p < .05) on the response variable, expenses per car mile (pences): 1) The percent of double deckers in a fleet, and 2) receipts per car mile.

For every receipt, expenses increase by approximently .48. One could count the number of receipts, add 8.66, and get a reasonable estimate of how much was spent per car mile.

For every unit increase in percent of double deckers in the fleet, one could multiply it by .033, add 16.55 and get a reasonable estimate of how much was spent per car mile.