Q1. Import data

ridership <- read.csv("ridership.csv", header = TRUE, sep = ",")
summary(ridership)
##       City       WeeklyRiders     PricePerWeek    CityPopulation   
##  Min.   : 1.0   Min.   :115696   Min.   : 15.00   Min.   :1590000  
##  1st Qu.: 7.5   1st Qu.:149600   1st Qu.: 27.50   1st Qu.:1617500  
##  Median :14.0   Median :161600   Median : 40.00   Median :1695000  
##  Mean   :14.0   Mean   :160026   Mean   : 49.93   Mean   :1680111  
##  3rd Qu.:20.5   3rd Qu.:176000   3rd Qu.: 75.00   3rd Qu.:1725000  
##  Max.   :27.0   Max.   :192000   Max.   :102.00   Max.   :1800000  
##  MonthlyIncomeOfRiders AvgParkingRates
##  Min.   : 5800         Min.   : 50    
##  1st Qu.: 8400         1st Qu.: 75    
##  Median :11600         Median :100    
##  Mean   :11063         Mean   :107    
##  3rd Qu.:13888         3rd Qu.:140    
##  Max.   :16200         Max.   :200

Weekly ridership as a function of Price Per Week

model1 <- lm(WeeklyRiders ~ PricePerWeek, ridership)
summary(model1)
## 
## Call:
## lm(formula = WeeklyRiders ~ PricePerWeek, data = ridership)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8218.4 -4883.2  -465.9  3436.9 11817.3 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  197208.34    2266.48   87.01  < 2e-16 ***
## PricePerWeek   -744.75      39.89  -18.67 3.41e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5620 on 25 degrees of freedom
## Multiple R-squared:  0.9331, Adjusted R-squared:  0.9304 
## F-statistic: 348.5 on 1 and 25 DF,  p-value: 3.413e-16

\[WeeklyRiders = 1.9720834\times 10^{5} + (-744.749 \times PricePerWeek)\]

Significance of the regression = \[3.4134003\times 10^{-16}\]

The null hypothesis is that there is no relationship between Weekly Riders and Price per week. We reject the null hypothesis, as \(p<0.05\).

adjusted \(R^2\) = 0.93

This is strong.

93% of the variation of Weekly Riders is explained by Price per week.

Plot

plot(ridership$PricePerWeek, ridership$WeeklyRiders)
abline(model1)

par(mfrow=c(2,2))
plot(model1)

The model’s best fit line looks like a good predictor of the data. The ‘residual vs fitted’ plot appears potentially normally distributed, the Q-Q plot follows the 45-degree line, the ‘scale-location’ plot appears homoskedastic and the none of the data points in ‘residuals vs leverage’ appear to have undue influence on the regression.

Q2. Linear regression analysis with transformation (SMSA)

smsa <- read.csv("SMSA.csv", header = TRUE, sep = ",")
summary(smsa)
##                       city        state       JanTemp        JulyTemp    
##  Akron                  : 1   OH     : 7   Min.   :12.0   Min.   :63.00  
##  Albany-Schenectady-Troy: 1   NY     : 6   1st Qu.:27.0   1st Qu.:72.00  
##  Allentown-Bethlehem    : 1   CA     : 4   Median :31.0   Median :74.00  
##  Atlanta                : 1   PA     : 4   Mean   :33.8   Mean   :74.41  
##  Baltimore              : 1   CT     : 3   3rd Qu.:39.5   3rd Qu.:77.00  
##  Birmingham             : 1   MA     : 3   Max.   :67.0   Max.   :85.00  
##  (Other)                :53   (Other):32                                 
##      RelHum           Rain         Mortality        Education    
##  Min.   :38.00   Min.   :10.00   Min.   : 790.7   Min.   : 9.00  
##  1st Qu.:55.50   1st Qu.:33.50   1st Qu.: 899.4   1st Qu.:10.40  
##  Median :57.00   Median :38.00   Median : 946.2   Median :11.00  
##  Mean   :57.75   Mean   :38.51   Mean   : 941.2   Mean   :10.97  
##  3rd Qu.:60.00   3rd Qu.:44.00   3rd Qu.: 984.1   3rd Qu.:11.50  
##  Max.   :73.00   Max.   :65.00   Max.   :1113.2   Max.   :12.30  
##                                                                  
##    PopDensity     pNonWhite          pWC             pop         
##  Min.   :1441   Min.   : 0.80   Min.   :33.80   Min.   : 124833  
##  1st Qu.:3138   1st Qu.: 4.90   1st Qu.:43.40   1st Qu.: 566515  
##  Median :3626   Median : 9.50   Median :45.50   Median : 914427  
##  Mean   :3910   Mean   :11.88   Mean   :46.39   Mean   :1438037  
##  3rd Qu.:4566   3rd Qu.:15.70   3rd Qu.:49.90   3rd Qu.:1717201  
##  Max.   :9699   Max.   :38.50   Max.   :62.20   Max.   :8274961  
##                                                                  
##     pophouse         income            PM                HCPot       
##  Min.   :2.650   Min.   :25782   Min.   :0.0001152   Min.   :  1.00  
##  1st Qu.:3.210   1st Qu.:30004   1st Qu.:0.0005285   1st Qu.:  7.00  
##  Median :3.270   Median :32452   Median :0.0010340   Median : 15.00  
##  Mean   :3.247   Mean   :33247   Mean   :0.0012651   Mean   : 38.47  
##  3rd Qu.:3.360   3rd Qu.:35496   3rd Qu.:0.0017225   3rd Qu.: 30.50  
##  Max.   :3.530   Max.   :47966   Max.   :0.0076810   Max.   :648.00  
##                                                                      
##      NOxPot           S02Pot            NOx        
##  Min.   :  1.00   Min.   :  1.00   Min.   :  1.00  
##  1st Qu.:  4.00   1st Qu.: 13.00   1st Qu.:  4.00  
##  Median :  9.00   Median : 32.00   Median :  9.00  
##  Mean   : 22.97   Mean   : 54.66   Mean   : 22.97  
##  3rd Qu.: 24.50   3rd Qu.: 70.00   3rd Qu.: 24.50  
##  Max.   :319.00   Max.   :278.00   Max.   :319.00  
## 

Description of SMSA data

Source: US Labor Department

Variable Names:

  • city City name

  • JanTemp Mean January temperature (degrees Farenheit)

  • JulyTemp Mean July temperature (degrees Farenheit)

  • RelHum Relative Humidity

  • Rain Annual rainfall (inches)

  • Mortality Age adjusted mortality

  • Education Median education

  • PopDensity Population density

  • %NonWhite Percentage of non whites

  • %WC Percentage of white collar workers

  • pop Population

  • pop/house Population per household

  • income Median income

  • PM Mortality/pop

  • HCPot HC pollution potential

  • NOxPot Nitrous Oxide pollution potential

  • SO2Pot Sulfur Dioxide pollution potential

  • NOx Nitrous Oxide

Linear regression (NOx is predictor variable, PM is response variable)

model2 <- lm(PM ~ NOx, smsa)
summary(model2)
## 
## Call:
## lm(formula = PM ~ NOx, data = smsa)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0011270 -0.0006478 -0.0001879  0.0004673  0.0063337 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.400e-03  1.590e-04   8.807  3.2e-12 ***
## NOx         -5.889e-06  3.078e-06  -1.913   0.0607 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.001094 on 57 degrees of freedom
## Multiple R-squared:  0.06034,    Adjusted R-squared:  0.04386 
## F-statistic:  3.66 on 1 and 57 DF,  p-value: 0.06075

\[PM = 0.0014 + (-5.9\times 10^{-6} \times NOx)\]

The null hypothesis is that there is no relationship between PM and NOx.

p-value of the regression = \[0.0607\]

We fail to reject the null hypothesis, as \(p>0.05\).

adjusted \(R^2\) = 0.044

This is weak.

4% of the variation of PM is explained by NOx.

Plot the best fit line on a scatterplot of the data

plot(smsa$NOx, smsa$PM)
abline(model2)

The fitted line looks like a poor fit to the data, with a lot of data above the line at the lower and higher NOx ranges, but almost all below the data in the low-middle NOx range.

Plot the residual charts

par(mfrow=c(2,2))
plot(model2)

  • Do the data appear to be linear?

No, the data does not appear to be linear. There are several points (inluding point 15) at the upper range of fitted values which have high residuals.

  • Independently sampled values, normally distributed errors: Normal Q-Q. Do the points line up on the 45-degree line?

No, there is a prominent data point (15) way above the 45-degree line.

  • Constant variance: Scale-Location

Is the spread of the points the same?

No, the standardized residuals increases as the fitted values increases. The data is heteroskedastic.

  • What are your conclusions about the regression?

The linear regression is a poor fit for the data. There is a suggestion that a data transformation might help improve the fit.

  • Could you improve the model fit by transforming the data? (Hint: Yes, you can).

Yes. With the data fitting poorly at higher fitted values, perhaps a log-transformation might help improve the model’s fit.

If so, answer the following questions based on log-transformed data:

smsa$logNOx <- log(smsa$NOx)
smsa$logPM <- log(smsa$PM)

model3 <- lm(logPM ~ logNOx, data = smsa)
summary(model3)
## 
## Call:
## lm(formula = logPM ~ logNOx, data = smsa)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.75021 -0.35550  0.08547  0.47918  2.05753 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.20246    0.20878 -29.708  < 2e-16 ***
## logNOx      -0.32954    0.07971  -4.134 0.000118 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7212 on 57 degrees of freedom
## Multiple R-squared:  0.2307, Adjusted R-squared:  0.2172 
## F-statistic: 17.09 on 1 and 57 DF,  p-value: 0.0001181
  • What is the equation of your regression line?

\[logPM = -6.202 + (-0.33 \times logNOx)\]

  • Did the significance of the regression change? Do you reject or fail to reject the null hypothesis?

The null hypothesis is that there is no relationship between \(log(PM)\) and \(log(NOx)\).

p-value of the regression = \[10^{-4}\]

We reject the null hypothesis, as \(p<0.05\).

  • Did the r^2 change? Is it strong or weak?

adjusted \(R^2\) = 0.217

This is weak, though better than the adjusted \(R^2\) for the linear regression model on data which had not been log transformed.

  • How much of the variation in Y has been explained by the independent variable X? (Reminder: use the names of your variables in your answer)

21.7% of the variation in \(log(PM)\) is explained by the independent variable \(log(NOx)\).

  • Plot best fit line on a scatterplot of the transformed data – how does it look?
plot(smsa$logNOx, smsa$logPM)
abline(model3)

The plot of \(log(PM)\) against \(log(NOx)\) has a spread of data across both dependent and independent variables. The fitted line appears to fairly represent a downward trend in the \(log(PM)\) as \(log(NOx)\) increases, although there is a lot of variation (residual) in the dependent variable for any given independent value.

  • Did transforming the data improve the linear fit?

Yes, the linear fit was improved, as is also shown by the residual plot and the Q-Q plot. The scale-location plot shows the transformed data is homoskedastic, and the residual-leverage plot shows that no transformed data point has undue influence on the linear regression.

par(mfrow=c(2,2))
plot(model3)