ridership <- read.csv("ridership.csv", header = TRUE, sep = ",")
summary(ridership)
## City WeeklyRiders PricePerWeek CityPopulation
## Min. : 1.0 Min. :115696 Min. : 15.00 Min. :1590000
## 1st Qu.: 7.5 1st Qu.:149600 1st Qu.: 27.50 1st Qu.:1617500
## Median :14.0 Median :161600 Median : 40.00 Median :1695000
## Mean :14.0 Mean :160026 Mean : 49.93 Mean :1680111
## 3rd Qu.:20.5 3rd Qu.:176000 3rd Qu.: 75.00 3rd Qu.:1725000
## Max. :27.0 Max. :192000 Max. :102.00 Max. :1800000
## MonthlyIncomeOfRiders AvgParkingRates
## Min. : 5800 Min. : 50
## 1st Qu.: 8400 1st Qu.: 75
## Median :11600 Median :100
## Mean :11063 Mean :107
## 3rd Qu.:13888 3rd Qu.:140
## Max. :16200 Max. :200
model1 <- lm(WeeklyRiders ~ PricePerWeek, ridership)
summary(model1)
##
## Call:
## lm(formula = WeeklyRiders ~ PricePerWeek, data = ridership)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8218.4 -4883.2 -465.9 3436.9 11817.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 197208.34 2266.48 87.01 < 2e-16 ***
## PricePerWeek -744.75 39.89 -18.67 3.41e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5620 on 25 degrees of freedom
## Multiple R-squared: 0.9331, Adjusted R-squared: 0.9304
## F-statistic: 348.5 on 1 and 25 DF, p-value: 3.413e-16
\[WeeklyRiders = 1.9720834\times 10^{5} + (-744.749 \times PricePerWeek)\]
Significance of the regression = \[3.4134003\times 10^{-16}\]
The null hypothesis is that there is no relationship between Weekly Riders and Price per week. We reject the null hypothesis, as \(p<0.05\).
adjusted \(R^2\) = 0.93
This is strong.
93% of the variation of Weekly Riders is explained by Price per week.
plot(ridership$PricePerWeek, ridership$WeeklyRiders)
abline(model1)
par(mfrow=c(2,2))
plot(model1)
The model’s best fit line looks like a good predictor of the data. The ‘residual vs fitted’ plot appears potentially normally distributed, the Q-Q plot follows the 45-degree line, the ‘scale-location’ plot appears homoskedastic and the none of the data points in ‘residuals vs leverage’ appear to have undue influence on the regression.
smsa <- read.csv("SMSA.csv", header = TRUE, sep = ",")
summary(smsa)
## city state JanTemp JulyTemp
## Akron : 1 OH : 7 Min. :12.0 Min. :63.00
## Albany-Schenectady-Troy: 1 NY : 6 1st Qu.:27.0 1st Qu.:72.00
## Allentown-Bethlehem : 1 CA : 4 Median :31.0 Median :74.00
## Atlanta : 1 PA : 4 Mean :33.8 Mean :74.41
## Baltimore : 1 CT : 3 3rd Qu.:39.5 3rd Qu.:77.00
## Birmingham : 1 MA : 3 Max. :67.0 Max. :85.00
## (Other) :53 (Other):32
## RelHum Rain Mortality Education
## Min. :38.00 Min. :10.00 Min. : 790.7 Min. : 9.00
## 1st Qu.:55.50 1st Qu.:33.50 1st Qu.: 899.4 1st Qu.:10.40
## Median :57.00 Median :38.00 Median : 946.2 Median :11.00
## Mean :57.75 Mean :38.51 Mean : 941.2 Mean :10.97
## 3rd Qu.:60.00 3rd Qu.:44.00 3rd Qu.: 984.1 3rd Qu.:11.50
## Max. :73.00 Max. :65.00 Max. :1113.2 Max. :12.30
##
## PopDensity pNonWhite pWC pop
## Min. :1441 Min. : 0.80 Min. :33.80 Min. : 124833
## 1st Qu.:3138 1st Qu.: 4.90 1st Qu.:43.40 1st Qu.: 566515
## Median :3626 Median : 9.50 Median :45.50 Median : 914427
## Mean :3910 Mean :11.88 Mean :46.39 Mean :1438037
## 3rd Qu.:4566 3rd Qu.:15.70 3rd Qu.:49.90 3rd Qu.:1717201
## Max. :9699 Max. :38.50 Max. :62.20 Max. :8274961
##
## pophouse income PM HCPot
## Min. :2.650 Min. :25782 Min. :0.0001152 Min. : 1.00
## 1st Qu.:3.210 1st Qu.:30004 1st Qu.:0.0005285 1st Qu.: 7.00
## Median :3.270 Median :32452 Median :0.0010340 Median : 15.00
## Mean :3.247 Mean :33247 Mean :0.0012651 Mean : 38.47
## 3rd Qu.:3.360 3rd Qu.:35496 3rd Qu.:0.0017225 3rd Qu.: 30.50
## Max. :3.530 Max. :47966 Max. :0.0076810 Max. :648.00
##
## NOxPot S02Pot NOx
## Min. : 1.00 Min. : 1.00 Min. : 1.00
## 1st Qu.: 4.00 1st Qu.: 13.00 1st Qu.: 4.00
## Median : 9.00 Median : 32.00 Median : 9.00
## Mean : 22.97 Mean : 54.66 Mean : 22.97
## 3rd Qu.: 24.50 3rd Qu.: 70.00 3rd Qu.: 24.50
## Max. :319.00 Max. :278.00 Max. :319.00
##
Source: US Labor Department
Variable Names:
city City name
JanTemp Mean January temperature (degrees Farenheit)
JulyTemp Mean July temperature (degrees Farenheit)
RelHum Relative Humidity
Rain Annual rainfall (inches)
Mortality Age adjusted mortality
Education Median education
PopDensity Population density
%NonWhite Percentage of non whites
%WC Percentage of white collar workers
pop Population
pop/house Population per household
income Median income
PM Mortality/pop
HCPot HC pollution potential
NOxPot Nitrous Oxide pollution potential
SO2Pot Sulfur Dioxide pollution potential
NOx Nitrous Oxide
model2 <- lm(PM ~ NOx, smsa)
summary(model2)
##
## Call:
## lm(formula = PM ~ NOx, data = smsa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0011270 -0.0006478 -0.0001879 0.0004673 0.0063337
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.400e-03 1.590e-04 8.807 3.2e-12 ***
## NOx -5.889e-06 3.078e-06 -1.913 0.0607 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001094 on 57 degrees of freedom
## Multiple R-squared: 0.06034, Adjusted R-squared: 0.04386
## F-statistic: 3.66 on 1 and 57 DF, p-value: 0.06075
\[PM = 0.0014 + (-5.9\times 10^{-6} \times NOx)\]
The null hypothesis is that there is no relationship between PM and NOx.
p-value of the regression = \[0.0607\]
We fail to reject the null hypothesis, as \(p>0.05\).
adjusted \(R^2\) = 0.044
This is weak.
4% of the variation of PM is explained by NOx.
plot(smsa$NOx, smsa$PM)
abline(model2)
The fitted line looks like a poor fit to the data, with a lot of data above the line at the lower and higher NOx ranges, but almost all below the data in the low-middle NOx range.
par(mfrow=c(2,2))
plot(model2)
No, the data does not appear to be linear. There are several points (inluding point 15) at the upper range of fitted values which have high residuals.
No, there is a prominent data point (15) way above the 45-degree line.
Is the spread of the points the same?
No, the standardized residuals increases as the fitted values increases. The data is heteroskedastic.
The linear regression is a poor fit for the data. There is a suggestion that a data transformation might help improve the fit.
Yes. With the data fitting poorly at higher fitted values, perhaps a log-transformation might help improve the model’s fit.
If so, answer the following questions based on log-transformed data:
smsa$logNOx <- log(smsa$NOx)
smsa$logPM <- log(smsa$PM)
model3 <- lm(logPM ~ logNOx, data = smsa)
summary(model3)
##
## Call:
## lm(formula = logPM ~ logNOx, data = smsa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.75021 -0.35550 0.08547 0.47918 2.05753
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.20246 0.20878 -29.708 < 2e-16 ***
## logNOx -0.32954 0.07971 -4.134 0.000118 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7212 on 57 degrees of freedom
## Multiple R-squared: 0.2307, Adjusted R-squared: 0.2172
## F-statistic: 17.09 on 1 and 57 DF, p-value: 0.0001181
\[logPM = -6.202 + (-0.33 \times logNOx)\]
The null hypothesis is that there is no relationship between \(log(PM)\) and \(log(NOx)\).
p-value of the regression = \[10^{-4}\]
We reject the null hypothesis, as \(p<0.05\).
adjusted \(R^2\) = 0.217
This is weak, though better than the adjusted \(R^2\) for the linear regression model on data which had not been log transformed.
21.7% of the variation in \(log(PM)\) is explained by the independent variable \(log(NOx)\).
plot(smsa$logNOx, smsa$logPM)
abline(model3)
The plot of \(log(PM)\) against \(log(NOx)\) has a spread of data across both dependent and independent variables. The fitted line appears to fairly represent a downward trend in the \(log(PM)\) as \(log(NOx)\) increases, although there is a lot of variation (residual) in the dependent variable for any given independent value.
Yes, the linear fit was improved, as is also shown by the residual plot and the Q-Q plot. The scale-location plot shows the transformed data is homoskedastic, and the residual-leverage plot shows that no transformed data point has undue influence on the linear regression.
par(mfrow=c(2,2))
plot(model3)