In this dataset, air pollution and related values for 41 U.S. cities are contained, which were collected from U.S. government publications. The data are mean values over the years 1969-1971, and the variable names/descriptions are listed below:
City: City name SO2: Sulfur dioxide content of air in micrograms per cubic meter Temp: Average annual temperature in degrees Fahrenheit Man: Number of manufacturing enterprises employing 20 or more workers Pop: Population size in thousands from the 1970 census Wind: Average annual wind speed in miles per hour Rain: Average annual precipitation in inches RainDays: Average number of days with precipitation per year
[Reference: Sokal, R.R. and Rohlf, F.J. (1981) Biometry, 2nd edition, San Francisco: W.H. Freeman, 239. Also found in: Hand, D.J., et al. (1994) A Handbook of Small Data Sets, London: Chapman & Hall, 20-21. (http://lib.stat.cmu.edu/DASL/Datafiles/AirPollution.html)]
Below, the Air Pollution Dataset is loaded into R, and its summary statistics and its structure are display (along with the “head” and the “tail” of the dataset). Additionally, a scatterplot of the data is generated, which offers some visual/graphical insight into the data itself. (In carrying out this analysis, only three independent variables [“Pop”, “Temp”, and “RainDays”] will be considered against the dependent variable “SO2”.)
##Load in the Air Pollution Dataset
rm(list=ls())
#Get dataset from Project Documents File
airpollution <- read.csv("~/Academics (RPI)/10. Spring 2015/Applied Regression Analysis/Assignments/Assignment #2/air pollution.csv", header=TRUE)
#Display the head and tail of the data
head(airpollution)
## City SO2 Temp Man Pop Wind Rain RainDays
## 1 Phoenix 10 70.3 213 582 6.0 7.05 36
## 2 Little Rock 13 61.0 91 132 8.2 48.52 100
## 3 San Francisco 12 56.7 453 716 8.7 20.66 67
## 4 Denver 17 51.9 454 515 9.0 12.95 86
## 5 Hartford 56 49.1 412 158 9.0 43.37 127
## 6 Wilmington 36 54.0 80 80 9.0 40.25 114
tail(airpollution)
## City SO2 Temp Man Pop Wind Rain RainDays
## 36 Salt Lake City 28 51.0 137 176 8.7 15.17 89
## 37 Norfolk 31 59.3 96 308 10.6 44.68 116
## 38 Richmond 26 57.8 197 299 7.6 42.59 115
## 39 Seattle 29 51.1 379 531 9.4 38.79 164
## 40 Charleston 31 55.2 35 71 6.5 40.75 148
## 41 Milwaukee 16 45.7 569 717 11.8 29.07 123
#Display the summary statistics and the structure of the data
summary(airpollution)
## City SO2 Temp Man
## Albany : 1 Min. : 8.00 Min. :43.50 Min. : 35.0
## Albuquerque: 1 1st Qu.: 13.00 1st Qu.:50.60 1st Qu.: 181.0
## Atlanta : 1 Median : 26.00 Median :54.60 Median : 347.0
## Baltimore : 1 Mean : 30.05 Mean :55.76 Mean : 463.1
## Buffalo : 1 3rd Qu.: 35.00 3rd Qu.:59.30 3rd Qu.: 462.0
## Charleston : 1 Max. :110.00 Max. :75.50 Max. :3344.0
## (Other) :35
## Pop Wind Rain RainDays
## Min. : 71.0 Min. : 6.000 Min. : 7.05 Min. : 36.0
## 1st Qu.: 299.0 1st Qu.: 8.700 1st Qu.:30.96 1st Qu.:103.0
## Median : 515.0 Median : 9.300 Median :38.47 Median :115.0
## Mean : 608.6 Mean : 9.444 Mean :36.76 Mean :113.9
## 3rd Qu.: 717.0 3rd Qu.:10.600 3rd Qu.:43.11 3rd Qu.:128.0
## Max. :3369.0 Max. :12.700 Max. :59.80 Max. :166.0
##
str(airpollution)
## 'data.frame': 41 obs. of 8 variables:
## $ City : Factor w/ 41 levels "Albany","Albuquerque",..: 31 20 36 12 15 41 39 18 23 3 ...
## $ SO2 : int 10 13 12 17 56 36 29 14 10 24 ...
## $ Temp : num 70.3 61 56.7 51.9 49.1 54 57.3 68.4 75.5 61.5 ...
## $ Man : int 213 91 453 454 412 80 434 136 207 368 ...
## $ Pop : int 582 132 716 515 158 80 757 529 335 497 ...
## $ Wind : num 6 8.2 8.7 9 9 9 9.3 8.8 9 9.1 ...
## $ Rain : num 7.05 48.52 20.66 12.95 43.37 ...
## $ RainDays: int 36 100 67 86 127 114 111 116 128 115 ...
#Generate a scatterplot of the data (Independent Variable = Population Size)
plot(y = airpollution$SO2,x = airpollution$Pop, pch=21, bg="darkviolet", main="Sulfur Dioxide Content of Air vs. Population Size", ylab = "Sulfur Dioxide Content of Air (in micrograms per m^3)", xlab = "Population Size (in thousands) from the 1970 Census")
#Generate a scatterplot of the data (Independent Variable = Temperature)
plot(y = airpollution$SO2,x = airpollution$Temp, pch=21, bg="darkviolet", main="Sulfur Dioxide Content of Air vs. Temperature", ylab = "Sulfur Dioxide Content of Air (in micrograms per m^3)", xlab = "Average Annual Temperature (in degrees Fahrenheit)")
#Generate a scatterplot of the data (Independent Variable = Rainy Days)
plot(y = airpollution$SO2,x = airpollution$RainDays, pch=21, bg="darkviolet", main="Sulfur Dioxide Content of Air vs. Number of Days with Precipitation", ylab = "Sulfur Dioxide Content of Air (in micrograms per m^3)", xlab = "Average Number of Days with Precipitation (per year)")
In this experiment, the independent variables that are being included are the population size (in thousands) from the 1970 Census, the average annual temperature (in degrees Fahrenheit), and the average number of days with precipitation (per year) for all of the individual cities that are contained within this dataset. Additionally, the dependent variable is the sulfur dioxide content of the air (in micrograms per m^3) for each corresponding city.
In this experiment, we are trying to determine whether or not the variation that is observed in the response variable (which corresponds to ‘SO2’ in this analysis) can be explained by the variation existent in any of the three treatments being considered in this experiment (which correspond to ‘Pop’, ‘Temp’, and ‘RainDays’). Therefore, the null hypothesis that is being tested states that the the population size in a given city from the 1970 Census, the average annual temperature in a given city, and average number of days with precipitation each year in a given city do not have a significant effect on the sulfur dioxide content of the air in a given city.
In order to determine whether or not the variation that is observed in the response variable (which corresponds to ‘SO2’ in this analysis) can be explained by the variation existent in any of the three treatments being considered in this experiment (which correspond to ‘Pop’, ‘Temp’, and ‘RainDays’), we can generate a multiple linear regression model using the “lm()” function. With this multiple linear regression model, we will be able to determine if the variation in sulfur dioxide content can be explained by the variation existent in population size, average annual temperature, and average number of days with precipitation per year. Upon generating the multiple linear regression model, a regression line will be drawn through a scatter plot of the data, which allows for the model’s fit (with respect to the relationship between the independent variables in this experiment (population size, average annual temperature, and average number of days with precipitation per year) and the dependent variable in this experiment (sulfur dioxide content in the air) to be visualized graphically. (In building this multiple linear regression model, entry-wise, hierarchical, and step-wise methodologies will be used.)
###Generate three simple linear regression models (one for each independent variable)
##Population
airpollution_model_pop <- lm(airpollution$SO2~airpollution$Pop)
summary(airpollution_model_pop)
##
## Call:
## lm(formula = airpollution$SO2 ~ airpollution$Pop)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.545 -14.456 -4.019 11.019 72.549
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.868316 4.713844 3.791 0.000509 ***
## airpollution$Pop 0.020014 0.005644 3.546 0.001035 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.67 on 39 degrees of freedom
## Multiple R-squared: 0.2438, Adjusted R-squared: 0.2244
## F-statistic: 12.57 on 1 and 39 DF, p-value: 0.001035
#Plot the regression line of the simple linear regression model "airpollution_model_pop"
plot(y = airpollution$SO2,x = airpollution$Pop, pch=21, bg="darkviolet", main="Sulfur Dioxide Content of Air vs. Population Size", ylab = "Sulfur Dioxide Content of Air (in micrograms per m^3)", xlab = "Population Size (in thousands) from the 1970 Census")
abline(airpollution_model_pop, col='black',lwd=2.5)
##Temperature
airpollution_model_temp <- lm(airpollution$SO2~airpollution$Temp)
summary(airpollution_model_temp)
##
## Call:
## lm(formula = airpollution$SO2 ~ airpollution$Temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.248 -11.830 -3.305 4.456 72.680
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 108.5711 26.3437 4.121 0.00019 ***
## airpollution$Temp -1.4081 0.4686 -3.005 0.00462 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.42 on 39 degrees of freedom
## Multiple R-squared: 0.188, Adjusted R-squared: 0.1672
## F-statistic: 9.03 on 1 and 39 DF, p-value: 0.004624
#Plot the regression line of the simple linear regression model "airpollution_model_temp"
plot(y = airpollution$SO2,x = airpollution$Temp, pch=21, bg="darkviolet", main="Sulfur Dioxide Content of Air vs. Temperature", ylab = "Sulfur Dioxide Content of Air (in micrograms per m^3)", xlab = "Average Annual Temperature (in degrees Fahrenheit)")
abline(airpollution_model_temp, col='black',lwd=2.5)
##Rainy Days
airpollution_model_raindays <- lm(airpollution$SO2~airpollution$RainDays)
summary(airpollution_model_raindays)
##
## Call:
## lm(formula = airpollution$SO2 ~ airpollution$RainDays)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.098 -12.499 -4.408 5.919 77.301
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.2270 15.3991 -0.469 0.6415
## airpollution$RainDays 0.3273 0.1318 2.484 0.0174 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.09 on 39 degrees of freedom
## Multiple R-squared: 0.1366, Adjusted R-squared: 0.1144
## F-statistic: 6.169 on 1 and 39 DF, p-value: 0.0174
#Plot the regression line of the simple linear regression model "airpollution_model_raindays"
plot(y = airpollution$SO2,x = airpollution$RainDays, pch=21, bg="darkviolet", main="Sulfur Dioxide Content of Air vs. Number of Days with Precipitation", ylab = "Sulfur Dioxide Content of Air (in micrograms per m^3)", xlab = "Average Number of Days with Precipitation (per year)")
abline(airpollution_model_raindays, col='black',lwd=2.5)
###Generate an entry-wise multiple linear regression model
airpollution_model_mlr <- lm(airpollution$SO2~airpollution$Pop+airpollution$Temp+airpollution$RainDays)
#Display summary of the entry-wise multiple linear regression model
summary(airpollution_model_mlr)
##
## Call:
## lm(formula = airpollution$SO2 ~ airpollution$Pop + airpollution$Temp +
## airpollution$RainDays)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.031 -10.157 -1.737 10.057 64.098
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.208674 33.245537 1.600 0.11800
## airpollution$Pop 0.018854 0.004976 3.789 0.00054 ***
## airpollution$Temp -1.011712 0.441312 -2.293 0.02766 *
## airpollution$RainDays 0.191234 0.120206 1.591 0.12014
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.19 on 37 degrees of freedom
## Multiple R-squared: 0.4446, Adjusted R-squared: 0.3995
## F-statistic: 9.872 on 3 and 37 DF, p-value: 6.42e-05
###Generate a step-wise multiple linear regression model (using MASS package)
library("MASS")
## Warning: package 'MASS' was built under R version 3.1.2
airpollution_model_step <- stepAIC(airpollution_model_mlr, direction="both")
## Start: AIC=241.66
## airpollution$SO2 ~ airpollution$Pop + airpollution$Temp + airpollution$RainDays
##
## Df Sum of Sq RSS AIC
## <none> 12240 241.66
## - airpollution$RainDays 1 837.3 13078 242.37
## - airpollution$Temp 1 1738.7 13979 245.10
## - airpollution$Pop 1 4748.5 16989 253.10
airpollution_model_step$anova # display results
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## airpollution$SO2 ~ airpollution$Pop + airpollution$Temp + airpollution$RainDays
##
## Final Model:
## airpollution$SO2 ~ airpollution$Pop + airpollution$Temp + airpollution$RainDays
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 37 12240.33 241.6557
###Generate a hierarchical multiple linear regression model
airpollution_model_one <- lm(airpollution$SO2~airpollution$Pop)
summary(airpollution_model_one)
##
## Call:
## lm(formula = airpollution$SO2 ~ airpollution$Pop)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.545 -14.456 -4.019 11.019 72.549
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.868316 4.713844 3.791 0.000509 ***
## airpollution$Pop 0.020014 0.005644 3.546 0.001035 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.67 on 39 degrees of freedom
## Multiple R-squared: 0.2438, Adjusted R-squared: 0.2244
## F-statistic: 12.57 on 1 and 39 DF, p-value: 0.001035
airpollution_model_two <- lm(airpollution$SO2~airpollution$Pop+airpollution$Temp)
summary(airpollution_model_two)
##
## Call:
## lm(formula = airpollution$SO2 ~ airpollution$Pop + airpollution$Temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.318 -14.189 -1.125 11.056 64.542
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.698488 23.256491 3.943 0.000334 ***
## airpollution$Pop 0.018987 0.005075 3.741 0.000604 ***
## airpollution$Temp -1.312781 0.406627 -3.228 0.002566 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.55 on 38 degrees of freedom
## Multiple R-squared: 0.4066, Adjusted R-squared: 0.3754
## F-statistic: 13.02 on 2 and 38 DF, p-value: 4.941e-05
airpollution_model_three <- lm(airpollution$SO2~airpollution$Pop+airpollution$Temp+airpollution$RainDays)
summary(airpollution_model_three)
##
## Call:
## lm(formula = airpollution$SO2 ~ airpollution$Pop + airpollution$Temp +
## airpollution$RainDays)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.031 -10.157 -1.737 10.057 64.098
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.208674 33.245537 1.600 0.11800
## airpollution$Pop 0.018854 0.004976 3.789 0.00054 ***
## airpollution$Temp -1.011712 0.441312 -2.293 0.02766 *
## airpollution$RainDays 0.191234 0.120206 1.591 0.12014
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.19 on 37 degrees of freedom
## Multiple R-squared: 0.4446, Adjusted R-squared: 0.3995
## F-statistic: 9.872 on 3 and 37 DF, p-value: 6.42e-05
anova(airpollution_model_one, airpollution_model_two, airpollution_model_three)
## Analysis of Variance Table
##
## Model 1: airpollution$SO2 ~ airpollution$Pop
## Model 2: airpollution$SO2 ~ airpollution$Pop + airpollution$Temp
## Model 3: airpollution$SO2 ~ airpollution$Pop + airpollution$Temp + airpollution$RainDays
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 39 16665
## 2 38 13078 1 3587.0 10.8429 0.002189 **
## 3 37 12240 1 837.3 2.5309 0.120144
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
*To be completed in the final version of this project.
confint(airpollution_model_mlr, level=0.95)
## 2.5 % 97.5 %
## (Intercept) -14.153181586 120.57052977
## airpollution$Pop 0.008770595 0.02893702
## airpollution$Temp -1.905894888 -0.11752894
## airpollution$RainDays -0.052326465 0.43479421
*To be completed in the final version of this project.