Assignment #02: Statistical Significance (Multiple Linear Regresstion) [Outline]

Statistical Significance (Multiple Linear Regression) Project - Analysis of Air Pollution Dataset

Brendan Howell

Renselaer Polytechnic Institute

03/12/15 - Version 1.0

1. Data

Data Selection and Description

U.S. Air Pollution Dataset

In this dataset, air pollution and related values for 41 U.S. cities are contained, which were collected from U.S. government publications. The data are mean values over the years 1969-1971, and the variable names/descriptions are listed below:

City: City name SO2: Sulfur dioxide content of air in micrograms per cubic meter Temp: Average annual temperature in degrees Fahrenheit Man: Number of manufacturing enterprises employing 20 or more workers Pop: Population size in thousands from the 1970 census Wind: Average annual wind speed in miles per hour Rain: Average annual precipitation in inches RainDays: Average number of days with precipitation per year

[Reference: Sokal, R.R. and Rohlf, F.J. (1981) Biometry, 2nd edition, San Francisco: W.H. Freeman, 239. Also found in: Hand, D.J., et al. (1994) A Handbook of Small Data Sets, London: Chapman & Hall, 20-21. (http://lib.stat.cmu.edu/DASL/Datafiles/AirPollution.html)]

Organization of Data

Below, the Air Pollution Dataset is loaded into R, and its summary statistics and its structure are display (along with the “head” and the “tail” of the dataset). Additionally, a scatterplot of the data is generated, which offers some visual/graphical insight into the data itself. (In carrying out this analysis, only three independent variables [“Pop”, “Temp”, and “RainDays”] will be considered against the dependent variable “SO2”.)

##Load in the Air Pollution Dataset
rm(list=ls())
#Get dataset from Project Documents File
airpollution <- read.csv("~/Academics (RPI)/10. Spring 2015/Applied Regression Analysis/Assignments/Assignment #2/air pollution.csv", header=TRUE)
#Display the head and tail of the data
head(airpollution)
##            City SO2 Temp Man Pop Wind  Rain RainDays
## 1       Phoenix  10 70.3 213 582  6.0  7.05       36
## 2   Little Rock  13 61.0  91 132  8.2 48.52      100
## 3 San Francisco  12 56.7 453 716  8.7 20.66       67
## 4        Denver  17 51.9 454 515  9.0 12.95       86
## 5      Hartford  56 49.1 412 158  9.0 43.37      127
## 6    Wilmington  36 54.0  80  80  9.0 40.25      114
tail(airpollution)
##              City SO2 Temp Man Pop Wind  Rain RainDays
## 36 Salt Lake City  28 51.0 137 176  8.7 15.17       89
## 37        Norfolk  31 59.3  96 308 10.6 44.68      116
## 38       Richmond  26 57.8 197 299  7.6 42.59      115
## 39        Seattle  29 51.1 379 531  9.4 38.79      164
## 40     Charleston  31 55.2  35  71  6.5 40.75      148
## 41      Milwaukee  16 45.7 569 717 11.8 29.07      123
#Display the summary statistics and the structure of the data
summary(airpollution)
##           City         SO2              Temp            Man        
##  Albany     : 1   Min.   :  8.00   Min.   :43.50   Min.   :  35.0  
##  Albuquerque: 1   1st Qu.: 13.00   1st Qu.:50.60   1st Qu.: 181.0  
##  Atlanta    : 1   Median : 26.00   Median :54.60   Median : 347.0  
##  Baltimore  : 1   Mean   : 30.05   Mean   :55.76   Mean   : 463.1  
##  Buffalo    : 1   3rd Qu.: 35.00   3rd Qu.:59.30   3rd Qu.: 462.0  
##  Charleston : 1   Max.   :110.00   Max.   :75.50   Max.   :3344.0  
##  (Other)    :35                                                    
##       Pop              Wind             Rain          RainDays    
##  Min.   :  71.0   Min.   : 6.000   Min.   : 7.05   Min.   : 36.0  
##  1st Qu.: 299.0   1st Qu.: 8.700   1st Qu.:30.96   1st Qu.:103.0  
##  Median : 515.0   Median : 9.300   Median :38.47   Median :115.0  
##  Mean   : 608.6   Mean   : 9.444   Mean   :36.76   Mean   :113.9  
##  3rd Qu.: 717.0   3rd Qu.:10.600   3rd Qu.:43.11   3rd Qu.:128.0  
##  Max.   :3369.0   Max.   :12.700   Max.   :59.80   Max.   :166.0  
## 
str(airpollution)
## 'data.frame':    41 obs. of  8 variables:
##  $ City    : Factor w/ 41 levels "Albany","Albuquerque",..: 31 20 36 12 15 41 39 18 23 3 ...
##  $ SO2     : int  10 13 12 17 56 36 29 14 10 24 ...
##  $ Temp    : num  70.3 61 56.7 51.9 49.1 54 57.3 68.4 75.5 61.5 ...
##  $ Man     : int  213 91 453 454 412 80 434 136 207 368 ...
##  $ Pop     : int  582 132 716 515 158 80 757 529 335 497 ...
##  $ Wind    : num  6 8.2 8.7 9 9 9 9.3 8.8 9 9.1 ...
##  $ Rain    : num  7.05 48.52 20.66 12.95 43.37 ...
##  $ RainDays: int  36 100 67 86 127 114 111 116 128 115 ...
#Generate a scatterplot of the data (Independent Variable = Population Size)
plot(y = airpollution$SO2,x = airpollution$Pop, pch=21, bg="darkviolet", main="Sulfur Dioxide Content of Air vs. Population Size", ylab = "Sulfur Dioxide Content of Air (in micrograms per m^3)", xlab = "Population Size (in thousands) from the 1970 Census")

#Generate a scatterplot of the data (Independent Variable = Temperature)
plot(y = airpollution$SO2,x = airpollution$Temp, pch=21, bg="darkviolet", main="Sulfur Dioxide Content of Air vs. Temperature", ylab = "Sulfur Dioxide Content of Air (in micrograms per m^3)", xlab = "Average Annual Temperature (in degrees Fahrenheit)")

#Generate a scatterplot of the data (Independent Variable = Rainy Days)
plot(y = airpollution$SO2,x = airpollution$RainDays, pch=21, bg="darkviolet", main="Sulfur Dioxide Content of Air vs. Number of Days with Precipitation", ylab = "Sulfur Dioxide Content of Air (in micrograms per m^3)", xlab = "Average Number of Days with Precipitation (per year)")

2. The Linear Model (Multiple Linear Regression)

Description of independent variables and the dependent variable

In this experiment, the independent variables that are being included are the population size (in thousands) from the 1970 Census, the average annual temperature (in degrees Fahrenheit), and the average number of days with precipitation (per year) for all of the individual cities that are contained within this dataset. Additionally, the dependent variable is the sulfur dioxide content of the air (in micrograms per m^3) for each corresponding city.

Description of the null hypothesis (H_0)

In this experiment, we are trying to determine whether or not the variation that is observed in the response variable (which corresponds to ‘SO2’ in this analysis) can be explained by the variation existent in any of the three treatments being considered in this experiment (which correspond to ‘Pop’, ‘Temp’, and ‘RainDays’). Therefore, the null hypothesis that is being tested states that the the population size in a given city from the 1970 Census, the average annual temperature in a given city, and average number of days with precipitation each year in a given city do not have a significant effect on the sulfur dioxide content of the air in a given city.

Multiple Linear Regression Model

In order to determine whether or not the variation that is observed in the response variable (which corresponds to ‘SO2’ in this analysis) can be explained by the variation existent in any of the three treatments being considered in this experiment (which correspond to ‘Pop’, ‘Temp’, and ‘RainDays’), we can generate a multiple linear regression model using the “lm()” function. With this multiple linear regression model, we will be able to determine if the variation in sulfur dioxide content can be explained by the variation existent in population size, average annual temperature, and average number of days with precipitation per year. Upon generating the multiple linear regression model, a regression line will be drawn through a scatter plot of the data, which allows for the model’s fit (with respect to the relationship between the independent variables in this experiment (population size, average annual temperature, and average number of days with precipitation per year) and the dependent variable in this experiment (sulfur dioxide content in the air) to be visualized graphically. (In building this multiple linear regression model, entry-wise, hierarchical, and step-wise methodologies will be used.)

###Generate three simple linear regression models (one for each independent variable)

##Population
airpollution_model_pop <- lm(airpollution$SO2~airpollution$Pop)
summary(airpollution_model_pop)
## 
## Call:
## lm(formula = airpollution$SO2 ~ airpollution$Pop)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.545 -14.456  -4.019  11.019  72.549 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      17.868316   4.713844   3.791 0.000509 ***
## airpollution$Pop  0.020014   0.005644   3.546 0.001035 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.67 on 39 degrees of freedom
## Multiple R-squared:  0.2438, Adjusted R-squared:  0.2244 
## F-statistic: 12.57 on 1 and 39 DF,  p-value: 0.001035
#Plot the regression line of the simple linear regression model "airpollution_model_pop"
plot(y = airpollution$SO2,x = airpollution$Pop, pch=21, bg="darkviolet", main="Sulfur Dioxide Content of Air vs. Population Size", ylab = "Sulfur Dioxide Content of Air (in micrograms per m^3)", xlab = "Population Size (in thousands) from the 1970 Census")
abline(airpollution_model_pop, col='black',lwd=2.5)

##Temperature
airpollution_model_temp <- lm(airpollution$SO2~airpollution$Temp)
summary(airpollution_model_temp)
## 
## Call:
## lm(formula = airpollution$SO2 ~ airpollution$Temp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.248 -11.830  -3.305   4.456  72.680 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       108.5711    26.3437   4.121  0.00019 ***
## airpollution$Temp  -1.4081     0.4686  -3.005  0.00462 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.42 on 39 degrees of freedom
## Multiple R-squared:  0.188,  Adjusted R-squared:  0.1672 
## F-statistic:  9.03 on 1 and 39 DF,  p-value: 0.004624
#Plot the regression line of the simple linear regression model "airpollution_model_temp"
plot(y = airpollution$SO2,x = airpollution$Temp, pch=21, bg="darkviolet", main="Sulfur Dioxide Content of Air vs. Temperature", ylab = "Sulfur Dioxide Content of Air (in micrograms per m^3)", xlab = "Average Annual Temperature (in degrees Fahrenheit)")
abline(airpollution_model_temp, col='black',lwd=2.5)

##Rainy Days
airpollution_model_raindays <- lm(airpollution$SO2~airpollution$RainDays)
summary(airpollution_model_raindays)
## 
## Call:
## lm(formula = airpollution$SO2 ~ airpollution$RainDays)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.098 -12.499  -4.408   5.919  77.301 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)  
## (Intercept)            -7.2270    15.3991  -0.469   0.6415  
## airpollution$RainDays   0.3273     0.1318   2.484   0.0174 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.09 on 39 degrees of freedom
## Multiple R-squared:  0.1366, Adjusted R-squared:  0.1144 
## F-statistic: 6.169 on 1 and 39 DF,  p-value: 0.0174
#Plot the regression line of the simple linear regression model "airpollution_model_raindays"
plot(y = airpollution$SO2,x = airpollution$RainDays, pch=21, bg="darkviolet", main="Sulfur Dioxide Content of Air vs. Number of Days with Precipitation", ylab = "Sulfur Dioxide Content of Air (in micrograms per m^3)", xlab = "Average Number of Days with Precipitation (per year)")
abline(airpollution_model_raindays, col='black',lwd=2.5)

###Generate an entry-wise multiple linear regression model
airpollution_model_mlr <- lm(airpollution$SO2~airpollution$Pop+airpollution$Temp+airpollution$RainDays)
#Display summary of the entry-wise multiple linear regression model
summary(airpollution_model_mlr)
## 
## Call:
## lm(formula = airpollution$SO2 ~ airpollution$Pop + airpollution$Temp + 
##     airpollution$RainDays)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.031 -10.157  -1.737  10.057  64.098 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           53.208674  33.245537   1.600  0.11800    
## airpollution$Pop       0.018854   0.004976   3.789  0.00054 ***
## airpollution$Temp     -1.011712   0.441312  -2.293  0.02766 *  
## airpollution$RainDays  0.191234   0.120206   1.591  0.12014    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.19 on 37 degrees of freedom
## Multiple R-squared:  0.4446, Adjusted R-squared:  0.3995 
## F-statistic: 9.872 on 3 and 37 DF,  p-value: 6.42e-05
###Generate a step-wise multiple linear regression model (using MASS package)
library("MASS")
## Warning: package 'MASS' was built under R version 3.1.2
airpollution_model_step <- stepAIC(airpollution_model_mlr, direction="both")
## Start:  AIC=241.66
## airpollution$SO2 ~ airpollution$Pop + airpollution$Temp + airpollution$RainDays
## 
##                         Df Sum of Sq   RSS    AIC
## <none>                               12240 241.66
## - airpollution$RainDays  1     837.3 13078 242.37
## - airpollution$Temp      1    1738.7 13979 245.10
## - airpollution$Pop       1    4748.5 16989 253.10
airpollution_model_step$anova # display results
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## airpollution$SO2 ~ airpollution$Pop + airpollution$Temp + airpollution$RainDays
## 
## Final Model:
## airpollution$SO2 ~ airpollution$Pop + airpollution$Temp + airpollution$RainDays
## 
## 
##   Step Df Deviance Resid. Df Resid. Dev      AIC
## 1                         37   12240.33 241.6557
###Generate a hierarchical multiple linear regression model
airpollution_model_one <- lm(airpollution$SO2~airpollution$Pop)
summary(airpollution_model_one)
## 
## Call:
## lm(formula = airpollution$SO2 ~ airpollution$Pop)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.545 -14.456  -4.019  11.019  72.549 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      17.868316   4.713844   3.791 0.000509 ***
## airpollution$Pop  0.020014   0.005644   3.546 0.001035 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.67 on 39 degrees of freedom
## Multiple R-squared:  0.2438, Adjusted R-squared:  0.2244 
## F-statistic: 12.57 on 1 and 39 DF,  p-value: 0.001035
airpollution_model_two <- lm(airpollution$SO2~airpollution$Pop+airpollution$Temp)
summary(airpollution_model_two)
## 
## Call:
## lm(formula = airpollution$SO2 ~ airpollution$Pop + airpollution$Temp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.318 -14.189  -1.125  11.056  64.542 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       91.698488  23.256491   3.943 0.000334 ***
## airpollution$Pop   0.018987   0.005075   3.741 0.000604 ***
## airpollution$Temp -1.312781   0.406627  -3.228 0.002566 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.55 on 38 degrees of freedom
## Multiple R-squared:  0.4066, Adjusted R-squared:  0.3754 
## F-statistic: 13.02 on 2 and 38 DF,  p-value: 4.941e-05
airpollution_model_three <- lm(airpollution$SO2~airpollution$Pop+airpollution$Temp+airpollution$RainDays)
summary(airpollution_model_three)
## 
## Call:
## lm(formula = airpollution$SO2 ~ airpollution$Pop + airpollution$Temp + 
##     airpollution$RainDays)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.031 -10.157  -1.737  10.057  64.098 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           53.208674  33.245537   1.600  0.11800    
## airpollution$Pop       0.018854   0.004976   3.789  0.00054 ***
## airpollution$Temp     -1.011712   0.441312  -2.293  0.02766 *  
## airpollution$RainDays  0.191234   0.120206   1.591  0.12014    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.19 on 37 degrees of freedom
## Multiple R-squared:  0.4446, Adjusted R-squared:  0.3995 
## F-statistic: 9.872 on 3 and 37 DF,  p-value: 6.42e-05
anova(airpollution_model_one, airpollution_model_two, airpollution_model_three)
## Analysis of Variance Table
## 
## Model 1: airpollution$SO2 ~ airpollution$Pop
## Model 2: airpollution$SO2 ~ airpollution$Pop + airpollution$Temp
## Model 3: airpollution$SO2 ~ airpollution$Pop + airpollution$Temp + airpollution$RainDays
##   Res.Df   RSS Df Sum of Sq       F   Pr(>F)   
## 1     39 16665                                 
## 2     38 13078  1    3587.0 10.8429 0.002189 **
## 3     37 12240  1     837.3  2.5309 0.120144   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Plot of the 95% confidence intervals of the regression line, b_0 and b_1

*To be completed in the final version of this project.

confint(airpollution_model_mlr, level=0.95)
##                               2.5 %       97.5 %
## (Intercept)           -14.153181586 120.57052977
## airpollution$Pop        0.008770595   0.02893702
## airpollution$Temp      -1.905894888  -0.11752894
## airpollution$RainDays  -0.052326465   0.43479421

The interpreted the results of the statistical analysis (b_0, b_1, and r)

*To be completed in the final version of this project.