Assignment #02: Statistical Significance (Multiple Linear Regression) [Final]

Statistical Significance (Multiple Linear Regression) Project - Analysis of Air Pollution Dataset

Brendan Howell

Renselaer Polytechnic Institute

03/19/15 - Version 1.0

1. Data

Data Selection and Description

U.S. Air Pollution Dataset

In this dataset, air pollution and related values for 41 U.S. cities are contained, which were collected from U.S. government publications. The data are mean values over the years 1969-1971, and the variable names/descriptions are listed below:

City: City name SO2: Sulfur dioxide content of air in micrograms per cubic meter Temp: Average annual temperature in degrees Fahrenheit Man: Number of manufacturing enterprises employing 20 or more workers Pop: Population size in thousands from the 1970 census Wind: Average annual wind speed in miles per hour Rain: Average annual precipitation in inches RainDays: Average number of days with precipitation per year

[Reference: Sokal, R.R. and Rohlf, F.J. (1981) Biometry, 2nd edition, San Francisco: W.H. Freeman, 239. Also found in: Hand, D.J., et al. (1994) A Handbook of Small Data Sets, London: Chapman & Hall, 20-21. (http://lib.stat.cmu.edu/DASL/Datafiles/AirPollution.html)]

Organization of Data

Below, the Air Pollution Dataset is loaded into R, and its summary statistics and its structure are display (along with the “head” and the “tail” of the dataset). Additionally, scatter plots and box plots of the data are generated, which offers some visual/graphical insight into the data itself. (In carrying out this analysis, only three independent variables [“Pop”, “Temp”, and “RainDays”] will be considered against the dependent variable “SO2”.)

##Load in the Air Pollution Dataset
rm(list=ls())

#Get dataset from Project Documents File
airpollution_0 <- read.csv("~/Academics (RPI)/10. Spring 2015/Applied Regression Analysis/Assignments/Assignment #2/air pollution.csv", header=TRUE)

#Remove unnecessary variables from the dataset
airpollution_1 <- subset(airpollution_0, select = c(SO2, Pop, Temp, RainDays))

#Display the head and tail of the data
head(airpollution_1)
##   SO2 Pop Temp RainDays
## 1  10 582 70.3       36
## 2  13 132 61.0      100
## 3  12 716 56.7       67
## 4  17 515 51.9       86
## 5  56 158 49.1      127
## 6  36  80 54.0      114
tail(airpollution_1)
##    SO2 Pop Temp RainDays
## 36  28 176 51.0       89
## 37  31 308 59.3      116
## 38  26 299 57.8      115
## 39  29 531 51.1      164
## 40  31  71 55.2      148
## 41  16 717 45.7      123
#Display the summary statistics, a "scattergram," and the structure of the data
summary(airpollution_1)
##       SO2              Pop              Temp          RainDays    
##  Min.   :  8.00   Min.   :  71.0   Min.   :43.50   Min.   : 36.0  
##  1st Qu.: 13.00   1st Qu.: 299.0   1st Qu.:50.60   1st Qu.:103.0  
##  Median : 26.00   Median : 515.0   Median :54.60   Median :115.0  
##  Mean   : 30.05   Mean   : 608.6   Mean   :55.76   Mean   :113.9  
##  3rd Qu.: 35.00   3rd Qu.: 717.0   3rd Qu.:59.30   3rd Qu.:128.0  
##  Max.   :110.00   Max.   :3369.0   Max.   :75.50   Max.   :166.0
plot(airpollution_1)

str(airpollution_1)
## 'data.frame':    41 obs. of  4 variables:
##  $ SO2     : int  10 13 12 17 56 36 29 14 10 24 ...
##  $ Pop     : int  582 132 716 515 158 80 757 529 335 497 ...
##  $ Temp    : num  70.3 61 56.7 51.9 49.1 54 57.3 68.4 75.5 61.5 ...
##  $ RainDays: int  36 100 67 86 127 114 111 116 128 115 ...
##Population Size
#Generate a scatterplot of the data (Independent Variable = Population Size)
plot(y = airpollution_1$SO2,x = airpollution_1$Pop, pch=21, bg="darkviolet", main="Sulfur Dioxide Content of Air vs. Population Size", ylab = "Sulfur Dioxide Content of Air (in micrograms per m^3)", xlab = "Population Size (in thousands) from the 1970 Census")

#Generate a boxplot of the data (Independent Variable = Population Size)
Pop_boxplot_1<-boxplot(x = airpollution_1$Pop, pch=21, bg="darkviolet", main="Population Size", xlab = "Population Size (in thousands) from the 1970 Census")

##Identify and remove outliers from the 'Population Size' data (First Round)
Pop_boxplot_1$out
## [1] 3369 1513 1950
airpollution_1a<-airpollution_1[airpollution_1$Pop!=3369,]
airpollution_1b<-airpollution_1a[airpollution_1a$Pop!=1513,]
airpollution_1c<-airpollution_1b[airpollution_1b$Pop!=1950,]

#Generate second version of a boxplot of the data (Independent Variable = Population Size)
Pop_boxplot_2<-boxplot(x = airpollution_1c$Pop, pch=21, bg="darkviolet", main="Population Size", xlab = "Population Size (in thousands) from the 1970 Census")

##Identify and remove outliers from the 'Population Size' data (Second Round)
Pop_boxplot_2$out
## [1] 1233
airpollution_1d<-airpollution_1c[airpollution_1c$Pop!=1233,]

#Generate third version of a boxplot of the data (Independent Variable = Population Size)
Pop_boxplot_3<-boxplot(x = airpollution_1d$Pop, pch=21, bg="darkviolet", main="Population Size", xlab = "Population Size (in thousands) from the 1970 Census")

Pop_boxplot_3$out
## numeric(0)
airpollution_2<-airpollution_1d[]
summary(airpollution_2)
##       SO2             Pop             Temp          RainDays    
##  Min.   : 8.00   Min.   : 71.0   Min.   :43.50   Min.   : 36.0  
##  1st Qu.:13.00   1st Qu.:277.0   1st Qu.:51.00   1st Qu.:100.0  
##  Median :24.00   Median :497.0   Median :55.00   Median :115.0  
##  Mean   :27.24   Mean   :456.4   Mean   :55.74   Mean   :113.5  
##  3rd Qu.:31.00   3rd Qu.:622.0   3rd Qu.:59.30   3rd Qu.:128.0  
##  Max.   :94.00   Max.   :905.0   Max.   :75.50   Max.   :166.0
#Generate a final scatterplot of the new data (Independent Variable = Population Size)
plot(y = airpollution_2$SO2,x = airpollution_2$Pop, pch=21, bg="darkviolet", main="Sulfur Dioxide Content of Air vs. Population Size", ylab = "Sulfur Dioxide Content of Air (in micrograms per m^3)", xlab = "Population Size (in thousands) from the 1970 Census")

#Generate a final boxplot of the data (Independent Variable = Population Size)
boxplot(x = airpollution_2$Pop, pch=21, bg="darkviolet", main="Population Size",  xlab = "Population Size (in thousands) from the 1970 Census")

##Temperature
#Generate a scatterplot of the data (Independent Variable = Temperature)
plot(y = airpollution_2$SO2,x = airpollution_2$Temp, pch=21, bg="darkviolet", main="Sulfur Dioxide Content of Air vs. Temperature", ylab = "Sulfur Dioxide Content of Air (in micrograms per m^3)", xlab = "Average Annual Temperature (in degrees Fahrenheit)")

#Generate a boxplot of the data (Independent Variable = Temperature)
Temp_boxplot_1<-boxplot(x = airpollution_2$Temp, pch=21, bg="darkviolet", main="Temperature", xlab = "Average Annual Temperature (in degrees Fahrenheit)")

##Identify and remove outliers from the 'Temperature' data (First Round)
Temp_boxplot_1$out
## [1] 75.5
airpollution_2a<-airpollution_2[airpollution_2$Temp!=75.5,]

#Generate a second scatterplot of the data (Independent Variable = Temperature)
plot(y = airpollution_2a$SO2,x = airpollution_2a$Temp, pch=21, bg="darkviolet", main="Sulfur Dioxide Content of Air vs. Temperature", ylab = "Sulfur Dioxide Content of Air (in micrograms per m^3)", xlab = "Average Annual Temperature (in degrees Fahrenheit)")

#Generate a second boxplot of the data (Independent Variable = Temperature)
Temp_boxplot_2<-boxplot(x = airpollution_2a$Temp, pch=21, bg="darkviolet", main="Temperature", xlab = "Average Annual Temperature (in degrees Fahrenheit)")

Temp_boxplot_2$out
## numeric(0)
airpollution_3<-airpollution_2a[]
summary(airpollution_3)
##       SO2             Pop             Temp          RainDays     
##  Min.   : 8.00   Min.   : 71.0   Min.   :43.50   Min.   : 36.00  
##  1st Qu.:13.75   1st Qu.:268.8   1st Qu.:50.85   1st Qu.: 99.75  
##  Median :25.00   Median :502.0   Median :54.75   Median :115.00  
##  Mean   :27.72   Mean   :459.8   Mean   :55.19   Mean   :113.14  
##  3rd Qu.:31.00   3rd Qu.:622.5   3rd Qu.:58.17   3rd Qu.:128.25  
##  Max.   :94.00   Max.   :905.0   Max.   :70.30   Max.   :166.00
#Generate a final scatterplot of the data (Independent Variable = Temperature)
plot(y = airpollution_3$SO2,x = airpollution_3$Temp, pch=21, bg="darkviolet", main="Sulfur Dioxide Content of Air vs. Temperature", ylab = "Sulfur Dioxide Content of Air (in micrograms per m^3)", xlab = "Average Annual Temperature (in degrees Fahrenheit)")

#Generate a final boxplot of the data (Independent Variable = Temperature)
boxplot(x = airpollution_3$Temp, pch=21, bg="darkviolet", main="Temperature", xlab = "Average Annual Temperature (in degrees Fahrenheit)")

##Rainy Days
#Generate a scatterplot of the data (Independent Variable = Rainy Days)
plot(y = airpollution_3$SO2,x = airpollution_3$RainDays, pch=21, bg="darkviolet", main="Sulfur Dioxide Content of Air vs. Number of Days with Precipitation", ylab = "Sulfur Dioxide Content of Air (in micrograms per m^3)", xlab = "Average Number of Days with Precipitation (per year)")

#Generate a boxplot of the data (Independent Variable = Rainy Days)
RainDays_boxplot_1 <- boxplot(x = airpollution_3$RainDays, pch=21, bg="darkviolet", main="Number of Days with Precipitation", xlab = "Average Number of Days with Precipitation (per year)")

##Identify and remove outliers from the 'Rainy Days' data (First Round)
RainDays_boxplot_1$out
## [1] 36
airpollution_3a<-airpollution_3[airpollution_3$RainDays!=36,]

#Generate a second scatterplot of the data (Independent Variable = Rainy Days)
plot(y = airpollution_3a$SO2,x = airpollution_3a$RainDays, pch=21, bg="darkviolet", main="Sulfur Dioxide Content of Air vs. Number of Days with Precipitation", ylab = "Sulfur Dioxide Content of Air (in micrograms per m^3)", xlab = "Average Number of Days with Precipitation (per year)")

#Generate a second boxplot of the data (Independent Variable = Rainy Days)
RainDays_boxplot_2 <- boxplot(x = airpollution_3a$RainDays, pch=21, bg="darkviolet", main="Number of Days with Precipitation", xlab = "Average Number of Days with Precipitation (per year)")

##Identify and remove outliers from the 'Rainy Days' data (Second Round)
RainDays_boxplot_2$out
## [1] 58
airpollution_3b<-airpollution_3a[airpollution_3a$RainDays!=58,]

#Generate a third scatterplot of the data (Independent Variable = Rainy Days)
plot(y = airpollution_3b$SO2,x = airpollution_3b$RainDays, pch=21, bg="darkviolet", main="Sulfur Dioxide Content of Air vs. Number of Days with Precipitation", ylab = "Sulfur Dioxide Content of Air (in micrograms per m^3)", xlab = "Average Number of Days with Precipitation (per year)")

#Generate a third boxplot of the data (Independent Variable = Rainy Days)
RainDays_boxplot_3 <- boxplot(x = airpollution_3b$RainDays, pch=21, bg="darkviolet", main="Number of Days with Precipitation", xlab = "Average Number of Days with Precipitation (per year)")

airpollution_4<-airpollution_3b[]
summary(airpollution_4)
##       SO2             Pop             Temp          RainDays    
##  Min.   : 8.00   Min.   : 71.0   Min.   :43.50   Min.   : 67.0  
##  1st Qu.:14.00   1st Qu.:282.5   1st Qu.:50.55   1st Qu.:103.5  
##  Median :26.00   Median :502.0   Median :54.25   Median :115.5  
##  Mean   :28.74   Mean   :462.6   Mean   :54.70   Mean   :117.0  
##  3rd Qu.:31.00   3rd Qu.:623.5   3rd Qu.:57.67   3rd Qu.:130.8  
##  Max.   :94.00   Max.   :905.0   Max.   :68.40   Max.   :166.0
#Generate a final scatterplot of the data (Independent Variable = Rainy Days)
plot(y = airpollution_4$SO2,x = airpollution_4$RainDays, pch=21, bg="darkviolet", main="Sulfur Dioxide Content of Air vs. Number of Days with Precipitation", ylab = "Sulfur Dioxide Content of Air (in micrograms per m^3)", xlab = "Average Number of Days with Precipitation (per year)")

#Generate a final boxplot of the data (Independent Variable = Rainy Days)
boxplot(x = airpollution_4$RainDays, pch=21, bg="darkviolet", main="Number of Days with Precipitation", xlab = "Average Number of Days with Precipitation (per year)")

Above, scatter plots and box plots were generated as a means for determining whether outliers were present in the dataset. As outliers were removed from the dataset, the data was re-tested (via new-scatter-plot-and-box-plot generation) to see if all of the outliers had been removed from the dataset. Upon completing this process, a new dataset [“airpollution_4”] was created, containing no outliers for the independent variables being considered in this analysis. This outlier-less dataset was created so that some of the bias that might exist in the original dataset could be accounted for before linear regression models are generated for the data. If outliers were to exist within the dataset being used to generate linear regression models, the validity and the statistical significance of the results drawn from the linear regression models would be more likely to be brought into question.

2. The Linear Model (Multiple Linear Regression)

Description of independent variables and the dependent variable

In this experiment, the independent variables that are being included are the population size (in thousands) from the 1970 Census, the average annual temperature (in degrees Fahrenheit), and the average number of days with precipitation (per year) for all of the individual cities that are contained within this dataset. Additionally, the dependent variable is the sulfur dioxide content of the air (in micrograms per m^3) for each corresponding city.

Description of the null hypothesis (H_0)

In this experiment, we are trying to determine whether or not the variation that is observed in the response variable (which corresponds to ‘SO2’ in this analysis) can be explained by the variation existent in any of the three treatments being considered in this experiment (which correspond to ‘Pop’, ‘Temp’, and ‘RainDays’). Therefore, the null hypothesis that is being tested states that the the population size in a given city from the 1970 Census, the average annual temperature in a given city, and average number of days with precipitation each year in a given city do not have a significant effect on the sulfur dioxide content of the air in a given city.

Multiple Linear Regression Model

In order to determine whether or not the variation that is observed in the response variable (which corresponds to ‘SO2’ in this analysis) can be explained by the variation existent in any of the three treatments being considered in this experiment (which correspond to ‘Pop’, ‘Temp’, and ‘RainDays’), we can generate a multiple linear regression model using the “lm()” function. With this multiple linear regression model, we will be able to determine if the variation in sulfur dioxide content can be explained by the variation existent in population size, average annual temperature, and average number of days with precipitation per year. Upon generating the multiple linear regression model, a regression line will be drawn through a scatter plot of the data, which allows for the model’s fit (with respect to the relationship between the independent variables in this experiment (population size, average annual temperature, and average number of days with precipitation per year) and the dependent variable in this experiment (sulfur dioxide content in the air) to be visualized graphically. (In building this multiple linear regression model, entry-wise, hierarchical, and step-wise methodologies will be used.)

###Generate three simple linear regression models (one for each independent variable) [These are simply created for reference as we delve deeper into generating multiple linear regression models.]
##Population
airpollution_model_pop <- lm(airpollution_4$SO2~airpollution_4$Pop)
summary(airpollution_model_pop)
## 
## Call:
## lm(formula = airpollution_4$SO2 ~ airpollution_4$Pop)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.328 -14.307  -3.667   2.764  62.831 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        32.704677   7.472569   4.377  0.00012 ***
## airpollution_4$Pop -0.008581   0.014421  -0.595  0.55599    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.64 on 32 degrees of freedom
## Multiple R-squared:  0.01094,    Adjusted R-squared:  -0.01996 
## F-statistic: 0.3541 on 1 and 32 DF,  p-value: 0.556
#Plot the regression line of the simple linear regression model "airpollution_model_pop"
plot(y = airpollution_4$SO2,x = airpollution_4$Pop, pch=21, bg="darkviolet", main="Sulfur Dioxide Content of Air vs. Population Size", ylab = "Sulfur Dioxide Content of Air (in micrograms per m^3)", xlab = "Population Size (in thousands) from the 1970 Census")
abline(airpollution_model_pop, col='black',lwd=2.5)

##Temperature
airpollution_model_temp <- lm(airpollution_4$SO2~airpollution_4$Temp)
summary(airpollution_model_temp)
## 
## Call:
## lm(formula = airpollution_4$SO2 ~ airpollution_4$Temp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.326 -12.908  -4.019   5.752  59.335 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)   
## (Intercept)          97.7853    28.6619   3.412  0.00177 **
## airpollution_4$Temp  -1.2624     0.5209  -2.423  0.02121 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.15 on 32 degrees of freedom
## Multiple R-squared:  0.1551, Adjusted R-squared:  0.1287 
## F-statistic: 5.873 on 1 and 32 DF,  p-value: 0.02121
#Plot the regression line of the simple linear regression model "airpollution_model_temp"
plot(y = airpollution_4$SO2,x = airpollution_4$Temp, pch=21, bg="darkviolet", main="Sulfur Dioxide Content of Air vs. Temperature", ylab = "Sulfur Dioxide Content of Air (in micrograms per m^3)", xlab = "Average Annual Temperature (in degrees Fahrenheit)")
abline(airpollution_model_temp, col='black',lwd=2.5)

##Rainy Days
airpollution_model_raindays <- lm(airpollution_4$SO2~airpollution_4$RainDays)
summary(airpollution_model_raindays)
## 
## Call:
## lm(formula = airpollution_4$SO2 ~ airpollution_4$RainDays)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.116  -9.784  -5.236   6.861  62.599 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)  
## (Intercept)             -10.4109    16.0927  -0.647   0.5223  
## airpollution_4$RainDays   0.3345     0.1349   2.479   0.0186 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.09 on 32 degrees of freedom
## Multiple R-squared:  0.1611, Adjusted R-squared:  0.1349 
## F-statistic: 6.146 on 1 and 32 DF,  p-value: 0.01863
#Plot the regression line of the simple linear regression model "airpollution_model_raindays"
plot(y = airpollution_4$SO2,x = airpollution_4$RainDays, pch=21, bg="darkviolet", main="Sulfur Dioxide Content of Air vs. Number of Days with Precipitation", ylab = "Sulfur Dioxide Content of Air (in micrograms per m^3)", xlab = "Average Number of Days with Precipitation (per year)")
abline(airpollution_model_raindays, col='black',lwd=2.5)

For the simple linear regression analysis that is performed where ‘Pop’ is analyzed against the response variable ‘SO2’, a p-value = 0.55599 is returned, indicating that there is roughly a probability of 0.55599 that the resulting associated F-value (0.3541) is the result of solely randomization. Therefore, based on this result, we would fail to reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of the content of Sulfur Dioxide in the air cannot be explained by the variation in population size being considered in this analysis and, as such, is likely caused solely by randomization. (See above results for p-value and F-value under the heading “Population.”)

For the simple linear regression analysis that is performed where ‘Temp’ is analyzed against the response variable ‘SO2’, a p-value = 0.02121 is returned, indicating that there is roughly a probability of 0.02121 that the resulting associated F-value (5.873) is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of the content of Sulfur Dioxide in the air cannot be explained by the variation in average annual temperature being considered in this analysis and, as such, is not likely caused solely by randomization. (See above results for p-value and F-value under the heading “Temperature.”)

For the simple linear regression analysis that is performed where ‘RainDays’ is analyzed against the response variable ‘SO2’, a p-value = 0.0186 is returned, indicating that there is roughly a probability of 0.0186 that the resulting associated F-value (6.146) is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of the content of Sulfur Dioxide in the air cannot be explained by the variation in average number of days with precipitation being considered in this analysis and, as such, is not likely caused solely by randomization. (See above results for p-value and F-value under the heading “Rainy Days.”)

Upon generating individual simple linear regression models for each of the independent variables being considered in this analysis, an entry-wise multiple linear regression model is generated by simply including each of the three independent variables being considered in this analysis into one linear regression model.

###Generate an entry-wise multiple linear regression model
airpollution_model_mlr <- lm(airpollution_4$SO2~airpollution_4$Pop+airpollution_4$Temp+airpollution_4$RainDays)
#Display summary of the entry-wise multiple linear regression model
summary(airpollution_model_mlr)
## 
## Call:
## lm(formula = airpollution_4$SO2 ~ airpollution_4$Pop + airpollution_4$Temp + 
##     airpollution_4$RainDays)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.829  -9.705  -2.180   5.010  57.638 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)             50.714901  41.967661   1.208    0.236
## airpollution_4$Pop      -0.006053   0.013201  -0.459    0.650
## airpollution_4$Temp     -0.856882   0.570165  -1.503    0.143
## airpollution_4$RainDays  0.236599   0.148002   1.599    0.120
## 
## Residual standard error: 17.93 on 30 degrees of freedom
## Multiple R-squared:  0.2269, Adjusted R-squared:  0.1496 
## F-statistic: 2.936 on 3 and 30 DF,  p-value: 0.04931

Upon generating this entry-wise multiple linear regression model where ‘Pop’, ‘Temp’, and ‘RainDays’ are all analyzed against the response variable ‘SO2’, p-values equal to 0.650, 0.143, and 0.120 [respectively] for each of these dependent variables are returned, indicating that there is roughly a probability of 0.650, 0.143, and 0.120 (for each of these independent variables, respectively) that the resulting model’s associated F-value (2.936) is the result of solely randomization. Therefore, based on this entry-wise model’s yielded results (and its p-value of 0.04931), we would fail to reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of the content of Sulfur Dioxide in the air cannot be explained by the variation in population size, average annual temperature, or average number of days with precipitation (per year) being considered in this analysis and, as such, is likely caused solely by randomization. (See above results for corresponding p-values and F-value.)

Next, a step-wise multiple linear regression model is generated, which corresponds to a linear regression model that grows to include more independent variables as a model’s adjusted R-squared value increases with the addition of another independent variable. The process for how this model is generated is laid out in the R-code below:

###Generate a step-wise multiple linear regression model.
#Generate a correlation matrix for the dataset, which will offer some insight into determining both the amount of variance in 'SO2' that can be explained by each of the independent variables being considered in this analysis and whether any existence of suppression is likely to exist within a linear regression model comprised of this data.
cor(airpollution_4)
##                 SO2         Pop        Temp    RainDays
## SO2       1.0000000 -0.10461443 -0.39379451  0.40138362
## Pop      -0.1046144  1.00000000  0.06933643 -0.04329827
## Temp     -0.3937945  0.06933643  1.00000000 -0.42754387
## RainDays  0.4013836 -0.04329827 -0.42754387  1.00000000
#Upon generating this correlation matrix, it's clear that (with a value of 0.4014) 'RainDays' is most highly correlated with 'SO2'. With a value of -0.3938, 'Temp' is the second-most highly correlated variable [against 'SO2']. Lastly, with a value of -0.1046, 'Pop' is the least highly correlated variable [against 'SO2']. As far as suppression is concerned, with a value of -0.4275, it appears that some suppression may exist when considering the variables 'Temp' and 'RainDays' against each other. Therefore, in any multiple linear regression model that contains these two independent variables, bouts of suppression (if really existent) could cause their respective coefficients in the model to increase/magnify in a misleading way.

##Based on the above correlation matrix, 'RainDays' will be the first independent variable to be included in the model, followed by 'Temp' and 'Pop', consecutively.
#Begin generating a step-wise multiple linear regression model with the addition of 'RainDays'.
airpollution_model_step_a <- lm(airpollution_4$SO2~airpollution_4$RainDays)
summary(airpollution_model_step_a)
## 
## Call:
## lm(formula = airpollution_4$SO2 ~ airpollution_4$RainDays)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.116  -9.784  -5.236   6.861  62.599 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)  
## (Intercept)             -10.4109    16.0927  -0.647   0.5223  
## airpollution_4$RainDays   0.3345     0.1349   2.479   0.0186 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.09 on 32 degrees of freedom
## Multiple R-squared:  0.1611, Adjusted R-squared:  0.1349 
## F-statistic: 6.146 on 1 and 32 DF,  p-value: 0.01863
#Next, add 'Temp' to the step-wise multiple linear regression model.
airpollution_model_step_b <- lm(airpollution_4$SO2~airpollution_4$RainDays + airpollution_4$Temp)
summary(airpollution_model_step_b)
## 
## Call:
## lm(formula = airpollution_4$SO2 ~ airpollution_4$RainDays + airpollution_4$Temp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.994  -9.890  -3.435   5.882  59.277 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)              48.5996    41.1785   1.180    0.247
## airpollution_4$RainDays   0.2376     0.1461   1.627    0.114
## airpollution_4$Temp      -0.8716     0.5620  -1.551    0.131
## 
## Residual standard error: 17.7 on 31 degrees of freedom
## Multiple R-squared:  0.2215, Adjusted R-squared:  0.1713 
## F-statistic: 4.411 on 2 and 31 DF,  p-value: 0.02062
#Since the value of adjusted R-squared increased from 0.1349 to 0.1713 with the addition of 'Temp', 'Temp' will stand to be included in this step-wise linear regression model and 'Pop' will be the last variable to be included in it.
airpollution_model_step_c <- lm(airpollution_4$SO2~airpollution_4$RainDays + airpollution_4$Temp+airpollution_4$Pop)
summary(airpollution_model_step_c)
## 
## Call:
## lm(formula = airpollution_4$SO2 ~ airpollution_4$RainDays + airpollution_4$Temp + 
##     airpollution_4$Pop)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.829  -9.705  -2.180   5.010  57.638 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)             50.714901  41.967661   1.208    0.236
## airpollution_4$RainDays  0.236599   0.148002   1.599    0.120
## airpollution_4$Temp     -0.856882   0.570165  -1.503    0.143
## airpollution_4$Pop      -0.006053   0.013201  -0.459    0.650
## 
## Residual standard error: 17.93 on 30 degrees of freedom
## Multiple R-squared:  0.2269, Adjusted R-squared:  0.1496 
## F-statistic: 2.936 on 3 and 30 DF,  p-value: 0.04931
#Since the value of adjusted R-squared decreased from 0.1713 to 0.1496 with the addition of 'Pop', 'Pop' will be removed from the final version of the step-wise linear regression model. (In this version of the model, the p-value associated with 'Pop' is equal to 0.650, leading us to believe that the variation that is observed in the mean values of the content of Sulfur Dioxide in the air cannot be explained by the variation in population size [see further explanation below].)
airpollution_model_step_final <- lm(airpollution_4$SO2~airpollution_4$RainDays + airpollution_4$Temp)
summary(airpollution_model_step_final)
## 
## Call:
## lm(formula = airpollution_4$SO2 ~ airpollution_4$RainDays + airpollution_4$Temp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.994  -9.890  -3.435   5.882  59.277 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)              48.5996    41.1785   1.180    0.247
## airpollution_4$RainDays   0.2376     0.1461   1.627    0.114
## airpollution_4$Temp      -0.8716     0.5620  -1.551    0.131
## 
## Residual standard error: 17.7 on 31 degrees of freedom
## Multiple R-squared:  0.2215, Adjusted R-squared:  0.1713 
## F-statistic: 4.411 on 2 and 31 DF,  p-value: 0.02062

Upon generating this entry-wise multiple linear regression model where ‘RainDays’, and ‘Temp’ are both analyzed against the response variable ‘SO2’, p-values equal to 0.114 and 0.131 [respectively] for each of these dependent variables are returned, indicating that there is roughly a probability of 0.114 and 0.131 (for each of these independent variables, respectively) that the resulting model’s associated F-value (4.411) is the result of solely randomization. Therefore, based on this step-wise model’s yielded results (and its p-value of 0.02062), we would fail to reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of the content of Sulfur Dioxide in the air cannot be explained by the variation in population size, average annual temperature, or average number of days with precipitation (per year) being considered in this analysis and, as such, is likely caused solely by randomization. (See above results for corresponding p-values and F-value.)

Lastly, a hierarchical multiple linear regression model is generated in this analysis, which is developed in part by using theory to determine the most logical way (in an intuitive sense) to build a linear regression model for this data. According to the 2005 Air Quality Fact Sheet created by the Australian Government’s Department of the Environment and Heritage, “about 99% of the sulfur dioxide in air comes from human sources.” Furthermore, “the main source of sulfur dioxide in the air is industrial activity that processes materials that contain sulfur, eg. the generation of electricity from coal, oil or gas that contains sulfur.” For these reasons, one might infer that the ranking of the three independent variables being considered in this analysis (on the basis of selecting the factors that likely most highly contribute to the Sulfur Dioxide content of the air) would correspond to the following order: ‘Pop’, ‘RainDays’, and ‘Temp’. In taking this hierarchical ordering into account, each individual independent variable will be included into a hierarchical multiple linear regression model “one-at-a-time” until all three independent variables have been included in the model. [Reference: http://www.environment.gov.au/protection/publications/factsheet-sulfur-dioxide-so2]

###Generate a hierarchical multiple linear regression model
#Begin by generating a linear regression model that includes 'Pop'
airpollution_model_hier_a <- lm(airpollution_4$SO2~airpollution_4$Pop)
summary(airpollution_model_hier_a)
## 
## Call:
## lm(formula = airpollution_4$SO2 ~ airpollution_4$Pop)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.328 -14.307  -3.667   2.764  62.831 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        32.704677   7.472569   4.377  0.00012 ***
## airpollution_4$Pop -0.008581   0.014421  -0.595  0.55599    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.64 on 32 degrees of freedom
## Multiple R-squared:  0.01094,    Adjusted R-squared:  -0.01996 
## F-statistic: 0.3541 on 1 and 32 DF,  p-value: 0.556
#In the above version of the model, the p-value associated with 'Pop' is equal to 0.55599, leading us to believe that the variation that is observed in the mean values of the content of Sulfur Dioxide in the air cannot be explained by the variation in population size [see further explanation below]. For this reason, 'Pop' will be removed from further consideration, and 'RainDays' will be considered next.
airpollution_model_hier_b <- lm(airpollution_4$SO2~airpollution_4$RainDays)
summary(airpollution_model_hier_b)
## 
## Call:
## lm(formula = airpollution_4$SO2 ~ airpollution_4$RainDays)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.116  -9.784  -5.236   6.861  62.599 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)  
## (Intercept)             -10.4109    16.0927  -0.647   0.5223  
## airpollution_4$RainDays   0.3345     0.1349   2.479   0.0186 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.09 on 32 degrees of freedom
## Multiple R-squared:  0.1611, Adjusted R-squared:  0.1349 
## F-statistic: 6.146 on 1 and 32 DF,  p-value: 0.01863
#Based on the above version of the model, the p-value associated with 'RainDays' is equal to 0.0186, leading us to believe that the variation that is observed in the mean values of the content of Sulfur Dioxide in the air can be explained by the variation in average number of days with precipitation (per year) [see further explanation below]. For this reason, 'RainDays' will not be removed from further consideration, and 'Temp' will be added to this hierarchical model.
airpollution_model_hier_c <- lm(airpollution_4$SO2~airpollution_4$RainDays + airpollution_4$Temp)
summary(airpollution_model_hier_c)
## 
## Call:
## lm(formula = airpollution_4$SO2 ~ airpollution_4$RainDays + airpollution_4$Temp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.994  -9.890  -3.435   5.882  59.277 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)              48.5996    41.1785   1.180    0.247
## airpollution_4$RainDays   0.2376     0.1461   1.627    0.114
## airpollution_4$Temp      -0.8716     0.5620  -1.551    0.131
## 
## Residual standard error: 17.7 on 31 degrees of freedom
## Multiple R-squared:  0.2215, Adjusted R-squared:  0.1713 
## F-statistic: 4.411 on 2 and 31 DF,  p-value: 0.02062
#Based on the above version of the model, the p-value associated with 'RainDays' is equal to 0.114 and the p-value associated with 'Temp' is equal to 0.131, leading us to believe that the variation that is observed in the mean values of the content of Sulfur Dioxide in the air cannot be explained by the variation in average number of days with precipitation (per year) or the average annual temperature [see further explanation below]. While the nature of variable-significance differs in this model with comparison to any simple linear regression model that respectively considers 'RainDays' and 'Temp' (since the simple linear regression models for both of these independent variables suggests that 'RainDays' and 'Temp' are both significant), the value of adjusted R-squared in the hierarchical multiple linear regression model that contains both of these independent variables is greater than either of these respective simple linear regression models (and is greater than the multiple linear regression model that contains all three independent variables being included in this analysis). Therefore, in moving forward with the selection of the best hierarchical multiple linear regression model, the version that includes both 'RainDays' and 'Temp' is the strongest model that can be generated using this data.
airpollution_model_hier_final <- lm(airpollution_4$SO2~airpollution_4$RainDays + airpollution_4$Temp)
summary(airpollution_model_hier_final)
## 
## Call:
## lm(formula = airpollution_4$SO2 ~ airpollution_4$RainDays + airpollution_4$Temp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.994  -9.890  -3.435   5.882  59.277 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)              48.5996    41.1785   1.180    0.247
## airpollution_4$RainDays   0.2376     0.1461   1.627    0.114
## airpollution_4$Temp      -0.8716     0.5620  -1.551    0.131
## 
## Residual standard error: 17.7 on 31 degrees of freedom
## Multiple R-squared:  0.2215, Adjusted R-squared:  0.1713 
## F-statistic: 4.411 on 2 and 31 DF,  p-value: 0.02062

Upon walking through these three different linear-regression-model-generation techniques (entry-wise, step-wise, and hierarchical multiple linear regression), it’s clear that, for the data being included in this analysis (which excludes all of the outliers that were discovered from the generation of box plots), the most statistically significant multiple linear regression model where ‘SO2’ is the dependent variable includes the independent variables ‘RainDays’ and ‘Temp’. This model, its summary statistics, and other relevant graphical displays necessary to carry out a thorough interpretation of the model are shown in the R-code below:

#Generate the final resulting mutiple linear regression model
airpollution_model_final <- lm(airpollution_4$SO2~airpollution_4$RainDays + airpollution_4$Temp)
summary(airpollution_model_final)
## 
## Call:
## lm(formula = airpollution_4$SO2 ~ airpollution_4$RainDays + airpollution_4$Temp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.994  -9.890  -3.435   5.882  59.277 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)              48.5996    41.1785   1.180    0.247
## airpollution_4$RainDays   0.2376     0.1461   1.627    0.114
## airpollution_4$Temp      -0.8716     0.5620  -1.551    0.131
## 
## Residual standard error: 17.7 on 31 degrees of freedom
## Multiple R-squared:  0.2215, Adjusted R-squared:  0.1713 
## F-statistic: 4.411 on 2 and 31 DF,  p-value: 0.02062
#Generate a 3-dimensional plot that graphically displays the dependent variable 'SO2' against the independent variables 'RainDays' and 'Temp'
library(scatterplot3d)
## Warning: package 'scatterplot3d' was built under R version 3.1.2
airpollution_model_final_plot <- scatterplot3d(,z = airpollution_4$SO2, x = airpollution_4$RainDays, y = airpollution_4$Temp, pch=21, bg="darkviolet", main = "Regression Plane: 'SO2' vs. 'RainDays' and 'Temp'", xlab = "Average Number of Days with Precipitation (per year)", ylab = "Average Annual Temperature (in degrees Fahrenheit)", zlab = "Sulfur Dioxide Content of Air (in micrograms per m^3)", axis=TRUE)
airpollution_model_final_plot$plane3d(airpollution_model_final,col='red')

95% confidence intervals

Below, the numerical values associated with the 95% confidence intervals are generated, which serve to represent the fact that for 95% of the possible samples of data, the true mean value fall between those confidence interval levels.

#Generate numerical 95% confidence intervals
confint(airpollution_model_final, level=0.95)
##                                2.5 %      97.5 %
## (Intercept)             -35.38456887 132.5837517
## airpollution_4$RainDays  -0.06032065   0.5355740
## airpollution_4$Temp      -2.01772531   0.2745381

The interpreted the results of the statistical analysis (b_0, b_1, b_2, and r)

For the multiple linear regression analysis that is performed where ‘RainDays’, and ‘Temp’ are both analyzed against the response variable ‘SO2’, p-values equal to 0.114 and 0.131 [respectively] for each of these dependent variables are returned, indicating that there is roughly a probability of 0.114 and 0.131 (for each of these independent variables, respectively) that the resulting model’s associated F-value (4.411) is the result of solely randomization. Therefore, based on this step-wise model’s yielded results (and its p-value of 0.02062), we would fail to reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of the content of Sulfur Dioxide in the air cannot be explained by the variation in population size, average annual temperature, or average number of days with precipitation (per year) being considered in this analysis and, as such, is likely caused solely by randomization. (See above summary results for corresponding p-values and F-value.)

In further analyzing the results of the simple linear regression analysis, it’s important to note that the value of b_0 (the linear model’s y-intercept) is 48.5996,the value of b_1 (which represents both the slope of the linear model and the coefficient associated with the variable ‘RainDays’ in the linear model) is 0.2376, and the value of b_2 (which represents both the slope of the linear model and the coefficient associated with the variable ‘Temp’ in the linear model) is -0.8716. These values indicate the relationship between the independent variables corresponding to average annual temperature and average number of days with precipitation per year and the dependent variable corresponding to the Sulfur Dioxide content in the air, which corresponds to the idea that an increase in one unit of “average annual temperature” (in degrees Fahrenheit) results in a decrease in 0.8716 units of “Sulfur Dioxide content in the air” and that an increase in one unit of “average number of days with precipitation per year” results in an increase in 0.2376 units of “Sulfur Dioxide content in the air.”

Additionally, it’s important to note the metrics that measures the correlation between the independent variables corresponding to average annual temperature and average number of days with precipitation per year and the dependent variable corresponding to the Sulfur Dioxide content in the air, which are multiple R-squared and adjusted R-squared (since we want to take into account any bias that might be associated with the number of explanatory variables being included in the model, this analysis emphasis the value of adjusted R-squared rather than the value of multiple R-squared). Since the value of adjusted R-squared is 0.1713, it can be inferred that the variation that exists in the independent variables corresponding to average annual temperature and average number of days with precipitation per year can explain approximately 17.13% of the variation existent in the dependent variable corresponding to the Sulfur Dioxide content in the air. Although the final model’s value of adjusted R-squared here is the largest value that was attained throughout the process of selecting the most statistically significant multiple linear regression model, it would not lead one to put much faith in asserting that the independent variables being considered in this analysis (‘RainDays’ and ‘Temp’) explain a significant portion of the variation existent in the dependent variable being considered here. However, if more data were collected for these government publications over a longer period of time than the years from 1969-1971, one might be able to better generate a model that can offer some stronger insight into the determination of the most statistically significant independent variables to take into account when trying to explain the variation that is existent in the dependent variable corresponding to the Sulfur Dioxide content in the air.

3. Diagnostics/Model Adequacy Checking

In verifying the results of this experiment, it’s important to ensure that the dataset itself meets all of the assumptions that correlate with the design approach that was carried out. In this way, we want to make sure that our dataset exhibits normality. Until we know that our dataset does, in fact, exhibit normality, we cannot yet say with confidence that our results are significant and representative of a properly carried-out modeling approach. In verifying our dataset for normality, we can both create a Normal Quantile-Quantile (QQ) Plot of our data, histograms for our data, and perform a Shapiro-Wilk Test of Normality on our data.

#Create a Normal Q-Q Plot for the data pertaining to Sulfur Dioxide Content in the Air.
qqnorm(airpollution_4[,"SO2"], main = "Normal Q-Q Plot of Sulfur Dioxide Content in the Air")
qqline(airpollution_4[,"SO2"])

#Generate histograms for all of the different variables being considered in this analysis ('SO2', 'Pop', 'Temp', and 'RainDays')
hist(airpollution_4$SO2, xlab = "Sulfur Dioxide Content in the Air (in micrograms per m^3)", main = "Histogram of Sulfur Dioxide Content in the Air")

hist(airpollution_4$Pop, xlab = "Population Size (in thousands) from the 1970 Census", main = "Histogram of Population Size")

hist(airpollution_4$Temp, xlab = "Average Annual Temperature (in degrees Fahrenheit)", main = "Histogram of Average Annual Temperature")

hist(airpollution_4$RainDays, xlab = "Average Number of Days with Precipitation (per year)", main = "Histogram of Average Number of Days with Precipitation")

#Create a Normal Q-Q Plot of the residuals for "airpollution_model_final".
qqnorm(residuals(airpollution_model_final), main = "Normal Q-Q Plot of Residuals of 'airpollution_model_final'")
qqline(residuals(airpollution_model_final))

Shapiro-Wilk Test of Normality Results

#Perform Shapiro-Wilk Test of Normality on the Sulfur Dioxide Content in the Air (normality is assummed if p > 0.1).
shapiro.test(airpollution_4[,"SO2"])
## 
##  Shapiro-Wilk normality test
## 
## data:  airpollution_4[, "SO2"]
## W = 0.8501, p-value = 0.000281

Upon both constructing Normal Q-Q Plots, histograms, and performing Shapiro-Wilk Tests of Normality on the data in this analysis, it’s likely that we can readily assume our data does not exhibit normality. First of all, the constructed Normal Q-Q Plots did not seem to display a trend of data that aligned closely with the Normal Q-Q Line. Secondly, the histogram associated with “Sulfur Dioxide Content in the Air” exhibits significant skewness, the histogram associated with “Population Size” exhibits both skewness and kurtosis, and the histogram associated with “Average Annual Temperature” exhibits kurtosis, which would not lead one to back up their assumptions about the data being normal for those variables. (The histogram associated with “Average Number of Days with Precipitation” was the only plot that seemed to exhibit very little skewness or kurtosis.) Lastly, the fact that the resulting p-value of the Shapiro-Wilk Test of Normality for “airpollution_4[,“SO2”]” was < 0.1 would lead one to assume that the data does not exhibit normality.

In further attempting to determine the adequacy of our final model multiple linear regression model, we can generate a “quality of fit” model that plots residual error against the fitted model that was developed in our analysis.

#Create a "Quality of Fit Model" that plots the residuals of "airpollution_model_final" against its fitted model.
par(mfrow=c(1,1))
plot(fitted(airpollution_model_final),residuals(airpollution_model_final), main = "Residuals of 'airpollution_model_final' Against Fitted Model 'airpollution_model_final'")
abline(0,0, col='darkviolet', lwd=2.5)

Because the resulting plot does not really appear to exhibit a symmetric distribution of residuals clumped around the zero line (especially considering the outliers that exist closer to the right of the plot), the multiple linear regression model that is developed here does not really suggest good fit. There, we cannot confidently rely on both the modeling approach that we carried out and the dataset that we analyzed in justifying the significance of our results. However, if more data were collected for these government publications over a longer period of time than the years from 1969-1971, one might be able to better generate a model that can offer some stronger insight into the determination of the most statistically significant independent variables to take into account when trying to explain the variation that is existent in the dependent variable corresponding to the Sulfur Dioxide content in the air.