How should we explain and predict SO2 level?

DATS6101 - Final Project

Question Development

This report is to give an analysis answering “How we should predict Air Pollution Level? So first, let me talk about, how I developed my question. Started with very basic and broad thinking to select what kinds of problem that I want to explore, there were many categories, such as social, politics, business, medical, environmental, education related problems, etc. Specifically narrowing down to environmental problem, I found several datasets related to this topic such as Air Quality, Air Pollution, Deforestation, Water Pollution, Global Warming data. But I chose Air pollution data which was very specific and straightforward about SO2 content in the air which is the main cause of acid rain and other predictor variables. The dataset can be found in (http://lib.stat.cmu.edu/DASL/Datafiles/AirPollution.html)]. A text file (.txt) is load after data is copied into it. Thus, a question I came out with was”Can the variation observed in SO2 content of air be explained by the variation existent in any other factors considered in the experiment?" Simply, “Can the SO2 level be explained and predicted by our predictors?”

Data Description

The dataset is on air pollution and related values for 41 U.S.cities over years 1969 to 1971, collected from U.S. government publication. Our response variable is SO2 - Sulfur dioxide content of air in microgram per cubic meter. Our predictor variables are Temp - Average annual temperature in degrees Fahrenheit, Man - Number of manufacturing enterprises employing 20 or more workers, Pop - population size in thousands from the 1970 census, Wind - Average annual wind speed in miles per hour, Rain - Average annual precipitation in inches, RainDays - Average number of days raining, City - Names of cities. Except City variable, all other variables are continuous. City variable is used for future reference of reasoning for possible problem understanding. It may help us to understand the context better since we know some characteristics of the cities. There is no missing values.(since sample size is small, I skipped test and training part.)

Exploratory Data Analysis

First and last 5 observations from the data is presented to see how it looks. Also plots for each combination of two variables are presented too. After checking each variable’s distribution, Wind and Raindays variables are approximately normally distributed and others are not. In addition, Rain and RainDays are left skewed while SO2, Temp, Man, Pop and Wind variables are right skewed.

Now we check the correlation between variables. we see that there is a high multicollinearity problem between Man and Pop variable. Those two have correlation coefficient of .95 which is very high. Intuitively, it makes sense because Man variable means number of manufacturing enterprises with more than 20 workers and Pop is population size of the city, and bigger city tends to have large manufacturing enterprises meaning more job opportunities with more population. When I checked Variance Inflation Factors for predictor variables, man and pop were 22.05 and 17.18 correspondingly. Usually we omit the one with highest VIF, pop is still huge in VIF and Man variable has a higher correlation with SO2, so pop variable will be ignored instead for the rest of analysis. Most significant factor we can assume for emission of SO2 is human source which is here manufacturing factories so it makes sense too. After omitting Pop variable, VIF are all below 5 meaning we are good to go.

#setting a working directory
setwd("C:/Users/Sean/Desktop/DATS6101")

#removing all previous assigned data
rm(list=ls())

#loading airpollution dataset to R
air <-  read.table('airpollution.txt', sep="\t", header=TRUE)

#attaching data for convenience
attach(air)

#checking data structure and how it looks
str(air)

## 'data.frame':    41 obs. of  8 variables:
##  $ City    : Factor w/ 41 levels "Albany","Albuquerque",..: 31 20 36 12 15 41 39 18 23 3 ...
##  $ SO2     : int  10 13 12 17 56 36 29 14 10 24 ...
##  $ Temp    : num  70.3 61 56.7 51.9 49.1 54 57.3 68.4 75.5 61.5 ...
##  $ Man     : int  213 91 453 454 412 80 434 136 207 368 ...
##  $ Pop     : int  582 132 716 515 158 80 757 529 335 497 ...
##  $ Wind    : num  6 8.2 8.7 9 9 9 9.3 8.8 9 9.1 ...
##  $ Rain    : num  7.05 48.52 20.66 12.95 43.37 ...
##  $ RainDays: int  36 100 67 86 127 114 111 116 128 115 ...

head(air)

##            City SO2 Temp Man Pop Wind  Rain RainDays
## 1       Phoenix  10 70.3 213 582  6.0  7.05       36
## 2   Little Rock  13 61.0  91 132  8.2 48.52      100
## 3 San Francisco  12 56.7 453 716  8.7 20.66       67
## 4        Denver  17 51.9 454 515  9.0 12.95       86
## 5      Hartford  56 49.1 412 158  9.0 43.37      127
## 6    Wilmington  36 54.0  80  80  9.0 40.25      114

tail(air)

##              City SO2 Temp Man Pop Wind  Rain RainDays
## 36 Salt Lake City  28 51.0 137 176  8.7 15.17       89
## 37        Norfolk  31 59.3  96 308 10.6 44.68      116
## 38       Richmond  26 57.8 197 299  7.6 42.59      115
## 39        Seattle  29 51.1 379 531  9.4 38.79      164
## 40     Charleston  31 55.2  35  71  6.5 40.75      148
## 41      Milwaukee  16 45.7 569 717 11.8 29.07      123

#checking if there is any missing values
sum(is.na(data.frame(air)))

## [1] 0

#excluding City variable since it is the name of the city
air1 <- air[2:8]

#attaching data for convenience
attach(air1)

## The following objects are masked from air:
## 
##     Man, Pop, Rain, RainDays, SO2, Temp, Wind

#basic summary about data with mean and quantiles
summary(air1)

##       SO2              Temp            Man              Pop        
##  Min.   :  8.00   Min.   :43.50   Min.   :  35.0   Min.   :  71.0  
##  1st Qu.: 13.00   1st Qu.:50.60   1st Qu.: 181.0   1st Qu.: 299.0  
##  Median : 26.00   Median :54.60   Median : 347.0   Median : 515.0  
##  Mean   : 30.05   Mean   :55.76   Mean   : 463.1   Mean   : 608.6  
##  3rd Qu.: 35.00   3rd Qu.:59.30   3rd Qu.: 462.0   3rd Qu.: 717.0  
##  Max.   :110.00   Max.   :75.50   Max.   :3344.0   Max.   :3369.0  
##       Wind             Rain          RainDays    
##  Min.   : 6.000   Min.   : 7.05   Min.   : 36.0  
##  1st Qu.: 8.700   1st Qu.:30.96   1st Qu.:103.0  
##  Median : 9.300   Median :38.74   Median :115.0  
##  Mean   : 9.444   Mean   :36.77   Mean   :113.9  
##  3rd Qu.:10.600   3rd Qu.:43.11   3rd Qu.:128.0  
##  Max.   :12.700   Max.   :59.80   Max.   :166.0

#plots between variables
plot(air1)

#VIF
#it used to work before loading last 4 packages and actually got the values for each predictor variables,
#but somehow it doesn't work now. I do not know what happened. it gave Man with 22 and pop with 17.
#correlations among variables
cor.air <- cor(air1, method = "pearson")
corrplot(cor.air, method = "circle", type = "lower", tl.col = "black", tl.srt = 45)

#it looks like there is a multicollinearty problem here
cor(Man, Pop, use = "complete")

## [1] 0.9552693

#due to high correlatoin between Man and Pop which are predictor variables.
#Man with higher correlation with SO2 remains. Pop is excluded.
air2 <-  air[c(2:4,6:8)]
attach(air2)

## The following objects are masked from air1:
## 
##     Man, Rain, RainDays, SO2, Temp, Wind

## The following objects are masked from air:
## 
##     Man, Rain, RainDays, SO2, Temp, Wind

#checking correlation coefficients again
cor.air2 <- cor(air2, method = "pearson")
corrplot(cor.air2, method = "circle", type = "lower", tl.col = "black", tl.srt = 45)

#checking normality of distribution
shapiro.test(air1$SO2)

## 
##  Shapiro-Wilk normality test
## 
## data:  air1$SO2
## W = 0.81165, p-value = 9.723e-06

shapiro.test(air1$Man)

## 
##  Shapiro-Wilk normality test
## 
## data:  air1$Man
## W = 0.60548, p-value = 2.781e-09

shapiro.test(air1$Temp)

## 
##  Shapiro-Wilk normality test
## 
## data:  air1$Temp
## W = 0.93554, p-value = 0.02215

shapiro.test(air1$Wind)

## 
##  Shapiro-Wilk normality test
## 
## data:  air1$Wind
## W = 0.98057, p-value = 0.6973

shapiro.test(air1$Rain)

## 
##  Shapiro-Wilk normality test
## 
## data:  air1$Rain
## W = 0.94214, p-value = 0.03725

shapiro.test(air1$RainDays)

## 
##  Shapiro-Wilk normality test
## 
## data:  air1$RainDays
## W = 0.9654, p-value = 0.2419

#histograms of distribution of each variables
hist(SO2)

hist(Man)

hist(Temp)

hist(Wind)

hist(Rain)

hist(RainDays)

#checking skewness of each variables
skewness(SO2)

## [1] 1.584113

skewness(Man)

## [1] 3.484603

skewness(Temp)

## [1] 0.8229757

skewness(Rain)

## [1] -0.6925181

skewness(Wind)

## [1] 0.002675131

skewness(RainDays)

## [1] -0.5500923

Variable Selection

Now we are to apply variable selection method to obtain simpler and effective model. For our data now, without Pop variable, there are 5 predictor variables are left to consider. After applying variable selection techniques, all three variable selection methods suggest building a linear regression model of SO2 ~ Man+ Temp +Wind + Rain.

Linear Regression Models

The full model, reduced model, and final model with log transformation are presented below. Final model has been improved from original model as 60% to 67% R-squared.

#linear regression with a full model
lm.air <-lm(formula = SO2~Man+Pop, data=air1)
lm.air

## 
## Call:
## lm(formula = SO2 ~ Man + Pop, data = air1)
## 
## Coefficients:
## (Intercept)          Man          Pop  
##    26.32508      0.08243     -0.05661

summary(lm.air)

## 
## Call:
## lm(formula = SO2 ~ Man + Pop, data = air1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.389 -12.831  -1.277   7.609  49.533 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 26.32508    3.84044   6.855 3.87e-08 ***
## Man          0.08243    0.01470   5.609 1.96e-06 ***
## Pop         -0.05661    0.01430  -3.959 0.000319 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.49 on 38 degrees of freedom
## Multiple R-squared:  0.5863, Adjusted R-squared:  0.5645 
## F-statistic: 26.93 on 2 and 38 DF,  p-value: 5.207e-08

VIF(lm.air)

## [1] 2.417329

#linear regression with a reduced model suggested by variable selection
lm.air1 <- lm(SO2~Man+Temp+Wind+Rain, data=air2)
lm.air1

## 
## Call:
## lm(formula = SO2 ~ Man + Temp + Wind + Rain, data = air2)
## 
## Coefficients:
## (Intercept)          Man         Temp         Wind         Rain  
##   123.11833      0.02548     -1.61144     -3.63024      0.52423

summary(lm.air1)

## 
## Call:
## lm(formula = SO2 ~ Man + Temp + Wind + Rain, data = air2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.374  -9.088  -3.042   7.205  58.785 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 123.118333  31.290702   3.935 0.000365 ***
## Man           0.025476   0.004537   5.615 2.27e-06 ***
## Temp         -1.611436   0.401373  -4.015 0.000289 ***
## Wind         -3.630245   1.892342  -1.918 0.063020 .  
## Rain          0.524235   0.229407   2.285 0.028297 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.59 on 36 degrees of freedom
## Multiple R-squared:  0.6028, Adjusted R-squared:  0.5587 
## F-statistic: 13.66 on 4 and 36 DF,  p-value: 7.168e-07

VIF(lm.air1)

## [1] 2.517784

#applying log transformation on the response variable
lnSO2 <- log(SO2)
lm.air2 <- lm(lnSO2~ Man+Temp+Wind+Rain, data=air2)
lm.air2

## 
## Call:
## lm(formula = lnSO2 ~ Man + Temp + Wind + Rain, data = air2)
## 
## Coefficients:
## (Intercept)          Man         Temp         Wind         Rain  
##   7.7628404    0.0005556   -0.0699392   -0.1803936    0.0200318

summary(lm.air2)

## 
## Call:
## lm(formula = lnSO2 ~ Man + Temp + Wind + Rain, data = air2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77450 -0.26074 -0.05387  0.27927  1.14267 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.7628404  0.9032385   8.594 3.02e-10 ***
## Man          0.0005556  0.0001310   4.242 0.000148 ***
## Temp        -0.0699392  0.0115861  -6.036 6.21e-07 ***
## Wind        -0.1803936  0.0546244  -3.302 0.002172 ** 
## Rain         0.0200318  0.0066221   3.025 0.004568 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4501 on 36 degrees of freedom
## Multiple R-squared:  0.6303, Adjusted R-squared:  0.5892 
## F-statistic: 15.35 on 4 and 36 DF,  p-value: 2.053e-07

VIF(lm.air2)

## [1] 2.705063

Validating the final model and checking goodness of fit

For the model we built, we need to check if there is any problem in the model. First plot shows studentized residuals. Studentized residual is used for dectecting and identifying outliers and we normally look at values not in range of -2 to 2.We see that observation 31 has value more than 2 which indicates possible outlier of the data.

The second plot shows leverage and it shows influential points which might be an outliers but they are not always outliers that can be excluded from the model. There are total of 5 observations are considered influential points. But only observation 1 and 11 have has extreme values and the others are relatively small.

The third plot shows cook’s distance of the each observation. Cook’s distance measures the effect of deleting a given observation. Thus cook’s distance with large values are expected to be excluded from the model too. here we are interested in the observations with cook’s distance value of more than 0.1 (4 divided by # of Obs.) .Here too only observation 1 and 11 have high value and the other two are relatively small. Regarding not all the influential points are outliers, we are left with observation 1 and 11 and 31. Observation 1 is phoenix city and SO2 value and other variable values were not extreme. And observation 11 is Chicago city such The observation is just a description of the city rather than a mistake in data collection. Obs. #31 - Providence is a city which has high concentration of SO2 although it has a small population, 179,000 people but sill has high SO2 levels of big cities like Chicago, 110 μg / m3 . Thus, We treat it as an outlier.

Our final model fits better than original model that q-q plot shows linear trend after log transformation.

#checking studentized residuals
lm.air$residuals

##           1           2           3           4           5           6 
##  -0.9385022 -13.3545143 -11.1374002 -17.5977616   4.6559128   7.6087174 
##           7           8           9          10          11          12 
##   9.7497184   6.4087731 -14.4257288  -4.5273483  -1.2770499  14.1447347 
##          13          14          15          16          17          18 
##  -6.5203019 -12.9493160  13.2543110 -13.7066548  20.3825835   6.6108308 
##          19          20          21          22          23          24 
## -12.8312027 -15.0329255   0.9978004  -7.6031631  -5.3050404  22.6141825 
##          25          26          27          28          29          30 
## -21.3479570 -15.7668438  -1.8246579   8.3150134  13.5793048  35.5057196 
##          31          32          33          34          35          36 
##  49.5326041  -8.7828527  -5.6347010 -22.3893648  -5.9641236   0.3442079 
##          37          38          39          40          41 
##  14.1960775   0.3607744   1.4905012   5.8087922 -16.6431487

r <- rstudent(lm.air2)
plot(r, ylab="Residual", xlab="Observation Number",main="Studentized Residuals", pch=24, bg="green")
points(31,r[31], pch=24, bg='red')
abline(h=2, col="red",lty=5, lwd=2)

#checking leverage
lev <- hat(model.matrix(lm.air2))
plot(lev, ylab="Leverage", xlab="Observation Number",main="Leverage", pch=24, bg="green")
points(c(1,9,11,14,23),lev[c(1,9,11,14,23)], pch=24, bg='red')
abline(h=.2, col="red", lty=5, lwd=2)

air[lev>0.2,]

##           City SO2 Temp  Man  Pop Wind  Rain RainDays
## 1      Phoenix  10 70.3  213  582  6.0  7.05       36
## 9        Miami  10 75.5  207  335  9.0 59.80      128
## 11     Chicago 110 50.6 3344 3369 10.4 34.44      122
## 14     Wichita   8 56.6  125  277 12.7 30.58       82
## 23 Albuquerque  11 56.8   46  244  8.9  7.77       58

#checking cook's distances
cook <- cooks.distance(lm.air2)
plot(cook,ylab="Cook's Distance" , xlab="Observation Number", main="Cook's Distance",pch=24, bg="green")
points(c(1,11,25,31),cook[c(1,11,25,31)], col='red',pch=24, bg='red')
abline(h=.1, col="red", lty=5, lwd=2)

plot(Man, r, xlab= "Man", ylab="Studentized Residual", main="# of Manufacturing Enterprises vs. Residual")

plot(Temp,r, xlab= "Temp", ylab="Studentized Residual", main="Average Temperature in F` vs Residual")

plot(Rain, r, xlab= "Rain", ylab="Studentized Residual", main="Average Precipitation vs. Residual")

plot(Wind, r, xlab= "Wind", ylab="Studentized Residual", main="Average Wind Speed vs. Residual")

#get rid off an outlier #31
airfinal <- air2[-31,]
airfinal

##    SO2 Temp  Man Wind  Rain RainDays
## 1   10 70.3  213  6.0  7.05       36
## 2   13 61.0   91  8.2 48.52      100
## 3   12 56.7  453  8.7 20.66       67
## 4   17 51.9  454  9.0 12.95       86
## 5   56 49.1  412  9.0 43.37      127
## 6   36 54.0   80  9.0 40.25      114
## 7   29 57.3  434  9.3 38.89      111
## 8   14 68.4  136  8.8 54.47      116
## 9   10 75.5  207  9.0 59.80      128
## 10  24 61.5  368  9.1 48.34      115
## 11 110 50.6 3344 10.4 34.44      122
## 12  28 52.3  361  9.7 38.74      121
## 13  17 49.0  104 11.2 30.85      103
## 14   8 56.6  125 12.7 30.58       82
## 15  30 55.6  291  8.3 43.11      123
## 16   9 68.3  204  8.4 56.77      113
## 17  47 55.0  625  9.6 41.31      111
## 18  35 49.9 1064 10.1 30.96      129
## 19  29 43.5  699 10.6 25.94      137
## 20  14 54.5  381 10.0 37.00       99
## 21  56 55.9  775  9.5 35.89      105
## 22  14 51.5  181 10.9 30.18       98
## 23  11 56.8   46  8.9  7.77       58
## 24  46 47.6   44  8.8 33.36      135
## 25  11 47.1  391 12.4 36.11      166
## 26  23 54.0  462  7.1 39.04      132
## 27  65 49.7 1007 10.9 34.99      155
## 28  26 51.5  266  8.6 37.01      134
## 29  69 54.6 1692  9.6 39.93      115
## 30  61 50.4  347  9.4 36.22      147
## 32  10 61.6  337  9.2 49.10      105
## 33  18 59.4  275  7.9 46.00      119
## 34   9 66.2  641 10.9 35.94       78
## 35  10 68.9  721 10.8 48.19      103
## 36  28 51.0  137  8.7 15.17       89
## 37  31 59.3   96 10.6 44.68      116
## 38  26 57.8  197  7.6 42.59      115
## 39  29 51.1  379  9.4 38.79      164
## 40  31 55.2   35  6.5 40.75      148
## 41  16 45.7  569 11.8 29.07      123

SO2final <-  airfinal$SO2

#applying log transformation on regression model without the outlier
log.SO2 <- log(SO2final)
lm.airfinal <- lm(log.SO2~ Man+Temp+Wind+Rain, data=airfinal)

#summary of our final model
summary(lm.airfinal)

## 
## Call:
## lm(formula = log.SO2 ~ Man + Temp + Wind + Rain, data = airfinal)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.66868 -0.23369 -0.05061  0.29197  0.84018 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.6772137  0.8235887   9.322 5.16e-11 ***
## Man          0.0005839  0.0001197   4.877 2.33e-05 ***
## Temp        -0.0655090  0.0106682  -6.141 5.04e-07 ***
## Wind        -0.1928899  0.0499626  -3.861 0.000466 ***
## Rain         0.0176780  0.0060889   2.903 0.006353 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4101 on 35 degrees of freedom
## Multiple R-squared:  0.6683, Adjusted R-squared:  0.6303 
## F-statistic: 17.63 on 4 and 35 DF,  p-value: 5.22e-08

#checking and comparing distribution of residuals between original and final model
qqnorm(lm.air$residuals)
qqline(lm.air$residuals)

hist(lm.air$residuals, xlab= "Residuals", main="Histogram of Residual")

qqnorm(lm.airfinal$residuals)
qqline(lm.airfinal$residuals)

hist(lm.airfinal$residuals, xlab= "Residuals", main="Histogram of Residual")

Prediction

Mean square error (essentially Variance) with 155.82 and Standard Deviation of 12,48 and R squared = 67%. Overall, this is not a good predictive model.

log.predictions <- predict(lm.airfinal)

#checking normality for SO2 and log of SO2
shapiro.test(SO2)

## 
##  Shapiro-Wilk normality test
## 
## data:  SO2
## W = 0.81165, p-value = 9.723e-06

shapiro.test(log.SO2)

## 
##  Shapiro-Wilk normality test
## 
## data:  log.SO2
## W = 0.9548, p-value = 0.111

#predicted SO2
predictions <- exp(log.predictions)
predictions

##          1          2          3          4          5          6 
##   8.702402  20.294357  18.438931  20.807779  41.762245  23.617158 
##          7          8          9         10         11         12 
##  21.554847  12.696215   8.787197  19.347398 136.721012  26.462017 
##         13         14         15         16         17         18 
##  18.411989   8.442583  28.961093  14.959977  27.597323  37.665842 
##         19         20         21         22         23         24 
##  38.462479  21.213603  26.306767  17.119459  11.065146  32.362023 
##         25         26         27         28         29         30 
##  21.468352  41.684960  33.970798  31.634198  51.552206  30.123953 
##         32         33         34         35         36         37 
##  18.766082  25.429790   9.465680  10.520848  20.212212  13.380636 
##         38         39         40         41 
##  26.917402  30.679144  34.748931  25.881153

#observed SO2
air$SO2

##  [1]  10  13  12  17  56  36  29  14  10  24 110  28  17   8  30   9  47
## [18]  35  29  14  56  14  11  46  11  23  65  26  69  61  94  10  18   9
## [35]  10  28  31  26  29  31  16

#making table that shows observed and predicted SO2
lm_real_pred <- data.frame(predictions, airfinal$SO2)
lm_real_pred

##    predictions airfinal.SO2
## 1     8.702402           10
## 2    20.294357           13
## 3    18.438931           12
## 4    20.807779           17
## 5    41.762245           56
## 6    23.617158           36
## 7    21.554847           29
## 8    12.696215           14
## 9     8.787197           10
## 10   19.347398           24
## 11  136.721012          110
## 12   26.462017           28
## 13   18.411989           17
## 14    8.442583            8
## 15   28.961093           30
## 16   14.959977            9
## 17   27.597323           47
## 18   37.665842           35
## 19   38.462479           29
## 20   21.213603           14
## 21   26.306767           56
## 22   17.119459           14
## 23   11.065146           11
## 24   32.362023           46
## 25   21.468352           11
## 26   41.684960           23
## 27   33.970798           65
## 28   31.634198           26
## 29   51.552206           69
## 30   30.123953           61
## 32   18.766082           10
## 33   25.429790           18
## 34    9.465680            9
## 35   10.520848           10
## 36   20.212212           28
## 37   13.380636           31
## 38   26.917402           26
## 39   30.679144           29
## 40   34.748931           31
## 41   25.881153           16

lm_real_pred <- mutate(lm_real_pred,airfinal$SO2-predictions)

## Warning: package 'bindrcpp' was built under R version 3.4.4

head(lm_real_pred)

##   predictions airfinal.SO2 airfinal$SO2 - predictions
## 1    8.702402           10                   1.297598
## 2   20.294357           13                  -7.294357
## 3   18.438931           12                  -6.438931
## 4   20.807779           17                  -3.807779
## 5   41.762245           56                  14.237755
## 6   23.617158           36                  12.382842

tail(lm_real_pred)

##    predictions airfinal.SO2 airfinal$SO2 - predictions
## 35    20.21221           28                  7.7877877
## 36    13.38064           31                 17.6193637
## 37    26.91740           26                 -0.9174018
## 38    30.67914           29                 -1.6791441
## 39    34.74893           31                 -3.7489305
## 40    25.88115           16                 -9.8811533

names(lm_real_pred) <- c("predicted","Observed","error")

#mean square error(variance here) and standard deviation
mse <- data.frame(mean((lm_real_pred$error)^2))
print(mse)

##   mean..lm_real_pred.error..2.
## 1                      155.872

rmse_air2 <- sqrt(mse)
print(rmse_air2)

##   mean..lm_real_pred.error..2.
## 1                     12.48487

#checking VIFs for final model
VIF(lm.airfinal)

## [1] 3.014371

Possible Improvement by…

we may obtain a dataset that has a larger sample size. More nationwide research and information about other cities that were not considered or included in the data.
Regarding the fact that the data was collected more than 40 years ago, we may hav more advanced technology for more accurate measuring of SO2 level and sampling for the data now.
we need more of background research may help to add more predictor variables in the data. size and more predictor variables to analyze such as number of the vehicles or other power plants that are primarily related to the amount of SO2 emission.

References

http://lib.stat.cmu.edu/DASL/Datafiles/AirPollution.html http://www.stat.columbia.edu/~martin/W2024/R7.pdf http://www.environment.gov.au/protection/publications/factsheet-sulfur-dioxide-so2 http://thestatsgeek.com/2014/01/25/r-squared-and-goodness-of-fit-in-linear-regression/ http://www.stat.columbia.edu/~martin/W2024/R7.pdf