This report is to give an analysis answering “How we should predict Air Pollution Level? So first, let me talk about, how I developed my question. Started with very basic and broad thinking to select what kinds of problem that I want to explore, there were many categories, such as social, politics, business, medical, environmental, education related problems, etc. Specifically narrowing down to environmental problem, I found several datasets related to this topic such as Air Quality, Air Pollution, Deforestation, Water Pollution, Global Warming data. But I chose Air pollution data which was very specific and straightforward about SO2 content in the air which is the main cause of acid rain and other predictor variables. The dataset can be found in (http://lib.stat.cmu.edu/DASL/Datafiles/AirPollution.html)]. A text file (.txt) is load after data is copied into it. Thus, a question I came out with was”Can the variation observed in SO2 content of air be explained by the variation existent in any other factors considered in the experiment?" Simply, “Can the SO2 level be explained and predicted by our predictors?”
The dataset is on air pollution and related values for 41 U.S.cities over years 1969 to 1971, collected from U.S. government publication. Our response variable is SO2 - Sulfur dioxide content of air in microgram per cubic meter. Our predictor variables are Temp - Average annual temperature in degrees Fahrenheit, Man - Number of manufacturing enterprises employing 20 or more workers, Pop - population size in thousands from the 1970 census, Wind - Average annual wind speed in miles per hour, Rain - Average annual precipitation in inches, RainDays - Average number of days raining, City - Names of cities. Except City variable, all other variables are continuous. City variable is used for future reference of reasoning for possible problem understanding. It may help us to understand the context better since we know some characteristics of the cities. There is no missing values.(since sample size is small, I skipped test and training part.)
First and last 5 observations from the data is presented to see how it looks. Also plots for each combination of two variables are presented too. After checking each variable’s distribution, Wind and Raindays variables are approximately normally distributed and others are not. In addition, Rain and RainDays are left skewed while SO2, Temp, Man, Pop and Wind variables are right skewed.
Now we check the correlation between variables. we see that there is a high multicollinearity problem between Man and Pop variable. Those two have correlation coefficient of .95 which is very high. Intuitively, it makes sense because Man variable means number of manufacturing enterprises with more than 20 workers and Pop is population size of the city, and bigger city tends to have large manufacturing enterprises meaning more job opportunities with more population. When I checked Variance Inflation Factors for predictor variables, man and pop were 22.05 and 17.18 correspondingly. Usually we omit the one with highest VIF, pop is still huge in VIF and Man variable has a higher correlation with SO2, so pop variable will be ignored instead for the rest of analysis. Most significant factor we can assume for emission of SO2 is human source which is here manufacturing factories so it makes sense too. After omitting Pop variable, VIF are all below 5 meaning we are good to go.
#setting a working directory
setwd("C:/Users/Sean/Desktop/DATS6101")
#removing all previous assigned data
rm(list=ls())
#loading airpollution dataset to R
air <- read.table('airpollution.txt', sep="\t", header=TRUE)
#attaching data for convenience
attach(air)
#checking data structure and how it looks
str(air)
## 'data.frame': 41 obs. of 8 variables:
## $ City : Factor w/ 41 levels "Albany","Albuquerque",..: 31 20 36 12 15 41 39 18 23 3 ...
## $ SO2 : int 10 13 12 17 56 36 29 14 10 24 ...
## $ Temp : num 70.3 61 56.7 51.9 49.1 54 57.3 68.4 75.5 61.5 ...
## $ Man : int 213 91 453 454 412 80 434 136 207 368 ...
## $ Pop : int 582 132 716 515 158 80 757 529 335 497 ...
## $ Wind : num 6 8.2 8.7 9 9 9 9.3 8.8 9 9.1 ...
## $ Rain : num 7.05 48.52 20.66 12.95 43.37 ...
## $ RainDays: int 36 100 67 86 127 114 111 116 128 115 ...
head(air)
## City SO2 Temp Man Pop Wind Rain RainDays
## 1 Phoenix 10 70.3 213 582 6.0 7.05 36
## 2 Little Rock 13 61.0 91 132 8.2 48.52 100
## 3 San Francisco 12 56.7 453 716 8.7 20.66 67
## 4 Denver 17 51.9 454 515 9.0 12.95 86
## 5 Hartford 56 49.1 412 158 9.0 43.37 127
## 6 Wilmington 36 54.0 80 80 9.0 40.25 114
tail(air)
## City SO2 Temp Man Pop Wind Rain RainDays
## 36 Salt Lake City 28 51.0 137 176 8.7 15.17 89
## 37 Norfolk 31 59.3 96 308 10.6 44.68 116
## 38 Richmond 26 57.8 197 299 7.6 42.59 115
## 39 Seattle 29 51.1 379 531 9.4 38.79 164
## 40 Charleston 31 55.2 35 71 6.5 40.75 148
## 41 Milwaukee 16 45.7 569 717 11.8 29.07 123
#checking if there is any missing values
sum(is.na(data.frame(air)))
## [1] 0
#excluding City variable since it is the name of the city
air1 <- air[2:8]
#attaching data for convenience
attach(air1)
## The following objects are masked from air:
##
## Man, Pop, Rain, RainDays, SO2, Temp, Wind
#basic summary about data with mean and quantiles
summary(air1)
## SO2 Temp Man Pop
## Min. : 8.00 Min. :43.50 Min. : 35.0 Min. : 71.0
## 1st Qu.: 13.00 1st Qu.:50.60 1st Qu.: 181.0 1st Qu.: 299.0
## Median : 26.00 Median :54.60 Median : 347.0 Median : 515.0
## Mean : 30.05 Mean :55.76 Mean : 463.1 Mean : 608.6
## 3rd Qu.: 35.00 3rd Qu.:59.30 3rd Qu.: 462.0 3rd Qu.: 717.0
## Max. :110.00 Max. :75.50 Max. :3344.0 Max. :3369.0
## Wind Rain RainDays
## Min. : 6.000 Min. : 7.05 Min. : 36.0
## 1st Qu.: 8.700 1st Qu.:30.96 1st Qu.:103.0
## Median : 9.300 Median :38.74 Median :115.0
## Mean : 9.444 Mean :36.77 Mean :113.9
## 3rd Qu.:10.600 3rd Qu.:43.11 3rd Qu.:128.0
## Max. :12.700 Max. :59.80 Max. :166.0
#plots between variables
plot(air1)
#VIF
#it used to work before loading last 4 packages and actually got the values for each predictor variables,
#but somehow it doesn't work now. I do not know what happened. it gave Man with 22 and pop with 17.
#correlations among variables
cor.air <- cor(air1, method = "pearson")
corrplot(cor.air, method = "circle", type = "lower", tl.col = "black", tl.srt = 45)
#it looks like there is a multicollinearty problem here
cor(Man, Pop, use = "complete")
## [1] 0.9552693
#due to high correlatoin between Man and Pop which are predictor variables.
#Man with higher correlation with SO2 remains. Pop is excluded.
air2 <- air[c(2:4,6:8)]
attach(air2)
## The following objects are masked from air1:
##
## Man, Rain, RainDays, SO2, Temp, Wind
## The following objects are masked from air:
##
## Man, Rain, RainDays, SO2, Temp, Wind
#checking correlation coefficients again
cor.air2 <- cor(air2, method = "pearson")
corrplot(cor.air2, method = "circle", type = "lower", tl.col = "black", tl.srt = 45)
#checking normality of distribution
shapiro.test(air1$SO2)
##
## Shapiro-Wilk normality test
##
## data: air1$SO2
## W = 0.81165, p-value = 9.723e-06
shapiro.test(air1$Man)
##
## Shapiro-Wilk normality test
##
## data: air1$Man
## W = 0.60548, p-value = 2.781e-09
shapiro.test(air1$Temp)
##
## Shapiro-Wilk normality test
##
## data: air1$Temp
## W = 0.93554, p-value = 0.02215
shapiro.test(air1$Wind)
##
## Shapiro-Wilk normality test
##
## data: air1$Wind
## W = 0.98057, p-value = 0.6973
shapiro.test(air1$Rain)
##
## Shapiro-Wilk normality test
##
## data: air1$Rain
## W = 0.94214, p-value = 0.03725
shapiro.test(air1$RainDays)
##
## Shapiro-Wilk normality test
##
## data: air1$RainDays
## W = 0.9654, p-value = 0.2419
#histograms of distribution of each variables
hist(SO2)
hist(Man)
hist(Temp)
hist(Wind)
hist(Rain)
hist(RainDays)
#checking skewness of each variables
skewness(SO2)
## [1] 1.584113
skewness(Man)
## [1] 3.484603
skewness(Temp)
## [1] 0.8229757
skewness(Rain)
## [1] -0.6925181
skewness(Wind)
## [1] 0.002675131
skewness(RainDays)
## [1] -0.5500923
Now we are to apply variable selection method to obtain simpler and effective model. For our data now, without Pop variable, there are 5 predictor variables are left to consider. After applying variable selection techniques, all three variable selection methods suggest building a linear regression model of SO2 ~ Man+ Temp +Wind + Rain.
The full model, reduced model, and final model with log transformation are presented below. Final model has been improved from original model as 60% to 67% R-squared.
#linear regression with a full model
lm.air <-lm(formula = SO2~Man+Pop, data=air1)
lm.air
##
## Call:
## lm(formula = SO2 ~ Man + Pop, data = air1)
##
## Coefficients:
## (Intercept) Man Pop
## 26.32508 0.08243 -0.05661
summary(lm.air)
##
## Call:
## lm(formula = SO2 ~ Man + Pop, data = air1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.389 -12.831 -1.277 7.609 49.533
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.32508 3.84044 6.855 3.87e-08 ***
## Man 0.08243 0.01470 5.609 1.96e-06 ***
## Pop -0.05661 0.01430 -3.959 0.000319 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.49 on 38 degrees of freedom
## Multiple R-squared: 0.5863, Adjusted R-squared: 0.5645
## F-statistic: 26.93 on 2 and 38 DF, p-value: 5.207e-08
VIF(lm.air)
## [1] 2.417329
#linear regression with a reduced model suggested by variable selection
lm.air1 <- lm(SO2~Man+Temp+Wind+Rain, data=air2)
lm.air1
##
## Call:
## lm(formula = SO2 ~ Man + Temp + Wind + Rain, data = air2)
##
## Coefficients:
## (Intercept) Man Temp Wind Rain
## 123.11833 0.02548 -1.61144 -3.63024 0.52423
summary(lm.air1)
##
## Call:
## lm(formula = SO2 ~ Man + Temp + Wind + Rain, data = air2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.374 -9.088 -3.042 7.205 58.785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 123.118333 31.290702 3.935 0.000365 ***
## Man 0.025476 0.004537 5.615 2.27e-06 ***
## Temp -1.611436 0.401373 -4.015 0.000289 ***
## Wind -3.630245 1.892342 -1.918 0.063020 .
## Rain 0.524235 0.229407 2.285 0.028297 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.59 on 36 degrees of freedom
## Multiple R-squared: 0.6028, Adjusted R-squared: 0.5587
## F-statistic: 13.66 on 4 and 36 DF, p-value: 7.168e-07
VIF(lm.air1)
## [1] 2.517784
#applying log transformation on the response variable
lnSO2 <- log(SO2)
lm.air2 <- lm(lnSO2~ Man+Temp+Wind+Rain, data=air2)
lm.air2
##
## Call:
## lm(formula = lnSO2 ~ Man + Temp + Wind + Rain, data = air2)
##
## Coefficients:
## (Intercept) Man Temp Wind Rain
## 7.7628404 0.0005556 -0.0699392 -0.1803936 0.0200318
summary(lm.air2)
##
## Call:
## lm(formula = lnSO2 ~ Man + Temp + Wind + Rain, data = air2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77450 -0.26074 -0.05387 0.27927 1.14267
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.7628404 0.9032385 8.594 3.02e-10 ***
## Man 0.0005556 0.0001310 4.242 0.000148 ***
## Temp -0.0699392 0.0115861 -6.036 6.21e-07 ***
## Wind -0.1803936 0.0546244 -3.302 0.002172 **
## Rain 0.0200318 0.0066221 3.025 0.004568 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4501 on 36 degrees of freedom
## Multiple R-squared: 0.6303, Adjusted R-squared: 0.5892
## F-statistic: 15.35 on 4 and 36 DF, p-value: 2.053e-07
VIF(lm.air2)
## [1] 2.705063
For the model we built, we need to check if there is any problem in the model. First plot shows studentized residuals. Studentized residual is used for dectecting and identifying outliers and we normally look at values not in range of -2 to 2.We see that observation 31 has value more than 2 which indicates possible outlier of the data.
The second plot shows leverage and it shows influential points which might be an outliers but they are not always outliers that can be excluded from the model. There are total of 5 observations are considered influential points. But only observation 1 and 11 have has extreme values and the others are relatively small.
The third plot shows cook’s distance of the each observation. Cook’s distance measures the effect of deleting a given observation. Thus cook’s distance with large values are expected to be excluded from the model too. here we are interested in the observations with cook’s distance value of more than 0.1 (4 divided by # of Obs.) .Here too only observation 1 and 11 have high value and the other two are relatively small. Regarding not all the influential points are outliers, we are left with observation 1 and 11 and 31. Observation 1 is phoenix city and SO2 value and other variable values were not extreme. And observation 11 is Chicago city such The observation is just a description of the city rather than a mistake in data collection. Obs. #31 - Providence is a city which has high concentration of SO2 although it has a small population, 179,000 people but sill has high SO2 levels of big cities like Chicago, 110 μg / m3 . Thus, We treat it as an outlier.
Our final model fits better than original model that q-q plot shows linear trend after log transformation.
#checking studentized residuals
lm.air$residuals
## 1 2 3 4 5 6
## -0.9385022 -13.3545143 -11.1374002 -17.5977616 4.6559128 7.6087174
## 7 8 9 10 11 12
## 9.7497184 6.4087731 -14.4257288 -4.5273483 -1.2770499 14.1447347
## 13 14 15 16 17 18
## -6.5203019 -12.9493160 13.2543110 -13.7066548 20.3825835 6.6108308
## 19 20 21 22 23 24
## -12.8312027 -15.0329255 0.9978004 -7.6031631 -5.3050404 22.6141825
## 25 26 27 28 29 30
## -21.3479570 -15.7668438 -1.8246579 8.3150134 13.5793048 35.5057196
## 31 32 33 34 35 36
## 49.5326041 -8.7828527 -5.6347010 -22.3893648 -5.9641236 0.3442079
## 37 38 39 40 41
## 14.1960775 0.3607744 1.4905012 5.8087922 -16.6431487
r <- rstudent(lm.air2)
plot(r, ylab="Residual", xlab="Observation Number",main="Studentized Residuals", pch=24, bg="green")
points(31,r[31], pch=24, bg='red')
abline(h=2, col="red",lty=5, lwd=2)
#checking leverage
lev <- hat(model.matrix(lm.air2))
plot(lev, ylab="Leverage", xlab="Observation Number",main="Leverage", pch=24, bg="green")
points(c(1,9,11,14,23),lev[c(1,9,11,14,23)], pch=24, bg='red')
abline(h=.2, col="red", lty=5, lwd=2)
air[lev>0.2,]
## City SO2 Temp Man Pop Wind Rain RainDays
## 1 Phoenix 10 70.3 213 582 6.0 7.05 36
## 9 Miami 10 75.5 207 335 9.0 59.80 128
## 11 Chicago 110 50.6 3344 3369 10.4 34.44 122
## 14 Wichita 8 56.6 125 277 12.7 30.58 82
## 23 Albuquerque 11 56.8 46 244 8.9 7.77 58
#checking cook's distances
cook <- cooks.distance(lm.air2)
plot(cook,ylab="Cook's Distance" , xlab="Observation Number", main="Cook's Distance",pch=24, bg="green")
points(c(1,11,25,31),cook[c(1,11,25,31)], col='red',pch=24, bg='red')
abline(h=.1, col="red", lty=5, lwd=2)
plot(Man, r, xlab= "Man", ylab="Studentized Residual", main="# of Manufacturing Enterprises vs. Residual")
plot(Temp,r, xlab= "Temp", ylab="Studentized Residual", main="Average Temperature in F` vs Residual")
plot(Rain, r, xlab= "Rain", ylab="Studentized Residual", main="Average Precipitation vs. Residual")
plot(Wind, r, xlab= "Wind", ylab="Studentized Residual", main="Average Wind Speed vs. Residual")
#get rid off an outlier #31
airfinal <- air2[-31,]
airfinal
## SO2 Temp Man Wind Rain RainDays
## 1 10 70.3 213 6.0 7.05 36
## 2 13 61.0 91 8.2 48.52 100
## 3 12 56.7 453 8.7 20.66 67
## 4 17 51.9 454 9.0 12.95 86
## 5 56 49.1 412 9.0 43.37 127
## 6 36 54.0 80 9.0 40.25 114
## 7 29 57.3 434 9.3 38.89 111
## 8 14 68.4 136 8.8 54.47 116
## 9 10 75.5 207 9.0 59.80 128
## 10 24 61.5 368 9.1 48.34 115
## 11 110 50.6 3344 10.4 34.44 122
## 12 28 52.3 361 9.7 38.74 121
## 13 17 49.0 104 11.2 30.85 103
## 14 8 56.6 125 12.7 30.58 82
## 15 30 55.6 291 8.3 43.11 123
## 16 9 68.3 204 8.4 56.77 113
## 17 47 55.0 625 9.6 41.31 111
## 18 35 49.9 1064 10.1 30.96 129
## 19 29 43.5 699 10.6 25.94 137
## 20 14 54.5 381 10.0 37.00 99
## 21 56 55.9 775 9.5 35.89 105
## 22 14 51.5 181 10.9 30.18 98
## 23 11 56.8 46 8.9 7.77 58
## 24 46 47.6 44 8.8 33.36 135
## 25 11 47.1 391 12.4 36.11 166
## 26 23 54.0 462 7.1 39.04 132
## 27 65 49.7 1007 10.9 34.99 155
## 28 26 51.5 266 8.6 37.01 134
## 29 69 54.6 1692 9.6 39.93 115
## 30 61 50.4 347 9.4 36.22 147
## 32 10 61.6 337 9.2 49.10 105
## 33 18 59.4 275 7.9 46.00 119
## 34 9 66.2 641 10.9 35.94 78
## 35 10 68.9 721 10.8 48.19 103
## 36 28 51.0 137 8.7 15.17 89
## 37 31 59.3 96 10.6 44.68 116
## 38 26 57.8 197 7.6 42.59 115
## 39 29 51.1 379 9.4 38.79 164
## 40 31 55.2 35 6.5 40.75 148
## 41 16 45.7 569 11.8 29.07 123
SO2final <- airfinal$SO2
#applying log transformation on regression model without the outlier
log.SO2 <- log(SO2final)
lm.airfinal <- lm(log.SO2~ Man+Temp+Wind+Rain, data=airfinal)
#summary of our final model
summary(lm.airfinal)
##
## Call:
## lm(formula = log.SO2 ~ Man + Temp + Wind + Rain, data = airfinal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.66868 -0.23369 -0.05061 0.29197 0.84018
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.6772137 0.8235887 9.322 5.16e-11 ***
## Man 0.0005839 0.0001197 4.877 2.33e-05 ***
## Temp -0.0655090 0.0106682 -6.141 5.04e-07 ***
## Wind -0.1928899 0.0499626 -3.861 0.000466 ***
## Rain 0.0176780 0.0060889 2.903 0.006353 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4101 on 35 degrees of freedom
## Multiple R-squared: 0.6683, Adjusted R-squared: 0.6303
## F-statistic: 17.63 on 4 and 35 DF, p-value: 5.22e-08
#checking and comparing distribution of residuals between original and final model
qqnorm(lm.air$residuals)
qqline(lm.air$residuals)
hist(lm.air$residuals, xlab= "Residuals", main="Histogram of Residual")
qqnorm(lm.airfinal$residuals)
qqline(lm.airfinal$residuals)
hist(lm.airfinal$residuals, xlab= "Residuals", main="Histogram of Residual")
Mean square error (essentially Variance) with 155.82 and Standard Deviation of 12,48 and R squared = 67%. Overall, this is not a good predictive model.
log.predictions <- predict(lm.airfinal)
#checking normality for SO2 and log of SO2
shapiro.test(SO2)
##
## Shapiro-Wilk normality test
##
## data: SO2
## W = 0.81165, p-value = 9.723e-06
shapiro.test(log.SO2)
##
## Shapiro-Wilk normality test
##
## data: log.SO2
## W = 0.9548, p-value = 0.111
#predicted SO2
predictions <- exp(log.predictions)
predictions
## 1 2 3 4 5 6
## 8.702402 20.294357 18.438931 20.807779 41.762245 23.617158
## 7 8 9 10 11 12
## 21.554847 12.696215 8.787197 19.347398 136.721012 26.462017
## 13 14 15 16 17 18
## 18.411989 8.442583 28.961093 14.959977 27.597323 37.665842
## 19 20 21 22 23 24
## 38.462479 21.213603 26.306767 17.119459 11.065146 32.362023
## 25 26 27 28 29 30
## 21.468352 41.684960 33.970798 31.634198 51.552206 30.123953
## 32 33 34 35 36 37
## 18.766082 25.429790 9.465680 10.520848 20.212212 13.380636
## 38 39 40 41
## 26.917402 30.679144 34.748931 25.881153
#observed SO2
air$SO2
## [1] 10 13 12 17 56 36 29 14 10 24 110 28 17 8 30 9 47
## [18] 35 29 14 56 14 11 46 11 23 65 26 69 61 94 10 18 9
## [35] 10 28 31 26 29 31 16
#making table that shows observed and predicted SO2
lm_real_pred <- data.frame(predictions, airfinal$SO2)
lm_real_pred
## predictions airfinal.SO2
## 1 8.702402 10
## 2 20.294357 13
## 3 18.438931 12
## 4 20.807779 17
## 5 41.762245 56
## 6 23.617158 36
## 7 21.554847 29
## 8 12.696215 14
## 9 8.787197 10
## 10 19.347398 24
## 11 136.721012 110
## 12 26.462017 28
## 13 18.411989 17
## 14 8.442583 8
## 15 28.961093 30
## 16 14.959977 9
## 17 27.597323 47
## 18 37.665842 35
## 19 38.462479 29
## 20 21.213603 14
## 21 26.306767 56
## 22 17.119459 14
## 23 11.065146 11
## 24 32.362023 46
## 25 21.468352 11
## 26 41.684960 23
## 27 33.970798 65
## 28 31.634198 26
## 29 51.552206 69
## 30 30.123953 61
## 32 18.766082 10
## 33 25.429790 18
## 34 9.465680 9
## 35 10.520848 10
## 36 20.212212 28
## 37 13.380636 31
## 38 26.917402 26
## 39 30.679144 29
## 40 34.748931 31
## 41 25.881153 16
lm_real_pred <- mutate(lm_real_pred,airfinal$SO2-predictions)
## Warning: package 'bindrcpp' was built under R version 3.4.4
head(lm_real_pred)
## predictions airfinal.SO2 airfinal$SO2 - predictions
## 1 8.702402 10 1.297598
## 2 20.294357 13 -7.294357
## 3 18.438931 12 -6.438931
## 4 20.807779 17 -3.807779
## 5 41.762245 56 14.237755
## 6 23.617158 36 12.382842
tail(lm_real_pred)
## predictions airfinal.SO2 airfinal$SO2 - predictions
## 35 20.21221 28 7.7877877
## 36 13.38064 31 17.6193637
## 37 26.91740 26 -0.9174018
## 38 30.67914 29 -1.6791441
## 39 34.74893 31 -3.7489305
## 40 25.88115 16 -9.8811533
names(lm_real_pred) <- c("predicted","Observed","error")
#mean square error(variance here) and standard deviation
mse <- data.frame(mean((lm_real_pred$error)^2))
print(mse)
## mean..lm_real_pred.error..2.
## 1 155.872
rmse_air2 <- sqrt(mse)
print(rmse_air2)
## mean..lm_real_pred.error..2.
## 1 12.48487
#checking VIFs for final model
VIF(lm.airfinal)
## [1] 3.014371
we may obtain a dataset that has a larger sample size. More nationwide research and information about other cities that were not considered or included in the data.
Regarding the fact that the data was collected more than 40 years ago, we may hav more advanced technology for more accurate measuring of SO2 level and sampling for the data now.
we need more of background research may help to add more predictor variables in the data. size and more predictor variables to analyze such as number of the vehicles or other power plants that are primarily related to the amount of SO2 emission.
http://lib.stat.cmu.edu/DASL/Datafiles/AirPollution.html http://www.stat.columbia.edu/~martin/W2024/R7.pdf http://www.environment.gov.au/protection/publications/factsheet-sulfur-dioxide-so2 http://thestatsgeek.com/2014/01/25/r-squared-and-goodness-of-fit-in-linear-regression/ http://www.stat.columbia.edu/~martin/W2024/R7.pdf