In this section, we set the benchmark for model selection and explanatory data analysis. We uses the dataset “meuse” in the library “sp”. We first load the data into our console and use the head() function to view the first six rows of the dataset. For detail explanation of the dataset, please refer to the link: https://docs.google.com/document/d/e/2PACX-1vTzSpJfqlmWnsU2Xe4ELARfMp9xGDAtjEXN1twUSNUR5Q4scaDcrzgNeNNbZ-ndA_SyMPhvmzBgwj96/pub
library(sp)
library(corrplot)
## corrplot 0.84 loaded
data(meuse)
head(meuse)
## x y cadmium copper lead zinc elev dist om ffreq soil
## 1 181072 333611 11.7 85 299 1022 7.909 0.00135803 13.6 1 1
## 2 181025 333558 8.6 81 277 1141 6.983 0.01222430 14.0 1 1
## 3 181165 333537 6.5 68 199 640 7.800 0.10302900 13.0 1 1
## 4 181298 333484 2.6 81 116 257 7.655 0.19009400 8.0 1 2
## 5 181307 333330 2.8 48 117 269 7.480 0.27709000 8.7 1 2
## 6 181390 333260 3.0 61 137 281 7.791 0.36406700 7.8 1 2
## lime landuse dist.m
## 1 1 Ah 50
## 2 1 Ah 30
## 3 1 Ah 150
## 4 0 Ga 270
## 5 0 Ah 380
## 6 0 Ga 470
In this section, we extract some basic summary information from our dataset
## Obtaining the summary information for each variable
summary(meuse)
## x y cadmium copper
## Min. :178605 Min. :329714 Min. : 0.200 Min. : 14.00
## 1st Qu.:179371 1st Qu.:330762 1st Qu.: 0.800 1st Qu.: 23.00
## Median :179991 Median :331633 Median : 2.100 Median : 31.00
## Mean :180005 Mean :331635 Mean : 3.246 Mean : 40.32
## 3rd Qu.:180630 3rd Qu.:332463 3rd Qu.: 3.850 3rd Qu.: 49.50
## Max. :181390 Max. :333611 Max. :18.100 Max. :128.00
##
## lead zinc elev dist
## Min. : 37.0 Min. : 113.0 Min. : 5.180 Min. :0.00000
## 1st Qu.: 72.5 1st Qu.: 198.0 1st Qu.: 7.546 1st Qu.:0.07569
## Median :123.0 Median : 326.0 Median : 8.180 Median :0.21184
## Mean :153.4 Mean : 469.7 Mean : 8.165 Mean :0.24002
## 3rd Qu.:207.0 3rd Qu.: 674.5 3rd Qu.: 8.955 3rd Qu.:0.36407
## Max. :654.0 Max. :1839.0 Max. :10.520 Max. :0.88039
##
## om ffreq soil lime landuse dist.m
## Min. : 1.000 1:84 1:97 0:111 W :50 Min. : 10.0
## 1st Qu.: 5.300 2:48 2:46 1: 44 Ah :39 1st Qu.: 80.0
## Median : 6.900 3:23 3:12 Am :22 Median : 270.0
## Mean : 7.478 Fw :10 Mean : 290.3
## 3rd Qu.: 9.000 Ab : 8 3rd Qu.: 450.0
## Max. :17.000 (Other):25 Max. :1000.0
## NA's :2 NA's : 1
## Next: we are interested in the number of missing values (NAs) in our dataset and then we decide how to treat the missing values (for more details regarding how to treat missing values, refer to the site: https://oscrproject.wixsite.com/website/post/data-manipulation-data-cleaning)
sum(is.na(meuse))
## [1] 3
## There are only three missing values in the dataset, we simply omit the missing rows for the purpose of this exercise
meuse <- na.omit(meuse)
## We obtain the boxplot for every variable except the x and y
boxplot(meuse[-c(1,2)])
## We find that zinc, lead and dist.m have a more spread-out distribution
## Next we obtain the correlation plot for the variables
cormat<-round(cor(meuse[-c(10,11,12,13)], use="pairwise.complete.obs"),2)
## we exclude these four columns because they are factor/categorical variables
cormat
## x y cadmium copper lead zinc elev dist om dist.m
## x 1.00 0.86 0.03 0.01 -0.16 -0.12 0.31 0.12 -0.05 0.14
## y 0.86 1.00 0.23 0.26 0.07 0.11 0.11 -0.18 0.16 -0.16
## cadmium 0.03 0.23 1.00 0.92 0.80 0.92 -0.56 -0.61 0.72 -0.62
## copper 0.01 0.26 0.92 1.00 0.82 0.91 -0.58 -0.61 0.73 -0.61
## lead -0.16 0.07 0.80 0.82 1.00 0.95 -0.58 -0.58 0.55 -0.58
## zinc -0.12 0.11 0.92 0.91 0.95 1.00 -0.59 -0.64 0.68 -0.66
## elev 0.31 0.11 -0.56 -0.58 -0.58 -0.59 1.00 0.53 -0.35 0.50
## dist 0.12 -0.18 -0.61 -0.61 -0.58 -0.64 0.53 1.00 -0.56 0.98
## om -0.05 0.16 0.72 0.73 0.55 0.68 -0.35 -0.56 1.00 -0.58
## dist.m 0.14 -0.16 -0.62 -0.61 -0.58 -0.66 0.50 0.98 -0.58 1.00
corplot.matrix<-corrplot.mixed(cormat, lower="number",upper="pie",number.digits = 2)
## From the correlation matrix, we can see that there are strong correlation between cadminum, copper, lead and zinc. There is also strong correlation between dist and dis.m, this is because they are both distribution but in different units.
## Next, because our response variable is lead and we found three variables that are likely to be significant, we draw a scatterplot with x being the other three variable and y being lead
plot(x = meuse$cadmium, y = meuse$lead)
plot(x = meuse$copper, y = meuse$lead)
plot(x = meuse$zinc, y = meuse$lead)
From these three scatter plots, we can see that there are extremely linear and positive relationships between the three variables and lead. In the following linear regression analysis, these might be extremely useful parameters.
In this section, we simply build a model using all variables to predict the level of lead in the river. This Rsquare will be regarded as the benchmark for subsequent analysis
model_bench <- lm(lead~.,data = meuse) ## notice: the "." indicates that we are using all other variables in the dataset except the response variable lead
summary(model_bench)
##
## Call:
## lm(formula = lead ~ ., data = meuse)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.330 -10.755 0.244 9.086 64.421
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.832e+02 1.318e+03 0.594 0.553418
## x -2.291e-03 8.331e-03 -0.275 0.783809
## y -1.038e-03 7.031e-03 -0.148 0.882832
## cadmium -1.180e+01 2.059e+00 -5.731 7.26e-08 ***
## copper -3.956e-01 3.225e-01 -1.227 0.222309
## zinc 4.427e-01 1.721e-02 25.726 < 2e-16 ***
## elev -1.027e+00 3.626e+00 -0.283 0.777465
## dist -9.036e+01 6.289e+01 -1.437 0.153310
## om -1.855e+00 1.004e+00 -1.848 0.066968 .
## ffreq2 -1.352e+01 8.368e+00 -1.616 0.108670
## ffreq3 -2.359e+01 1.108e+01 -2.129 0.035238 *
## soil2 -4.029e+00 7.016e+00 -0.574 0.566804
## soil3 -2.485e+00 1.103e+01 -0.225 0.822070
## lime1 -2.352e+01 6.935e+00 -3.391 0.000937 ***
## landuseAb 8.548e+00 1.906e+01 0.448 0.654587
## landuseAg 2.180e+00 2.036e+01 0.107 0.914875
## landuseAh 7.162e+00 1.719e+01 0.417 0.677598
## landuseAm 7.322e+00 1.743e+01 0.420 0.675256
## landuseB -3.199e+00 2.218e+01 -0.144 0.885542
## landuseBw 1.183e+01 2.085e+01 0.567 0.571505
## landuseDEN -2.456e+01 3.001e+01 -0.819 0.414646
## landuseFh 2.169e+01 2.865e+01 0.757 0.450588
## landuseFw 1.577e+00 1.855e+01 0.085 0.932428
## landuseGa 6.300e+01 2.673e+01 2.357 0.020019 *
## landuseSPO -1.184e+01 2.948e+01 -0.402 0.688625
## landuseSTA 1.366e+01 2.415e+01 0.565 0.572810
## landuseTv 3.089e+01 2.877e+01 1.074 0.285076
## landuseW -8.839e-01 1.762e+01 -0.050 0.960071
## dist.m 9.603e-02 5.309e-02 1.809 0.072911 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.63 on 123 degrees of freedom
## Multiple R-squared: 0.9664, Adjusted R-squared: 0.9587
## F-statistic: 126.2 on 28 and 123 DF, p-value: < 2.2e-16
From the summary, we can see that the Rsquared is extremely high — 0.9664. This indicates that using the predictors in our dataset, we are able to explain 96.64% of the variations in the response variable lead. However, a closer inspection of the summary reveals that only a few of the variables are significant (*** indication at the end of the row). We might need to conduct feature selection later in our analysis.