EDA & Benchmark

In this section, we set the benchmark for model selection and explanatory data analysis. We uses the dataset “meuse” in the library “sp”. We first load the data into our console and use the head() function to view the first six rows of the dataset. For detail explanation of the dataset, please refer to the link: https://docs.google.com/document/d/e/2PACX-1vTzSpJfqlmWnsU2Xe4ELARfMp9xGDAtjEXN1twUSNUR5Q4scaDcrzgNeNNbZ-ndA_SyMPhvmzBgwj96/pub

library(sp)
library(corrplot)

## corrplot 0.84 loaded

data(meuse)
head(meuse)

##        x      y cadmium copper lead zinc  elev       dist   om ffreq soil
## 1 181072 333611    11.7     85  299 1022 7.909 0.00135803 13.6     1    1
## 2 181025 333558     8.6     81  277 1141 6.983 0.01222430 14.0     1    1
## 3 181165 333537     6.5     68  199  640 7.800 0.10302900 13.0     1    1
## 4 181298 333484     2.6     81  116  257 7.655 0.19009400  8.0     1    2
## 5 181307 333330     2.8     48  117  269 7.480 0.27709000  8.7     1    2
## 6 181390 333260     3.0     61  137  281 7.791 0.36406700  7.8     1    2
##   lime landuse dist.m
## 1    1      Ah     50
## 2    1      Ah     30
## 3    1      Ah    150
## 4    0      Ga    270
## 5    0      Ah    380
## 6    0      Ga    470

Part One: Explanatory Data Analysis

In this section, we extract some basic summary information from our dataset

## Obtaining the summary information for each variable
summary(meuse)

##        x                y             cadmium           copper      
##  Min.   :178605   Min.   :329714   Min.   : 0.200   Min.   : 14.00  
##  1st Qu.:179371   1st Qu.:330762   1st Qu.: 0.800   1st Qu.: 23.00  
##  Median :179991   Median :331633   Median : 2.100   Median : 31.00  
##  Mean   :180005   Mean   :331635   Mean   : 3.246   Mean   : 40.32  
##  3rd Qu.:180630   3rd Qu.:332463   3rd Qu.: 3.850   3rd Qu.: 49.50  
##  Max.   :181390   Max.   :333611   Max.   :18.100   Max.   :128.00  
##                                                                     
##       lead            zinc             elev             dist        
##  Min.   : 37.0   Min.   : 113.0   Min.   : 5.180   Min.   :0.00000  
##  1st Qu.: 72.5   1st Qu.: 198.0   1st Qu.: 7.546   1st Qu.:0.07569  
##  Median :123.0   Median : 326.0   Median : 8.180   Median :0.21184  
##  Mean   :153.4   Mean   : 469.7   Mean   : 8.165   Mean   :0.24002  
##  3rd Qu.:207.0   3rd Qu.: 674.5   3rd Qu.: 8.955   3rd Qu.:0.36407  
##  Max.   :654.0   Max.   :1839.0   Max.   :10.520   Max.   :0.88039  
##                                                                     
##        om         ffreq  soil   lime       landuse       dist.m      
##  Min.   : 1.000   1:84   1:97   0:111   W      :50   Min.   :  10.0  
##  1st Qu.: 5.300   2:48   2:46   1: 44   Ah     :39   1st Qu.:  80.0  
##  Median : 6.900   3:23   3:12           Am     :22   Median : 270.0  
##  Mean   : 7.478                         Fw     :10   Mean   : 290.3  
##  3rd Qu.: 9.000                         Ab     : 8   3rd Qu.: 450.0  
##  Max.   :17.000                         (Other):25   Max.   :1000.0  
##  NA's   :2                              NA's   : 1

## Next: we are interested in the number of missing values (NAs) in our dataset and then we decide how to treat the missing values (for more details regarding how to treat missing values, refer to the site: https://oscrproject.wixsite.com/website/post/data-manipulation-data-cleaning)
sum(is.na(meuse))

## [1] 3

## There are only three missing values in the dataset, we simply omit the missing rows for the purpose of this exercise
meuse <- na.omit(meuse)
## We obtain the boxplot for every variable except the x and y 
boxplot(meuse[-c(1,2)])

## We find that zinc, lead and dist.m have a more spread-out distribution

## Next we obtain the correlation plot for the variables
cormat<-round(cor(meuse[-c(10,11,12,13)], use="pairwise.complete.obs"),2) 
## we exclude these four columns because they are factor/categorical variables
cormat

##             x     y cadmium copper  lead  zinc  elev  dist    om dist.m
## x        1.00  0.86    0.03   0.01 -0.16 -0.12  0.31  0.12 -0.05   0.14
## y        0.86  1.00    0.23   0.26  0.07  0.11  0.11 -0.18  0.16  -0.16
## cadmium  0.03  0.23    1.00   0.92  0.80  0.92 -0.56 -0.61  0.72  -0.62
## copper   0.01  0.26    0.92   1.00  0.82  0.91 -0.58 -0.61  0.73  -0.61
## lead    -0.16  0.07    0.80   0.82  1.00  0.95 -0.58 -0.58  0.55  -0.58
## zinc    -0.12  0.11    0.92   0.91  0.95  1.00 -0.59 -0.64  0.68  -0.66
## elev     0.31  0.11   -0.56  -0.58 -0.58 -0.59  1.00  0.53 -0.35   0.50
## dist     0.12 -0.18   -0.61  -0.61 -0.58 -0.64  0.53  1.00 -0.56   0.98
## om      -0.05  0.16    0.72   0.73  0.55  0.68 -0.35 -0.56  1.00  -0.58
## dist.m   0.14 -0.16   -0.62  -0.61 -0.58 -0.66  0.50  0.98 -0.58   1.00

corplot.matrix<-corrplot.mixed(cormat, lower="number",upper="pie",number.digits = 2)

## From the correlation matrix, we can see that there are strong correlation between cadminum, copper, lead and zinc. There is also strong correlation between dist and dis.m, this is because they are both distribution but in different units. 

## Next, because our response variable is lead and we found three variables that are likely to be significant, we draw a scatterplot with x being the other three variable and y being lead

plot(x = meuse$cadmium, y = meuse$lead)

plot(x = meuse$copper, y = meuse$lead)

plot(x = meuse$zinc, y = meuse$lead)

From these three scatter plots, we can see that there are extremely linear and positive relationships between the three variables and lead. In the following linear regression analysis, these might be extremely useful parameters.

Part Two: Setting Benchmark

In this section, we simply build a model using all variables to predict the level of lead in the river. This Rsquare will be regarded as the benchmark for subsequent analysis

model_bench <- lm(lead~.,data = meuse) ## notice: the "." indicates that we are using all other variables in the dataset except the response variable lead
summary(model_bench)

## 
## Call:
## lm(formula = lead ~ ., data = meuse)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.330 -10.755   0.244   9.086  64.421 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.832e+02  1.318e+03   0.594 0.553418    
## x           -2.291e-03  8.331e-03  -0.275 0.783809    
## y           -1.038e-03  7.031e-03  -0.148 0.882832    
## cadmium     -1.180e+01  2.059e+00  -5.731 7.26e-08 ***
## copper      -3.956e-01  3.225e-01  -1.227 0.222309    
## zinc         4.427e-01  1.721e-02  25.726  < 2e-16 ***
## elev        -1.027e+00  3.626e+00  -0.283 0.777465    
## dist        -9.036e+01  6.289e+01  -1.437 0.153310    
## om          -1.855e+00  1.004e+00  -1.848 0.066968 .  
## ffreq2      -1.352e+01  8.368e+00  -1.616 0.108670    
## ffreq3      -2.359e+01  1.108e+01  -2.129 0.035238 *  
## soil2       -4.029e+00  7.016e+00  -0.574 0.566804    
## soil3       -2.485e+00  1.103e+01  -0.225 0.822070    
## lime1       -2.352e+01  6.935e+00  -3.391 0.000937 ***
## landuseAb    8.548e+00  1.906e+01   0.448 0.654587    
## landuseAg    2.180e+00  2.036e+01   0.107 0.914875    
## landuseAh    7.162e+00  1.719e+01   0.417 0.677598    
## landuseAm    7.322e+00  1.743e+01   0.420 0.675256    
## landuseB    -3.199e+00  2.218e+01  -0.144 0.885542    
## landuseBw    1.183e+01  2.085e+01   0.567 0.571505    
## landuseDEN  -2.456e+01  3.001e+01  -0.819 0.414646    
## landuseFh    2.169e+01  2.865e+01   0.757 0.450588    
## landuseFw    1.577e+00  1.855e+01   0.085 0.932428    
## landuseGa    6.300e+01  2.673e+01   2.357 0.020019 *  
## landuseSPO  -1.184e+01  2.948e+01  -0.402 0.688625    
## landuseSTA   1.366e+01  2.415e+01   0.565 0.572810    
## landuseTv    3.089e+01  2.877e+01   1.074 0.285076    
## landuseW    -8.839e-01  1.762e+01  -0.050 0.960071    
## dist.m       9.603e-02  5.309e-02   1.809 0.072911 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.63 on 123 degrees of freedom
## Multiple R-squared:  0.9664, Adjusted R-squared:  0.9587 
## F-statistic: 126.2 on 28 and 123 DF,  p-value: < 2.2e-16

From the summary, we can see that the Rsquared is extremely high — 0.9664. This indicates that using the predictors in our dataset, we are able to explain 96.64% of the variations in the response variable lead. However, a closer inspection of the summary reveals that only a few of the variables are significant (*** indication at the end of the row). We might need to conduct feature selection later in our analysis.

EDA & Benchmark

Huizi Yu

10/26/2019

Part One: Explanatory Data Analysis

Part Two: Setting Benchmark