The problem wants us to plot the residuals versus the fitted values and interpret the resulting plot. In order to do so, we need to first build our model and obtain the fitted values and the residuals.
Using the ASWELLS data set, I imported the table and briefly looked at the raw data. I noticed the DEPTH variable was in terms of factors, which needs to be changed or else these factors will all be predictors in our model. I changed the values in DEPTH from factors to numerics.
aswell <- read.table("/Users/seankrinik/Documents/CSULB/STAT510/MT1/ASWELLS.txt",
header=T)
str(aswell)
## 'data.frame': 328 obs. of 4 variables:
## $ LATITUDE : num 23.8 23.8 23.8 23.8 23.8 ...
## $ LONGITUDE: num 90.7 90.7 90.7 90.7 90.6 ...
## $ DEPTH : Factor w/ 45 levels "*","100","105",..: 37 32 32 7 13 37 36 33 37 37 ...
## $ ARSENIC : int 331 302 193 232 19 33 2 1 23 111 ...
aswell$DEPTH <- as.double(aswell$DEPTH)
Now that the data set is cleaned and ready to go, I build the model as requested in the problem. The question calls for arsenic level as a function of latitude, longitude and depth. therefore out model will look like \[\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X_{\text{lat}} + \hat{\beta_2}X_{\text{lon}} + \hat{\beta_3}X_{\text{dep}}\] Below, I create the model I will use to extract the fitted values and residuals.
model2 <- lm(ARSENIC~LATITUDE+LONGITUDE+DEPTH,aswell)
Now we need to plot the fitted (predicted) values against the residuals for the model. Our assumption for this model entails the residuals being of constant variance, so there should no visible patterns in the plot. I the residuals vs. fitted by extracting the values (top), and I also have included the residual diagnostic plot from the plot.lm capability (bottom). The diagnostic plot shows the trend curve as well which helps in visualization.
plot(model2$fitted.values,model2$residuals)
plot(model2, which = c(1,1))
Unfortunately, you can see a downward linear trend and also an exponential trend as the fitted values increase. This could indicate the response is likely a function of its mean. A square-root transformation might help fix some of the residual errors and allow us to fulfill the equal variance assumption.
Using the ASWELLS data set again, the problem is now asking for us to look for outliers, and further determine if any of the outliers are influential points. To begin, we need to look at the studentized residuals and find any that have values larger than 3 (since the residuals are standardized, values greater than 3 are three standard deviations from average). These will be our outliers.
Below, I take a look over the studentized residuals and find those who have values over 3. We end up with 10 data points of interest.
std.res <- rstudent(model2)
outliers <- std.res[which(std.res>=3)];outliers
## 105 189 199 215 235 307 310 312
## 3.767053 3.156988 4.710071 3.216332 3.805881 3.543353 3.746283 3.021516
## 322 327
## 4.098881 4.793430
Now that we have our outliers, we need to determine the cutoff for influence using leverage. Any leverage points with values larger than \(h_{ii} > \frac{2(k+1)}{n}\) or in our case, \[ \begin{aligned} \frac{2(k+1)}{n} &= \frac{2(3+1)}{328}\\ \\ &= 0.02439024 \end{aligned} \] will be considered influential. Let’s gather the influential points from the hat matrix and see if any of these points match our outliers.
To do so, I obtained the hat matrix values as a vector and compared the data point numbers to those in our outlier vector above to see if any match. Firstly, we need the hat matrix values and our influential point cut off value:
hat_values <- hatvalues(model2); head(hat_values)
## 1 2 3 4 5 6
## 0.012587286 0.012794117 0.012438282 0.038380540 0.010343870 0.005079992
influential <- 8/length(aswell$LATITUDE);influential
## [1] 0.02439024
Our influential points:
lvg_points <- hat_values[which(hat_values >= influential)];lvg_points
## 4 14 22 28 29 30
## 0.03838054 0.02930059 0.02590719 0.02695893 0.02698915 0.03105630
## 31 40 117 167 168 169
## 0.02849673 0.04158686 0.03106229 0.02489322 0.02499345 0.02465182
## 170 171 199 207 290 294
## 0.02460604 0.02868424 0.02700739 0.02737627 0.03168799 0.02643552
Now that we have our influential points, we can compare the data points of our outliers to the influential points and see if any match:
lvg_points[which(names(lvg_points) %in% names(outliers))]
## 199
## 0.02700739
From the above result, we see that data point 199 is an outlier and is an influential point. A quick check using the residual diagnostic plots for our model confirm shows that data point 199 is marked as an outlier of interest:
plot(model2, which = c(5,5))
Notice point 199 has high values in the x and y directions.
As a suggestion, we can remove this point from our data set since our sample size is large. This is probably the cheapest and easiest option. Another alternative is to investigate why the arsenic level recorded is so high (possible error?).
If we reevaluate our model without data point 199, our Multiple \(R^2\) jumps from \(0.1481\) to \(0.1731\) which ultimately doesn’t do much for the poor model fit. The issue of poor model fit lies within further residual analysis.