MSSP 897 Lab 7: Identifying and Handling Outliers

The slides for Lab 7 session are now available to review here.

REVIEW OF PREVIOUS CLASSES

Exploratory Data Analysis (EDA)

As good Social/Data Scientists, we must always look at the data to identify potential problems. Here are some main things that you should be in the lookout for:

Do we have a set of sensible variables?
Abnormality in the data
Make a set of candidates to be chosen
make new variables
transformations on some variables

Residual Diagnosis

USE residual plots (important diagnostic tool!) to test “Linearity” and “Homoscedasticity” assumption
USE histogram of residuals to test “Normality of Residuals” assumption
USE Added Variable Plot to avoid omitted variable bias

#install.packages("olsrr")
library(olsrr)
#install.packages("rlang")
library(rlang)
#install.packages("Hmisc")
library(Hmisc)
library(car)

Create A New Dataset With No Missing Observations

good<-read.csv("good.csv")
goodR<-na.omit(good[,c("CHID","readss97","AGE97","faminc97","HOME97","bthwht")])
attach(goodR)

Run the following regression

lm1<-lm(readss97~AGE97 + faminc97 + bthwht, data=goodR)
summary(lm1)

## 
## Call:
## lm(formula = readss97 ~ AGE97 + faminc97 + bthwht, data = goodR)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.923  -9.416  -0.372   9.376  63.690 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.356e+01  9.844e-01  95.043  < 2e-16 ***
## AGE97        7.155e-01  1.200e-01   5.960 2.98e-09 ***
## faminc97     8.727e-05  6.299e-06  13.856  < 2e-16 ***
## bthwht      -3.502e+00  7.294e-01  -4.801 1.70e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.06 on 1960 degrees of freedom
## Multiple R-squared:  0.1174, Adjusted R-squared:  0.1161 
## F-statistic: 86.94 on 3 and 1960 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(lm1)

Based on the previous diagnostics in Lab 6 it looks like there may be problems with the variables faminc97 and AGE97. The problems can be fixed largely by re-specifying our model. Primarily what we need to do is log-transform our faminc97 variable and center AGE97 (i.e. subtract the mean value of AGE97 from each observation so that its new mean is 0) and also include its squared term.

TRANSFORMING VARIABLES: faminc97 & age97

hist(faminc97)

goodR$faminc97R<-ifelse(goodR$faminc97<=1,1,goodR$faminc97)
  # Or, goodR$faminc97R<-ifelse(goodR$faminc97<=1,1, 
  # ifelse(goodR$faminc97 > 1,goodR$faminc97,NA))    log 0 is undefined
goodR$loginc <-log(goodR$faminc97R)

goodR$ageC<-goodR$AGE97-8
goodR$ageC2<- goodR$ageC^2

lm2<-lm(readss97~loginc + ageC + ageC2 + bthwht+
        HOME97,data=goodR)
summary(lm2)

## 
## Call:
## lm(formula = readss97 ~ loginc + ageC + ageC2 + bthwht + HOME97, 
##     data = goodR)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.824  -9.389  -0.377   9.518  55.985 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 55.13936    3.15466  17.479  < 2e-16 ***
## loginc       1.69500    0.30738   5.514 3.97e-08 ***
## ageC         0.37552    0.12910   2.909  0.00367 ** 
## ageC2       -0.03267    0.04638  -0.704  0.48134    
## bthwht      -2.53381    0.74463  -3.403  0.00068 ***
## HOME97       1.52543    0.12229  12.474  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.67 on 1958 degrees of freedom
## Multiple R-squared:  0.1636, Adjusted R-squared:  0.1615 
## F-statistic:  76.6 on 5 and 1958 DF,  p-value: < 2.2e-16

We examine some residual points for a linear regression to see if there are some outliers, heteroscedasticity, normality etc. Compare the residual plot in which the faminc97 variable isn’t transformed.

plot(lm1,1)         #`,1` show only 1st graph: Residuals vs Fitted plot

plot(lm2,1)

Graph of Model Residuals Against loginc

resid2<-lm2$residuals
plot(resid2~loginc, data=goodR)

New model after having made variable transformations (lm2 show that ageC2 is not needed)

lm3<-lm(readss97~loginc + ageC + bthwht+
          HOME97,data=goodR)
summary(lm3)

## 
## Call:
## lm(formula = readss97 ~ loginc + ageC + bthwht + HOME97, data = goodR)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.469  -9.516  -0.324   9.517  55.657 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  54.8712     3.1312  17.524  < 2e-16 ***
## loginc        1.6871     0.3071   5.493 4.47e-08 ***
## ageC          0.4100     0.1194   3.433 0.000609 ***
## bthwht       -2.6817     0.7143  -3.754 0.000179 ***
## HOME97        1.5321     0.1219  12.568  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.66 on 1959 degrees of freedom
## Multiple R-squared:  0.1634, Adjusted R-squared:  0.1617 
## F-statistic: 95.65 on 4 and 1959 DF,  p-value: < 2.2e-16

plot(lm3,1)

MULTICOLLINEARITY

library(car)
vif(lm3)

##   loginc     ageC   bthwht   HOME97 
## 1.260362 1.115411 1.085926 1.312712

7 IDENTIFYING OUTLIERS

Key reference material: Belsley, D.A.; Kuh, E., Welsh, R. E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Rules of Thumb" used to identify potential outliers: (k=number of IVs; N= Sample size) NOTE: These ‘rules’ are not a hard-and-fast rule, but rather a guideline only! ALWAYS produce a plot/histogram or sample quantiles of your outlier measures

Measure	Cut-off Value
abs(standardized resid)	> 2 (or 3)
leverage	>(2k+2)/N
abs(Dffits)	> 2/sqrt(k+1/N)
abs(Dfbetas)	> 2/sqrt(N)
Cook’s D	> 4/N

Create a new data frame where we can store all of our outlier information

outliers <- goodR[,c("CHID","readss97","loginc","ageC","HOME97","bthwht")] # CHID = Children ID
outliers$r <- rstudent(lm3)
outliers$lev <- hatvalues(lm3)
outliers$cd <- cooks.distance(lm3)
outliers$dffit<- dffits(lm3)
# In this step we are creating a separate data frame with the dfbetas so that we can properly label the columns
dfbetaR<-dfbetas(lm3)
dfbetaR <- as.data.frame(dfbetaR)
dfbetas(lm3)

# To check the order of the variable names use the colnames() function
colnames(dfbetaR)

## [1] "(Intercept)" "loginc"      "ageC"        "bthwht"      "HOME97"

# Next assign new variable names and add the dfb appendage
colnames(dfbetaR)<-c("int_dfb","income_dfb",
                     "ageC_dfb","bthwht_dfb","home_dfb")
# find out the number of observations in `lm3`
nobs(lm3)

## [1] 1964

Now we want to combine the two data frames using the cbind() function

outliers<-cbind(outliers,dfbetaR)
# use the head() function to make sure youthe new columns have been added to your outliers data frame
head(outliers)

7.1 DISCREPANCY

Discrepancy is the difference between the predicted and observed value, measured by Studentized Residuals. The basic idea of Studentized Residuals is to delete the observations one at a time, each time refitting the regression model on the remaining n–1 observations. Then, we compare the observed response values to their fitted values based on the models with the ith observation deleted. This produces deleted residuals. Standardizing the deleted residuals produces studentized residuals.

Plotting Out and Evaluating Official Threshold

Our critical standardized residual value is abs(standardized resid) > 2 (or 3) which seems more plausible? Let’s plot to find out.

r <- rstudent(lm3) # create an named numerical object
plot(outliers$r, 
     xlab="Index",
     ylab="studentized residuals",
     pch=19)
abline(h = 3, col="red")  # add cutoff line
abline(h = -3, col="red")  # add cutoff line
text(x=1:length(r), y=r, labels=ifelse(r > 3,names(r),""), col="red", pos = 4)  # add labels # pos = 4 to position at the right 
text(x=1:length(r), y=r, labels=ifelse(r<(-3.0),names(r),""), col="red", pos = 4)  # add labels # pos = 4 to position at the right 

# Or, use this code for interactive point identification to avoid overlapping labels
identify(1:length(r),r,row.names(outliers))

## integer(0)

# abline(lm(r~CHID, data=outliers), col="red") # fit a regression line (y~x)

According to the plot, observations with studentized residuals greater than +/-3 might be the real outliers in our dataset. Let’s subset our outliers data frame by those observations.

rstudent1<-subset(outliers,abs(r)>3)

Use the view function to examine observations with standardized residuals greater than +/-3

View(rstudent1)

What do these observations have in common?

Quite a few observations have low readss97 values, and you can also see that there seem to be a lot of observations with high income and low reading scores, or observations with low income and high reading scores.

7.2 LEVERAGE

Leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. As measured by hat values, it is the standardized distance to mean of predictors.

Plotting Out and Evaluating Official Threshold

Our official critical leverage value is leverage > (2k+1)/N OR leverage > (2*4+1)/1964 (.005). Is this value too strict? Let’s plot to find out!

plot(outliers$lev, 
     xlab="Index",
     ylab="leverage",
     pch=19)
abline(lm(lev~CHID, data=outliers), col="red")  # Add a fitted line

Residuals vs Leverage plot

plot(lm3, which = 5)

Selecting a Reasonable Cut-off Point

According to the plot, maybe observations with leverage values greater than .03 are the real outliers. Let’s subset our outliers data frame by those observations.

leverage1<-subset(outliers,lev > .03)
# Use the view function to examine observations with leverage values greater than .03
View(leverage1)

What do these observations have in common? Many observations have a loginc value less 1 , high reading score, and high parenting practices scores. A pattern is emerging. In the previous step, we saw that some observations with standardized residuals greater than +/-3 also had very low loginc values but higher reading scores and high parenting practices scores.

7.3 INFLUENCE (Cook’s D)

Cook’s distance was introduced by American statistician R Dennis Cook in 1977. It is used to identify influential data points, by considering both the residual and leverage i.e., it takes into account both the x value and y value of the observation.

Our critical Cooks Distance value is Cook’s D > 4/N OR Cook’s D > 4/1964 (0.00203666).

Plotting Out and Evaluating Official Threshold

plot(cooks.distance(lm3), pch="*", cex=2, main="Influential Obs by Cook's distance Using Official Threshold (4/1964)")  
abline(h = 4/1964, col="red")  # add cutoff line

Alternatively, using functions from library olsrr

ols_plot_cooksd_bar(lm3)   # threshold = 4/n = 4/1964 = 0.002

ols_plot_cooksd_chart(lm3)

Maybe our critical cooks.d value of 4/1964 is too strict. Let’s take a closer look at our cook’s d values that are large according to our critical value of 4/1964.

First, create a new data frame containing observations with Cook’s D values above 4/1964.

large_cd<-subset(outliers, cd > (4/1964)) 
View(large_cd)

Next, produce a histogram and print the sample quantiles of our Cook’s D outlier measure to the console.

# install.packages("Hmisc")
# library(Hmisc)
describe(large_cd$cd)

## large_cd$cd 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       87        0       87        1 0.007076 0.008124 0.002100 0.002186 
##      .25      .50      .75      .90      .95 
## 0.002465 0.003157 0.005046 0.011376 0.017548 
## 
## lowest : 0.002050058 0.002059938 0.002072141 0.002075410 0.002096715
## highest: 0.018274246 0.023348340 0.063262844 0.077436290 0.104389886

hist(large_cd$cd)

quantile(large_cd$cd, probs = seq(0, 1, 0.05))  # from 0 to 1, by 0.05

##          0%          5%         10%         15%         20%         25% 
## 0.002050058 0.002099812 0.002186473 0.002266736 0.002379564 0.002465379 
##         30%         35%         40%         45%         50%         55% 
## 0.002534835 0.002585449 0.002928289 0.003031257 0.003156610 0.003283638 
##         60%         65%         70%         75%         80%         85% 
## 0.003404639 0.003746563 0.004223186 0.005046340 0.005655253 0.005933571 
##         90%         95%        100% 
## 0.011375837 0.017547768 0.104389886

Notice that there is a big jump between the 75th and 90th percentile. The 90th percentile value is 0.011375837 which is much larger than the 75th percentile value, which is about .00504. This is a good indicator that observations in the 90th percentile and higher are the real outliers.

Selecting A Reasonable Cut-off Point

Let’s take a closer look at these observations

large_cd2<-subset(outliers, cd > 0.011375837)
# Use the View() function to examine observations in this object
View(large_cd2)

Now, focusing just on the variables included in the regression model, what do you notice? Again, we find that quite a few observations have very low loginc values but high reading scores and high parenting practices scores.

Cook’s Distance Plot

cd <- cooks.distance(lm3)
plot(cooks.distance(lm3), pch="*", cex=2, main="Influential Obs by Cooks distance")  
abline(h = 0.011375837, col="red")  # add cutoff line
text(x=1:length(cd), y=cd, labels=ifelse(cd>0.011375837,names(cd),""), col="red", pos = 4)  # add labels # pos = 4 to position at the right 

# Or alternatively,
identify(1:length(cd),cooks.distance(lm3),row.names(outliers)) # interactive labeling, LOVE it! :)

## integer(0)

Let’s fit a model without outliers (i.e. dropping observations with Cook’s D greater than 0.011375837).How do these results compare to the results obtained with the lm3 model?

lm4<-lm(readss97~loginc + ageC + bthwht + HOME97, 
        data=subset(outliers, cd < 0.011375837))
# Do the model estimates or R-Squared change?
summary(lm3)

## 
## Call:
## lm(formula = readss97 ~ loginc + ageC + bthwht + HOME97, data = goodR)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.469  -9.516  -0.324   9.517  55.657 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  54.8712     3.1312  17.524  < 2e-16 ***
## loginc        1.6871     0.3071   5.493 4.47e-08 ***
## ageC          0.4100     0.1194   3.433 0.000609 ***
## bthwht       -2.6817     0.7143  -3.754 0.000179 ***
## HOME97        1.5321     0.1219  12.568  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.66 on 1959 degrees of freedom
## Multiple R-squared:  0.1634, Adjusted R-squared:  0.1617 
## F-statistic: 95.65 on 4 and 1959 DF,  p-value: < 2.2e-16

summary(lm4)

## 
## Call:
## lm(formula = readss97 ~ loginc + ageC + bthwht + HOME97, data = subset(outliers, 
##     cd < 0.011375837))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.817  -9.370  -0.429   9.339  56.389 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.1573     3.3941  13.305  < 2e-16 ***
## loginc        2.8991     0.3511   8.258  2.7e-16 ***
## ageC          0.4076     0.1177   3.463 0.000546 ***
## bthwht       -2.5110     0.7042  -3.566 0.000371 ***
## HOME97        1.3760     0.1233  11.163  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.41 on 1950 degrees of freedom
## Multiple R-squared:  0.1831, Adjusted R-squared:  0.1814 
## F-statistic: 109.3 on 4 and 1950 DF,  p-value: < 2.2e-16

Influence Plot

It creates a “bubble” plot of Studentized residuals versus hat values, with the areas of the circles representing the observations proportional to the value Cook’s distance.

# library(car)
influencePlot(lm3, id.method="noteworthy", # "identify": for interactive labeling
              main="Influence Plot", 
              sub="Circle size is proportial to Cook's Distance")

##        StudRes         Hat       CookD
## 345  -3.798934 0.001779784 0.005111247
## 715   2.342583 0.054617275 0.063262844
## 1132  1.326663 0.049370524 0.018274246
## 1192  3.815063 0.003395120 0.009848521
## 2759  3.059742 0.039876923 0.077436290
## 2760  3.578872 0.039381913 0.104389886

7.4 INFLUENCE (DFBETAS)

We can also use DFBETAS to find potential outliers DFBETA measures the difference in each parameter estimate with and without the influential point. Large values of DFBETAS indicate observations that are influential in estimating a given parameter.

Plotting Out and Evaluating Official Threshold

Our critical dfbeta cuttoff value is abs(dfbetas) > 2/sqrt(N) OR here: dfbetas > abs(2/sqrt(1964)) (0.04512937). But maybe this value is too strict. To get a better idea of what potential values of each variable might be outliers we need to plot the dfbetas for each variable (excluding the intercept).

Use the par() and mfrow() functions to print all four dfbeta plots to one window (2 rows, 2 columns)

par(mfrow=c(2,2))
plot(outliers$income_dfb)
plot(outliers$ageC_dfb)
plot(outliers$bthwht_dfb)
plot(outliers$home_dfb)

One observation that should not be a surprise is our income variable.

Selecting A Reasonable Cut-off Point

The critical value for our income dfbeta is about abs(.2). Let’s take a closer look at these observations.

inc<-subset(outliers, abs(income_dfb) >  .2) 
View(inc)

Focusing specifically, on the loginc values, what do these observations have in common? Again, we find that quite a few of these observations have loginc values less than 1

7.5 INFLUENCE (DFFITS)

The difference in fits for observation i, denoted DFFITS. Create Scatterplot using Dffits object automatically generated when we constructed the linear model.

Plotting Out and Evaluating Official Threshold

plot(dffits(lm3), 
     ylab = "Standardized dfFits", xlab = "Index", 
     main = paste("Standardized DfFits, \n critical value = 2*sqrt(k+1/n) = +/-",
                  round(0.1,3)))  # (2*sqrt(4+1/1964) = 0.1)
# Critical Value divided by horizontal lines
abline(h = 0.09025, col = "red", lty = 2)
abline(h = -0.09025, col = "red", lty = 2)

Let’s take a closer look at these observations

large_dffits<-subset(outliers, dffit > 0.1)
View(large_dffits)

Next, produce a histogram and print the sample quantiles of our dffits outlier measure to the console.

# library(Hmisc)
describe(large_dffits$dffit)

## large_dffits$dffit 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       51        0       51        1   0.1739   0.1055   0.1008   0.1016 
##      .25      .50      .75      .90      .95 
##   0.1081   0.1246   0.1696   0.2816   0.4524 
## 
## lowest : 0.1002484 0.1006708 0.1006789 0.1009103 0.1012990
## highest: 0.3023354 0.3418022 0.5630620 0.6235657 0.7246348

hist(large_dffits$dffit)

quantile(large_dffits$dffit, probs = seq(0, 1, 0.05))  # from 0 to 1, by 0.05

##        0%        5%       10%       15%       20%       25%       30% 
## 0.1002484 0.1007946 0.1015823 0.1037258 0.1060657 0.1080966 0.1113601 
##       35%       40%       45%       50%       55%       60%       65% 
## 0.1127046 0.1149095 0.1222516 0.1245861 0.1279384 0.1311345 0.1402279 
##       70%       75%       80%       85%       90%       95%      100% 
## 0.1591148 0.1695848 0.1812054 0.2424442 0.2815864 0.4524321 0.7246348

Selecting A Reasonable Cut-off Point

large_dffits2<-subset(outliers, dffit > 0.2424442) # e.g., set at 85th
View(large_cd2)

df <- dffits(lm3)
plot(dffits(lm3), pch="*", cex=2, main="Influential Obs by DFFIT")  
abline(h = 0.2424442, col="red")   
abline(h = -0.2424442, col="red")  
text(x=1:length(df), y=df, labels=ifelse(df>0.2424442, names(df),""), col="red", pos = 4)
text(x=1:length(df), y=df, labels=ifelse(df<(-0.2424442), names(df),""), col="red", pos = 4) 
# Or, use this code for interactive point identification
identify(1:length(df),dffits(lm3),row.names(outliers))

## integer(0)

7.6 DEALING WITH OUTLIERS

One common theme that emerged from the analysis is that observations with very low incomes are a problem. (To determine if there are other problematic variable values or observations, make sure to perform a thorough outlier analysis.)

How do we deal with outliers? There are TWO approaches.

1. You can TAKE OUT these observations

In this example, we might want to omit from our analysis observations with a cook’s d value greater than 0.011375837. First, create a new data frame (here, called outliers2) and subset your outliers data frame to include only those observations that have a cook’s d value LESS than our new critical cook’s d value (0.011375837).

outliers2<-subset(outliers, cd < 0.011375837)

# Now you can compare the results produced from our lm3 model with the results produced with our new dataset.
lm5<-lm(readss97~loginc + ageC + HOME97+ bthwht, data=outliers2)
summary(lm3)

## 
## Call:
## lm(formula = readss97 ~ loginc + ageC + bthwht + HOME97, data = goodR)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.469  -9.516  -0.324   9.517  55.657 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  54.8712     3.1312  17.524  < 2e-16 ***
## loginc        1.6871     0.3071   5.493 4.47e-08 ***
## ageC          0.4100     0.1194   3.433 0.000609 ***
## bthwht       -2.6817     0.7143  -3.754 0.000179 ***
## HOME97        1.5321     0.1219  12.568  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.66 on 1959 degrees of freedom
## Multiple R-squared:  0.1634, Adjusted R-squared:  0.1617 
## F-statistic: 95.65 on 4 and 1959 DF,  p-value: < 2.2e-16

summary(lm5)

## 
## Call:
## lm(formula = readss97 ~ loginc + ageC + HOME97 + bthwht, data = outliers2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.817  -9.370  -0.429   9.339  56.389 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.1573     3.3941  13.305  < 2e-16 ***
## loginc        2.8991     0.3511   8.258  2.7e-16 ***
## ageC          0.4076     0.1177   3.463 0.000546 ***
## HOME97        1.3760     0.1233  11.163  < 2e-16 ***
## bthwht       -2.5110     0.7042  -3.566 0.000371 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.41 on 1950 degrees of freedom
## Multiple R-squared:  0.1831, Adjusted R-squared:  0.1814 
## F-statistic: 109.3 on 4 and 1950 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(lm3)

plot(lm5)

2. Or, set specific VARIABLE VALUES to missing

In this example, most of the observations that were flagged as outliers had very low loginc values, so we might consider setting loginc values less than or equal to 1 to NA (i.e., missing).

First, create a new data frame (here, called outliers3) that is an exact copy of our original outliers data frame

outliers3<-outliers
# Next, set loginc values less than or equal to 1 to NA
outliers3$loginc[outliers3$loginc <= 1]<-NA
# you could also use:
# outliers3$loginc<-ifelse(outliers3$loginc <= 1, NA,outliers3$loginc)

NOTE: you can use the same format to set other VARIABLE VALUES to missing. Once you have finished setting specific variable values to missing, compare the lm3 estimates with the estimates produced with the outliers3 data frame. Are there differences? Has the R-Squared value changed?

lm6<-lm(readss97~loginc + ageC + HOME97+ bthwht, data=outliers3)
summary(lm3)

## 
## Call:
## lm(formula = readss97 ~ loginc + ageC + bthwht + HOME97, data = goodR)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.469  -9.516  -0.324   9.517  55.657 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  54.8712     3.1312  17.524  < 2e-16 ***
## loginc        1.6871     0.3071   5.493 4.47e-08 ***
## ageC          0.4100     0.1194   3.433 0.000609 ***
## bthwht       -2.6817     0.7143  -3.754 0.000179 ***
## HOME97        1.5321     0.1219  12.568  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.66 on 1959 degrees of freedom
## Multiple R-squared:  0.1634, Adjusted R-squared:  0.1617 
## F-statistic: 95.65 on 4 and 1959 DF,  p-value: < 2.2e-16

summary(lm6)

## 
## Call:
## lm(formula = readss97 ~ loginc + ageC + HOME97 + bthwht, data = outliers3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.885  -9.325  -0.396   9.395  56.577 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  42.8643     3.6322  11.801  < 2e-16 ***
## loginc        3.1751     0.3897   8.148 6.53e-16 ***
## ageC          0.4005     0.1184   3.382 0.000734 ***
## HOME97        1.3463     0.1260  10.684  < 2e-16 ***
## bthwht       -2.5328     0.7095  -3.570 0.000366 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.52 on 1950 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.1818, Adjusted R-squared:  0.1802 
## F-statistic: 108.3 on 4 and 1950 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(lm3)

plot(lm6)

Policy Report 2 due at 12:59 pm March 29th, 2019 (Next Friday) (Updated)

Assignment 7 due at 12:59 pm March 25th, 2019 (Next Monday)

Import bloodpress.txt - information for 20 patients - into R-Studio. You will use the following variables: 1) BP - Blood Pressure 2) BSA - Body Surface Area 3) weight

Create and present a correlation matrix between all three variables. What is the correlation between BSA and weight, and what does it suggest about a regression where both are independent variables?
Regress blood pressure on BSA and weight and interpret the coefficients and p-values.
Run a VIF on the model you just ran
Regress BP on BSA, only, and interpret coefficients What do the results suggest about whether there was a multicollinearity problem in your previous regression?

MSSP 897 Lab 7: Identifying and Handling Outliers

Viviana Wu

March 18, 2019

The slides for Lab 7 session are now available to review here.

REVIEW OF PREVIOUS CLASSES

Exploratory Data Analysis (EDA)

Residual Diagnosis

TRANSFORMING VARIABLES: faminc97 & age97

MULTICOLLINEARITY

7 IDENTIFYING OUTLIERS

7.1 DISCREPANCY

Plotting Out and Evaluating Official Threshold

7.2 LEVERAGE

Plotting Out and Evaluating Official Threshold

Selecting a Reasonable Cut-off Point

7.3 INFLUENCE (Cook’s D)

Plotting Out and Evaluating Official Threshold

Selecting A Reasonable Cut-off Point

Cook’s Distance Plot

Influence Plot

7.4 INFLUENCE (DFBETAS)

Plotting Out and Evaluating Official Threshold

Selecting A Reasonable Cut-off Point

7.5 INFLUENCE (DFFITS)

Plotting Out and Evaluating Official Threshold

Selecting A Reasonable Cut-off Point

7.6 DEALING WITH OUTLIERS

1. You can TAKE OUT these observations

2. Or, set specific VARIABLE VALUES to missing

Policy Report 2 due at 12:59 pm March 29th, 2019 (Next Friday) (Updated)

Assignment 7 due at 12:59 pm March 25th, 2019 (Next Monday)