The slides for Lab 5 session are now available to review here.

Install the packages that will be used for this tutorial.

install.packages(“dplyr”)

install.packages(“car”)

options(scipen=999, digits = 2)
library(ggplot2)
library(dplyr)
library(car)

Transform a Continuous Varible to a Categorical one

Let’s a randomly created dataset.

set.seed(100)
df <- data.frame(a = rnorm(100))
df$category <- cut(df$a, breaks=c(-Inf, 0.5, 0.6, Inf),                    
                   labels=c("low","middle","high"))

# Inf and -Inf are positive and negative infinity 
# breaks = unique cut points or a single number (greater than or equal to 2) giving the number of intervals into which x is to be cut.

Alternatively:

set.seed(100)  # get the same random number!
df1 <- data.frame(a = rnorm(100))
df1$category[df1$a < 0.5] <- "low"
df1$category[df1$a > 0.5 & df1$a < 0.6] <- "middle"
df1$category[df1$a > 0.6] <- "high"
str(df$category)
##  Factor w/ 3 levels "low","middle",..: 1 1 1 3 1 1 1 3 1 1 ...
str(df1$category)
##  chr [1:100] "low" "low" "low" "high" "low" "low" "low" "high" "low" ...
df1$category <- as.factor(df1$category)
summary(df$category)
##    low middle   high 
##     72      1     27
summary(df1$category)
##   high    low middle 
##     27     72      1

Or, using mutate function under dplyr for adding new variables and keeping existing ones, see more: https://dplyr.tidyverse.org/reference/mutate.html

Import the good dataset into R and and assign the dataframe the variable name ‘good’

good <- read.csv("good.csv")
attach(good)

Exploratory Data Analysis

Continuous data must be collected into bins for histogram, but distribution varies with number of bins used.

hist(good$faminc97, freq = FALSE)

Histogram with Density Plot

The Density Plot shows the smoothed distribution of the points(smooth out the noise) along the numeric axis. The peaks of the density plot are at the locations where there is the highest concentration of points.

hist(good$faminc97, freq = FALSE)
lines(density(good$faminc97))

ggplot(good, aes(x=faminc97)) + 
  labs(title = "Graph 1: Histogram and Kernel Density of Family Income") +
  geom_histogram(aes(y=..density..), colour="black", fill="white")+
  geom_density(alpha=.2, fill="#FF6666")  # alpha: color opacity
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Quantile-quantile plots (Q-Q plot) for one variable

Build a Q-Q plot to check how well a data set came from or fits a particular theoretical distribution such as a Normal or exponential. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight. More demonstrations: http://www.ucd.ie/ecomodel/Resources/QQplots_WebVersion.html

qqnorm(good$faminc97)
qqline(good$faminc97,col="red")

# If distribution is near-Gaussian, points should follow red line.

Check the Pearson correlation between two variables.

cor(good$readss97, good$faminc97, use = "complete.obs") # remember `cor.test()`?
## [1] 0.31
cor(good$AGE97, good$readss97, use = "complete.obs")
## [1] 0.12
cor(good$AGE97, good$faminc97, use = "complete.obs")
## [1] 0.05

Any Extreme Outliners?

Using boxplot we tell R to create a horizontal box plot. Horizontal and vertical box plots display the distribution of data by using a rectangular box and whiskers. Whiskers are lines that indicate a data range outside of the box.

boxplot(good$faminc97)

boxplot(good$readss97)

Use loginc97 or remove it from our model for now.

hist(good$loginc97, freq = FALSE)
lines(density(good$loginc97))


Multiple Linear Regression and Assumption Diagnostics

Recall what we have already done in Lab 2 on Model diagnoses here.

# Subset good data set with selected variables
goodreg <- na.omit(good[,c("readss97","AGE97","HOME97","bthwht")]) 
attach(goodreg)
## The following objects are masked from good:
## 
##     AGE97, bthwht, HOME97, readss97
lm <- lm(readss97~AGE97 + HOME97 + bthwht, data = goodreg)
summary(lm)
## 
## Call:
## lm(formula = readss97 ~ AGE97 + HOME97 + bthwht, data = goodreg)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -54.96  -9.52  -0.31   9.70  54.50 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   63.342      2.266   27.96 < 0.0000000000000002 ***
## AGE97          0.399      0.120    3.32              0.00093 ***
## HOME97         1.829      0.110   16.61 < 0.0000000000000002 ***
## bthwht        -2.848      0.719   -3.96             0.000077 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15 on 1960 degrees of freedom
## Multiple R-squared:  0.151,  Adjusted R-squared:  0.149 
## F-statistic:  116 on 3 and 1960 DF,  p-value: <0.0000000000000002

5.1 Assumption of Linearity

Continuous Idependent Variables (IV) and Dependent Variables (DV). The relationship between two continuous variables is most commonly investigated using scatter plots (see graphing section below).

Create Pairwise Comparison Scatterplot

pairs(goodreg,panel=panel.smooth)

# Scatterplot matrices are good for determining rough linear correlations of metadata that contain continuous variables.
# Scatterplot matrices are not so good for looking at discrete variables.

Create Bivariate Scatterplots for two variables

ggplot(goodreg, aes(x = AGE97, y = readss97)) +
  geom_point(size = 0.6) +
  xlab("Respondent's Age") +
  ylab("Reading") +
  theme_bw() +
  geom_smooth(method = "loess")

ggplot(goodreg, aes(x = HOME97, y = readss97)) +
  geom_point(size = 0.6) +
  xlab("Home97") +
  ylab("Reading") +
  theme_bw() +
  geom_smooth(method = "loess")


5.2 Assumption of Homoscedasticity

See lab 5 slides for more descriptions and refer to this post for more explanations: https://data.library.virginia.edu/diagnostic-plots/

plot(lm)

Plotting Residuals by Fitted Values

plot(fitted.values(lm), residuals(lm))
abline(0, 0, col= "red")

Plotting Residuals by Predictors (X)

plot(lm$residuals,ylab="Residuals",main="Index plot of residuals")
abline(0, 0, col= "red")


5.3 Assumption of Normality of Residuals

We assume that the distribution of the residuals is normal. We want our residuals to follow a normal distribution centered around a mean of 0. This would mean that most of the residuals are quite small, and close to 0 meaning that there is little difference between predicted and observed values

good.res <- resid(lm)
hist(good.res,10)

boxplot(good.res,main="Boxplot of residuals") 

qqnorm(resid(lm)) # A quantile normal plot for checking normality
qqline(resid(lm))


5.4 Multicollinearity Issue

Multicollinearity between predictors can lead to unreliable and unstable estimates of regression coefficients and inflated SE. Check Variance Inflation Factor (VIF). The general rule of thumb is that VIFs > 4 warrant further investigation, while VIFs > 10 are signs of serious multicollinearity requiring correction.

# Check Variance Inflation Factors (VIF) measure
vif(lm)
##  AGE97 HOME97 bthwht 
##    1.1    1.1    1.1

5.5 Omitted Variable Bias

“Added-variable plots” provide graphic information about the marginal importanceof a predictor variable (e.g.,HOME97) given the other variables in the model. Both the response variable Y and the predictor variable under investigation (say, HOME97) are both regressed against the other predictor variables already in the regression model and the residuals are obtained for each.

These two sets of residuals reflect the part of each (Y and X1) that is not linearly associated with the other predictor variables.

  1. Run a regression of the DV in our model (readss97) on our IVs for our initial model and produces a temporary dataset called AVP that contains residuals RESID_READ for our model.
AVP <-lm(good$readss97 ~ good$AGE97 + good$bthwht)
RESID_READ <- resid(AVP)
  1. Run a regression with home97 as DV and saves residuals (called resid_home) # in a new dataset (called AVP1)
AVP1<-lm(HOME97 ~ AGE97 + bthwht, data = good)
RESID_HOME <- resid(AVP1)
  1. Compare plotted residuals.
# What are the trends? How are they different? 
plot(density(resid(AVP))) # A density plot

plot(density(resid(AVP1))) # A density plot

qqnorm(resid(AVP)) # checking normality of residuals
qqline(resid(AVP))

qqnorm(resid(AVP1)) # checking normality of residuals
qqline(resid(AVP1))

Combining the two sets of residuals on a scatterplot

avPlot(model=lm(good$readss97 ~ good$AGE97 + good$HOME97 + good$bthwht),
       variable=good$HOME97)

Let’s examine the changes in adjusted R2, AIC and BIC by including HOME97.

lm1 <- lm(readss97 ~ AGE97 + bthwht, data = goodreg)
summary(lm1)
## 
## Call:
## lm(formula = readss97 ~ AGE97 + bthwht, data = goodreg)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -55.68  -9.99  -0.56  10.15  62.32 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   97.728      0.982   99.52 < 0.0000000000000002 ***
## AGE97          0.827      0.125    6.59       0.000000000057 ***
## bthwht        -4.469      0.761   -5.88       0.000000004953 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16 on 1961 degrees of freedom
## Multiple R-squared:  0.031,  Adjusted R-squared:  0.03 
## F-statistic: 31.4 on 2 and 1961 DF,  p-value: 0.0000000000000394
lm2 <- lm(readss97 ~ AGE97 + bthwht + HOME97, data = goodreg)
summary(lm2)
## 
## Call:
## lm(formula = readss97 ~ AGE97 + bthwht + HOME97, data = goodreg)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -54.96  -9.52  -0.31   9.70  54.50 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   63.342      2.266   27.96 < 0.0000000000000002 ***
## AGE97          0.399      0.120    3.32              0.00093 ***
## bthwht        -2.848      0.719   -3.96             0.000077 ***
## HOME97         1.829      0.110   16.61 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15 on 1960 degrees of freedom
## Multiple R-squared:  0.151,  Adjusted R-squared:  0.149 
## F-statistic:  116 on 3 and 1960 DF,  p-value: <0.0000000000000002
AIC(lm1,lm2)
##     df   AIC
## lm1  4 16413
## lm2  5 16157
BIC(lm1,lm2)
##     df   BIC
## lm1  4 16436
## lm2  5 16185

Assignment 5 due at 12:59 pm March 4th, 2019 (Next Monday)

Using the statekffmultregression dataset:

  1. Create a histogram and boxplot of the official poverty rate

  2. Using qqnorm and qqline, assess the normality of the official poverty rate. Is it normally distributed?

  3. Create a scatterplot of the official poverty rate and the unemployment rate, with a regression line. what does it tell us?

  4. Regress official poverty rate on the unemployment rate. Present the output and plot the residuals on a scatterplot and the qqnorm graph. Are any assumptions of OLS violated? If so, what are the costs?