Install the packages that will be used for this tutorial.
install.packages(“dplyr”)
install.packages(“car”)
options(scipen=999, digits = 2)
library(ggplot2)
library(dplyr)
library(car)Let’s a randomly created dataset.
set.seed(100)
df <- data.frame(a = rnorm(100))
df$category <- cut(df$a, breaks=c(-Inf, 0.5, 0.6, Inf),
labels=c("low","middle","high"))
# Inf and -Inf are positive and negative infinity
# breaks = unique cut points or a single number (greater than or equal to 2) giving the number of intervals into which x is to be cut.Alternatively:
set.seed(100) # get the same random number!
df1 <- data.frame(a = rnorm(100))
df1$category[df1$a < 0.5] <- "low"
df1$category[df1$a > 0.5 & df1$a < 0.6] <- "middle"
df1$category[df1$a > 0.6] <- "high"str(df$category)## Factor w/ 3 levels "low","middle",..: 1 1 1 3 1 1 1 3 1 1 ...
str(df1$category)## chr [1:100] "low" "low" "low" "high" "low" "low" "low" "high" "low" ...
df1$category <- as.factor(df1$category)
summary(df$category)## low middle high
## 72 1 27
summary(df1$category)## high low middle
## 27 72 1
Or, using mutate function under dplyr for adding new variables and keeping existing ones, see more: https://dplyr.tidyverse.org/reference/mutate.html
Import the good dataset into R and and assign the dataframe the variable name ‘good’
good <- read.csv("good.csv")
attach(good)Continuous data must be collected into bins for histogram, but distribution varies with number of bins used.
hist(good$faminc97, freq = FALSE)Histogram with Density Plot
The Density Plot shows the smoothed distribution of the points(smooth out the noise) along the numeric axis. The peaks of the density plot are at the locations where there is the highest concentration of points.
hist(good$faminc97, freq = FALSE)
lines(density(good$faminc97))ggplot(good, aes(x=faminc97)) +
labs(title = "Graph 1: Histogram and Kernel Density of Family Income") +
geom_histogram(aes(y=..density..), colour="black", fill="white")+
geom_density(alpha=.2, fill="#FF6666") # alpha: color opacity## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Quantile-quantile plots (Q-Q plot) for one variable
Build a Q-Q plot to check how well a data set came from or fits a particular theoretical distribution such as a Normal or exponential. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight. More demonstrations: http://www.ucd.ie/ecomodel/Resources/QQplots_WebVersion.html
qqnorm(good$faminc97)
qqline(good$faminc97,col="red")# If distribution is near-Gaussian, points should follow red line.Check the Pearson correlation between two variables.
cor(good$readss97, good$faminc97, use = "complete.obs") # remember `cor.test()`?## [1] 0.31
cor(good$AGE97, good$readss97, use = "complete.obs")## [1] 0.12
cor(good$AGE97, good$faminc97, use = "complete.obs")## [1] 0.05
Any Extreme Outliners?
Using boxplot we tell R to create a horizontal box plot. Horizontal and vertical box plots display the distribution of data by using a rectangular box and whiskers. Whiskers are lines that indicate a data range outside of the box.
boxplot(good$faminc97)boxplot(good$readss97)Use loginc97 or remove it from our model for now.
hist(good$loginc97, freq = FALSE)
lines(density(good$loginc97))# Subset good data set with selected variables
goodreg <- na.omit(good[,c("readss97","AGE97","HOME97","bthwht")])
attach(goodreg)## The following objects are masked from good:
##
## AGE97, bthwht, HOME97, readss97
lm <- lm(readss97~AGE97 + HOME97 + bthwht, data = goodreg)
summary(lm)##
## Call:
## lm(formula = readss97 ~ AGE97 + HOME97 + bthwht, data = goodreg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.96 -9.52 -0.31 9.70 54.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.342 2.266 27.96 < 0.0000000000000002 ***
## AGE97 0.399 0.120 3.32 0.00093 ***
## HOME97 1.829 0.110 16.61 < 0.0000000000000002 ***
## bthwht -2.848 0.719 -3.96 0.000077 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15 on 1960 degrees of freedom
## Multiple R-squared: 0.151, Adjusted R-squared: 0.149
## F-statistic: 116 on 3 and 1960 DF, p-value: <0.0000000000000002
Continuous Idependent Variables (IV) and Dependent Variables (DV). The relationship between two continuous variables is most commonly investigated using scatter plots (see graphing section below).
Create Pairwise Comparison Scatterplot
pairs(goodreg,panel=panel.smooth)# Scatterplot matrices are good for determining rough linear correlations of metadata that contain continuous variables.
# Scatterplot matrices are not so good for looking at discrete variables.Create Bivariate Scatterplots for two variables
ggplot(goodreg, aes(x = AGE97, y = readss97)) +
geom_point(size = 0.6) +
xlab("Respondent's Age") +
ylab("Reading") +
theme_bw() +
geom_smooth(method = "loess")ggplot(goodreg, aes(x = HOME97, y = readss97)) +
geom_point(size = 0.6) +
xlab("Home97") +
ylab("Reading") +
theme_bw() +
geom_smooth(method = "loess")See lab 5 slides for more descriptions and refer to this post for more explanations: https://data.library.virginia.edu/diagnostic-plots/
plot(lm) Plotting Residuals by Fitted Values
plot(fitted.values(lm), residuals(lm))
abline(0, 0, col= "red")Plotting Residuals by Predictors (X)
plot(lm$residuals,ylab="Residuals",main="Index plot of residuals")
abline(0, 0, col= "red")We assume that the distribution of the residuals is normal. We want our residuals to follow a normal distribution centered around a mean of 0. This would mean that most of the residuals are quite small, and close to 0 meaning that there is little difference between predicted and observed values
good.res <- resid(lm)
hist(good.res,10)boxplot(good.res,main="Boxplot of residuals") qqnorm(resid(lm)) # A quantile normal plot for checking normality
qqline(resid(lm))Multicollinearity between predictors can lead to unreliable and unstable estimates of regression coefficients and inflated SE. Check Variance Inflation Factor (VIF). The general rule of thumb is that VIFs > 4 warrant further investigation, while VIFs > 10 are signs of serious multicollinearity requiring correction.
# Check Variance Inflation Factors (VIF) measure
vif(lm)## AGE97 HOME97 bthwht
## 1.1 1.1 1.1
“Added-variable plots” provide graphic information about the marginal importanceof a predictor variable (e.g.,HOME97) given the other variables in the model. Both the response variable Y and the predictor variable under investigation (say, HOME97) are both regressed against the other predictor variables already in the regression model and the residuals are obtained for each.
These two sets of residuals reflect the part of each (Y and X1) that is not linearly associated with the other predictor variables.
AVP that contains residuals RESID_READ for our model.AVP <-lm(good$readss97 ~ good$AGE97 + good$bthwht)
RESID_READ <- resid(AVP)AVP1<-lm(HOME97 ~ AGE97 + bthwht, data = good)
RESID_HOME <- resid(AVP1)# What are the trends? How are they different?
plot(density(resid(AVP))) # A density plotplot(density(resid(AVP1))) # A density plotqqnorm(resid(AVP)) # checking normality of residuals
qqline(resid(AVP))qqnorm(resid(AVP1)) # checking normality of residuals
qqline(resid(AVP1))Combining the two sets of residuals on a scatterplot
avPlot(model=lm(good$readss97 ~ good$AGE97 + good$HOME97 + good$bthwht),
variable=good$HOME97)Let’s examine the changes in adjusted R2, AIC and BIC by including HOME97.
lm1 <- lm(readss97 ~ AGE97 + bthwht, data = goodreg)
summary(lm1)##
## Call:
## lm(formula = readss97 ~ AGE97 + bthwht, data = goodreg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.68 -9.99 -0.56 10.15 62.32
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 97.728 0.982 99.52 < 0.0000000000000002 ***
## AGE97 0.827 0.125 6.59 0.000000000057 ***
## bthwht -4.469 0.761 -5.88 0.000000004953 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16 on 1961 degrees of freedom
## Multiple R-squared: 0.031, Adjusted R-squared: 0.03
## F-statistic: 31.4 on 2 and 1961 DF, p-value: 0.0000000000000394
lm2 <- lm(readss97 ~ AGE97 + bthwht + HOME97, data = goodreg)
summary(lm2)##
## Call:
## lm(formula = readss97 ~ AGE97 + bthwht + HOME97, data = goodreg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.96 -9.52 -0.31 9.70 54.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.342 2.266 27.96 < 0.0000000000000002 ***
## AGE97 0.399 0.120 3.32 0.00093 ***
## bthwht -2.848 0.719 -3.96 0.000077 ***
## HOME97 1.829 0.110 16.61 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15 on 1960 degrees of freedom
## Multiple R-squared: 0.151, Adjusted R-squared: 0.149
## F-statistic: 116 on 3 and 1960 DF, p-value: <0.0000000000000002
AIC(lm1,lm2)## df AIC
## lm1 4 16413
## lm2 5 16157
BIC(lm1,lm2)## df BIC
## lm1 4 16436
## lm2 5 16185
Using the statekffmultregression dataset:
Create a histogram and boxplot of the official poverty rate
Using qqnorm and qqline, assess the normality of the official poverty rate. Is it normally distributed?
Create a scatterplot of the official poverty rate and the unemployment rate, with a regression line. what does it tell us?
Regress official poverty rate on the unemployment rate. Present the output and plot the residuals on a scatterplot and the qqnorm graph. Are any assumptions of OLS violated? If so, what are the costs?