Regression analysis is a statistical technique that allows researches to prod the relationship between variables. In bivariate analysis, we assess the relationship between a predictor and a response variable. To give an example, a bivariate analysis might consider how being male (predictor) affects the probability of voting for a Candidate A (response). Of course, bivariate anlayses are rare because the questions that puzzle social scientists are often quite complex. In the previous example, voting for Candidate A surely cannot be entirely explained by gender alone. Multiple regression, where we have one response but multiple predictors is much more common and forms the foundation of quantiative anlaysis.
For each one of our lessons we will be playing with fake datasets that we create here. I find that this tends to illuminate what regression analysis achieves most clearly beause we play the role of Nature, assigning ex-ante values to the relationship between our variables. Since, we are Nature, we fully know what the true relation between our variable are. This is for pedagogical purposes, of course, things are not this easy in real life…
Today’s running example will consider an individual’s support for the local executive/mayor. The response variable is a feeling thermometer that ranges 0-100, with warmer temperatures indicating more approving views of the mayor. Call this thermometer
. The key predictors include:
age
a continuous random variable that is normally distributed,college
a dichotomous or “dummy” random variable that takes value of 1 if the individual completed a 4-year college,income
a continuous random variable that takes indicates the individuals income in thousand $US.We create each of these variables in turn:
#Create Predictors
#We draw from the following distributions 100 draws.
set.seed(333) #We must set the seed so that our work is replicable, any small # is fine
age<-rnorm(100,45,15) #This creates the variable *age* which is drawn from the normal distribution
#with mean 45 and st. dev. 20.
college<-rbinom(100,1,0.3) #This creates the variable *college* which is drawn from the binomial
#distribution, where the probability of obtaining a draw with value 1 is 0.3
income<-10*rbeta(100,2,8) #This creates the variable *income* which is drawn from the beta distribution
#with shape parameters 2 and 8, such that it has a slight positive skew.
We might want to plot these variables to ensure that we have created the correctly specified variables, but you will have trust me in this instance. I will show later in this class how to draw histograms and density plots of the variables.
To create the dependent variable we will specify a linear predictor:
\[thermometer_i = \alpha + age_i*\beta_1 + college_i*\beta_2 + income_i*\beta_3 + \epsilon_i. \]
Understanding what this equation is saying is fundamental to understanding regression analysis. Remember, since we play the role of Nature, we can specify the exact relationship between thermometer and all the other variables. The equation above is saying that in our universe the values of thermometer
for person i
are a linear function of a parameter \(\alpha\), for each person i
, values on age
, college
, income
, and an error term \epsilon
. Now, the effects of the predictors is weighted by a term beta_j
where \(j=1,2,3\). These beta parameters are what regression analysis attempts to recover, the effect of any given predictor variable on the response, when everything else is held constant.
To obtain our dependent variable, we must simply specify the values for the parameters and add them up.
#Create Dependent Variable
a<-80 #The intercept is the thermometer feeling towards the mayor
#when all variables are set to zero. In this case it would
#be the thermometer value when the person is age 0, has no
#college education and no income. This is nonsensical, which
#highlights the fact that the intercept is not always
#interpretable.
b1<-(-.5) #Beta1 says that a one year increas in age decreases support
#for the mayor by .5 points.
b2<-15 #Beta2 says that being college educated increases support for
#the mayor by 10 points! (ugh, liberals...)
b3<-(-2) #Beta3 says that for every thousand dollar increase in income
#person I will drop support for the mayor by 2 points.
e<-rnorm(100,0,2.5) #The error term is drawn from the normal, mean 0, st.dev 2.5
y<- a + age*b1 + college*b2 + income*b3 + e #So our values for y are a function of a, plus each one of
#the predictors and their weights, plus an error term.
thermometer<-y #I rename y
mayor<-as.data.frame(cbind(thermometer, #I create a data set to use subsequently.
age,
college,
income))
We know take off our data-generating-process hat and assume that we are handed the data set mayor
. As with every statistical analysis, the first and most important step is to explore our data. We want to get a sense not only of the distribution of the data but to check for potentially odd observations, which may be the result of coding errors.
First, let us look at the dependent variable thermometer
.
summary(thermometer) #this produces basic descriptive statistics of the variabel
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 33.06 52.51 57.88 58.52 66.54 83.41
hist(thermometer) #this plots a histogram of the variable
d_therm<-density(thermometer) #this calculates the density of the variable
plot(d_therm) #this plots the density which is neater/cooler than the historgram
Nothing odd here, the dependent variable is normally distributed with a mean around 58. We could do this individually for all variable in our data set, but this is sometimes impractical. We can look at all of our variables at once using some of the previous commands and the graphing package ggplot2
. To install a package in R, simply type install.packages("")
and write the name of the package within the quotes.
summary(mayor) #This summarizes all of our variables in dataset.
## thermometer age college income
## Min. :33.06 Min. :12.18 Min. :0.00 Min. :0.1308
## 1st Qu.:52.51 1st Qu.:35.84 1st Qu.:0.00 1st Qu.:1.1096
## Median :57.88 Median :45.73 Median :0.00 Median :1.8562
## Mean :58.52 Mean :44.66 Mean :0.28 Mean :1.9668
## 3rd Qu.:66.54 3rd Qu.:51.35 3rd Qu.:1.00 3rd Qu.:2.7760
## Max. :83.41 Max. :74.02 Max. :1.00 Max. :4.8085
#Notice also that this provides the number of missing obs if
#there are any. None here.
library(reshape2) #this package transforms data from wide to long format
library(ggplot2) #this is a powerful graphing package
d <- melt(mayor) #melt turns wide-format data into long-format data.
## No id variables; using all as measure variables
ggplot(d,aes(x = value)) + #here we plot the data `d`
facet_wrap(~variable,scales = "free") + #here we indiate what to plot, and the scales for each plot
geom_histogram() #here we specify the plot type
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
These figures provide visual inspection of the data. We can see that themometer
is approx. bounded by 20 and 90, age
between 15 and 80, income
between 0 and 5 (remember this is in thousands of dollars) and almost 70 obs have no college
.
Before proceeding with more sophisiticated analyses, it is often useful to inspect the correlation between variables, as a shorthand for regression. One way of doing this efficiently is by obtainin a correlation matrix, which shows how the variables move together. One easy way to do this is by using the corrplot
package:
library(corrplot)
M <- cor(mayor)
corrplot(M, method="circle", #Here we symbolize corr with circles, matrix discards upper half.
type="lower",
order="hclust",
tl.col="black",
tl.srt=45)
Positive correlations are displayed in blue, whereas negative correlations are disaplyed in red, as given by the legend at the bottom of the plot. The size of the circles is proportional to the size of the coefficient. So, the diagonal is obviously blue and large because the correlation between thermometer
and itself is perfect, or 1. We can stick for now with the first column, since it tells us the relationship between thermometer
and all the predictors. Notice that college
has a weak positive correlation with thermometer
, but age
and income
have negative correlations with thermometer
. age
in particular seems to have a large negative correlation coefficient, whereas incomes
’ relationship to thermometer
is much weaker.
A slightly cooler plot with much more coding below: (feel free to skip until the next section)
cormat <- round(cor(mayor),2) #Create a correlation matrix.
library(reshape2) #Use package to melt.
melted_cormat <- melt(cormat) #Melt the correlations matrix
head(melted_cormat) #This would show for each variable correlation with others.
## Var1 Var2 value
## 1 thermometer thermometer 1.00
## 2 age thermometer -0.68
## 3 college thermometer 0.59
## 4 income thermometer -0.39
## 5 thermometer age -0.68
## 6 age age 1.00
library(ggplot2)
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) +
geom_tile()
#For a plot with only the bottom half, write the following function:
# Get lower triangle of the correlation matrix
get_lower_tri<-function(cormat){
cormat[upper.tri(cormat)] <- NA
return(cormat)
}
# Get upper triangle of the correlation matrix
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)]<- NA
return(cormat)
}
upper_tri <- get_upper_tri(cormat)
upper_tri
## thermometer age college income
## thermometer 1 -0.68 0.59 -0.39
## age NA 1.00 0.11 0.20
## college NA NA 1.00 -0.09
## income NA NA NA 1.00
# Melt the correlation matrix
library(reshape2)
melted_cormat <- melt(upper_tri, na.rm = TRUE)
# Heatmap
library(ggplot2)
ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1))+
coord_fixed()
We intend to regress our dependent variable thermometer
on a host of predictors. R is an object-based software, which assigns values to objects, which we can then summarize. To run a regression, therefore, we have to assign a linear regression model to the an object. The lm
function is specified as such: lm(response~predictor_1+predictor_2...+predictor_n)
. So, in our case, we can specify at least three models of increasing complexity:
linear.model.1<-lm(thermometer~age) #Here we regress therm on age.
linear.model.2<-lm(thermometer~age + college) #Here on age and college.
linear.model.3<-lm(thermometer~age + college + income) #Here on all predictors.
To obtain the information from the linear model we can use a variety of functions:
print(linear.model.1)
##
## Call:
## lm(formula = thermometer ~ age)
##
## Coefficients:
## (Intercept) age
## 80.9269 -0.5017
summary(linear.model.1)
##
## Call:
## lm(formula = thermometer ~ age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.118 -5.344 -2.421 7.248 19.090
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.92689 2.55591 31.663 < 2e-16 ***
## age -0.50165 0.05451 -9.204 6.47e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.796 on 98 degrees of freedom
## Multiple R-squared: 0.4636, Adjusted R-squared: 0.4582
## F-statistic: 84.71 on 1 and 98 DF, p-value: 6.472e-15
These are all quite messy outputs. A very neat way of showing results, akin to how it would dispay in Stata, is given by the package texreg
which can also be used to export results to latex format. Here we are concened with the function screenreg
which takes as inputs a list of regression models and produces a very neat table of coefficients from different models, side-by-side.
library(texreg)
## Version: 1.36.7
## Date: 2016-06-21
## Author: Philip Leifeld (University of Glasgow)
##
## Please cite the JSS article in your publications -- see citation("texreg").
screenreg(list(linear.model.1,linear.model.2,linear.model.3))
##
## ===============================================
## Model 1 Model 2 Model 3
## -----------------------------------------------
## (Intercept) 80.93 *** 78.94 *** 81.30 ***
## (2.56) (1.01) (0.84)
## age -0.50 *** -0.56 *** -0.53 ***
## (0.05) (0.02) (0.02)
## college 15.92 *** 15.40 ***
## (0.69) (0.54)
## income -1.80 ***
## (0.22)
## -----------------------------------------------
## R^2 0.46 0.92 0.95
## Adj. R^2 0.46 0.92 0.95
## Num. obs. 100 100 100
## RMSE 7.80 3.07 2.39
## ===============================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
We can also plot an aspect of the thermometer scores given some of the predictors. Let’s do this for thermometer
and age
from the first model.
plot(age, thermometer, xlab="Age",ylab ="Thermometer")
#curve (coef(linear.model.1)[1]+coef(linear.model.1)[2]*x, add=TRUE)
curve (cbind(1,x) %*% coef(linear.model.1), add=TRUE)
As can be seen from this fuigure, as age increases, the thermometer
score tends to drop. The regression line fits this relatinoship quite nicely. Now assume, we want to investigate this relationship but we want to investigate if there are any differences for individuals with a college education. This is also easily done for two predictors.
colors<-ifelse(college==1, "blue", "black")
plot(age, thermometer, xlab="Age",ylab ="Thermometer", col=colors, pch=20)
curve (cbind(1,x,1) %*% coef(linear.model.2), add=TRUE, col="blue")
curve (cbind(1,x,0) %*% coef(linear.model.2), add=TRUE, col="black")
Here, we have taken estimated from linear.model.2
to show that the regression line for individuals with college education completed shifts to the right of those that do not. This means that their intercept at y
for college educated people is higher than those without, meaning that they have a higher average support for the mayor.
c.age <- age - mean(age)
c.college<-college- mean(college)
linear.model.4<-lm(thermometer~c.age+c.college)
library(texreg)
screenreg(linear.model.4)
##
## =======================
## Model 1
## -----------------------
## (Intercept) 58.52 ***
## (0.31)
## c.age -0.56 ***
## (0.02)
## c.college 15.92 ***
## (0.69)
## -----------------------
## R^2 0.92
## Adj. R^2 0.92
## Num. obs. 100
## RMSE 3.07
## =======================
## *** p < 0.001, ** p < 0.01, * p < 0.05