1 Introduction to R, RStudio, and RMarkdown

1.1 Key Objectives

By the end of class, you should be able to

  1. import data (.csv, .sav, .dat) and load packages
  2. view, describe, and manipulate data
  3. make simple plots (histogram, scatterplot, line plot)
  4. present the results in R markdown

1.2 R Basics

  • What is R and why do we want to use R?
  • RStudio is an integrated development environment, or IDE (a software application that provides comprehensive facilities to computer programmers for software development), for R programming.
  • An R package is a collection of functions, data, and documentation that extends the capabilities of base R.
# R uses "#" to add comments so that 
# you and others can understand the R code
# for example, to calculate 1+1, we type
1+1
## [1] 2

Q1. for the following codes, can you tell what outputs would we get?

1+3
#2+4
3+6

A. 4  6  9

B. 4  9

C. 4

For simple arithmetic besides addition (+), we can also, for example, do substraction (-), multiplication (*), division (/), exponentiation (^), and modulo (%).

Q2. perform the following calculations in R:

  1. \(\sqrt{2}\)
  2. \(3+\frac{4}{5}\)
  3. \(3^2-4\)
  4. \(16-\frac{8}{4}\)
  5. \(\frac{16-8}{4}\)
  6. \(e^4\)

Hint: the r codes for the problems are:

  1. sqrt(2)
  2. 3+4/5
  3. 3^2-4
  4. 16-8/4
  5. (16-8)/4
  6. exp(4)

Moreover, to read packages, for example, we want to load the foreign package, which includes powerful data importing tools, we do

install.packages("foreign") # install the packages 
library("foreign") # load the package into the current work environment
??foreign # look up what is in the package

More general help guide within R can be accessed through the following scripts:

help.start() # general self-help guide to R
# specific self-help for, e.g., standard deviation
help.search("Mathematical Functions") 
help() # general help function

Or click “Help” on the top panel that is next to “Window.”

Q3. Use the help system to find a function that would compute the standard deviation of a group of numbers.

1.3 Working Environment Setup

  • R studio basics:
    • top-left panel: Code editor allowing you to create and open a file containing R script. The R script is where you keep a record of your work. R script can be created as follow: File –> New –> R Script.
    • Bottom-left panel: R console for typing R commands
    • Top-right panel:
      • Workspace tab: shows the list of R objects you created during your R session
      • History tab: shows the history of all previous commands
    • Bottom-right panel:
      • Files tab: show files in your working directory
      • Plots tab: show the history of plots you created. From this tab, you can export a plot to a PDF or an image files
      • Packages tab: show external R packages available on your system. If checked, the package is loaded in R.
  • working directory: a folder where R reads and saves files
    • check current working directory: getwd()
    • change working directory:
      • A. setwd("/path/to/my/directory") (using tab after typing quotation marks can be helpful if the desired path is under current path, e.g., you want to go from “/a/b” to “/a/b/c”). Alternatively, you can go to Session -> Set Working Directory -> Choose Directory….
      • B. Create a sub-directory in the folder you would like to store and read files. From RStudio, use the menu to change your working directory under Session -> Set Working Directory -> Choose Directory. Choose the directory you’ve just created.
    • set a default working directory: A default working directory is a folder where RStudio goes, every time you open it. You can change the default working directory from RStudio menu under: Tools –> Global options –> click on “Browse” to select the default working directory you want.
  • saving data: Before quitting, go to Session menu; choose “Save Workspace as”. It will be a.RData file. If you want your workspace back, you can go to your R work directory and double click on the .Rdata file. R will be started, and your workspace will be loaded.Or you can start R from the Start menu and then go to the Session menu and choose Load Workspace.

1.4 Basic Data Management

Before we learn about data, a couple of concepts can be important.

  • object: any variable you define or a result of a function, e.g., “an object of data”
  • variable: a word or letter that you define and assign to some value or values .Variables can start with a letter or periods but cannot start with a number or be a number by itself. Some examples: result, b.2, .answer (not 2)
  • assignment: The assignment operator is: <- or =. What you type on the right is assigned to what you type on the left. So it reads “left is assigned as right”. For example, y<-1, we have assigned 1 to variable y.

We also need to note that

  • R is case sensitive (e.g. Result is a different variable than result).
  • R is insensitive to white space before or after the assignment operator (e.g. x <- 2 and x<- 2 are the same).

Q4. what variable assignments can be successfully executed?

A. 2b<-2

B. b2 <- 2

C. b 2<-2

D. b2=2

Some commands invovling managing objects in your working directory:

  • To see what objects are in your directory: ls()
  • To remove an object “obj” from your directory: rm(obj)
  • To remove all objects from your directory: rm(list=ls())

We can read in data from a text file or a .dat, .csv, and more files using the read.table() and read.csv() commands. For .sav file, we can use read.spss() from the foreign package. If your data is in the same directory as your R session/.RData file, you can just type the name of the file. In addition, you can also import data by clicking File -> Import dataset. To interactively load the dataset, try read.table(file.choose())

?read.table() # what is the read.table() function?
# import .sav data file as a variable named df, note that "to.data.frame=T"
df=read.spss("R_011719.sav",header=T,to.data.frame=T) 
# import .csv data file as a variable named df_csv
df_csv=read.csv("R_011719.csv",header=T)
# import .dat data file as a variable named df_dat
df_dat=read.table("R_011719.dat",header=T)

In your top-right penal, you should be able to see, for example, df has 30 observations of 2 variables. df here is a dataframe:

A data frame has the variables of a data set as columns and the observations as rows.

  • to view the data, we use View(), e.g., View(df).
  • but sometimes, you may have a large dataset where viewing the entire dataset becomes cumbersome. so you may want to look at only the first or last observations in your dataset using head() and tail() respectively.
  • to overview your data, you can also do str(), which shows you the structure of your data, such as the total number of observations, the total number of variables, a full list of variables names, the data type of each variable, and the first observations.
  • to select particular element in the dataframe, for example, you want to look at the first row and the second column in df, you can use df[1,2]. In other words, of all the x scores, you want that of the second participant, i.e., df$x[2]. You can also do df[2,"x"]. For more than one elements, say you want to select row 1, 2, 3, and columns 2, 3, 4, you can do df[1:3,2:4]. “:” here means to. For example, 1:3 reads 1 to 3.
  • to check for missing data, we use complete.cases(). To create a new dataframe that excludes rows with missing data, we use newdata <- na.omit(mydata)
  • to describe the data, we can use summary(df)
View(df_dat) 
head(df_dat)
tail(df_dat)
str(df_dat)
df[1,2]
df$x[2]
df[2,"x"]
df[1:3,1:2]
# check if the dataset contains missing data, what do you find?
complete.cases(df_dat) 
# if the missing data are coded differently, when we import 
# the data, we can note how the missing data are encoded
# by using the na.strings argument 
df_dat2=read.table("R_011719.dat",header=T,na.strings = "-8")
View(df_dat2)
# check if the dataframe contains missing data, what do you find?
complete.cases(df_dat2)
# create a new dataframe excluding missing data
newdata <- na.omit(df_dat2)
# check if the new dataframe contains missing data, what do you find?
complete.cases(newdata)
# describe the new data
summary(newdata)

Q5. write scripts in three different ways to select the y scores of fifth to seventh participants in the df dataframe and assign the selected scores to a variable named sel.

1.5 Simple Plots

for one variable

xscores=df_dat$x # histogram
hist(xscores) # scatterplot 
plot(xscores)
# line plot using blue points overlayed by a line
plot(xscores,type="o",col="blue"); boxplot(xscores)

1.6 A Gentle Introduction to R Markdown

  • R Markdown is a low-overhead way of writing reports which includes R code and the code’s automatically-generated output. It also lets you include nicely-typeset math, hyperlinks, images, and some basic formatting.
    • Markdown is a low-overhead mark-up language invented by John Gruber. There are now many programs for translating documents written in Markdown into documents in HTML, PDF or even Word format (among others).
    • R Markdown is an extension of Markdown to incorporate running code, in R, and including its output in the document.
  • Starting and rendering an R markdown file:
    • before starting an R markdown file, make sure that you have installed the package rmarkdown.
    • to start an R markdown file, click File -> New File -> R Markdown, alternatively, you can click the paper with a green circle with a plus sign within on the top left of your R studio window and then select R Markdown.
    • to render the R markdown file, you can click the button on top of your left top panel that says “knit” and choose the format you would like to knit your R markdown file to.
  • Title, Author, Date, Output Format, Table of Contents
    • You can specify things like title, author and date in the header of your R Markdown file. This goes at the very beginning of the file, preceded and followed by lines containing three dashes. Thus the beginning of this file looks like so:

      ---
      title: Using R Markdown for Class Reports
      author: A Student
      date: 12 January 2018
      ---
    • You can also use the header to tell R Markdown whether you want it to render to HTML (the default), PDF, or something else. To have this turned into PDF, for instance, I’d write

      ---
      title: Using R Markdown for Class Reports
      author: A Student
      date: 12 January 2018
      output: pdf_document
      ---
    • Adding a table of contents is done as an option to the output type.

      ---
      title: Using R Markdown for Class Reports
      author: A Student
      date: 12 January 2018
      output:
        html_document:
      toc: true
      ---
  • Basic text formatting: you just type them as if you are typing in a word document, for the most part.
    • paragraph breaks: To insert a break between paragraphs, include a single completely blank line.
    • line breaks: To force a line break, put two blank spaces at the end of a line.
    • bold and italicize: Text to be italicized goes inside a single set of underscores or asterisks. Text to be bolded goes inside a double set of underscores or asterisks.
    • quotations: Set-off quoted paragraphs are indicated by an initial >
    • computer script formatting: Text to be printed in a fixed-width font, without further interpretation, goes in paired left-single-quotes, a.k.a. “back-ticks”, without line breaks in your typing. (Thus R vs. R.) If you want to display multiple lines like this, start them with three back ticks in a row on a line by themselves, and end them the same way.
  • Mathematical formatting: Since this is a statistics class, you need to be able to write out mathematical expressions, often long series of them. R Markdown gives you the syntax to render complex mathematical formulas and derivations, and have them displayed very nicely, via LaTex. The math can either be inline or set off (displays).
    • Inline math is marked off with a pair of dollar signs ($), as \(\pi r^2\) or \(e^{i\pi}\).

      Inline math is marked off witha pair of dollar
      signs (`$`), as $\pi r^2$ or $e^{i\pi}$.
    • Mathematical displays are marked off with \[ and \], as in \[ e^{i \pi} = -1 \]

      Mathematical displays are marked off with `\[` and `\]`, as in
      \[
      e^{i \pi} = -1
      \]
    • for more help with mathematical formatting, see this and this. Or you can do your own google search. In google search, note that some of the help on google for LaTex would also be applicable to math in R Markdown.
  • Organizational formatting:
    • headers: The character # at the beginning of a line means that the rest of the line is interpreted as a section header. The number of #s at the beginning of the line indicates whether it is treated as a section, sub-section, sub-sub-section, etc. of the document. For instance, Basic Formatting in R Markdown above is preceded by a single #, but Headers at the start of this paragraph was preceded by ###. Do not interrupt these headers by line-breaks.
    • bullet lists:
      • This is a list marked where items are marked with bullet points.
      • Each item in the list should start with a * (asterisk) character, or a single dash (-).
      • Each item should also be on a new line.
        • Indent lines and begin them with + for sub-bullets.
        • Sub-sub-bullet aren’t really a thing in R Markdown.
    • numbered lists:
      1. Lines which begin with a numeral (0–9), followed by a period, will usually be interpreted as items in a numbered list.
      2. R Markdown handles the numbering in what it renders automatically.
      3. This can be handy when you lose count or don’t update the numbers yourself when editing. (Look carefully at the .Rmd file for this item.)
        1. Sub-lists of numbered lists, with letters for sub-items, are a thing.
        2. They are however a fragile thing, which you’d better not push too hard.
  • Hyperlinks and images
    • Hyperlinks: Hyperlinks anchored by URLs are easy: just type the URL, as, e.g., to get the source file for this document. Hyperlinks anchored to text have the anchor in square brackets, then the link in parentheses.

      [anchor in square brackets, then the link
      in parentheses]()
    • Images: Images begin with an exclamation mark, then the text to use if the image can’t be displayed, then either the file address of the image (in the same directory as your document) or a URL. Here are two examples, one for an image in the directory and one for a URL.

      ![A local image](r-project.png)
      ![A remote image](https://goo.gl/images/d7UTqd)
  • Including code: The real point of R Markdown is that it lets you include your code, have the code run automatically when your document is rendered, and seemlessly include the results of that code in your document. The code comes in two varieties, code chunks and inline code.
    • Inline Code: Code output can also be seamlessly incorporated into the text, using inline code. This is code not set off on a line by itself, but beginning with `r and ending with `. Notice that inline code does not display the commands run, just their output.
    • Code Chunks and Their Results: A code chunk is simply an off-set piece of code by itself. It is preceded by ```{r} on a line by itself, and ended by a line which just says ```. The code itself goes in between. Code chunks (but not inline code) can take a lot of options which modify how they are run, and how they appear in the document. These options go after the initial r and before the closing } that announces the start of a code chunk.
      • One of the most common options turns off printing out the code, but leaves the results alone: ```{r, echo=FALSE}
      • runs the code, but includes neither the text of the code nor its output. ```{r, include=FALSE}This might seem pointless, but it can be useful for code chunks which do set-up like loading data files, or initial model estimates, etc.
      • prints the code in the document, but does not run it: ```{r, eval=FALSE} This is useful if you want to talk about the (nicely formatted) code.

Example:

df_dat2=read.table("R_011719.dat",header=T,na.strings = "-8")
View(df_dat2)
# check if the dataframe contains missing data, what do you find?
complete.cases(df_dat2)
# create a new dataframe excluding missing data
newdata <- na.omit(df_dat2)
# check if the new dataframe contains missing data, what do you find?
complete.cases(newdata)
# describe the new data
summary(newdata)
xscores=newdata$x # histogram
hist(xscores) # scatterplot 
plot(xscores)
# line plot using blue points overlayed by a line
plot(xscores,type="o",col="blue"); boxplot(xscores)
plot(newdata) # scatterplot 
# line plots using blue points overlayed by a line
plot(newdata$x,type="o",col="blue",ylim=c(0,30))
lines(newdata$y, type="o", pch=22, lty=2, col="red")
boxplot(newdata) # boxplot

1.7 Summary

1.7.1 List of Key Commands:

  • to install a package/packages: install.packages()
  • to load a package: library()
  • to show the current working directory: getwd()
  • to set the working directory: setwd()
  • to read the dataset: read.table() or foreign::read.spss() or read.csv()
  • to view the dataframe: View()
  • to look at the first/last observatiosn in your dataframe: head(), tail()
  • to check if observations are complete: complete.cases()
  • to exclude observations with missing values: na.omit()
  • to describe the data: summary()
  • to make a histogram: hist()
  • to make a scatterplot/line plot: plot()
  • to make a box plot: boxplot()

Q6. read the following statements and scripts, for statements, judge if the statement is correct, for scripts, determine if the script can be executed successfully, if so, what would the output be.

  1. given a dataframe named “dep”, view(dep)
  2. given a dataframe named “dep”, view("dep")
  3. given a complete dataframe named “dep” of 2 variables with 30 observations and 1 missing value, sum(complete.cases(df_dat))
  4. str() shows you the structure of your data, such as the total number of observations, the total number of variables, a full list of variables names, the data type of each variable, and the last observations.
  5. given a variable named z, summary(z) can give you information about the second quartile of z.
  6. dim() can give you information about the number of participants and of variables in your dataset.

Q7.

Answer the following questions in R Markdown and knit it to an html file:

  1. describe rnorm function in one sentence. (Hint: do help(rnorm)).
  2. create a random sample from a normal distribution of length 200 with mean 4 and standard deviation 2 and name it ra.
  3. view ra and show the first and last 5 observations of ra. (hint: do help(head) and help(tail))
  4. find the min, max, mean, median, and standard deviation of ra.
  5. (bonus) how many are between 3 and 5? plot them in a histogram.

2 Regression Models in R

2.1 Key Objectives

By the end of class, you should be able to

  1. compute descriptive statistics of a dataset
  2. fit, interpret, and plot regression models
  3. understand regression models

2.2 Descriptive statistics in R

Several functions can be helpful in doing descriptive statistics:

  1. summary() from base R package
  2. describe() from psych package
  3. dfSummary(bfi) from summarytools package

(remember to install and to load the packages before you use the functions)

For example, we want to gather some basic information about the bfi dataset from the psych package:

  1. if psych package is not installed already, run install.packages("psych").
  2. load the package psych by running library(psych).
  3. load the dataset bfi by data(bfi).
  4. since we will be working with this dataset for the entire session, run attach(bfi).

attach() is a command that allows you to directly access all the variables in the dataset by simply giving their names, instead of having to extract the variable from the dataset and to manually put it into the global environment, e.g., A4<-bfi$A4, or referencing variables by calling the dataset first, e.g., bfi$A4. After you are done working with this dataset, you can run, for example, detach(bfi) so that the dataset variables are no longer directly accessible.

For particular variable, we can use describe() from psych, e.g., describe(A4) or summary() from base R, e.g., summary(A4), or descr() from summarytools, e.g., descr(A4). We may also be interested in the relationships between variables, e.g., covariance, cov(x,y) for x and y and cov(df) for all variables in df; correlation, cor(x,y) for x and y and cor(df) for all variables in df.

Q1:

  1. view first 20 observations,
  2. view last 20 observations
  3. view the A4 scores for observations 50-100
  4. obtain descriptive statistics for the bfi dataset.
  5. obtain descriptive statistics for variables A4 and age.
  6. find the covariance and correlation between all variables and between A4 and age.
  7. execute cor(A4,age)*sd(A4)*sd(age), what did you find?
  8. execute cor(scale(A4),scale(age)), what did this do?

In addition to statistical descriptions, R can also provide visual descriptions, e.g., histogram, scatterplot, correlation graphs, etc. Having covered plotting in base R, here we present a prettier way for plotting using ggplot2. Specifically, ggplot2 can help make a histogram, a scatterplot, and a correlation heatmap. For convenience, we show making correlation heatmap with ggplot2-based package ggcorrplot. Moreover, we show a useful function qplot(), see ?qplot().

library(psych); data(bfi); bfi=na.omit(bfi); attach(bfi)
ggplot(bfi, aes(x=age)) + geom_histogram() # histogram
# using factor() to make A4 a categorical variable
# histogram of age with A4 coloring
ggplot(bfi, aes(x=age, color=factor(A4))) + geom_histogram() 
# run install.packages("ggplot2") first if it is not installed
library(ggplot2) 
ggplot(bfi, aes(x=A4, y=age)) + geom_point() # scatterplot
# correlation heatmap with ggcorplot
install.packages("ggcorrplot");library(ggcorrplot)
ggcorrplot(cor(bfi), p.mat = cor_pmat(bfi), hc.order=TRUE, type='lower')
# correlation heatmap with corrplot
install.packages("corrplot");library(corrplot)
corrplot(cor(bfi), method="circle") # try setting method="number" or "color"
# ggplot version of base plot
qplot(age,A4,data=bfi) 
qplot(age,A4,colour=factor(gender))
# plot scatter plot with a line plot of group means
# A: in base R
plot(A4,age)
lines(aggregate(age,list(factor(A4)),mean)$x,col="red", lwd=3)
# B: in ggplot2
group.means=data.frame(aggregate(age,list(factor(A4)),mean))
colnames(group.means)=c("A4","group_mean")
ggplot(data=bfi,aes(x=factor(A4),y=age))+
  geom_point()+
  geom_point(data=group.means, aes(x=A4, y=group_mean, group=1, color="blue"))+
  geom_line(data=group.means, aes(x=A4, y=group_mean, group=1, color="red"))

Q2: run qplot(A4,age,colour=factor(gender))+geom_abline(intercept=mean(age),slope=0). what does geom_abline() do? what does the line mean?

2.3 regression models in R

2.3.2 visualization

To visualize the regression, we can use either base R or ggplot2:

srm=lm(age~A4)
plot(A4,age)
abline(srm, col="blue",lwd=3)
# using ggplot2
ggplot(bfi, aes(x=A4, y=age)) +   
  geom_smooth(method = "lm", se = FALSE)

Q3:

  1. write out the estimated regression model in R markdown.
  2. include a regression plot in the R markdown without displaying the scripts that produce the plot.
  3. knit the R markdown file into a pdf file.

2.4 understanding regression models

To understand regression models, there are a few key things:

  1. prediction and residuals
  2. regression coefficients

2.4.1 prediction and residuals

After estimating the regression model, we may wish to obtain the predicted values for the dependent variable, \(\hat{y}\). In simple regression, \(\hat{y}=\hat{b}_0+\hat{b}_1x_1\). In multiple regression, \(\hat{y}=\hat{b}_0+\hat{b}_1x_1+...\hat{b}_px_p\). The predicted values for the data are stored in “fitted.values” object in R lm output model object. For example, we named our model mr, then to access the predicted value, we can run mr$fitted.values.

But what if we wish to find the \(\hat{y}\) given \(X\) values of a new participant, that is, to make new predictions given the estimated model? One way is to figure out the exact model form and the corresponding coefficients. Using the same example, we can first do coefficients(mr) or mr$coefficients. This tells us that \(\hat{y}=8.33+ 2.41\times IQ+.15\times SES\). Now we can simply plug in new IQ and SES values into the formula. But this can get tiring if there are a lot of predictors. To obtain new predictions, we can also use predict() function. Note that the new data need to have the exact same variable name as the data you used to estimate the model do (minus the dependent variable of course).

# fabricate some new IQ and SES data
new.IQSES=data.frame(cbind("IQ"=IQ[1:10],"SES"=SES[11:20]))
# use predict function to make predictions
predict(mr,new.IQSES)

Residuals are defined as \(y-\hat{y}\). Residuals are important as they are indicators of what your model is missing. To find the residuals for mr, we can run residuals(mr) or mr$residuals.

The variance of residuals is the variance that the model fails to explain. As \(R^2\) is defined as the variance explained by the model, \(f(X)=\hat{y}\), \(R^2\) can be computed as \(cor(\hat{y},y)^2\). Yet given the relationship that \(y=\hat{y}+e\) and \(var(y)=var(\hat{y})+var(e)\), we have \(R^2=1-\frac{var(e)}{var(y)}\). We can verify this in computer scripts:

# r squred as squared correlation between y and predicted y
cor(mr$fitted.values,lang)^2
# r squared as 1 - variance of residuals
1-var(mr$residuals)/var(lang)

2.4.2 Regression Coefficients

In a simple regression model, \(y=b_0+b_1x+\epsilon\), we have \(b_1=\frac{cov(XY)}{var(X)}\). Given \(cor(X,Y)=\frac{Cov(X,Y)}{s_Xs_Y}\), \(b_1=\frac{cor(XY)s_Xs_Y}{s^2(X)}=cor_{XY}\frac{s_Y}{s_X}\).

Q4 check if two formula give the same result and if the result is correct:

  1. regress lang to IQ, obtain the regression coefficient
  2. write scripts for \(b_1=\frac{cov(XY)}{var(X)}\) where \(X\) is lang and \(Y\) is IQ
  3. write scripts for \(b_1=cor_{XY}\frac{s_Y}{s_X}\) where \(X\) is lang and \(Y\) is IQ

In multiple regression, regression coefficients are partial regression coefficients. For example, in \(Y=b_0+b_1X_1+b_2X_2\), \(b_1\) and \(b_2\) each marks the unique contribution of \(X_1\) and \(X_2\) respectively. To observe this, we can first regress \(X_2\) to \(X_1\) and obtain the residuals. The residuals can be seen as \(X_2\) that is “clean” of the influence of \(X_1\) and can mark the part of \(X_2\) that is unique from \(X_1\). Now we regress \(Y\) to the residuals. The coefficient would be the same as \(b_2\).

library(MASS)
attach(nlschools)
mr=lm(lang ~ IQ + SES); 
summary(mr);
mr1=lm(SES~IQ)
mr2=lm(lang~mr2$residuals)

Q5 obtain the coefficent of IQ as in the multiple regression where both IQ and SES are predictors of lang by using a series of simple regression.

Standardization is a process that involves subtracting an individual value by the population mean and then dividing by the population standard deviation. In linear regression, this results in variables that have a mean of 0 and a standard deviation of 1. Standardization is useful when one of the variables has a very large scale, as this may lead to regression coefficients of a very small order of magnitude. To standardize the variables, we can use the scale() function. One important use of this is to obtain standardized regression coefficients. Standradized regression coefficients can be obtained by running a regression model with all variabled standardized. For example, in simple regression, we can run lm(scale(y)~scale(x)).

Alternatively, standardized regression coefficients can be calculated via the formula \(\hat{b}_j=\frac{s_{X_j}}{s_Y}b_j\).

Q6 compare the coefficents, significance, and R-squared of lang regressing to IQ and SES and the coefficents of lang regressing to standardized IQ and standardized SES.

2.5 Summary

2.5.1 List of Key Commands

  • to describes a dataset: describe() from psych package
  • to fit linear models, used for simple linear regression: lm()
  • to look into details of regression model: coef(), summary(), predict(), residuals()
  • to compute the covariance of x and y: cov(x,y)
  • to compute the variance of x: var(x)
  • to compute the correlation of x and y: cor(x,y)
  • to standardize a variable x, scale(x)

3 Model Fit and Comparison

3.1 Key Objectives

By the end of class, you should be able to

  1. compute semipartial and partial correlations
  2. calculate partial R-squared
  3. understand sum of squares
  4. compare regression models

3.2 Correlation

In regression, the regression coefficients for each regressor is partial regression slopes. The partial regression coefficients provide information about the relationship between regressor \(j\) and \(Y\) statistically controlling for other regressors. Other measures for partial relationship include partial and semipartial correlation.

3.2.1 Partial Correlations

For partial correlations we are explaining the relationship between two variables while holding one or more other variables constant. If we had three variables X1, X2, and X3, a possible partial correlation would be the correlation between X1 and X2 holding X3 constant for both X1 and X2. The formula for this partial correlation is

\[r_{12.3}= \frac{r_{12}- r_{13} r_{23}}{\sqrt{1-r^2_{13}}\sqrt{1-r^2_{23}}}\]

Let’s compute the partial correlation of lang and SES holding IQ constant for both land and SES. First, we will regress lang on IQ. Then we will regress SES on IQ. Taking the correlation of the residuals of these two models will compute the partial correlation.

library(MASS);attach(na.omit(nlschools))
# through regression
lm.part.lang <- lm(lang ~ IQ)
lm.part.SES <- lm(SES ~ IQ)
cor(residuals(lm.part.lang),residuals(lm.part.SES))
# by formula
(cor(lang,SES)-cor(lang,IQ)*cor(IQ,SES))/(sqrt(1-cor(lang,IQ)^2)*sqrt(1-cor(SES,IQ)^2)) 

To plot the partial correlation of lang and SES we can just plot the residuals of our two models.

plot(residuals(lm.part.SES),residuals(lm.part.lang))
lm.partial <- lm(residuals(lm.part.lang) ~ residuals(lm.part.SES))
abline(lm.partial,col='blue')

We could do these steps again for each of the pairs of variables we are interested in. However, there is an easier way to find all the partial correlations in R. The function pcor() from the package ppcor can be used to find partial correlations in R.

Q1: Execute the following code and explain what you find with the plotting scripts.

plot(SES,lang)
lm.m <- lm(lang~SES+IQ)
abline(coef(lm.m)[1:2],col='blue')

3.2.2 Semi-partial Correlations

Semi-partial correlation is the correlation between X1 and X2 holding X3 constant just for X1. The formula for this partial correlation is

\[r_{1(2.3)}= \frac{r_{12}- r_{13} r_{23}}{\sqrt{1-r^2_{23}}}\]

or

\[r_{2(1.3)}= \frac{r_{12}- r_{13} r_{23}}{\sqrt{1-r^2_{13}}}\]

SES.out <-lm(SES~IQ)
# the unique part of SES not explained by IQ
SES.part <-residuals(SES.out)
summary(lm(lang ~ SES.part))
cor(SES.part,lang) # semipartial correlation
# by formula
(cor(lang,SES)-cor(lang,IQ)*cor(IQ,SES))/(sqrt(1-cor(SES,IQ)^2))
# semipartial scatterplot
plot(SES.part,lang)
abline(lm(lang ~ SES.part))

As before, we can also use the spcor() function from ppcor package in R to find the semi-partial correlation for us.

Q2: Compare the semipartial scatterplot with the partial scatterplot and the plot from Q1. Discuss what you find.

3.3 \(R^2\) in multiple regression

  • \(R^2\): the proportion of variance in the dependent variable that has been explained by the model.
  • \(sr_1^2\): the proportion of variance uniquely explained by \(X_1\) relative to all of \(Y\)
  • \(pr^2_1\): \(X_1\)’s contribution to explaining Y relative to variance in Y not already explained by \(X_2\), given two predictors.
  • \(R^2=r^2_{YX_1}+sr_2^2=r_{YX_2}^2+sr_1^2\)
  • incremental R^2 by adding predictor \(j\) in addition to other predictors: \(\Delta R^2_j=sr_j^2=R^2_K-R^2_{K-j}\). Also note that \(pr^2_j=\frac{sr_j^2}{1-R^2_{K-j}}\). In the special case where \(r_{X_1X_2}=0\), we have \(R^2=r^2_{YX_1}+r^2_{YX_2}\)
SES.out <- lm(SES ~ IQ)
SES.part <- residuals(SES.out)
cor(IQ,lang) # ryx1
cor(lang,SES.part) # semipartial correlation, sr2
rsquared<-cor(IQ,lang)^2+cor(lang,SES.part)^2
rsquared  
## verify the above answer 
lm.ls <- lm(lang ~ SES, nlschools)
summary(lm.ls)

Q3: find the partial \(R^2_{\text{SES}}\) and the incremental \(\Delta R^2_{\text{SES}}\) given the model lang~SES+IQ and the model lang~IQ+SES.

\(R^2\) does not consider model complexity and optimizing \(R^2\) would lead to a model choice that involves great complexity and predict poorly out of the sample. \(R_{\text{adj}}^2=1-\frac{MSE\frac{n}{n-p-1}}{s^2_Y}\) where \(n\) is the sample size and \(p\) is the number of predictors: the incorporation of \(p\) takes consideration of the model complexity.

3.4 Sum of squares

The distance from any point in a collection of data, to the mean of the data, is the deviation. This can be written as \(y_i-\bar{y}_i\), where \(y_i\) is the \(i\)th data point, and \(\bar{y}\) is the mean estimate. If all such deviations are squared and then summed, as in \(\sum^n_{i=1}(y_i-\bar{y})^2\), this gives the sum of squares for these data.

  • Type 1 sum of squares (sequential): The SS for each factor is the incremental improvement in the error SS as each factor effect is added to the regression model. In other words it is the effect as the factor were considered one at a time into the model, in the order they are entered in the model selection. The SS can also be viewed as the reduction in residual sum of squares (SSE) obtained by adding that term to a fit that already includes the terms listed before it.
  • Type 2 sum of squares (hierarchical or partially sequential): SS is the reduction in residual error due to adding the term to the model after all other terms except those that contain it, or the reduction in residual sum of squares obtained by adding that term to a model consisting of all other terms that do not contain the term in question. An interaction comes into play only when all involved factors are included in the model.
  • Type 3 sum of squares (marginal or orthogonal): SS gives the sum of squares that would be obtained for each variable if it were entered last into the model. That is, the effect of each variable is evaluated after all other factors have been accounted for. Therefore the result for each term is equivalent to what is obtained with Type I analysis when the term enters the model as the last one in the ordering.
lm.m <- lm(lang~SES+IQ)
#install.packages("car")
library(car)
anova(lm.m) # type 1
Anova(lm.m,type="2") # type 2
Anova(lm.m,type="3") # type 3

3.5 Model comparions

ANOVA can be used to do model comparisons, which can be conducted with anova() in R.

lm.ls <- lm(lang ~ SES, nlschools)
anova(lm.ls)
lm.m <- lm(lang~SES+IQ)
# set SS to type 3
options(contrast = c("contr.sum", "contr.poly"))
anova(lm.m)
anova(lm.ls,lm.m)

Q4: Given two independent predictors, discuss when type 1, 2, and 3 sum of squares are equal to each other and demonstrate it in R.

3.6 Calculating p-values without lm()

A few importance concepts and formula:

  • tolerance: the tolerance of an independent variable measures its redundancy with the contribution of other independent variable. The tolerance for the \(j\)th IV is \(\text{tol}_j=1-R^2_j\) where \(R^2_j\) is the explained variance of the \(j\)th IV in a model with the \(j\)th IV as the dependent variable and all other IV as the predictors.
  • t statistic: for each IV, we employ a t test to reveal its statistical significance (or lack thereof), \(t_j=\frac{\hat{\beta_j}}{\text{se}_{\hat{\beta_j}}}\) for the \(j\) th IV.
  • the standard error of regression coefficient estimates: \(\text{se}_{\hat{\beta_j}}=\sqrt{\frac{\text{model mse}}{N \times \text{tol}_j}}\).
  • F statistic: for model significance, we use an F test to test its significance, \(F=\frac{\text{model mse}}{\text{residual mse}}=\frac{\sigma^2_{\hat{y}}/p}{\sigma^2_e/(N-p-1)}\)

Now we turn to implement the abovementioned concepts in R to perform significance testing for given estimated model. First, we show how to perform significance testing for an IV using estimated model parameters.

library(MASS)
attach(nlschools)
# significance testing for IV
lm.full <- lm(lang ~ SES + IQ, nlschools) # estimate the full model 
b_ses <-lm1$coefficients[[2]] # obtain beta estimate for SES
# to obtain tolerance for SES, we estimate a model with SES and intercept
lm.ses <-lm(SES~IQ); tol.ses <- 1 -summary(lm.ses)$r.squared # tolerance
# to obtain the standard error of SES beta estimate, we need the model mse
SSR <-anova(lm.full)$`Sum Sq`[[3]]; df <-anova(lm1)$Df[[3]]; mse <-SSR/df 
SES_se = sqrt(mse/(nrow(nlschools) * var(SES) * tol)) # the se for b_SES 
# now we check the accuracy of SES_se by looking at the summary(lm.full)
summary(lm.full)$coefficients[2,2] # se of SES beta estimate
# given the beta estimate and the corresponding se, we compute t
t_ses <-b_ses/SES_se
# now we check the accuracy of t value by looking at the summary(lm.full)
summary(lm.full)$coefficients[2,3] # t value
# having obtained the t value and the df, we compute the p value for SES 
2*(1-pt(t_ses, df)) 

Now, we show how to perform significance testing for the model with estimated model parameters.

# significance testing for model
# because we have 2 predictors, p = 2 here, nrow() computes the sample size
mse_reg <- var(predict(lm.full))/2 
mse_resid <-var(lang - predict(lm.full))/(nrow(nlschools)-2-1)
f <- mse_reg/mse_resid
# now we check the accuracy of F calculation by looking at the summary(lm.full)
summary(lm.full)$fstatistic
# now we compute the model p value
1- pf(f,2, df)

Q5: Using the same example, compute the predictor t value and significance for IQ and compute the model significance of a model with SES as the dependent variable and IQ as the independent variable.

3.7 Summary

3.7.1 List of Commands

  1. partial correlation: pcor() from ppcor package
  2. semipartial correlations: spcor() from ppcor package
  3. sum of squares: anova() and Anova() from car package

3.7.2 Citations and Resources

  1. a great read on \(r^2\), partial \(r^2\) and incremental \(r^2\)
  2. the section on sum of squares is adapted from this handout on sum of squares
  3. see section 6 of this handout for more discussion on \(R^2\) and adjusted \(R^2\)

4 Regression Assumptions and Diagnostics in R

4.1 Key Objectives

By the end of class, you should be able to

  1. understand the four assumptions of linear regression
  2. perform diagnostic procedures to check for assumption violations
  3. plot diagnostic information
  4. compute values that assess outliers, leverage, and influence

4.2 Assumptions and Diagnostics for Linear Regression

Four assumptions:

  1. linearity: the relationship between the independent and dependent variables needs to be linear (in the beta parameters).
  2. Homoscedasticity: the conditional distribution of Y given X has equal variances across X.
  3. Normality: the conditional distribution of Y given X is Normally distributed.
  4. Independence of Observations: Observations in the data must be sampled independently from one another.

4.2.1 Linearity Assumption Diagnostics

A good way to first check linearity prior to the analysis is to obtain a scatterplot of DV versus IV. The scattorplot can offer visual clues about the nature of the relationship between the IV and the DV. After the analysis, linearity can also be checked by plotting the residuals against the predicted values from the model. This plot shows if residuals have non-linear patterns. There could be a non-linear relationship between predictor variables and an outcome variable and the pattern could show up in this plot if the model doesn’t capture the non-linear relationship. If you find equally spread residuals around a horizontal line without distinct patterns, that is a good indication you don’t have non-linear relationships.

library(MASS)
attach(nlschools)
# create an artificial variable that has a nonlinear relationship with lang
nonlinear.var=scale(nlschools$lang)^3
# pre-analysis linearity check
plot(lang, nonlinear.var)
plot(lang, IQ)
# post-analysis linearity check
mod.nonlinear <- lm(nlschools$lang~nonlinear.var)
mod.linear <- lm(nlschools$lang~nlschools$IQ)
# residuals against predicted y
plot(mod.linear$residuals,mod.linear$fitted.values); # or try plot(mod.linear,1)
plot(mod.nonlinear$residuals,mod.nonlinear$fitted.values) # or try plot(mod.nonlinear,1)

4.2.2 Homoscedasticity Diagnostics

This assumption states that the conditional distribution of Y given X has equal variances across X. Equivalently, the residual variance is the same across X. If the assumption is satisfied, residuals are spread equally along the ranges of predictors.

To obtain visual clues after the analysis, we can use scale-location plots. This plot shows if residuals are spread equally along the ranges of predictors. This is how you can check the assumption of equal variance (homoscedasticity). It’s good if you see a horizontal line with equally (randomly) spread points.

stand_r <- rstandard(mod.linear)  #standardized residual 
plot(mod.linear$fitted.values, stand_r, xlab = "", ylab = "std. residuals", 
    main = "std. residuals vs predicted Y")
abline(h = 0, col = "blue")
plot(mod.linear, 3)  # an alternative method

Q1 Plot the predictor against the residual and against the standardized residual in the abovementioned simple regression model. What would you expect to and what did you find? Why?

4.2.3 Normality Assumption Diagnostics

The conditional distribution of Y given X is normally distributed.

Prior to analysis, histogram of Y scores can be plotted.

As the assumption implies that the residuals are normally distributed, we can make a histogram of the model residuals after the analysis to check the normality assumption. QQ plot can also be used to check for this assumption after analysis. If points are on a straight line on the diagonal, then normality is likely to be met. With sample data, we would expect small variations in the tail to occur. However, large discrepancies from the line indicates a violation of the Normality assumption.

hist(nlschools$lang)  # pre-analysis histogram
# after-analysis plots
hist(stand_r)  # residual histogram
# qq plot
qqnorm(stand_r, main = "Residuals Normal Probability Plot")
qqline(stand_r)
plot(mod.linear, 2)  #an alternative method for qq plot

4.2.2 Independence of Observations Assumption Diagnostics

Observations in the data must be sampled independently from one another. This is normally evaluated based on how the data is collected.

To check independence, plot residuals against any time variables available (e.g., order of observation), any spatial variables available (e.g., the location of the observation), and any independent variables included in the model (e.g., factors, regressors). A pattern that is not random suggests lack of independence.

Because time or spatial correlations are so frequent, it is important when making observations to record any time or spatial variables that could conceivably influence results. This not only allows you to make the residual plots to detect possible lack of independence, but also allows you to change to a model incorporating additional time or spatial variables if lack of independence is detected in these plots. Dependence on time or spatial variables are common sources of lack of independence, but the other plots might also detect lack of independence. Note that since it is known that the residuals sum to zero, they are not independent, so the plot is really a very rough approximation.

4.3 Other Diagnostics

4.3.1 Multicolinearity Diagnostics

If we had a multiple regression we would also want to check the VIF to see if we have multicolinearity problems. \(VIF_j=\frac{1}{\text{tol}_j}=\frac{1}{1-R_j^2}\). A VIF of greater than 10 is considered high multicolinearity.

We can consider the residuals of both of these models.

library(car)
mod <- lm(lang ~ IQ + SES, nlschools)
vif(mod)
1/tol.ses

4.3.2 Diagnostics for Unusual Observations

After an initial fit of a model (including possible nonlinear transformations of X and or Y ), it may become clear that certain observations are unusual in the sense that they are extremely atypical of X and or Y , either in the univariate or bivariate sense.

If the fitted model does not give a set of residuals that appear to be reasonably in agreement with the model for those residuals, then we must question the model and/or its assumptions.

A related issue is the importance of each case on estimation and other aspects of the analysis. In some data sets, the observed statistics may change in important ways if just one case is deleted from the data. This will involve two types of diagnostic statistics, distance measures and leverage values.

In general, the goal of this diagnostic is to assess

  • Outliers: An observation with large residual.
    • An observation whose dependent-variable value is unusual given its values on the predictor variables.
    • An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem.
  • Leverage: An observation with an extreme value on a predictor variable
    • Leverage is a measure of how far an independent variable deviates from its mean.
    • These leverage points can have an effect on the estimate of regression coefficients.
  • Influence
    • Influence can be thought of as the product of leverage and outlierness.
    • Removing the observation substantially changes the estimate of coefficients.

For demonstration, run source("http://www.statpower.net/R2101/Leverage.R"). 20 observations of two variables are included. One of the points is marked in red (\(X =0,Y =1.6\)). The regression line for the points is plotted in blue, and at the top of the plot, 3 statistics for this red point are given. Three statistics are given in the title:

  • Leverage. \(h_i=\frac{(x_i-\bar{x})^2}{\sum(x-\bar{x})^2}+\frac{1}{n}\) for one x. This is a measure of how unusual the X value of a point is, relative to the X observations as a whole. Leverage of a point has an absolute minimum of 1/n, and we can see that the red point is right in the middle of the points on the X axis, and has a residual of 0.05. Leverage is the proportion of the total sum of squares of the explanatory variable contributed by the ith case. Common cutoffs include \(3\frac{p+1}{n}\) or \(2\frac{p+1}{n}\).
  • Studentized Residual. \(\frac{\hat{\epsilon_i}}{SE(\hat{\epsilon}_i)}\) This can inform us about the outlier with respect to Y given the specified model. This is a measure of the size of the residual, standardized by the estimated standard deviation of residuals based on all the data but the red point. The red point is a barely detectable smidgen below the regression line, and has a Studentized Residual of −.025. Having a magnitude of studentized residual exceeding 3 is a no no.
  • Cook’s D. \(D_i=\sum^n_{j=1}\frac{(\hat{y}_{j(i)}-\hat{y}_j)^2}{p\hat{\sigma}^2}\) This statistic is a measure of the influence of the red point. Influence is the amount that the red point is affecting the regression line, measured by how much the regression line would change if the point were not included in the analysis. Cook’s D is within rounding error of zero in this case. This point is almost right on the regression line, and removing it would not change the regression line detectably. In fact, there are actually two regression lines plotted on this graph. One is in red, and represents the linear fit with the red point included, and one is in blue, representing the linear fit for the 19 points excluding the red point. The blue and red regression lines are almost identical, so the red line cannot be seen. If \(D_i\) is greater than 0.5, then the ith data point is worthy of further investigation as it may be influential. If \(D_i\) is greater than 1, then the ith data point is quite likely to be influential. Or, if \(D_i\) sticks out like a sore thumb from the other \(D_i\) values, it is almost certainly influential.

  • relationship among them:
    • \(SE(\hat{\epsilon_i})=\hat{\sigma}\sqrt{1-h_i}\)
    • \(D_i=\frac{1}{p}(\text{studres}_i)^2(\frac{h_i}{1-h_i})\)

Moreover, Mahalanobis distance can also be used to assess leverage whereas dfbeta can help assess influence.

Now we turn to R for implementation:

library(MASS)
mod.inf=lm(lang~SES,nlschools)
# detecting outliers in y through studentized residuals
studentized.residuals=studres(mod.inf)
plot(studentized.residuals)
# magnitude exceeding 3 is a no no
abline(h=-3, col="blue") 
# which cases are bad?
which(abs(studentized.residuals)>=3)
# how bad
studentized.residuals[which(abs(studentized.residuals)>=3)]
# cook's distance
cookd=cooks.distance(mod.inf)
plot(cookd)
# leverage point
plot(influence(mod.inf)$hat)
abline(h=2*2/nrow(nlschools), col="blue") 

# built-in plots
# cook's distance
plot(mod.inf,4)
# leverage point
plot(mod.inf,5)

#mahalanobis distance for leverage
library(psych)
#imagine we are using both SES and IQ as IV
outlier(cbind(nlschools$SES,nlschools$IQ))

#dfbeta for influence
?influence.measures
influence.measures(mod.inf)

Q2 run plot(mod.inf,6) and figure out what this plot means.

4.4 Summary

4.4.1 List of Commands

  1. standardized residuals: rstandard() from base stats package
  2. Q-Q plot: qqnorm() and qqline() from base stats package
  3. VIF: vif() from car package
  4. various influence measure: influence() and influence.measures() from base stats package
  5. studentized t: studres() from MASS package
  6. cook’s distance: cooks.distance() from base stats package
  7. mahalanobis distance for leverage: outlier() from psych package

4.4.2 resources

  1. The demonstration plot is borrowed from James Steiger’s slides on Outliers, Leverage, and Influence. Highly recommend to go over the slides for an in-depth yet fun exploration on the topic!
  2. For diagnostics in general, see lesson 9 of this PSU online materials

5 Data Transformation

5.1 Key Objectives

By the end of class, you should be able to

  1. dummy code categorical predictors by hand and through functions
  2. understand the relationship between linear regression and t-test/anova
  3. transform data given assumption violation

Categorical predictors are common in psychological data. Sex (male, female), group membership (control,experimental), and levels (high, medium, low) are all examples of categorical variables. In order to include these qualitative variables into our regression model we need to dummy code them. Basically, dummy coding is assigning a unique number to ask group. While any numbers can be assigned to the groups, we offer use codes 1 unit apart (0,1,2, etc).

5.2 Dichotomous Predictors

With dichotomous variables, (categorical variables with 2 groups) a zero-one indicator is commonly used. For sex, this would be coding males as 0 and females as 1. Often times data online will be coded with 1 and 2 which can make intrepretation hard. Thus, we can change this in R.

5.2.1 Dummy Coding

We can also use the dummy.code function in the psych package. Using this function will output a matrix with two columns: one where Male is coded as 0 and female is coded as 1 and one where Female is coded as 0 and male is coded as one. It is important when dummy coding to always be aware of which group is what value.

library(psych)
attach(bfi)
head(bfi) #Male is 1 and Female is 2
bfi <-na.omit(bfi)
attach(bfi)
#Method 1
female <-gender -1 #Male is 0 and Female is 1

#Method 2
female <-dummy.code(gender)[,2]

Once our variables are dummy coded, we can use them as predictors in our regression model. When we run this analysis we get a estimate for the dichotomous variable female. When using 0 and 1 codes, \(b_0\) corresponds to the mean of Y for the group coded 0, here male, and \(b_1\) is the difference between the group means. This is analogous to our intrepretation of slope beginning the amount of change in Y for a unit change in the predictor. However, now a unit change in a categorical predictor (from 0 to 1) indicates a change in group.

lm_female <-lm(age ~ female)
summary(lm_female)

lm_female$df.residual # (N-1)-p
anova(lm_female)

plot(age ~ female)

You can also tell R that your variable is a categorical variable using the factor function and will include levels in your variable. Using this function, allows you to not dummy code your variables and directly use the variable in the regression model and still obtain the same results.

gender_factor <-factor(gender)
lm_factor <-lm(age ~ gender_factor)
summary(lm_factor)

5.2.2 Relationship with \(t\) test

We also can conduct a t-test in R to get the results using the t.test function in the base stats package. The argument var.equal states whether or not the variances should be treated as equal. If TRUE then the pooled variance is used to estimate the variance which is what we want here.

ttest <-t.test(age ~ female, var.equal=T)
ttest$statistic^2 #equal to our F value

5.3 Multiple Regression

We conducted simple regression with a categorical predictor. Now we can generalize it to multiple regression. Let’s use the variable A4 from the dataset and treat it as a continuous variable. We can add it to our regression model just like we have learned before. However, our intrepretation of the coefficients is slightly different. So now the estimate for \(\beta_1\) is the difference in means of the male group and the female group holding constant the A4 variable. Similarly, A4 is the change in age for a unit increase in A4.

lm_female2 <-lm(age ~ female + A4)
lm_female2$coefficients
lm_female2$df.residual
anova(lm_female2)

#extra code if interested, won't discuss in class
library(rgl)
newdat <- expand.grid(female = c(0,1), A4 = 1:6)
newdat$pp <- predict(lm_female2, newdata = newdat)
with(newdat, plot3d(female, A4, bfi$age, 
                    col="blue", size=1, type="s", 
                    main="3D Linear Model Fit"))
with(newdat, surface3d(unique(female), unique(A4), newdat$pp,
                      alpha=0.3, front = "line", back = "line"))

5.3.1 Multi-category predictors

The variable education in the bfi dataset is an ordinal variable with 5 categories.The categories include high school (1), finished high school (2), some college (3), college graduate (4), graduate degree (5). When we have multiple categories like this we need to code multiple dummy variables. We can get all the information we need from 1 less dummy variables than the total number of categories we want to code. Therefore, we can drop the first dummy variable created by the dummy.code function. By dropping this first column we are implicitly stating that this is our reference column.

Interpreting the \(\beta\) is expanded from the dichotomous. The \(beta\) coefficients are the average difference between the reference group (here high school) with the comparison group, dummy variable we are interested in, holding all other variables constant. So for \(\beta_1\) here it would be the average difference between those in high school and those who finished high school while holding the other variables constant. Similarly, \(\beta_2\) would be the average difference between those in high school and those with some college while holding the other variables constant. The rest of the \(\beta\)s following this pattern.

edu_dummies <-dummy.code(education)[,-1]
lm_edu <-lm(age ~ edu_dummies)
summary(lm_edu)

5.3.2 Multiple Regression with 2 Categorical Predictors

When two categorial variables are in the regression model we can intrepret them just like we did in the previous sections.

lm_factor2 <-lm(age ~ female + edu_dummies)
summary(lm_factor2)

5.4 Data Transformation given Assumption Violation

5.4.1 non-linearity

The first step would be to plot your x against y to gauge what transformation would be most helpful. Based on the shape, there are the following transformation techniques:

  • exponential transformation: \(x_\text{new}=e^{x_\text{old}}\)
  • reciprocal transformation: \(x_\text{new}=\frac{1}{x_\text{old}}\)
  • (natural) logarithm transformation (on X): \(x_\text{new}=\text{ln}{ x_\text{old}}\)
  • (natural) logarithm transformation (on Y): \(y_\text{new}=\text{ln}{ y_\text{old}}\)
  • (natural) logarithm transformation (on both X and Y): \(x_\text{new}=\text{ln}{ x_\text{old}}\) and \(y_\text{new}=\text{ln}{ y_\text{old}}\)

After transformation, plot the transformed variable against other variables, a linear line would indicate a successful transformation.

5.4.2 heteroscedasticity

If the error variances are unequal, try “stabilizing the variance” by transforming y:

  • square root transformation: \(y_\text{new}=\sqrt{ y_\text{old}}\) Sometimes people use this to deal with count variables (e.g., how many drinks do you have a day). However, it is less recommended now given the prevalence of and preference to generalized linear models, in particular poisson regression model.
  • arcsine transformation: \(y_\text{new}=\text{sin}^{-1}(\sqrt{y_\text{old}})\) Sometimes peple use this to deal with probabilities of binary outcomes (e.g., how likely are you going to drink Monday night). However, it is less recommended now given the prevalence of and preference to generalized linear models, in particular logistic regression model.
  • reciprocal transformation: \(y_\text{new}=\frac{1}{y_\text{old}}\)
  • (natural) logarithm transformation: \(y_\text{new}=\text{ln}{ y_\text{old}}\)

5.4.3 non-normality

Given nonnormal residuals, power transformation, aka Box-Cox transformation, can be used to transform the dependent variable. One additional benefit of power transformation is that this also alleviates the issue of heteroscedasticity were such issue present. Despite the incredible power of power transformation, interpretation on the analysis of transformed variable can be of great challenges! Furthermore, transformations for assumption violation in general often risk interpretational difficulties.

In R, we can use BoxCox() from forecast package.

messed.up.y=exp(rnorm(1000))
# how messed up is it
hist(messed.up.y)
qqnorm(y=messed.up.y)
# box cox transformation
library(forecast)
clean.y=BoxCox(messed.up.y,lambda = "auto")
# what lambda was selected?
BoxCox.lambda(messed.up.y) 
# how clean is it
hist(clean.y)
qqnorm(y=clean.y)

5.5 Summary

5.5.1 List of Commands

  1. create dummy variables: dummy.code() from psych package
  2. encode a vector as a factor: factor() from base package
  3. performs t-tests: t.test() from base stats package
  4. power transformation: BoxCox() (and BoxCox.lambda() to find out about the \(\lambda\)) from forecast package

5.5.2 Resources and Citations

  1. more read on categorical predictors
  2. on transformation for assumption violation, please look at this page to determine when should what transformation be used.

6 Hierarchical Regression in R

6.1 Key Objectives

By the end of class, you should be able to

  1. Fit a hierarchical regression models
  2. Compute change in \(R^2\) between Steps
  3. Compute \(F\) values to assess for the unique contribution of each subsequent step

6.2 Hierarchical Regression

Hierarchical regression here refers to that where we add new variables into the model sequentially. Hierarchical regression often involves several regression models. The models are hierarchical because the prior model is always nested within the subsequent model.

Two linear models are nested if one (the restricted model) is obtained from the other (the full model) by setting some parameters to zero (i.e. removing terms from the model), or some other constraint on the parameters. For example, we have model one as \(lang=\beta_0+\beta_1\times IQ+\epsilon\) and model two as \(lang=\beta_0+\beta_1\times IQ+\beta_2\times SES+\epsilon\). Model one is nested within model two because model one is equivalent to model two if we set \(\beta_2\) to zero. Oftentimes, the more complex model is regarded as the full model and the nested model/simpler model is regarded as the restricted model.

We can compare fit between nested models to the same dataset with the F test. To compare two models, we have \(F=\frac{(RSS_R-RSS_F)/(df_R-df_F)}{RSS_F/df_F}\) where RSS stands for the residual sum of squares and the df here refers to the degrees of freedom for the residuals. The residual sum of squares represent the part in the outcome that the model is unable to account for. Essentially, we try to test if the residual sum of squares for the restrcicted model is substantially larger than the residual sum of squares for the full model. That is, the F test answers the question if the simple model explains significantly less of the outcome than the more complex model. Equivalently, the F test is testing if the added parameters differ significantly from zero. From the prior example, in comparing model one to model two, the null hypothesis that model one is better is equivalent to that \(\beta_2\) is zero.

We illustrate the simplest case of hierarchical regression in r where we compare a simple regression model to an intercept-only model.

library(MASS)
# intercept-only model
mod0=lm(lang~1,nlschools)
sum0=summary(mod0)
ss0=anova(mod0)
# lang ~ IQ model
mod1=lm(lang~IQ,nlschools)
sum1=summary(mod1)
ss1=anova(mod1)
anova(mod0,mod1)

Q1

In the prior example, what is the relationship between the F statistics and the t statistic of the IQ coefficient estimate?

6.2 \(R^2\) change in Hierarchical Regression

Q2 What is the \(R^2\) in the intercept-only model? Would an F statistic be meaningful for an intercept-only model?

Model \(R^2\) can be defined as \(\frac{SS_{\text{reg}}}{SS_{\text{tot}}}\) or \(1-\frac{RSS}{TSS}\) where RSS is the residual sum of squares and TSS a.k.a \(SS_{\text{tot}}\) is the total sum of squares. In general, model \(R^2\) refers to the proportion of variance in the outcome that is accounted for by the model.

In hierarchical regression, we have \(R^2\)s for all the compared models. The change in \(R^2\) between models, sometimes denoted as \(\Delta R^2\), can be of interest because it can indicate the additional explained variance due to added predictors in a subsequent model. The change in \(R^2\) can be computed as \(\frac{SS_f-SS_r}{SS_\text{tot}}\) or \(\frac{RSS_r-RSS_f}{SS_\text{tot}}\). The former represents the the proportion of additional variance that the complex model explains and the latter represents that the proportion of the reduction in unexplained variance from the simpler model to the more complex model.

We first show the change in \(R^2\) from the intercept-only model to the simple regression model:

TSS=sum(ss1$`Sum Sq`) # total sum of squares
ESS=ss1$`Sum Sq`[1] # explained sum of squares
ESS/TSS # model r squared
# to check the model r squared
sum1$r.squared
# the r squared change between model 0 and model 1
sum1$r.squared-sum0$r.squared

However, hierarchical regressions often start with, at least, a simple regression model to multiple regression models. So here we give another example where we go from a one-predictor model to a three-predictor model. For example, we first consider intelligence (something arguably innate) as a predictor for language test score, so we have our model one with lang as DV and IQ as IQ. But we also wonder how “environmental” factors may also explain the language test score above and beyond IQ. We consider two factors: social-economic status and the class type (multi-grade or not). So we have a more complex model with lang as DV and IQ, SES, and COMB as predictors.

We already built model one in R, so now we build the more complex model with three predictors in R and compute the change in \(R^2\).

mod3=lm(lang~IQ+SES+COMB,nlschools)
sum3=summary(mod3)
ss3=anova(mod3)
# model r saured
TSS=sum(ss3$`Sum Sq`) # total sum of squares
# here instead of using ESS/TSS, we use 1-RSS/TSS
RSS=ss3$`Sum Sq`[4] # residual sum of squares
1-RSS/TSS # model r squared
# to check the model r squared
sum3$r.squared
# the r squared change between model 1 and model 3
sum3$r.squared-sum1$r.squared
# (ss_f-ss_r)/TSS
ss_f=sum(ss3$`Sum Sq`[1:3])
ss_r=ss1$`Sum Sq`[1]
TSS=sum(ss3$`Sum Sq`)
(ss_f-ss_r)/TSS
# (RSS_r-RSS_f)/TSS
rss_r=ss1$`Sum Sq`[2]
rss_f=ss3$`Sum Sq`[4]
(rss_r-rss_f)/TSS

Q3

Regress language test score to both IQ and SES. 1. compute the \(R^2\) of the new model 2. find the \(\Delta R^2\) from this model to mod3. 3. find the \(\Delta R^2\) from mod1 to this model.

6.3 Significance Testing in Hierarchical Regression

To compare nested models, F statistics can be computed, \(F=\frac{(RSS_r-RSS_F)/(df_R-df_F)}{RSS_F/df_F}\). This is sometimes referred as the F change test. F change test differs from individual model significance testing or parameter estimate significance testing. We can test the model significance of each model in a hierarchical regression, but here the \(F\) statistic is to test if a subsequent model is a significantly better fit for the data than the prior model. While adding additional predictors always increase the \(R^2\), such increase may be due to chance or due to sampling variability. Hence, we cannot simply look at the \(\Delta R^2\) and decide the subsequent model is better or the added predictors are meaningful but significance testing of model comparison is required to determine whether the more complex model is better than the simpler model or the added predictors are meaningful.

One notable scenario is that the restricted model is significant, the full model is significant, but the model comparison is not. What happens here is that the full model is only significant because of the explained variance due to the predictors that are already in the restricted model and that added predictors do not explain additional variance of statistical significance. We illustrate this scenario in R.

noise=rnorm(nrow(nlschools),40,9)
mod.ex=lm(lang~IQ+noise,nlschools)
sum.ex=summary(mod.ex)
# model p value
1-pf(sum.ex$fstatistic[1],sum.ex$fstatistic[2],sum.ex$fstatistic[3])
# model r squared
sum.ex$r.squared
# delta r squared compared to mod1
sum.ex$r.squared-sum1$r.squared
# significance testing for model comparison
anova(mod1,mod.ex)

Having illustrated the confusing scenario, we return to our regular programming and show significance testing of comparing the simple regression model to the intercept-only model by using the formula.

# model comparison between 0 and 1
RSS_r=ss0$`Sum Sq`
RSS_f=ss1$`Sum Sq`[2] # what would 1 be?
df_r=ss0$Df
df_f=ss1$Df[2]
f=((RSS_r-RSS_f)/(df_r-df_f))/(RSS_f/df_f)
p.val=1-pf(f,ss1$Df[1],ss1$Df[2])
# we check this by using `anova()`
anova(mod0,mod1)

In hierarchical regression, we often compare more than two models. Each subsequent model is compared to the immediately prior model, that is, the added predictors in each step is tested for statistical significance. At each new step, we ask the question, given the prior model, do the new predictors explain additional variance of statistical significance above and beyond the existing predictors?

mod2=lm(lang~IQ+SES,nlschools)
summary(mod2)
mod3=lm(lang~IQ+SES+COMB,nlschools)
summary(mod3)
anova(mod1,mod2,mod3)

Hierarchical model can become tricky in confirmatory hypothesis testing where each step corresponds to pre-specified hypothesis. Imagine a significant step 1 model, and a insignificant step 2 model: this happens when you add so many irrelevant predictors in step 2 that the df-ESS balance becomes off. But now you need to decide what to do for step 3… given step 2, the model comparison test from step 2 and 3 may be significant given a “good” predictor. But the step 3 model may still not hold statistical significance because of how messed up step 2 was. What now? One strategy here is to focus on reporting the F test of model comparison because you can still conclude that added preditors are statistically significant above and beyond existing predictors in the prior model.

Q4 Compare the change between mod1 and mod3 using anova().

6.4 Summary

Q5 Test the following models hierarchically using the bfi dataset from psych

  • model 1: DV=agreeableness and IV=age
  • model 2: adding the predictor gender
  • model 3: adding the predictor education (noe that this is a multi-category categorical variable)

  • hint 1: first exclude all cases with missing values
  • hint 2: create a variable denoting the agreeableness agree <- rowSums(bfi[,1:5])

6.4.1 List of Commands

  1. sum of squares and dfs for the model: anova()
  2. p value for the f test: pf()

6.4.2 Resources and Citations

  1. the portion on F change test is adapted from some (really good) slides.
  2. some R-bloggers entry that may be helpful

7 Model Generalizability, Bootstrapping, and Cross Validation

7.1 Key Objectives

By the end of class, you should be able to

  1. understand bias-variance trade-off and the generalzability of in-sample model fit
  2. execute bootstrapping and cross-validation to compute “generalizable” \(R^2\)

7.2 Bias-Variance Tradeoff, Generalizability, and Optimism

In evaluating model fit, we often use \(R^2\). It is worth noting that \(R^2(y,\hat{y})=\frac{MSE\times n}{s_Y^2}\) where MSE is the mean squared error from fitting the model to the sample data. From the equation, we can see that while \(R^2\) indicates the goodness of fit, MSE shows the discrepancy from the model prediction to the reality in sample data.

MSE can be decomposed into \(MSE=\mathbb{E}[(\hat{y}-\mathbb{E}(\hat{y}))^2]+\mathbb{E}[(\mathbb{E}(\hat{y})-y)^2]+\sigma^2=\text{var}(\hat{y})+\text{bias}(\hat{y},y)^2+\sigma^2\). Here the bias measures how good the model is in estimating the true values, the variance measures how variable the model is in terms of making predictions and \(\sigma^2\) denotes the irreducible error. Oftentimes, the more complex the model is, the less biased it becomes. However, the complexity also makes the model prediction amenable to change in the input: that is the model would have high variability.

Another (VERY) important issue concerns in-sample and out-of-sample MSE. In-sample MSE refers to the MSE of the model given the samlple data and out-sample MSE refers to the MSE of the model given new data. As it can be expected, how well the model do on your sample data may not necessarily generalize to a new sample. The reason is that your model is trained to fit your sample data (as good as possibe and of course the model would look rosy on your sample). But the trained model may not generalize well to a new sample, in fact, the expected value of out-of-sample MSE can be approximated as \(\frac{1}{n}\sum^n_{i=1}(Y_i-\hat{f}_i)^2+\frac{2}{n}\sigma^2(p+1)\) where the former part is the in-sample MSE and the latter part is often referred as optimism.

Basically, this formula says that using in-sample MSE to estimate the model performnce outside of the training sample may not be the best idea: averaged over repeated sampling, the in-sample MSE would underestimate the true expected error. Moreover, the optimism part of the formula says that, the noise in the data or the extra parameters in the model can be enablers of the model pretending it can do a good job on new data. On the other hand, if the training sample is large, then it would make things difficult for the model to pretend that the fit is better than it really is.

Given the direct inverse relationship between \(R^2\) and MSE, \(R^2\) is also subject to the bias-variance tradeoff where when the model is complex, \(R^2\) would be higher indicating a low bias. But meanwhile, such \(R^2\) can vary a lot across samples, meaning high variability. Similarly, \(R^2\) is also an optimistic estimator of the model fit on new data. It would be stupid to believe the in-sample \(R^2\) as the true model fit. Thus, some advocates the use of in-sample adjusted \(R^2\) to estimate the model’s general performance. Recall \(R^2_{adj}=1-\frac{MSE\frac{n}{n-p-1}}{s_Y^2}\) where \(MSE\frac{n}{n-p-1}= MSE+MSE\frac{p+1}{n}\) given fixed \(p\) and \(n \rightarrow \infty\). The latter part can be thought of as some sort of penalty to cancel out the “optimism”: but if you compare this with the formula for optimism, best case scenario given a perfect model where \(MSE=\hat{\sigma}^2\), the penalty is only half the size of the optimism! This means that in-sample adjusted \(R^2\) still overestimates the general model performance!

Hence, we introduce boostrapping and cross validation first to help mitigate the “inaccuracy” in in-sample \(R^2\) in attempts to better assess the model fit beyond the training sample.

7.3 Loops and Functional Programming in R

Before we can do boostrapping or cross validation, we need to understand two programming basics: loops and functions.

A loop is a way to repeat a set of code multiple times under some condition. A for loop will run the same code for the argument from a user-specified starting value to a user-specified ending value. A for loop is particularly helpful when the next iteration depends upon results from last iteration. Here we show for loops in R. You should also loop up while and repeat loops.

# for loop
# example 1: print out elements in a sequence one at a time
sequence=seq(1:10)
for (element in sequence){
  print(element)
}
# example 2: sum numbers from 1 to 100
sum=0
for (i in 1:100){
  sum=i+sum
}
sum

So far we have been using built-in functions in R. We can also specify our own functions. To construct a function, we can use the keywood function(). Arguments are added in the parantheses to tell the function what value to take. The take the function does is included in a pair of braces. For example, you want to see the summary statistics on x and y and then have the \(R^2\) estimates of y regressed on x. It would be tedious to write (almost identical) codes for these every time you have new x and y. So instead you can write a function for this.

# simulate some x and y 
x=rnorm(100)
y=.5*x+rnorm(100)*.5
# how you would normally do this
summary(x);summary(y)
mod=lm(y~x);summary(mod)$r.squared
# a function to do this
sum.b.t<-function(x,y){
  # show summary statisitcs for x and y
  print(summary(x)); 
  print(summary(y));
  mod=lm(y~x);
  # output r squared
  summary(mod)$r.squared
}
sum.b.t(x,y)
funt.output=sum.b.t(x,y)
# what about some new data
a=rnorm(100)
b=.5*x+rnorm(100)*.7
sum.b.t(a,b)
funt.output=sum.b.t(a,b)

7.4 Cross-Validation

A way to obtain an approximation of the out-of-sample model fit is through cross-validation. How can we validate a model? A researcher collects a sample from the population and trains a model fitting the sample data. In order to validate the model, the researcher might collect a second sample from the same population independently from the first sample. In applying the model from the first sample on this new sample, the model fit on the new sample can would be a literal out-of-sample performance as the second sample would be “out” of the first sample.

Cross-validation inherits the basic idea: given one large sample, we can split that sample into many subsamples. When the original sample is randomly partitioned into \(k\) equal sized subsamples, we call this k-fold cross-validation. The procedure goes as follows:

  1. Split the dataset into \(k\) groups
  2. For each unique group:
    1. Take the group as a hold out or test data set
    2. Take the remaining groups as a training data set
    3. Fit a model on the training set and compute the \(R^2\) on the test set
    4. Retain the \(R^2\) and discard the model
  3. Average obtained \(R^2\)s

Here we demonstrate (a variant of) this in R (this variant may lead to a better estimate than the og procedure).

# step 1
k=5
# create a group index indicating the 
# group membership of an observation
ind=rep(1:k,nrow(dat)/k)
# step 2
yhats=NULL
ys=NULL
for (g in 1:k){
  mod=lm(lang~SES+IQ,dat[ind!=g,])
  yhats=c(yhats,predict(mod,newdata=dat[ind==g,]))
  ys=c(ys,dat[ind==g,]$lang)
}
#step 3
cor(ys,yhats)^2

7.5 Bootstrap

One way to tackle the “inaccuracy” in the in-sample \(R^2\) is to show the variability in the \(R^2\) by providing a confidence interval. This can be achieved using bootstrapping. The general idea behind bootstrapping is without applying any additional assumptions, the distribution of the measurement in our sample is our best estimate of the population distribution. This procedure is useful in that this can help give us a “confidence” interval of \(R^2\) and it also provides a (slightly) better \(R^2\).

The procedure works as follows.

  1. We draw B independent random samples of size N from our sample with replacement. Each new sample is called a bootstrap sample. Sampling with replacement allows us to get bootstrap sample that can be different from our original data.
  2. We estimate models using each of these bootrap samples and evaluate the model on observations not included into the boostrap sample (in our case, \(R^2\)).Thus, we have B \(R^2\)s in total. The standard deviation of these estimates is the bootstrap standard error.
# create a sample of nlschools
library(MASS)
dat=nlschools[1:25,]
# population R^2
summary(lm(lang~SES+IQ,nlschools))$r.squared
# in-sample R^2
summary(lm(lang~SES+IQ,dat))$r.squared
boot.r2<-function(N,data){
  bind=sample(1:nrow(data),N,replace = T)
  bdat=data[bind,]
  mod=lm(lang~SES+IQ,bdat)
  cor(predict(mod,data[-unique(bind),]),data[-unique(bind),]$lang)^2
}


br2s=replicate(5000,boot.r2(25,dat))
mean(br2s)
sd(br2s)
#Visualize the bootstrap replicates
hist(br2s, main = 'Bootstrap Replicates of R^2')
abline(v = quantile(br2s, c(.025, .975)), col = 'blue') # with confidence limits
abline(v = summary(lm(lang~SES+IQ,dat))$r.squared, col = 'red')
abline(v = mean(br2s), col = 'green')

8 Variable Selection and Exploratory Testing

8.1 Key Objectives

By the end of class, you should be able to

  1. understand stepwise regression
  2. implement regularized regression

8.2 Stepwise Regression

The goal of stepwise regression is a method of variable selection through an automatic procedure. In each step, a predictor is considered for addition to or subtraction from a model based on some prespecified criteria. Here we will use the criteria AIC. AIC stands for Akaike information criterion. It is an estimator that deals with the trade-off between the goodness of fit of the model and the simplicity of the model.

\[AIC=-2log-likelihood+2k\].

We will discuss two types of stepwise regression.

  1. Forward
  2. Backward

8.2.1 Forward Stepwise Regression

Steps in forward stepwise regression

  1. Start with the intercept only model (null model).
  2. Add the predictor that will give the lowest AIC value for the new model.
  3. Repeat this prodecure adding one predictor at a time.
  4. Stop when the AIC is no longer significantly decreasing.
library(MASS)
# forward stepwise via AIC
newdata <-nlschools[,-3]
full <- lm(lang ~ ., newdata)
null <- lm(lang ~ 1, newdata)
forward.step <- step(null,scope=list(upper=full,lower=null))
summary(forward.step)

8.2.2 Backward Stepwise Regression

Steps in backward stepwise regression

  1. Start with the largest model you are willing to consider.
  2. Subtract the predictor that will give the lowest AIC value for the new model.
  3. Repeat this prodecure adding one predictor at a time.
  4. Stop when the AIC is no longer significantly decreasing.
# backward stepwise via AIC
backward.step <- step(full,scope=list(upper=full,lower=null))
summary(backward.step)

8.3 Regularized Regression

8.3.1 Ridge Regression (L2-regularization)

Ridge regression can be used to enhance out-of-sample model performance in particular to ameliorate high variance in \(\beta\) and possible correlation between \(\beta\)s. To control the potential high variance of \(\beta\)s in the regular linear regression, ridge constraint can be imposed on the loss function: that is, in addition to estimating \(\beta\)s that minimizes the squared errors, we add the contriant that \(\sum^p_{j=1} \beta_j^2 \leq t\).

The ridge regression has a closed-form solution. That is, there is a formula for the \(\beta\) estimates, \(\hat{\beta}^{\text{ridge}}_\lambda=(Z^TZ+\lambda I_p)^{-1}Z^Ty\) where \(Z\) is standardized predictors and \(y\) is centered. \(\lambda\) here is a tuning parameter that we call the shrinkage parameter and we have a set of \(\hat{\beta}\) for each \(\lambda\). Essentially,

  • \(\lambda\) controls the size of the coefficients
  • \(\lambda\) controls the amount of regularization
  • as \(\lambda \rightarrow 0\), we obtain the OLS \(\beta\) estimates
  • as \(\lambda \rightarrow \infty\), we obtain an intercept-only model

Ridge regression can be implemented through glmnet. The glmnet() function has an alpha argument that determines what type of model is fitted. If alpha = 0 then a ridge regression model is fitted. In particular, here we show an implementation where an “optimal” \(\lambda\) can be “automatically” selected through cross-validation within the function cv.glmnet.

library(glmnet)
library(psych)
bfi=na.omit(bfi)
bfi$agree=rowSums(bfi[,1:5])
bfi=bfi[,-c(1:5)]
x.train=data.matrix(bfi[1:100,-24])
y.train=bfi[1:100,24]
x.test=data.matrix(bfi[101:2236,-24])
y.test=bfi[101:2236,24]

# regular linear model
lm.mod=lm(agree~.,bfi[1:100,])
coef(lm.mod)
mean((y.test-predict(lm.mod,bfi[101:2236,-24]))^2)

# ridge model
fit.ridge.cv <- cv.glmnet(x.train, y.train, type.measure="mse", alpha=0,family="gaussian")
# plotting effects of lambda
plot(fit.ridge.cv)
# coefficient of ridge regression
coef(fit.ridge.cv)
mean((y.test-predict(fit.ridge.cv,x.test))^2)

Note that Ridge regression shrinks the coefficients of low-variance components. So it may be more “fair” to standardize predictors prior to using ridge regression. Moreover, ridge regression produce biased \(\beta\) estimates that often still enjoy low prediction error due to reduced variance.

8.3.2 LASSO Regression (L1-regularization)

In exploratory context, using regression with all predictors may not be the most wise as including irrelevant predictors in the regression can result in biased predictions and generally poor parameter estiamtes. Hence, selecting the relevant/important predictor can be of particular interest both for variable selection and to enhance prediction. LASSO: least absolute shrinkage and selection operator was introduced to do exactly this!

LASSO coefficients are the solutions to the \(l_1\) optimization problem: that is to minimize \((y-Z\beta)^T(y-Z\beta)\) s.t. \(\sum^p_{j=1}|\beta_j|\leq t\) in estimating \(\beta\)s. That is, to minimize loss function \((y-Z\beta)^T(y-Z\beta)+\lambda ||\beta||_1\). Here \(t\) and \(\lambda\) are inversely related. LASSO regression does not have close-form solution for \(\beta\) estimates.

LASSO regression can also be implemented through glmnet.If alpha = 1, then a LASSO regression model is fitted. In particular, here we show an implementation where an “optimal” \(\lambda\) can be “automatically” selected through cross-validation within the function cv.glmnet.

# LASSO model
fit.lasso.cv <- cv.glmnet(x.train, y.train, type.measure="mse", alpha=1,family="gaussian")
# plotting effects of lambda
plot(fit.lasso.cv)
# coefficient of LASSO regression
coef(fit.lasso.cv)
mean((y.test-predict(fit.lasso.cv,x.test))^2)

Note that a possible natural next step is to use the selected predictors from LASSO to form a new regression model for significance testing in a new sample.

8.3.3 Elastic Net Regression (L1- and L2-regularization)

Elastic net, simply put, is a combination of LASSO and ridge. This is evident in the estimation of \(\beta\)s where \(\hat{\beta}=\text{arg} min_\beta ||y-X\beta||^2+\lambda_2||\beta||^2+\lambda_1||\beta||_1\). In elastic net, the goal is to minimize risidual mean squared error + \(\alpha \cdot\) Ridge Penalty + \((1-\alpha) \cdot\) LASSO penalty. Elastic net is advantages because it performs both variable selection and colinearity reduction.

  • In the case where you have more parameters than observations, \(p>n\), the LASSO method selects at most n variables before it saturates, because of the nature of the convex optimization problem. This can be a defect for a variable selection method. By contrast, the elastic net method can select more than \(n\) variables in this case because of the ridge regression regularization.
  • If there is a group of variables that have high pairwise correlations, then whereas LASSO tends to select only one variable from that group, the elastic net method can select more than one variable.
  • In the \(n >p\) case, if there are high correlations between predictors, it has been empirically observed that the prediction performance of LASSO is dominated by ridge regression. In this case, the elastic net method can achieve better prediction performance by using ridge regression regularization.

The implementation of elastic net involves finding the optimal \(\alpha\) manually. Here we implement such search for the optimal \(\alpha\) using cross-validation.

# load loads of packages all at once
# you need to install doParallel and foreach 
pkgs <- list("glmnet", "doParallel", "foreach")
lapply(pkgs, require, character.only = T)
# 
registerDoParallel(cores = 4)
a <- seq(0.1, 0.9, 0.05)
search <- foreach(i = a, .combine = rbind, .packages="glmnet") %dopar% {
  cv <- cv.glmnet(x.train, y.train, nfold = 10, type.measure = "mse", parallel = TRUE, alpha = i) 
  data.frame(cvm = cv$cvm[cv$lambda == cv$lambda.1se], lambda.1se = cv$lambda.1se, alpha = i)
}
cv.la <- search[search$cvm == min(search$cvm), ]
mod <- glmnet(x.train, y.train, lambda = cv.la$lambda.1se, alpha = cv.la$alpha)
coef(mod)
mean((y.test-predict(mod,x.test))^2)

8.4 Summary

8.4.1 List of Commands

  1. conducts stepwise regression (forward or backward): step()

9 Logistic Regression

9.1 Key Objectives

By the end of class, you should be able to

  1. run logistic regression in R
  2. intrepret beta coefficients
  3. understanding accuracy, sensitivity, and specificity of the model prediction

9.2 Logistic Regression

Previously we have only worked with dependent variables that are continuous. However, sometimes we are interested in modeling a binary dependent variable, such as:

  1. Success or Fail
  2. Right or Wrong
  3. Married or Not Married
  4. History of depression or No history of depression

We will be working with Ross’ dataset located in the lab folder for today. This dataset contains 205 observations of 5 variables: whether or not a participant had a suicide attempt (SuicideAttempt), depression symptomology (depression), gender (Gender), age (Age), and lifetime frequency of self-injury (LifetimeFreq). We are interested in predicting the likelihood of attempting suicide from depression. In this case, we have a dichotomous dependent variable, suicide attempt, and a continuous independent variable, depression.

library(psych)
# make sure to save the data file into your working directory
# setwd(" ") FILL IN your working directory in the blank here
dat<-read.csv("SAdat.csv")
attach(dat)

#explore the data
describe(dat)
describeBy(dat, SuicideAttempt)
hist(Depression)
hist(Gender)
hist(Age)
hist(LifetimeFreq)

counts <-table(SuicideAttempt)
barplot(counts, main="Suicide Attempt Distribution", xlab="Suicide Attempt")

Let’s first look at our results if we just did the linear regression like we normally would instead of a logistic regression. This means we will be treating SuicideAttempt as a continuous variable.

SuicideAttempt.num <-as.numeric(SuicideAttempt)
head(SuicideAttempt.num) #1=no, 2=yes
#Recode so that 0=no, 1=yes
SuicideAttempt.new <-SuicideAttempt.num -1
#Fit model with suicide attempt as DV and depression as IV
lm1 <-lm(SuicideAttempt.new ~ Depression); summary(lm1)
#Check to see if the outcome is normally distributed?
plot(density(SuicideAttempt.new)); plot(lm1)
#plot 1 Residuals vs the fitted values
#plot 2 Normal Q-Q plot
#plot 3 Sqrt of standardized residuals vs fitted values
#plot 4 Standardized residuals vs the leverage values

Question 1 What assumptions of linear regression does a binary dependent variable violate?

In the case of a binary outcome, e.g., \(Y=0\) or \(1\), we should use logistic regression where we are modeling the log odds of \(Y=1\) occurring,

\[logit(Y) = \beta_0 + \beta_1 X_1\]

where \(logit(Y_i) = ln \frac{Y_i}{1-Y_i}\) or equivalently written as a probability,

\[P(Y) = \frac{e^{\beta_0 + \beta_1X_1}}{1 + e^{\beta_0 + \beta_1X_1}}\].

We also might be interested in the odds ratio. First, we consider the definition of odds. By definition, the odds for an event is \(\frac{p}{1-p}\) such that \(p\) is the probability of the event. For example, if you are at the racetrack and there is a 80% chance that a certain horse will win the race, then his odds are \(0.80 / (1 - 0.80) = 4\), or \(4:1\).

The odds ratio is subsequently defined as the ratio of the odds of A in the presence of B and the odds of A in the absence of B, or equivalently (due to symmetry), the ratio of the odds of B in the presence of A and the odds of B in the absence of A. For example, the odds of developing depression given exposure to Qimin’s lab is \(O_{D|Q}=\frac{\text{\# of depressed students after Qimin's lab}}{\text{\# of non-depressed students after Qimin's lab}}=\frac{1}{3}\). Say the odds of developing depression without exposure to Qimin’s lab is \(O_{D|\sim Q}=\frac{1}{14-1}\). Then the odds ratio of the two would be \(OR=\frac{1/3}{1/14}=\frac{14}{3}\).

We now show in R how to implement logistic regression and how to compute odds ratio and the corresponding confidence interval.

#Intercept only model
log.null <- glm(SuicideAttempt ~ 1, family=binomial); summary(log.null)
b0 <-as.numeric(log.null$coefficients[1]) #estimated logit = -1.079

#convert logit (log odds) to an odds ratio
exp(coef(log.null)) #1:3 odds
#low odds that you would have attemtped suicide given no predictors
prob.null <- (exp(b0))/(1+exp(b0)) #.25
#given no other information, there is an estimated 25% chance of attempting suicide

#Add in Depression as a predictor
log.reg <- glm(SuicideAttempt ~ Depression, family=binomial); summary(log.reg); 
b0 <-as.numeric(log.reg$coefficients[1]); b1 <-as.numeric(log.reg$coefficients[2])
#estimated logit = -.91 + .053*depression
exp(coef(log.reg)) #odds ratio
prob.b0 <-exp(b0 + b1*0)/(1 + exp(b0 + b1*0)) #When Depression is 0

9.3 \(\beta\) Coefficients Estimates and Predictions

  • \(\beta_0\): Here \(e^{\beta_0}=.15\). We can intrepret that as the odds of attempting suicide when depression is 0.
  • \(\beta_1\):
    1. If the estimated \(\beta_1\) is NEGATIVE, the log odds, odds, and probability of the event occurring are estimated to DECREASE as the predictor (\(x_1\)) increases.
    2. If the estimated \(\beta_1\) is POSITIVE, the log odds, odds, and probability of the event occurring are estimated to INCREASE as the predictor (\(x_1\)) increases.

Since in our model, \(\beta_1\) was positive we expect that as depression increases the log odds, odds, and probability of attempting suicide will increase. Here \(e^{\beta_1}=1.05\). We can intrepret that as an increase in 1 unit on depression is associated with a 5% increase in the odds of attemtping suicide. That is, the odds of commiting suicide given depression\(=x+1\) vs depression\(=x\) is 1.05.

Question 2 what is the interpretation of \(e^{\beta_0+\beta_1}\) in our example?

Furthermore, we may also be interested at estimating logit, oddds, and probability at specific values of \(x_1\). For example, let’s consider how likely someone is to attempt suicide if they have the mean depression score.

#Estimated logit, odds, and probability of attempting suicide 
#when someone scores at the mean value for depression
Depression_mean <-mean(Depression); logit.m <- b0 + b1*Depression_mean #logit (log odds)
odds.m <-exp(b0 + b1*Depression_mean) #odds
#Probability (estimated probability of attemping sucide for the mean depression value)
prob.m <- (exp(b0 + b1*Depression_mean))/(1+exp(b0 + b1*Depression_mean))

Question 3 Repeat the above procedures given \(x_1\) equals 5 or 25.

It is always a good idea to plot your data. Let’s plot the distribtuion of predicted probabilities.Note you have to use the continuous depedendent variable in order to plot the logistic regression. We can plot our IV (depression) against our DV (suicide attempt). The blue line represents the logistic regression line.

#plotting the distribution of predicted probabilities
plot(Depression,SuicideAttempt.new,xlab="Depression",ylab="Suicide Attempt") 
curve(predict(log.reg,data.frame(Depression=x),type="resp"),add=TRUE, col="blue")

#plot the distibution of depression for each suicide attempt classifcation
install.packages("sm") #choose no for install question
library(sm)
sm.density.compare(Depression, SuicideAttempt.new, xlab = "Depression")
title(main="Distribution for Each Suicide Attempt Class")
colfill<-c(2:(2+length(levels(SuicideAttempt)))) 
legend(42,0.03, levels(SuicideAttempt),fill=colfill)

9.4 Model Fit

To test for model fit, again we can adopt a model comparison approach, for example, to test \(\beta_1\), we can compare the following models:

  • restricted model: \(logit(E(Y))=\beta_0\)
  • full model: \(logit(E(Y))=\beta_0+\beta_1X_1\)

To compare the two models, we want to consider the deviance: \(D=2(ll_s-ll_m)\) where \(ll_s\) represents the log-likelihood of the saturated model given observed data and \(ll_m\) reers to the log-likelihood of the interested model given observed data. Here, the saturated model is a model with a parameter for every observation so that the data are fitted exactly. Likelihood here measures how likely it is to have such parameters of the statistical model given the data. As seen, when deviance is applied to a single model, deviance can measure the model fit against the saturated model. Moreover, as inherently designed in deviance, deviance can be used to compare two models.

Suppose in the framework of the GLM, we have two nested models, \(M_1\) and \(M_2\). In particular, suppose that \(M_1\) contains the parameters in \(M_2\), and \(k\) additional parameters. Then, under the null hypothesis that \(M_2\) is the true model, the difference between the deviances for the two models follows an approximate chi-squared distribution with \(k\)-degrees of freedom. In our example, \(M_2\) is the intercep-only model and \(M_1\) is the model with one regressor. The difference between deviance of two models then follow a Chi-square distribution with 1 degrees of freedom. To implement this in R, we have

anova(log.null, log.reg, test="Chisq")

As deviance heavily involves log-likelihood, to some extent, model fitting process maximizes the log-likelihood of the fitted model. By manipulating the likelihood, we can construct pseudo R-squared to quantify the model fit to our familiar terms.

For pseudo \(R^2\), there are two adjustments for logistic regression we could use: Cox & Snell or Nagelkerke. Cox & Snell’s \(R^2\) is \(1-(\frac{L_0}{L_{full}})^{2/N}\). It uses the ratio of the likelihoods to reflect the improvement of the full model over the intercept-only model. However, the Cox & Snell’s \(R^2\) has a maximum value at \(1-[p^p(1-p)^{1-p}]^2\) that is less 1. To correct for this, some researchers use the Nagelkerke’s \(R^2\). It adjusts Cox & Snell’s \(R^2\) to allow the range of possible maximum values to be 1. Its formula is \(\frac{1-(\frac{L_0}{L_{full}})^{2/N}}{1-L_0^{2/N}}\).

#Log-Likelihood
ll.null <- as.numeric(logLik(log.null)); ll.full <- as.numeric(logLik(log.reg))
-2*ll.null; -2*ll.full #negative log likelihood
1-exp(-((-2*ll.null-(-2*ll.full))/205)) #Cox & Snell R^2
(1-exp(-((-2*ll.null-(-2*ll.full))/205)))/(1-exp(-(-2*ll.null)/205)) #Nagelkerke R^2

Question 4 Compare the model where age and depression are both predictors to the intercept-only model and then to the model with depression as the sole predictor. Compute the Cox & Snell \(R^2\) and the Nagelkerke \(R^2\) for the model with age and depression as predictors.

9.5 Accuracy, Sensitivity, and Specificity

Important Definitions

  • True Positive: number of correctly identified positve cases
  • True Negative: number of correctly identified negative cases
  • False Positive: number of cases incorrectly identified as positve cases
  • False Negative: number of incorrectly identified as negative cases
  • Accuracy is the (total number correct)/total =(True Pos + True Neg)/(Total)
  • Sensitivity is the “True Positive Rate” or the ability to correctly identify who will attempt suicide.
  • Specificity is the “True Negative Rate” or the ability to correctly identify who will not attempt suicide

In order to calculate these results, we need to transform each person’s probability to a class. If we use a cutoff of .5, then values of greater than .5 indicates yes the person will attempt suicide. Values less than .5 indicates no the person will not attempt suicide. Cutoffs can be specified by the researcher based off of theory.

We can use the confusionMatrix() function from the caret package in order to get a matrix of true positive, true negative, false positive, and false negative counts. We can then use those to caluculate accuracy, sensitivity, and specificity. The function will also directly provide you with these results. Also note, we specified in the function the argument positive to be 1. This is becuase in our data, attempting suicide is coded as 1 and not attempting is coded 0. Therefore, the postive case is yes the person attempted suicide and the negative case is no they did not attempt suicide. Make sure you always check which case is positive or negative in your own data and specify the argument according.

install.packages("caret"); library(caret)
install.packages("e1071");library(e1071)

preds <- predict(log.reg,type="response"); pred.class <- as.factor(ifelse(preds>.5,1,0))
SuicideAttempt.num <-as.numeric(SuicideAttempt); SuicideAttempt.new <-SuicideAttempt.num -1
confusionMatrix(pred.class,as.factor(SuicideAttempt.new), positive='1')

(150+9)/205 #Accuracy
9/(43+9) #Sensitivity
150/(150+3) #Specificity

9.6 ROC and AUC

Receiver Operating Characteristic curve (ROC) curve plots plotting the true positive rate (TP) against the false positive rate (FP) at various threshold settings. The ROC curve quantifies the tradeoff between Sensitivity and Specificity. Any increase in sensitivity will be accompanied by a decrease in specificity. To interpret the ROC plot

  1. The closer the curve to the top left corner, the more accurate the test.

  2. The closer the curve comes to the diagonal of the ROC space, the less accurate the test. If the curve is the diagonal line, that would indicate no predictive value of the model.

With an ROC Curve, one way we can choose an optimal cutpoint, is to see which point on the curve is closest to the top left corner (maximizes both sensitivity and specificity).

We can also calculate the area under the curve (AUC), which is another way of combining sensitivity and specificity as a performance metric. AUC is equivalent to the probability that a randomly chosen positive instance is ranked higher than a randomly chosen negative instance. Thus, an AUC of 0.5 means our model has no better accuracy than chance. A model with perfect accuracy has an AUC of 1.

#ROC curve
install.packages("pROC"); library(pROC)
#Depression as predictor
pred1 <-predict(log.reg, type="response")
roc.out <- roc(SuicideAttempt,Depression,plot=T,ci=T,print.thres=0.5) #ci: compute confidence interval

# ROC curve for model with depression and gender as predictor
mult.log.reg1 <- glm(SuicideAttempt ~ Depression + Gender, family=binomial)
pred2 <-predict(mult.log.reg1, type="response")
roc.out2 <- roc(SuicideAttempt,pred2,plot=T,ci=T,print.thres=0.5)

pred.class2 <- as.factor(ifelse(pred2>.5,1,0))
#compare our two ROC curves and the accuracy.
confusionMatrix(pred.class2,as.factor(SuicideAttempt.new), positive='1') 

#find optimal cutoff for model with 1 predictor
install.packages("ROCR"); library(ROCR)

pred = prediction(pred1,SuicideAttempt)
#"sens": sensitivity,"spec": specificity, "fpr": False positive rate
roc.perf = performance(pred, "sens","spec" ,"fpr") 

plot(roc.perf); abline(a=0, b= 1)

#function finds the point on the curve that is to the most top-left of the graph
opt.cut = function(perf, pred){
  cut.ind = mapply(FUN=function(x, y, p){
    d = (x - 0)^2 + (y-1)^2
    ind = which(d == min(d))
    c(sensitivity = y[[ind]], specificity = 1-x[[ind]], 
      cutoff = p[[ind]])
  }, perf@x.values, perf@y.values, pred@cutoffs)
}
print(opt.cut(roc.perf, pred))

roc.out # AUC

9.7 Multiple Logistic Regression

Just like in multiple linear regression, in multiple logistic regression, we are adding more predictors into our model.

#1 categorical predictor
mult.log.reg1 <- glm(SuicideAttempt ~ Depression + Gender, family=binomial)
summary(mult.log.reg1)

#plot 
plot(Depression, SuicideAttempt.new, xlab="Depression", ylab="Suicide Attempt") 
curve(predict(mult.log.reg1, data.frame(Depression=x, Gender=0), type="resp"), 
      add=TRUE, col="blue")
curve(predict(mult.log.reg1, data.frame(Depression=x, Gender=1), type="resp"), 
      add=TRUE, col="red")

#continuous predictors
mult.log.reg2 <- glm(SuicideAttempt.new ~ Depression + Age, family=binomial)
summary(mult.log.reg2)
plot(Depression, SuicideAttempt.new, xlab="Depression", ylab="Suicide Attempt") 
curve(predict(mult.log.reg2, data.frame(Depression=x, Age=15), type="resp"), 
      add=TRUE, col="purple")
curve(predict(mult.log.reg2, data.frame(Depression=x, Age=20), type="resp"), 
      add=TRUE, col="blue")
curve(predict(mult.log.reg2, data.frame(Depression=x, Age=25), type="resp"), 
      add=TRUE, col="lightblue")
curve(predict(mult.log.reg2, data.frame(Depression=x, Age=30), type="resp"), 
      add=TRUE, col="red")
curve(predict(mult.log.reg2, data.frame(Depression=x, Age=35), type="resp"), 
      add=TRUE, col="yellow")
curve(predict(mult.log.reg2, data.frame(Depression=x, Age=40), type="resp"), 
      add=TRUE, col="green")

#accuracy, sensitivity, specificity for 1st model
preds2 <- predict(mult.log.reg1,type="response") 
pred.class2 <- as.factor(ifelse(preds2>.5,1,0))
confusionMatrix(pred.class2,as.factor(SuicideAttempt.new), positive='1')

##accuracy, sensitivity, specificity for 2nd model
preds2 <- predict(mult.log.reg2,type="response") 
pred.class2 <- as.factor(ifelse(preds2>.5,1,0))
confusionMatrix(pred.class2,as.factor(SuicideAttempt.new), positive='1')

9.8 Summary

9.8.1 List of Commands

  1. conducts logistic regression: glm()
  2. computes log-likelihood: logLik()
  3. computes accurracy, sensitivity, and speficity: confusionMatrix() from caret package

9.8.2 Citations and Resources

  1. Awesome summary on R squares and pseudo R squares
  2. How to do logistic regression in R/SAS by Penn State

10 Polynomial Regression

10.1 Motivation

One of the assumption for linear regression is the linearity in the \(\beta\). This restriction in the functional form often translates to a linear relationship between the \(X\) and \(Y\). However, sometimes the relationship between \(X\) and \(Y\) can be insufficiently accounted for by the linear term in regression forms. To check whether lienar term is sufficient, we can plot residuals from a regression with linear term only against predicted values from such model. If residuals show non-linear patterns in the plot, we may suspect a relationship more than linear between independent and dependent variables. If you find equally spread residuals around a horizontal line without distinct patterns, that is a good indication that linear terms are sufficient.

library(MASS)
attach(nlschools)
# create an artificial variable that has a nonlinear relationship with lang
nonlinear.var=scale(nlschools$lang)^3
# pre-analysis linearity check
plot(lang, nonlinear.var)
plot(lang, IQ)
# post-analysis linearity check
mod.nonlinear <- lm(nlschools$lang~nonlinear.var)
mod.linear <- lm(nlschools$lang~nlschools$IQ)
# residuals against predicted y
plot(mod.linear$residuals,mod.linear$fitted.values); # or try plot(mod.linear,1)
plot(mod.nonlinear$residuals,mod.nonlinear$fitted.values) # or try plot(mod.nonlinear,1)

Polynomial regression, a form of regression analysis in which the relationship between the independent variable \(X\) and the dependent variable \(Y\) is modelled as an \(n\)th degree polynomial in \(X\), is one way to account for the nonlinear relationship between \(X\) and \(Y\).

10.2 Polynomial Regression in R

It is worth noting that polynomial regression, despite the capacity to capture relationship beyond linear fashion between \(X\) and \(Y\), is not nonlinear. For example, consider quadratic regression, i.e., the polynomial regression including up to the quadratic term, \(Y=\beta_0+\beta_1 X + \beta_2 X^2+E\). \(\beta\)s are still linear. For this reason, polynomial regression is considered to be a special case of multiple linear regression. However, as \(X\) and its higher-order terms are highly collinear, it is recommended to orthogonalize them, i.e., they should be made uncorrelated. R can automatically orthogonalize the predictors via poly()

fakey = scale(nlschools$lang) + scale(nlschools$lang)^2 + rnorm(length(nlschools$lang))
mod = lm(fakey ~ poly(nlschools$lang, 2))
summary(mod)
plot(nlschools$lang, mod$fitted.values)

What does this mean? The fitted model tests the hypothesis of a quadratic relationship between the lang and the artifically created variable fakey, \(\hat{Y}=.99+22.74 \text{lang}+56.73 \text{lang}^2\). Here, the individual \(\beta\) coefficients are less interpretatble and it may be better to interpret the model in general, \(F(2,2284)=1756,\ p<.001\). One implication of polynomial regression is that the change in \(Y\) due to \(X\) no longer remains constant across all values of \(X\). Here, we have \(\hat{\frac{d}{d_x}}=22.74+56.73\times 2\times \text{lang}\), the first derivative of the estimated function, as the estimated rate of change in our example.

The first derivative for the quadratic regression, \(aX^2+bX+c\) is \(2a X+b\). Here we can see that the quadratic regression assumes the acceleration of change is constant \(2a\). However, sometimes the rate of change itself can also vary per \(X\). Here we would want to use the cubic regression, \(aX^3+bX^2+cX+d\). The rate of change in cubic regression is the first derivative, \(3aX^2+2bX+c\). The acceleration in the rate of change is the second derivative, \(6aX+2b\). To fit a cubic regression,

fakey = scale(nlschools$lang) + scale(nlschools$lang)^2 - scale(nlschools$lang)^3 + 
    rnorm(length(nlschools$lang))
mod = lm(fakey ~ poly(nlschools$lang, 3))
summary(mod)
plot(nlschools$lang, mod$fitted.values)

10.3 More on Polynomial Regression

Given a quadratic or cubic function, we can calculate the critical point where the relationship between \(X\) and \(Y\) change from positive to negative or vice versa. The critical point(s) for a quadratic function is at \(X=\frac{b}{2a}\) and for a cubic function are at \(\frac{-b\pm \sqrt{b^2-3ac}}{3a}\). To estimate critical points, we can plug in the coefficient estimates. However, critical points are descriptive only and you should not generalize this to the population because no significance testing has been done.

To determine to which \(n\) should one include polynomial terms, visualization via scatterplot can be helpful. Another thing is to use a partition of the sample to test out which polynomial regression may fit better. To do this, we use hierarchical regression where we start with the linear model and at polynomial terms one at a time.

Note that you should not add a cubic term without the quadratic term. Or more generally, you should not at a \(p\)th polynomial term without including all 1 to \(p-1\) polynomial terms. As such, in hierarchical regression, you should always test quadratic model to the linear model first and only when that is significant, you test cubic model against the quadratic. That is, you test polynomial regression models sequentially, always comparing \(n+1\)th polynomial regression model to the \(n\)th polynomial regression model one at a time.

linear.mod = lm(fakey ~ nlschools$lang)
quad.mod = lm(fakey ~ poly(nlschools$lang, 2))
cubic.mod = lm(fakey ~ poly(nlschools$lang, 3))
anova(linear.mod, quad.mod, cubic.mod)
summary(mod)

11 Moderation Models in R

11.1 Key Objectives

By the end of class, you should be able to 1. Specify a moderation model 2. Plot simple slopes of x at various values of the moderator 3. Assess significance of simple slopes 4. Use floodight analysis (Johnson-Neyman)

11.2 Moderation

Moderation and interaction are synonymous. Let \(Y\) be our dependent variable, \(X\) be our predictor and \(Z\) be our moderator. Here \(X\times Z\) is our interaction between \(X\) and \(Z\) also referred to as our moderation effect. In moderation analysis, we are unable to tell which variable is moderating the effect of the other variable on y. That is, does \(Z\) moderate the effect of \(X\) on \(Y\) or does \(X\) moderate the effect of \(Z\) on \(Y\). You choose the variable you call the moderator based on theory.

One motivation for testing moderation is that when a moderator \(Z\) exists, the effect of \(X\) on \(Y\) becomes a function involving \(Z\). As such, the treatment effect can show statistical significance differentially given pretest scores. This is why procedures like Johnson-Neyman method and Potthoff’s extensions can test the treatment effect at different pretest scores and can determine the region of pretest scores where the treatment effect shows statistical significance.

Here we will be using the epi.bfidataset from the psych package. The dependent variable is bdi (depression) and the independent variable is epiNeur (neuroticism). Based on theory, we will say that the moderator is stateanx (state anxiety). We are interested in prediciting bdi from epiNeur, stateanx, and the interaction between the 2 predictors.

library(psych);dat <-epi.bfi;attach(dat)
#explore data
describe(dat);hist(bdi);hist(stateanx);hist(epiNeur)
#center variables to deal with collinearity
stateanx.c <-stateanx - mean(stateanx);epiNeur.c <-epiNeur - mean(epiNeur)
#moderation model
lm1 <- lm(bdi ~ stateanx.c + epiNeur.c + stateanx.c:epiNeur.c); summary(lm1)
#shorthand
lm1.b <- lm(bdi ~ stateanx.c*epiNeur.c);summary(lm1.b)

What if one predictor is categorical and the other is continuous? For the purpose of exercise, we recode the trait anxiety into low, medium, high groups all of equal size and aim to examine the effect of trait anxiety, state anxiety, on depression.

library(MASS);library(car);
ta <- as.factor(cut(dat$traitanx, 3))
#moderation model
lm2 <- lm(bdi ~ stateanx.c + ta + ta:stateanx.c); 
summary(lm2);Anova(lm2,type=3)
#shorthand
lm2.b <- lm(bdi ~ stateanx.c*ta);summary(lm2.b)

What if both predictors are continuous? For the purpose of exercise, we recode both the trait and state anxiety into low, medium, high groups all of equal size and again aim to examine the effect of trait anxiety, state anxiety, on depression.

sa <- as.factor(cut(dat$stateanx, 3))
#moderation model
lm3 <- lm(bdi ~ sa + ta + ta:sa); 
summary(lm3);Anova(lm3,type=3)
#shorthand
lm3.b <- lm(bdi ~ sa*ta);summary(lm3.b)

We can intrepet the intercept (\(b_0\)) as the expected value of \(Y\) for an individual with average values of \(X\) and \(Z\). The \(\beta\) estimate for the predictor \(X\) is the expected increase in \(Y\) due to a one unit increase in \(X\), for an individual with an average value of \(Z\). The \(\beta\) estimate for the moderator \(Z\) is the expected increase in \(Y\) due to a one unit increase in \(Z\), for an individual with an average value of \(X\). Finally, the \(\beta\) estimate for the interaction \(X\times Z\) is the expected increase in the effect of \(X\) on \(Y\) for a one unit increase in \(Z\).

11.3 Plotting Simple Slopes

The simple slopes are just the regression of the outcome \(Y\) on the predictor \(X\) at specific values of the moderator \(Z\). Plotting these simple slopes can give us a better idea how the specific values of \(Z\) moderator \(X\) on \(Y\).

install.packages("rockchalk")
library(rockchalk)
plotSlopes(lm1, plotx="epiNeur.c",modx="stateanx.c", modxVals="std.dev.")

# add in CI bands
plotCurves(lm1, plotx="epiNeur.c",modx="stateanx.c", modxVals="std.dev.", interval="confidence")
describe(stateanx.c) # sd = 11.5, min = -18.85, max = 39.15

# "high" versus "low" values of state anxiety
plotSlopes(lm1, plotx="epiNeur.c",modx="stateanx.c", modxVals=c(-11.5,11.5), interval="confidence")

# specific values of state anxiety
plotSlopes(lm1, plotx="epiNeur.c",modx="stateanx.c", modxVals=c(-18.85, -10, 0, 10, 20, 30, 39.15))

When we have two categorical predictors, we can use

interaction.plot(ta,sa,bdi)

11.4 Significance Region Testing

We are often interested in assessing the significant of the conditional effect of \(X\) on \(Y\) at a specific value of \(Z\). This is particularly helpful when we have theory based \(Z\) values of interest. However, if you have no theory about which point would be of interest you can use Johnson-Neyman to look at all values of the moderator to see where there is a signifcant conditional effect of x.

sig_regions calculates the Johnson-Neyman (J-N) regions of significance for an interaction – the points at which the simple effect of the categorical predictor changes from non-significant to significant.

install.packages("reghelper");library(reghelper)
install.packages("probemod");library(probemod)
install.packages("interactions");library(interactions)
#simple slope analysis
simple_slopes(lm1);simple_slopes(aov(lm2));simple_slopes(lm3) # default
simple_slopes(lm1,levels=list(epiNeur.c=0)) # effect of state anxiety at `epiNeur.c`=0
#johnson-neyman when one is categorical and the other is continuous
sig_regions(lm2)
#johnson-neyman when both predictors are continuous
states <- as.data.frame(state.x77)
states$HSGrad <- states$`HS Grad`
fit <- lm(Income ~ HSGrad + Murder*Illiteracy, data = states)
johnson_neyman(model = fit, pred = Murder,modx = Illiteracy)
johnson_neyman(model = lm1, pred = state.anx.c, modx = epiNeur.c)

12 Mediation

12.1 Key Objectives

By the end of class, you should be able to 1. Understand the simple mediation model 2. test the indirect effects via Sobel test and percentile bootstrap 3. Test and estimate a mediation model with a covariate or multiple mediators

12.2 A Simple Mediation Model

We are often interested in causal relationships between variables. One variable may affect another variable directly, indirectly (through other variables), or both. If a variable \(X\) is assumed to cause the outcome \(Y\), \(X\) is called a causal variable. However, the effect on \(X\) on \(Y\) could be mediated by a mediating variable M as in: \[M=i_M+aX+e_M\] \[Y=i_Y+c'X+bM+e_Y\]

  1. Direct effect \(c'=[\hat{Y}|(X=x,M=m)]-[\hat{Y}|(X=x-1,M=m)]\): Two cases differ by one unit on \(X\) but are equal on M are estimated to differ by \(c'\) units on Y. For example, one minute of exercise can directly result in weight loss of \(1lb\) at the same level of calories intake because calories are directly burned during exercise.

  2. Indirect effect \(ab\): two cases that differ by one unit on \(X\) are estimated to differ by \(ab\) on \(Y\) as a result of the effect of \(X\) on \(M\) which, in turn, affects \(Y\).

    • \(a=[\hat{M}|X=x]-[\hat{M}|X=x-1]\) quantifies how much two cases that differ by one unit on \(X\) are estimated to differ on M.
    • \(b=[\hat{Y}|(X=x,M=m)]-[\hat{Y}|(X=x,M=m-1)]\): two cases that differ by one unit on \(M\) but that are equal on \(X\) are estimated to differ by \(b\) units on \(Y\)
    • example: one minute of exercise suppress \(200\) calories intake, every hundred of which would correspond to \(2lb\) weight gain. So the indirect effect of exercise on weight through food intake is \(200/100\times (-2)=-4\)
  3. Total effect \(c=[\hat{Y}|(X=x)]-[\hat{Y}|(X=x-1)]\) quantifies how much two cases that differ by one unit on \(X\) are estimated to differ on \(Y\). The total effect of exercising one minute directly and indirectly through food intake on weight is \(-1-4=-5\)

Note that total effect is perfectly partitioned into direct and indirect effects in the OLS/ML regression, \(c=c'+ab\). Subsequently, we have \(Y=(i_Y+bi_M)+(ab+c')X+(e_Y+be_M)\). While the statistical inferences for the total effect and direct effect are straightforward through testing the regression coefficients \(c\) and \(c'\), the test for the indirect effect, aka the mediation effect, can be nontrivial.

One classic test is the Sobel test, aka the delta method, aka the product of coefficients approach. Assuming the sampling distribution of \(ab\) to be normal, one known estimator for the standard error is \(se_{ab}=\sqrt{a^2se^2_b+b^2se^2_a+se^2_ase^2_b}\). Then we can compute a z-statistic: \(Z=\frac{ab}{se_{ab}}\) and then proceed with obtaining the \(p\) value. Moreover, confidence interval can also be computed using the standard error and the critical \(z\) value: \(CI=ab\pm Z_{\alpha/2}se_{ab}\).

Problem 1 State the null hypothesis for testing the indrect effects.

Let’s calculate the total, direct, and indirect effects of a mediation model. We wil use the dataset Mroz from the package Ecdat. This data includes the outcome variable of wife hourly earnings (hearnw). Let’s first consider the independent variable, wife’s mother’s education (educwm), and the mediator, wife’s education in years (educw). Again, here we have

  • \(Y\): wife hourly earnings (hearnw)
  • \(X\): wife’s mother’s education (educwm)
  • \(M\): wife’s education in years (educw)
install.packages("Ecdat")
library(Ecdat)
?Mroz
head(Mroz)

# total effect
lm.xy <- lm(hearnw~ educwm,Mroz)
summary(lm.xy) # c=.087, p < .05
# direct effect 
lm.xy_m <- lm(hearnw~ educwm + educw,Mroz)
summary(lm.xy_m) # c_prime= -.057, p =.12 
# indirect effect
lm.xm <-lm(educw~educwm,Mroz)
a=lm.xm$coefficients[2];se_a=summary(lm.xm)$coefficients[2,2]
b=lm.xy_m$coefficients[3];se_b=summary(lm.xy_m)$coefficients[3,2]
z=as.numeric(a*b/sqrt(a^2*se_b^2+b^2*se_a^2+se_a^2*se_b^2))
1-pnorm(z)

a*b+lm.xy_m$coefficients[2]

We can compare this result with mediation model using the R package lavaan and then using package sem to analyze this model.

library(lavaan)

# Run Mediation
model.med <-' ## Regressions
              hearnw ~ c_prime*educwm #direct effect (c_prime)
              hearnw ~ b1*educw       
              educw ~ a1*educwm       # mediator

              ## Define indirect effects (a*b)
              a1b1 := a1*b1

              ## Define total effect (c)
              c := c_prime + (a1*b1)'

# analyze the model
fit.med = sem(model.med,data=Mroz)
summary(fit.med,ci=T,fit=T) # c_prime= - .057, p = .132

Note that the Sobel Test has some disadvantages. The assumption of the normal sampling distribution of \(ab\) has been shown through both derivation and simulation to be unmet.As such, simulation research that has compared this approach to various competing inferential methods has shown that it is one of the lowest in power and genrates confidence intervals that tend to be less accurate than some other methods.

To address the irregularity in the sampling distribution of \(ab\), boostrap approach can be adopted. Through resampling the observations, \(ab\) estimates can be obtained per resampled data to construct an emprically derived representation of the sampling distribution of the indirect effect. Subsequently, confidence interval can be obtained. This yield inferences that are more likely to be accuracte than when the normal theory approach is used with higher power. This is implemented in lavaan.

bfit.med=bootstrapLavaan(fit.med)
boot.ab=bfit.med[,2]*bfit.med[,3]
quantile(boot.ab,c(.025,.975))
mean(boot.ab)

Caution exists for the boostrap approach as well despite it being the widely recommended method. As the boostrap approach resamples the collected data to approximate the population distribution, the approach fails when the collected data are not representative. In small sample, boostrap can also fail because of the potential prominance of the unusual observations compared with in a larger sample where such outlier can be “washed out” by other representitive observations. Alternatively, this may also show that boostrap accounts for the unusual observations even at a small sample, but more research is needed. Neverthless, boostrap can have advantages in terms of power compared with sobel’s test in small sample.

Other approaches for testing \(ab\), including the Monte Carlo confidence intervals and the distribution of the product, are discussed in the supplementary reading.

Here we illustrate how to visualize the mediation model in a graph using the package semPlot.

install.packages("semPlot")
library(semPlot)
semPaths(fit.med,sizeMan=8,sizeInt=8,layout="tree2",rotation=2,edge.label.cex=1.5,whatLabels="est")

12.3 Mediation Models with a Covariate or with Multiple Mediators

We can control for other variables in our model by adding them as a covariate. All the intrepretations of the total, direct, and indirect effects hold but now we just include with holding the covariates constant. Let’s add a binary covariate indicating whether wife participated in the labor force in 1975 (work).

#1 is work, 0 is no
Mroz$work2 = as.numeric(Mroz$work) - 1

model.med.cov <-' ## Regressions
              hearnw ~ c_prime*educwm #direct effect (c_prime)
              hearnw ~ work2   
              hearnw ~ b1*educw 
              educw ~ a1*educwm       #mediators
              educw ~ work2           #mediators

              ## Define indirect effects (a*b)
              a1b1 := a1*b1

              ## Define total effect (c)
              c := c_prime + (a1*b1)'

fit.med.cov = sem(model.med.cov,data=Mroz)
summary(fit.med.cov,std=T,ci=T,fit=T)
semPaths(fit.med.cov,sizeMan=8,sizeInt=8,layout="tree2",rotation=2,edge.label.cex=1.5)

Problem 2 Conduct inferential tests using the Boostrap approach.

So far we have only considered path analyses with one mediator. However, we are able to specify more complex path as well. We can have independent variables that have effects on the dependent variable through more than one indirect path. Page 466 of the textbook shows a few examples of multiple mediation path analyses. Here let’s say we are interested in 3 mediators: wife’s labor market experience in years (experience), husband’s wage (wageh), and number of kids under the age of 6 (child6).

model.multi <-' # direct effect (c_prime)
            hearnw ~ c_prime*educw

           # mediators
            experience ~ a1*educw
            wageh ~ a2*educw
            child6 ~ a3*educw
            hearnw ~ b1*experience + b2*wageh + b3*child6

           # indirect effects (a*b)
            a1b1 := a1*b1
            a2b2 := a2*b2
            a3b3 := a3*b3

           # total indirect
          ab := a1b1 + a2b2 + a3b3 

           # total effect (c)
          c := c_prime + (a1*b1) + (a2*b2) + (a3*b3)'


fit.multi = sem(model.multi,data=Mroz,se="boot",bootstrap=1000)
summary(fit.multi,std=T,ci=T,fit=T)
semPaths(fit.multi,sizeMan=8,sizeInt=8,layout="tree2",rotation=2,edge.label.cex=1.5)

Problem 3 Write down the mediation null hypothesis and conduct boostrap tests for these models.

12.4 Mediation Models with Dichotomous Variables

In mediation models, dichotomous variables, such as dummy coded variables, can be readily handled as predictors without special treatment. For dichotomous mediators or outcomes, the case can become very contrived. This provides a fairly good guide for the details, with great simplification. This gives syntax for running such mediation models in SPSS as well.

The problem with dichotomous outcomes or mediators is regarding the introduction of logistic regression. Recall that the mediation formulas are, for example given a dichotomous outcome Y and continuous X and M,

  1. \(logit(Y)=cX\)
  2. \(M=aX+E\)
  3. \(logit(Y)=bM+c'X\)

As seen here, coefficient \(a\) is on different scales as \(c\) or \(b\) or \(c'\) because \(a\) is with regard to the raw scores whereas the others are with regard to the logits. This makes the common formulation of the indirect effect \(ab\) no longer applicable because \(a\) and \(b\) are no longer on the same scales. The conversion to make them comparable can be analytically complex.

lavaan regards the problem of dichotomous dependent variables (or ordinal ones) as a violation of normality assumption in linear regression for the relevant regression functions in the mediation model. It treats dichotomous variables as intrinsically following a normal distribution. That is, there exists an underlying latent variable that is normally distributed and the dichotomous is simply an end results of dichotomization of this continuous normal variable. So lavaan estimates mediation models with the underlying continuous normally distributed latent variable and circumvents the need for logistic regression. To achieve this, lavaan uses a special estimator named diagonally weighted least square (referred to as DLWS in lavaan but more commonly referred to as WLSMV in the literature). We now illustrate this in R.

Mroz2 = Mroz
Mroz2$educw.c = Mroz2$educw - mean(Mroz2$educw) # center M1 (continuous)
Mroz2$educwm.c = Mroz2$educwm - mean(Mroz2$educwm) # center X (continuous)
Mroz2$city2 = as.numeric(Mroz2$city) - 1 #recode 1= live in large city; 0 = not live in large city

model.med.dich <-' # direct effect (c_prime)
                  work2 ~ c_prime*educwm.c 

                  work2 ~ b1*educw.c + b2*city2

                   # mediators
                  educw.c ~ a1*educwm.c
                  city2 ~ a2*educwm.c

                  # indirect effects (a*b)
                  a1b1 := a1*b1
                  a2b2 := a2*b2

                  # total indirect
                  ab := a1b1 + a2b2

                  # total effect (c)
                  c := c_prime + (a1*b1) + (a2*b2)
                  '

fit.med.dich = sem(model.med.dich,data=Mroz2,estimator="dwls",
                   ordered=c("city2","work2"),se="bootstrap")

summary(fit.med.dich,std=T,ci=T,fit=T)

12.5 Effect Size Meausures for Mediation Models

Literature is filled with numerous effect size measures for the mediation models:

I. ratio of the indirect effect to the total effect: \(P_M=\frac{ab}{c}\)
II. ratio of the indirect effect to the direct effect: \(R_M=\frac{ab}{c'}\)
III. proportion of variance in Y explained by the indirect effect: \(R^2_{med}=r^2_{MY}-(R^2_{Y.MX}-r^2_{XY})\)
IV. kappa-squared (the size of an indirect effect relative to how large it could possibly be given the variances and correlations between variables in the data): \(\kappa^2=\frac{ab}{Max(ab)}\)

However, these all have problems: I as a ratio should be bounded within 0 and 1 but in the case of negative \(ab\), this is not true. II is very unstable because given a small \(c'\), the ratio can get very large. Moreover, both I and II vary a lot across samples. III does not work because in some cases, \(ab\) can be larger than \(c\), that is, the indirect effect \(ab\) can exist in absence of a detectable association between \(X\) and \(Y\), i.e., \(c\). IV is a good idea, but all current implementations contain computational errors, thus rendering it not implementable. For these reasons, Hayes recommends the following two effect size measures:

  • the partially standardized effect expresses effects relative to the standard deviation of Y, which makes the effects relatable to the variability in the outcome: \(c'_{ps}=\frac{c'}{\hat{\sigma}_Y}\), \(ab_{ps}=\frac{ab'}{\hat{\sigma}_Y}\), and \(c_{ps}=\frac{c}{\hat{\sigma}_Y}\). For example, two people who differs on \(1\) unit in calories intake would differ by \(.02\) standard deviations in weight.
  • the completely standardized effect scales the effects not only with regard to \(Y\) but also with regard to \(X\): \(c'_{cs}=\frac{c'\hat{\sigma}_X}{\hat{\sigma}_Y}=\hat{\sigma}_Xc'_{ps}\), \(ab_{ps}=\frac{ab'\hat{\sigma}_X}{\hat{\sigma}_Y}=\hat{\sigma}_Xab_{ps}\), and \(c_{ps}=\frac{c\hat{\sigma}_X}{\hat{\sigma}_Y}=\hat{\sigma}_Xc_{ps}\). This is generally not meaningful for dichotomous predictors. For example, two people who differs by \(1\) standard deviation in calories intake would differ by \(1\) standard deviations in weight.