MSSP 897 Lab 2: Correlation and Linear Regression

The slides for Lab 2 session is now available to review here.

As we did in Lab 1 , first create a new project, an R script and load the dataset (save it in the project file).

good_1_ <- read.csv("good.csv") # for illustration
good <- good_1_ # rename the dataframe

Install the packages that will be used for this tutorial.

install.packages(“ppcor”)

install.packages(“ggplot2”)

install.packages(“ggthemes”)

install.packages(“data.table”)

2.1 Pearson Correlation

A Pearson correlation is a number between -1 and 1 that indicates the extent to which two variables are linearly related.

Assumptions for the Pearson r correlation:

Variables should be continuous.
Both variables should be approximately normally distributed (variables have a bell-shaped curve).
The two variables have linear relationship, shown by a straight line.
The data is homoscedastic, which means the variance around the best fit line is the same for all values of the predictor variable (X).

“pearson” is the default (you’ll get the same result without specifying the method actually). Specify other methods when appropriate, such as “kendall” or “spearman”.

cor(good$read97, good$read02, use="complete.obs", method="pearson")

## [1] 0.6988714

# In the cor() function, if use is "complete.obs" then missing values are handled by casewise deletion.

Besides using the function cor(), we can use cor.test(). The difference of the two is that the cor.test() function will output a p-value, 95% confidence interval for your correlation.

cor.test(good$read97, good$read02, method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  good$read97 and good$read02
## t = 33.436, df = 1171, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6683699 0.7270266
## sample estimates:
##       cor 
## 0.6988714

# What is the p-value? What can you say about the correlation btw read97 and read02?
# Read on the R documentation for more elaborations --> help("cor.test")

2.2 Partial Correlation

Create a new data frame

First, let’s create a new data frame with the variables we will use. There are at least two ways to do this. First, we can subset data frame using cbind() function to help us combine R objects by rows or columns.

goodMini <- cbind(good$read02, good$read97, good$AFDCpreg) # append columns

Let’s look at the new data frame created by cbind() using View(), “View” needs to be capitalized.

View(goodMini)

Can you see the column name? Not really! Let us rename the column names as follows.

colnames(goodMini, do.NULL = FALSE) # check current column names

## [1] "col1" "col2" "col3"

colnames(goodMini) <- c("read02","read97","AFDCpreg") # rename column names
View(goodMini)

# Examine the class or type of an object using class()
class(goodMini)

## [1] "matrix"

# Convert matrix to a data frame for running regression later on.
library(data.table)
goodMini <- data.table(goodMini)

Alternatively, we can subset our original data frame by selecting the columns pretty quickly!

goodMini <- good[,c("read02","read97","AFDCpreg")] # subset data frame with named columns

Remove Incomplete cases

In R, missing values are represented by the symbol NA (not available) and in many cases NA prevents us from running statistical analysis smoothly. Hence, it is a good practice to always “clean up” your data first.

The na.omit() function returns a list without any rows that contain na values.

goodMiniClean <- na.omit(goodMini)

What can you tell from the two summary results?

summary(goodMini)

##      read02           read97         AFDCpreg     
##  Min.   :  7.00   Min.   : 2.00   Min.   :0.0000  
##  1st Qu.: 57.00   1st Qu.:45.00   1st Qu.:0.0000  
##  Median : 69.00   Median :61.00   Median :0.0000  
##  Mean   : 66.46   Mean   :56.29   Mean   :0.1592  
##  3rd Qu.: 79.00   3rd Qu.:71.00   3rd Qu.:0.0000  
##  Max.   :100.00   Max.   :97.00   Max.   :1.0000  
##  NA's   :1026     NA's   :2041    NA's   :246

summary(goodMiniClean)

##      read02           read97         AFDCpreg     
##  Min.   : 16.00   Min.   : 2.00   Min.   :0.0000  
##  1st Qu.: 70.00   1st Qu.:44.75   1st Qu.:0.0000  
##  Median : 78.00   Median :60.00   Median :0.0000  
##  Mean   : 76.03   Mean   :55.89   Mean   :0.1461  
##  3rd Qu.: 84.00   3rd Qu.:71.00   3rd Qu.:0.0000  
##  Max.   :100.00   Max.   :97.00   Max.   :1.0000

Calculate the pairwise partial correlations

It is a measure of the strength and direction of a linear relationship between two continuous variables, while controlling for the effect of one or more other continuous variables. Pearson correlation applies to continuous variables only.

library(ppcor)

## Warning: package 'ppcor' was built under R version 3.5.2

## Loading required package: MASS

pcor(goodMiniClean, method = c("pearson"))

## $estimate
##              read02      read97     AFDCpreg
## read02    1.0000000 0.690630225 -0.178061139
## read97    0.6906302 1.000000000  0.007543908
## AFDCpreg -0.1780611 0.007543908  1.000000000
## 
## $p.value
##                 read02        read97     AFDCpreg
## read02    0.000000e+00 4.611467e-152 4.729960e-09
## read97   4.611467e-152  0.000000e+00 8.055768e-01
## AFDCpreg  4.729960e-09  8.055768e-01 0.000000e+00
## 
## $statistic
##             read02     read97   AFDCpreg
## read02    0.000000 31.1642823 -5.9052771
## read97   31.164282  0.0000000  0.2461975
## AFDCpreg -5.905277  0.2461975  0.0000000
## 
## $n
## [1] 1068
## 
## $gp
## [1] 1
## 
## $method
## [1] "pearson"

# If the variables are not all numerical, you will see an Error as follow, 
# "Error in pcor(goodMiniClean, method = c("pearson")) : 'x' must be numeric"

What can you interpret from the results?

estimate: a matrix of the partial correlation coefficient between two variables
p.value: a matrix of the p value of the test
statistic: a matrix of the value of the test statistic
n: the number of samples
gn: the number of given variables
method: the correlation method used

2.3 Plot Scatterplots

Scatter plots are similar to line graphs in that they use horizontal and vertical axes to plot data points. Scatter plots show how much one variable is affected by another, while the relationship between two variables is called their correlation.

Pairwise Comparison Scatterplot

pairs(goodMiniClean, panel = panel.smooth) # Do you like this plot better?

Bivariate Scatterplot of Read 97 against Read 02

good$childnum <- as.factor(good$childnum)
plot(good$read97, good$read02, main="Scatterplot Example", 
     xlab="Reading Scores '97 ", ylab="Reading Scores '02 ", pch=10)

2.4 Simple Linear Regression Model

Let the response \(y_i\) be the read02 and the explanatory variable \(x_i\) be read97 (\(i = 1, \dots, n=3563\)).

Linear model assumptions

\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i\]

To be specific,

Linearity:

\[\textbf{E}(y_i | x_i) = \beta_0 + \beta_1 x_i\]

homoscedasticity:

\[\textbf{Var}(y_i | x_i) = \sigma^2\]

Normality:

\[\epsilon_i \overset{iid}{\sim} \mathcal{N}(0, \sigma^2)\] or

\[y_i \overset{iid}{\sim} \mathcal{N}( \beta_0 + \beta_1 x_i, \sigma^2)\]

Parameters of interest
- intercept: \(\beta_0\)
- slope: \(\beta_1\)
- variance: \(\sigma^2\)

OLS estimates

The OLS estimators \(\hat\beta_0\) and \(\hat\beta_1\) are obtained by minimizing sum of squared errors:

\[\min_{b_0,\,b_1} \sum_{i=1}^{n} (y_i - b_0 - b_1 x_i)^{2}\]

The function lm() will be used extensively. This function solves the minimization problem that we defined above.

The basic output of the lm(…) function contains two elements: the Call and the Coefficients. The former is used to tell you what regression it was that you estimated. The second contains the regression coefficients. In this case there are two coefficients: the intercept and the regression weight of our sole predictor. lm(DV ~ IV1 + IV2…) function is used for fitting linear models

General rule for regression: When we regress Y on X, we use the values of explanatory variable X to predict those of response variable Y. Let’s regress reading score in 2002 (Y) on reading score in 1997 (X)

We will use the clean data which has no missing value for computing the linear regression.

reg <- lm(goodMiniClean$read02 ~ goodMiniClean$read97)

Examining a Fit

Let us look at the results of the fit and print the model object. The output includes the model formula, the coefficients, parameter estimates, standard errors, as well the residual standard error, adjusted R-squared and multiple R-squared.

What is the coefficient? The y-intercept? Is the coefficient statistically significant?

summary(reg)

## 
## Call:
## lm(formula = goodMiniClean$read02 ~ goodMiniClean$read97)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.866  -4.265   0.523   4.722  39.537 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          53.65799    0.74020   72.49   <2e-16 ***
## goodMiniClean$read97  0.40029    0.01249   32.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.041 on 1066 degrees of freedom
## Multiple R-squared:  0.4907, Adjusted R-squared:  0.4902 
## F-statistic:  1027 on 1 and 1066 DF,  p-value: < 2.2e-16

# Computes confidence intervals for the estimated coefficients in a fitted model.
confint(reg, level=0.95)

##                           2.5 %     97.5 %
## (Intercept)          52.2055813 55.1103949
## goodMiniClean$read97  0.3757778  0.4247941

Model diagnoses

Before using the model, we need to check the model assumptions in the following steps:

Check linearity first; if linearity is satisfied, then
Check homoscedasticity; if homoscedasticity is satisfied, then
Check normality.

Residual plot

We plot the residuals against the fitted values to

check linearity by checking whether the residuals follow a symmetric pattern with respect to \(h=0\). (No systematic curvature.)
check homoscedasticity by checking whether the residuals are evenly distributed within a band. (Variability is approximately the same across the board.)

plot(reg$fitted, reg$residuals, 
     pch  = 16,
     main = "residual plot")
abline(h=0, lwd=4, col="red")

Check normality

We look at the qqplot of residuals to check normality.

  qqnorm(reg$residuals)
  qqline(reg$residuals, lwd=4, col="blue")

Extracting Results

Create a column of predicted values of Y (Y’) in the dataset

goodMiniClean$predicted <- fitted(reg)

Create a column of residuals in the dataset.

goodMiniClean$residuals <- residuals(reg)

Get standardized regression coefficients (dividing x and y by standard deviations)

Also called beta coefficients or beta weights, standardizing coefficients results in the variances of dependent and independent variables equal to 1. Hence, standardized coefficients refer to how many standard deviations a dependent variable will change, per standard deviation increase in the predictor variable.

regStandard <- lm(scale(good$read02)~ scale(good$read97))
summary(regStandard)

## 
## Call:
## lm(formula = scale(good$read02) ~ scale(good$read97))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.43302 -0.23944  0.02685  0.27489  2.35036 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.55556    0.01390   39.96   <2e-16 ***
## scale(good$read97)  0.46307    0.01385   33.44   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4761 on 1171 degrees of freedom
##   (2390 observations deleted due to missingness)
## Multiple R-squared:  0.4884, Adjusted R-squared:  0.488 
## F-statistic:  1118 on 1 and 1171 DF,  p-value: < 2.2e-16

This will become more useful as we add more independent variables to our regression

2.5 Using ggplot Functions for Scatterplot with the LS line added

GGPLOT produces a scatter plot between two variables with more flexibility to include a Regression Line as well as other nice features

Base R

plot(goodMiniClean$read97, goodMiniClean$read02, 
     main="Scatterplot Example with LS line added", 
     xlab="Reading Scores '97 ", 
     ylab="Reading Scores '02 ", 
     pch=10)
abline(reg, col="red", lwd=4)       # many other ways. 
abline(h=mean(goodMiniClean$read02), lwd=4, col="blue") # add a horizontal line, y=mean(y)

ggplot

library(ggplot2)
library(ggthemes)
plot <- ggplot(goodMiniClean, aes(x=goodMiniClean$read97, y=goodMiniClean$read02)) + geom_point() +
  stat_smooth(method="lm", se=FALSE)

#can add titles, subtitles and caption in ggplot 
plot + labs(title = "Scatterplot of Reading Scores in 1997 & Reading Scores in 2002", 
            subtitle = "Lab 2") + labs(caption = "(based on data from good package)")

# can change smooth line color # Smoothed conditional means
plot + labs(title = "Scatterplot of Reading Scores in 1997 & Reading Scores in 2002", 
            subtitle = "Lab 2") + labs(caption = "(based on data from good package)") +
       geom_smooth(method="lm", colour="red", size=0.5, se = TRUE)

# can add themes  
plot + labs(title = "Scatterplot of Reading Scores in 1997 & Reading Scores in 2002", 
            subtitle = "Lab 2") + labs(caption = "(based on data from good package)") +
       geom_point(color='lightblue4') +
       geom_smooth(method="lm", colour="black", size=0.5) +
       theme_economist(base_size = 10, base_family = "sans")

  # Other themes e.g., theme_grey, theme_bw, theme_linedraw, theme_light, theme_dark, theme_minimal or theme_classic
  # customize your theme
  # https://www.r-bloggers.com/custom-themes-in-ggplot2/

Assignment 2 due at 12:59 pm February 11th, 2019 (Next Monday)

Dataset: “statekffdata” dataset

Documentation: “KFF_State_Codebook” on Canvas

What is the bivariate correlation between official poverty rate and the total number of households in a state?
Is this correlation statistically significant?
Run a regression in which you regress the proportion of people without health insurance (which you need to create if you haven’t already) on the official poverty rate?
What is the coefficient? The y-intercept? Is the coefficient statistically significant?