Lecture: Correlation and Regression

Slides for Week 4 are available here.

In this tutorial, you will learn how to:

  • run correlation test
  • visualize correlation
  • create a smaller dataset
  • compute simple linear regression

Basic Set Up: Install and load all necessary libraries

Tips: If you need to install a new package, remove # first to run the install.packages code. Install packages only once, then put back # in front of the codes to prevent re-running the install.packages() codes when knitting. You will also need to load the library (library()) each session before you can use their corresponding functions.

options(scipen=999, digits = 2)
# install.packages("ggcorrplot")
library(ggcorrplot) # for function ggcorrplot()
library(ggplot2)
library(ggthemes)

Step 1: Import the Dataset

smoking_data <- read.csv("birthweight_smoking.csv")

smoking_data1 <- smoking_data 

# smoking_data1 is now the spare copy in case we "mess up" our `data`!
# smoking_data <- smoking_data1 

Since we will explore correlation this week, let’s keep the data structure for now. Correlation function cannot work on a factor variable.

summary(smoking_data)
##     nprevist     alcohol        tripre1       tripre2        tripre3    
##  Min.   : 0   Min.   :0.00   Min.   :0.0   Min.   :0.00   Min.   :0.00  
##  1st Qu.: 9   1st Qu.:0.00   1st Qu.:1.0   1st Qu.:0.00   1st Qu.:0.00  
##  Median :12   Median :0.00   Median :1.0   Median :0.00   Median :0.00  
##  Mean   :11   Mean   :0.02   Mean   :0.8   Mean   :0.15   Mean   :0.03  
##  3rd Qu.:13   3rd Qu.:0.00   3rd Qu.:1.0   3rd Qu.:0.00   3rd Qu.:0.00  
##  Max.   :35   Max.   :1.00   Max.   :1.0   Max.   :1.00   Max.   :1.00  
##     tripre0      birthweight       smoker       unmarried         educ     
##  Min.   :0.00   Min.   : 425   Min.   :0.00   Min.   :0.00   Min.   : 0.0  
##  1st Qu.:0.00   1st Qu.:3062   1st Qu.:0.00   1st Qu.:0.00   1st Qu.:12.0  
##  Median :0.00   Median :3420   Median :0.00   Median :0.00   Median :12.0  
##  Mean   :0.01   Mean   :3383   Mean   :0.19   Mean   :0.23   Mean   :12.9  
##  3rd Qu.:0.00   3rd Qu.:3750   3rd Qu.:0.00   3rd Qu.:0.00   3rd Qu.:14.0  
##  Max.   :1.00   Max.   :5755   Max.   :1.00   Max.   :1.00   Max.   :17.0  
##       age         drinks    
##  Min.   :14   Min.   : 0.0  
##  1st Qu.:23   1st Qu.: 0.0  
##  Median :27   Median : 0.0  
##  Mean   :27   Mean   : 0.1  
##  3rd Qu.:31   3rd Qu.: 0.0  
##  Max.   :44   Max.   :21.0
str(smoking_data)
## 'data.frame':    3000 obs. of  12 variables:
##  $ nprevist   : int  12 5 12 13 9 11 12 10 13 10 ...
##  $ alcohol    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tripre1    : int  1 0 1 1 1 1 1 1 1 1 ...
##  $ tripre2    : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ tripre3    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tripre0    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ birthweight: int  4253 3459 2920 2600 3742 3420 2325 4536 2850 2948 ...
##  $ smoker     : int  1 0 1 0 0 0 1 0 0 0 ...
##  $ unmarried  : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ educ       : int  12 16 11 17 13 16 14 13 17 14 ...
##  $ age        : int  27 24 23 28 27 33 24 38 29 28 ...
##  $ drinks     : int  0 0 0 0 0 0 0 0 0 0 ...

Step 2: Correlation Analysis

A Pearson correlation is a number between -1 and 1 that indicates the extent to which two variables are linearly related. Pearson is the default (you will get the same result without specifying the method actually). Specify other methods when appropriate, such as kendall or spearman.

cor()

Check the Pearson correlation between two continuous variables.

cor(smoking_data$birthweight, smoking_data$educ)
## [1] 0.11
cor(smoking_data$birthweight, smoking_data$drinks)
## [1] -0.032

cor.test()

Besides using the function cor(), we can use cor.test(). The difference of the two is that the cor.test() function will output a p-value, 95% confidence interval for your correlation. An \(\alpha\) of 0.05 indicates that the risk of concluding that a correlation exists when, actually, no correlation exists is 5%.

cor.test(smoking_data$birthweight, smoking_data$educ, method = "pearson") 
## 
##  Pearson's product-moment correlation
## 
## data:  smoking_data$birthweight and smoking_data$educ
## t = 6, df = 2998, p-value = 0.000000008
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07 0.14
## sample estimates:
##  cor 
## 0.11
# What is the p-value? 
# What can you say about the correlation between birthweight and educ?
# Read on the R documentation for more elaborations --> help("cor.test") 

The test statistics show that the correlation coefficient for birthweight and education attainment (r = 0.11) is statistically significant at the 0.05 level with the p-value of 0.000000008 (lower than 0.05). We reject the null hypothesis that birthweight and education attainment are not correlated, though the correlation is very weak.

Create a smaller data frame

First, let’s create a new data frame with a few variables we will use. We can subset our original data frame by selecting the columns pretty quickly. Specify the column names through this code: data[,c("variable1", "variable2")].

dataMini <- smoking_data[,c("birthweight","nprevist","age","educ","drinks","smoker","alcohol")] 
# subset data frame by shortlisting named columns

Generate a Correlation Matrix

Check the Pearson correlation between two continuous variables. The correlation function cannot analyze factor structure, that’s why we do not transform the factor structure for categorical variables here.

# basic correlation matrix
cor(dataMini)
##             birthweight nprevist    age    educ drinks smoker alcohol
## birthweight       1.000    0.227  0.080  0.1052 -0.032 -0.169 -0.0336
## nprevist          0.227    1.000  0.147  0.2102 -0.061 -0.109 -0.0425
## age               0.080    0.147  1.000  0.4457  0.026 -0.144  0.0381
## educ              0.105    0.210  0.446  1.0000 -0.013 -0.233  0.0049
## drinks           -0.032   -0.061  0.026 -0.0129  1.000  0.092  0.6041
## smoker           -0.169   -0.109 -0.144 -0.2326  0.092  1.000  0.1209
## alcohol          -0.034   -0.043  0.038  0.0049  0.604  0.121  1.0000
# round the matrix to 1 decimal place, save the matrix object
cor_df <- round(cor(dataMini),1)

# export to csv
write.csv(cor_df, "correlation.csv")

Visualize Correlation: ggcorrplot()

# default with labels of correlation coefficient (lab = TRUE)
ggcorrplot(cor_df, lab = TRUE)

# method = "circle"
ggcorrplot(cor_df, method = "circle", hc.order = TRUE)

# hierarchical cluster: order correlation coefficients from +ve to -ve 
ggcorrplot(cor_df, 
           lab = TRUE, 
           hc.order = TRUE)

# Get the lower triangle, turn insignificant coefficients blank
ggcorrplot(cor_df, 
           lab = TRUE, 
           hc.order = TRUE, 
           type = "lower",
           insig = "blank")

# change the themes and color palettes
ggcorrplot(cor_df, lab = TRUE, hc.order = TRUE, type = "lower",
   outline.col = "white",
   ggtheme = ggplot2::theme_minimal(),
   colors = c("#6D9EC1", "white", "#E46726"))

# which IV has the largest correlation coefficient (r)?

Plot Pairwise Comparison Scatterplot

Scatter plots are similar to line graphs in that they use horizontal and vertical axes to plot data points. Scatter plots show how much one variable is affected by another, while the relationship between two variables is called their correlation.

pairs(dataMini, panel = panel.smooth) # Do you like this plot better?


Step 3: Linear Regression Analysis

Let the dependent variable \(y_i\) be the birthweight and the independent variable \(x_i\) be nprevist (\(i = 1, \dots, n=3000\)).

\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i\]

  • Parameters of interest
    • intercept: \(\beta_0\)
    • slope: \(\beta_1\)
    • variance: \(\sigma^2\)

OLS Estimates

  • The OLS estimators \(\hat\beta_0\) and \(\hat\beta_1\) are obtained by minimizing sum of squared errors:

\[\min_{b_0,\,b_1} \sum_{i=1}^{n} (y_i - b_0 - b_1 x_i)^{2}\]

The function lm() will be used extensively. This function solves the minimization problem that we defined above.

The basic output of the lm(…) function contains two elements: the Call and the Coefficients. The former is used to tell you what regression it was that you estimated. The second contains the regression coefficients. In this case there are two coefficients: the intercept and the regression weight of our sole predictor. lm(DV ~ IV1 + IV2…) function is used for fitting linear models

General rule for regression: When we regress Y on X (y ~ x), we use the values of explanatory variable X to predict those of response variable Y.

Let’s start with a simple linear regression. Regress birthweight (y) on nprevist (x) using the small dataset.

model1 <- lm(birthweight ~ nprevist, data = dataMini)

Examining a Fit

Let us look at the results of the fit and print the model object. The output includes the model formula, the coefficients, parameter estimates, standard errors, as well the residual standard error, adjusted R-squared and multiple R-squared.

What is the coefficient? The y-intercept? Is the coefficient statistically significant?

summary(model1)
## 
## Call:
## lm(formula = birthweight ~ nprevist, data = dataMini)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2672.6  -302.7    27.4   367.4  2225.1 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  2979.93      33.24    89.7 <0.0000000000000002 ***
## nprevist       36.66       2.87    12.8 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 577 on 2998 degrees of freedom
## Multiple R-squared:  0.0517, Adjusted R-squared:  0.0514 
## F-statistic:  163 on 1 and 2998 DF,  p-value: <0.0000000000000002

The Intercept

What is the average infant birthweight when nprevist is equal to zero?

The intercept is 2979.93 grams, meaning when all our variables are 0, our model predicts the average infant birthweight will be 2979.93 grams.

The Beta Coefficient

The value of \(\beta\) is the amount by which the dependent variable changes when the independent variable increases by one unit. So, \(\beta\) = 36.66 for nprevist means that the dependent variable infants’ birthweight (y) goes up by 36.66 grams on average when the number of prenatal visits (x) goes up by 1 unit. The p-value is less than 0.05, suggesting that the coefficient for nprevist is statistically significant at the 0.05 level.

Estimate Confidence Intervals

The interval has a probability of 95 % to contain the true value of \(\beta_i\). So in 95 % of all samples that could be drawn, the confidence interval will cover the true value of \(\beta_i\).

# Computes confidence intervals for the estimated coefficients in a fitted model.
confint(model1, level=0.95) 
##             2.5 % 97.5 %
## (Intercept)  2915   3045
## nprevist       31     42