Functions	Tasks
`cov()`	Calculates covariance using two vectors.
`cor.test()`	Calculates correlation coefficient and conduct hypothesis testing.
`lm()`	Performs a linear modeling using ordinary least squares method.
`abline()`	Adds one or more straight lines through the current plot.
`cor()`	Computes correlations with either vectors or a matrix or data.frame.
`tab_model()`	Creates a well-formatted regression outputs into HTML format.
`tab_xtab()`	Creates a well-formatted contigency table as HTML file.
`tab_corr()`	Creates a well-formatted correlation table as HTML file.

0. Before We Get Started: Downloading data, setting the working directory, and reading data

In this lab, we will download two variables from the American Fact Finder, join the two data using a key variable, clean the data, and conduct correlation and bivariate regression. The question we will examine is “Do neighborhoods with a higher proportion of people with graduate degrees tend to be wealthier?”

# Install required packages ----------- Run these if you haven't installed these
install.packages("tidyverse", dependencies = TRUE)
install.packages("stargazer", dependencies = TRUE)
install.packages("sjPlot", dependencies = TRUE)

# Call the packages using library()
library(tidyverse)
library(stargazer)
library(sjPlot)

The data we will use is from the last week, the two data.frames that we joined and cleaned. This data is the joined data.frame on the median household income and the education attainment for census tracts in Clayton, Cobb, DeKalb, and Fulton County.

# USE YOUR OWN working directory and file name
setwd("D:/Dropbox (Personal)/School/Fall2020/CP6025/Labs/Lab6_2020") # Use your own pathname
hinc.edu <- read.csv("data_from_week6_edu_hinc.csv")

# Check the data
head(hinc.edu)

##                 GEO_ID               tract         county   state total
## 1 1400000US13063040202 Census Tract 402.02 Clayton County Georgia  1393
## 2 1400000US13063040203 Census Tract 402.03 Clayton County Georgia  2134
## 3 1400000US13063040204 Census Tract 402.04 Clayton County Georgia  2452
## 4 1400000US13063040302 Census Tract 403.02 Clayton County Georgia  3520
## 5 1400000US13063040303 Census Tract 403.03 Clayton County Georgia  4072
## 6 1400000US13063040306 Census Tract 403.06 Clayton County Georgia  1959
##   total.error lessHS lessHS.error   HS HS.error someCol someCol.error bachelor
## 1         202    318           93  581      134     338           103      123
## 2         321    306          199  688      236     753           183      265
## 3         241    190           90  877      185     825           184      371
## 4         398   1034          222 1347      325     855           213      235
## 5         389   1059          300 1319      365    1075           266      530
## 6         322    781          243  430      149     567           221      169
##   bachelor.error graduate graduate.error
## 1             53       33             28
## 2             99      122             59
## 3            133      189             70
## 4            136       49             49
## 5            226       89             85
## 6            105       12             18
##                                         NAME.y  hinc hinc.error   p.lessHS
## 1 Census Tract 402.02, Clayton County, Georgia 31524       6665 0.22828428
## 2 Census Tract 402.03, Clayton County, Georgia 36786       5722 0.14339269
## 3 Census Tract 402.04, Clayton County, Georgia 39194       5231 0.07748777
## 4 Census Tract 403.02, Clayton County, Georgia 33190       1973 0.29375000
## 5 Census Tract 403.03, Clayton County, Georgia 37236       7343 0.26006876
## 6 Census Tract 403.06, Clayton County, Georgia 27372      10845 0.39867279
##        p.HS p.someCol p.bachelor  p.graduate hinc.num hinc.num.ord
## 1 0.4170854 0.2426418 0.08829864 0.023689878    31524         poor
## 2 0.3223993 0.3528585 0.12417994 0.057169634    36786         poor
## 3 0.3576672 0.3364600 0.15130506 0.077079935    39194         poor
## 4 0.3826705 0.2428977 0.06676136 0.013920455    33190         poor
## 5 0.3239194 0.2639980 0.13015717 0.021856582    37236         poor
## 6 0.2194997 0.2894334 0.08626850 0.006125574    27372         poor
##   p.graduate.ord
## 1       very low
## 2       very low
## 3            low
## 4       very low
## 5       very low
## 6       very low

1. Covariance and Correlation

Before we examine the data we have, let us briefly go through what correlation is and what it isn’t. The correlation can tell us how intimately two variables are aligned on a line (see the image below). Notice that the correlation coefficient on the second row in the image are all 1 regardless of how steep or gentle the slope of the line is. In other words, the correlation coefficient is NOT about the slope of the line; it is about the spread of data points from the line.

(image source: Wikipedia)

First, we will look at the relationship between percent people with graduate degree in a neighborhood and the median household income of a neighborhood. We anecdotally know that neighborhoods with more educated population tend to be wealthier. In other words, as a variable for the education level increases, we expect to see an increase in a variable for income. Let’s visualize hinc.edu to see if this is true for Georgia too. We can plot one variable on the x-axis and plot the other variable on the y-axis to create a scatterplot.

# Creating a Scatterplot between hinc and p.graduate
plot(hinc.edu$p.graduate,
     hinc.edu$hinc.num,
     main = "Scatterplot between % gradute degree \n and median household income",
     xlab = "Proportion of residents with graduate degrees",
     ylab = "Median household income",
     col = "blue")

The scatterplot shows that there is a clear positive relationship between the two variables. But without running the correlation analysis, we cannot quantify how strong the relationship is and whether the relationship is statistically significant.

The code below calculates covariance and correlation with which we can quantify the relationship between the two variables.

# Covariance
cov(hinc.edu$p.graduate, hinc.edu$hinc.num) # This value by itself is not really useful because it is affected by the unit of the data.

## [1] 3105.944

# Correlation
cor.test(hinc.edu$p.graduate, hinc.edu$hinc.num,
         alternative = "two.sided", # or "less" "greater" for one-tailed test
         method = "pearson",        # or "kendall" "spearman"
         conf.level = 0.95)

## 
##  Pearson's product-moment correlation
## 
## data:  hinc.edu$p.graduate and hinc.edu$hinc.num
## t = 24.844, df = 512, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6974296 0.7761786
## sample estimates:
##       cor 
## 0.7393218

The correlation coefficient is 0.7393218 The t-value is huge with a large degrees of freedom (and of course a very small p-value). These numbers indicate a statistically significant positive correlation between the two variables.

Compare the scatterplot of p.graduate and hinc.num against the large correlation plot above with multiple correlation coefficients and corresponding scatterplot. The correlation coefficient we got from the data is 0.7393218, which is close to 0.8. Can see the similarity in terms of how spread the points are from the line?

If the correlation coefficient we got from the data was, for example, 0.4, the points in the scatterplot would have been more spread out. Again, correlation coefficient do not tell us anything about the slope! It only tells us whether it is upward/downward slope and how spread out the points are from the line.

A few more examples of correlation tests

We have seen that there is a positive correlation between p.graduate and hinc.num. Let’s examine if there is the opposite relationship between having low educational attainment and the median household income of neighborhoods in the four county. This time, let’s use p.HS, the proportion of residents who are older than 25 and have high school education.

# Creating a Scatterplot between hinc and p.HS
plot(hinc.edu$p.HS,
     hinc.edu$hinc.num,
     main = "Scatterplot between % high school graduate \n and median household income",
     xlab = "Proportion of residents with high school diploma",
     ylab = "Median household income",
     col = "blue")

The plot shows that there is a very clear negative relationship between the two variables. We can easily predict from this plot that the correlation coefficient will be large (to be minimize potential confusions, I’d say the absolute value (or the magnitude) of the coefficient will be large or the coefficient is large and negative or something similar). Let’s run the analysis.

cor.test(hinc.edu$p.HS, hinc.edu$hinc.num,
         alternative = "two.sided", # or "less" "greater" for one-tailed test
         method = "pearson",        # or "kendall" "spearman"
         conf.level = 0.95)

## 
##  Pearson's product-moment correlation
## 
## data:  hinc.edu$p.HS and hinc.edu$hinc.num
## t = -23.285, df = 512, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7567151 -0.6723812
## sample estimates:
##        cor 
## -0.7171636

Just like we predicted, the correlation coefficient is -0.7171636 and the hypothesis test result is highly statistically significant. They indicate that there is a strong negative correlation between p.HS and hinc.num.

Before you move on, try repeating what we’ve done so far with other educational attainment levels, for example, hinc.edu$someCol. Examine if there a statistically significant correlation, what the scatterplot looks like, and how the plot compares with the large correlation chart (i.e., the first figure of this document).

2. Bivariate Regression Analysis

While the correlation tells us how intimately two variables are related, it does not answer the question of how much change in p.graduate (or any other variable for that matter) is associated with how much change in hinc.num. Framing questions in this way is important for planners and policy makers as we often want to know not only the existence of relationships but also how much change in the median household income we can expect if we make plans and policies to promote higher education, for example.

Warning: I worded the last sentence as if there is a causal relationship between education attainment and median household income at census tract level, but I CANNOT KNOW IF IT IS CAUSAL FROM ONLY THE REGRESSION ANALYSIS. To establish a causal relationship, you need to consider the four criteria you learned earlier. The regression as presented below only provide support for covariance criteria.

The following code performs a bivariate regression using hinc.num as the y-variable (i.e., the dependent variable, the variable that is being explained) and p.graduate as the x-variable (i.e., the independent variable, the variable that is used to explain the dependent variable) and assigns the estimated regression model and other related values in an R object called ols1. Printing summary(ols1) gives you a formatted information of the regression, but the R object ols1 contains many other values not displayed in summary(ols1) as well.

# bivariate linear model 
ols1 <- lm(hinc.num ~ p.graduate, data = hinc.edu)
summary(ols1)

## 
## Call:
## lm(formula = hinc.num ~ p.graduate, data = hinc.edu)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -85749 -14481  -3105  11704 107182 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    28038       1894   14.81   <2e-16 ***
## p.graduate    230422       9275   24.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24390 on 512 degrees of freedom
## Multiple R-squared:  0.5466, Adjusted R-squared:  0.5457 
## F-statistic: 617.2 on 1 and 512 DF,  p-value: < 2.2e-16

The regression result shows that the increase in the proportion of graduate degree holder from 0% to 100% is associated with the increase in the median household income by 230,422 USD.

The t value and p value (i.e., Pr(>|t|)) for p.graduate suggest that, if the null hypothesis (i.e., that there exists no relationship between the the proportion of graduate degree holders and the median household income of Census Tracts, or in statistical term, that the coefficient is not different from zero) was correct, it is extremely unlikely to see a regression result like this. Because it is so unlikely thing to happen if the null hypothesis was true, we can conclude that we can reject the null hypothesis - why? because in the world where there is no relationship between graduate degree and income, seeing the relationship we just saw in the regression is soooo unlikely.

The regression result also indicates that the coefficient of determination (Multiple R-squared in the summary(ols1)) is 0.5465967. This tells us how much of the variation in the median household income is explained by variation in the proportion of graduate degree holders.

The plot we created earlier for the correlation can be modified to include a regression line.

# Scatterplot with regression line
plot(hinc.edu$p.graduate,
     hinc.edu$hinc.num,
     main = "Scatterplot between % gradute degree and median household income",
     xlab = "Proportion of residents with graduate degrees",
     ylab = "Median household income",
     col = "blue")
abline(ols1)

Do neighborhoods with a higher proportion of people with graduate degrees tend to be wealthier?

The answer from the data is yes. The correlation and bivariate regression strongly suggest that there exists a positive relationship between the two variables. BUT AGAIN, THIS SHOULD NOT BE INTERPRETED AS A CAUSAL RELATIONSHIP YET!

Just for you to check, below are a bivariate regression result using p.HS instead of p.graduate.

# bivariate linear model 
ols2 <- lm(hinc.num ~ p.HS, data = hinc.edu)
summary(ols2)

## 
## Call:
## lm(formula = hinc.num ~ p.HS, data = hinc.edu)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -94362 -17023  -1660  12225  98158 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   117041       2430   48.17   <2e-16 ***
## p.HS         -237988      10221  -23.29   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25240 on 512 degrees of freedom
## Multiple R-squared:  0.5143, Adjusted R-squared:  0.5134 
## F-statistic: 542.2 on 1 and 512 DF,  p-value: < 2.2e-16

# Scatterplot with regression line
plot(hinc.edu$p.HS,
     hinc.edu$hinc.num,
     main = "Scatterplot between % high school graduate \n and median household income",
     xlab = "Proportion of residents with less-than-high school diploma",
     ylab = "Median household income",
     col = "blue")
abline(ols2)

3. Reporting the results in pretty formats

Although R outputs okay-formatted outputs in the console window, that is not pretty enough to simply screenshot them and include in your, for example, option paper. R has various packages that help us bring the outputs of correlation and/or regression analysis to Word, Excel, etc. in a well-formatted, publication-ready formats.

Some of the most commonly used packages for reporting include stargazer and sjPlot. I personally prefer sjPlot, so that’s what I am going to focus today. Note that these packages offer numerous options for you to customize your tables, and I cannot cover them all here. I provide two links below for you to explore.

stargazer: https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf sjPlot: https://strengejacke.github.io/sjPlot/index.html

Basic command

For reporting regression outputs, we can use tab_model() function. It has many many arguments, so be sure to check ?tab_model to see what options you have. You don’t need to know them all - just see what are available and come back to it later when you need it. The output of this function is in HTML format, and the result will be shown in the Viewer panel.

# Regression output using sjPlot package
tab_model(ols1, ols2, # You can compare multiple models in one table as well.
          show.ci = 0.95, # alpha-level for confidence interval
          show.se = TRUE, # Whether to show standard errors of coefficient estimates
          show.stat = TRUE) # Whether to show t value for coefficient estimates

	hinc.num					hinc.num
Predictors	Estimates	std. Error	CI	Statistic	p	Estimates	std. Error	CI	Statistic	p
(Intercept)	28038.11	1893.46	24318.20 – 31758.02	14.81	<0.001	117041.07	2429.86	112267.35 – 121814.80	48.17	<0.001
p.graduate	230421.55	9274.64	212200.53 – 248642.58	24.84	<0.001
p.HS						-237987.56	10220.56	-258066.94 – -217908.17	-23.29	<0.001
Observations	514					514
R² / R² adjusted	0.547 / 0.546					0.514 / 0.513

If the formatted table doesn’t automatically show up in the Viewer panel, you can manual click it to open it. You will see a very nicely formatted table that looks like one of those tables in published papers. Also notice that, when the two (or more) models have same variables, those variables share the same rows to make it easier for us to compare them across models.

Labeling the variable names

Okay, the table looks nice except that the name of the variables are still in a code-like format (e.g., p.HS). Let’s replace that with something meaningful.

# Regression output using sjPlot package
tab_model(ols1, ols2, # You can compare multiple models in one table as well.
          show.ci = 0.95, # alpha-level for confidence interval
          show.se = TRUE, # Whether to show standard errors of coefficient estimates
          show.stat = TRUE, # Whether to show t value for coefficient estimates
    
          pred.labels = c("(Intercept)", "% Graduate degree holders", "% High school diploma holders"), 
          # We have three variables, intercept, p.graduate, p.HS. So we need to supply three names.
          
          dv.labels = c("dependent = Median Household Income 2017", "dependent = Median Household Income 2017")) # This label will go to the top of the table

	dependent = Median Household Income 2017					dependent = Median Household Income 2017
Predictors	Estimates	std. Error	CI	Statistic	p	Estimates	std. Error	CI	Statistic	p
(Intercept)	28038.11	1893.46	24318.20 – 31758.02	14.81	<0.001	117041.07	2429.86	112267.35 – 121814.80	48.17	<0.001
% Graduate degree holders	230421.55	9274.64	212200.53 – 248642.58	24.84	<0.001
% High school diploma holders						-237987.56	10220.56	-258066.94 – -217908.17	-23.29	<0.001
Observations	514					514
R² / R² adjusted	0.547 / 0.546					0.514 / 0.513

Looking good! The last thing we need is to save it in a format that is compatible with MS Word, Excel, etc. We can supply file = argument to specify the file name. Don’t forget to put .html at the end of your file name!

tab_model(ols1, ols2, # You can compare multiple models in one table as well.
          show.ci = 0.95, # alpha-level for confidence interval
          show.se = TRUE, # Whether to show standard errors of coefficient estimates
          show.stat = TRUE, # Whether to show t value for coefficient estimates
          
          pred.labels = c("(Intercept)", "% Graduate degree holders", "% High school diploma holders"), 
          # We have three variables, intercept, p.graduate, p.HS. So we need to supply three names.
          
          dv.labels = c("dependent = Median Household Income 2017", "dependent = Median Household Income 2017"),
          # This label will go to the top of the table

          file = "ols1_ols2_reg.html") # This is the output file name. This file will be save in your working directory.

	dependent = Median Household Income 2017					dependent = Median Household Income 2017
Predictors	Estimates	std. Error	CI	Statistic	p	Estimates	std. Error	CI	Statistic	p
(Intercept)	28038.11	1893.46	24318.20 – 31758.02	14.81	<0.001	117041.07	2429.86	112267.35 – 121814.80	48.17	<0.001
% Graduate degree holders	230421.55	9274.64	212200.53 – 248642.58	24.84	<0.001
% High school diploma holders						-237987.56	10220.56	-258066.94 – -217908.17	-23.29	<0.001
Observations	514					514
R² / R² adjusted	0.547 / 0.546					0.514 / 0.513

Once saved, you can double click the file to open it in your web browser. You will be able to highlight them either with your mouse or by pressing control+a (or ⌘+a if you are using a mac) and copy-paste it into Excel or Word.

Other tables you can explore

That was formatting a regression result using sjPlot. The package also provide functions for correlation table, contingency table, etc. I show some examples below.

Note that cor() is a function that can calculate correlation coefficient for multiple variables in a table format. But this function does not give us the hypothesis test result, so that is the difference between cor() and cor.test().

# Correlation table
mycor <- cor( hinc.edu[, c("p.lessHS", "p.HS", "p.bachelor", "p.graduate", "hinc.num")] ) 
tab_corr(mycor, triangle = "lower", show.p = T)

	p.lessHS	p.HS	p.bachelor	p.graduate
p.lessHS
p.HS	0.497
p.bachelor	-0.700	-0.886
p.graduate	-0.652	-0.846	0.801
hinc.num	-0.601	-0.717	0.768	0.739
Computed correlation used pearson-method with listwise-deletion.

# Contigency table
tab_xtab(hinc.edu$county, hinc.edu$hinc.num.ord, # two variable with which to build the contingency table
         var.labels = c("County", "Median Household Income - Ordinal")) # group names

County	Median Household Income - Ordinal				Total
County	lower-middle class	poor	upper-middle class	wealthy	Total
Clayton County	21	20	8	0	49
Cobb County	22	11	43	44	120
DeKalb County	51	28	38	26	143
Fulton County	34	70	39	59	202
Total	128	129	128	129	514
χ²=79.087 · df=9 · Cramer’s V=0.226 · p=0.000

Week 7 - Correlation and Bivariate Regression

Bonwoo Koo

10/02/2020