| Functions | Tasks |
|---|---|
cov() |
Calculates covariance using two vectors. |
cor.test() |
Calculates correlation coefficient and conduct hypothesis testing. |
lm() |
Performs a linear modeling using ordinary least squares method. |
abline() |
Adds one or more straight lines through the current plot. |
cor() |
Computes correlations with either vectors or a matrix or data.frame. |
tab_model() |
Creates a well-formatted regression outputs into HTML format. |
tab_xtab() |
Creates a well-formatted contigency table as HTML file. |
tab_corr() |
Creates a well-formatted correlation table as HTML file. |
In this lab, we will download two variables from the American Fact Finder, join the two data using a key variable, clean the data, and conduct correlation and bivariate regression. The question we will examine is “Do neighborhoods with a higher proportion of people with graduate degrees tend to be wealthier?”
# Install required packages ----------- Run these if you haven't installed these
install.packages("tidyverse", dependencies = TRUE)
install.packages("stargazer", dependencies = TRUE)
install.packages("sjPlot", dependencies = TRUE)
# Call the packages using library()
library(tidyverse)
library(stargazer)
library(sjPlot)
The data we will use is from the last week, the two data.frames that we joined and cleaned. This data is the joined data.frame on the median household income and the education attainment for census tracts in Clayton, Cobb, DeKalb, and Fulton County.
# USE YOUR OWN working directory and file name
setwd("D:/Dropbox (Personal)/School/Fall2020/CP6025/Labs/Lab6_2020") # Use your own pathname
hinc.edu <- read.csv("data_from_week6_edu_hinc.csv")
# Check the data
head(hinc.edu)
## GEO_ID tract county state total
## 1 1400000US13063040202 Census Tract 402.02 Clayton County Georgia 1393
## 2 1400000US13063040203 Census Tract 402.03 Clayton County Georgia 2134
## 3 1400000US13063040204 Census Tract 402.04 Clayton County Georgia 2452
## 4 1400000US13063040302 Census Tract 403.02 Clayton County Georgia 3520
## 5 1400000US13063040303 Census Tract 403.03 Clayton County Georgia 4072
## 6 1400000US13063040306 Census Tract 403.06 Clayton County Georgia 1959
## total.error lessHS lessHS.error HS HS.error someCol someCol.error bachelor
## 1 202 318 93 581 134 338 103 123
## 2 321 306 199 688 236 753 183 265
## 3 241 190 90 877 185 825 184 371
## 4 398 1034 222 1347 325 855 213 235
## 5 389 1059 300 1319 365 1075 266 530
## 6 322 781 243 430 149 567 221 169
## bachelor.error graduate graduate.error
## 1 53 33 28
## 2 99 122 59
## 3 133 189 70
## 4 136 49 49
## 5 226 89 85
## 6 105 12 18
## NAME.y hinc hinc.error p.lessHS
## 1 Census Tract 402.02, Clayton County, Georgia 31524 6665 0.22828428
## 2 Census Tract 402.03, Clayton County, Georgia 36786 5722 0.14339269
## 3 Census Tract 402.04, Clayton County, Georgia 39194 5231 0.07748777
## 4 Census Tract 403.02, Clayton County, Georgia 33190 1973 0.29375000
## 5 Census Tract 403.03, Clayton County, Georgia 37236 7343 0.26006876
## 6 Census Tract 403.06, Clayton County, Georgia 27372 10845 0.39867279
## p.HS p.someCol p.bachelor p.graduate hinc.num hinc.num.ord
## 1 0.4170854 0.2426418 0.08829864 0.023689878 31524 poor
## 2 0.3223993 0.3528585 0.12417994 0.057169634 36786 poor
## 3 0.3576672 0.3364600 0.15130506 0.077079935 39194 poor
## 4 0.3826705 0.2428977 0.06676136 0.013920455 33190 poor
## 5 0.3239194 0.2639980 0.13015717 0.021856582 37236 poor
## 6 0.2194997 0.2894334 0.08626850 0.006125574 27372 poor
## p.graduate.ord
## 1 very low
## 2 very low
## 3 low
## 4 very low
## 5 very low
## 6 very low
Before we examine the data we have, let us briefly go through what correlation is and what it isn’t. The correlation can tell us how intimately two variables are aligned on a line (see the image below). Notice that the correlation coefficient on the second row in the image are all 1 regardless of how steep or gentle the slope of the line is. In other words, the correlation coefficient is NOT about the slope of the line; it is about the spread of data points from the line.
(image source: Wikipedia)
First, we will look at the relationship between percent people with graduate degree in a neighborhood and the median household income of a neighborhood. We anecdotally know that neighborhoods with more educated population tend to be wealthier. In other words, as a variable for the education level increases, we expect to see an increase in a variable for income. Let’s visualize hinc.edu to see if this is true for Georgia too. We can plot one variable on the x-axis and plot the other variable on the y-axis to create a scatterplot.
# Creating a Scatterplot between hinc and p.graduate
plot(hinc.edu$p.graduate,
hinc.edu$hinc.num,
main = "Scatterplot between % gradute degree \n and median household income",
xlab = "Proportion of residents with graduate degrees",
ylab = "Median household income",
col = "blue")
The scatterplot shows that there is a clear positive relationship between the two variables. But without running the correlation analysis, we cannot quantify how strong the relationship is and whether the relationship is statistically significant.
The code below calculates covariance and correlation with which we can quantify the relationship between the two variables.
# Covariance
cov(hinc.edu$p.graduate, hinc.edu$hinc.num) # This value by itself is not really useful because it is affected by the unit of the data.
## [1] 3105.944
# Correlation
cor.test(hinc.edu$p.graduate, hinc.edu$hinc.num,
alternative = "two.sided", # or "less" "greater" for one-tailed test
method = "pearson", # or "kendall" "spearman"
conf.level = 0.95)
##
## Pearson's product-moment correlation
##
## data: hinc.edu$p.graduate and hinc.edu$hinc.num
## t = 24.844, df = 512, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6974296 0.7761786
## sample estimates:
## cor
## 0.7393218
The correlation coefficient is 0.7393218 The t-value is huge with a large degrees of freedom (and of course a very small p-value). These numbers indicate a statistically significant positive correlation between the two variables.
Compare the scatterplot of p.graduate and hinc.num against the large correlation plot above with multiple correlation coefficients and corresponding scatterplot. The correlation coefficient we got from the data is 0.7393218, which is close to 0.8. Can see the similarity in terms of how spread the points are from the line?
If the correlation coefficient we got from the data was, for example, 0.4, the points in the scatterplot would have been more spread out. Again, correlation coefficient do not tell us anything about the slope! It only tells us whether it is upward/downward slope and how spread out the points are from the line.
We have seen that there is a positive correlation between p.graduate and hinc.num. Let’s examine if there is the opposite relationship between having low educational attainment and the median household income of neighborhoods in the four county. This time, let’s use p.HS, the proportion of residents who are older than 25 and have high school education.
# Creating a Scatterplot between hinc and p.HS
plot(hinc.edu$p.HS,
hinc.edu$hinc.num,
main = "Scatterplot between % high school graduate \n and median household income",
xlab = "Proportion of residents with high school diploma",
ylab = "Median household income",
col = "blue")
The plot shows that there is a very clear negative relationship between the two variables. We can easily predict from this plot that the correlation coefficient will be large (to be minimize potential confusions, I’d say the absolute value (or the magnitude) of the coefficient will be large or the coefficient is large and negative or something similar). Let’s run the analysis.
cor.test(hinc.edu$p.HS, hinc.edu$hinc.num,
alternative = "two.sided", # or "less" "greater" for one-tailed test
method = "pearson", # or "kendall" "spearman"
conf.level = 0.95)
##
## Pearson's product-moment correlation
##
## data: hinc.edu$p.HS and hinc.edu$hinc.num
## t = -23.285, df = 512, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7567151 -0.6723812
## sample estimates:
## cor
## -0.7171636
Just like we predicted, the correlation coefficient is -0.7171636 and the hypothesis test result is highly statistically significant. They indicate that there is a strong negative correlation between p.HS and hinc.num.
Before you move on, try repeating what we’ve done so far with other educational attainment levels, for example, hinc.edu$someCol. Examine if there a statistically significant correlation, what the scatterplot looks like, and how the plot compares with the large correlation chart (i.e., the first figure of this document).
While the correlation tells us how intimately two variables are related, it does not answer the question of how much change in
p.graduate(or any other variable for that matter) is associated with how much change inhinc.num. Framing questions in this way is important for planners and policy makers as we often want to know not only the existence of relationships but also how much change in the median household income we can expect if we make plans and policies to promote higher education, for example.
Warning: I worded the last sentence as if there is a causal relationship between education attainment and median household income at census tract level, but I CANNOT KNOW IF IT IS CAUSAL FROM ONLY THE REGRESSION ANALYSIS. To establish a causal relationship, you need to consider the four criteria you learned earlier. The regression as presented below only provide support for covariance criteria.
The following code performs a bivariate regression using hinc.num as the y-variable (i.e., the dependent variable, the variable that is being explained) and p.graduate as the x-variable (i.e., the independent variable, the variable that is used to explain the dependent variable) and assigns the estimated regression model and other related values in an R object called ols1. Printing summary(ols1) gives you a formatted information of the regression, but the R object ols1 contains many other values not displayed in summary(ols1) as well.
# bivariate linear model
ols1 <- lm(hinc.num ~ p.graduate, data = hinc.edu)
summary(ols1)
##
## Call:
## lm(formula = hinc.num ~ p.graduate, data = hinc.edu)
##
## Residuals:
## Min 1Q Median 3Q Max
## -85749 -14481 -3105 11704 107182
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28038 1894 14.81 <2e-16 ***
## p.graduate 230422 9275 24.84 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24390 on 512 degrees of freedom
## Multiple R-squared: 0.5466, Adjusted R-squared: 0.5457
## F-statistic: 617.2 on 1 and 512 DF, p-value: < 2.2e-16
The regression result shows that the increase in the proportion of graduate degree holder from 0% to 100% is associated with the increase in the median household income by 230,422 USD.
The t value and p value (i.e., Pr(>|t|)) for p.graduate suggest that, if the null hypothesis (i.e., that there exists no relationship between the the proportion of graduate degree holders and the median household income of Census Tracts, or in statistical term, that the coefficient is not different from zero) was correct, it is extremely unlikely to see a regression result like this. Because it is so unlikely thing to happen if the null hypothesis was true, we can conclude that we can reject the null hypothesis - why? because in the world where there is no relationship between graduate degree and income, seeing the relationship we just saw in the regression is soooo unlikely.
The regression result also indicates that the coefficient of determination (Multiple R-squared in the summary(ols1)) is 0.5465967. This tells us how much of the variation in the median household income is explained by variation in the proportion of graduate degree holders.
The plot we created earlier for the correlation can be modified to include a regression line.
# Scatterplot with regression line
plot(hinc.edu$p.graduate,
hinc.edu$hinc.num,
main = "Scatterplot between % gradute degree and median household income",
xlab = "Proportion of residents with graduate degrees",
ylab = "Median household income",
col = "blue")
abline(ols1)
The answer from the data is yes. The correlation and bivariate regression strongly suggest that there exists a positive relationship between the two variables. BUT AGAIN, THIS SHOULD NOT BE INTERPRETED AS A CAUSAL RELATIONSHIP YET!
Just for you to check, below are a bivariate regression result using p.HS instead of p.graduate.
# bivariate linear model
ols2 <- lm(hinc.num ~ p.HS, data = hinc.edu)
summary(ols2)
##
## Call:
## lm(formula = hinc.num ~ p.HS, data = hinc.edu)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94362 -17023 -1660 12225 98158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 117041 2430 48.17 <2e-16 ***
## p.HS -237988 10221 -23.29 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25240 on 512 degrees of freedom
## Multiple R-squared: 0.5143, Adjusted R-squared: 0.5134
## F-statistic: 542.2 on 1 and 512 DF, p-value: < 2.2e-16
# Scatterplot with regression line
plot(hinc.edu$p.HS,
hinc.edu$hinc.num,
main = "Scatterplot between % high school graduate \n and median household income",
xlab = "Proportion of residents with less-than-high school diploma",
ylab = "Median household income",
col = "blue")
abline(ols2)
Although R outputs okay-formatted outputs in the console window, that is not pretty enough to simply screenshot them and include in your, for example, option paper. R has various packages that help us bring the outputs of correlation and/or regression analysis to Word, Excel, etc. in a well-formatted, publication-ready formats.
Some of the most commonly used packages for reporting include stargazer and sjPlot. I personally prefer sjPlot, so that’s what I am going to focus today. Note that these packages offer numerous options for you to customize your tables, and I cannot cover them all here. I provide two links below for you to explore.
stargazer: https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf sjPlot: https://strengejacke.github.io/sjPlot/index.html
For reporting regression outputs, we can use tab_model() function. It has many many arguments, so be sure to check ?tab_model to see what options you have. You don’t need to know them all - just see what are available and come back to it later when you need it. The output of this function is in HTML format, and the result will be shown in the Viewer panel.
# Regression output using sjPlot package
tab_model(ols1, ols2, # You can compare multiple models in one table as well.
show.ci = 0.95, # alpha-level for confidence interval
show.se = TRUE, # Whether to show standard errors of coefficient estimates
show.stat = TRUE) # Whether to show t value for coefficient estimates
| hinc.num | hinc.num | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Predictors | Estimates | std. Error | CI | Statistic | p | Estimates | std. Error | CI | Statistic | p |
| (Intercept) | 28038.11 | 1893.46 | 24318.20 – 31758.02 | 14.81 | <0.001 | 117041.07 | 2429.86 | 112267.35 – 121814.80 | 48.17 | <0.001 |
| p.graduate | 230421.55 | 9274.64 | 212200.53 – 248642.58 | 24.84 | <0.001 | |||||
| p.HS | -237987.56 | 10220.56 | -258066.94 – -217908.17 | -23.29 | <0.001 | |||||
| Observations | 514 | 514 | ||||||||
| R2 / R2 adjusted | 0.547 / 0.546 | 0.514 / 0.513 | ||||||||
If the formatted table doesn’t automatically show up in the Viewer panel, you can manual click it to open it. You will see a very nicely formatted table that looks like one of those tables in published papers. Also notice that, when the two (or more) models have same variables, those variables share the same rows to make it easier for us to compare them across models.
Okay, the table looks nice except that the name of the variables are still in a code-like format (e.g., p.HS). Let’s replace that with something meaningful.
# Regression output using sjPlot package
tab_model(ols1, ols2, # You can compare multiple models in one table as well.
show.ci = 0.95, # alpha-level for confidence interval
show.se = TRUE, # Whether to show standard errors of coefficient estimates
show.stat = TRUE, # Whether to show t value for coefficient estimates
pred.labels = c("(Intercept)", "% Graduate degree holders", "% High school diploma holders"),
# We have three variables, intercept, p.graduate, p.HS. So we need to supply three names.
dv.labels = c("dependent = Median Household Income 2017", "dependent = Median Household Income 2017")) # This label will go to the top of the table
| dependent = Median Household Income 2017 | dependent = Median Household Income 2017 | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Predictors | Estimates | std. Error | CI | Statistic | p | Estimates | std. Error | CI | Statistic | p |
| (Intercept) | 28038.11 | 1893.46 | 24318.20 – 31758.02 | 14.81 | <0.001 | 117041.07 | 2429.86 | 112267.35 – 121814.80 | 48.17 | <0.001 |
| % Graduate degree holders | 230421.55 | 9274.64 | 212200.53 – 248642.58 | 24.84 | <0.001 | |||||
| % High school diploma holders | -237987.56 | 10220.56 | -258066.94 – -217908.17 | -23.29 | <0.001 | |||||
| Observations | 514 | 514 | ||||||||
| R2 / R2 adjusted | 0.547 / 0.546 | 0.514 / 0.513 | ||||||||
Looking good! The last thing we need is to save it in a format that is compatible with MS Word, Excel, etc. We can supply file = argument to specify the file name. Don’t forget to put .html at the end of your file name!
tab_model(ols1, ols2, # You can compare multiple models in one table as well.
show.ci = 0.95, # alpha-level for confidence interval
show.se = TRUE, # Whether to show standard errors of coefficient estimates
show.stat = TRUE, # Whether to show t value for coefficient estimates
pred.labels = c("(Intercept)", "% Graduate degree holders", "% High school diploma holders"),
# We have three variables, intercept, p.graduate, p.HS. So we need to supply three names.
dv.labels = c("dependent = Median Household Income 2017", "dependent = Median Household Income 2017"),
# This label will go to the top of the table
file = "ols1_ols2_reg.html") # This is the output file name. This file will be save in your working directory.
| dependent = Median Household Income 2017 | dependent = Median Household Income 2017 | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Predictors | Estimates | std. Error | CI | Statistic | p | Estimates | std. Error | CI | Statistic | p |
| (Intercept) | 28038.11 | 1893.46 | 24318.20 – 31758.02 | 14.81 | <0.001 | 117041.07 | 2429.86 | 112267.35 – 121814.80 | 48.17 | <0.001 |
| % Graduate degree holders | 230421.55 | 9274.64 | 212200.53 – 248642.58 | 24.84 | <0.001 | |||||
| % High school diploma holders | -237987.56 | 10220.56 | -258066.94 – -217908.17 | -23.29 | <0.001 | |||||
| Observations | 514 | 514 | ||||||||
| R2 / R2 adjusted | 0.547 / 0.546 | 0.514 / 0.513 | ||||||||
Once saved, you can double click the file to open it in your web browser. You will be able to highlight them either with your mouse or by pressing control+a (or ⌘+a if you are using a mac) and copy-paste it into Excel or Word.
That was formatting a regression result using sjPlot. The package also provide functions for correlation table, contingency table, etc. I show some examples below.
Note that cor() is a function that can calculate correlation coefficient for multiple variables in a table format. But this function does not give us the hypothesis test result, so that is the difference between cor() and cor.test().
# Correlation table
mycor <- cor( hinc.edu[, c("p.lessHS", "p.HS", "p.bachelor", "p.graduate", "hinc.num")] )
tab_corr(mycor, triangle = "lower", show.p = T)
| p.lessHS | p.HS | p.bachelor | p.graduate | hinc.num | |
|---|---|---|---|---|---|
| p.lessHS | |||||
| p.HS | 0.497 | ||||
| p.bachelor | -0.700 | -0.886 | |||
| p.graduate | -0.652 | -0.846 | 0.801 | ||
| hinc.num | -0.601 | -0.717 | 0.768 | 0.739 | |
| Computed correlation used pearson-method with listwise-deletion. | |||||
# Contigency table
tab_xtab(hinc.edu$county, hinc.edu$hinc.num.ord, # two variable with which to build the contingency table
var.labels = c("County", "Median Household Income - Ordinal")) # group names
| County |
Median Household Income - Ordinal |
Total | |||
|---|---|---|---|---|---|
| lower-middle class | poor | upper-middle class | wealthy | ||
| Clayton County | 21 | 20 | 8 | 0 | 49 |
| Cobb County | 22 | 11 | 43 | 44 | 120 |
| DeKalb County | 51 | 28 | 38 | 26 | 143 |
| Fulton County | 34 | 70 | 39 | 59 | 202 |
| Total | 128 | 129 | 128 | 129 | 514 | χ2=79.087 · df=9 · Cramer’s V=0.226 · p=0.000 |