| Functions | Tasks |
|---|---|
lm() |
Performs a linear modeling using ordinary least squares method. |
abline() |
Adds one or more straight lines through the current plot. |
cor() |
Computes correlations with either vectors or a matrix or data.frame. |
tab_model() |
Creates a well-formatted regression outputs into HTML format. |
tab_xtab() |
Creates a well-formatted contingency table as HTML file. |
tab_corr() |
Creates a well-formatted correlation table as HTML file. |
In this lab, we will download two variables from the American Fact Finder, join the two data using a key variable, clean the data, and conduc bivariate regression. The question we will examine is “Do neighborhoods with a higher proportion of people with graduate degrees tend to be wealthier?”
# Install required packages ----------- Run these if you haven't installed these
install.packages("tidyverse", dependencies = TRUE)
install.packages("stargazer", dependencies = TRUE)
install.packages("sjPlot", dependencies = TRUE)
# Call the packages using library()
library(tidyverse)
library(stargazer)
library(sjPlot)
The data we will use is from the last week, the two data.frames that we joined and cleaned. This data is the joined data.frame on the median household income and the education attainment for census tracts in Clayton, Cobb, DeKalb, and Fulton County.
# USE YOUR OWN working directory and file name
setwd("C:\\Users\\cod\\Desktop\\PhD Files\\GRA & GTA\\GTA\\CP6025 Fall 2021\\Week 8\\Lab7_2020") # Use your own pathname
hinc.edu <- read.csv("data_from_week6_edu_hinc.csv")
# Check the data
head(hinc.edu)
## GEO_ID tract county state total
## 1 1400000US13063040202 Census Tract 402.02 Clayton County Georgia 1393
## 2 1400000US13063040203 Census Tract 402.03 Clayton County Georgia 2134
## 3 1400000US13063040204 Census Tract 402.04 Clayton County Georgia 2452
## 4 1400000US13063040302 Census Tract 403.02 Clayton County Georgia 3520
## 5 1400000US13063040303 Census Tract 403.03 Clayton County Georgia 4072
## 6 1400000US13063040306 Census Tract 403.06 Clayton County Georgia 1959
## total.error lessHS lessHS.error HS HS.error someCol someCol.error bachelor
## 1 202 318 93 581 134 338 103 123
## 2 321 306 199 688 236 753 183 265
## 3 241 190 90 877 185 825 184 371
## 4 398 1034 222 1347 325 855 213 235
## 5 389 1059 300 1319 365 1075 266 530
## 6 322 781 243 430 149 567 221 169
## bachelor.error graduate graduate.error
## 1 53 33 28
## 2 99 122 59
## 3 133 189 70
## 4 136 49 49
## 5 226 89 85
## 6 105 12 18
## NAME.y hinc hinc.error p.lessHS
## 1 Census Tract 402.02, Clayton County, Georgia 31524 6665 0.22828428
## 2 Census Tract 402.03, Clayton County, Georgia 36786 5722 0.14339269
## 3 Census Tract 402.04, Clayton County, Georgia 39194 5231 0.07748777
## 4 Census Tract 403.02, Clayton County, Georgia 33190 1973 0.29375000
## 5 Census Tract 403.03, Clayton County, Georgia 37236 7343 0.26006876
## 6 Census Tract 403.06, Clayton County, Georgia 27372 10845 0.39867279
## p.HS p.someCol p.bachelor p.graduate hinc.num hinc.num.ord
## 1 0.4170854 0.2426418 0.08829864 0.023689878 31524 poor
## 2 0.3223993 0.3528585 0.12417994 0.057169634 36786 poor
## 3 0.3576672 0.3364600 0.15130506 0.077079935 39194 poor
## 4 0.3826705 0.2428977 0.06676136 0.013920455 33190 poor
## 5 0.3239194 0.2639980 0.13015717 0.021856582 37236 poor
## 6 0.2194997 0.2894334 0.08626850 0.006125574 27372 poor
## p.graduate.ord
## 1 very low
## 2 very low
## 3 low
## 4 very low
## 5 very low
## 6 very low
While the correlation tells us how intimately two variables are related, it does not answer the question of how much change in
p.graduate(or any other variable for that matter) is associated with how much change inhinc.num. Framing questions in this way is important for planners and policy makers as we often want to know not only the existence of relationships but also how much change in the median household income we can expect if we make plans and policies to promote higher education, for example.
Warning: I worded the last sentence as if there is a causal relationship between education attainment and median household income at census tract level, but I CANNOT KNOW IF IT IS CAUSAL FROM ONLY THE REGRESSION ANALYSIS. To establish a causal relationship, you need to consider the four criteria you learned earlier. The regression as presented below only provide support for covariance criteria.
The following code performs a bivariate regression using hinc.num as the y-variable (i.e., the dependent variable, the variable that is being explained) and p.graduate as the x-variable (i.e., the independent variable, the variable that is used to explain the dependent variable) and assigns the estimated regression model and other related values in an R object called ols1. Printing summary(ols1) gives you a formatted information of the regression, but the R object ols1 contains many other values not displayed in summary(ols1) as well.
# bivariate linear model
ols1 <- lm(hinc.num ~ p.graduate, data = hinc.edu)
summary(ols1)
##
## Call:
## lm(formula = hinc.num ~ p.graduate, data = hinc.edu)
##
## Residuals:
## Min 1Q Median 3Q Max
## -85749 -14481 -3105 11704 107182
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28038 1894 14.81 <2e-16 ***
## p.graduate 230422 9275 24.84 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24390 on 512 degrees of freedom
## Multiple R-squared: 0.5466, Adjusted R-squared: 0.5457
## F-statistic: 617.2 on 1 and 512 DF, p-value: < 2.2e-16
The regression result shows that the increase in the proportion of graduate degree holder from 0% to 100% is associated with the increase in the median household income by 230,422 USD. Don’t worry if this interpretation doesn’t make sense to you just yet.
The t value and p value (i.e., Pr(>|t|)) for p.graduate suggest that, if the null hypothesis (i.e., that there exists no relationship between the the proportion of graduate degree holders and the median household income of Census Tracts, or in statistical term, that the coefficient is not different from zero) was correct, it is extremely unlikely to see a regression result like this. Because it is so unlikely thing to happen if the null hypothesis was true, we can conclude that we can reject the null hypothesis - why? because in the world where there is no relationship between graduate degree and income, seeing the relationship we just saw in the regression is soooo unlikely.
The regression result also indicates that the coefficient of determination (Multiple R-squared in the summary(ols1)) is 0.5465967. This tells us how much of the variation in the median household income is explained by variation in the proportion of graduate degree holders.
The plot we created earlier for the correlation can be modified to include a regression line.
# Scatterplot with regression line
plot(hinc.edu$p.graduate,
hinc.edu$hinc.num,
main = "Scatterplot between % gradute degree and median household income",
xlab = "Proportion of residents with graduate degrees",
ylab = "Median household income",
col = "blue")+
abline(ols1)
## integer(0)
The answer from the data is yes. The correlation (results from last two week’s lab) and bivariate regression strongly suggest that there exists a positive relationship between the two variables. BUT AGAIN, THIS SHOULD NOT BE INTERPRETED AS A CAUSAL RELATIONSHIP YET!
Just for you to check, below are a bivariate regression result using p.HS instead of p.graduate.
# bivariate linear model
ols2 <- lm(hinc.num ~ p.HS, data = hinc.edu)
summary(ols2)
##
## Call:
## lm(formula = hinc.num ~ p.HS, data = hinc.edu)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94362 -17023 -1660 12225 98158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 117041 2430 48.17 <2e-16 ***
## p.HS -237988 10221 -23.29 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25240 on 512 degrees of freedom
## Multiple R-squared: 0.5143, Adjusted R-squared: 0.5134
## F-statistic: 542.2 on 1 and 512 DF, p-value: < 2.2e-16
# Scatterplot with regression line
plot(hinc.edu$p.HS,
hinc.edu$hinc.num,
main = "Scatterplot between % high school graduate \n and median household income",
xlab = "Proportion of residents with less-than-high school diploma",
ylab = "Median household income",
col = "blue")+
abline(ols2)
## integer(0)
Although R outputs okay-formatted outputs in the console window, that is not pretty enough to simply screenshot them and include in your, for example, option paper. R has various packages that help us bring the outputs of correlation and/or regression analysis to Word, Excel, etc. in a well-formatted, publication-ready formats.
Some of the most commonly used packages for reporting include stargazer and sjPlot. I personally prefer sjPlot, so that’s what I am going to focus today. Note that these packages offer numerous options for you to customize your tables, and I cannot cover them all here. I provide two links below for you to explore.
stargazer: https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf sjPlot: https://strengejacke.github.io/sjPlot/index.html
For reporting regression outputs, we can use tab_model() function. It has many many arguments, so be sure to check ?tab_model to see what options you have. You don’t need to know them all - just see what are available and come back to it later when you need it. The output of this function is in HTML format, and the result will be shown in the Viewer panel.
# Regression output using sjPlot package
tab_model(ols1, ols2, # You can compare multiple models in one table as well.
show.ci = 0.95, # alpha-level for confidence interval
show.se = TRUE, # Whether to show standard errors of coefficient estimates
show.stat = TRUE) # Whether to show t value for coefficient estimates
| hinc.num | hinc.num | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Predictors | Estimates | std. Error | CI | Statistic | p | Estimates | std. Error | CI | Statistic | p |
| (Intercept) | 28038.11 | 1893.46 | 24318.20 – 31758.02 | 14.81 | <0.001 | 117041.07 | 2429.86 | 112267.35 – 121814.80 | 48.17 | <0.001 |
| p.graduate | 230421.55 | 9274.64 | 212200.53 – 248642.58 | 24.84 | <0.001 | |||||
| p.HS | -237987.56 | 10220.56 | -258066.94 – -217908.17 | -23.29 | <0.001 | |||||
| Observations | 514 | 514 | ||||||||
| R2 / R2 adjusted | 0.547 / 0.546 | 0.514 / 0.513 | ||||||||
If the formatted table doesn’t automatically show up in the Viewer panel, you can manual click it to open it. You will see a very nicely formatted table that looks like one of those tables in published papers. Also notice that, when the two (or more) models have same variables, those variables share the same rows to make it easier for us to compare them across models.
Okay, the table looks nice except that the name of the variables are still in a code-like format (e.g., p.HS). Let’s replace that with something meaningful.
# Regression output using sjPlot package
tab_model(ols1, ols2, # You can compare multiple models in one table as well.
show.ci = 0.95, # alpha-level for confidence interval
show.se = TRUE, # Whether to show standard errors of coefficient estimates
show.stat = TRUE, # Whether to show t value for coefficient estimates
pred.labels = c("(Intercept)", "% Graduate degree holders", "% High school diploma holders"),
# We have three variables, intercept, p.graduate, p.HS. So we need to supply three names.
dv.labels = c("dependent = Median Household Income 2017", "dependent = Median Household Income 2017")) # This label will go to the top of the table
| dependent = Median Household Income 2017 | dependent = Median Household Income 2017 | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Predictors | Estimates | std. Error | CI | Statistic | p | Estimates | std. Error | CI | Statistic | p |
| (Intercept) | 28038.11 | 1893.46 | 24318.20 – 31758.02 | 14.81 | <0.001 | 117041.07 | 2429.86 | 112267.35 – 121814.80 | 48.17 | <0.001 |
| % Graduate degree holders | 230421.55 | 9274.64 | 212200.53 – 248642.58 | 24.84 | <0.001 | |||||
| % High school diploma holders | -237987.56 | 10220.56 | -258066.94 – -217908.17 | -23.29 | <0.001 | |||||
| Observations | 514 | 514 | ||||||||
| R2 / R2 adjusted | 0.547 / 0.546 | 0.514 / 0.513 | ||||||||
Looking good! The last thing we need is to save it in a format that is compatible with MS Word, Excel, etc. We can supply file = argument to specify the file name. Don’t forget to put .html at the end of your file name!
tab_model(ols1, ols2, # You can compare multiple models in one table as well.
show.ci = 0.95, # alpha-level for confidence interval
show.se = TRUE, # Whether to show standard errors of coefficient estimates
show.stat = TRUE, # Whether to show t value for coefficient estimates
pred.labels = c("(Intercept)", "% Graduate degree holders", "% High school diploma holders"),
# We have three variables, intercept, p.graduate, p.HS. So we need to supply three names.
dv.labels = c("dependent = Median Household Income 2017", "dependent = Median Household Income 2017"),
# This label will go to the top of the table
file = "ols1_ols2_reg.html") # This is the output file name. This file will be save in your working directory.
| dependent = Median Household Income 2017 | dependent = Median Household Income 2017 | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Predictors | Estimates | std. Error | CI | Statistic | p | Estimates | std. Error | CI | Statistic | p |
| (Intercept) | 28038.11 | 1893.46 | 24318.20 – 31758.02 | 14.81 | <0.001 | 117041.07 | 2429.86 | 112267.35 – 121814.80 | 48.17 | <0.001 |
| % Graduate degree holders | 230421.55 | 9274.64 | 212200.53 – 248642.58 | 24.84 | <0.001 | |||||
| % High school diploma holders | -237987.56 | 10220.56 | -258066.94 – -217908.17 | -23.29 | <0.001 | |||||
| Observations | 514 | 514 | ||||||||
| R2 / R2 adjusted | 0.547 / 0.546 | 0.514 / 0.513 | ||||||||
Once saved, you can double click the file to open it in your web browser. You will be able to highlight them either with your mouse or by pressing control+a (or ⌘+a if you are using a mac) and copy-paste it into Excel or Word.
That was formatting a regression result using sjPlot. The package also provide functions for correlation table, contingency table, etc. I show some examples below.
Note that cor() is a function that can calculate correlation coefficient for multiple variables in a table format. But this function does not give us the hypothesis test result, so that is the difference between cor() and cor.test().
# Correlation table
mycor <- cor( hinc.edu[, c("p.lessHS", "p.HS", "p.bachelor", "p.graduate", "hinc.num")] )
tab_corr(mycor, triangle = "lower", show.p = T)
| p.lessHS | p.HS | p.bachelor | p.graduate | hinc.num | |
|---|---|---|---|---|---|
| p.lessHS | |||||
| p.HS | 0.497 | ||||
| p.bachelor | -0.700 | -0.886 | |||
| p.graduate | -0.652 | -0.846 | 0.801 | ||
| hinc.num | -0.601 | -0.717 | 0.768 | 0.739 | |
| Computed correlation used pearson-method with listwise-deletion. | |||||
# Contigency table
tab_xtab(hinc.edu$county, hinc.edu$hinc.num.ord, # two variable with which to build the contingency table
var.labels = c("County", "Median Household Income - Ordinal")) # group names
## Warning in sprintf(" <td class=\"summary tdata\" colspan=\"%i\">%s=%.3f ·
## df=%i · %s=%.3f · %s=%.3f</td>", : one argument not used by format
## ' <td class="summary tdata" colspan="%i">%s=%.3f · df=%i · %s=%.3f
## · %s=%.3f</td>'
| County |
Median Household Income - Ordinal |
Total | |||
|---|---|---|---|---|---|
| lower-middle class | poor | upper-middle class | wealthy | ||
| Clayton County | 21 | 20 | 8 | 0 | 49 |
| Cobb County | 22 | 11 | 43 | 44 | 120 |
| DeKalb County | 51 | 28 | 38 | 26 | 143 |
| Fulton County | 34 | 70 | 39 | 59 | 202 |
| Total | 128 | 129 | 128 | 129 | 514 | χ2=79.087 · df=9 · Cramer’s V=0.226 · p=0.000 |