Functions Tasks
lm() Performs a linear modeling using ordinary least squares method.
abline() Adds one or more straight lines through the current plot.
cor() Computes correlations with either vectors or a matrix or data.frame.
tab_model() Creates a well-formatted regression outputs into HTML format.
tab_xtab() Creates a well-formatted contingency table as HTML file.
tab_corr() Creates a well-formatted correlation table as HTML file.

0. Before We Get Started: Downloading data, setting the working directory, and reading data

In this lab, we will download two variables from the American Fact Finder, join the two data using a key variable, clean the data, and conduc bivariate regression. The question we will examine is “Do neighborhoods with a higher proportion of people with graduate degrees tend to be wealthier?”

# Install required packages ----------- Run these if you haven't installed these
install.packages("tidyverse", dependencies = TRUE)
install.packages("stargazer", dependencies = TRUE)
install.packages("sjPlot", dependencies = TRUE)
# Call the packages using library()
library(tidyverse)
library(stargazer)
library(sjPlot)

The data we will use is from the last week, the two data.frames that we joined and cleaned. This data is the joined data.frame on the median household income and the education attainment for census tracts in Clayton, Cobb, DeKalb, and Fulton County.

# USE YOUR OWN working directory and file name
setwd("C:\\Users\\cod\\Desktop\\PhD Files\\GRA & GTA\\GTA\\CP6025 Fall 2021\\Week 8\\Lab7_2020") # Use your own pathname
hinc.edu <- read.csv("data_from_week6_edu_hinc.csv") 
# Check the data
head(hinc.edu)
##                 GEO_ID               tract         county   state total
## 1 1400000US13063040202 Census Tract 402.02 Clayton County Georgia  1393
## 2 1400000US13063040203 Census Tract 402.03 Clayton County Georgia  2134
## 3 1400000US13063040204 Census Tract 402.04 Clayton County Georgia  2452
## 4 1400000US13063040302 Census Tract 403.02 Clayton County Georgia  3520
## 5 1400000US13063040303 Census Tract 403.03 Clayton County Georgia  4072
## 6 1400000US13063040306 Census Tract 403.06 Clayton County Georgia  1959
##   total.error lessHS lessHS.error   HS HS.error someCol someCol.error bachelor
## 1         202    318           93  581      134     338           103      123
## 2         321    306          199  688      236     753           183      265
## 3         241    190           90  877      185     825           184      371
## 4         398   1034          222 1347      325     855           213      235
## 5         389   1059          300 1319      365    1075           266      530
## 6         322    781          243  430      149     567           221      169
##   bachelor.error graduate graduate.error
## 1             53       33             28
## 2             99      122             59
## 3            133      189             70
## 4            136       49             49
## 5            226       89             85
## 6            105       12             18
##                                         NAME.y  hinc hinc.error   p.lessHS
## 1 Census Tract 402.02, Clayton County, Georgia 31524       6665 0.22828428
## 2 Census Tract 402.03, Clayton County, Georgia 36786       5722 0.14339269
## 3 Census Tract 402.04, Clayton County, Georgia 39194       5231 0.07748777
## 4 Census Tract 403.02, Clayton County, Georgia 33190       1973 0.29375000
## 5 Census Tract 403.03, Clayton County, Georgia 37236       7343 0.26006876
## 6 Census Tract 403.06, Clayton County, Georgia 27372      10845 0.39867279
##        p.HS p.someCol p.bachelor  p.graduate hinc.num hinc.num.ord
## 1 0.4170854 0.2426418 0.08829864 0.023689878    31524         poor
## 2 0.3223993 0.3528585 0.12417994 0.057169634    36786         poor
## 3 0.3576672 0.3364600 0.15130506 0.077079935    39194         poor
## 4 0.3826705 0.2428977 0.06676136 0.013920455    33190         poor
## 5 0.3239194 0.2639980 0.13015717 0.021856582    37236         poor
## 6 0.2194997 0.2894334 0.08626850 0.006125574    27372         poor
##   p.graduate.ord
## 1       very low
## 2       very low
## 3            low
## 4       very low
## 5       very low
## 6       very low

   

1. Bivariate Regression Analysis

While the correlation tells us how intimately two variables are related, it does not answer the question of how much change in p.graduate (or any other variable for that matter) is associated with how much change in hinc.num. Framing questions in this way is important for planners and policy makers as we often want to know not only the existence of relationships but also how much change in the median household income we can expect if we make plans and policies to promote higher education, for example.

Warning: I worded the last sentence as if there is a causal relationship between education attainment and median household income at census tract level, but I CANNOT KNOW IF IT IS CAUSAL FROM ONLY THE REGRESSION ANALYSIS. To establish a causal relationship, you need to consider the four criteria you learned earlier. The regression as presented below only provide support for covariance criteria.

 

The following code performs a bivariate regression using hinc.num as the y-variable (i.e., the dependent variable, the variable that is being explained) and p.graduate as the x-variable (i.e., the independent variable, the variable that is used to explain the dependent variable) and assigns the estimated regression model and other related values in an R object called ols1. Printing summary(ols1) gives you a formatted information of the regression, but the R object ols1 contains many other values not displayed in summary(ols1) as well.

# bivariate linear model 
ols1 <- lm(hinc.num ~ p.graduate, data = hinc.edu)
summary(ols1)
## 
## Call:
## lm(formula = hinc.num ~ p.graduate, data = hinc.edu)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -85749 -14481  -3105  11704 107182 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    28038       1894   14.81   <2e-16 ***
## p.graduate    230422       9275   24.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24390 on 512 degrees of freedom
## Multiple R-squared:  0.5466, Adjusted R-squared:  0.5457 
## F-statistic: 617.2 on 1 and 512 DF,  p-value: < 2.2e-16

The regression result shows that the increase in the proportion of graduate degree holder from 0% to 100% is associated with the increase in the median household income by 230,422 USD. Don’t worry if this interpretation doesn’t make sense to you just yet.

The t value and p value (i.e., Pr(>|t|)) for p.graduate suggest that, if the null hypothesis (i.e., that there exists no relationship between the the proportion of graduate degree holders and the median household income of Census Tracts, or in statistical term, that the coefficient is not different from zero) was correct, it is extremely unlikely to see a regression result like this. Because it is so unlikely thing to happen if the null hypothesis was true, we can conclude that we can reject the null hypothesis - why? because in the world where there is no relationship between graduate degree and income, seeing the relationship we just saw in the regression is soooo unlikely.

The regression result also indicates that the coefficient of determination (Multiple R-squared in the summary(ols1)) is 0.5465967. This tells us how much of the variation in the median household income is explained by variation in the proportion of graduate degree holders.

 

The plot we created earlier for the correlation can be modified to include a regression line.

# Scatterplot with regression line
plot(hinc.edu$p.graduate,
     hinc.edu$hinc.num,
     main = "Scatterplot between % gradute degree and median household income",
     xlab = "Proportion of residents with graduate degrees",
     ylab = "Median household income",
     col = "blue")+
  abline(ols1)

## integer(0)

Do neighborhoods with a higher proportion of people with graduate degrees tend to be wealthier?

The answer from the data is yes. The correlation (results from last two week’s lab) and bivariate regression strongly suggest that there exists a positive relationship between the two variables. BUT AGAIN, THIS SHOULD NOT BE INTERPRETED AS A CAUSAL RELATIONSHIP YET!

Just for you to check, below are a bivariate regression result using p.HS instead of p.graduate.

# bivariate linear model 
ols2 <- lm(hinc.num ~ p.HS, data = hinc.edu)
summary(ols2)
## 
## Call:
## lm(formula = hinc.num ~ p.HS, data = hinc.edu)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -94362 -17023  -1660  12225  98158 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   117041       2430   48.17   <2e-16 ***
## p.HS         -237988      10221  -23.29   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25240 on 512 degrees of freedom
## Multiple R-squared:  0.5143, Adjusted R-squared:  0.5134 
## F-statistic: 542.2 on 1 and 512 DF,  p-value: < 2.2e-16
# Scatterplot with regression line
plot(hinc.edu$p.HS,
     hinc.edu$hinc.num,
     main = "Scatterplot between % high school graduate \n and median household income",
     xlab = "Proportion of residents with less-than-high school diploma",
     ylab = "Median household income",
     col = "blue")+
  abline(ols2)

## integer(0)

2. Reporting the results in pretty formats

Although R outputs okay-formatted outputs in the console window, that is not pretty enough to simply screenshot them and include in your, for example, option paper. R has various packages that help us bring the outputs of correlation and/or regression analysis to Word, Excel, etc. in a well-formatted, publication-ready formats.

Some of the most commonly used packages for reporting include stargazer and sjPlot. I personally prefer sjPlot, so that’s what I am going to focus today. Note that these packages offer numerous options for you to customize your tables, and I cannot cover them all here. I provide two links below for you to explore.

stargazer: https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf sjPlot: https://strengejacke.github.io/sjPlot/index.html

Basic command

For reporting regression outputs, we can use tab_model() function. It has many many arguments, so be sure to check ?tab_model to see what options you have. You don’t need to know them all - just see what are available and come back to it later when you need it. The output of this function is in HTML format, and the result will be shown in the Viewer panel.

# Regression output using sjPlot package
tab_model(ols1, ols2, # You can compare multiple models in one table as well.
          show.ci = 0.95, # alpha-level for confidence interval
          show.se = TRUE, # Whether to show standard errors of coefficient estimates
          show.stat = TRUE) # Whether to show t value for coefficient estimates
  hinc.num hinc.num
Predictors Estimates std. Error CI Statistic p Estimates std. Error CI Statistic p
(Intercept) 28038.11 1893.46 24318.20 – 31758.02 14.81 <0.001 117041.07 2429.86 112267.35 – 121814.80 48.17 <0.001
p.graduate 230421.55 9274.64 212200.53 – 248642.58 24.84 <0.001
p.HS -237987.56 10220.56 -258066.94 – -217908.17 -23.29 <0.001
Observations 514 514
R2 / R2 adjusted 0.547 / 0.546 0.514 / 0.513

 

If the formatted table doesn’t automatically show up in the Viewer panel, you can manual click it to open it. You will see a very nicely formatted table that looks like one of those tables in published papers. Also notice that, when the two (or more) models have same variables, those variables share the same rows to make it easier for us to compare them across models.

Labeling the variable names

Okay, the table looks nice except that the name of the variables are still in a code-like format (e.g., p.HS). Let’s replace that with something meaningful.

# Regression output using sjPlot package
tab_model(ols1, ols2, # You can compare multiple models in one table as well.
          show.ci = 0.95, # alpha-level for confidence interval
          show.se = TRUE, # Whether to show standard errors of coefficient estimates
          show.stat = TRUE, # Whether to show t value for coefficient estimates
    
          pred.labels = c("(Intercept)", "% Graduate degree holders", "% High school diploma holders"), 
          # We have three variables, intercept, p.graduate, p.HS. So we need to supply three names.
          
          dv.labels = c("dependent = Median Household Income 2017", "dependent = Median Household Income 2017")) # This label will go to the top of the table
  dependent = Median Household Income 2017 dependent = Median Household Income 2017
Predictors Estimates std. Error CI Statistic p Estimates std. Error CI Statistic p
(Intercept) 28038.11 1893.46 24318.20 – 31758.02 14.81 <0.001 117041.07 2429.86 112267.35 – 121814.80 48.17 <0.001
% Graduate degree holders 230421.55 9274.64 212200.53 – 248642.58 24.84 <0.001
% High school diploma holders -237987.56 10220.56 -258066.94 – -217908.17 -23.29 <0.001
Observations 514 514
R2 / R2 adjusted 0.547 / 0.546 0.514 / 0.513

 

Looking good! The last thing we need is to save it in a format that is compatible with MS Word, Excel, etc. We can supply file = argument to specify the file name. Don’t forget to put .html at the end of your file name!

tab_model(ols1, ols2, # You can compare multiple models in one table as well.
          show.ci = 0.95, # alpha-level for confidence interval
          show.se = TRUE, # Whether to show standard errors of coefficient estimates
          show.stat = TRUE, # Whether to show t value for coefficient estimates
          
          pred.labels = c("(Intercept)", "% Graduate degree holders", "% High school diploma holders"), 
          # We have three variables, intercept, p.graduate, p.HS. So we need to supply three names.
          
          dv.labels = c("dependent = Median Household Income 2017", "dependent = Median Household Income 2017"),
          # This label will go to the top of the table

          file = "ols1_ols2_reg.html") # This is the output file name. This file will be save in your working directory.
  dependent = Median Household Income 2017 dependent = Median Household Income 2017
Predictors Estimates std. Error CI Statistic p Estimates std. Error CI Statistic p
(Intercept) 28038.11 1893.46 24318.20 – 31758.02 14.81 <0.001 117041.07 2429.86 112267.35 – 121814.80 48.17 <0.001
% Graduate degree holders 230421.55 9274.64 212200.53 – 248642.58 24.84 <0.001
% High school diploma holders -237987.56 10220.56 -258066.94 – -217908.17 -23.29 <0.001
Observations 514 514
R2 / R2 adjusted 0.547 / 0.546 0.514 / 0.513

 

Once saved, you can double click the file to open it in your web browser. You will be able to highlight them either with your mouse or by pressing control+a (or ⌘+a if you are using a mac) and copy-paste it into Excel or Word.

Other tables you can explore

That was formatting a regression result using sjPlot. The package also provide functions for correlation table, contingency table, etc. I show some examples below.

Note that cor() is a function that can calculate correlation coefficient for multiple variables in a table format. But this function does not give us the hypothesis test result, so that is the difference between cor() and cor.test().

# Correlation table
mycor <- cor( hinc.edu[, c("p.lessHS", "p.HS", "p.bachelor", "p.graduate", "hinc.num")] ) 
tab_corr(mycor, triangle = "lower", show.p = T)
  p.lessHS p.HS p.bachelor p.graduate hinc.num
p.lessHS          
p.HS 0.497        
p.bachelor -0.700 -0.886      
p.graduate -0.652 -0.846 0.801    
hinc.num -0.601 -0.717 0.768 0.739  
Computed correlation used pearson-method with listwise-deletion.
# Contigency table
tab_xtab(hinc.edu$county, hinc.edu$hinc.num.ord, # two variable with which to build the contingency table
         var.labels = c("County", "Median Household Income - Ordinal")) # group names
## Warning in sprintf(" <td class=\"summary tdata\" colspan=\"%i\">%s=%.3f &middot;
## df=%i &middot; %s=%.3f &middot; %s=%.3f</td>", : one argument not used by format
## ' <td class="summary tdata" colspan="%i">%s=%.3f &middot; df=%i &middot; %s=%.3f
## &middot; %s=%.3f</td>'
County Median Household
Income - Ordinal
Total
lower-middle class poor upper-middle class wealthy
Clayton County 21 20 8 0 49
Cobb County 22 11 43 44 120
DeKalb County 51 28 38 26 143
Fulton County 34 70 39 59 202
Total 128 129 128 129 514
χ2=79.087 · df=9 · Cramer’s V=0.226 · p=0.000