Welcome to topic 4! In this unit, we’ll go through the steps of a basic data analysis project. This includes creating a professional looking summary statistics table, making some basic plots and histograms of our data, determining correlations between our variables, and beginning to build a basic regression model. We will not have time to delve into statistical theory in this course; if this type of analysis interests you, I would recommend taking a statistics course to learn more. Many of these skills are taught in much greater detail in my course, DIDA 130: Introduction to Statistical Thinking.
Starting Out: Summary Statistics Table
When we are beginning to work with a new data set, getting to know our data is always the first step in our analysis. The first thing we will do is choose the variables that interest us, which we might want to further analyze later on. The most basic way to begin getting to know your data is with a summary statistic table: this will usually include measures of central tendency (mean or median) and variability (standard deviation, minimum and maximum).
# We will start by choosing a few variables to study.
setwd("~/Binghamton/Harp325")
Warning: The working directory was changed to C:/Users/mhaller/Documents/Binghamton/Harp325 inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
data <- read.csv("ny_data_topic4.csv")
library(dplyr)
variables <- data %>%
select(c(crime_rate, frac_middleclass, unemp_rate, inequality, racial_segregation, percent_college_degree, median_house_value))
variables
NA
#Next, we'll make our summary table! This is very easy to do with Dplyr
#Why do we put a 1 at the end of the name? If we want to use any of these functions later, R will get confused if, for example, a mean object already exists
mean1 <- variables %>% summarise(across(everything(), mean, na.rm=TRUE))
Warning: There was 1 warning in `summarise()`.
ℹ In argument: `across(everything(), mean, na.rm = TRUE)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function
instead.
# Previously
across(a:b, mean, na.rm = TRUE)
# Now
across(a:b, \(x) mean(x, na.rm = TRUE))
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning
was generated.
sd1 <- variables %>% summarise(across(everything(), sd, na.rm=TRUE))
min1 <- variables %>% summarise(across(everything(), min, na.rm=TRUE))
max1 <- variables %>% summarise(across(everything(), max, na.rm=TRUE))
#put together into a table using rbind()
table <- rbind(mean1, sd1, min1, max1)
#Usually, we want the variables to be in the rows and the summary measures to be in the columns - let's flip them around
rownames(table) <- c("Mean", "Standard Deviation", "Minimum", "Maximum")
#Transpose the dataframe to flip it around
table <- t(table)
#remove scientific notation
options(scipen = 999)
#Tell R to round the data to make it look nicer
table <- table %>%
as.data.frame %>%
mutate_if(is.numeric, round, digits=2)
table
So, this is a start! Unfortunately, it still doesn’t look very nice, does it? Let’s install a new package to help us make a much better looking summary table.
library(kableExtra)
Warning: package ‘kableExtra’ was built under R version 4.2.2
Attaching package: ‘kableExtra’
The following object is masked from ‘package:dplyr’:
group_rows
#we'll use this package to make our summary statistic table - again, we can use the dplyr pipe operator to make this really easy
table %>%
kbl() %>%
kable_styling()
| Mean | Standard Deviation | Minimum | Maximum | |
|---|---|---|---|---|
| crime_rate | 0.01 | 0.00 | 0.00 | 0.01 |
| frac_middleclass | 0.54 | 0.08 | 0.35 | 0.70 |
| unemp_rate | 0.05 | 0.01 | 0.03 | 0.07 |
| inequality | 0.39 | 0.11 | 0.24 | 1.09 |
| racial_segregation | 0.14 | 0.11 | 0.01 | 0.42 |
| percent_college_degree | 21.52 | 8.23 | 11.50 | 49.40 |
| median_house_value | 158573.27 | 169087.12 | 67183.20 | 1333001.33 |
Ok, so far, this looks much better than it did originally! Let’s try out some other themes to see what looks best.
#This classic theme looks just like a table from a publication!
table %>%
kbl() %>%
kable_classic_2("striped", full_width = F)
| Mean | Standard Deviation | Minimum | Maximum | |
|---|---|---|---|---|
| crime_rate | 0.01 | 0.00 | 0.00 | 0.01 |
| frac_middleclass | 0.54 | 0.08 | 0.35 | 0.70 |
| unemp_rate | 0.05 | 0.01 | 0.03 | 0.07 |
| inequality | 0.39 | 0.11 | 0.24 | 1.09 |
| racial_segregation | 0.14 | 0.11 | 0.01 | 0.42 |
| percent_college_degree | 21.52 | 8.23 | 11.50 | 49.40 |
| median_house_value | 158573.27 | 169087.12 | 67183.20 | 1333001.33 |
#Same thing, but let's add a title now too
table %>%
#These lines add a table to the figure
#We have to use a bit of HTML to format it
kbl(caption = "<center><strong>Figure 1: Socioeconomic Data in NYS</strong></center>",
format = "html") %>%
kable_classic_2("striped", full_width = F)
| Mean | Standard Deviation | Minimum | Maximum | |
|---|---|---|---|---|
| crime_rate | 0.01 | 0.00 | 0.00 | 0.01 |
| frac_middleclass | 0.54 | 0.08 | 0.35 | 0.70 |
| unemp_rate | 0.05 | 0.01 | 0.03 | 0.07 |
| inequality | 0.39 | 0.11 | 0.24 | 1.09 |
| racial_segregation | 0.14 | 0.11 | 0.01 | 0.42 |
| percent_college_degree | 21.52 | 8.23 | 11.50 | 49.40 |
| median_house_value | 158573.27 | 169087.12 | 67183.20 | 1333001.33 |
NA
NA
What if we want to change the colors of the rows, conditional on the values? This isn’t super meaningful for our summary table, but it might be for other purposes. What if, for example, we wanted to compare each variable by NYS regions?
#Create a region summary table
regions <- data %>%
select(region, crime_rate, frac_middleclass, unemp_rate, inequality, racial_segregation, percent_college_degree, median_house_value) %>%
group_by(region) %>%
summarise(across(crime_rate:median_house_value, mean, na.rm=T)) %>%
mutate_if(is.numeric, round, digits=2)
#Create a copy of our original table so we don't lose our work!
regions1 <- regions
#Set conditional cell specification
regions1$racial_segregation = cell_spec(regions1$racial_segregation, background = ifelse(regions1$racial_segregation > 0.15, "red", "blue"), color = "white")
#print table
regions1 %>%
kbl(escape = F, align = "c") %>%
kable_classic_2("striped", full_width = F)
| region | crime_rate | frac_middleclass | unemp_rate | inequality | racial_segregation | percent_college_degree | median_house_value |
|---|---|---|---|---|---|---|---|
| capital | 0.01 | 0.54 | 0.04 | 0.38 | 0.14 | 22.60 | 129784.21 |
| central | 0.01 | 0.57 | 0.05 | 0.37 | 0.12 | 19.76 | 104267.26 |
| finger lakes | 0.01 | 0.59 | 0.05 | 0.33 | 0.11 | 18.51 | 109765.14 |
| hudson valley | 0.01 | 0.44 | 0.04 | 0.40 | 0.14 | 29.16 | 243805.70 |
| mohawk valley | 0.01 | 0.55 | 0.04 | 0.38 | 0.11 | 18.23 | 99841.70 |
| new york | 0.01 | 0.44 | 0.06 | 0.60 | 0.31 | 26.66 | 489531.19 |
| north country | 0.01 | 0.62 | 0.05 | 0.32 | 0.13 | 16.10 | 98032.63 |
| southern tier | 0.01 | 0.57 | 0.04 | 0.38 | 0.07 | 21.84 | 101241.35 |
| suffolk | 0.00 | 0.44 | 0.04 | 0.42 | 0.25 | 31.45 | 284928.75 |
| western | 0.01 | 0.57 | 0.05 | 0.37 | 0.19 | 18.18 | 92936.76 |
NA
I can even highlight a row or two for easy comparison!
regions %>%
kbl(align = "c") %>%
kable_paper("striped", full_width = F) %>%
row_spec(c(3, 8), bold = T, color = "white", background = "cornflowerblue")
| region | crime_rate | frac_middleclass | unemp_rate | inequality | racial_segregation | percent_college_degree | median_house_value |
|---|---|---|---|---|---|---|---|
| capital | 0.01 | 0.54 | 0.04 | 0.38 | 0.14 | 22.60 | 129784.21 |
| central | 0.01 | 0.57 | 0.05 | 0.37 | 0.12 | 19.76 | 104267.26 |
| finger lakes | 0.01 | 0.59 | 0.05 | 0.33 | 0.11 | 18.51 | 109765.14 |
| hudson valley | 0.01 | 0.44 | 0.04 | 0.40 | 0.14 | 29.16 | 243805.70 |
| mohawk valley | 0.01 | 0.55 | 0.04 | 0.38 | 0.11 | 18.23 | 99841.70 |
| new york | 0.01 | 0.44 | 0.06 | 0.60 | 0.31 | 26.66 | 489531.19 |
| north country | 0.01 | 0.62 | 0.05 | 0.32 | 0.13 | 16.10 | 98032.63 |
| southern tier | 0.01 | 0.57 | 0.04 | 0.38 | 0.07 | 21.84 | 101241.35 |
| suffolk | 0.00 | 0.44 | 0.04 | 0.42 | 0.25 | 31.45 | 284928.75 |
| western | 0.01 | 0.57 | 0.05 | 0.37 | 0.19 | 18.18 | 92936.76 |
Next, we will plot our data to get a better sense of its distribution and relationships between variables. Let’s start with some histograms, which plot the distribution of individual variables. We can only produce a histogram of numeric variables; this type of plot does not work very well for categorical variables. What can we learn from a histogram? Histograms will give us a sense of the shape of the data: What is the range of the data? Is each county equally distributed across the range, or do we see an interesting pattern? Is there a clear mid-point/mean? Is the data skewed? Are there major outliers that we should look into? We especially need to know if there is an odd pattern or irregularity in the data before we can analyze it further. We won’t go too deep into what to do if your data has a strange distribution and/or major outliers, but it’s good practice to check for these things now. Really big outliers can, for example, bias the results of our regression or give us misleading results.
#We'll make these using ggplot, although the built in hist() function in R works pretty well
hist(data$inequality)
Next, I’ll show the ggplot2 version. We would, ideally, take a look at histograms for all of the main variables of interest.
library(ggplot2)
ggplot(data, aes(x = racial_segregation))+
geom_histogram(fill = "cornflowerblue", color = "gray", bins = 10)+
theme_minimal()+
labs(title = "Distribution of Racial Segregation")
NA
NA
Next, we might think about the relationships between variables using some plots. I’ll make a few plots side by side using the cowplot package and geom_smooth to plot a regression line to illustrate the relationships between variables.
library(cowplot)
a <- ggplot(data)+
geom_point(aes(x = inequality, y = racial_segregation, color = region))+
geom_smooth(method = "lm", aes(x = inequality, y = racial_segregation))+
theme_minimal()+
labs(title = "Segregation vs Inequality")
b <- ggplot(data)+
geom_point(aes(x = crime_rate, y = racial_segregation, color = region))+
geom_smooth(method = "lm", aes(x = crime_rate, y = racial_segregation))+
theme_minimal()+
labs(title = "Segregation vs Crime Rates")
c <- ggplot(data)+
geom_point(aes(x = percent_college_degree, y = racial_segregation, color = region))+
geom_smooth(method = "lm", aes(x = percent_college_degree, y = racial_segregation))+
theme_minimal()+
labs(title = "Segregation vs College Education")
plot_grid(a, b, c, nrow = 1)
`geom_smooth()` using formula 'y ~ x'
`geom_smooth()` using formula 'y ~ x'
`geom_smooth()` using formula 'y ~ x'
Be sure to make the plot full screen! Otherwise, it looks a bit squished. We can see that there is a linear relationship in all plots, although the inequality plot has the clearest relationship. We can easily look at the relationship between all of the variables by checking correlations. First, I’ll show you how to make a correlation table, and then we’ll make a correlation plot to better illustrate correlation.
#Make a correlation table
corr_data <- data %>% select(racial_segregation, crime_rate, frac_middleclass, unemp_rate, inequality, percent_college_degree, median_house_value)
#Calculate correlations
corr_table <- cor(corr_data)
#Make the table
corr_table %>%
kbl(align = "c") %>%
kable_paper("striped", full_width = F)
| racial_segregation | crime_rate | frac_middleclass | unemp_rate | inequality | percent_college_degree | median_house_value | |
|---|---|---|---|---|---|---|---|
| racial_segregation | 1.0000000 | 0.3979778 | -0.4881346 | 0.2291692 | 0.5078557 | 0.3016265 | 0.4194023 |
| crime_rate | 0.3979778 | 1.0000000 | -0.3396174 | 0.0725339 | 0.3550824 | 0.1546712 | 0.1770690 |
| frac_middleclass | -0.4881346 | -0.3396174 | 1.0000000 | 0.2147388 | -0.6092046 | -0.7742870 | -0.6257974 |
| unemp_rate | 0.2291692 | 0.0725339 | 0.2147388 | 1.0000000 | 0.1831194 | -0.4577376 | 0.0893107 |
| inequality | 0.5078557 | 0.3550824 | -0.6092046 | 0.1831194 | 1.0000000 | 0.5782252 | 0.8894241 |
| percent_college_degree | 0.3016265 | 0.1546712 | -0.7742870 | -0.4577376 | 0.5782252 | 1.0000000 | 0.6475987 |
| median_house_value | 0.4194023 | 0.1770690 | -0.6257974 | 0.0893107 | 0.8894241 | 0.6475987 | 1.0000000 |
Creating a correlation plot makes the relationships between variables much easier to see.
#Make the correlation plot
library(corrplot)
corrplot 0.92 loaded
#The cex arguments just change the size of the text on the plot
corrplot(corr_table, tl.cex = 1, cl.cex = 0.5)
There are some additional style options that you can customize with a correlation plot! For example:
#note that tl.cex changes the font size for the row and column labels, while cl.cex changes the font size of the legend on the side
corrplot(corr_table, method = 'number', tl.cex = 0.8, cl.cex = 0.5)
Next, we’ll compute a basic regression using our data:
summary(lm(formula = racial_segregation ~ inequality,
data = data))
Call:
lm(formula = racial_segregation ~ inequality, data = data)
Residuals:
Min 1Q Median 3Q Max
-0.147613 -0.078843 0.002082 0.054144 0.283150
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.05096 0.04386 -1.162 0.25
inequality 0.49839 0.10914 4.567 0.0000252 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.09288 on 60 degrees of freedom
Multiple R-squared: 0.2579, Adjusted R-squared: 0.2455
F-statistic: 20.85 on 1 and 60 DF, p-value: 0.00002515
Note that this regression is the same as the one performed in the third plot from the point plots above. Linear regression with one variable is simply a line drawn in two-dimensional space that minimizes the mean squared error between the plotted points and the line. We cannot make the same plot if we add more than one independent variable; each additional variable adds another dimension to the calculation of the regression line, and we cannot visualize that easily.
How do we interpret this regression? The equation looks something like this:
racial segregation = 0.058 + 0.004college degree + E
In other words, a 1 percentage point increase in the number of residents with a college degree is associated with a 0.004 unit increase in racial segregation. This may be a surprising result, depending on what factors you expect would impact the level of segregation in a neighborhood. With an R-Squared value of 0.091, we can see that college degree only explains about 1% of the variation in racial segregation - there are likely other variables that better capture this variation. We can also see that percent college degree is significant, but weakly so, indicating again that it may not have a strong relationship with racial segregation. Next, we’ll add some more variables into the analysis to get a better sense of the factors that impact racial segregation at the county level.
#education and crime rates are both measured in percentages, but crime_rate is stored as a fraction. Let's multiply it by 100 so the units are the same as college degree
data <- data %>% mutate(crime_rate1 = crime_rate*100)
summary(reg2 <- lm(formula = racial_segregation ~ percent_college_degree + crime_rate1 + median_house_value,
data = data))
Call:
lm(formula = racial_segregation ~ percent_college_degree + crime_rate1 +
median_house_value, data = data)
Residuals:
Min 1Q Median 3Q Max
-0.12389 -0.06826 -0.01354 0.04509 0.30140
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.01385390254 0.04955557230 -0.280 0.78081
percent_college_degree 0.00037384604 0.00189703970 0.197 0.84446
crime_rate1 0.17480250281 0.05932060756 2.947 0.00462 **
median_house_value 0.00000021617 0.00000009266 2.333 0.02315 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.09276 on 58 degrees of freedom
Multiple R-squared: 0.2846, Adjusted R-squared: 0.2476
F-statistic: 7.69 on 3 and 58 DF, p-value: 0.0002071
In this regression, we are explaining much more variation: about 29%, indicating that the variables in this multiple regression model explain the variation in racial segregation much better.
In Class Activity: If time, pick three variables of interest. Repeat these steps, using one of the three variables as the dependent variable and the other as the independent variables.
Resources
Zhu, Hao (2019). Create Awesome HTML Table with knitr::kable and kableExtra, https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html.