knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(warning = FALSE)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(reshape2)
library(ggpubr)
library(ggfortify)
library(psych)
require(scales)
This is the 2nd of 3 notebooks that researches the question, “Can education and wages be used as predictors for home values?”. The first notebook completed data cleaning and transformation; this notebook will focus on visualization.
## Data for this project is 2745 rows and 16 features
head(master,1)
## state county dropout hs_diploma some_college
## 1 AK Fairbanks North Star Borough 3,381 13,785 24,520
## four_year_degree dropout_percent hs_diploma_percent some_college_percent
## 1 20,152 5.5 22.3 39.7
## four_year_degree_percent region region_num resident_count avg_annual_pay
## 1 32.6 West 4 2275 53107.01
## metro avg_home_value
## 1 Fairbanks 244178.1
Here is a helpful description of the 16 columns in the master data file.
describe(master)
## vars n mean sd median trimmed
## state* 1 2745 25.80 14.12 24.00 25.86
## county* 2 2745 820.89 457.23 823.00 819.14
## dropout* 3 2745 1159.42 688.04 1159.00 1156.84
## hs_diploma* 4 2745 1290.96 741.38 1294.00 1291.38
## some_college* 5 2745 1261.15 733.32 1269.00 1260.75
## four_year_degree* 6 2745 1210.68 721.27 1206.00 1209.53
## dropout_percent 7 2745 13.38 6.04 12.20 12.86
## hs_diploma_percent 8 2745 34.36 7.29 34.60 34.62
## some_college_percent 9 2745 30.56 5.01 30.50 30.54
## four_year_degree_percent 10 2745 21.70 9.54 19.20 20.40
## region* 11 2745 2.43 1.08 3.00 2.41
## region_num 12 2745 2.52 0.83 2.00 2.50
## resident_count 13 2745 3262.90 13086.08 728.00 1200.12
## avg_annual_pay 14 2745 41527.11 9278.22 39763.01 40322.16
## metro* 15 2745 257.38 276.68 152.00 225.20
## avg_home_value 16 2745 160108.92 111972.30 130851.33 141978.32
## mad min max range skew kurtosis
## state* 17.79 1.00 50.0 49.0 0.04 -1.27
## county* 566.35 1.00 1634.0 1633.0 0.04 -1.11
## dropout* 882.15 1.00 2356.0 2355.0 0.02 -1.21
## hs_diploma* 938.49 1.00 2583.0 2582.0 -0.01 -1.18
## some_college* 932.56 1.00 2538.0 2537.0 0.00 -1.19
## four_year_degree* 941.45 1.00 2438.0 2437.0 0.01 -1.24
## dropout_percent 5.78 1.40 48.5 47.1 0.90 1.05
## hs_diploma_percent 7.12 8.10 55.6 47.5 -0.32 -0.03
## some_college_percent 5.04 11.40 48.0 36.6 0.02 -0.03
## four_year_degree_percent 7.41 6.90 74.6 67.7 1.39 2.22
## region* 1.48 1.00 4.0 3.0 -0.24 -1.34
## region_num 1.48 1.00 4.0 3.0 0.27 -0.58
## resident_count 708.68 25.00 495918.0 495893.0 22.01 747.80
## avg_annual_pay 6241.75 0.01 134664.0 134664.0 2.61 14.38
## metro* 223.87 1.00 844.0 843.0 0.63 -1.05
## avg_home_value 62020.49 29574.67 1527482.2 1497907.5 4.18 31.44
## se
## state* 0.27
## county* 8.73
## dropout* 13.13
## hs_diploma* 14.15
## some_college* 14.00
## four_year_degree* 13.77
## dropout_percent 0.12
## hs_diploma_percent 0.14
## some_college_percent 0.10
## four_year_degree_percent 0.18
## region* 0.02
## region_num 0.02
## resident_count 249.77
## avg_annual_pay 177.09
## metro* 5.28
## avg_home_value 2137.17
We’ll start with a few histograms to get a feel for the normality of some of the key features.
Histograms
Histograms of high school graduation rates and four year college graduation rates.
#Percent with high school diploma
hsd_hist <- ggplot(master, aes(x=hs_diploma_percent)) + geom_histogram()
#Percent with four year degrees
fyd_hist <- ggplot(master, aes(x=four_year_degree_percent)) + geom_histogram()
#Show education histograms
ggarrange(hsd_hist, fyd_hist,
labels = c("hsd", "fyd"),
ncol = 2, nrow = 1,
common.legend = TRUE, legend = "bottom")
The high school diploma distribution appears normal, which makes sense given that a high school education is prevalent across the US and is publicly available. The right skew to a four year degree reflects that even though university’s are in every state, not everyone goes to (and completes) a four year degree for a number of reasons.
Avg annual pay
library(ggfortify)
ggplot(master, aes(x=avg_annual_pay)) +
geom_histogram(bins = 50) +
scale_x_continuous(labels = comma) +
scale_y_continuous(labels = comma)
The histogram of average annual pay below is skewed right. This makes sense as all US states are in this dataset and there are some very high salaries in prestiguous counties in the US.
Home values plot
ggplot(master, aes(x=avg_home_value)) +
geom_histogram(bins = 50) +
scale_x_continuous(labels = comma) +
scale_y_continuous(labels = comma)
Similar to wages and for similar reasons, home values in some counties are very high, giving the distribution a right skew.
Correlation Heat Maps
Now We’ll move on to creating heat maps to highlight the correlations between Education, Wages, Resident Count and Home Values. Our first heatmap will look at the four education categories of high school dropout, high school diploma, some college, and four year degree. There should be high correlation between each of these so we’ll select the one with highest correlation to home values to move forward with.
#Select the data for the education heat map
ehm <- master %>%
select(hs_dropout = dropout_percent, hs_diploma = hs_diploma_percent, some_college = some_college_percent, college_degree = four_year_degree_percent, avg_home_value)
#Calculate the correlation values
cormat <- round(cor(ehm),2)
melted_cormat <- melt(cormat)
#Create the Heatmap
ggheatmap <- ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1))+
coord_fixed()
reorder_cormat <- function(cormat){
# Use correlation between variables as distance
dd <- as.dist((1-cormat)/2)
hc <- hclust(dd)
cormat <-cormat[hc$order, hc$order]
}
ggheatmap +
geom_text(aes(Var2, Var1, label = value), color = "black", size = 4)
The education heatmap correlation of 69% with a college degree and home values was somewhat expected. The -61% corelation for a high school diploma (so no higher education) with home values was initially surprising, but does make sense.
Next we’ll generate a heatmap using wages, resident count, and home values.
#Select the data for the heat map
chm <- master %>%
select(avg_annual_pay, resident_count, avg_home_value)
#Calculate the correlation values
cormat <- round(cor(chm),2)
melted_cormat <- melt(cormat)
#Create the Heatmap
library(ggplot2)
ggheatmap <- ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1))+
coord_fixed()
reorder_cormat <- function(cormat){
# Use correlation between variables as distance
dd <- as.dist((1-cormat)/2)
hc <- hclust(dd)
cormat <-cormat[hc$order, hc$order]
}
ggheatmap +
geom_text(aes(Var2, Var1, label = value), color = "black", size = 4)
The 52% correlation for wages is strong, but not as strong as expected. What’s surprising is that average pay is showing a lower correlation than education in the previous heatmap.
Scatter Charts
Home Values Compared to Education, Wages, and Resident Count Colored by Census Region
There are 2,745 counties across 50 states, but only 4 census regions. We’ll use regions to create several scatter charts of the relationship between education, wages, resident count, and home values.
scatterdf <- select(master, region, region_num, hs_diploma_percent, four_year_degree_percent, avg_annual_pay, resident_count, avg_home_value)
#Scatter chart of hs diploma percent
hsd_scatter <- ggplot(scatterdf, aes(avg_home_value, hs_diploma_percent, colour = region)) +
geom_point(alpha=0.5, position = 'jitter') +
scale_x_continuous(labels = comma) +
scale_y_continuous(labels = comma)
#Scatter chart of four year degree percent
fyd_scatter <- ggplot(scatterdf, aes(avg_home_value, four_year_degree_percent, colour = region)) +
geom_point(alpha=0.5, position = 'jitter') +
scale_x_continuous(labels = comma) +
scale_y_continuous(labels = comma)
#Scatter chart of average annual pay
aap_scatter <- ggplot(scatterdf, aes(avg_home_value, avg_annual_pay, colour = region)) +
geom_point(alpha=0.5, position = 'jitter') +
scale_x_continuous(labels = comma) +
scale_y_continuous(labels = comma)
#Scatter chart of resident count
rc_scatter <- ggplot(scatterdf, aes(avg_home_value, resident_count, colour = region)) +
geom_point(alpha=0.5, position = 'jitter') +
scale_x_continuous(labels = comma) +
scale_y_continuous(labels = comma)
#Show scatter charts
ggarrange(hsd_scatter, fyd_scatter, aap_scatter, rc_scatter,
labels = c("hsd", "fyd", "aap", "rc"),
ncol = 2, nrow = 2,
common.legend = TRUE, legend = "bottom")
Violin Plot
We’ll conclude this visuals section with comparisons using violin plots
Home Values
#Violin of home values
hv_violin <- ggplot(scatterdf, aes(region_num, avg_home_value, colour = region)) +
geom_violin(trim = FALSE)
#Display the violin plots
ggarrange(hv_violin,
labels = c("hv"),
ncol = 1, nrow = 1,
common.legend = TRUE, legend = "top")
Education
#Violin of high school diploma
hsd_violin <- ggplot(scatterdf, aes(region_num, hs_diploma_percent, colour = region)) +
geom_violin(trim = FALSE)
#Violin of four year degree
fyd_violin <- ggplot(scatterdf, aes(region_num, four_year_degree_percent, colour = region)) +
geom_violin(trim = FALSE)
#Display the violin plots
ggarrange(hsd_violin, fyd_violin,
labels = c("hsd", "fyd"),
ncol = 2, nrow = 1,
common.legend = TRUE, legend = "bottom")
Wages and Resident Count
#Violin of wages
aap_violin <- ggplot(scatterdf, aes(region_num, avg_annual_pay, colour = region)) +
geom_violin(trim = FALSE) +
scale_y_continuous(labels = comma)
#Violin of resident count
rc_violin <- ggplot(scatterdf, aes(region_num, resident_count, colour = region)) +
geom_violin(trim = FALSE) +
scale_y_continuous(labels = comma)
#Display the violin plots
ggarrange(aap_violin, rc_violin,
labels = c("aap", "rc"),
ncol = 2, nrow = 1,
common.legend = TRUE, legend = "bottom")
Summary
The visualizations have served their purpose of making us much more familiar with the data. In the final notebook Analysis and Conclusions, we’ll use the information learned here for further analysis; these visualizations make a good case for application of a regression model to this data.
Here is a summary of what’s been learned so far…
1. High school education is normally distributed.
2. Four year degrees, wages, and home values are all right skewed (this consistency in their skew could help a model’s ability to make predictions.)
3. Home values do appear to be correlated to education and wages.
4. In the violin plot we also saw that resident count in each county might be a feature worth including in a model to predict home values.
5. Overall, viewing this data by census region worked well. The visuals definitely highlight that the Midwest is the lowest cost area and that the West and Northeast are both higher cost and a source of most of the outlier data. This is probably what most of us would have guessed, but it’s always good to see data that validates what we’ve already perceived or been told is true.
This article from STHDA was a huge help in generating the heatmaps.
This article from STHDA helped me analye the data’s alignment to regression conditions.