knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(warning = FALSE)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(reshape2)
library(ggpubr)
library(ggfortify)
library(psych)

require(scales)

Introduction to Visualization

This is the 2nd of 3 notebooks that researches the question, “Can education and wages be used as predictors for home values?”. The first notebook completed data cleaning and transformation; this notebook will focus on visualization.

About the Data

## Data for this project is 2745 rows and 16 features
head(master,1)
##   state                       county dropout hs_diploma some_college
## 1    AK Fairbanks North Star Borough   3,381     13,785       24,520
##   four_year_degree dropout_percent hs_diploma_percent some_college_percent
## 1           20,152             5.5               22.3                 39.7
##   four_year_degree_percent region region_num resident_count avg_annual_pay
## 1                     32.6   West          4           2275       53107.01
##       metro avg_home_value
## 1 Fairbanks       244178.1

Here is a helpful description of the 16 columns in the master data file.

describe(master)
##                          vars    n      mean        sd    median   trimmed
## state*                      1 2745     25.80     14.12     24.00     25.86
## county*                     2 2745    820.89    457.23    823.00    819.14
## dropout*                    3 2745   1159.42    688.04   1159.00   1156.84
## hs_diploma*                 4 2745   1290.96    741.38   1294.00   1291.38
## some_college*               5 2745   1261.15    733.32   1269.00   1260.75
## four_year_degree*           6 2745   1210.68    721.27   1206.00   1209.53
## dropout_percent             7 2745     13.38      6.04     12.20     12.86
## hs_diploma_percent          8 2745     34.36      7.29     34.60     34.62
## some_college_percent        9 2745     30.56      5.01     30.50     30.54
## four_year_degree_percent   10 2745     21.70      9.54     19.20     20.40
## region*                    11 2745      2.43      1.08      3.00      2.41
## region_num                 12 2745      2.52      0.83      2.00      2.50
## resident_count             13 2745   3262.90  13086.08    728.00   1200.12
## avg_annual_pay             14 2745  41527.11   9278.22  39763.01  40322.16
## metro*                     15 2745    257.38    276.68    152.00    225.20
## avg_home_value             16 2745 160108.92 111972.30 130851.33 141978.32
##                               mad      min       max     range  skew kurtosis
## state*                      17.79     1.00      50.0      49.0  0.04    -1.27
## county*                    566.35     1.00    1634.0    1633.0  0.04    -1.11
## dropout*                   882.15     1.00    2356.0    2355.0  0.02    -1.21
## hs_diploma*                938.49     1.00    2583.0    2582.0 -0.01    -1.18
## some_college*              932.56     1.00    2538.0    2537.0  0.00    -1.19
## four_year_degree*          941.45     1.00    2438.0    2437.0  0.01    -1.24
## dropout_percent              5.78     1.40      48.5      47.1  0.90     1.05
## hs_diploma_percent           7.12     8.10      55.6      47.5 -0.32    -0.03
## some_college_percent         5.04    11.40      48.0      36.6  0.02    -0.03
## four_year_degree_percent     7.41     6.90      74.6      67.7  1.39     2.22
## region*                      1.48     1.00       4.0       3.0 -0.24    -1.34
## region_num                   1.48     1.00       4.0       3.0  0.27    -0.58
## resident_count             708.68    25.00  495918.0  495893.0 22.01   747.80
## avg_annual_pay            6241.75     0.01  134664.0  134664.0  2.61    14.38
## metro*                     223.87     1.00     844.0     843.0  0.63    -1.05
## avg_home_value           62020.49 29574.67 1527482.2 1497907.5  4.18    31.44
##                               se
## state*                      0.27
## county*                     8.73
## dropout*                   13.13
## hs_diploma*                14.15
## some_college*              14.00
## four_year_degree*          13.77
## dropout_percent             0.12
## hs_diploma_percent          0.14
## some_college_percent        0.10
## four_year_degree_percent    0.18
## region*                     0.02
## region_num                  0.02
## resident_count            249.77
## avg_annual_pay            177.09
## metro*                      5.28
## avg_home_value           2137.17

Visualization

We’ll start with a few histograms to get a feel for the normality of some of the key features.

Histograms
Histograms of high school graduation rates and four year college graduation rates.

#Percent with high school diploma
hsd_hist <- ggplot(master, aes(x=hs_diploma_percent)) + geom_histogram()

#Percent with four year degrees
fyd_hist <- ggplot(master, aes(x=four_year_degree_percent)) + geom_histogram()

#Show education histograms
ggarrange(hsd_hist, fyd_hist, 
                    labels = c("hsd", "fyd"),
                    ncol = 2, nrow = 1,
                    common.legend = TRUE, legend = "bottom")

The high school diploma distribution appears normal, which makes sense given that a high school education is prevalent across the US and is publicly available. The right skew to a four year degree reflects that even though university’s are in every state, not everyone goes to (and completes) a four year degree for a number of reasons.

Avg annual pay

library(ggfortify)
ggplot(master, aes(x=avg_annual_pay)) +
  geom_histogram(bins = 50) + 
  scale_x_continuous(labels = comma) + 
  scale_y_continuous(labels = comma)

The histogram of average annual pay below is skewed right. This makes sense as all US states are in this dataset and there are some very high salaries in prestiguous counties in the US.

Home values plot

ggplot(master, aes(x=avg_home_value)) + 
  geom_histogram(bins = 50) + 
  scale_x_continuous(labels = comma) + 
  scale_y_continuous(labels = comma)

Similar to wages and for similar reasons, home values in some counties are very high, giving the distribution a right skew.

Correlation Heat Maps
Now We’ll move on to creating heat maps to highlight the correlations between Education, Wages, Resident Count and Home Values. Our first heatmap will look at the four education categories of high school dropout, high school diploma, some college, and four year degree. There should be high correlation between each of these so we’ll select the one with highest correlation to home values to move forward with.

#Select the data for the education heat map
ehm <- master %>%
  select(hs_dropout = dropout_percent, hs_diploma = hs_diploma_percent, some_college = some_college_percent,  college_degree = four_year_degree_percent, avg_home_value)

#Calculate the correlation values
cormat <- round(cor(ehm),2)

melted_cormat <- melt(cormat)

#Create the Heatmap
ggheatmap <- ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white")+
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
   name="Pearson\nCorrelation") +
  theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 12, hjust = 1))+
 coord_fixed()

reorder_cormat <- function(cormat){
# Use correlation between variables as distance
dd <- as.dist((1-cormat)/2)
hc <- hclust(dd)
cormat <-cormat[hc$order, hc$order]
}

ggheatmap + 
geom_text(aes(Var2, Var1, label = value), color = "black", size = 4)

The education heatmap correlation of 69% with a college degree and home values was somewhat expected. The -61% corelation for a high school diploma (so no higher education) with home values was initially surprising, but does make sense.

Next we’ll generate a heatmap using wages, resident count, and home values.

#Select the data for the heat map
chm <- master %>%
  select(avg_annual_pay, resident_count, avg_home_value)

#Calculate the correlation values
cormat <- round(cor(chm),2)

melted_cormat <- melt(cormat)

#Create the Heatmap
library(ggplot2)
ggheatmap <- ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white")+
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
   name="Pearson\nCorrelation") +
  theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 12, hjust = 1))+
 coord_fixed()

reorder_cormat <- function(cormat){
# Use correlation between variables as distance
dd <- as.dist((1-cormat)/2)
hc <- hclust(dd)
cormat <-cormat[hc$order, hc$order]
}

ggheatmap + 
geom_text(aes(Var2, Var1, label = value), color = "black", size = 4)

The 52% correlation for wages is strong, but not as strong as expected. What’s surprising is that average pay is showing a lower correlation than education in the previous heatmap.

Scatter Charts
Home Values Compared to Education, Wages, and Resident Count Colored by Census Region
There are 2,745 counties across 50 states, but only 4 census regions. We’ll use regions to create several scatter charts of the relationship between education, wages, resident count, and home values.

scatterdf <- select(master, region, region_num, hs_diploma_percent, four_year_degree_percent, avg_annual_pay, resident_count, avg_home_value)

#Scatter chart of hs diploma percent
hsd_scatter <- ggplot(scatterdf, aes(avg_home_value, hs_diploma_percent, colour = region)) + 
  geom_point(alpha=0.5, position = 'jitter') +
  scale_x_continuous(labels = comma) + 
  scale_y_continuous(labels = comma)

#Scatter chart of four year degree percent
fyd_scatter <- ggplot(scatterdf, aes(avg_home_value, four_year_degree_percent, colour = region)) + 
  geom_point(alpha=0.5, position = 'jitter') +
  scale_x_continuous(labels = comma) + 
  scale_y_continuous(labels = comma)

#Scatter chart of average annual pay
aap_scatter <- ggplot(scatterdf, aes(avg_home_value, avg_annual_pay, colour = region)) + 
  geom_point(alpha=0.5, position = 'jitter') +
  scale_x_continuous(labels = comma) + 
  scale_y_continuous(labels = comma)

#Scatter chart of resident count
rc_scatter <- ggplot(scatterdf, aes(avg_home_value, resident_count, colour = region)) + 
  geom_point(alpha=0.5, position = 'jitter') +
  scale_x_continuous(labels = comma) + 
  scale_y_continuous(labels = comma)

#Show scatter charts
ggarrange(hsd_scatter, fyd_scatter, aap_scatter, rc_scatter, 
                    labels = c("hsd", "fyd", "aap", "rc"),
                    ncol = 2, nrow = 2,
                    common.legend = TRUE, legend = "bottom")

Violin Plot
We’ll conclude this visuals section with comparisons using violin plots
Home Values

#Violin of home values
hv_violin <- ggplot(scatterdf, aes(region_num, avg_home_value, colour = region)) + 
  geom_violin(trim = FALSE) 

#Display the violin plots
ggarrange(hv_violin,
          labels = c("hv"),
          ncol = 1, nrow = 1,
          common.legend = TRUE, legend = "top")

Education

#Violin of high school diploma
hsd_violin <- ggplot(scatterdf, aes(region_num, hs_diploma_percent, colour = region)) + 
  geom_violin(trim = FALSE)

#Violin of four year degree
fyd_violin <- ggplot(scatterdf, aes(region_num, four_year_degree_percent, colour = region)) +
  geom_violin(trim = FALSE)

#Display the violin plots
ggarrange(hsd_violin, fyd_violin,
          labels = c("hsd", "fyd"),
          ncol = 2, nrow = 1,
          common.legend = TRUE, legend = "bottom")

Wages and Resident Count

#Violin of wages
aap_violin <- ggplot(scatterdf, aes(region_num, avg_annual_pay, colour = region)) + 
  geom_violin(trim = FALSE) + 
  scale_y_continuous(labels = comma)

#Violin of resident count
rc_violin <- ggplot(scatterdf, aes(region_num, resident_count, colour = region)) + 
  geom_violin(trim = FALSE) + 
  scale_y_continuous(labels = comma)

#Display the violin plots
ggarrange(aap_violin, rc_violin,
          labels = c("aap", "rc"),
          ncol = 2, nrow = 1,
          common.legend = TRUE, legend = "bottom")

Summary
The visualizations have served their purpose of making us much more familiar with the data. In the final notebook Analysis and Conclusions, we’ll use the information learned here for further analysis; these visualizations make a good case for application of a regression model to this data.

Here is a summary of what’s been learned so far…
1. High school education is normally distributed.

2. Four year degrees, wages, and home values are all right skewed (this consistency in their skew could help a model’s ability to make predictions.)

3. Home values do appear to be correlated to education and wages.

4. In the violin plot we also saw that resident count in each county might be a feature worth including in a model to predict home values.

5. Overall, viewing this data by census region worked well. The visuals definitely highlight that the Midwest is the lowest cost area and that the West and Northeast are both higher cost and a source of most of the outlier data. This is probably what most of us would have guessed, but it’s always good to see data that validates what we’ve already perceived or been told is true.

References

This article from STHDA was a huge help in generating the heatmaps.

This article from STHDA helped me analye the data’s alignment to regression conditions.