To start, it would be smart to have these packages ready:

geojsonio: reads shapefiles

sp: works with shapefiles

sf: works with shapefiles - simple features format

mclust: cluster analyses

tmap: theme maps

ggplot2: helps with graphing

ggthemes: allows for formats with ggplots

dplyr: the classic data wrangler

pander: the classic fo formatting RMD tables

tidycensus: data

cartogram: for spatial maps

maptools: for spatial object manipulation

corrplot: the classic for correlation plots

moments: to calculate skewness

In this class, I’ve found the way we’ve utilized correlation plots interesting. Using the census data we’ve utilized in this course:

URL <- "https://github.com/DS4PS/cpp-529-master/raw/master/data/ltdb_std_2010_sample.rds"
census.dat <- readRDS( gzcon( url( URL ) ) )

Rename some of the variables to align with and clean up the data:

hinc10 <- census.dat$hinc12
dmulti10 <- census.dat$dmulti12
ppov10 <- census.dat$ppov12

Adjust skewness:

hinc10.rt <- (hinc10)^(1/3)
skewness(hinc10.rt)
## [1] 0.4270911
hist(hinc10.rt, main = "Histogram of Median Household Income in 2010 (Cube Root)")

Housing Units

dmulti10.rt <- (dmulti10)^(1/3)
skewness(dmulti10.rt)
## [1] -1.746615
hist(dmulti10.rt, main = "Histogram of Housing Units in 2010 (Cube Root)")

Percent in Poverty

ppov10.rt <- (ppov10)^(1/3)
skewness(ppov10.rt)
## [1] 0.01563461
hist(ppov10.rt, main = "Distribution of Percent in Poverty in 2010 (Cube Root)")

We could look at a correlation plot like this, comparing the three variables:

d <- as.data.frame(census.dat)
df.dim <- dplyr::select(d, hinc12, dmulti12, ppov12)
corrplot(cor(df.dim, use = 'complete.obs'), 
         order = 'hclust',
         tl.col = 'black',
         tl.cex = 0.75)

This type of plot can be the most visually intuitive, based on the size of the dots/circles.

There is also a more quantitative approach, as described below.

Looking at the following scatterplots can give an idea on whether or not two variables are correlated to each other. It provides quantitative data, as well as a visual.

Houshold Income and Percent Poverty

# custom scatterplot with a lowess line
jplot <- function( x1, x2, lab1="", lab2="", draw.line=T, ... )
{

    plot( x1, x2,
          pch=19, 
          col=gray(0.6, alpha = 0.2), 
          cex=0.5,  
          bty = "n",
          xlab=lab1, 
          ylab=lab2, cex.lab=1.2,
        ... )

    if( draw.line==T ){ 
        ok <- is.finite(x1) & is.finite(x2)
        lines( lowess(x2[ok]~x1[ok]), col="red", lwd=3 ) }

}

jplot( hinc10.rt, ppov10.rt, lab1="Household Income in 2010 (Cube Root)", lab2="Percent Poverty in 2010 (Cube Root)" )

From this, you can see that there is a somewhat negative linear relationship between the two.

Household Income and Housing Units

jplot( hinc10.rt, dmulti10.rt, lab1="Household Income in 2010 (Cube Root)", lab2="Housing Units in 2010 (Cube Root)" )

There is not a linear relationship here either.

Housing Units and Percent Poverty

jplot( dmulti10.rt, ppov10.rt, lab1="Housing Units in 2010 (Cube Root)", lab2="Percent Poverty in 2010 (Cube Root)" )