To start, it would be smart to have these packages ready:
geojsonio: reads shapefiles
sp: works with shapefiles
sf: works with shapefiles - simple features format
mclust: cluster analyses
tmap: theme maps
ggplot2: helps with graphing
ggthemes: allows for formats with ggplots
dplyr: the classic data wrangler
pander: the classic fo formatting RMD tables
tidycensus: data
cartogram: for spatial maps
maptools: for spatial object manipulation
corrplot: the classic for correlation plots
moments: to calculate skewness
In this class, I’ve found the way we’ve utilized correlation plots interesting. Using the census data we’ve utilized in this course:
URL <- "https://github.com/DS4PS/cpp-529-master/raw/master/data/ltdb_std_2010_sample.rds"
census.dat <- readRDS( gzcon( url( URL ) ) )
Rename some of the variables to align with and clean up the data:
hinc10 <- census.dat$hinc12
dmulti10 <- census.dat$dmulti12
ppov10 <- census.dat$ppov12
Adjust skewness:
hinc10.rt <- (hinc10)^(1/3)
skewness(hinc10.rt)
## [1] 0.4270911
hist(hinc10.rt, main = "Histogram of Median Household Income in 2010 (Cube Root)")
Housing Units
dmulti10.rt <- (dmulti10)^(1/3)
skewness(dmulti10.rt)
## [1] -1.746615
hist(dmulti10.rt, main = "Histogram of Housing Units in 2010 (Cube Root)")
Percent in Poverty
ppov10.rt <- (ppov10)^(1/3)
skewness(ppov10.rt)
## [1] 0.01563461
hist(ppov10.rt, main = "Distribution of Percent in Poverty in 2010 (Cube Root)")
We could look at a correlation plot like this, comparing the three variables:
d <- as.data.frame(census.dat)
df.dim <- dplyr::select(d, hinc12, dmulti12, ppov12)
corrplot(cor(df.dim, use = 'complete.obs'),
order = 'hclust',
tl.col = 'black',
tl.cex = 0.75)
This type of plot can be the most visually intuitive, based on the size of the dots/circles.
There is also a more quantitative approach, as described below.
Looking at the following scatterplots can give an idea on whether or not two variables are correlated to each other. It provides quantitative data, as well as a visual.
Houshold Income and Percent Poverty
# custom scatterplot with a lowess line
jplot <- function( x1, x2, lab1="", lab2="", draw.line=T, ... )
{
plot( x1, x2,
pch=19,
col=gray(0.6, alpha = 0.2),
cex=0.5,
bty = "n",
xlab=lab1,
ylab=lab2, cex.lab=1.2,
... )
if( draw.line==T ){
ok <- is.finite(x1) & is.finite(x2)
lines( lowess(x2[ok]~x1[ok]), col="red", lwd=3 ) }
}
jplot( hinc10.rt, ppov10.rt, lab1="Household Income in 2010 (Cube Root)", lab2="Percent Poverty in 2010 (Cube Root)" )
From this, you can see that there is a somewhat negative linear relationship between the two.
Household Income and Housing Units
jplot( hinc10.rt, dmulti10.rt, lab1="Household Income in 2010 (Cube Root)", lab2="Housing Units in 2010 (Cube Root)" )
There is not a linear relationship here either.
Housing Units and Percent Poverty
jplot( dmulti10.rt, ppov10.rt, lab1="Housing Units in 2010 (Cube Root)", lab2="Percent Poverty in 2010 (Cube Root)" )