Introduction

The focus for this code-through assignment is analyzing and customizing the cluster charts of a characteristic of a neighborhood tract. The customized charts below show the density of the percent unemployed from the inhabitants of a neighborhood in Phoenix. This allows us to see the variations of each group and allows the different groups to be labeled according to those variations.



Installing the Packages

The geojsonio, sp, sf and mclust packages allow for the data analysis of the Census Data for Phoenix to take place. The *ggplot, ggpubr, RColorBrewer, and ggthemes packages allow for the customization of plots. The dplyr and the pander** packages assist in the code commands.

library(geojsonio)    # read shapefiles
library(sp)           # work with shapefiles
library(sf)           # work with shapefiles - simple features format
library(mclust)        # cluster analysis 
library(ggplot2)       # graphing 
library(ggpubr)        #linetypes
library(RColorBrewer)  #color palettes
library(ggthemes)     #preset themes
library(dplyr)
library(pander)



Data Analysis

The data used in this code-through assignment is from the 2012 American Communities Survey made available through the Diversity and Disparities Project.

The Phoenix data is extracted from the shapefile, and the variables within the data set such as percent white, percent property owners, and percent unemployed are converted to show their z-scores for comparison. We will be concentrating on the percent unemployed variable for this code-through. Through the z-scores, clusters are able to be formed from the data.

github.url <- "https://raw.githubusercontent.com/DS4PS/cpp-529-master/master/data/phx_dorling.geojson"
phx <- geojson_read( x=github.url,  what="sp" )
d1 <- phx@data
head( d1[,1:6] ) %>% pander()
GEOID2 GEOID STATEFP COUNTYFP TRACTCE AFFGEOID
4.013e+09 04013010101 04 013 010101 1400000US04013010101
4.013e+09 04013010102 04 013 010102 1400000US04013010102
4.013e+09 04013030401 04 013 030401 1400000US04013030401
4.013e+09 04013030402 04 013 030402 1400000US04013030402
4.013e+09 04013040502 04 013 040502 1400000US04013040502
4.013e+09 04013040506 04 013 040506 1400000US04013040506
keep.these <- c("pnhwht12", "pnhblk12", "phisp12", "pntv12", "pfb12", "polang12", 
"phs12", "pcol12", "punemp12", "pflabf12", "pprof12", "pmanuf12", 
"pvet12", "psemp12", "hinc12", "incpc12", "ppov12", "pown12", 
"pvac12", "pmulti12", "mrent12", "mhmval12", "p30old12", "p10yrs12", 
"p18und12", "p60up12", "p75up12", "pmar12", "pwds12", "pfhh12")

d2 <- select( d1, keep.these )
d3 <- apply( d2, 2, scale )

head( d3[,1:6] ) %>% pander()
pnhwht12 pnhblk12 phisp12 pntv12 pfb12 polang12
1.382 -0.8508 -1.038 -0.2962 -0.6524 -0.9802
1.358 -0.8508 -1.125 -0.22 -0.1142 -0.6567
1.533 -0.8508 -1.153 -0.2962 -0.7297 -1.201
1.24 -0.8508 -1.058 -0.1715 -0.8447 -0.9941
1.054 -0.6265 -0.7904 -0.1733 -0.7813 -0.8673
1.535 -0.8472 -1.168 -0.2962 -0.8854 -1.101

The mclust command allows for the data to be classified and clustered into groups.

# library( mclust )
set.seed( 1234 )
fit <- Mclust( d3 )
phx$cluster <- as.factor( fit$classification )
summary( fit )
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VVE (ellipsoidal, equal orientation) model with 8 components: 
## 
##  log-likelihood   n  df       BIC       ICL
##       -16327.92 913 922 -38940.87 -39010.66
## 
## Clustering table:
##   1   2   3   4   5   6   7   8 
##  76  33 165  57  95 154 148 185
d2$cluster <- d2$cluster <- as.factor( paste0("GROUP-",fit$classification) )



Cluster Plotting Customization

Customize cluster plots by modifying colors, linetypes, backgrounds, fill opacity, titles and labels. Customizing cluster plots will assist in providing better data visualizations and understanding of the data. First, we will see a list of possible colors and linetypes to use through the colors and show_line_types commands.

#colors() will generate a full list of colors

head(colors())
## [1] "white"         "aliceblue"     "antiquewhite"  "antiquewhite1"
## [5] "antiquewhite2" "antiquewhite3"
show_line_types()



Line Plotting

One of the options in customizing cluster charts is line plotting. In these line plots, you can see the eight different clusters that show the percent unemployed. Through this basic line plotting you can see the difference between each cluster. For example, Group #2 shows a low density of those who are unemployed, but a higher percentage. In comparison, Group #6 has a very high density, but low percentage of those unemployed.

ggplot(d2, aes(x=punemp12)) +
  geom_density()+
  facet_wrap( ~ cluster, nrow=2 )

Use linewidth to adjust the thickness of the lines in the plot. The color of the line can also be adjusted by assigning a color from the color list above. These arguments would be included in the geom_density command.


ggplot(d2, aes(x=punemp12)) +
  geom_density(linewidth = 1.25, color = "hotpink4")+
  facet_wrap( ~ cluster, nrow=2 )



Fill Plotting

After adjusting the line plot, a fill can be added so the plot will now show shaded areas for the data.

ggplot( d2, aes( x=punemp12 ) ) +
        geom_density(linewidth = 1.25, color = "hotpink4", fill="hotpink" ) +
        facet_wrap( ~ cluster, nrow=2 )

Adjust the opacity of the fill by adding in the alpha argument. Lower values will show higher transparency whereas higher alpha values will show less transparency.

ggplot( d2, aes( x=punemp12 ) ) +
        geom_density( alpha = 0.25, linewidth = 1.25, color = "hotpink4", fill="hotpink" ) +
        facet_wrap( ~ cluster, nrow=2 )

A color palette can also be assigned to the cluster charts. Using the scale_fill_brewer command, the charts below have been assigned a blue palette.

  ggplot( d2, aes(x=punemp12, group=cluster, fill=cluster)) +
  geom_density( linewidth = .75, alpha = 0.75)+
  scale_fill_brewer(type = "seq", palette = 1)+
  facet_wrap( ~ cluster, nrow=2 )



Chart Background Themes

Background themes can be customized to show grid line variations and colors. Adding contrast and panel gridlines may help in comparing the data. As you can see below, Groups #3, #4, and #7 are above the 0.05 density line whereas Groups #1, #5, #6, and #8 all surpass the 0.10 density line. The element_rect will allow you to fill a color for the panel.background. The panel.grid.major can be adjusted to show the primary gridlines, and the panel.grid.minor will show the secondary gridlines. There are also many preset background themes such as theme_minimal, theme_classic, and theme_dark that can be used for these charts. The text labels of the groups can be adjusted in size by indicating a value to the element_text command for strip.text.x.

ggplot( d2, aes( x=punemp12 ) ) +
        geom_density( alpha = 0.85, linewidth = 1.25, color = "hotpink4", fill="hotpink" )+
        facet_wrap( ~ cluster, nrow=2 )+
  theme(panel.background = element_rect(fill = "turquoise"),
        panel.grid.major = element_line(size = .75, linetype = 'dashed',
                                colour = "white"),
         panel.grid.minor = element_line(size = 0.05, linetype = 'solid',
                                colour = "white"),
        strip.text.x = element_text(size = 25))



Adding Titles and a Legend

Use the xlab, ylab, and labs commands to show titles for the x-axis, y-axis, and main title. Add a legend for the groups by including legend_position in the theme command. Indicate the placement for the legend. In this case, the legend is on the “right”.

  ggplot( d2, aes(x=punemp12, group=cluster, fill=cluster))+
  geom_density( linewidth = .75, alpha = 0.75)+
    theme(
      legend.position="right",
      panel.spacing = unit(0.5, "lines"),
      strip.text.x = element_text(size = 25)
    )+
  xlab( "Percent Unemployed" )+ 
  ylab("Density") + 
  facet_wrap(~cluster, nrow=2 )+
  labs(title="Proportion Unemployed by Group")+
  theme_classic()



Group Comparison & Labeling

Stacked Chart of Groups

Some of the individual cluster charts may be difficult to compare to one another because they seem close in values. Therefore, the charts may be stacked by getting rid of the facet_wrap command line of the code. It is helpful to view both the stacked chart and the individual cluster charts for better comparison.

  ggplot( d2, aes(x=punemp12, group=cluster, fill=cluster))+
  geom_density( linewidth = .75, alpha = 0.5)+
    theme(
      legend.position="right")+
  xlab( "Percent Unemployed" )+
  ylab("Density")+
  labs(title="Proportion Unemployed by Group (Stacked)")+
  theme_classic()



Labeling Clusters

Labeling clusters will help to distinguish each group and provide a unique identity to each cluster. Using the for(i in 1) command will allow you to label each cluster through text. Combine the text labels you came up with for each cluster by using the c() command for the legend.

for (i in 1:length(d2$cluster)) {
  if (d2$cluster[i] == 1) {
    d2$cluster[i] = "Hmm..."
  }
}
for (i in 1:length(d2$cluster)) {
  if (d2$cluster[i] == 2) {
    d2$cluster[i] = "Okay"
  }
}
for (i in 1:length(d2$cluster)) {
  if (d2$cluster[i] == 3) {
    d2$cluster[i] = "Meh"
  }
}
for (i in 1:length(d2$cluster)) {
  if (d2$cluster[i] == 1) {
    d2$cluster[i] = "Getting there"
  }
}
for (i in 1:length(d2$cluster)) {
  if (d2$cluster[i] == 1) {
    d2$cluster[i] = "Uhh"
  }
}
for (i in 1:length(d2$cluster)) {
  if (d2$cluster[i] == 1) {
    d2$cluster[i] = "Probably not"
  }
}
for (i in 1:length(d2$cluster)) {
  if (d2$cluster[i] == 1) {
    d2$cluster[i] = "A little better"
  }
}
for (i in 1:length(d2$cluster)) {
  if (d2$cluster[i] == 1) {
    d2$cluster[i] = "Not the best"
  }
}
d2$cluster <- c("Hmm...", "Okay", "Meh", "Getting there", "Uhh", "Probably not", "A little better", "Not the best")[d2$cluster]

Customize the title of the legend by using the guides command. Here you can see the new legend with the customized labels for each cluster and the new legend title Groups.

  ggplot( d2, aes(x=punemp12, group=cluster, fill=cluster)) +
  geom_density( linewidth = .75, alpha = 0.5)+
    theme(
      legend.position="right")+
  xlab( "Percent Unemployed" )+
  ylab("Density")+
  labs(title="Proportion Unemployed by Group (Stacked)")+
  theme_classic()+
  guides(fill=guide_legend("Groups"))

The labels will also be applied to each chart when adding the facet_wrap command back into the code.

 ggplot( d2, aes(x=punemp12, group=cluster, fill=cluster))+
  geom_density( linewidth = .75, alpha = 0.75)+
    theme(
      legend.position="right",
      panel.spacing = unit(0.5, "lines"),
      strip.text.x = element_text(size = 25)
    ) +
  xlab( "Percent Unemployed" )+ 
  ylab("Density") + 
  facet_wrap(~cluster, nrow=2 )+
  labs(title="Percent Unemployed by Group")+
  theme_classic()+
  guides(fill=guide_legend("Groups"))

References

Wickham, H., Chang, W., Henry, L., Pederson, T., Takahashi, K., Wilke, C., Woo, K., Yutani, H., Dunnington, D.. (Retrieved 11 November 2023). “Smoothed Density Estimates”. https://ggplot2.tidyverse.org/reference/geom_density.html

Stack Overflow. (Retrieved 15 November 2023). “Naming clusters in R”.. https://stackoverflow.com/questions/69275964/naming-clusters-in-r

Holtz, Yan. (Retrieved 11 November 2023). “The Issue with Stacking”.From Data to Viz. https://www.data-to-viz.com/caveat/stacking.html

Stack Overflow. (Retrieved 15 November 2023). “How to change legend title in ggplot”. https://stackoverflow.com/questions/14622421/how-to-change-legend-title-in-ggplot

Data Novia. (Retrieved 11 November 2023). “Linetypes in R The Ultimate Guide for R Base Plot and GGPLOT https://www.datanovia.com/en/blog/line-types-in-r-the-ultimate-guide-for-r-base-plot-and-ggplot/#:~:text=In%20R%20base%20plot%20functions,the%20size%20of%20lines%2C%20respectively