Draft

The 2020 demographics were projections at the time of the original work (2017). I hope to get actuals and redo this. The data sources were very messy, and a different set of work was peformed to combine and tidy the data that are input to this visualization process.

Visualize the change in demographics through time in each North Carolina Wake County. Data includes population counts by district, race, ethnicity, year, and age range. The age ranges go from birth to greater than 85. The age breaks are in five year increments. Race and ethnicity break down the same population. Races are black, white, and other. Ethnicities are Hispanic and non-Hispanic.

Source CSV can be downloaded at https://docs.google.com/spreadsheets/d/1PguhEERjgX901MRtJ9mNMSpZwM_wOeRbd5r3ruzxRuU/edit?usp=sharing

# data manipulation 
library(tidyr)
library(dplyr)

# weighted median implementation
library(matrixStats)

# plot animation
require(plotly)

Functions

The function reshapecolumn takes a column, extracts data from the column name, enriches with additional columns, and returns a normalized structure that can be bound to other normalized columns as rows to create a long data frame.

This function will be applied to all of the columns of interest in the source data. The function is very much ad hoc, reflecting the definition of the input file.

reshapecolumn <- function(colname) {
  
  parts <- colname %>% 
           strsplit("[_]") %>% 
           unlist()
  
  # divisor column to calculate percentage
  divisorcol <- parts %>%
                (function(x) (paste("totalpop",x[2],x[3],sep="_")))
  
  # total population for the year will be used for plotting
  totalpop_year <- paste("totalpop",parts[3],sep="_")
  
  # extract data from the column names
  colsfromname <- data.frame("classification" = as.character(parts[1]),
                             "agerange" = as.character(parts[2]),
                             "year" = as.integer(parts[3]))

  # build the frame 
  newdataframe <- cbind(rawdemo["nc_senate"], 
                        colsfromname, 
                        round(rawdemo[colname] / rawdemo[divisorcol] * 100),
                        round(rawdemo[colname]))
  
  # set the new column names (must precede next step)
  names(newdataframe)[5] <- "percent"
  names(newdataframe)[6] <- "count"
  
  # add the total population for the nc senate district to the rows 
  newdataframe <- inner_join(newdataframe, 
                             rawdemo[,c("nc_senate", totalpop_year)],
                             by="nc_senate")
  
  # set the column name
  names(newdataframe)[7] <- "totalpop_district_year"
  
  newdataframe$totalpop_district_year <- round(newdataframe$totalpop_district_year)
  
  return(newdataframe)
}

Select and munge the source data

The file is selected and read. Its name is printed for tracking what was done.


csvfilter <- matrix(c("CSV", "*.csv"), ncol=2)

rawdemo <- choose.files(filters = csvfilter,
                        caption = "Select demographics source file",
                        multi=FALSE)  %T>%
           print() %>%          
           read.csv()
[1] "C:\\Users\\wmc\\Documents\\vote\\NCSenateDemographics_1990_2020_FULL FILE.csv"
rawdemo

The data columns of interest (those with classified demographic data) are identified by a common name pattern (the argument to grep). The identified column names are captured in the vector cohorts. The reshapecolumn function is applied to each element of the vector, with each returned data frame bound as rows to the next. This process transforms the wide structure to a long structure, with enrichment by the function.

cohorts <- names(rawdemo)[grep("(hisp|white|black|other).*_age",
                               (names(rawdemo)))]
head(cohorts, 30) 
 [1] "white_age0_1990"  "white_age5_1990"  "white_age10_1990" "white_age15_1990"
 [5] "white_age20_1990" "white_age25_1990" "white_age30_1990" "white_age35_1990"
 [9] "white_age40_1990" "white_age45_1990" "white_age50_1990" "white_age55_1990"
[13] "white_age60_1990" "white_age65_1990" "white_age70_1990" "white_age75_1990"
[17] "white_age80_1990" "white_age85_1990" "black_age0_1990"  "black_age5_1990" 
[21] "black_age10_1990" "black_age15_1990" "black_age20_1990" "black_age25_1990"
[25] "black_age30_1990" "black_age35_1990" "black_age40_1990" "black_age45_1990"
[29] "black_age50_1990" "black_age55_1990"
length(cohorts)
[1] 360
demolong <- 
  cohorts %>%
  
  lapply(FUN=reshapecolumn) %>%    # feeds each name into the function
                                   # and builds a list of tables for each 
                                   # named column
  
  do.call('rbind', .)              # binds the result into a single data frame

head(demolong,20)
nrow(demolong)
[1] 18000

The age column values identify an age range for the age grouping. The values are mapped to an age midpoint integer for the group. The values are joined to the rows so the values can be used for weighted means.

agemeans <- data.frame(agerange  = unique(demolong$agerange),
                       midpoint_age = seq(from=2, by = 5,length.out = 18))

demolong <- inner_join(demolong, agemeans, by="agerange")

The data is restructured again to group and summarize. The summarization defines weighted means for age and population percentage.

demosumm <- demolong %>% 
  group_by(year, nc_senate, classification, totalpop_district_year) %>%
  summarise(mean_age = round(weighted.mean(midpoint_age,count)),
            median_age = round(weightedMedian(midpoint_age,count)),
            totalpop_class = sum(count)
            )
`summarise()` has grouped output by 'year', 'nc_senate', 'classification'. You can override using the `.groups` argument.
demosumm$percent <- round((demosumm$totalpop_class / demosumm$totalpop_district_year) * 100, 1)
demosumm$identity <- paste("dist", demosumm$nc_senate, demosumm$classification, sep = "_")

# show the structure of the result, using the first few rows
head(demosumm,10)

Visualization

A specific district and a related set of classification is used to create a motion plot through multiple years.

wake <- subset(demosumm,
               nc_senate %in% c(14,15,16,17)) 
wake$nc_senate <- as.factor(wake$nc_senate)

wake

Changes to Hispanic versus non-Hispanic

Plot changes to population size and percentage of Hispanic and non-Hispanics.

Uses ggplot (for plots) and plotly (for animation). https://plotly-r.com/animating-views.html

gg <- ggplot(subset(wake, classification %in% c("hispanic","nonhisp")),
             aes(mean_age, 
                 totalpop_district_year,
                 shape = classification)) +
        geom_point(aes(size = percent, 
                       frame = year, 
                       ids = identity,
                       color = nc_senate
                       )
                   )
ggplotly(gg) %>% 
  animation_opts(
    frame = 2000, 
    redraw = FALSE
  )

Changes to racial makeup

Plot changes to population size and percentage of black, white, and other classifications.

Uses ggplot (for plots) and plotly (for animation). https://plotly-r.com/animating-views.html

gg <- ggplot(subset(wake, classification %in% c("white","black","other")),
             aes(mean_age, 
                 totalpop_district_year,
                 shape = classification)) +
        geom_point(aes(size = percent, 
                       frame = year, 
                       ids = identity,
                       color = nc_senate
                       )
                   )
ggplotly(gg) %>% 
  animation_opts(
    frame = 2000, 
    redraw = FALSE
  )
---
title: "Demographic changes in Wake County NC Senate Districts across voting years"
author: "Mark Connolly"
date: "`r Sys.Date()`"
output: html_notebook
---
## Draft
The 2020 demographics were projections at the time of the original work (2017).  I hope to get actuals and redo this.  The data sources were very messy, and a different set of work was peformed to combine and tidy the data that are input to this visualization process.

Visualize the change in demographics through time in each North Carolina Wake County.  Data includes population counts by district, race, ethnicity, year, and age range.  The age ranges go from birth to greater than 85.  The age breaks are in five year increments. Race and ethnicity break down the same population.  Races are black, white, and other.  Ethnicities are Hispanic and non-Hispanic.

Source CSV can be downloaded at https://docs.google.com/spreadsheets/d/1PguhEERjgX901MRtJ9mNMSpZwM_wOeRbd5r3ruzxRuU/edit?usp=sharing


```{r setup, warning=FALSE}
# data manipulation 
library(tidyr)
library(dplyr)

# weighted median implementation
library(matrixStats)

# plot animation
require(plotly)
```
## Functions 
The function ```reshapecolumn``` takes a column, extracts data from the column name, enriches with additional columns, and returns a normalized structure that can be bound to other normalized columns as rows to create a long data frame.

This function will be applied to all of the columns of interest in the source data.  The function is very much ad hoc, reflecting the definition of the input file.

```{r}
reshapecolumn <- function(colname) {
  
  parts <- colname %>% 
           strsplit("[_]") %>% 
           unlist()
  
  # divisor column to calculate percentage
  divisorcol <- parts %>%
                (function(x) (paste("totalpop",x[2],x[3],sep="_")))
  
  # total population for the year will be used for plotting
  totalpop_year <- paste("totalpop",parts[3],sep="_")
  
  # extract data from the column names
  colsfromname <- data.frame("classification" = as.character(parts[1]),
                             "agerange" = as.character(parts[2]),
                             "year" = as.integer(parts[3]))

  # build the frame 
  newdataframe <- cbind(rawdemo["nc_senate"], 
                        colsfromname, 
                        round(rawdemo[colname] / rawdemo[divisorcol] * 100),
                        round(rawdemo[colname]))
  
  # set the new column names (must precede next step)
  names(newdataframe)[5] <- "percent"
  names(newdataframe)[6] <- "count"
  
  # add the total population for the nc senate district to the rows 
  newdataframe <- inner_join(newdataframe, 
                             rawdemo[,c("nc_senate", totalpop_year)],
                             by="nc_senate")
  
  # set the column name
  names(newdataframe)[7] <- "totalpop_district_year"
  
  newdataframe$totalpop_district_year <- round(newdataframe$totalpop_district_year)
  
  return(newdataframe)
}
```

## Select and munge the source data
The file is selected and read.  Its name is printed for tracking what was done.  

```{r}

csvfilter <- matrix(c("CSV", "*.csv"), ncol=2)

rawdemo <- choose.files(filters = csvfilter,
                        caption = "Select demographics source file",
                        multi=FALSE)  %T>%
           print() %>%          
           read.csv()
rawdemo
```


The data columns of interest (those with classified demographic data) are identified by a common name pattern (the argument to ```grep```). The identified column names are captured in the vector ```cohorts```.  The ```reshapecolumn``` function is applied to each element of the vector, with each returned data frame bound as rows to the next.  This process transforms the wide structure to a long structure, with enrichment by the function.

```{r}
cohorts <- names(rawdemo)[grep("(hisp|white|black|other).*_age",
                               (names(rawdemo)))]
head(cohorts, 30) 
length(cohorts)
```

```{r}
demolong <- 
  cohorts %>%
  
  lapply(FUN=reshapecolumn) %>%    # feeds each name into the function
                                   # and builds a list of tables for each 
                                   # named column
  
  do.call('rbind', .)              # binds the result into a single data frame

head(demolong,20)
nrow(demolong)
```

The age column values identify an age range for the age grouping.  The values are mapped to an age midpoint integer for the group.  The values are joined to the rows so the values can be used for weighted means.

```{r}
agemeans <- data.frame(agerange  = unique(demolong$agerange),
                       midpoint_age = seq(from=2, by = 5,length.out = 18))

demolong <- inner_join(demolong, agemeans, by="agerange")
```

The data is restructured again to group and summarize.  The summarization defines weighted means for age and population percentage.

```{r}
demosumm <- demolong %>% 
  group_by(year, nc_senate, classification, totalpop_district_year) %>%
  summarise(mean_age = round(weighted.mean(midpoint_age,count)),
            median_age = round(weightedMedian(midpoint_age,count)),
            totalpop_class = sum(count)
            )

demosumm$percent <- round((demosumm$totalpop_class / demosumm$totalpop_district_year) * 100, 1)
demosumm$identity <- paste("dist", demosumm$nc_senate, demosumm$classification, sep = "_")

# show the structure of the result, using the first few rows
head(demosumm,10)
```


### Visualization
A specific district and a related set of classification is used to create a motion plot through multiple years.

```{r}
wake <- subset(demosumm,
               nc_senate %in% c(14,15,16,17)) 
wake$nc_senate <- as.factor(wake$nc_senate)

wake
```

### Changes to Hispanic versus non-Hispanic
Plot changes to population size and percentage of Hispanic and non-Hispanics.

Uses ```ggplot``` (for plots) and ```plotly``` (for animation). https://plotly-r.com/animating-views.html

```{r, warning=FALSE, fig.width=8, fig.height=6}
gg <- ggplot(subset(wake, classification %in% c("hispanic","nonhisp")),
             aes(mean_age, 
                 totalpop_district_year,
                 shape = classification)) +
        geom_point(aes(size = percent, 
                       frame = year, 
                       ids = identity,
                       color = nc_senate
                       )
                   )
ggplotly(gg) %>% 
  animation_opts(
    frame = 2000, 
    redraw = FALSE
  )
```

### Changes to racial makeup
Plot changes to population size and percentage of black, white, and other classifications.

Uses ```ggplot``` (for plots) and ```plotly``` (for animation). https://plotly-r.com/animating-views.html

```{r, warning=FALSE, fig.width=8, fig.height=6}
gg <- ggplot(subset(wake, classification %in% c("white","black","other")),
             aes(mean_age, 
                 totalpop_district_year,
                 shape = classification)) +
        geom_point(aes(size = percent, 
                       frame = year, 
                       ids = identity,
                       color = nc_senate
                       )
                   )
ggplotly(gg) %>% 
  animation_opts(
    frame = 2000, 
    redraw = FALSE
  )
```
