income <- read_csv("https://raw.githubusercontent.com/Magnus-PS/GSEVC/main/nj_county_medInc.csv")
pop <- read_csv("https://raw.githubusercontent.com/Magnus-PS/GSEVC/main/nj_county_pop.csv")
vb <- read_csv("https://raw.githubusercontent.com/Magnus-PS/GSEVC/main/nj_county_vb.csv")

INTRO

Garden State Elite Volleyball Club (GSEVC) is located in Succasunna, New Jersey. North and West of the center of New Jersey, in Morris County. A high income, high growth county with an engaged volleyball community. In many ways the perfect spot to have started a premier volleyball center / club.

But … what if the plan went beyond the start and beyond the singular location?

What if the plan were to expand and grow? What then?

What other locations and what other counties in the state of New Jersey are promising for a volleyball club? And how do you even measure such a thing?

Answering these questions forms the crux of this exploratory data analysis (EDA). An introduction to applied Data Science. A high level exploration of income levels, population growth, and volleyball engagement at the county level in New Jersey.

METRICS

We make use of Median Income, Population Growth, and Volleyball Engagement in assessing volleyball favor-ability at the county level.

Prior to exploring our output, visualization, and interpretation, we prepare our data. While this isn’t the most glamorous step, it is essential, and thus we briefly visit it to better understand the labor that provides the fruit we later enjoy.

Our data prep process (applied to each metric) included but was not limited to:

  • locating, uploading, and then reading in our data (as a .csv file),
  • removing excess data (rows, columns, characters, whitespace, etc.),
  • operating upon provided metrics to create metrics, and
  • putting variables and observations into compatible forms.

As for the background and reasoning behind our choice in metrics …

Median Income

Median income is generated from 2019 US Census data and provides insight regarding the relative wealth of a county. We assume that a higher household income is indicative of a wealthier market. One that has more money to spend on their children’s sports participation (ie. joining a volleyball club).

#Tidy income dataset
income <- income %>% select(1,2) #drop 3rd column
income <- income[-c(1),] #drop 1st row
income <- income[-nrow(income),] #remove last row
income <- income %>% rename( #rename columns
    `Median Income ($)` = X2,
    County = `Median Household Income by County`
    )

#Process County column
stopwords <- c("County","New Jersey")
income$County  <-  removeWords(income$County,stopwords) #remove 'County' from County column entries
income$County <- trimws(income$County) #remove excess whitespace

#Process Median Income column
income$`Median Income ($)` <- str_replace_all(income$`Median Income ($)`, "[^[:alnum:]]", "") #remove non-alpha numeric characters
income$`Median Income ($)` <- sapply(income$`Median Income ($)`, as.numeric)

#head(income) #verify

Population Growth

Population Growth is generated from 2019 US Census data and provides insight regarding the relative growth of a county. We assume that higher growth is indicative that there are more prospective student-athletes and thus a larger market.

#Tidy population dataset
pop <- pop %>% select(1:2,13) #drop columns 3-12
pop <- pop[-c(1:4,26:30),] #drop 1st 4 and last 5 rows
pop <- pop %>% rename( #rename columns
    County = `table with row headers in column A and column headers in rows 3 through 4 (leading dots indicate sub-parts)`,
    `2010 Pop.` = X2,
    `2019 Pop.` = X13
    )

#Process County column
pop$County <- removeWords(pop$County,stopwords) #remove 'County', 'New Jersey' from County column entries
pop$County <- str_replace_all(pop$County, "[^[:alnum:]]", " ") #remove non-alpha numeric characters
pop$County <- trimws(pop$County) #remove excess whitespace

#Process 2010 Pop. column
pop$`2010 Pop.` <- str_replace_all(pop$`2010 Pop.`, "[[:punct:]]", "") #remove commas
pop$`2010 Pop.` <- sapply(pop$`2010 Pop.`, as.numeric) #convert to numeric

#Create population growth variable
pop$`Pop.Growth`<- (pop$`2019 Pop.` - pop$`2010 Pop.`) / pop$`2010 Pop.`
pop$`Pop.Growth` <- round(pop$`Pop.Growth`,3) * 100

#head(pop) #verify

We calculate the population growth rate (%) for each county as:

\[ Pop.Growth = \frac{Pop (2019) - Pop(2010)}{Pop(2010)} \]

What we end up with is a decimal to the nth degree and so we round our values to the 3rd decimal place and then multiply by 100 to bring our Population Growth metric into a simple % form. This metric provides clear insight as to which counties are growing vs. which counties are not.

Volleyball Engagement

Volleyball Engagement is generated from state level data (documented further down) and provides insight regarding the relative enthusiasm a county has toward the sport of volleyball. Being that high schools are much more likely to offer girls volleyball, we assume that a greater concentration of boys programs within a county is indicative of a higher general enthusiasm for volleyball.

#Tidy volleyball dataset
vb <- vb %>% rename( #rename columns
    `Boys_Teams` = `Boys Volleyball Teams`,
    `HSs` = `High Schools`
    )

#replace NAs with 0
vb[is.na(vb)] <- 0

#Process County column
vb$County <- removeWords(vb$County,stopwords) #remove 'County', 'New Jersey' from County column entries
vb$County <- str_replace_all(vb$County, "[^[:alnum:]]", " ") #remove non-alpha numeric characters
vb$County <- trimws(vb$County) #remove excess whitespace

#Create VB Engagement metric
vb$TPH <- vb$Boys_Teams / vb$HSs #teams per HS
#vb$TPS <- vb$Boys_Teams / (vb$Students/2) #teams per student (assuming 1/2 male)
vb$TPH <- round(vb$TPH,3) * 100

#head(vb) #verify

There are national metrics of engagement (ie. # of high schoolers playing) but when it comes down to volleyball engagement at a county level, there is no readily available metric. Thus, I created one:

  • From High-Schools.com I pulled the number of high schools and student enrollment at a county level. This data is from the 2015-2016 school year.
  • From highschoolsports.nj.com I pulled the number of boy's volleyball teams at a county level. This data is from the 2018-2019 volleyball season.

While the High-Schools.com data was in a ready-to-consume form, the highschoolsports.nj.com data had to be manually tallied. I went through each division’s list, searched the location of the school, and then placed a tally for the county in which it resided.

From this, I calculated the volleyball engagement rate (%) for each county as:

\[ TPH = \frac{NumBoysTeams}{NumHighSchools} \]

What we end up with is a decimal to the nth degree and so we round our values to the 3rd decimal place and then multiply by 100 to bring our Teams Per High School (TPH) volleyball engagement metric into a simple % form. This metric provides clear insight as to which counties have higher vs. lower volleyball enthusiasm.

……………………………………………………………………..

OUTPUT

We merge our clean, compatible dataframes into one, and output the resulting table:

#join dfs by County
df <- merge(income, pop, by="County")
df <- merge(df, vb, by="County")
df <- df %>% select(1:2,5,9) #drop columns 3-12

df %>% kbl() %>% kable_minimal()
County Median Income ($) Pop.Growth TPH
Atlantic 62110 -4.0 7.7
Bergen 101144 3.0 9.2
Burlington 87416 -0.8 11.1
Camden 70451 -1.4 23.9
Cape May 67074 -5.4 0.0
Cumberland 54149 -4.7 0.0
Essex 61510 1.9 22.1
Gloucester 87283 1.2 17.9
Hudson 71189 6.0 31.8
Hunterdon 115379 -3.1 5.6
Mercer 81057 0.3 5.0
Middlesex 89533 1.9 25.5
Monmouth 99733 -1.8 15.2
Morris 115527 -0.1 7.8
Ocean 70909 5.3 20.0
Passaic 69688 0.1 19.6
Salem 66842 -5.6 0.0
Somerset 113611 1.7 11.4
Sussex 94520 -5.9 14.3
Union 80198 3.7 15.1
Warren 81307 -3.2 0.0

We draw the following insight:

  • The top 3 wealthiest counties are Morris, Hunterdon, and Somerset. The bottom 3 are Cumberland, Essex, and Atlantic.
  • The top 3 counties for growth are Hudson, Ocean, and Union. The bottom 3 are Sussex, Salem, and Cape May.
  • The top 3 counties for volleyball are Hudson, Middlesex, and Camden. The bottom 4 are Cape May, Cumberland, Salem, and Warren.

With this combination of metrics in mind, how do we identify the best opportunities for expansion? We can map and rank it …

Favorability Map

A choropleth map is a thematic map in which pre-defined areas (in our case counties) are shaded in proportion to a statistical variable. In our case, we’re considering three metrics and the modeling function (the choropleth map) requires just one.

To adapt, I created a combination variable by normalizing our metrics (income, growth, and engagement), summing the result, normalizing the combination variable, and then rounding these values to the 3rd decimal place and multiplying by 100.

The resulting % is our Favorability score. This metric provides clear insight as to which counties are most favorable for a volleyball center based on income, growth, and engagement:

# obtain region codes for new jersey
data(county.regions, package = "choroplethrMaps")

region <- county.regions %>% filter(state.name == "new jersey")

#convert field names to lowercase
names(df) <- lapply(names(df), tolower)
df[[1]] <- tolower(df[[1]])

#simplify column names
df <- df %>% rename(
    income = `median income ($)`,
    growth = pop.growth
    )

#head(df)
#normalize data scales to be from 0 to 1
i <- df$income
n_i = (i-min(i))/(max(i)-min(i))

g <- df$growth
n_g <- (g-min(g))/(max(g)-min(g))

v <- df$tph
n_v <- (v-min(v))/(max(v)-min(v))

#combine metrics
combined <- n_i + n_g + n_v
n_combined <- (combined-min(combined))/(max(combined)-min(combined))

#merge with df
df$combined <- round(n_combined,3) * 100
#head(df)

df_comb <- df %>% select(1,5)
df_comb <- df_comb %>% rename(value = combined)
#Choropleth map: combined metrics

#join data to region codes
plot_4 <- inner_join(df_comb, region, by=c("county" = "county.name"))

#create choropleth map
county_choropleth(plot_4, state_zoom = "new jersey", num_colors = 6) +
  scale_fill_brewer(palette="Blues") +
  labs(title = "Volleyball Club Favorability by County", fill = "Favorability Score")

The above plot provides a clear visualization regarding county and regional favorability for a volleyball club. The most favorable counties appear to be in the eastern part of the state (near New York City or the coast).

To more clearly see where each county stands, we create a simple barplot where the counties are ranked based on their Favorability Score:

ggplot(df_comb) +
  geom_bar(aes(x = reorder(county, value) , y = value, fill = value), stat = "identity", position = "dodge", width = 1) + coord_flip() +
  theme(legend.position = "none") +
  labs( title = "Volleyball Favorability County Rankings", x = "", y = "", fill = "Source")

CONCLUSION

From the map and barplot above, we see that, based on our elected metrics:

  • Hudson is the most favorable,
  • Bergen, Ocean, Somerset, and Middlesex are very favorable,
  • Monmouth, Gloucester, Union, and Morris are favorable, and
  • remaining counties are either fairly favorable or not at all (ie. Cumberland).

Based on the rankings, Hudson, Bergen, Ocean, Somerset, and Middlesex counties are the most promising with regard to expansion. These are the counties where the reward is more likely to outweigh the risk when we take into account income, growth, and engagement.

Next Steps

There are a number of next steps that would take a report of this nature to ‘the next level’, and I’ve noted a few below:

  • account for more predictive metrics (ie. what precedes the growth of a VB community?),
  • account for each county’s ease of doing business,
  • account for commute times / ease of access,
  • update scoring metric based on GSEVC feedback (ie. income > growth),
  • be more specific and dig into data at a town level for counties of interest,
  • account for existing volleyball club locations with pins on our output map or as a metric to be considered along with income, growth, and engagment,
  • improve the accuracy of the engagement metric or score volleyball engagement based on male and female teams, and/or
  • create an interactive app where visualizations update based on User adjustment of metric weight and consideration.

Being that this was a high level, introductory report, I just wanted to show some of the capabilities of Data Science and how it might be applied to help your business and/or guide your decision-making. I hope it’s proven useful.

All the Best! Magnus

……………………………………………………………………..

APPENDIX

In the appendix I’ve included a number of additional plots and an interpretation of our final plots. Being that this information could provide valuable insight, I wanted to include without overloading the main report.

Metric Maps

I’ve included choropleth maps for individual components / metrics. They provide a visual representation of how each county fared with regard to the metric in question:

##Median Income

#select only pertinent columns
df_in <- df %>% select(1,2)
df_in <- df_in %>% rename(value = income)

#join data to region codes
plot_1 <- inner_join(df_in, region, by=c("county" = "county.name"))

#create choropleth map
county_choropleth(plot_1, state_zoom = "new jersey", num_colors = 6) +
  scale_fill_brewer(palette="Blues") +
  labs(title = "Median Income by County", fill = "Income Range ($)")

##Population Growth

#select only pertinent columns
df_pop <- df %>% select(1,3)
df_pop <- df_pop %>% rename(value = growth)

#join data to region codes
plot_2 <- inner_join(df_pop, region, by=c("county" = "county.name"))

#create choropleth map
county_choropleth(plot_2, state_zoom = "new jersey", num_colors = 6) +
  scale_fill_brewer(palette="Blues") +
  labs(title = "Population Growth by County", fill = "Population Growth (%)")

##Volleyball Engagement

#select only pertinent columns
df_vb <- df %>% select(1,4)
df_vb <- df_vb %>% rename(value = tph)

#join data to region codes
plot_3 <- inner_join(df_vb, region, by=c("county" = "county.name"))

#create choropleth map
county_choropleth(plot_3, state_zoom = "new jersey", num_colors = 6) +
  scale_fill_brewer(palette="Blues") +
  labs(title = "Volleyball Engagement by County", fill = "Volleyball Engagement (%)")

Favorability Score Table

I’ve included the favorability score table to provide clear, numeric values and better understand the ranking of our preceding volleyball favorability map and bar plot visualizations:

df_comb %>% kbl() %>% kable_minimal()
county value
atlantic 19.8
bergen 78.2
burlington 56.0
camden 59.5
cape may 7.0
cumberland 0.0
essex 62.9
gloucester 73.4
hudson 100.0
hunterdon 60.1
mercer 46.7
middlesex 88.8
monmouth 67.3
morris 75.0
ocean 80.0
passaic 58.5
salem 6.0
somerset 85.7
sussex 46.2
union 73.7
warren 26.1

Favorability Score Interpretation

While I’d expected Monmouth county to score higher, I didn’t know that the county had seen a net decrease in population in the past decade and this factor alone set the county’s score in a lower bracket. Whereas Hudson county was the highest performing county for population growth and engagement. Being the top in each of these metrics, set the county as the most favorable even though it wasn’t in the top 3 for income.

When we pause and think for a second, it makes sense. Hudson county has seen a rise in population, livability, and median income. It’s seen an influx of high earning residents (commuting to the city) as well as Hispanic immigrants. Both of these demographics could be favorable in supporting a new volleyball club. A younger, higher earning population, that loves the game is a great community to become a part of.