income <- read_csv("https://raw.githubusercontent.com/Magnus-PS/GSEVC/main/nj_county_medInc.csv")
pop <- read_csv("https://raw.githubusercontent.com/Magnus-PS/GSEVC/main/nj_county_pop.csv")
vb <- read_csv("https://raw.githubusercontent.com/Magnus-PS/GSEVC/main/nj_county_vb.csv")
Garden State Elite Volleyball Club (GSEVC) is located in Succasunna, New Jersey. North and West of the center of New Jersey, in Morris County. A high income, high growth county with an engaged volleyball community. In many ways the perfect spot to have started a premier volleyball center / club.
But … what if the plan went beyond the start and beyond the singular location?
What if the plan were to expand and grow? What then?
What other locations and what other counties in the state of New Jersey are promising for a volleyball club? And how do you even measure such a thing?
Answering these questions forms the crux of this exploratory data analysis (EDA). An introduction to applied Data Science. A high level exploration of income levels, population growth, and volleyball engagement at the county level in New Jersey.
We make use of Median Income, Population Growth, and Volleyball Engagement in assessing volleyball favor-ability at the county level.
Prior to exploring our output, visualization, and interpretation, we prepare our data. While this isn’t the most glamorous step, it is essential, and thus we briefly visit it to better understand the labor that provides the fruit we later enjoy.
Our data prep process (applied to each metric) included but was not limited to:
As for the background and reasoning behind our choice in metrics …
Median income is generated from 2019 US Census data and provides insight regarding the relative wealth of a county. We assume that a higher household income is indicative of a wealthier market. One that has more money to spend on their children’s sports participation (ie. joining a volleyball club).
#Tidy income dataset
income <- income %>% select(1,2) #drop 3rd column
income <- income[-c(1),] #drop 1st row
income <- income[-nrow(income),] #remove last row
income <- income %>% rename( #rename columns
`Median Income ($)` = X2,
County = `Median Household Income by County`
)
#Process County column
stopwords <- c("County","New Jersey")
income$County <- removeWords(income$County,stopwords) #remove 'County' from County column entries
income$County <- trimws(income$County) #remove excess whitespace
#Process Median Income column
income$`Median Income ($)` <- str_replace_all(income$`Median Income ($)`, "[^[:alnum:]]", "") #remove non-alpha numeric characters
income$`Median Income ($)` <- sapply(income$`Median Income ($)`, as.numeric)
#head(income) #verify
Population Growth is generated from 2019 US Census data and provides insight regarding the relative growth of a county. We assume that higher growth is indicative that there are more prospective student-athletes and thus a larger market.
#Tidy population dataset
pop <- pop %>% select(1:2,13) #drop columns 3-12
pop <- pop[-c(1:4,26:30),] #drop 1st 4 and last 5 rows
pop <- pop %>% rename( #rename columns
County = `table with row headers in column A and column headers in rows 3 through 4 (leading dots indicate sub-parts)`,
`2010 Pop.` = X2,
`2019 Pop.` = X13
)
#Process County column
pop$County <- removeWords(pop$County,stopwords) #remove 'County', 'New Jersey' from County column entries
pop$County <- str_replace_all(pop$County, "[^[:alnum:]]", " ") #remove non-alpha numeric characters
pop$County <- trimws(pop$County) #remove excess whitespace
#Process 2010 Pop. column
pop$`2010 Pop.` <- str_replace_all(pop$`2010 Pop.`, "[[:punct:]]", "") #remove commas
pop$`2010 Pop.` <- sapply(pop$`2010 Pop.`, as.numeric) #convert to numeric
#Create population growth variable
pop$`Pop.Growth`<- (pop$`2019 Pop.` - pop$`2010 Pop.`) / pop$`2010 Pop.`
pop$`Pop.Growth` <- round(pop$`Pop.Growth`,3) * 100
#head(pop) #verify
We calculate the population growth rate (%) for each county as:
\[ Pop.Growth = \frac{Pop (2019) - Pop(2010)}{Pop(2010)} \]
What we end up with is a decimal to the nth degree and so we round our values to the 3rd decimal place and then multiply by 100 to bring our Population Growth metric into a simple % form. This metric provides clear insight as to which counties are growing vs. which counties are not.
Volleyball Engagement is generated from state level data (documented further down) and provides insight regarding the relative enthusiasm a county has toward the sport of volleyball. Being that high schools are much more likely to offer girls volleyball, we assume that a greater concentration of boys programs within a county is indicative of a higher general enthusiasm for volleyball.
#Tidy volleyball dataset
vb <- vb %>% rename( #rename columns
`Boys_Teams` = `Boys Volleyball Teams`,
`HSs` = `High Schools`
)
#replace NAs with 0
vb[is.na(vb)] <- 0
#Process County column
vb$County <- removeWords(vb$County,stopwords) #remove 'County', 'New Jersey' from County column entries
vb$County <- str_replace_all(vb$County, "[^[:alnum:]]", " ") #remove non-alpha numeric characters
vb$County <- trimws(vb$County) #remove excess whitespace
#Create VB Engagement metric
vb$TPH <- vb$Boys_Teams / vb$HSs #teams per HS
#vb$TPS <- vb$Boys_Teams / (vb$Students/2) #teams per student (assuming 1/2 male)
vb$TPH <- round(vb$TPH,3) * 100
#head(vb) #verify
There are national metrics of engagement (ie. # of high schoolers playing) but when it comes down to volleyball engagement at a county level, there is no readily available metric. Thus, I created one:
the number of high schools
and student enrollment
at a county level. This data is from the 2015-2016 school year.number of boy's volleyball teams
at a county level. This data is from the 2018-2019 volleyball season.While the High-Schools.com data was in a ready-to-consume form, the highschoolsports.nj.com data had to be manually tallied. I went through each division’s list, searched the location of the school, and then placed a tally for the county in which it resided.
From this, I calculated the volleyball engagement rate (%) for each county as:
\[ TPH = \frac{NumBoysTeams}{NumHighSchools} \]
What we end up with is a decimal to the nth degree and so we round our values to the 3rd decimal place and then multiply by 100 to bring our Teams Per High School (TPH) volleyball engagement metric into a simple % form. This metric provides clear insight as to which counties have higher vs. lower volleyball enthusiasm.
……………………………………………………………………..
We merge our clean, compatible dataframes into one, and output the resulting table:
#join dfs by County
df <- merge(income, pop, by="County")
df <- merge(df, vb, by="County")
df <- df %>% select(1:2,5,9) #drop columns 3-12
df %>% kbl() %>% kable_minimal()
County | Median Income ($) | Pop.Growth | TPH |
---|---|---|---|
Atlantic | 62110 | -4.0 | 7.7 |
Bergen | 101144 | 3.0 | 9.2 |
Burlington | 87416 | -0.8 | 11.1 |
Camden | 70451 | -1.4 | 23.9 |
Cape May | 67074 | -5.4 | 0.0 |
Cumberland | 54149 | -4.7 | 0.0 |
Essex | 61510 | 1.9 | 22.1 |
Gloucester | 87283 | 1.2 | 17.9 |
Hudson | 71189 | 6.0 | 31.8 |
Hunterdon | 115379 | -3.1 | 5.6 |
Mercer | 81057 | 0.3 | 5.0 |
Middlesex | 89533 | 1.9 | 25.5 |
Monmouth | 99733 | -1.8 | 15.2 |
Morris | 115527 | -0.1 | 7.8 |
Ocean | 70909 | 5.3 | 20.0 |
Passaic | 69688 | 0.1 | 19.6 |
Salem | 66842 | -5.6 | 0.0 |
Somerset | 113611 | 1.7 | 11.4 |
Sussex | 94520 | -5.9 | 14.3 |
Union | 80198 | 3.7 | 15.1 |
Warren | 81307 | -3.2 | 0.0 |
We draw the following insight:
With this combination of metrics in mind, how do we identify the best opportunities for expansion? We can map and rank it …
A choropleth map is a thematic map in which pre-defined areas (in our case counties) are shaded in proportion to a statistical variable. In our case, we’re considering three metrics and the modeling function (the choropleth map) requires just one.
To adapt, I created a combination variable by normalizing our metrics (income, growth, and engagement), summing the result, normalizing the combination variable, and then rounding these values to the 3rd decimal place and multiplying by 100.
The resulting % is our Favorability score
. This metric provides clear insight as to which counties are most favorable for a volleyball center based on income, growth, and engagement:
# obtain region codes for new jersey
data(county.regions, package = "choroplethrMaps")
region <- county.regions %>% filter(state.name == "new jersey")
#convert field names to lowercase
names(df) <- lapply(names(df), tolower)
df[[1]] <- tolower(df[[1]])
#simplify column names
df <- df %>% rename(
income = `median income ($)`,
growth = pop.growth
)
#head(df)
#normalize data scales to be from 0 to 1
i <- df$income
n_i = (i-min(i))/(max(i)-min(i))
g <- df$growth
n_g <- (g-min(g))/(max(g)-min(g))
v <- df$tph
n_v <- (v-min(v))/(max(v)-min(v))
#combine metrics
combined <- n_i + n_g + n_v
n_combined <- (combined-min(combined))/(max(combined)-min(combined))
#merge with df
df$combined <- round(n_combined,3) * 100
#head(df)
df_comb <- df %>% select(1,5)
df_comb <- df_comb %>% rename(value = combined)
#Choropleth map: combined metrics
#join data to region codes
plot_4 <- inner_join(df_comb, region, by=c("county" = "county.name"))
#create choropleth map
county_choropleth(plot_4, state_zoom = "new jersey", num_colors = 6) +
scale_fill_brewer(palette="Blues") +
labs(title = "Volleyball Club Favorability by County", fill = "Favorability Score")
The above plot provides a clear visualization regarding county and regional favorability for a volleyball club. The most favorable counties appear to be in the eastern part of the state (near New York City or the coast).
To more clearly see where each county stands, we create a simple barplot where the counties are ranked based on their Favorability Score
:
ggplot(df_comb) +
geom_bar(aes(x = reorder(county, value) , y = value, fill = value), stat = "identity", position = "dodge", width = 1) + coord_flip() +
theme(legend.position = "none") +
labs( title = "Volleyball Favorability County Rankings", x = "", y = "", fill = "Source")
From the map and barplot above, we see that, based on our elected metrics:
Based on the rankings, Hudson, Bergen, Ocean, Somerset, and Middlesex counties are the most promising with regard to expansion. These are the counties where the reward is more likely to outweigh the risk when we take into account income, growth, and engagement.
There are a number of next steps that would take a report of this nature to ‘the next level’, and I’ve noted a few below:
Being that this was a high level, introductory report, I just wanted to show some of the capabilities of Data Science and how it might be applied to help your business and/or guide your decision-making. I hope it’s proven useful.
All the Best! Magnus
……………………………………………………………………..
In the appendix I’ve included a number of additional plots and an interpretation of our final plots. Being that this information could provide valuable insight, I wanted to include without overloading the main report.
I’ve included choropleth maps for individual components / metrics. They provide a visual representation of how each county fared with regard to the metric in question:
##Median Income
#select only pertinent columns
df_in <- df %>% select(1,2)
df_in <- df_in %>% rename(value = income)
#join data to region codes
plot_1 <- inner_join(df_in, region, by=c("county" = "county.name"))
#create choropleth map
county_choropleth(plot_1, state_zoom = "new jersey", num_colors = 6) +
scale_fill_brewer(palette="Blues") +
labs(title = "Median Income by County", fill = "Income Range ($)")
##Population Growth
#select only pertinent columns
df_pop <- df %>% select(1,3)
df_pop <- df_pop %>% rename(value = growth)
#join data to region codes
plot_2 <- inner_join(df_pop, region, by=c("county" = "county.name"))
#create choropleth map
county_choropleth(plot_2, state_zoom = "new jersey", num_colors = 6) +
scale_fill_brewer(palette="Blues") +
labs(title = "Population Growth by County", fill = "Population Growth (%)")
##Volleyball Engagement
#select only pertinent columns
df_vb <- df %>% select(1,4)
df_vb <- df_vb %>% rename(value = tph)
#join data to region codes
plot_3 <- inner_join(df_vb, region, by=c("county" = "county.name"))
#create choropleth map
county_choropleth(plot_3, state_zoom = "new jersey", num_colors = 6) +
scale_fill_brewer(palette="Blues") +
labs(title = "Volleyball Engagement by County", fill = "Volleyball Engagement (%)")
I’ve included the favorability score table to provide clear, numeric values and better understand the ranking of our preceding volleyball favorability map and bar plot visualizations:
county | value |
---|---|
atlantic | 19.8 |
bergen | 78.2 |
burlington | 56.0 |
camden | 59.5 |
cape may | 7.0 |
cumberland | 0.0 |
essex | 62.9 |
gloucester | 73.4 |
hudson | 100.0 |
hunterdon | 60.1 |
mercer | 46.7 |
middlesex | 88.8 |
monmouth | 67.3 |
morris | 75.0 |
ocean | 80.0 |
passaic | 58.5 |
salem | 6.0 |
somerset | 85.7 |
sussex | 46.2 |
union | 73.7 |
warren | 26.1 |
While I’d expected Monmouth county to score higher, I didn’t know that the county had seen a net decrease in population in the past decade and this factor alone set the county’s score in a lower bracket. Whereas Hudson county was the highest performing county for population growth and engagement. Being the top in each of these metrics, set the county as the most favorable even though it wasn’t in the top 3 for income.
When we pause and think for a second, it makes sense. Hudson county has seen a rise in population, livability, and median income. It’s seen an influx of high earning residents (commuting to the city) as well as Hispanic immigrants. Both of these demographics could be favorable in supporting a new volleyball club. A younger, higher earning population, that loves the game is a great community to become a part of.