GSEVC Intro to EDA

income <- read_csv("https://raw.githubusercontent.com/Magnus-PS/GSEVC/main/nj_county_medInc.csv")
pop <- read_csv("https://raw.githubusercontent.com/Magnus-PS/GSEVC/main/nj_county_pop.csv")
vb <- read_csv("https://raw.githubusercontent.com/Magnus-PS/GSEVC/main/nj_county_vb.csv")

INTRO

Garden State Elite Volleyball Club (GSEVC) is located in Succasunna, New Jersey. North and West of the center of New Jersey, in Morris County. A high income, high growth county with an engaged volleyball community. In many ways the perfect spot to have started a premier volleyball center / club.

But … what if the plan went beyond the start and beyond the singular location?

What if the plan were to expand and grow? What then?

What other locations and what other counties in the state of New Jersey are promising for a volleyball club? And how do you even measure such a thing?

Answering these questions forms the crux of this exploratory data analysis (EDA). An introduction to applied Data Science. A high level exploration of income levels, population growth, and volleyball engagement at the county level in New Jersey.

METRICS

We make use of Median Income, Population Growth, and Volleyball Engagement in assessing volleyball favor-ability at the county level.

Prior to exploring our output, visualization, and interpretation, we prepare our data. While this isn’t the most glamorous step, it is essential, and thus we briefly visit it to better understand the labor that provides the fruit we later enjoy.

Our data prep process (applied to each metric) included but was not limited to:

locating, uploading, and then reading in our data (as a .csv file),
removing excess data (rows, columns, characters, whitespace, etc.),
operating upon provided metrics to create metrics, and
putting variables and observations into compatible forms.

As for the background and reasoning behind our choice in metrics …

Median Income

Median income is generated from 2019 US Census data and provides insight regarding the relative wealth of a county. We assume that a higher household income is indicative of a wealthier market. One that has more money to spend on their children’s sports participation (ie. joining a volleyball club).

#Tidy income dataset
income <- income %>% select(1,2) #drop 3rd column
income <- income[-c(1),] #drop 1st row
income <- income[-nrow(income),] #remove last row
income <- income %>% rename( #rename columns
    `Median Income ($)` = X2,
    County = `Median Household Income by County`
    )

#Process County column
stopwords <- c("County","New Jersey")
income$County  <-  removeWords(income$County,stopwords) #remove 'County' from County column entries
income$County <- trimws(income$County) #remove excess whitespace

#Process Median Income column
income$`Median Income ($)` <- str_replace_all(income$`Median Income ($)`, "[^[:alnum:]]", "") #remove non-alpha numeric characters
income$`Median Income ($)` <- sapply(income$`Median Income ($)`, as.numeric)

#head(income) #verify

Population Growth

Population Growth is generated from 2019 US Census data and provides insight regarding the relative growth of a county. We assume that higher growth is indicative that there are more prospective student-athletes and thus a larger market.

#Tidy population dataset
pop <- pop %>% select(1:2,13) #drop columns 3-12
pop <- pop[-c(1:4,26:30),] #drop 1st 4 and last 5 rows
pop <- pop %>% rename( #rename columns
    County = `table with row headers in column A and column headers in rows 3 through 4 (leading dots indicate sub-parts)`,
    `2010 Pop.` = X2,
    `2019 Pop.` = X13
    )

#Process County column
pop$County <- removeWords(pop$County,stopwords) #remove 'County', 'New Jersey' from County column entries
pop$County <- str_replace_all(pop$County, "[^[:alnum:]]", " ") #remove non-alpha numeric characters
pop$County <- trimws(pop$County) #remove excess whitespace

#Process 2010 Pop. column
pop$`2010 Pop.` <- str_replace_all(pop$`2010 Pop.`, "[[:punct:]]", "") #remove commas
pop$`2010 Pop.` <- sapply(pop$`2010 Pop.`, as.numeric) #convert to numeric

#Create population growth variable
pop$`Pop.Growth`<- (pop$`2019 Pop.` - pop$`2010 Pop.`) / pop$`2010 Pop.`
pop$`Pop.Growth` <- round(pop$`Pop.Growth`,3) * 100

#head(pop) #verify

We calculate the population growth rate (%) for each county as:

\[ Pop.Growth = \frac{Pop (2019) - Pop(2010)}{Pop(2010)} \]

What we end up with is a decimal to the nth degree and so we round our values to the 3rd decimal place and then multiply by 100 to bring our Population Growth metric into a simple % form. This metric provides clear insight as to which counties are growing vs. which counties are not.

Volleyball Engagement

Volleyball Engagement is generated from state level data (documented further down) and provides insight regarding the relative enthusiasm a county has toward the sport of volleyball. Being that high schools are much more likely to offer girls volleyball, we assume that a greater concentration of boys programs within a county is indicative of a higher general enthusiasm for volleyball.

#Tidy volleyball dataset
vb <- vb %>% rename( #rename columns
    `Boys_Teams` = `Boys Volleyball Teams`,
    `HSs` = `High Schools`
    )

#replace NAs with 0
vb[is.na(vb)] <- 0

#Process County column
vb$County <- removeWords(vb$County,stopwords) #remove 'County', 'New Jersey' from County column entries
vb$County <- str_replace_all(vb$County, "[^[:alnum:]]", " ") #remove non-alpha numeric characters
vb$County <- trimws(vb$County) #remove excess whitespace

#Create VB Engagement metric
vb$TPH <- vb$Boys_Teams / vb$HSs #teams per HS
#vb$TPS <- vb$Boys_Teams / (vb$Students/2) #teams per student (assuming 1/2 male)
vb$TPH <- round(vb$TPH,3) * 100

#head(vb) #verify

There are national metrics of engagement (ie. # of high schoolers playing) but when it comes down to volleyball engagement at a county level, there is no readily available metric. Thus, I created one:

From High-Schools.com I pulled the number of high schools and student enrollment at a county level. This data is from the 2015-2016 school year.
From highschoolsports.nj.com I pulled the number of boy's volleyball teams at a county level. This data is from the 2018-2019 volleyball season.

While the High-Schools.com data was in a ready-to-consume form, the highschoolsports.nj.com data had to be manually tallied. I went through each division’s list, searched the location of the school, and then placed a tally for the county in which it resided.

From this, I calculated the volleyball engagement rate (%) for each county as:

\[ TPH = \frac{NumBoysTeams}{NumHighSchools} \]

What we end up with is a decimal to the nth degree and so we round our values to the 3rd decimal place and then multiply by 100 to bring our Teams Per High School (TPH) volleyball engagement metric into a simple % form. This metric provides clear insight as to which counties have higher vs. lower volleyball enthusiasm.

……………………………………………………………………..

OUTPUT

We merge our clean, compatible dataframes into one, and output the resulting table:

#join dfs by County
df <- merge(income, pop, by="County")
df <- merge(df, vb, by="County")
df <- df %>% select(1:2,5,9) #drop columns 3-12

df %>% kbl() %>% kable_minimal()

County	Median Income ($)	Pop.Growth	TPH
Atlantic	62110	-4.0	7.7
Bergen	101144	3.0	9.2
Burlington	87416	-0.8	11.1
Camden	70451	-1.4	23.9
Cape May	67074	-5.4	0.0
Cumberland	54149	-4.7	0.0
Essex	61510	1.9	22.1
Gloucester	87283	1.2	17.9
Hudson	71189	6.0	31.8
Hunterdon	115379	-3.1	5.6
Mercer	81057	0.3	5.0
Middlesex	89533	1.9	25.5
Monmouth	99733	-1.8	15.2
Morris	115527	-0.1	7.8
Ocean	70909	5.3	20.0
Passaic	69688	0.1	19.6
Salem	66842	-5.6	0.0
Somerset	113611	1.7	11.4
Sussex	94520	-5.9	14.3
Union	80198	3.7	15.1
Warren	81307	-3.2	0.0

We draw the following insight:

The top 3 wealthiest counties are Morris, Hunterdon, and Somerset. The bottom 3 are Cumberland, Essex, and Atlantic.
The top 3 counties for growth are Hudson, Ocean, and Union. The bottom 3 are Sussex, Salem, and Cape May.
The top 3 counties for volleyball are Hudson, Middlesex, and Camden. The bottom 4 are Cape May, Cumberland, Salem, and Warren.

With this combination of metrics in mind, how do we identify the best opportunities for expansion? We can map and rank it …

Favorability Map

A choropleth map is a thematic map in which pre-defined areas (in our case counties) are shaded in proportion to a statistical variable. In our case, we’re considering three metrics and the modeling function (the choropleth map) requires just one.

To adapt, I created a combination variable by normalizing our metrics (income, growth, and engagement), summing the result, normalizing the combination variable, and then rounding these values to the 3rd decimal place and multiplying by 100.

The resulting % is our Favorability score. This metric provides clear insight as to which counties are most favorable for a volleyball center based on income, growth, and engagement:

# obtain region codes for new jersey
data(county.regions, package = "choroplethrMaps")

region <- county.regions %>% filter(state.name == "new jersey")

#convert field names to lowercase
names(df) <- lapply(names(df), tolower)
df[[1]] <- tolower(df[[1]])

#simplify column names
df <- df %>% rename(
    income = `median income ($)`,
    growth = pop.growth
    )

#head(df)

#normalize data scales to be from 0 to 1
i <- df$income
n_i = (i-min(i))/(max(i)-min(i))

g <- df$growth
n_g <- (g-min(g))/(max(g)-min(g))

v <- df$tph
n_v <- (v-min(v))/(max(v)-min(v))

#combine metrics
combined <- n_i + n_g + n_v
n_combined <- (combined-min(combined))/(max(combined)-min(combined))

#merge with df
df$combined <- round(n_combined,3) * 100
#head(df)

df_comb <- df %>% select(1,5)
df_comb <- df_comb %>% rename(value = combined)

#Choropleth map: combined metrics

#join data to region codes
plot_4 <- inner_join(df_comb, region, by=c("county" = "county.name"))

#create choropleth map
county_choropleth(plot_4, state_zoom = "new jersey", num_colors = 6) +
  scale_fill_brewer(palette="Blues") +
  labs(title = "Volleyball Club Favorability by County", fill = "Favorability Score")

The above plot provides a clear visualization regarding county and regional favorability for a volleyball club. The most favorable counties appear to be in the eastern part of the state (near New York City or the coast).

To more clearly see where each county stands, we create a simple barplot where the counties are ranked based on their Favorability Score:

ggplot(df_comb) +
  geom_bar(aes(x = reorder(county, value) , y = value, fill = value), stat = "identity", position = "dodge", width = 1) + coord_flip() +
  theme(legend.position = "none") +
  labs( title = "Volleyball Favorability County Rankings", x = "", y = "", fill = "Source")

CONCLUSION

From the map and barplot above, we see that, based on our elected metrics:

Hudson is the most favorable,
Bergen, Ocean, Somerset, and Middlesex are very favorable,
Monmouth, Gloucester, Union, and Morris are favorable, and
remaining counties are either fairly favorable or not at all (ie. Cumberland).

Based on the rankings, Hudson, Bergen, Ocean, Somerset, and Middlesex counties are the most promising with regard to expansion. These are the counties where the reward is more likely to outweigh the risk when we take into account income, growth, and engagement.

Next Steps

There are a number of next steps that would take a report of this nature to ‘the next level’, and I’ve noted a few below:

account for more predictive metrics (ie. what precedes the growth of a VB community?),
account for each county’s ease of doing business,
account for commute times / ease of access,
update scoring metric based on GSEVC feedback (ie. income > growth),
be more specific and dig into data at a town level for counties of interest,
account for existing volleyball club locations with pins on our output map or as a metric to be considered along with income, growth, and engagment,
improve the accuracy of the engagement metric or score volleyball engagement based on male and female teams, and/or
create an interactive app where visualizations update based on User adjustment of metric weight and consideration.

Being that this was a high level, introductory report, I just wanted to show some of the capabilities of Data Science and how it might be applied to help your business and/or guide your decision-making. I hope it’s proven useful.

All the Best! Magnus

……………………………………………………………………..

APPENDIX

In the appendix I’ve included a number of additional plots and an interpretation of our final plots. Being that this information could provide valuable insight, I wanted to include without overloading the main report.

Metric Maps

I’ve included choropleth maps for individual components / metrics. They provide a visual representation of how each county fared with regard to the metric in question:

##Median Income

#select only pertinent columns
df_in <- df %>% select(1,2)
df_in <- df_in %>% rename(value = income)

#join data to region codes
plot_1 <- inner_join(df_in, region, by=c("county" = "county.name"))

#create choropleth map
county_choropleth(plot_1, state_zoom = "new jersey", num_colors = 6) +
  scale_fill_brewer(palette="Blues") +
  labs(title = "Median Income by County", fill = "Income Range ($)")

##Population Growth

#select only pertinent columns
df_pop <- df %>% select(1,3)
df_pop <- df_pop %>% rename(value = growth)

#join data to region codes
plot_2 <- inner_join(df_pop, region, by=c("county" = "county.name"))

#create choropleth map
county_choropleth(plot_2, state_zoom = "new jersey", num_colors = 6) +
  scale_fill_brewer(palette="Blues") +
  labs(title = "Population Growth by County", fill = "Population Growth (%)")

##Volleyball Engagement

#select only pertinent columns
df_vb <- df %>% select(1,4)
df_vb <- df_vb %>% rename(value = tph)

#join data to region codes
plot_3 <- inner_join(df_vb, region, by=c("county" = "county.name"))

#create choropleth map
county_choropleth(plot_3, state_zoom = "new jersey", num_colors = 6) +
  scale_fill_brewer(palette="Blues") +
  labs(title = "Volleyball Engagement by County", fill = "Volleyball Engagement (%)")

Favorability Score Table

I’ve included the favorability score table to provide clear, numeric values and better understand the ranking of our preceding volleyball favorability map and bar plot visualizations:

df_comb %>% kbl() %>% kable_minimal()

county	value
atlantic	19.8
bergen	78.2
burlington	56.0
camden	59.5
cape may	7.0
cumberland	0.0
essex	62.9
gloucester	73.4
hudson	100.0
hunterdon	60.1
mercer	46.7
middlesex	88.8
monmouth	67.3
morris	75.0
ocean	80.0
passaic	58.5
salem	6.0
somerset	85.7
sussex	46.2
union	73.7
warren	26.1

Favorability Score Interpretation

While I’d expected Monmouth county to score higher, I didn’t know that the county had seen a net decrease in population in the past decade and this factor alone set the county’s score in a lower bracket. Whereas Hudson county was the highest performing county for population growth and engagement. Being the top in each of these metrics, set the county as the most favorable even though it wasn’t in the top 3 for income.

When we pause and think for a second, it makes sense. Hudson county has seen a rise in population, livability, and median income. It’s seen an influx of high earning residents (commuting to the city) as well as Hispanic immigrants. Both of these demographics could be favorable in supporting a new volleyball club. A younger, higher earning population, that loves the game is a great community to become a part of.