Overview

For project 2, we were asked to import three untidy datasets, make it tidy and conduct some analysis with the newly tranformed dataframe. For my project, I worked with the following:

Florida Demographics

This dataset was compiled to help assess people’s risk of loosing their homes in Florida. While each row represents one Census tract, it includes various observations such as income and poverty level by age, sex, and race, and median household costs by owner/renter status. I aimed to analyze this demographic data to understand which groups might be at higher risk.

Importing and Tidying the Housing Dictionary Dataset

First I created a subset of the housing dictionary using the regular expression to filter for rows beginning with “ACS”. The ACS provides the estimated value and margin of error for each observation which are flagged in the variable name with an “e” or “m” suffix respectively. I dropped all rows with the “m” flag so that the field_name column could be reused later when tidying up the main dataset.

# Load dataset dictionary and
# create a subset of the housing dictionary
# removing duplicates and cleaning up field names
acs_dictionary <- read_csv("data_dictionary_1-FL.csv") |>
  select(field_name, dk_column_name, year) |>
  filter(str_detect(dk_column_name, "^ACS") & str_detect(field_name, "e$")) |>
  mutate(
    field_name = str_sub(field_name, 1, -2),
    across('dk_column_name', str_replace, ' - Estimate', ''),  
    across('dk_column_name', str_replace, 'ACS - ', '')
  )
## Rows: 379 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): field_name, dk_column_name, description, dataset, data_source, sour...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `across("dk_column_name", str_replace, " - Estimate", "")`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))

Importing the Main Dataset

Next, I imported the main dataset and used the melt package to convert it into a long dataframe where all indicator columns were split into rows; observations were broken up in a newly created variables column and value column set for each Census track in the geoid column.

# Load dataset as df
housing_df <- read_csv("data_1-FL.csv")
## Rows: 1605 Columns: 385
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (4): county, economic_distress_pop_agg, economic_distress_simple_agg, ...
## dbl (381): geoid, geoid_year, state, state_fips_code, county_fips_code, b190...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# pivot long using geoid as the key
long_housing_df <- housing_df |>
  subset(select = -c( geoid_year : county_fips_code)) |>
  melt(id='geoid')
  
head(long_housing_df)
##        geoid    variable  value
## 1 1.2001e+10 b19083_001e 0.3928
## 2 1.2001e+10 b19083_001e 0.5466
## 3 1.2001e+10 b19083_001e 0.3641
## 4 1.2001e+10 b19083_001e 0.3662
## 5 1.2001e+10 b19083_001e 0.5686
## 6 1.2001e+10 b19083_001e 0.4269

Tidying up the DF

First, I needed to limit my rows to observations related to ACS. I constructed a regular expression to filter these rows accordingly. As with the data dictionary df, the ACS observations were divided into two rows: 1) estimate and 2) margin of error. I first extracted the flags “e” and “m” into their own column, then used pivot_wider to pull the estimate and margin of error values into a single row per observation. I broke out the geoid into three columns based on their numerical code representing the state, county, and tract ids. I then left_joined the dataframe with my dictionaty dataframe and separated out the dk_column_name from the dictionary into three columns which will be used for grouping my data into distinct indicators to help visualize and analyse the data.

# pattern to find ACS fields 
pattern_acs <- regex(
  r"(
    ^([bs]    # start with b or s 
    (\d{4})   # followed by 4 chars
    [\d]?     # optional number
    )|        # or
    (dp05)    # dp5
    [_]+      # followed by a _ then a sequence of numbers
  )", 
  comments = TRUE
)

acs <- long_housing_df |>
  filter(str_detect(variable, pattern_acs)) |>
  mutate(
    type = str_sub(variable, -1), 
    variable = str_sub(variable, 1, -2)
  ) |>
  pivot_wider(
    names_from = type,
    values_from = value,
    names_prefix = "type_"
  ) |>
  rename(c("estimate" = "type_e", "margin_of_error" = "type_m")) |> 
  filter(estimate != '-666666666') |>
  mutate(
    estimate = as.double(estimate),
    margin_of_error = as.double(margin_of_error)
  ) |>
  extract(
    col="geoid", 
    into=c("state","county","tract"),
    regex="([0-9]{2})([0-9]{3})([0-9]{4})"
  ) |>
  left_join(acs_dictionary, join_by(variable == field_name)) |>
  separate(
    col = dk_column_name,
    into = c('variable_1', 'variable_2', 'variable_3'),
    sep = " - "
  ) |>
  subset(select = -c(variable))
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 47644 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
head(acs)
## # A tibble: 6 Ă— 9
##   state county tract estimate margin_of_error variable_1   variable_2 variable_3
##   <chr> <chr>  <chr>    <dbl>           <dbl> <chr>        <chr>      <chr>     
## 1 12    001    0004     0.393          0.0724 Gini Index … <NA>       <NA>      
## 2 12    001    0008     0.547          0.0649 Gini Index … <NA>       <NA>      
## 3 12    001    0009     0.364          0.0878 Gini Index … <NA>       <NA>      
## 4 12    001    0012     0.366          0.0894 Gini Index … <NA>       <NA>      
## 5 12    001    0015     0.569          0.0634 Gini Index … <NA>       <NA>      
## 6 12    001    0015     0.427          0.0479 Gini Index … <NA>       <NA>      
## # ℹ 1 more variable: year <chr>

Visualizing Income Inequality

With the dataset transformed into a tidy dataset, I am now able to easily visualize data by indicator type(s). I can use the values store in variable_1 to filter the data by indicator, create groups and summarise data. For example, I created fitered data by “Gini Index” grouped by counties in Florida. Once the data is grouped, I was able to summarise the mean to calculate the average value of all census tract in the County. I was then able to visualize the data with a density_plot to quickly get a sense of the distribution of averages.

# filter by indicator and plot a density plot
filterFor <- function(f, ntl) {
  df <- acs |>
    filter(str_detect(variable_1, f)) |> 
    group_by(county) |>
    summarise(
      avg_estimate = mean(estimate)
    )
  
  num_counties <- df |>
    count()
  
  num_counties_under_ntl <- df |>
    filter(avg_estimate < ntl) |>
    count()
  
   peak <- df |>
     summarise(
       min = min(avg_estimate),
       max = max(avg_estimate),
       mean = mean(avg_estimate),
       median = median(avg_estimate),
       counties_under_ntl = num_counties_under_ntl,
       total_counties = num_counties
     )
  
  # build plot
  print(ggplot(df, aes(avg_estimate)) +
    geom_density() +
    geom_vline(xintercept = peak$mean) +
    labs(
      x ='', 
      y='',
      title=paste(f,'by County')
    ) +
    theme(
      plot.title = element_text(hjust = 0.5),
      axis.title.y=element_blank(),
      axis.text.y=element_blank(),
      axis.ticks.y=element_blank(),
    ))
  
  peak
}

# plat the Gini Index
income_inequality <- filterFor("Gini Index", 0.417) 

income_inequality
## # A tibble: 1 Ă— 6
##     min   max  mean median counties_under_ntl$n total_counties$n
##   <dbl> <dbl> <dbl>  <dbl>                <int>            <int>
## 1 0.348 0.547 0.434  0.429                   15               62

Visualizing Housing Costs

I created a function to be able to filter and group my dataframe by multiple variables and used this to plot a graph of housing costs by homeownership status (owners and renters). We can overlap the result to compare the results.

variable_by <- function(x, y, f) { 
  df <- acs |>
    filter(str_detect(variable_1, x) & variable_2 == y) |> 
    group_by(county, variable_3) |>
    summarise(
      avg_estimate = mean(estimate)
    ) |>
  ungroup()

 varplot <- ggplot(df, aes(avg_estimate, fill=variable_3)) +
    geom_density(alpha  = 0.4) +
    labs(
      x ='', 
      y='',
      title=paste(x, y)
    ) +
    theme(
      plot.title = element_text(hjust = 0.5),
      axis.title.y=element_blank(),
      #axis.text.y=element_blank(),
      axis.ticks.y=element_blank(),
      legend.position="bottom"
    )
 
  if (f == "facet") {
    varplot <- varplot + facet_wrap(~variable_3)
  }

   varplot 
}

# plot housing costs by ownership status
variable_by("Median Monthly Housing Cost","By ownership status", "")
## `summarise()` has grouped output by 'county'. You can override using the
## `.groups` argument.

Visualizing Income and Poverty Level

I applied a similar method to generate density plots for Median Household Income and Below Poverty Level but breaking out our findings by Race, Age and Sex. Having created a function, I can now just call variable_by and pass in the name of the indicator I want to explore and which variable I want to group by to output the results from my tidy dataframe.

# plo poverty level by 
variable_by("Below Poverty", "By Race", "facet")
## `summarise()` has grouped output by 'county'. You can override using the
## `.groups` argument.

variable_by("Below Poverty", "By Age", "facet")
## `summarise()` has grouped output by 'county'. You can override using the
## `.groups` argument.

variable_by("Below Poverty", "By Sex", "")
## `summarise()` has grouped output by 'county'. You can override using the
## `.groups` argument.

# plot income by 
variable_by("Median Household Income", "By Race of Householder", "facet")
## `summarise()` has grouped output by 'county'. You can override using the
## `.groups` argument.

variable_by("Median Household Income", "By Age of Householder", "facet")
## `summarise()` has grouped output by 'county'. You can override using the
## `.groups` argument.

# earnings by sex
variable_by("Median Earnings", "By Sex", "")
## `summarise()` has grouped output by 'county'. You can override using the
## `.groups` argument.

Conclusion

We see that the mean Gini Index in Florida is 0.4338199 in 2020, higher than the Gini index for the US of 0.417 when calculated using income after taxes and transfers Source. However, 15 out of 62 counties had Gini Indeces below the national index, with the county facing the highest county having an index of 0.3477. Renters also appear to have higher monthly housing costs on average than homeowners.

Black or African Americans appear to have higher representation among groups below the poverty level though had comperable income levels. Not surprising, people 15-24 years old and 65+ had lower household income, suggesting that minors, young adults and seniors were more likely to face economic hardship.


Marriage Dataset

The marriage dataset was suggested by DC. In their post, they suggest the following analysis

Using this dataset I would analyze the correlation between the different educational, geographical and social economic status variables to determine if there are certain factors that contribute to the instances of marriages / lack of marriages.

After importing the CSV, I used pivot_longer to transform the table from a wide to a long format. I users a regular expression to break up all columns by “demographic” and “age”, excuding the year and date columns which will be used as my keys. I used mutate to create a new column for categories, using the values of the demographic column to assign its value. I also reformated the demographic and age columns by casting them as factors. Next I used str_detect to filter out all observations that disaggregated data by child status from the dataframe marriage_demographics_df of total values irregardless of child status.

messy_marriage_data <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/marriage/both_sexes.csv")
## New names:
## Rows: 17 Columns: 75
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (74): ...1, year, all_2534, HS_2534, SC_2534, BAp_2534, BAo_2534, GD_25... date
## (1): date
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
# pivot data to long format, converting cols to demographic and age col
marriage_pivoted_df <- messy_marriage_data |>
  pivot_longer(
    cols = !(year|date),
    names_to = c("demographic", "age"), 
    names_pattern = "([A-Za-z_]+)_([0-9]+)$",
    values_to = "value"
  ) |>
  mutate(age = ifelse(is.na(age), age, paste(substr(age, 1,2),substr(age, 3,4), sep="-")),
    category = "education",
    category = ifelse(str_detect(demographic,"poor|mid|rich"), "economic status", category),
    category = ifelse(str_detect(demographic,"Midwest|South|Mountain|Pacific"), "region", category),
    category = ifelse(str_detect(demographic,"White|Black|Hisp|Asian"), "race", category),
    demographic = as.factor(demographic)
  ) |>
  relocate(category, .before = demographic) 

# make df of all  demographics that don't include child status
marriage_demographics_df <- marriage_pivoted_df   |>
  filter(!str_detect(demographic,"n?o?kids"))

head(marriage_demographics_df)
## # A tibble: 6 Ă— 6
##    year date       category  demographic age    value
##   <dbl> <date>     <chr>     <fct>       <chr>  <dbl>
## 1  1960 1960-01-01 education all         25-34  0.123
## 2  1960 1960-01-01 education HS          25-34  0.110
## 3  1960 1960-01-01 education SC          25-34  0.152
## 4  1960 1960-01-01 education BAp         25-34  0.239
## 5  1960 1960-01-01 education BAo         25-34  0.239
## 6  1960 1960-01-01 education GD          25-34 NA

Visualizing Marriage Statistics

Now that the data is in tidy format, I can quickly conduct some exploratory analysis. I created a function to filter out the dataframe by category and visualize the data with facets. This function takes two variables: 1) the category that we want to filter by and 2) a list of strings in the order we want to display the dependent variable.

I visualized the data comparing the percent of couples getting married since 1960 broken up by age for economic status, education, race, and region. These were plotted as lines for each of three age groups in the dataset: a) 25-35, b) 35-44, and c) 45-54. I used geom_smooth to show a trend line with confidence intervals.

# filters df by category and plots as facet
plot_by_cat <- function(cat, forder) {
  
  df <- marriage_demographics_df |>
    filter(str_detect(category, cat)) |>
    mutate(
      demographic = factor(demographic, levels = forder)
    )
  
  
  cat_uc <- paste(toupper(substr(cat, 1, 1)), substr(cat, 2, str_length(cat)), sep="")
  
  ggplot(df, aes(x=date, y=value)) +
    geom_line(aes(color=age)) +
    facet_wrap(~demographic) +
    geom_smooth(method=lm) + 
    scale_y_continuous(label=scales::percent) +
    labs(title = paste(cat_uc, " vs Age", sep="")) +
    theme(plot.title = element_text(hjust = 0.5))
  }

plot_by_cat("economic status", c('poor', 'mid', 'rich'))
## `geom_smooth()` using formula = 'y ~ x'

plot_by_cat("race", c('Black','White','Hisp'))
## `geom_smooth()` using formula = 'y ~ x'

plot_by_cat("education", c('all','NE', 'GD', 'HS', 'SC', 'BAo' , 'BAp', 'MA'))
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 9 rows containing non-finite outside the scale range
## (`stat_smooth()`).

plot_by_cat("region", c("Pacific","Mountain","Midwest","South"))
## `geom_smooth()` using formula = 'y ~ x'

Conclusion

Not surprisingly, couples ages 25-34 got married more then couples of other age groups accross all four categories that we explored: economic status, education, race, and region. Marriage rates appear to trending upwards possibly due population changes throughout the last 60 years.

Differences by category

While the percentage of couples of all three economic status was somewhat similar in 1960, couples categorized as poor married at higher rates in the last 20 years than other groups, in particular for ages 25-34. Couples categorized as rich saw a decline or slow down in marriage rates from 1960 to ~2005, but has seen an uptick since 20015. Like economic status, rates among Black, White and Hispanic couples were somewhat similar in 1960; while rates among White and Hispanics have remained similar, Black couples outpace other groups by nearly 20%. Marriage rates appear fairly similar across all eight educational types and all four regions examined.


Spotify Most Streamed Songs of 2023

Separating the artists from track information

After importing the dataset, I proceeded to create a separate artist table.: 1. Created an id column for each of the tracks based on the row nomner 2. Created a subset of track_ids and artist(s). Since some tracks had multiple artists credited, I split these into their own row using separate_longer_delim. 3. From the keys dataframe, I created a new dataframe of unique artists. These were also given their own id number 4. The artist df was left joined to the keys df by artist name. I dropped the artist_name column to get my final keys df.

Finally, I created a tracks subset from the original that contained single observations such as Beats per Minute (bpm) and Key. All other columns that contain multiple observations between them will were tidied for analysis in the next section.

# After reading in the file 
# Add row numbers as id column
sptfy_messy_data <- read_csv("spotify-2023.csv", locale = locale(encoding = "Latin1"))  |>
  mutate(track_id = row_number()) |>
  relocate(track_id, .before = track_name) |>
  rename(c('artists' = `artist(s)_name`)) 
## Rows: 953 Columns: 24
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): track_name, artist(s)_name, streams, key, mode
## dbl (17): artist_count, released_year, released_month, released_day, in_spot...
## num  (2): in_deezer_playlists, in_shazam_charts
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# create joiner table 
sptfy_artists_key <- sptfy_messy_data |>
  subset(select=c(track_id, artists)) |>
  separate_longer_delim(cols = c(artists), delim="," ) |> 
  mutate(artists = str_trim(artists))

# make distinct list of artists and give them their own id
sptfy_artists_by_name <- sptfy_artists_key |>
  distinct(artists) |>
  mutate(artist_id = row_number()) |>
  relocate(artist_id, .before = artists)

# join artist list to keys to bring unique artist idea 
sptfy_artists_key <- sptfy_artists_key |>
  left_join(sptfy_artists_by_name, join_by(artists == artists)) |>
  subset(select=-c(artists))

# create subset of track info
sptfy_tracks <- sptfy_messy_data |>
  subset(select=c(track_id,track_name, artist_count:released_day, streams, bpm, key, mode))

Which key does Bad Bunny prefer to use for his songs?

Now that we have the data in tidy form, I can filter the data easily I created a function to left joined the sptfy_tracks to the sptfy_artists_by_name through a joining table and filtered by the artist. In this example I filtered by “Bad Bunny” to plot a bar chart showing the number of songs in the dataset per musical key. Our chart shows that Bad Bunny preferred the key of C#.

get_artist_df <-function(name) {
  df <- sptfy_artists_by_name |>
    filter(artists == name) |>
    left_join(sptfy_artists_key, join_by(artist_id == artist_id)) |>
    left_join(sptfy_tracks, join_by(track_id == track_id))
}

ggplot(subset(get_artist_df(c("Bad Bunny")), !is.na(key)), aes(y=key)) +
  geom_bar()

Addressing columns with Multiple Observations

This next section uses pivot_longer to transform the remaining columns into groupable variables with their respective values. From this long df, I subsetted a new spotify_metrics df filtering by variables that have have a percent sign(%) in their name. These will be used for comparing dancibility, valence, energy, acousticness, instrumentalness, liveliness and speechiness.

# create subset of non-track info
# and pivot remaining columns into
# format id, variable, values 
sptfy_long_df <- sptfy_messy_data |>
  subset(select=-c(track_name:released_day, streams, bpm, key, mode)) |>
  pivot_longer(
    cols = !track_id,
    names_to = 'variable',
    values_to = 'value',
    values_transform = list(value=as.character) # convert to chars
  )

# subset spotify metrics where column names
# match percent regex
sptfy_metrics <- sptfy_long_df |>
  filter(str_detect(variable, "([A-Za-z]+)_%")) |>
  mutate(
    variable = substr(variable, 0, str_length(variable)-2 ),
    value = as.integer(value) # convert value back to int
  )

head(sptfy_metrics)
## # A tibble: 6 Ă— 3
##   track_id variable         value
##      <int> <chr>            <int>
## 1        1 danceability        80
## 2        1 valence             89
## 3        1 energy              83
## 4        1 acousticness        31
## 5        1 instrumentalness     0
## 6        1 liveness             8

Exploratory Analysis

Because my data is now in tidy format, I can now group my data easily and perform summary calculations easily. In the following example, I use a function to filter by the metric I am passing though to obtain its average on a per track basis. I used a bar chart and boxplot to get different visualizations to get sense of the mean and distribution of the 7 Spotify specfic metrics data.

# func: get variable mean for any df
get_variable_mean <- function(df) {
  new_df <- df |>
    group_by(variable) |>
    summarise(
      mean_value = mean(value)
    ) 
}

# plot the means of the variables      
ggplot(get_variable_mean(sptfy_metrics), aes(x=mean_value, y=variable)) +
  geom_bar(stat = "identity") +
  labs(title='Variable Means')

# variable-boxplot}
ggplot(sptfy_metrics, aes(y=variable, x=value)) +
  geom_boxplot() +
  labs(title='Boxplot of Variables')

Is there a correlation between BPMs and Spotify’s metrics?

In this example, I try to see if there is a correlation between Beats per Minute and the seven Spotify Metrics.

add_column_from_tracks <- function(df, col) {
  df_cols <- names(df) |>
    append(col)
  new_df <- df |>
    left_join(sptfy_tracks, join_by(track_id == track_id)) |>
    subset(select = df_cols)
}

sptfy_metrics_with_bpm <- add_column_from_tracks(sptfy_metrics, "bpm")

ggplot(sptfy_metrics_with_bpm, aes(x=value, y=bpm)) +
  geom_point()  + 
  scale_x_continuous(labels = function(x) paste0(x, "%")) +
  facet_wrap(~variable)  +
  labs(title='Comparing BPM vs ___') +
  geom_smooth(method=lm) 
## `geom_smooth()` using formula = 'y ~ x'

ggplot(sptfy_metrics_with_bpm, aes(x=value, y=bpm)) +
  geom_point()  + 
 scale_x_log10()  + 
    scale_y_log10()  + 
  facet_wrap(~variable) +
  labs(title='Comparing Log BPM vs Log ___') +
  geom_smooth(method=lm) 
## Warning in scale_x_log10(): log-10 transformation introduced infinite values.
## log-10 transformation introduced infinite values.
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 926 rows containing non-finite outside the scale range
## (`stat_smooth()`).

Conclusion

Unfortunately, I couldn’t see much evidence of a correlation between the Beats per Minute and the seven Spotify metrics Further understanding of the metrics’ definitions might be required to reveal patterns between different variables.