Project 2 - Data Transformation

Author

Naomi Buell

Introduction

In this assignment, I prepare different data sets for downstream analysis work. I chose the following three data sets identified in the Week 6 Discussion items:

Generate CSVs

To begin, I added the three .CSV files using, the “wide” structure, as they appeared in their discussion threads to my github. They are downloaded directly from the source linked on the discussion forum and include all of the information in the original data sets. Github paths are stored below for future reference:

NSDUH_path <- "https://raw.githubusercontent.com/naomibuell/DATA607/main/STUSAB%20x%20PNRNMYR.csv"
pop_path <- "https://raw.githubusercontent.com/naomibuell/DATA607/main/world_pop_data.csv"
milk_path <- "https://raw.githubusercontent.com/naomibuell/DATA607/main/dyfluid.csv"

NSDUH Prescription Pain Reliever Misuse by State

The first data set I work with will be from the 2018-19 National Survey on Drug Use and Health (NSDUH) from the Substance and Mental Health Services Administration (SAMHSA). This data set shows the prevalence of pain reliever misuse in each state. It is untidy because it has three rows for each state: a row for the proportion misusing opioids (rc_pain_relievers_past_year_misuse = 1), the proportion not misusing opioids (rc_pain_relievers_past_year_misuse = 0), and the sum of these two (rc_pain_relievers_past_year_misuse = Overall). Splitting it up this way is redundant–if we know one of these proportions, we can calculate the other two (e.g., 1 - proportion misusing = proportion not misusing. I aim to tidy the data for analysis by creating 1 row for each state, where the observation records the prevalence of misuse in the past year.

Import, tidy, and transform NSDUH data

I read the information from the .CSV file into R, and use dplyr to tidy and transform the NSDUH data. Importantly, I make the states my row-level analysis observation to tidy my data and make it easier to analyze.

NSDUH <- read_csv(NSDUH_path) |>
  clean_names() |>
  rename(
    state = state_us_abbreviation,
    prevalence = column_percent,
    se = column_percent_se,
    ci_lower = column_percent_ci_lower,
    ci_upper = column_percent_ci_upper
  ) |>
  filter(
    rc_pain_relievers_past_year_misuse == "1 - Misused within the past year",
    # keep only misuse prevalences
    state != "Overall" # only keeping state values, dropping national
  ) |>       
  # removing extra variables
  select(
    -c(
      starts_with("total"),
      starts_with("row"),
      rc_pain_relievers_past_year_misuse,
      weighted_count,
      count_se
    )
  )

I also add in a region variable (from a built-in data set) and a variable for whether a state has expanded Medicaid expansion for analysis from KFF.

# Get regional data from built in dataset
regions <- data.frame(state.abb, state.region, state.name) |>
  rename(state = state.abb,
         region = state.region,
         Location = state.name)

NSDUH_regions <-
  left_join(NSDUH, regions, by = "state") |>
  mutate(Location = if_else(state == "DC", "District of Columbia", Location))

# Get medicaid expansion status
medicaid_exp <-
  read_csv(
    "https://raw.githubusercontent.com/naomibuell/DATA607/main/medicaid_expansion.csv",
    skip = 2
  )

NSDUH_medicaid <-
  left_join(NSDUH_regions, medicaid_exp, by = "Location") |>
  clean_names() |>
  mutate(expanded = abs(as.numeric(as.factor(status_of_medicaid_expansion_decision)) -
                   2)) |>
  select(-c("location", "status_of_medicaid_expansion_decision"))

Now that the data has been transformed with the appropriate variables, I do some final cleaning by revising data types.

NSDUH_tidy <- NSDUH_medicaid |> 
  mutate(
    state = as.factor(state),
    ci_lower = parse_double(ci_lower),
    ci_upper = parse_double(ci_upper)
    )

Here is the final tidied data set for analysis:

head(NSDUH_tidy)

# A tibble: 6 × 7
  state prevalence     se ci_lower ci_upper region expanded
  <fct>      <dbl>  <dbl>    <dbl>    <dbl> <fct>     <dbl>
1 AK         0.042 0.0051    0.033    0.053 West          1
2 AL         0.053 0.008     0.04     0.072 South         0
3 AR         0.035 0.0051    0.026    0.046 South         1
4 AZ         0.04  0.0044    0.032    0.05  West          1
5 CA         0.037 0.003     0.032    0.044 West          1
6 CO         0.046 0.0105    0.029    0.071 West          1

Analysis

First, I compare and rank prevalence by state, checking for regional patterns.

NSDUH_tidy |> 
  arrange(desc(prevalence)) |> 
  head(5)

# A tibble: 5 × 7
  state prevalence     se ci_lower ci_upper region expanded
  <fct>      <dbl>  <dbl>    <dbl>    <dbl> <fct>     <dbl>
1 AL         0.053 0.008     0.04     0.072 South         0
2 OR         0.051 0.0068    0.039    0.066 West          1
3 CO         0.046 0.0105    0.029    0.071 West          1
4 KY         0.046 0.0062    0.035    0.06  South         1
5 MT         0.045 0.0073    0.033    0.062 West          1

Alabama, Oregon, Colorado, Kentucky, and Montana are the states with the top 5 highest opioid misuse rates.

NSDUH_tidy |> 
  arrange(desc(1-prevalence)) |> 
  head(5)

# A tibble: 5 × 7
  state prevalence     se ci_lower ci_upper region        expanded
  <fct>      <dbl>  <dbl>    <dbl>    <dbl> <fct>            <dbl>
1 IL         0.024 0.0027    0.019    0.03  North Central        1
2 NE         0.024 0.0031    0.019    0.031 North Central        1
3 SD         0.024 0.0038    0.018    0.033 North Central        1
4 WY         0.026 0.0045    0.019    0.036 West                 0
5 NY         0.027 0.0025    0.022    0.032 Northeast            1

Illinois, New England, South Dakota, Wyoming, and New York are in the bottom 5 states with the lowest rates of opioid misuse.

The following plot illustrates distributions of misuse prevalence by region:

NSDUH_tidy |>
  na.omit() |> 
  ggplot(aes(x = region, y = prevalence)) +
  geom_boxplot() +
  coord_flip()

The northeast generally has the lowest rates of pain reliever misuse (mean = 0.03) and the western region generally has the highest (mean = 0.041). Alabama is the high outlier in the south. Wyoming is the low outlier in the west.

Next, I determine if there is a correlation between state pain reliever misuse and whether states have expanded medicaid.

r = cor(NSDUH_tidy$expanded, NSDUH_tidy$prevalence)
r_sqrd = r^2

There is a very weak negative correlation between the binary for whether a state had Medicaid expansion and the prevalence of opioid use disorder (r = -0.12)–with Medicaid expansion, there may be a slight tendency for the prevalence of opioid misuse to decrease, and vice versa. Only 1% of variance in opioid misuse prevalence can be explained by Medicaid expansion; the remaining 99% of the variance may be influenced by other factors not considered in this analysis.

NSDUH_tidy |>
  group_by(expanded) |> 
  na.omit() |> 
  ggplot(aes(x = expanded, y = prevalence, group = expanded)) +
  geom_boxplot() +
  coord_flip()

On average, states who have adopted Medicaid expansion have slightly lower rates of pain reliever misuse than those that have adopted Medicaid expansion (mean = 0.036 vs. 0.038).

Conclusion

In conclusion, the Northeast had the lowest rates of pain reliever misuse and the West had the highest. There is no evidence of correlation between whether states have expanded medicaid and their rate of pain reliever misuse.

World Population History and Projections

The second data set I work with is from the world population clock maintained by the US Census Bureau. The data set shows the population over time and by country, including projected populations up to the year 2050. This data set is untidy in the context of my analysis because it has multiple year’s worth of population observtions in one row assigned to a particular country. I aim to tidy this data set by lengthening it, creating one row for each country and year (i.e., there will be one row for each possible combination of country and year).

Import, tidy, and transform population data

I read the information from the .CSV file into R, and use dplyr to tidy and transform the world population data. Importantly, I use the pivot_longer() command to convert the analysis observation to be on the country-year level.

world_pop <- read_csv(pop_path) |> 
  clean_names() |> 
  select(-x1) |> 
  pivot_longer(
    cols = starts_with("x"), 
    names_to = "year", 
    values_to = "population"
  ) |> 
  mutate(
    year = as.integer(str_replace_all(year, "[x]", ""))
  )

Here is my tidy data set for analysis:

head(world_pop)

# A tibble: 6 × 10
   rank country country_code    area land_area_km growth_rate world_percentage
  <dbl> <chr>   <chr>          <dbl>        <dbl>       <dbl>            <dbl>
1     1 China   CHN          9706961     9424703.           0            0.179
2     1 China   CHN          9706961     9424703.           0            0.179
3     1 China   CHN          9706961     9424703.           0            0.179
4     1 China   CHN          9706961     9424703.           0            0.179
5     1 China   CHN          9706961     9424703.           0            0.179
6     1 China   CHN          9706961     9424703.           0            0.179
# ℹ 3 more variables: density <dbl>, year <int>, population <dbl>

Analysis

Here I analyze population growth trends and determine the correlation between population density and population growth.

top_countries <- world_pop |> 
  filter(rank <= 15) |> 
  mutate(population = population/1000000000) # reporting population in billions

ggplot(top_countries, aes(x = year, y = population, color = country)) +
  geom_smooth(se = FALSE) + 
  labs(y = "Population in Billions")

# Get range of values for top 2 countries, China and India
top_stats <- top_countries |>
  filter(country == "China" | country == "India") |>
  summarise(min = round(min(population), 2),
            max = round(max(population), 2),
            pd = max(year) - min(year))

# Get range for remaining countries
other_stats <- top_countries |>
  filter(country != "China" & country != "India") |>
  summarise(min = round(min(population), 2),
            max = round(max(population), 2))

Countries generally tend to increase their population over time (with the exception of China). India is soon expected to surpass China in population size. Outside of the top two largest countries in terms of population, China and India, which range from 0.7 billion to 1.67 billion over the 70-year period, the other countries in the top 15 have drastically smaller populations, all below 0.38 billion.

ggplot(world_pop, aes(x = density, y = population)) +
  geom_point()

The scatterplot of density vs. population shows no discernible pattern.

r = cor(world_pop$density, world_pop$population)
r_sqrd = r^2

Based on the correlation coefficient, there is a very weak correlation between population density (at the single point in time when data was pulled) and population over time (r = -0.03). Only 0.07% of variance in population can be explained by density. Since this is very close to zero, there is little evidence of a relationship between population and density in this data.

Conclusion

In conclusion, all countries’ populations tend to rise over time, historically and based on future projection, with the exception of China. We did not find any evidence that population size was correlated with population density.

USDA National Food Availability Trends: Milk

The final data set I work with is from the USDA economic research service and describes per capita fluid milk availability in the U.S. since 1909. This data set is untidy because categories, or types, of milk are used as column headers (e.g., whole milk is coded as the name of a variable instead of as the level of a factor-type variable). I’ll aim to tidy this data set by lengthening it, creating an observation for each year and type of milk.

Import, tidy, and transform milk data

I read the information from the .CSV file into R, and use dplyr to tidy and transform the data on milk. There are also a number of columns calculating subtotals (e.g., there is a column for total plain and flavored whole milk) in addition to the grand total. I remove these totals columns so I can work with just the raw numbers for each type of milk–I can calculate totals and subtotals as needed in my analysis. Importantly, I transform the observation level to year and type of milk. I.e., there will be one row for each combination of year and milk type.

# read in data
milk_raw <- read_csv(milk_path, skip_empty_rows = TRUE, skip = 0) |>
  clean_names() |>
  mutate_all( ~ str_replace_all(str_replace_all(as.character(.), "-", ""), ",", "")) # removing dashes and commas

# get column names
milk_cols <- milk_raw |>
  slice(1:6) |> # for the first 6 rows,
  summarise_all(~ ifelse(all(is.na(.)), NA, paste(na.omit(.), collapse = " "))) |> # concatenating text down each column to create name for each column
  mutate_all(~ str_replace_all(as.character(.), c("  " = " ")))  # removing double spaces from the data

colnames(milk_raw) <- milk_cols # assign column names and clean

milk_clean <- milk_raw |>
  clean_names() |> # revise column names
  rename(
    population = u_s_population_july_11_millions,
    whole_plain = total_plain,
    whole_flavored = flavored2,
    two_perc = lower_fat_and_skim_milk_plain_2_percent,
    one_perc = x1_percent,
    non_whole_flavored = flavored_other_than_whole2,
    skim_buttermilk_consumed = skim_milk_and_buttermilk_consumed_where_produced,
    eggnog = other_beverage_milk_eggnog3
  ) |>
  select(-c( # remove unnecessary rows
    starts_with("na"),
    starts_with("whole_milk_plain_consu"),
    sales,
    starts_with("total")
  )) |>
  mutate( # fix variable types
    year = parse_double(year),
    population = parse_double(population),
    whole_plain = parse_double(whole_plain),
    whole_flavored = parse_double(whole_flavored),
    two_perc = parse_double(two_perc),
    one_perc = parse_double(one_perc),
    non_whole_flavored = parse_double(non_whole_flavored),
    buttermilk = parse_double(buttermilk),
    skim_milk = parse_double(skim_milk),
    skim_buttermilk_consumed = parse_double(skim_buttermilk_consumed),
    eggnog = parse_double(eggnog),
    miscellaneous = parse_double(miscellaneous),
  ) |>
  filter(!is.na(year)) |> # remove extra rows
  pivot_longer( # transform level of observation
    cols = -c(year, population),
    names_to = "type",
    values_to = "pounds_per_cap"
  ) |>
  mutate(type = as.factor(type)) # add pounds, unadjusted for capita

Here is the tidied and transformed data set for analysis:

head(milk_clean)

# A tibble: 6 × 4
   year population type               pounds_per_cap
  <dbl>      <dbl> <fct>                       <dbl>
1  1909       90.5 whole_plain                 20764
2  1909       90.5 whole_flavored                128
3  1909       90.5 two_perc                       NA
4  1909       90.5 one_perc                       NA
5  1909       90.5 non_whole_flavored             86
6  1909       90.5 buttermilk                    221

Analysis

I analyze this data to better understand the shift between two types of products whole and lower-fat milk consumption. Below shows trends in all types of milk’s availability in millions of pounds per capita:

# pounds per capita
ggplot(milk_clean, aes(x = year, y = pounds_per_cap, color = type)) +
  geom_smooth(se = FALSE, size = .5) + 
  geom_point(size = .5) +
  labs(y = "Million Pounds Per Capita Available")

Whole (plain) milk sticks out as the best performer by far historically, surpassed only recently by two-percent.

The plot below compares trends in whole vs. lower-fat milk consumption (plotting the total pounds of whole milk and non-whole milk produced):

# gen binary for whole milk vs. not for grouping and comparison
milk_analysis <- milk_clean |>
  mutate(whole = str_starts(type, "whole"),
         pounds_per_cap = pounds_per_cap / 1000) |> # switching from millions to billions
  group_by(year, whole) |>
  summarise(sum = sum(pounds_per_cap, na.rm = TRUE))

# plot pounds per capita: whole vs. not
ggplot(milk_analysis, aes(x = year, y = sum, color = whole)) +
  geom_point(size = .5) +
  geom_smooth() +
  labs(y = "Billion Pounds Per Capita Available")

# Get diffs:
milk_diffs <- milk_analysis |>
  pivot_wider(names_from = whole,
              values_from = sum) |>
  clean_names() |>
  mutate(diff = true - false, # get differences between whole and non-whole pounds per capita
         abs_diff = abs(diff))

As the graph above shows, whole milk has not always been produced more than skim or other lower fat milks. The lead whole milk had over non-whole milk peaked in 1957 at 44.65 billion more pounds per capita available for whole than non-whole milks. In 1987 however, whole milk switched from being the leading milk per capita, getting surpassed by skim and other lower fat milks.

Conclusion

In conclusion, the decline in whole milk’s popularity and the shift to lower-fat milk consumption began in the 1960s, with non-whole milks taking over the lead in 1957. Though non-whole milks had been increasing in availability since the ’50s, in recent years (since 2010), their availability has been dropping. I wonder if this is due to the rise in alternative milks, like soy and nut milks, where folks are substituting skim or lower fat milks with vegan milks. Data on alternative milks are not available in this USDA data set, but would it would be interesting to investigate potential substitution effects.