Project 2 - Data Transformation

Author

Naomi Buell

Introduction

In this assignment, I prepare different data sets for downstream analysis work. I chose the following three data sets identified in the Week 6 Discussion items:

  1. NSDUH Prescription Pain Reliever Misuse by State

  2. World Population History and Projections

  3. USDA National Food Availability Trends: Milk

Generate CSVs

To begin, I added the three .CSV files using, the “wide” structure, as they appeared in their discussion threads to my github. They are downloaded directly from the source linked on the discussion forum and include all of the information in the original data sets. Github paths are stored below for future reference:

NSDUH_path <- "https://raw.githubusercontent.com/naomibuell/DATA607/main/STUSAB%20x%20PNRNMYR.csv"
pop_path <- "https://raw.githubusercontent.com/naomibuell/DATA607/main/world_pop_data.csv"
milk_path <- "https://raw.githubusercontent.com/naomibuell/DATA607/main/dyfluid.csv"

NSDUH Prescription Pain Reliever Misuse by State

The first data set I work with will be from the 2018-19 National Survey on Drug Use and Health (NSDUH) from the Substance and Mental Health Services Administration (SAMHSA). This data set shows the prevalence of pain reliever misuse in each state. It is untidy because it has three rows for each state: a row for the proportion misusing opioids (rc_pain_relievers_past_year_misuse = 1), the proportion not misusing opioids (rc_pain_relievers_past_year_misuse = 0), and the sum of these two (rc_pain_relievers_past_year_misuse = Overall). Splitting it up this way is redundant–if we know one of these proportions, we can calculate the other two (e.g., 1 - proportion misusing = proportion not misusing. I aim to tidy the data for analysis by creating 1 row for each state, where the observation records the prevalence of misuse in the past year.

Import, tidy, and transform NSDUH data

I read the information from the .CSV file into R, and use dplyr to tidy and transform the NSDUH data. Importantly, I make the states my row-level analysis observation to tidy my data and make it easier to analyze.

NSDUH <- read_csv(NSDUH_path) |>
  clean_names() |>
  rename(
    state = state_us_abbreviation,
    prevalence = column_percent,
    se = column_percent_se,
    ci_lower = column_percent_ci_lower,
    ci_upper = column_percent_ci_upper
  ) |>
  filter(
    rc_pain_relievers_past_year_misuse == "1 - Misused within the past year",
    # keep only misuse prevalences
    state != "Overall" # only keeping state values, dropping national
  ) |>       
  # removing extra variables
  select(
    -c(
      starts_with("total"),
      starts_with("row"),
      rc_pain_relievers_past_year_misuse,
      weighted_count,
      count_se
    )
  )

I also add in a region variable (from a built-in data set) and a variable for whether a state has expanded Medicaid expansion for analysis from KFF.

# Get regional data from built in dataset
regions <- data.frame(state.abb, state.region, state.name) |>
  rename(state = state.abb,
         region = state.region,
         Location = state.name)

NSDUH_regions <-
  left_join(NSDUH, regions, by = "state") |>
  mutate(Location = if_else(state == "DC", "District of Columbia", Location))

# Get medicaid expansion status
medicaid_exp <-
  read_csv(
    "https://raw.githubusercontent.com/naomibuell/DATA607/main/medicaid_expansion.csv",
    skip = 2
  )

NSDUH_medicaid <-
  left_join(NSDUH_regions, medicaid_exp, by = "Location") |>
  clean_names() |>
  mutate(expanded = abs(as.numeric(as.factor(status_of_medicaid_expansion_decision)) -
                   2)) |>
  select(-c("location", "status_of_medicaid_expansion_decision"))

Now that the data has been transformed with the appropriate variables, I do some final cleaning by revising data types.

NSDUH_tidy <- NSDUH_medicaid |> 
  mutate(
    state = as.factor(state),
    ci_lower = parse_double(ci_lower),
    ci_upper = parse_double(ci_upper)
    )

Here is the final tidied data set for analysis:

head(NSDUH_tidy)
# A tibble: 6 × 7
  state prevalence     se ci_lower ci_upper region expanded
  <fct>      <dbl>  <dbl>    <dbl>    <dbl> <fct>     <dbl>
1 AK         0.042 0.0051    0.033    0.053 West          1
2 AL         0.053 0.008     0.04     0.072 South         0
3 AR         0.035 0.0051    0.026    0.046 South         1
4 AZ         0.04  0.0044    0.032    0.05  West          1
5 CA         0.037 0.003     0.032    0.044 West          1
6 CO         0.046 0.0105    0.029    0.071 West          1

Analysis

First, I compare and rank prevalence by state, checking for regional patterns.

NSDUH_tidy |> 
  arrange(desc(prevalence)) |> 
  head(5)
# A tibble: 5 × 7
  state prevalence     se ci_lower ci_upper region expanded
  <fct>      <dbl>  <dbl>    <dbl>    <dbl> <fct>     <dbl>
1 AL         0.053 0.008     0.04     0.072 South         0
2 OR         0.051 0.0068    0.039    0.066 West          1
3 CO         0.046 0.0105    0.029    0.071 West          1
4 KY         0.046 0.0062    0.035    0.06  South         1
5 MT         0.045 0.0073    0.033    0.062 West          1

Alabama, Oregon, Colorado, Kentucky, and Montana are the states with the top 5 highest opioid misuse rates.

NSDUH_tidy |> 
  arrange(desc(1-prevalence)) |> 
  head(5)
# A tibble: 5 × 7
  state prevalence     se ci_lower ci_upper region        expanded
  <fct>      <dbl>  <dbl>    <dbl>    <dbl> <fct>            <dbl>
1 IL         0.024 0.0027    0.019    0.03  North Central        1
2 NE         0.024 0.0031    0.019    0.031 North Central        1
3 SD         0.024 0.0038    0.018    0.033 North Central        1
4 WY         0.026 0.0045    0.019    0.036 West                 0
5 NY         0.027 0.0025    0.022    0.032 Northeast            1

Illinois, New England, South Dakota, Wyoming, and New York are in the bottom 5 states with the lowest rates of opioid misuse.

The following plot illustrates distributions of misuse prevalence by region:

NSDUH_tidy |>
  na.omit() |> 
  ggplot(aes(x = region, y = prevalence)) +
  geom_boxplot() +
  coord_flip()

The northeast generally has the lowest rates of pain reliever misuse (mean = 0.03) and the western region generally has the highest (mean = 0.041). Alabama is the high outlier in the south. Wyoming is the low outlier in the west.

Next, I determine if there is a correlation between state pain reliever misuse and whether states have expanded medicaid.

r = cor(NSDUH_tidy$expanded, NSDUH_tidy$prevalence)
r_sqrd = r^2

There is a very weak negative correlation between the binary for whether a state had Medicaid expansion and the prevalence of opioid use disorder (r = -0.12)–with Medicaid expansion, there may be a slight tendency for the prevalence of opioid misuse to decrease, and vice versa. Only 1% of variance in opioid misuse prevalence can be explained by Medicaid expansion; the remaining 99% of the variance may be influenced by other factors not considered in this analysis.

NSDUH_tidy |>
  group_by(expanded) |> 
  na.omit() |> 
  ggplot(aes(x = expanded, y = prevalence, group = expanded)) +
  geom_boxplot() +
  coord_flip()

On average, states who have adopted Medicaid expansion have slightly lower rates of pain reliever misuse than those that have adopted Medicaid expansion (mean = 0.036 vs. 0.038).

Conclusion

In conclusion, the Northeast had the lowest rates of pain reliever misuse and the West had the highest. There is no evidence of correlation between whether states have expanded medicaid and their rate of pain reliever misuse.

World Population History and Projections

The second data set I work with is from the world population clock maintained by the US Census Bureau. The data set shows the population over time and by country, including projected populations up to the year 2050. This data set is untidy in the context of my analysis because it has multiple year’s worth of population observtions in one row assigned to a particular country. I aim to tidy this data set by lengthening it, creating one row for each country and year (i.e., there will be one row for each possible combination of country and year).

Import, tidy, and transform population data

I read the information from the .CSV file into R, and use dplyr to tidy and transform the world population data. Importantly, I use the pivot_longer() command to convert the analysis observation to be on the country-year level.

world_pop <- read_csv(pop_path) |> 
  clean_names() |> 
  select(-x1) |> 
  pivot_longer(
    cols = starts_with("x"), 
    names_to = "year", 
    values_to = "population"
  ) |> 
  mutate(
    year = as.integer(str_replace_all(year, "[x]", ""))
  )

Here is my tidy data set for analysis:

head(world_pop)
# A tibble: 6 × 10
   rank country country_code    area land_area_km growth_rate world_percentage
  <dbl> <chr>   <chr>          <dbl>        <dbl>       <dbl>            <dbl>
1     1 China   CHN          9706961     9424703.           0            0.179
2     1 China   CHN          9706961     9424703.           0            0.179
3     1 China   CHN          9706961     9424703.           0            0.179
4     1 China   CHN          9706961     9424703.           0            0.179
5     1 China   CHN          9706961     9424703.           0            0.179
6     1 China   CHN          9706961     9424703.           0            0.179
# ℹ 3 more variables: density <dbl>, year <int>, population <dbl>

Analysis

Here I analyze population growth trends and determine the correlation between population density and population growth.

top_countries <- world_pop |> 
  filter(rank <= 15) |> 
  mutate(population = population/1000000000) # reporting population in billions

ggplot(top_countries, aes(x = year, y = population, color = country)) +
  geom_smooth(se = FALSE) + 
  labs(y = "Population in Billions")

# Get range of values for top 2 countries, China and India
top_stats <- top_countries |>
  filter(country == "China" | country == "India") |>
  summarise(min = round(min(population), 2),
            max = round(max(population), 2),
            pd = max(year) - min(year))

# Get range for remaining countries
other_stats <- top_countries |>
  filter(country != "China" & country != "India") |>
  summarise(min = round(min(population), 2),
            max = round(max(population), 2))

Countries generally tend to increase their population over time (with the exception of China). India is soon expected to surpass China in population size. Outside of the top two largest countries in terms of population, China and India, which range from 0.7 billion to 1.67 billion over the 70-year period, the other countries in the top 15 have drastically smaller populations, all below 0.38 billion.

ggplot(world_pop, aes(x = density, y = population)) +
  geom_point()

The scatterplot of density vs. population shows no discernible pattern.

r = cor(world_pop$density, world_pop$population)
r_sqrd = r^2

Based on the correlation coefficient, there is a very weak correlation between population density (at the single point in time when data was pulled) and population over time (r = -0.03). Only 0.07% of variance in population can be explained by density. Since this is very close to zero, there is little evidence of a relationship between population and density in this data.

Conclusion

In conclusion, all countries’ populations tend to rise over time, historically and based on future projection, with the exception of China. We did not find any evidence that population size was correlated with population density.