Goal: In this data analysis report, you will estimate which region in the United States had the highest rate of people (over age 12) who consume alcohol in 2018. You will also answer a couple extra questions using the same dataset. To do this, you will go through all the steps of data analysis from data acquisition, cleaning, preprocessing, modeling/description, and creating a graph to communicate your findings. The only step that is not required is inference; inference is required when we want to extend from our data to new scenarios (eg.: studying just a few hundred people to understand a big population of millions). In this case, we don’t need to do formal inference – we will assume that our data perfectly represents what we care about.
Each step builds on previous steps. However, if you get stuck on a step, you are not completely at a loss! At multiple points in this document, you can choose to load a correct version of what your data frame should look like. If you are stuck on an earlier section, then you can choose to load the new, correct data frame and go from that point. To complete this assignment, simply work through the rest of this document and fill in code or text as requested when you see an Exercise by filling in the …s in the code chunks. Make sure to run all code chunks from top to bottom. No text answers are required on this assignments.
Set up the libraries required for visualizations.
#If you get an error saying dplyr or ggplot2 have not been found, then you have not installed them yet! To install them, "uncomment" the next two lines by deleting the hash/pound symbol and try again! Warnings are OK!
#install.packages("dyplr")
#install.packages("ggplot2")
#install.packages(“dyplr”)
#install.packages(“ggplot2”)
This code gets both of the necessary datasets (it required Internet access!). No need to edit it – we’re not worrying too much about how it works but it is loading data files from a Google Drive set up for the class.
NSDUH <- read.csv("https://raw.githubusercontent.com/cgmr-asu/NEU290/main/NSDUH.csv")
CDC_Geographic <- read.csv("https://raw.githubusercontent.com/cgmr-asu/NEU290/main/CDC_Geographic.csv")
There are many ways to check the datasets out and get a sense of what is in them! One way is to “glimpse” them. Go ahead and “glimpse” them to get a sense of what the data looks like (more details in the notes document).
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
glimpse(NSDUH)
## Rows: 867
## Columns: 14
## $ State <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California", "C…
## $ Year <int> 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, …
## $ Pop12_17 <int> 380805, 69400, 485521, 232986, 3140739, 385648, 295157, 6647…
## $ Pop18_25 <int> 499453, 62791, 602265, 302029, 3919577, 493921, 323120, 8838…
## $ Pop26Plus <int> 2812905, 368460, 3329482, 1687337, 21392421, 2798960, 223576…
## $ Alc12_17 <dbl> 0.150033, 0.158988, 0.186864, 0.167065, 0.152791, 0.206330, …
## $ Alc18_25 <dbl> 0.509551, 0.598311, 0.584521, 0.537894, 0.555278, 0.688926, …
## $ Alc26Plus <dbl> 0.372703, 0.559151, 0.532858, 0.393372, 0.553801, 0.627921, …
## $ Mar1_17 <dbl> 0.063662, 0.110781, 0.077371, 0.079670, 0.076595, 0.098228, …
## $ Mar18_25 <dbl> 0.124672, 0.239907, 0.151780, 0.164807, 0.160916, 0.216673, …
## $ Mar26Plus <dbl> 0.025967, 0.071362, 0.036657, 0.033891, 0.045736, 0.060119, …
## $ Tob12_17 <dbl> 0.166578, 0.163918, 0.151071, 0.195714, 0.092235, 0.174446, …
## $ Tob18_25 <dbl> 0.451976, 0.484270, 0.397968, 0.558846, 0.351353, 0.499545, …
## $ Tob26Plus <dbl> 0.330659, 0.304220, 0.309969, 0.391210, 0.220699, 0.298565, …
glimpse(CDC_Geographic)
## Rows: 51
## Columns: 3
## $ State <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California", "Co…
## $ Division <chr> "East South Central", "Pacific", "Mountain", "West South Cent…
## $ Region <chr> "South", "West", "West", "South", "West", "West", "Northeast"…
The NSDUH data frame contains information about drug use across US States and DC. The CDC_Geography data frame contains a guide to the geographic “divisions” that the CDC uses to categorize different US states and territories. Use a data wrangling function to “join” these data frames together by adding on the CDC geographic data onto the NSDUH data. Remember you can go back and check the skills practice and online textbook! If you “mess up” the NSDUH data frame while trying out code, you can always reload the data by rerunning the data acquisition code above and start over!
NSDUH <- NSDUH %>%
left_join(CDC_Geographic, by = "State")
Have a look at the data! It should now list the divisions and regions for states.
glimpse(NSDUH)
## Rows: 867
## Columns: 16
## $ State <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California", "C…
## $ Year <int> 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, …
## $ Pop12_17 <int> 380805, 69400, 485521, 232986, 3140739, 385648, 295157, 6647…
## $ Pop18_25 <int> 499453, 62791, 602265, 302029, 3919577, 493921, 323120, 8838…
## $ Pop26Plus <int> 2812905, 368460, 3329482, 1687337, 21392421, 2798960, 223576…
## $ Alc12_17 <dbl> 0.150033, 0.158988, 0.186864, 0.167065, 0.152791, 0.206330, …
## $ Alc18_25 <dbl> 0.509551, 0.598311, 0.584521, 0.537894, 0.555278, 0.688926, …
## $ Alc26Plus <dbl> 0.372703, 0.559151, 0.532858, 0.393372, 0.553801, 0.627921, …
## $ Mar1_17 <dbl> 0.063662, 0.110781, 0.077371, 0.079670, 0.076595, 0.098228, …
## $ Mar18_25 <dbl> 0.124672, 0.239907, 0.151780, 0.164807, 0.160916, 0.216673, …
## $ Mar26Plus <dbl> 0.025967, 0.071362, 0.036657, 0.033891, 0.045736, 0.060119, …
## $ Tob12_17 <dbl> 0.166578, 0.163918, 0.151071, 0.195714, 0.092235, 0.174446, …
## $ Tob18_25 <dbl> 0.451976, 0.484270, 0.397968, 0.558846, 0.351353, 0.499545, …
## $ Tob26Plus <dbl> 0.330659, 0.304220, 0.309969, 0.391210, 0.220699, 0.298565, …
## $ Division <chr> "East South Central", "Pacific", "Mountain", "West South Cen…
## $ Region <chr> "South", "West", "West", "South", "West", "West", "Northeast…
For this analysis, we only want data from the year 2018. The following code should use a data wrangling tool to make sure NSDUH only has 2018 data.
NSDUH <- NSDUH %>%
filter(Year == 2018)
glimpse(NSDUH)
## Rows: 51
## Columns: 16
## $ State <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California", "C…
## $ Year <int> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, …
## $ Pop12_17 <int> 371033, 57444, 562546, 237382, 3017701, 433719, 268807, 6932…
## $ Pop18_25 <int> 505103, 71183, 763917, 312348, 4174962, 589317, 383297, 9252…
## $ Pop26Plus <int> 3225318, 455906, 4715857, 1949363, 25917724, 3782655, 240814…
## $ Alc12_17 <dbl> 0.082234, 0.087305, 0.079122, 0.084060, 0.088679, 0.106916, …
## $ Alc18_25 <dbl> 0.458341, 0.534787, 0.490376, 0.486640, 0.544955, 0.624899, …
## $ Alc26Plus <dbl> 0.476373, 0.570019, 0.538643, 0.440504, 0.537543, 0.653031, …
## $ Mar1_17 <dbl> 0.057989, 0.085400, 0.060367, 0.058163, 0.088481, 0.097544, …
## $ Mar18_25 <dbl> 0.183446, 0.307125, 0.207976, 0.158035, 0.264822, 0.343871, …
## $ Mar26Plus <dbl> 0.074906, 0.162506, 0.100118, 0.076023, 0.119085, 0.156205, …
## $ Tob12_17 <dbl> 0.053078, 0.065626, 0.029289, 0.054730, 0.025569, 0.041461, …
## $ Tob18_25 <dbl> 0.306354, 0.258003, 0.214940, 0.308876, 0.170364, 0.262379, …
## $ Tob26Plus <dbl> 0.313648, 0.248320, 0.203073, 0.317618, 0.149708, 0.216631, …
## $ Division <chr> "East South Central", "Pacific", "Mountain", "West South Cen…
## $ Region <chr> "South", "West", "West", "South", "West", "West", "Northeast…
We ultimately want to look at the overall rate of alcohol consumption in states and regions. Our dataset only includes rates for specific age ranges. We’ll need to do some math and transformations to find the overall rate! This’ll take a couple preprocessing steps to create a variable called AlcoholRate.
Create a new variable in NSDUH called “PopTotal” using a data wrangling function that includes the total population of people 12 and up in each state, combining or adding together the 12-17 year-olds, the 18-25 year-olds, and the 26+ year-olds. (Hint: In 2018, Alabama had 4,101,454 people aged 12 or above.)
NSDUH <- NSDUH %>%
mutate(PopTotal = Pop12_17 + Pop18_25 + Pop26Plus)
glimpse(NSDUH)
## Rows: 51
## Columns: 17
## $ State <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California", "C…
## $ Year <int> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, …
## $ Pop12_17 <int> 371033, 57444, 562546, 237382, 3017701, 433719, 268807, 6932…
## $ Pop18_25 <int> 505103, 71183, 763917, 312348, 4174962, 589317, 383297, 9252…
## $ Pop26Plus <int> 3225318, 455906, 4715857, 1949363, 25917724, 3782655, 240814…
## $ Alc12_17 <dbl> 0.082234, 0.087305, 0.079122, 0.084060, 0.088679, 0.106916, …
## $ Alc18_25 <dbl> 0.458341, 0.534787, 0.490376, 0.486640, 0.544955, 0.624899, …
## $ Alc26Plus <dbl> 0.476373, 0.570019, 0.538643, 0.440504, 0.537543, 0.653031, …
## $ Mar1_17 <dbl> 0.057989, 0.085400, 0.060367, 0.058163, 0.088481, 0.097544, …
## $ Mar18_25 <dbl> 0.183446, 0.307125, 0.207976, 0.158035, 0.264822, 0.343871, …
## $ Mar26Plus <dbl> 0.074906, 0.162506, 0.100118, 0.076023, 0.119085, 0.156205, …
## $ Tob12_17 <dbl> 0.053078, 0.065626, 0.029289, 0.054730, 0.025569, 0.041461, …
## $ Tob18_25 <dbl> 0.306354, 0.258003, 0.214940, 0.308876, 0.170364, 0.262379, …
## $ Tob26Plus <dbl> 0.313648, 0.248320, 0.203073, 0.317618, 0.149708, 0.216631, …
## $ Division <chr> "East South Central", "Pacific", "Mountain", "West South Cen…
## $ Region <chr> "South", "West", "West", "South", "West", "West", "Northeast…
## $ PopTotal <int> 4101454, 584533, 6042320, 2499093, 33110387, 4805691, 306024…
Add a new variable to NSDUH called “AlcoholTotal”. This should include an estimate for the total number of people over the age of 12 (from all three age groups) that consumed alcohol last month. The data wrangling function may look similar to Exercise 3, but with additional math steps. As a hint, to calculate an estimate for the number of 12-17 year-olds that drink alcohol, you can multiply the total number of 12-17 year-olds by the rate of alcohol use for 12-17 year olds. Note: Some numbers might include decimals – it doesn’t really make sense to have .5 or .3 people, but for these estimates it’s OK that the number of people isn’t always whole numbers! They’re imperfect estimates!
NSDUH <- NSDUH %>%
mutate(AlcoholTotal = (Pop12_17 * Alc12_17) + (Pop18_25 * Alc18_25) + (Pop26Plus * Alc26Plus))
glimpse(NSDUH)
## Rows: 51
## Columns: 18
## $ State <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California",…
## $ Year <int> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 201…
## $ Pop12_17 <int> 371033, 57444, 562546, 237382, 3017701, 433719, 268807, 6…
## $ Pop18_25 <int> 505103, 71183, 763917, 312348, 4174962, 589317, 383297, 9…
## $ Pop26Plus <int> 3225318, 455906, 4715857, 1949363, 25917724, 3782655, 240…
## $ Alc12_17 <dbl> 0.082234, 0.087305, 0.079122, 0.084060, 0.088679, 0.10691…
## $ Alc18_25 <dbl> 0.458341, 0.534787, 0.490376, 0.486640, 0.544955, 0.62489…
## $ Alc26Plus <dbl> 0.476373, 0.570019, 0.538643, 0.440504, 0.537543, 0.65303…
## $ Mar1_17 <dbl> 0.057989, 0.085400, 0.060367, 0.058163, 0.088481, 0.09754…
## $ Mar18_25 <dbl> 0.183446, 0.307125, 0.207976, 0.158035, 0.264822, 0.34387…
## $ Mar26Plus <dbl> 0.074906, 0.162506, 0.100118, 0.076023, 0.119085, 0.15620…
## $ Tob12_17 <dbl> 0.053078, 0.065626, 0.029289, 0.054730, 0.025569, 0.04146…
## $ Tob18_25 <dbl> 0.306354, 0.258003, 0.214940, 0.308876, 0.170364, 0.26237…
## $ Tob26Plus <dbl> 0.313648, 0.248320, 0.203073, 0.317618, 0.149708, 0.21663…
## $ Division <chr> "East South Central", "Pacific", "Mountain", "West South …
## $ Region <chr> "South", "West", "West", "South", "West", "West", "Northe…
## $ PopTotal <int> 4101454, 584533, 6042320, 2499093, 33110387, 4805691, 306…
## $ AlcoholTotal <dbl> 1798475.4, 302958.0, 2959279.7, 1030657.6, 16474664.2, 28…
Create a new variable in NSDUH called “AlcoholRate”. This variable should have the rate of alcohol consumption for all people over the year of 12 in each state. Recall that the overall rate of consumption should be the total number of people who drink alcohol divided by the total population.
NSDUH <- NSDUH %>%
mutate(AlcoholRate = AlcoholTotal / PopTotal)
glimpse(NSDUH)
## Rows: 51
## Columns: 19
## $ State <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California",…
## $ Year <int> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 201…
## $ Pop12_17 <int> 371033, 57444, 562546, 237382, 3017701, 433719, 268807, 6…
## $ Pop18_25 <int> 505103, 71183, 763917, 312348, 4174962, 589317, 383297, 9…
## $ Pop26Plus <int> 3225318, 455906, 4715857, 1949363, 25917724, 3782655, 240…
## $ Alc12_17 <dbl> 0.082234, 0.087305, 0.079122, 0.084060, 0.088679, 0.10691…
## $ Alc18_25 <dbl> 0.458341, 0.534787, 0.490376, 0.486640, 0.544955, 0.62489…
## $ Alc26Plus <dbl> 0.476373, 0.570019, 0.538643, 0.440504, 0.537543, 0.65303…
## $ Mar1_17 <dbl> 0.057989, 0.085400, 0.060367, 0.058163, 0.088481, 0.09754…
## $ Mar18_25 <dbl> 0.183446, 0.307125, 0.207976, 0.158035, 0.264822, 0.34387…
## $ Mar26Plus <dbl> 0.074906, 0.162506, 0.100118, 0.076023, 0.119085, 0.15620…
## $ Tob12_17 <dbl> 0.053078, 0.065626, 0.029289, 0.054730, 0.025569, 0.04146…
## $ Tob18_25 <dbl> 0.306354, 0.258003, 0.214940, 0.308876, 0.170364, 0.26237…
## $ Tob26Plus <dbl> 0.313648, 0.248320, 0.203073, 0.317618, 0.149708, 0.21663…
## $ Division <chr> "East South Central", "Pacific", "Mountain", "West South …
## $ Region <chr> "South", "West", "West", "South", "West", "West", "Northe…
## $ PopTotal <int> 4101454, 584533, 6042320, 2499093, 33110387, 4805691, 306…
## $ AlcoholTotal <dbl> 1798475.4, 302958.0, 2959279.7, 1030657.6, 16474664.2, 28…
## $ AlcoholRate <dbl> 0.4384970, 0.5182906, 0.4897588, 0.4124126, 0.4975679, 0.…
Now, let’s find out which regions have the highest and lowest average rates of alcohol consumption. To do this, we can look at the average rates of alcohol consumption for all states in each of the 4 CDC regions of the US (West, Northeast, South, Midwest).
Create a new data frame called NSDUH_Regions. This data frame should have 4 rows (one for each Region). It should have 2 variables: “Region” should be the region name and “AverageAlcoholRate” should be the mean alcohol rate for all states within the particular region.
NSDUH_Regions <- NSDUH %>%
group_by(Region) %>%
summarize(AverageAlcoholRate = mean(AlcoholRate, na.rm = TRUE))
glimpse(NSDUH_Regions)
## Rows: 4
## Columns: 2
## $ Region <chr> "Midwest", "Northeast", "South", "West"
## $ AverageAlcoholRate <dbl> 0.5463229, 0.5714646, 0.4791133, 0.5012730
Create a bar plot that visualizes the NSDUH_Regions data in a sensible fashion. Note: There are more than one functions to create bar plots with ggplot – make sure to use one that is appropriate for your data!
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(ggplot2)
NSDUH_Regions %>%
ggplot(aes(x = Region, y = AverageAlcoholRate, fill = Region)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(title = "Average Alcohol Consumption by Region in 2018",
y = "Average Alcohol Rate",
x = "Region")
Excellent! Based on the graph above, you should be able to figure out which regions had the highest and lowest average statewide rates of alcohol consumption (at least in 2018)! The document from the top until this point is basically what you would expect to see in a real data analysis. For the next 3 exercises, you’re going to answer a couple extra questions that are less directly tied to the original question.
Rearrange the NSDUH data frame so that all the states are ordered resulting in the state with the lowest overall alcohol consumption rate being at the top of the data frame and the state with the highest overall alcohol consumption rate being at the bottom.
NSDUH <- NSDUH %>%
arrange(AlcoholRate)
glimpse(NSDUH)
## Rows: 51
## Columns: 19
## $ State <chr> "Utah", "West Virginia", "Mississippi", "Arkansas", "Kent…
## $ Year <int> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 201…
## $ Pop12_17 <int> 312345, 124659, 241623, 237382, 339761, 371033, 513011, 3…
## $ Pop18_25 <int> 406023, 174236, 316108, 312348, 461868, 505103, 687939, 4…
## $ Pop26Plus <int> 1821899, 1235448, 1894943, 1949363, 2917791, 3225318, 449…
## $ Alc12_17 <dbl> 0.056698, 0.096124, 0.080492, 0.084060, 0.096198, 0.08223…
## $ Alc18_25 <dbl> 0.324851, 0.483520, 0.416831, 0.486640, 0.481976, 0.45834…
## $ Alc26Plus <dbl> 0.324884, 0.402934, 0.421761, 0.440504, 0.464227, 0.47637…
## $ Mar1_17 <dbl> 0.045811, 0.070514, 0.056821, 0.058163, 0.061808, 0.05798…
## $ Mar18_25 <dbl> 0.145029, 0.208839, 0.169864, 0.158035, 0.189620, 0.18344…
## $ Mar26Plus <dbl> 0.049647, 0.081155, 0.066478, 0.076023, 0.087077, 0.07490…
## $ Tob12_17 <dbl> 0.027104, 0.081434, 0.070829, 0.054730, 0.071363, 0.05307…
## $ Tob18_25 <dbl> 0.172845, 0.400053, 0.321114, 0.308876, 0.351077, 0.30635…
## $ Tob26Plus <dbl> 0.154902, 0.354690, 0.333158, 0.317618, 0.333691, 0.31364…
## $ Division <chr> "Mountain", "South Atlantic", "East South Central", "West…
## $ Region <chr> "West", "South", "South", "South", "South", "South", "Sou…
## $ PopTotal <int> 2540267, 1534343, 2452674, 2499093, 3719420, 4101454, 569…
## $ AlcoholTotal <dbl> 741512.1, 594033.3, 950425.4, 1030657.6, 1609811.0, 17984…
## $ AlcoholRate <dbl> 0.2919032, 0.3871581, 0.3875058, 0.4124126, 0.4328124, 0.…
What is the best estimate you can calculate of the sum total number of 12-17 year-olds who consumed alcohol in the United States (minus the territories and DC) in 2018? Use data wrangling tools to create a new data frame called TotalAlcYouth. TotalAlcYouth should have 1 row and 1 variable. That variable should be called “TotalYouthDrinkers”. TotalYouthDrinkers should be an estimate of the total, combined number of youth who drink alcohol in all the states in the United States. To create this dataframe, you will need to use multiple data wrangling steps. Make sure to remove the District of Columbia from the calculation, as it is not a US State.
TotalAlcYouth <- NSDUH %>%
filter(State != "District of Columbia") %>%
summarize(TotalYouthDrinkers = sum(Pop12_17 * Alc12_17, na.rm = TRUE))
TotalAlcYouth
## TotalYouthDrinkers
## 1 2286540
In some US states, more young adults use marijuana than tobacco and in other US states, more young adults use tobacco than marijuana (at least in 2018). In which of the main regions of the US (Midwest, Northeast, South, West) would you expect states to have a preference from marijuana over tobacco? One way to think of this is to imagine a “typical” or “average” state in each region. For this exercise, create a new data frame called “RegionalDifferences”. This data frame should have four rows (one for each region) and two columns called Region and PrefForMarijuana PrefForMarijuana should tell us how much higher the rate of marijuana use is over tobacco use for young adults (18-25) on average (mean) for states in the region (you will need to subtract variables and take a mean value of something). If you look at RegionalDifferences, regions with positive values of PrefForMarijuana indicate a region where an average state has a preference for marijuana over tobacco. To do this, you will need to use multiple data wrangling steps (around three). Have a look at find which regions have the strongest average state preferences for marijuana or tobacco in young adults.
RegionalDifferences <- NSDUH %>%
group_by(Region) %>%
summarize(PrefForMarijuana = mean(Mar18_25 - Tob18_25, na.rm = TRUE))
RegionalDifferences
## # A tibble: 4 × 2
## Region PrefForMarijuana
## <chr> <dbl>
## 1 Midwest -0.0758
## 2 Northeast 0.0289
## 3 South -0.0875
## 4 West -0.0000752
Make sure to save your .Rmd Markdown file (with File -> Save) and also knit your Markdown to an .html file (Click the Knit button). Once the file has been knit, scroll through to make sure everything looks right. Then, find the knit .html file on your computer by going to the folder that you saved the .Rmd file in (when you knit the Markdown, the window that comes up should have the folder name shown at the top of the window). Upload that .html file to Canvas as your submission!