Preparation Instructions

EXERCISES

Examine the Data

For this Challenge Problem assignment, you are going to be using the NHANES dataset (from the {NHANES} package). [Note: This is the similar to how you accessed this dataset in a previous assignment.] Each case/observation is supposed to represent a participant of the US National Health And Nutrition Examination Survey (NHANES) for the 2009-2010 and 2011-2012 sample years, and can be treated, for educational purposes, as a simple random sample from the American population.

  1. Examine the NHANES dataset. Be sure to show your work.
str(NHANES)
## tibble [10,000 × 76] (S3: tbl_df/tbl/data.frame)
##  $ ID              : int [1:10000] 51624 51624 51624 51625 51630 51638 51646 51647 51647 51647 ...
##  $ SurveyYr        : Factor w/ 2 levels "2009_10","2011_12": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Gender          : Factor w/ 2 levels "female","male": 2 2 2 2 1 2 2 1 1 1 ...
##  $ Age             : int [1:10000] 34 34 34 4 49 9 8 45 45 45 ...
##  $ AgeDecade       : Factor w/ 8 levels " 0-9"," 10-19",..: 4 4 4 1 5 1 1 5 5 5 ...
##  $ AgeMonths       : int [1:10000] 409 409 409 49 596 115 101 541 541 541 ...
##  $ Race1           : Factor w/ 5 levels "Black","Hispanic",..: 4 4 4 5 4 4 4 4 4 4 ...
##  $ Race3           : Factor w/ 6 levels "Asian","Black",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Education       : Factor w/ 5 levels "8th Grade","9 - 11th Grade",..: 3 3 3 NA 4 NA NA 5 5 5 ...
##  $ MaritalStatus   : Factor w/ 6 levels "Divorced","LivePartner",..: 3 3 3 NA 2 NA NA 3 3 3 ...
##  $ HHIncome        : Factor w/ 12 levels " 0-4999"," 5000-9999",..: 6 6 6 5 7 11 9 11 11 11 ...
##  $ HHIncomeMid     : int [1:10000] 30000 30000 30000 22500 40000 87500 60000 87500 87500 87500 ...
##  $ Poverty         : num [1:10000] 1.36 1.36 1.36 1.07 1.91 1.84 2.33 5 5 5 ...
##  $ HomeRooms       : int [1:10000] 6 6 6 9 5 6 7 6 6 6 ...
##  $ HomeOwn         : Factor w/ 3 levels "Own","Rent","Other": 1 1 1 1 2 2 1 1 1 1 ...
##  $ Work            : Factor w/ 3 levels "Looking","NotWorking",..: 2 2 2 NA 2 NA NA 3 3 3 ...
##  $ Weight          : num [1:10000] 87.4 87.4 87.4 17 86.7 29.8 35.2 75.7 75.7 75.7 ...
##  $ Length          : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
##  $ HeadCirc        : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
##  $ Height          : num [1:10000] 165 165 165 105 168 ...
##  $ BMI             : num [1:10000] 32.2 32.2 32.2 15.3 30.6 ...
##  $ BMICatUnder20yrs: Factor w/ 4 levels "UnderWeight",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ BMI_WHO         : Factor w/ 4 levels "12.0_18.5","18.5_to_24.9",..: 4 4 4 1 4 1 2 3 3 3 ...
##  $ Pulse           : int [1:10000] 70 70 70 NA 86 82 72 62 62 62 ...
##  $ BPSysAve        : int [1:10000] 113 113 113 NA 112 86 107 118 118 118 ...
##  $ BPDiaAve        : int [1:10000] 85 85 85 NA 75 47 37 64 64 64 ...
##  $ BPSys1          : int [1:10000] 114 114 114 NA 118 84 114 106 106 106 ...
##  $ BPDia1          : int [1:10000] 88 88 88 NA 82 50 46 62 62 62 ...
##  $ BPSys2          : int [1:10000] 114 114 114 NA 108 84 108 118 118 118 ...
##  $ BPDia2          : int [1:10000] 88 88 88 NA 74 50 36 68 68 68 ...
##  $ BPSys3          : int [1:10000] 112 112 112 NA 116 88 106 118 118 118 ...
##  $ BPDia3          : int [1:10000] 82 82 82 NA 76 44 38 60 60 60 ...
##  $ Testosterone    : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
##  $ DirectChol      : num [1:10000] 1.29 1.29 1.29 NA 1.16 1.34 1.55 2.12 2.12 2.12 ...
##  $ TotChol         : num [1:10000] 3.49 3.49 3.49 NA 6.7 4.86 4.09 5.82 5.82 5.82 ...
##  $ UrineVol1       : int [1:10000] 352 352 352 NA 77 123 238 106 106 106 ...
##  $ UrineFlow1      : num [1:10000] NA NA NA NA 0.094 ...
##  $ UrineVol2       : int [1:10000] NA NA NA NA NA NA NA NA NA NA ...
##  $ UrineFlow2      : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
##  $ Diabetes        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ DiabetesAge     : int [1:10000] NA NA NA NA NA NA NA NA NA NA ...
##  $ HealthGen       : Factor w/ 5 levels "Excellent","Vgood",..: 3 3 3 NA 3 NA NA 2 2 2 ...
##  $ DaysPhysHlthBad : int [1:10000] 0 0 0 NA 0 NA NA 0 0 0 ...
##  $ DaysMentHlthBad : int [1:10000] 15 15 15 NA 10 NA NA 3 3 3 ...
##  $ LittleInterest  : Factor w/ 3 levels "None","Several",..: 3 3 3 NA 2 NA NA 1 1 1 ...
##  $ Depressed       : Factor w/ 3 levels "None","Several",..: 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ nPregnancies    : int [1:10000] NA NA NA NA 2 NA NA 1 1 1 ...
##  $ nBabies         : int [1:10000] NA NA NA NA 2 NA NA NA NA NA ...
##  $ Age1stBaby      : int [1:10000] NA NA NA NA 27 NA NA NA NA NA ...
##  $ SleepHrsNight   : int [1:10000] 4 4 4 NA 8 NA NA 8 8 8 ...
##  $ SleepTrouble    : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ PhysActive      : Factor w/ 2 levels "No","Yes": 1 1 1 NA 1 NA NA 2 2 2 ...
##  $ PhysActiveDays  : int [1:10000] NA NA NA NA NA NA NA 5 5 5 ...
##  $ TVHrsDay        : Factor w/ 7 levels "0_hrs","0_to_1_hr",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ CompHrsDay      : Factor w/ 7 levels "0_hrs","0_to_1_hr",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ TVHrsDayChild   : int [1:10000] NA NA NA 4 NA 5 1 NA NA NA ...
##  $ CompHrsDayChild : int [1:10000] NA NA NA 1 NA 0 6 NA NA NA ...
##  $ Alcohol12PlusYr : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ...
##  $ AlcoholDay      : int [1:10000] NA NA NA NA 2 NA NA 3 3 3 ...
##  $ AlcoholYear     : int [1:10000] 0 0 0 NA 20 NA NA 52 52 52 ...
##  $ SmokeNow        : Factor w/ 2 levels "No","Yes": 1 1 1 NA 2 NA NA NA NA NA ...
##  $ Smoke100        : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ Smoke100n       : Factor w/ 2 levels "Non-Smoker","Smoker": 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ SmokeAge        : int [1:10000] 18 18 18 NA 38 NA NA NA NA NA ...
##  $ Marijuana       : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ...
##  $ AgeFirstMarij   : int [1:10000] 17 17 17 NA 18 NA NA 13 13 13 ...
##  $ RegularMarij    : Factor w/ 2 levels "No","Yes": 1 1 1 NA 1 NA NA 1 1 1 ...
##  $ AgeRegMarij     : int [1:10000] NA NA NA NA NA NA NA NA NA NA ...
##  $ HardDrugs       : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ SexEver         : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ...
##  $ SexAge          : int [1:10000] 16 16 16 NA 12 NA NA 13 13 13 ...
##  $ SexNumPartnLife : int [1:10000] 8 8 8 NA 10 NA NA 20 20 20 ...
##  $ SexNumPartYear  : int [1:10000] 1 1 1 NA 1 NA NA 0 0 0 ...
##  $ SameSex         : Factor w/ 2 levels "No","Yes": 1 1 1 NA 2 NA NA 2 2 2 ...
##  $ SexOrientation  : Factor w/ 3 levels "Bisexual","Heterosexual",..: 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ PregnantNow     : Factor w/ 3 levels "Yes","No","Unknown": NA NA NA NA NA NA NA NA NA NA ...
  1. How many cases/observations are in the NHANES dataset? How many variables? [Note: You may have used a function in the previous question that provided you with the answer to this question. If so, you don’t have to run additional R code.]
  • 10,000 cases/observations
  • 76 variables
  1. Identify two variables that are numerical and two variables that are categorical.
  • Numerical variables: Poverty and BMI
  • Categorical variables: Martial Status & Education

Practice Problems

Now let’s practice wrangling the data using the dplyr data verbs to answer questions. Each question is supposed to be stand alone and not build from each other (unless specified).

  1. How many participants are from the 2009-2010 survey?
  • 5,000 participants are from the 2009-2010 survey
NHANES %>%
filter(SurveyYr == "2009_10") %>%
  summarise(population = n())
  1. How many of the adult participants (18+ years old) fall into each of these general health categories: Excellent, Vgood, or Good?
  • Excellent = 773 participants
  • Vgood = 2161 participants
  • Good = 2654 participants
NHANES %>% 
  filter(Age >= 18, HealthGen %in% c("Excellent", "Vgood", "Good")) %>%
  group_by(HealthGen) %>%
  summarise(population = n())
  1. Select sex, age, and pulse from the NHANES dataset and arrange from largest to smallest pulse rate. Display only the first 6 rows of the dataset.
NHANES %>% 
  select(Gender, Age, Pulse) %>%
  arrange(desc(Pulse)) %>%
  head(6)
  1. Create a new variable in the NHANES dataset for height in inches, height_in, and save this updated dataset as the object titled NHANES1. Note: One inch equals 1 centimeter divided by 2.54. Be sure to examine the first few rows of the NHANES1 dataset to see if R did what you intended.
NHANES1 <- NHANES %>%
  mutate(height_in = Height / 2.54)
head(NHANES1)
  1. For the entire NHANES1 dataset, what is the mean and standard deviation of the new height_in variable (from the previous problem) and how many rows of data are there?
  • mean: 63.73
  • sd: 7.947
  • rows: 10,000
NHANES1 %>% 
  summarise(
    mean = mean(height_in, na.rm = TRUE),
    sd = sd(height_in, na.rm = TRUE),
    rows = n()
  )
  1. For each sex in the NHANES1 dataset, what is the mean and standard deviation of the new height_in variable (from the previous problem) and how many rows of data are there?
  • Females: mean = 61.65 sd = 6.61 rows = 5020
  • Male: mean = 65.82 sd = 8.60 rows = 4980
NHANES1 %>%
  group_by(Gender) %>% 
  summarise(
    mean = mean(height_in, na.rm = TRUE),
    sd = sd(height_in, na.rm = TRUE),
    rows = n()
  )

Talk It Out!

You have code in the following form:

NHANES %>%
  filter(!is.na( **SOME_VARIABLES** )) %>%
  group_by( **SOME_VARIABLES** ) %>%
  summarise(count=n(), meanHt=mean(Height, na.rm=TRUE))

If SOME_VARIABLES was replaced with the variables listed in the questions below, describe what variables will appear in the output and how many rows there will be. [Note: Try to do this without running the code. Only run the code if you get stuck.] Hint: you can use the table() function to get the number of levels/categories in each categorical variable.

  1. Gender
  • It would include 2 rows for each gender
  1. Education
  • Each row would represent Education categories
  1. Gender, Education
  • Each row would represent both genders and each education category
  1. Height, Education
  • Since height is a continuous numerical variable it would result in many rows each representing a unique combination of height and education categories.

Putting It All Together

Now let’s put all of the dplyr data verbs together to answer a question. [Note: The inspiration for this example is from the Data Computing textbook, Chapter 7.]

Question: What are the demographic patterns of smoking in adults?

The end result is going to be a graph like this:

This section is going to walk you through how to tackle wrangling the data into the right form for creating the graph. The relevant variables here are: AgeDecade, Gender, and SmokeNow.

  1. Step 1: Use the graph to get an idea of what your end data frame will look like after you wrangle the data. What will be the rows and columns in your data frame? How many rows will there specifically be? [Hint: first identify what a case is]
  • Each row would be a case with a unique combination of smoking status, age group, and gender.
  • Each column in the data frame would include age group (AgeDecade), gender, smoking status (SmokeNow), and proportion of smokers.
  1. Step 2: Now we’ll start wrangling the data to construct the graph. You’ll notice that the age groups in the figure start at 20-29. Start by removing participants from younger age groups and those for which age is unknown. [Note: the proportions of individuals who smoke include people we do not have this information for in the denominator. You will not get the correct proportions if you remove participants with NAs for the SmokeNow variable.]
NHANES %>%
  filter(!is.na(AgeDecade),
         AgeDecade != " 0-9",
         AgeDecade != " 10-19")
  1. Step 3: In the following steps, we will continue to build on the code above. You can either copy and paste the previous code to add on to or save it as an object to which you will pipe on additional code. For this next step, get the number of people with each combination of age decade, gender, and smoking status and name this smoke_n.
smoke_n <- NHANES %>%
  filter(!is.na(AgeDecade),
         AgeDecade != " 0-9",
         AgeDecade != " 10-19") %>%
group_by(AgeDecade, Gender, SmokeNow) %>%
  summarise(count = n())
## `summarise()` has grouped output by 'AgeDecade', 'Gender'. You can override
## using the `.groups` argument.
  1. Step 4: Create a new column (total) in the table that you just created that gives the total number of people with each gender/age group (across all smoking categories). Then create a new column that gives the proportion in each smoking category relative to the total in that gender/age group (smoke_prop).
smoke_n <- smoke_n %>%
  group_by(AgeDecade,Gender) %>%
  mutate(total = sum(count)) %>%
  mutate(smoke_prop = count / total)
  1. Step 5: Finally, subset the table so that it only includes rows that correspond to the proportions of people who smoke. Save the new data frame in the object mod_NHANES. [Note: The steps above correspond to one way of making this table; there are multiple other ways of getting to the same answer.]
mod_NHANES <- smoke_n %>%
  filter(SmokeNow == "Yes")
  1. Once you have your data in the right form for producing the plot, make your graph. The code template has been started for you. Just replace the ______ with the appropriate syntax. [Note: Take out eval=FALSE in the options of the code chunk so that the code executes in your assignment.]
ggplot(mod_NHANES, aes(x= AgeDecade, y=smoke_prop, color= Gender))+
  geom_point() +
  geom_line(aes(group= Gender)) + 
  labs(x="Age Group", y="Proportion of People who Smoke")

