Download the images NHANES-Demographic-Smoking.png from the Canvas page for this assignment and save these files to the folder where the RMD file is located.
Be sure to change the author in the YAML to your name. Remember to keep it inside the quotes.
Try to use good coding practice. Read the short sections on good code with pipes.
Questions that require the use of R will have an R code chunk below it.
Examine the Data
For this Challenge Problem assignment, you are going to be using the
NHANES
dataset (from the {NHANES}
package).
[Note: This is the similar to how you accessed this dataset in a
previous assignment.] Each case/observation is supposed to represent a
participant of the US National Health And Nutrition Examination Survey
(NHANES) for the 2009-2010 and 2011-2012 sample years, and can be
treated, for educational purposes, as a simple random sample from the
American population.
NHANES
dataset. Be sure to show your
work.str(NHANES)
## tibble [10,000 × 76] (S3: tbl_df/tbl/data.frame)
## $ ID : int [1:10000] 51624 51624 51624 51625 51630 51638 51646 51647 51647 51647 ...
## $ SurveyYr : Factor w/ 2 levels "2009_10","2011_12": 1 1 1 1 1 1 1 1 1 1 ...
## $ Gender : Factor w/ 2 levels "female","male": 2 2 2 2 1 2 2 1 1 1 ...
## $ Age : int [1:10000] 34 34 34 4 49 9 8 45 45 45 ...
## $ AgeDecade : Factor w/ 8 levels " 0-9"," 10-19",..: 4 4 4 1 5 1 1 5 5 5 ...
## $ AgeMonths : int [1:10000] 409 409 409 49 596 115 101 541 541 541 ...
## $ Race1 : Factor w/ 5 levels "Black","Hispanic",..: 4 4 4 5 4 4 4 4 4 4 ...
## $ Race3 : Factor w/ 6 levels "Asian","Black",..: NA NA NA NA NA NA NA NA NA NA ...
## $ Education : Factor w/ 5 levels "8th Grade","9 - 11th Grade",..: 3 3 3 NA 4 NA NA 5 5 5 ...
## $ MaritalStatus : Factor w/ 6 levels "Divorced","LivePartner",..: 3 3 3 NA 2 NA NA 3 3 3 ...
## $ HHIncome : Factor w/ 12 levels " 0-4999"," 5000-9999",..: 6 6 6 5 7 11 9 11 11 11 ...
## $ HHIncomeMid : int [1:10000] 30000 30000 30000 22500 40000 87500 60000 87500 87500 87500 ...
## $ Poverty : num [1:10000] 1.36 1.36 1.36 1.07 1.91 1.84 2.33 5 5 5 ...
## $ HomeRooms : int [1:10000] 6 6 6 9 5 6 7 6 6 6 ...
## $ HomeOwn : Factor w/ 3 levels "Own","Rent","Other": 1 1 1 1 2 2 1 1 1 1 ...
## $ Work : Factor w/ 3 levels "Looking","NotWorking",..: 2 2 2 NA 2 NA NA 3 3 3 ...
## $ Weight : num [1:10000] 87.4 87.4 87.4 17 86.7 29.8 35.2 75.7 75.7 75.7 ...
## $ Length : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
## $ HeadCirc : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
## $ Height : num [1:10000] 165 165 165 105 168 ...
## $ BMI : num [1:10000] 32.2 32.2 32.2 15.3 30.6 ...
## $ BMICatUnder20yrs: Factor w/ 4 levels "UnderWeight",..: NA NA NA NA NA NA NA NA NA NA ...
## $ BMI_WHO : Factor w/ 4 levels "12.0_18.5","18.5_to_24.9",..: 4 4 4 1 4 1 2 3 3 3 ...
## $ Pulse : int [1:10000] 70 70 70 NA 86 82 72 62 62 62 ...
## $ BPSysAve : int [1:10000] 113 113 113 NA 112 86 107 118 118 118 ...
## $ BPDiaAve : int [1:10000] 85 85 85 NA 75 47 37 64 64 64 ...
## $ BPSys1 : int [1:10000] 114 114 114 NA 118 84 114 106 106 106 ...
## $ BPDia1 : int [1:10000] 88 88 88 NA 82 50 46 62 62 62 ...
## $ BPSys2 : int [1:10000] 114 114 114 NA 108 84 108 118 118 118 ...
## $ BPDia2 : int [1:10000] 88 88 88 NA 74 50 36 68 68 68 ...
## $ BPSys3 : int [1:10000] 112 112 112 NA 116 88 106 118 118 118 ...
## $ BPDia3 : int [1:10000] 82 82 82 NA 76 44 38 60 60 60 ...
## $ Testosterone : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
## $ DirectChol : num [1:10000] 1.29 1.29 1.29 NA 1.16 1.34 1.55 2.12 2.12 2.12 ...
## $ TotChol : num [1:10000] 3.49 3.49 3.49 NA 6.7 4.86 4.09 5.82 5.82 5.82 ...
## $ UrineVol1 : int [1:10000] 352 352 352 NA 77 123 238 106 106 106 ...
## $ UrineFlow1 : num [1:10000] NA NA NA NA 0.094 ...
## $ UrineVol2 : int [1:10000] NA NA NA NA NA NA NA NA NA NA ...
## $ UrineFlow2 : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
## $ Diabetes : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ DiabetesAge : int [1:10000] NA NA NA NA NA NA NA NA NA NA ...
## $ HealthGen : Factor w/ 5 levels "Excellent","Vgood",..: 3 3 3 NA 3 NA NA 2 2 2 ...
## $ DaysPhysHlthBad : int [1:10000] 0 0 0 NA 0 NA NA 0 0 0 ...
## $ DaysMentHlthBad : int [1:10000] 15 15 15 NA 10 NA NA 3 3 3 ...
## $ LittleInterest : Factor w/ 3 levels "None","Several",..: 3 3 3 NA 2 NA NA 1 1 1 ...
## $ Depressed : Factor w/ 3 levels "None","Several",..: 2 2 2 NA 2 NA NA 1 1 1 ...
## $ nPregnancies : int [1:10000] NA NA NA NA 2 NA NA 1 1 1 ...
## $ nBabies : int [1:10000] NA NA NA NA 2 NA NA NA NA NA ...
## $ Age1stBaby : int [1:10000] NA NA NA NA 27 NA NA NA NA NA ...
## $ SleepHrsNight : int [1:10000] 4 4 4 NA 8 NA NA 8 8 8 ...
## $ SleepTrouble : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ...
## $ PhysActive : Factor w/ 2 levels "No","Yes": 1 1 1 NA 1 NA NA 2 2 2 ...
## $ PhysActiveDays : int [1:10000] NA NA NA NA NA NA NA 5 5 5 ...
## $ TVHrsDay : Factor w/ 7 levels "0_hrs","0_to_1_hr",..: NA NA NA NA NA NA NA NA NA NA ...
## $ CompHrsDay : Factor w/ 7 levels "0_hrs","0_to_1_hr",..: NA NA NA NA NA NA NA NA NA NA ...
## $ TVHrsDayChild : int [1:10000] NA NA NA 4 NA 5 1 NA NA NA ...
## $ CompHrsDayChild : int [1:10000] NA NA NA 1 NA 0 6 NA NA NA ...
## $ Alcohol12PlusYr : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ...
## $ AlcoholDay : int [1:10000] NA NA NA NA 2 NA NA 3 3 3 ...
## $ AlcoholYear : int [1:10000] 0 0 0 NA 20 NA NA 52 52 52 ...
## $ SmokeNow : Factor w/ 2 levels "No","Yes": 1 1 1 NA 2 NA NA NA NA NA ...
## $ Smoke100 : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ...
## $ Smoke100n : Factor w/ 2 levels "Non-Smoker","Smoker": 2 2 2 NA 2 NA NA 1 1 1 ...
## $ SmokeAge : int [1:10000] 18 18 18 NA 38 NA NA NA NA NA ...
## $ Marijuana : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ...
## $ AgeFirstMarij : int [1:10000] 17 17 17 NA 18 NA NA 13 13 13 ...
## $ RegularMarij : Factor w/ 2 levels "No","Yes": 1 1 1 NA 1 NA NA 1 1 1 ...
## $ AgeRegMarij : int [1:10000] NA NA NA NA NA NA NA NA NA NA ...
## $ HardDrugs : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ...
## $ SexEver : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ...
## $ SexAge : int [1:10000] 16 16 16 NA 12 NA NA 13 13 13 ...
## $ SexNumPartnLife : int [1:10000] 8 8 8 NA 10 NA NA 20 20 20 ...
## $ SexNumPartYear : int [1:10000] 1 1 1 NA 1 NA NA 0 0 0 ...
## $ SameSex : Factor w/ 2 levels "No","Yes": 1 1 1 NA 2 NA NA 2 2 2 ...
## $ SexOrientation : Factor w/ 3 levels "Bisexual","Heterosexual",..: 2 2 2 NA 2 NA NA 1 1 1 ...
## $ PregnantNow : Factor w/ 3 levels "Yes","No","Unknown": NA NA NA NA NA NA NA NA NA NA ...
NHANES
dataset?
How many variables? [Note: You may have used a function in the previous
question that provided you with the answer to this question. If so, you
don’t have to run additional R code.]Practice Problems
Now let’s practice wrangling the data using the dplyr
data verbs to answer questions. Each question is supposed to be stand
alone and not build from each other (unless specified).
NHANES %>%
filter(SurveyYr == "2009_10") %>%
summarise(population = n())
NHANES %>%
filter(Age >= 18, HealthGen %in% c("Excellent", "Vgood", "Good")) %>%
group_by(HealthGen) %>%
summarise(population = n())
NHANES
dataset and
arrange from largest to smallest pulse rate. Display only the first 6
rows of the dataset.NHANES %>%
select(Gender, Age, Pulse) %>%
arrange(desc(Pulse)) %>%
head(6)
NHANES
dataset for height
in inches, height_in
, and save this updated dataset as the
object titled NHANES1
. Note: One inch equals 1 centimeter
divided by 2.54. Be sure to examine the first few rows of the
NHANES1
dataset to see if R did what you intended.NHANES1 <- NHANES %>%
mutate(height_in = Height / 2.54)
head(NHANES1)
NHANES1
dataset, what is the mean and
standard deviation of the new height_in
variable (from the
previous problem) and how many rows of data are there?NHANES1 %>%
summarise(
mean = mean(height_in, na.rm = TRUE),
sd = sd(height_in, na.rm = TRUE),
rows = n()
)
NHANES1
dataset, what is the mean
and standard deviation of the new height_in
variable (from
the previous problem) and how many rows of data are there?NHANES1 %>%
group_by(Gender) %>%
summarise(
mean = mean(height_in, na.rm = TRUE),
sd = sd(height_in, na.rm = TRUE),
rows = n()
)
Talk It Out!
You have code in the following form:
NHANES %>%
filter(!is.na( **SOME_VARIABLES** )) %>%
group_by( **SOME_VARIABLES** ) %>%
summarise(count=n(), meanHt=mean(Height, na.rm=TRUE))
If SOME_VARIABLES was replaced with the variables listed in the questions below, describe what variables will appear in the output and how many rows there will be. [Note: Try to do this without running the code. Only run the code if you get stuck.] Hint: you can use the table() function to get the number of levels/categories in each categorical variable.
Putting It All Together
Now let’s put all of the dplyr
data verbs together to
answer a question. [Note: The inspiration for this example is from the
Data
Computing textbook, Chapter 7.]
Question: What are the demographic patterns of smoking in adults?
The end result is going to be a graph like this:
This section is going to walk you through how to tackle wrangling the
data into the right form for creating the graph. The relevant variables
here are: AgeDecade
, Gender
, and
SmokeNow
.
SmokeNow
variable.]NHANES %>%
filter(!is.na(AgeDecade),
AgeDecade != " 0-9",
AgeDecade != " 10-19")
smoke_n
.smoke_n <- NHANES %>%
filter(!is.na(AgeDecade),
AgeDecade != " 0-9",
AgeDecade != " 10-19") %>%
group_by(AgeDecade, Gender, SmokeNow) %>%
summarise(count = n())
## `summarise()` has grouped output by 'AgeDecade', 'Gender'. You can override
## using the `.groups` argument.
total
) in the
table that you just created that gives the total number of people with
each gender/age group (across all smoking categories). Then create a new
column that gives the proportion in each smoking category relative to
the total in that gender/age group (smoke_prop
).smoke_n <- smoke_n %>%
group_by(AgeDecade,Gender) %>%
mutate(total = sum(count)) %>%
mutate(smoke_prop = count / total)
mod_NHANES
. [Note: The steps
above correspond to one way of making this table; there are multiple
other ways of getting to the same answer.]mod_NHANES <- smoke_n %>%
filter(SmokeNow == "Yes")
______
with the appropriate syntax. [Note: Take
out eval=FALSE
in the options of the code chunk so that the
code executes in your assignment.]ggplot(mod_NHANES, aes(x= AgeDecade, y=smoke_prop, color= Gender))+
geom_point() +
geom_line(aes(group= Gender)) +
labs(x="Age Group", y="Proportion of People who Smoke")