Introduction

Overview: In this lab exercise, you will begin to familiarize yourself with the statistical software R and RStudio.

Objectives: At the end of this lab, you will be able to:

Part 0: Download and organize files

Part 1: Import a dataset

We will load a dataset from the nhanesSubset.csv file into R, using the read.csv() function. For example, the command

# Not evaluated
mydata <- read.csv("datafile.csv")

loads dataset from the file “datafile.csv” and stores it in an object called “mydata”. The code above is not executed by R because the option eval = FALSE is used (“eval” for “evaluate”).

In the code chunk below, read the dataset from the nhanesSubset.csv file and store it in an object called nhanes.

# A "code chunk" is everything between the ```{r} line above, and the ``` line below.
# When you want code to be executed, write it in a provided code chunk.

The dataset is now held in the object named nhanes.

We will convert the nhanes object into the format (tibble, like “table”) that we will work with, and print that object (i.e., the stored dataset), using code similar to the following:

# Not evaluated
mydata <- as_tibble(mydata) 
print(mydata)

In the code chunk below, convert the nhanes object to a tibble and print it.

# Enter code here
 nhanes<-as_tibble(nhanes)
 print(nhanes)

Printing the dataset by simply entering the object name nhanes on its own line should produce some output when you knit your document. Note the following components of the table:

We may not be interested all of the variables at the same time. We can use the select() function to select only a few columns or variables . For example, the command

# Not evaluated
mydata %>% select(variable.1, variable.2, variable.3)

will select variable.1, variable.2, and variable.3 from mydata, and print the result. The “pipe” symbol %>% means that we start with the object on the left side, and apply an action described by the right side.

In the code chunk below, use select() to select and print just the id, race, ethnicity, sex, age, and healthStatus variables in the nhanes dataset.

# Enter code here
nhanes %>% select(id,race,ethnicity,sex,age,healthStatus)

Part 2: Types of variables

In order to perform statistical analyses correctly in R, we need to pay attention to the type of the data. In a tibble, we can have the following types:

R will perform an analysis depending on the way the variable is stored. For example, R will not permit you to calculate a mean for a variable stored as a factor (nominal or ordinal variable).

As with many datasets, this NHANES dataset is coded. This means that instead of recording responses like “male” and “female” as these words (text), we store the data as numeric values that correspond to specific responses (i.e., numeric codes). For example, the variable sex has values 1 and 2 in the NHANES dataset. We need to know what each numeric value corresponds to if we want to understand the demographics of our sample. We can determine what type each variable should be by reading the data dictionary (nhanes_dataDictionary.xlsx).

STOP! Answer Questions 1 and 2 now.

Part 3: Creating factors and adding value labels

The inspect command prints a summary of a data object:

# Not evaluated
inspect(mydata)

In the code chunk below, inspect the nhanes dataset.

# Enter code here
inspect (nhanes)

You will notice that some variables that should be nominal or categorical are not stored as factors because the minimum, 1st quartile, median, 3rd quartile, maximum, mean, and standard deviation are displayed instead of what we would expect, the percentage per category level. We will do two tasks at once: convert a variable to a factor, and assign informative labels to the numeric codes, using code similar to that below.

# Not evaluated
mydata <- mydata %>% 
  mutate(
    sex = recode_factor(
      sex,
      `1` = "male",
      `2` = "female"
      )
)

Before moving on, let’s parse the above code.

In the code chunk below, convert the sex variable in the nhanes dataset to a factor and code it according to the data dictionary.

 # Enter code here
nhanes <- nhanes %>%
  mutate(
    sex = recode_factor(
      sex,
      `1` = "male",
      `2` = "female"
     )
)

In the code chunk below, print the nhanes object again, and then use inspect() to see what in the summary has changed.

# Enter code here
print (nhanes)

inspect (nhanes)

STOP! Answer Question 3 now.

We can alter several variables at once using the mutate() command, for example,

# Not executed
nhanes <- nhanes %>% mutate(
  race = recode_factor(
    race,
    `1` = "white",
    `2` = "black",
    `3` = "other"
    ),
  
  ethnicity = recode_factor(
    ethnicity,
    `1` = "mexican-american",
    `2` = "other hispanic",
    `3` = "not hispanic"
    ),
  
  urban = recode_factor(
    urban,
    `1` = "metro area of 1 million",
    `2` = "other"
    )
)

alters the race, ethnicity, and urban variables all at once.

In the code chunk below, convert all nominal variables (including the three above) to factors using mutate() and recode_factor(). Read the data dictionary carefully, as some variable codings are not consistent (e.g., everSmoke and smokeNow).

# Enter code here
  nhanes <- nhanes %>% mutate(
    race =recode_factor(
      race,
      `1` ="yellow"
      `2` = "black"
      `3· = "white"
    ),
  ethnicity = recode_factor(
    ethnicity,
    `1` = "korean-american",
    `2` = "Korean"
    `3` = "not Korean american"
  ),
          
  urban = recode_factor(
    urban,
    `1` ="metreo area of 1million"
    `2` = "other"
  )

The variable healthStatus is an ordinal variable, so we want to tell R that it is an ordered factor. Create ordered value labels for the variable healthStatus using the .ordered = TRUE option as in following code:

# Not evaluated
mydata <- mydata %>% mutate(
  smokingCausesCancer = recode_factor(
    smokingCausesCancer,
    `1` = "strongly disagree",
    `2` = "disagree",
    `3` = "unsure",
    `4` = "agree",
    `5` = "strongly agree",
    .ordered = TRUE
  )
)

In the code chunk below, convert the variable healthStatus to an ordinal variable with an appropriate ordering.

# Enter code here
 healthStatus <- healthStatus %>% mutate(
   smokingcasuesCancer = recode_factor(
     smokingcausesCancer,
     `1` ="agree"
     `2` = "disagree"
     `3` = "I dont know",
     .ordered = TRUE
   )
 )

In the code chunk below, use the select() function as you did earlier to select and print just the variables id, race, and healthStatus.

# Enter code here
select.list(id,race,healthstatus)
print(id,race,healthStatus)

Note the variable types <fct> for “factor” and <ord> for “ordinal”.

To look closely at a single variable, we can use the pull() function, as in

# Not evaluated
mydata %>% pull(smokingCausesCancer)

In the code chunk below, use the pull() function to print just the race variable, and in a separate command, print just the healthStatus variable.

# Enter code here
race %>% pull(white,yellow,black)
healthStatus %>% pull(smokingcasuecancer)

Note that the order of the levels of healthStatus is the order in which they were defined, not the order of the old numeric codes.

STOP! Answer Question 4 now.

Look at the Codes (i.e., the labels for the numeric values) for the variable active. Do this by consulting the data dictionary (Excel file you downloaded earlier: nhanes_dataDictionary.xlsx, see column E) and then, in the code chunk below, reorder the response options to appear in a logical order if necessary. Here, note that you are not required to change the original codes (i.e., the labels for the numeric values) for the variable. Instead, in your coding efforts, reorder the numeric values such that their given labels in the data dictionary appear in a logical order.

# Enter code here

STOP! Answer Question 5 now.

You can identify a specific row from the nhanes dataset by extracting the row that corresponds to a unique patient using the filter() function (note the double equal signs). Here we use two “pipes” (%>%) to perform two actions in sequence: filter just the rows with id == 10 then select a few columns or variables to display. For example:

# Not evaluated
nhanes %>%
  filter(id == 10) %>%
  select(id, race, ethnicity, sex, age)

prints the variables id, race, ethnicity, and age for the subject with id = 10.

In the code chunk below, print the variables id, race, ethnicity, sex, age, and healthStatus for the subject with id = 2.

# Enter code here
nhanes %>%
  filter(id ==2) %>%
  select(id,race,ethnicity,sex,age,healthStatus)

STOP! Answer Question 6 now.

Part 4: Saving a dataset

In order to save all the work you did on this dataset (such as setting variable types and value labels), you need to save the dataset as an R object using the following command:

# Not evaluated
save(nhanes, file = "nhanes.RData")

In the code chunk below, use the exact command above to save the modified dataset. Open the folder to make sure the file has been saved.

# Enter code here
save(nahnes,file = "nahnes.RData")

The next time you want to use this R object, you would set your working directory to where this object is stored and then use the following command:

# Not evaluated
load("nhanes.RData")

Please turn in your completed worksheet (DOCX, i.e., word document), and your RMD file and updated HTML file to Carmen by the due date. Here, ensure to upload all the three (3) files before you click on the “Submit Assignment” tab to complete your submission.