Introduction

Overview: In this lab exercise, you will begin to familiarize yourself with the statistical software R and RStudio.

Objectives: At the end of this lab, you will be able to:

Part 0: Download and organize files

Part 1: Import a dataset

We will load a dataset from the nhanesSubset.csv file into R, using the read.csv() function. For example, the command

# Not evaluated
mydata <- read.csv("datafile.csv")

loads dataset from the file “datafile.csv” and stores it in an object called “mydata”. The code above is not executed by R because the option eval = FALSE is used (“eval” for “evaluate”).

In the code chunk below, read the dataset from the nhanesSubset.csv file and store it in an object called nhanes.

nhanes <- read.csv("nhanesSubset.csv")

The dataset is now held in the object named nhanes.

We will convert the nhanes object into the format (tibble, like “table”) that we will work with, and print that object (i.e., the stored dataset), using code similar to the following:

# Not evaluated
mydata <- as_tibble(mydata) 
print(mydata)

In the code chunk below, convert the nhanes object to a tibble and print it.

# Enter code here
nhanes <- as_tibble(nhanes)
print(nhanes)
## # A tibble: 100 × 33
##       id  race ethnicity   sex   age familySize urban region
##    <int> <int>     <int> <int> <int>      <int> <int>  <int>
##  1     1     2         3     2    56          1     1      2
##  2     2     1         3     2    73          1     2      4
##  3     3     1         3     2    25          2     1      3
##  4     4     1         1     2    53          2     2      3
##  5     5     1         1     2    68          2     2      3
##  6     6     1         3     2    44          3     2      4
##  7     7     2         3     2    28          2     1      3
##  8     8     1         3     1    74          2     2      2
##  9     9     1         3     2    65          1     2      1
## 10    10     1         2     2    61          3     1      4
## # ℹ 90 more rows
## # ℹ 25 more variables: pir <dbl>, yrsEducation <int>,
## #   maritalStatus <int>, healthStatus <int>,
## #   heightInSelf <int>, weightLbSelf <int>, beer <int>,
## #   wine <int>, liquor <int>, everSmoke <int>,
## #   smokeNow <int>, active <int>, SBP <int>, DBP <int>,
## #   weightKg <dbl>, heightCm <dbl>, waist <dbl>, …

Printing the dataset by simply entering the object name nhanes on its own line should produce some output when you knit your document. Note the following components of the table:

We may not be interested all of the variables at the same time. We can use the select() function to select only a few columns or variables . For example, the command

# Not evaluated 1
mydata %>% select(variable.1, variable.2, variable.3)

will select variable.1, variable.2, and variable.3 from mydata, and print the result. The “pipe” symbol %>% means that we start with the object on the left side, and apply an action described by the right side.

In the code chunk below, use select() to select and print just the id, race, ethnicity, sex, age, and healthStatus variables in the nhanes dataset.

# Enter code here

nhanes %>% select(id, race, ethnicity, sex, age, healthStatus)
## # A tibble: 100 × 6
##       id  race ethnicity   sex   age healthStatus
##    <int> <int>     <int> <int> <int>        <int>
##  1     1     2         3     2    56            4
##  2     2     1         3     2    73            4
##  3     3     1         3     2    25            3
##  4     4     1         1     2    53            5
##  5     5     1         1     2    68            4
##  6     6     1         3     2    44            2
##  7     7     2         3     2    28            3
##  8     8     1         3     1    74            3
##  9     9     1         3     2    65            3
## 10    10     1         2     2    61            3
## # ℹ 90 more rows

Part 2: Types of variables

In order to perform statistical analyses correctly in R, we need to pay attention to the type of the data. In a tibble, we can have the following types:

R will perform an analysis depending on the way the variable is stored. For example, R will not permit you to calculate a mean for a variable stored as a factor (nominal or ordinal variable).

As with many datasets, this NHANES dataset is coded. This means that instead of recording responses like “male” and “female” as these words (text), we store the data as numeric values that correspond to specific responses (i.e., numeric codes). For example, the variable sex has values 1 and 2 in the NHANES dataset. We need to know what each numeric value corresponds to if we want to understand the demographics of our sample. We can determine what type each variable should be by reading the data dictionary (nhanes_dataDictionary.xlsx).

STOP! Answer Questions 1 and 2 now.

Part 3: Creating factors and adding value labels

The inspect command prints a summary of a data object:

# Not evaluated
inspect(mydata)

In the code chunk below, inspect the nhanes dataset.

# Enter code here
inspect(nhanes)
## 
## quantitative variables:  
##             name   class     min        Q1   median
## 1             id integer   1.000  25.75000  50.5000
## 2           race integer   1.000   1.00000   1.0000
## 3      ethnicity integer   1.000   1.00000   3.0000
## 4            sex integer   1.000   1.00000   2.0000
## 5            age integer  17.000  34.75000  48.5000
## 6     familySize integer   1.000   2.00000   2.0000
## 7          urban integer   1.000   1.00000   2.0000
## 8         region integer   1.000   2.00000   3.0000
## 9            pir numeric   0.179   1.17625   1.9145
## 10  yrsEducation integer   0.000   8.00000  12.0000
## 11 maritalStatus integer   1.000   1.00000   1.0000
## 12  healthStatus integer   1.000   2.00000   3.0000
## 13  heightInSelf integer  52.000  62.00000  66.0000
## 14  weightLbSelf integer  82.000 135.00000 160.0000
## 15          beer integer   0.000   0.00000   0.0000
## 16          wine integer   0.000   0.00000   0.0000
## 17        liquor integer   0.000   0.00000   0.0000
## 18     everSmoke integer   0.000   0.00000   0.0000
## 19      smokeNow integer   1.000   2.00000   2.0000
## 20        active integer   1.000   1.25000   2.0000
## 21           SBP integer  94.000 112.00000 126.0000
## 22           DBP integer   0.000  64.00000  74.0000
## 23      weightKg numeric  35.800  63.85000  72.6500
## 24      heightCm numeric 139.400 157.20000 164.5000
## 25         waist numeric  68.100  84.50000  92.0000
## 26        tricep numeric   5.100  11.45000  19.2500
## 27         thigh numeric   4.400  10.10000  18.0000
## 28           BMD numeric   0.358   0.80750   0.9010
## 29           RBC numeric   3.430   4.28500   4.5900
## 30          lead numeric   0.700   2.07500   3.0500
## 31   cholesterol integer 122.000 179.25000 206.0000
## 32  triglyceride integer  29.000  86.75000 137.5000
## 33           hdl integer  14.000  41.75000  49.0000
##          Q3    max        mean         sd   n missing
## 1   75.2500 100.00  50.5000000 29.0114920 100       0
## 2    1.0000   2.00   1.2200000  0.4163332 100       0
## 3    3.0000   3.00   2.4400000  0.8912595 100       0
## 4    2.0000   2.00   1.5800000  0.4960450 100       0
## 5   70.2500  90.00  51.4300000 21.5285833 100       0
## 6    4.0000  10.00   3.0100000  1.7780650 100       0
## 7    2.0000   2.00   1.5400000  0.5009083 100       0
## 8    3.0000   4.00   2.7400000  0.9600084 100       0
## 9    3.3545   6.77   2.4536395  1.6725237  86      14
## 10  13.0000  17.00  10.8989899  4.0317489  99       1
## 11   4.0000   7.00   2.6500000  2.2036471 100       0
## 12   4.0000   5.00   2.8200000  1.0766578 100       0
## 13  69.0000  76.00  65.7473684  4.4338361  95       5
## 14 180.0000 270.00 160.0736842 37.6004307  95       5
## 15   0.5000 365.00   7.1313131 38.2141246  99       1
## 16   0.0000  61.00   1.2424242  6.4824581  99       1
## 17   0.0000  13.00   0.7070707  2.3571538  99       1
## 18   1.0000   1.00   0.4444444  0.4994328  99       1
## 19   2.0000   2.00   1.7878788  0.4108907  99       1
## 20   3.0000   3.00   2.2040816  0.8243608  98       2
## 21 141.0000 216.00 129.4216867 22.8815191  83      17
## 22  81.0000 118.00  73.1084337 16.6374629  83      17
## 23  82.1500 137.75  73.4323656 17.4040663  93       7
## 24 171.9000 184.10 164.4440860  9.9532981  93       7
## 25  98.7000 129.90  93.3670588 13.1250613  85      15
## 26  26.6000  39.80  19.7340909  9.5348349  88      12
## 27  29.5000  41.80  20.4780822 11.1813491  73      27
## 28   1.0510   1.26   0.9028824  0.1881680  68      32
## 29   4.9850   6.15   4.6184783  0.5013256  92       8
## 30   4.7250  13.30   3.6934783  2.5428549  92       8
## 31 239.5000 370.00 210.0326087 43.8146574  92       8
## 32 192.2500 524.00 152.6304348 92.8197731  92       8
## 33  61.0000 101.00  51.8913043 15.8068399  92       8

You will notice that some variables that should be nominal or categorical are not stored as factors because the minimum, 1st quartile, median, 3rd quartile, maximum, mean, and standard deviation are displayed instead of what we would expect, the percentage per category level. We will do two tasks at once: convert a variable to a factor, and assign informative labels to the numeric codes, using code similar to that below.

# Not evaluated
mydata <- mydata %>% 
  mutate(
    sex = recode_factor(
      sex,
      `1` = "male",
      `2` = "female"
      )
)

Before moving on, let’s parse the above code.

In the code chunk below, convert the sex variable in the nhanes dataset to a factor and code it according to the data dictionary.

# Enter code here
nhanes <- nhanes %>%
  mutate(
    sex = recode_factor(
      sex,
      `1` = "male",
      `2` = "female"
    )
  )

In the code chunk below, print the nhanes object again, and then use inspect() to see what in the summary has changed.

# Enter code here
print(nhanes)
## # A tibble: 100 × 33
##       id  race ethnicity sex     age familySize urban region
##    <int> <int>     <int> <fct> <int>      <int> <int>  <int>
##  1     1     2         3 fema…    56          1     1      2
##  2     2     1         3 fema…    73          1     2      4
##  3     3     1         3 fema…    25          2     1      3
##  4     4     1         1 fema…    53          2     2      3
##  5     5     1         1 fema…    68          2     2      3
##  6     6     1         3 fema…    44          3     2      4
##  7     7     2         3 fema…    28          2     1      3
##  8     8     1         3 male     74          2     2      2
##  9     9     1         3 fema…    65          1     2      1
## 10    10     1         2 fema…    61          3     1      4
## # ℹ 90 more rows
## # ℹ 25 more variables: pir <dbl>, yrsEducation <int>,
## #   maritalStatus <int>, healthStatus <int>,
## #   heightInSelf <int>, weightLbSelf <int>, beer <int>,
## #   wine <int>, liquor <int>, everSmoke <int>,
## #   smokeNow <int>, active <int>, SBP <int>, DBP <int>,
## #   weightKg <dbl>, heightCm <dbl>, waist <dbl>, …
inspect(nhanes) 
## 
## categorical variables:  
##   name  class levels   n missing
## 1  sex factor      2 100       0
##                                    distribution
## 1 female (58%), male (42%)                     
## 
## quantitative variables:  
##             name   class     min        Q1   median
## 1             id integer   1.000  25.75000  50.5000
## 2           race integer   1.000   1.00000   1.0000
## 3      ethnicity integer   1.000   1.00000   3.0000
## 4            age integer  17.000  34.75000  48.5000
## 5     familySize integer   1.000   2.00000   2.0000
## 6          urban integer   1.000   1.00000   2.0000
## 7         region integer   1.000   2.00000   3.0000
## 8            pir numeric   0.179   1.17625   1.9145
## 9   yrsEducation integer   0.000   8.00000  12.0000
## 10 maritalStatus integer   1.000   1.00000   1.0000
## 11  healthStatus integer   1.000   2.00000   3.0000
## 12  heightInSelf integer  52.000  62.00000  66.0000
## 13  weightLbSelf integer  82.000 135.00000 160.0000
## 14          beer integer   0.000   0.00000   0.0000
## 15          wine integer   0.000   0.00000   0.0000
## 16        liquor integer   0.000   0.00000   0.0000
## 17     everSmoke integer   0.000   0.00000   0.0000
## 18      smokeNow integer   1.000   2.00000   2.0000
## 19        active integer   1.000   1.25000   2.0000
## 20           SBP integer  94.000 112.00000 126.0000
## 21           DBP integer   0.000  64.00000  74.0000
## 22      weightKg numeric  35.800  63.85000  72.6500
## 23      heightCm numeric 139.400 157.20000 164.5000
## 24         waist numeric  68.100  84.50000  92.0000
## 25        tricep numeric   5.100  11.45000  19.2500
## 26         thigh numeric   4.400  10.10000  18.0000
## 27           BMD numeric   0.358   0.80750   0.9010
## 28           RBC numeric   3.430   4.28500   4.5900
## 29          lead numeric   0.700   2.07500   3.0500
## 30   cholesterol integer 122.000 179.25000 206.0000
## 31  triglyceride integer  29.000  86.75000 137.5000
## 32           hdl integer  14.000  41.75000  49.0000
##          Q3    max        mean         sd   n missing
## 1   75.2500 100.00  50.5000000 29.0114920 100       0
## 2    1.0000   2.00   1.2200000  0.4163332 100       0
## 3    3.0000   3.00   2.4400000  0.8912595 100       0
## 4   70.2500  90.00  51.4300000 21.5285833 100       0
## 5    4.0000  10.00   3.0100000  1.7780650 100       0
## 6    2.0000   2.00   1.5400000  0.5009083 100       0
## 7    3.0000   4.00   2.7400000  0.9600084 100       0
## 8    3.3545   6.77   2.4536395  1.6725237  86      14
## 9   13.0000  17.00  10.8989899  4.0317489  99       1
## 10   4.0000   7.00   2.6500000  2.2036471 100       0
## 11   4.0000   5.00   2.8200000  1.0766578 100       0
## 12  69.0000  76.00  65.7473684  4.4338361  95       5
## 13 180.0000 270.00 160.0736842 37.6004307  95       5
## 14   0.5000 365.00   7.1313131 38.2141246  99       1
## 15   0.0000  61.00   1.2424242  6.4824581  99       1
## 16   0.0000  13.00   0.7070707  2.3571538  99       1
## 17   1.0000   1.00   0.4444444  0.4994328  99       1
## 18   2.0000   2.00   1.7878788  0.4108907  99       1
## 19   3.0000   3.00   2.2040816  0.8243608  98       2
## 20 141.0000 216.00 129.4216867 22.8815191  83      17
## 21  81.0000 118.00  73.1084337 16.6374629  83      17
## 22  82.1500 137.75  73.4323656 17.4040663  93       7
## 23 171.9000 184.10 164.4440860  9.9532981  93       7
## 24  98.7000 129.90  93.3670588 13.1250613  85      15
## 25  26.6000  39.80  19.7340909  9.5348349  88      12
## 26  29.5000  41.80  20.4780822 11.1813491  73      27
## 27   1.0510   1.26   0.9028824  0.1881680  68      32
## 28   4.9850   6.15   4.6184783  0.5013256  92       8
## 29   4.7250  13.30   3.6934783  2.5428549  92       8
## 30 239.5000 370.00 210.0326087 43.8146574  92       8
## 31 192.2500 524.00 152.6304348 92.8197731  92       8
## 32  61.0000 101.00  51.8913043 15.8068399  92       8

STOP! Answer Question 3 now.

We can alter several variables at once using the mutate() command, for example,

# Not executed
nhanes <- nhanes %>% mutate(
  race = recode_factor(
    race,
    `1` = "white",
    `2` = "black",
    `3` = "other"
    ),
  
  ethnicity = recode_factor(
    ethnicity,
    `1` = "mexican-american",
    `2` = "other hispanic",
    `3` = "not hispanic"
    ),
  
  urban = recode_factor(
    urban,
    `1` = "metro area of 1 million",
    `2` = "other"
    )
)

alters the race, ethnicity, and urban variables all at once.

In the code chunk below, convert all nominal variables (including the three above) to factors using mutate() and recode_factor(). Read the data dictionary carefully, as some variable codings are not consistent (e.g., everSmoke and smokeNow).

# Enter code here
nhanes <- nhanes %>% mutate(
  race = recode_factor(
    race,
    `1` = "white",
    `2` = "black",
    `3` = "other"
  ),
ethnicity = recode_factor(
  ethnicity,
    `1` = "mexican-american",
    `2` = "other hispanic",
    `3` = "not hispanic"
  ),
urban = recode_factor(
  urban,
    `1` = "metro area of 1 million",
    `2` = "other"
  ),
everSmoke = recode_factor(
   everSmoke,
    `1` = "yes",
    `2` = "no"
  ),
smokeNow = recode_factor(
   smokeNow,
    `1` = "every day",
    `2` = "some days",
    `3` = "not at all"
  ),
sex = recode_factor(
   sex,
    `1` = "male",
    `2` = "female"
  ),
maritalStatus = recode_factor(
  maritalStatus,
    `1` = "married",
    `2` = "widowed",
    `3` = "divorced",
    `4` = "separated",
    `5` = "never married",
    `6` = "living with partner"
  )
)
## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `everSmoke = recode_factor(everSmoke, `1` =
##   "yes", `2` = "no")`.
## Caused by warning:
## ! Unreplaced values treated as NA as `.x` is not
##   compatible.
## Please specify replacements exhaustively or supply
## `.default`.
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining
##   warning.

The variable healthStatus is an ordinal variable, so we want to tell R that it is an ordered factor. Create ordered value labels for the variable healthStatus using the .ordered = TRUE option as in following code:

# Not evaluated
mydata <- mydata %>% mutate(
  smokingCausesCancer = recode_factor(
    smokingCausesCancer,
    `1` = "strongly disagree",
    `2` = "disagree",
    `3` = "unsure",
    `4` = "agree",
    `5` = "strongly agree",
    .ordered = TRUE
  )
)

In the code chunk below, convert the variable healthStatus to an ordinal variable with an appropriate ordering.

# Enter code here
nhanes <- nhanes %>% mutate(
  healthStatus = recode_factor(
    healthStatus,
    `1` = "poor",
    `2` = "fair",
    `3` = "good",
    `4` = "very good",
    `5` = "excellent",
    .ordered = TRUE
  )
)

In the code chunk below, use the select() function as you did earlier to select and print just the variables id, race, and healthStatus.

# Enter code here
nhanes %>% select(id, race, healthStatus)
## # A tibble: 100 × 3
##       id race  healthStatus
##    <int> <fct> <ord>       
##  1     1 black very good   
##  2     2 white very good   
##  3     3 white good        
##  4     4 white excellent   
##  5     5 white very good   
##  6     6 white fair        
##  7     7 black good        
##  8     8 white good        
##  9     9 white good        
## 10    10 white good        
## # ℹ 90 more rows

Note the variable types <fct> for “factor” and <ord> for “ordinal”.

To look closely at a single variable, we can use the pull() function, as in

# Not evaluated
mydata %>% pull(smokingCausesCancer)

In the code chunk below, use the pull() function to print just the race variable, and in a separate command, print just the healthStatus variable.

# Enter code here
nhanes %>% pull(race)
##   [1] black white white white white white black white white
##  [10] white white white white white white black white white
##  [19] white white white white white white white white white
##  [28] white white black white white black white white white
##  [37] white white white black white white white white white
##  [46] white white white black black white white white white
##  [55] white black black black black white white white black
##  [64] white white white white white white black black white
##  [73] white white white black white white black white white
##  [82] white white white black white white white black black
##  [91] white white white white black white black white white
## [100] white
## Levels: white black
nhanes %>% pull(healthStatus)
##   [1] very good very good good      excellent very good
##   [6] fair      good      good      good      good     
##  [11] good      excellent fair      fair      very good
##  [16] fair      good      fair      very good very good
##  [21] fair      very good fair      good      fair     
##  [26] fair      fair      very good good      very good
##  [31] good      good      very good poor      excellent
##  [36] very good good      good      very good good     
##  [41] fair      very good good      poor      good     
##  [46] good      good      poor      good      very good
##  [51] fair      good      poor      poor      good     
##  [56] fair      good      poor      poor      good     
##  [61] fair      good      good      good      poor     
##  [66] very good fair      good      good      excellent
##  [71] good      very good good      fair      very good
##  [76] fair      fair      very good fair      good     
##  [81] excellent fair      fair      poor      good     
##  [86] very good very good poor      good      poor     
##  [91] fair      good      poor      good      poor     
##  [96] very good very good good      fair      fair     
## Levels: poor < fair < good < very good < excellent

Note that the order of the levels of healthStatus is the order in which they were defined, not the order of the old numeric codes.

STOP! Answer Question 4 now.

Look at the Codes (i.e., the labels for the numeric values) for the variable active. Do this by consulting the data dictionary (Excel file you downloaded earlier: nhanes_dataDictionary.xlsx, see column E) and then, in the code chunk below, reorder the response options to appear in a logical order if necessary. Here, note that you are not required to change the original codes (i.e., the labels for the numeric values) for the variable. Instead, in your coding efforts, reorder the numeric values such that their given labels in the data dictionary appear in a logical order.

# Enter code here
nhanes$active <- factor(nhanes$active,
  levels = c(2, 3, 1),
  labels = c("Less Active", "About the Same", "More Active"))

STOP! Answer Question 5 now.

You can identify a specific row from the nhanes dataset by extracting the row that corresponds to a unique patient using the filter() function (note the double equal signs). Here we use two “pipes” (%>%) to perform two actions in sequence: filter just the rows with id == 10 then select a few columns or variables to display. For example:

# Not evaluated
nhanes %>%
  filter(id == 10) %>%
  select(id, race, ethnicity, sex, age)

prints the variables id, race, ethnicity, and age for the subject with id = 10.

In the code chunk below, print the variables id, race, ethnicity, sex, age, and healthStatus for the subject with id = 2.

# Enter code here
nhanes %>%
  filter(id == 2) %>%
  select(id, race, ethnicity, sex, age, healthStatus)
## # A tibble: 1 × 6
##      id race  ethnicity    sex      age healthStatus
##   <int> <fct> <fct>        <fct>  <int> <ord>       
## 1     2 white not hispanic female    73 very good

STOP! Answer Question 6 now.

Part 4: Saving a dataset

In order to save all the work you did on this dataset (such as setting variable types and value labels), you need to save the dataset as an R object using the following command:

# Not evaluated
save(nhanes, file = "nhanes.RData")

In the code chunk below, use the exact command above to save the modified dataset. Open the folder to make sure the file has been saved.

# Enter code here
save(nhanes, file = "nhanes.RData")

The next time you want to use this R object, you would set your working directory to where this object is stored and then use the following command:

# Not evaluated
load("nhanes.RData")

Please turn in your completed worksheet (DOCX, i.e., word document), and your RMD file and updated HTML file to Carmen by the due date. Here, ensure to upload all the three (3) files before you click on the “Submit Assignment” tab to complete your submission.