Lab 1

Introduction

Overview: In this lab exercise, you will begin to familiarize yourself with the statistical software R and RStudio.

Objectives: At the end of this lab, you will be able to:

Access a file from the course website in Carmen and download it to your external drive;
Access the statistical software R through RStudio and acquaint yourself with the statistical concepts of datasets and variables;
Assign data types to variables and create value labels for variables.

Part 0: Download and organize files

Open RStudio.
In the bottom-right pane of RStudio, click the Packages tab, then Install. In the search box, type and choose “mosaic” (lower case), then leaving everything else as it is, click Install.
Choose a location for your lab files that you will be able to access later. For example, we strongly recommend the OSU OneDrive folder while in the RStudio server.
In your chosen location (the OSU OneDrive folder strongly recommended), create a directory or folder called PUBHBIO 2210 Labs.
Enter the PUBHBIO 2210 Labs directory or folder and create a subdirectory named Lab 1.
Download the five lab files from Carmen while in the RStudio server:
1. lab-01-intro-blank.html
2. lab-01-intro-blank.Rmd
3. lab-01-intro-worksheet-blank.docx
4. nhanesSubset.csv
5. nhanes-data-dictionary.xls
If you have not downloaded all of these files, do so now.
Save the five downloaded files in the PUBHBIO 2210 Labs/Lab 1 directory (i.e., save the downloaded files in the Lab 1 directory or folder created). When working on labs, it is important to keep all related files in the same directory.
The file lab-01-intro-blank.Rmd is a source file, and lab-01-intro-blank.html is the output file from “knitting” the source file using RStudio. Text written in the source file will generally appear in the output file in a nice format, along with results from running code. You can read the lab instructions from the source or output files as is convenient for you.
Open the source file lab-01-intro-blank.Rmd in RStudio. In the toolbar above the file editor (top left) window, click Knit (with a blue yarn ball icon). Make sure that the output file is produced.
At the top of this file, replace “Firstname Lastname” with your name as the author. Knit the document and make sure that your name appears at the top of the output.
Follow the instructions below to complete the rest of the lab, writing your code and answers in lab-01-intro-blank.Rmd using RStudio. As you work, knit the document to see your progress and make sure everything is working correctly.
When prompted to answer a question, answer it in the worksheet file lab-01-worksheet-blank.docx. You will submit the completed worksheet and the updated HTML file to Carmen.

Part 1: Import a dataset

We will load a dataset from the nhanesSubset.csv file into R, using the read.csv() function. For example, the command

# Not evaluated
mydata <- read.csv("datafile.csv")

loads dataset from the file “datafile.csv” and stores it in an object called “mydata”. The code above is not executed by R because the option eval = FALSE is used (“eval” for “evaluate”).

In the code chunk below, read the dataset from the nhanesSubset.csv file and store it in an object called nhanes.

# A "code chunk" is everything between the ```{r} line above, and the ``` line below.
# When you want code to be executed, write it in a provided code chunk.

The dataset is now held in the object named nhanes.

We will convert the nhanes object into the format (tibble, like “table”) that we will work with, and print that object (i.e., the stored dataset), using code similar to the following:

# Not evaluated
mydata <- as_tibble(mydata) 
print(mydata)

In the code chunk below, convert the nhanes object to a tibble and print it.

# Enter code here
 nhanes<-as_tibble(nhanes)
 print(nhanes)

Printing the dataset by simply entering the object name nhanes on its own line should produce some output when you knit your document. Note the following components of the table:

A tibble: 100 x 33: this means the table has 100 rows (observations) and 33 columns (variables);
id race ethnicity ...: these are the column (variable) names;
<int> <int> <int> ... <dbl>: these are variable types, which will be discussed in Part 2;
The leftmost column, which has no column name, gives the row number. This is a numbering system for the table and does not necessarily correspond to a subject ID.
In the rightmost column, two entries are NA (not available), indicating missing data;
By default, printing a tibble object only shows the first few rows and columns. At the bottom of the output, there is a note that there are 90 more rows and 24 more variables, along with a list of the remaining variables and their modes.

We may not be interested all of the variables at the same time. We can use the select() function to select only a few columns or variables . For example, the command

# Not evaluated
mydata %>% select(variable.1, variable.2, variable.3)

will select variable.1, variable.2, and variable.3 from mydata, and print the result. The “pipe” symbol %>% means that we start with the object on the left side, and apply an action described by the right side.

In the code chunk below, use select() to select and print just the id, race, ethnicity, sex, age, and healthStatus variables in the nhanes dataset.

# Enter code here
nhanes %>% select(id,race,ethnicity,sex,age,healthStatus)

Part 2: Types of variables

In order to perform statistical analyses correctly in R, we need to pay attention to the type of the data. In a tibble, we can have the following types:

integer variables (labeled <int> in a tibble) are numbers without a decimal part, i.e., a discrete variable;
numeric variables (labeled <dbl> for “double precision” in a tibble) are numbers with a decimal part, i.e., a continuous variable;
factor variables (labeled <fct> or <ord> in a tibble) are not numbers and have defined levels, i.e., nominal or ordinal variables;
character variables (labeled <chr> in a tibble) are free text;
logical variables (labeled <lgl> in a tibble) can take values of TRUE or FALSE.

R will perform an analysis depending on the way the variable is stored. For example, R will not permit you to calculate a mean for a variable stored as a factor (nominal or ordinal variable).

As with many datasets, this NHANES dataset is coded. This means that instead of recording responses like “male” and “female” as these words (text), we store the data as numeric values that correspond to specific responses (i.e., numeric codes). For example, the variable sex has values 1 and 2 in the NHANES dataset. We need to know what each numeric value corresponds to if we want to understand the demographics of our sample. We can determine what type each variable should be by reading the data dictionary (nhanes_dataDictionary.xlsx).

STOP! Answer Questions 1 and 2 now.

Part 3: Creating factors and adding value labels

The inspect command prints a summary of a data object:

# Not evaluated
inspect(mydata)

In the code chunk below, inspect the nhanes dataset.

# Enter code here
inspect (nhanes)

You will notice that some variables that should be nominal or categorical are not stored as factors because the minimum, 1st quartile, median, 3rd quartile, maximum, mean, and standard deviation are displayed instead of what we would expect, the percentage per category level. We will do two tasks at once: convert a variable to a factor, and assign informative labels to the numeric codes, using code similar to that below.

# Not evaluated
mydata <- mydata %>% 
  mutate(
    sex = recode_factor(
      sex,
      `1` = "male",
      `2` = "female"
      )
)

Before moving on, let’s parse the above code.

mydata <- means that we’re going to assign a new value to the mydata object. Everything in the code after this part is creating the value that is going to be assigned to mydata.
mydata %>% means we’re going to start with the current mydata object and perform a sequence of one or more actions on it to produce the final value.
mutate() is the action we’re performing. Mutating data adds a variable or replaces an existing one. We’ve split the argument (everything between the parentheses) of this function into multiple lines to make it easier to read.
sex = recode_factor(sex, ...) means that we’re replacing the variable sex with a version of itself that has been recoded as a factor. We’ve also split up the argument of recode_factor() into multiple lines for readability.
`1` = "male" means that we’re assigning the label "male" to the code 1, and similarly for `2` = "female".

In the code chunk below, convert the sex variable in the nhanes dataset to a factor and code it according to the data dictionary.

 # Enter code here
nhanes <- nhanes %>%
  mutate(
    sex = recode_factor(
      sex,
      `1` = "male",
      `2` = "female"
     )
)

In the code chunk below, print the nhanes object again, and then use inspect() to see what in the summary has changed.

# Enter code here
print (nhanes)

inspect (nhanes)

STOP! Answer Question 3 now.

We can alter several variables at once using the mutate() command, for example,

# Not executed
nhanes <- nhanes %>% mutate(
  race = recode_factor(
    race,
    `1` = "white",
    `2` = "black",
    `3` = "other"
    ),
  
  ethnicity = recode_factor(
    ethnicity,
    `1` = "mexican-american",
    `2` = "other hispanic",
    `3` = "not hispanic"
    ),
  
  urban = recode_factor(
    urban,
    `1` = "metro area of 1 million",
    `2` = "other"
    )
)

alters the race, ethnicity, and urban variables all at once.

In the code chunk below, convert all nominal variables (including the three above) to factors using mutate() and recode_factor(). Read the data dictionary carefully, as some variable codings are not consistent (e.g., everSmoke and smokeNow).

# Enter code here
  nhanes <- nhanes %>% mutate(
    race =recode_factor(
      race,
      `1` ="yellow"
      `2` = "black"
      `3· = "white"
    ),
  ethnicity = recode_factor(
    ethnicity,
    `1` = "korean-american",
    `2` = "Korean"
    `3` = "not Korean american"
  ),
          
  urban = recode_factor(
    urban,
    `1` ="metreo area of 1million"
    `2` = "other"
  )

The variable healthStatus is an ordinal variable, so we want to tell R that it is an ordered factor. Create ordered value labels for the variable healthStatus using the .ordered = TRUE option as in following code:

# Not evaluated
mydata <- mydata %>% mutate(
  smokingCausesCancer = recode_factor(
    smokingCausesCancer,
    `1` = "strongly disagree",
    `2` = "disagree",
    `3` = "unsure",
    `4` = "agree",
    `5` = "strongly agree",
    .ordered = TRUE
  )
)

In the code chunk below, convert the variable healthStatus to an ordinal variable with an appropriate ordering.

# Enter code here
 healthStatus <- healthStatus %>% mutate(
   smokingcasuesCancer = recode_factor(
     smokingcausesCancer,
     `1` ="agree"
     `2` = "disagree"
     `3` = "I dont know",
     .ordered = TRUE
   )
 )

In the code chunk below, use the select() function as you did earlier to select and print just the variables id, race, and healthStatus.

# Enter code here
select.list(id,race,healthstatus)
print(id,race,healthStatus)

Note the variable types <fct> for “factor” and <ord> for “ordinal”.

To look closely at a single variable, we can use the pull() function, as in

# Not evaluated
mydata %>% pull(smokingCausesCancer)

In the code chunk below, use the pull() function to print just the race variable, and in a separate command, print just the healthStatus variable.

# Enter code here
race %>% pull(white,yellow,black)
healthStatus %>% pull(smokingcasuecancer)

Note that the order of the levels of healthStatus is the order in which they were defined, not the order of the old numeric codes.

STOP! Answer Question 4 now.

Look at the Codes (i.e., the labels for the numeric values) for the variable active. Do this by consulting the data dictionary (Excel file you downloaded earlier: nhanes_dataDictionary.xlsx, see column E) and then, in the code chunk below, reorder the response options to appear in a logical order if necessary. Here, note that you are not required to change the original codes (i.e., the labels for the numeric values) for the variable. Instead, in your coding efforts, reorder the numeric values such that their given labels in the data dictionary appear in a logical order.

# Enter code here

STOP! Answer Question 5 now.

You can identify a specific row from the nhanes dataset by extracting the row that corresponds to a unique patient using the filter() function (note the double equal signs). Here we use two “pipes” (%>%) to perform two actions in sequence: filter just the rows with id == 10 then select a few columns or variables to display. For example:

# Not evaluated
nhanes %>%
  filter(id == 10) %>%
  select(id, race, ethnicity, sex, age)

prints the variables id, race, ethnicity, and age for the subject with id = 10.

In the code chunk below, print the variables id, race, ethnicity, sex, age, and healthStatus for the subject with id = 2.

# Enter code here
nhanes %>%
  filter(id ==2) %>%
  select(id,race,ethnicity,sex,age,healthStatus)

STOP! Answer Question 6 now.

Part 4: Saving a dataset

In order to save all the work you did on this dataset (such as setting variable types and value labels), you need to save the dataset as an R object using the following command:

# Not evaluated
save(nhanes, file = "nhanes.RData")

In the code chunk below, use the exact command above to save the modified dataset. Open the folder to make sure the file has been saved.

# Enter code here
save(nahnes,file = "nahnes.RData")

The next time you want to use this R object, you would set your working directory to where this object is stored and then use the following command:

# Not evaluated
load("nhanes.RData")

Please turn in your completed worksheet (DOCX, i.e., word document), and your RMD file and updated HTML file to Carmen by the due date. Here, ensure to upload all the three (3) files before you click on the “Submit Assignment” tab to complete your submission.