Overview: In this lab exercise, you will begin to familiarize yourself with the statistical software R and RStudio.
Objectives: At the end of this lab, you will be able to:
Open RStudio.
In the bottom-right pane of RStudio, click the Packages tab, then Install. In the search box, type and choose “mosaic” (lower case), then leaving everything else as it is, click Install.
Choose a location for your lab files that you will be able to access later. For example, we strongly recommend the OSU OneDrive folder while in the RStudio server.
In your chosen location (the OSU OneDrive folder strongly
recommended), create a directory or folder called
PUBHBIO 2210 Labs.
Enter the PUBHBIO 2210 Labs directory or folder and
create a subdirectory named Lab 1.
Download the five lab files from Carmen while in the RStudio server:
lab-01-intro-blank.htmllab-01-intro-blank.Rmdlab-01-intro-worksheet-blank.docxnhanesSubset.csvnhanes-data-dictionary.xlsIf you have not downloaded all of these files, do so now.
Save the five downloaded files in the
PUBHBIO 2210 Labs/Lab 1 directory (i.e., save the
downloaded files in the Lab 1 directory or folder created).
When working on labs, it is important to keep all related files in the
same directory.
The file lab-01-intro-blank.Rmd is a
source file, and lab-01-intro-blank.html
is the output file from “knitting” the source file
using RStudio. Text written in the source file will
generally appear in the output file in a nice format, along with results
from running code. You can read the lab instructions from the source or
output files as is convenient for you.
Open the source file lab-01-intro-blank.Rmd in
RStudio. In the toolbar above the file editor (top
left) window, click Knit (with a blue yarn ball icon).
Make sure that the output file is produced.
At the top of this file, replace “Firstname Lastname” with your name as the author. Knit the document and make sure that your name appears at the top of the output.
Follow the instructions below to complete the rest of the lab,
writing your code and answers in lab-01-intro-blank.Rmd
using RStudio. As you work, knit the document to see
your progress and make sure everything is working correctly.
When prompted to answer a question, answer it in the worksheet
file lab-01-worksheet-blank.docx. You will submit the
completed worksheet and the updated HTML file to
Carmen.
We will load a dataset from the nhanesSubset.csv file
into R, using the read.csv() function. For example, the
command
# Not evaluated
mydata <- read.csv("datafile.csv")
loads dataset from the file “datafile.csv” and stores it in an object
called “mydata”. The code above is not executed by R
because the option eval = FALSE is used (“eval” for
“evaluate”).
In the code chunk below, read the dataset from the
nhanesSubset.csv file and store it in an object called
nhanes.
# A "code chunk" is everything between the ```{r} line above, and the ``` line below.
# When you want code to be executed, write it in a provided code chunk.
The dataset is now held in the object named nhanes.
We will convert the nhanes object into the format
(tibble, like “table”) that we will work with, and print
that object (i.e., the stored dataset), using code similar to the
following:
# Not evaluated
mydata <- as_tibble(mydata)
print(mydata)
In the code chunk below, convert the nhanes object to a
tibble and print it.
# Enter code here
nhanes<-as_tibble(nhanes)
print(nhanes)
Printing the dataset by simply entering the object name
nhanes on its own line should produce some output when you
knit your document. Note the following components of the table:
A tibble: 100 x 33: this means the table has 100 rows
(observations) and 33 columns (variables);id race ethnicity ...: these are the column (variable)
names;<int> <int> <int> ... <dbl>:
these are variable types, which will be discussed in Part 2;NA (not
available), indicating missing data;tibble object only shows the
first few rows and columns. At the bottom of the output, there is a note
that there are 90 more rows and 24 more variables, along with a list of
the remaining variables and their modes.We may not be interested all of the variables at the same time. We
can use the select() function to select only a few columns
or variables . For example, the command
# Not evaluated
mydata %>% select(variable.1, variable.2, variable.3)
will select variable.1, variable.2, and
variable.3 from mydata, and print the result.
The “pipe” symbol %>% means that we start with the
object on the left side, and apply an action described by the right
side.
In the code chunk below, use select() to select and
print just the id, race,
ethnicity, sex, age, and
healthStatus variables in the nhanes
dataset.
# Enter code here
nhanes %>% select(id,race,ethnicity,sex,age,healthStatus)
In order to perform statistical analyses correctly in
R, we need to pay attention to the
type of the data. In a tibble, we can have
the following types:
<int>
in a tibble) are numbers without a decimal part, i.e., a
discrete variable;<dbl>
for “double precision” in a tibble) are numbers with a
decimal part, i.e., a continuous variable;<fct>
or <ord> in a tibble) are not numbers
and have defined levels, i.e., nominal or ordinal
variables;<chr> in a tibble) are free text;<lgl>
in a tibble) can take values of TRUE or
FALSE.R will perform an analysis depending on the way the variable is stored. For example, R will not permit you to calculate a mean for a variable stored as a factor (nominal or ordinal variable).
As with many datasets, this NHANES dataset is coded. This means that
instead of recording responses like “male” and “female” as these words
(text), we store the data as numeric values that correspond to specific
responses (i.e., numeric codes). For example, the variable
sex has values 1 and 2 in the
NHANES dataset. We need to know what each numeric value corresponds to
if we want to understand the demographics of our sample. We can
determine what type each variable should be by reading the data
dictionary (nhanes_dataDictionary.xlsx).
The inspect command prints a summary of a data
object:
# Not evaluated
inspect(mydata)
In the code chunk below, inspect the nhanes dataset.
# Enter code here
inspect (nhanes)
You will notice that some variables that should be nominal or categorical are not stored as factors because the minimum, 1st quartile, median, 3rd quartile, maximum, mean, and standard deviation are displayed instead of what we would expect, the percentage per category level. We will do two tasks at once: convert a variable to a factor, and assign informative labels to the numeric codes, using code similar to that below.
# Not evaluated
mydata <- mydata %>%
mutate(
sex = recode_factor(
sex,
`1` = "male",
`2` = "female"
)
)
Before moving on, let’s parse the above code.
mydata <- means that we’re going to assign a new
value to the mydata object. Everything in the code after
this part is creating the value that is going to be assigned to
mydata.mydata %>% means we’re going to start with the
current mydata object and perform a sequence of one or more
actions on it to produce the final value.mutate() is the action we’re performing. Mutating data
adds a variable or replaces an existing one. We’ve split the argument
(everything between the parentheses) of this function into multiple
lines to make it easier to read.sex = recode_factor(sex, ...) means that we’re
replacing the variable sex with a version of itself that
has been recoded as a factor. We’ve also split up the
argument of recode_factor() into multiple lines for
readability.`1` = "male" means that we’re assigning the label
"male" to the code 1, and similarly for
`2` = "female".In the code chunk below, convert the sex variable in the
nhanes dataset to a factor and code it according to the
data dictionary.
# Enter code here
nhanes <- nhanes %>%
mutate(
sex = recode_factor(
sex,
`1` = "male",
`2` = "female"
)
)
In the code chunk below, print the nhanes object again,
and then use inspect() to see what in the summary has
changed.
# Enter code here
print (nhanes)
inspect (nhanes)
We can alter several variables at once using the
mutate() command, for example,
# Not executed
nhanes <- nhanes %>% mutate(
race = recode_factor(
race,
`1` = "white",
`2` = "black",
`3` = "other"
),
ethnicity = recode_factor(
ethnicity,
`1` = "mexican-american",
`2` = "other hispanic",
`3` = "not hispanic"
),
urban = recode_factor(
urban,
`1` = "metro area of 1 million",
`2` = "other"
)
)
alters the race, ethnicity, and
urban variables all at once.
In the code chunk below, convert all nominal variables (including the
three above) to factors using mutate() and
recode_factor(). Read the data dictionary carefully, as
some variable codings are not consistent (e.g., everSmoke
and smokeNow).
# Enter code here
nhanes <- nhanes %>% mutate(
race =recode_factor(
race,
`1` ="yellow"
`2` = "black"
`3· = "white"
),
ethnicity = recode_factor(
ethnicity,
`1` = "korean-american",
`2` = "Korean"
`3` = "not Korean american"
),
urban = recode_factor(
urban,
`1` ="metreo area of 1million"
`2` = "other"
)
The variable healthStatus is an ordinal variable, so we
want to tell R that it is an ordered factor. Create
ordered value labels for the variable
healthStatus using the .ordered = TRUE option
as in following code:
# Not evaluated
mydata <- mydata %>% mutate(
smokingCausesCancer = recode_factor(
smokingCausesCancer,
`1` = "strongly disagree",
`2` = "disagree",
`3` = "unsure",
`4` = "agree",
`5` = "strongly agree",
.ordered = TRUE
)
)
In the code chunk below, convert the variable
healthStatus to an ordinal variable with an appropriate
ordering.
# Enter code here
healthStatus <- healthStatus %>% mutate(
smokingcasuesCancer = recode_factor(
smokingcausesCancer,
`1` ="agree"
`2` = "disagree"
`3` = "I dont know",
.ordered = TRUE
)
)
In the code chunk below, use the select() function as
you did earlier to select and print just the variables id,
race, and healthStatus.
# Enter code here
select.list(id,race,healthstatus)
print(id,race,healthStatus)
Note the variable types <fct> for “factor” and
<ord> for “ordinal”.
To look closely at a single variable, we can use the
pull() function, as in
# Not evaluated
mydata %>% pull(smokingCausesCancer)
In the code chunk below, use the pull() function to
print just the race variable, and in a separate command,
print just the healthStatus variable.
# Enter code here
race %>% pull(white,yellow,black)
healthStatus %>% pull(smokingcasuecancer)
Note that the order of the levels of healthStatus is the
order in which they were defined, not the order of the old numeric
codes.
Look at the Codes (i.e., the labels for the numeric
values) for the variable active. Do this by consulting the
data dictionary (Excel file you downloaded earlier:
nhanes_dataDictionary.xlsx, see column E) and then, in the
code chunk below, reorder the response options to appear in a logical
order if necessary. Here, note that you are not required to change the
original codes (i.e., the labels for the numeric values)
for the variable. Instead, in your coding efforts, reorder the numeric
values such that their given labels in the data dictionary appear in a
logical order.
# Enter code here
You can identify a specific row from the nhanes dataset
by extracting the row that corresponds to a unique patient using the
filter() function (note the double equal signs). Here we
use two “pipes” (%>%) to perform two actions in
sequence: filter just the rows with id == 10
then select a few columns or variables to display. For
example:
# Not evaluated
nhanes %>%
filter(id == 10) %>%
select(id, race, ethnicity, sex, age)
prints the variables id, race,
ethnicity, and age for the subject with id =
10.
In the code chunk below, print the variables id,
race, ethnicity, sex,
age, and healthStatus for the subject with id
= 2.
# Enter code here
nhanes %>%
filter(id ==2) %>%
select(id,race,ethnicity,sex,age,healthStatus)
In order to save all the work you did on this dataset (such as
setting variable types and value labels), you need to save the dataset
as an R object using the following command:
# Not evaluated
save(nhanes, file = "nhanes.RData")
In the code chunk below, use the exact command above to save the modified dataset. Open the folder to make sure the file has been saved.
# Enter code here
save(nahnes,file = "nahnes.RData")
The next time you want to use this R object, you would
set your working directory to where this object is stored and then use
the following command:
# Not evaluated
load("nhanes.RData")
Please turn in your completed worksheet (DOCX, i.e., word document), and your RMD file and updated HTML file to Carmen by the due date. Here, ensure to upload all the three (3) files before you click on the “Submit Assignment” tab to complete your submission.