Problem Set 1: Introduction to R and Measures of Disease

This problem set is meant to be a fresh introduction to using R, as well as an opportunity for you to calculate basic measures of disease using a simulated dataset. If R is completely new to you, here are some other, general R-coding resources:

R resources

For new users of R, the following sources are helpful learning resources:
* Calendar of UC Berkeley D-Lab workshops (Look for R bootcamps)
* Datacamp: R for SAS, SPSS, and Stata users (This course is for those familiar with statistical programming who only need to learn R-specific syntax. Great GUI learning interface, and free for introductory course, and academic discount for paid courses.)
* Try R codeschool (An interactive introductory course)
* UCLA Institute For Digital Research and Education: R resources (Great source for topic-specific tutorials)
* R -tutor introduction
* Johns Hopkins Coursera course in R programming

Dataset Overview

Before we get started with some basic analysis, here is a basic overview of the dataset (swineflu.Rdata). This dataset contains variables from a hypothetical study that followed individuals for 4 months (120 days total). There were no withdrawals or deaths among the study population during follow-up. The research question of interest that this cohort study attempts to answer is: Does wearing a facemask reduce the risk of becoming infected with swine flu?

The variables in the dataset are:

  1. swineflu: binary outcome variable.
  • 1= incident case of swine flu
  • 0 =person who is still at risk of swine flu
  1. facemask: binary exposure variable
  • 1=person wore face mask
  • 0=person did not wear face mask
  1. education: confounder variable. Categorical (1 low, 2 middle, 3 high)
  2. popdensity: confounder variable for population density. Ecological level variable. Continuous.
  3. hospital: binary selection bias variable.
  • 1=person went to the hospital
  • 0=person did not go to the hospital
  1. humidity: effect modifier. continuous (1 low, 2 middle, 3 high)
  2. persontime: continuous variable- each individual is in the study for a certain number of days, between 0 and 120 (study ends after 4 months, so maximum amount of person-days contributed to the study = 120 days).

New R Commands

As a quick reference for the future, the following commands will be used in this problem set.
load()
insheet()
dim() summary() head() tail() which()
tabulate()
table()
hist()
plot() boxplot()

Descriptive Statistics

Opening the Dataset: The dataset is available in multiple formats on the bspace website, including as an Rdata file, the file format most easily read into R. If you double click on this file, and have R installed on your computer, the dataset should automatically load. If in the future you want to keep your commands in a script so they can be re-run without change or consulted for reference, use this code in your R script (create a new blank do file through File -> New File -> R script):

  #Set working directory (the folder the data file is located in)
  setwd("C:/filepath/file_location")
  #Note the backslashes, /, rather than the normal file-path markers \.
  #Also note how hashtags are used to create comments, lines of text that is 
  #not run as code. 
  #Fill in the "filepath/file_location" with your own pathnames, for example:
  #setwd("C:/Users/Documents/250B/")

  #Load in Rdata
  load("swineflu.Rdata")

Importing Data from Spreadsheets to R:

Often times, your data are not available in R format, but most people know how to get their data into Excel format. (Though for large datasets like this, always check if the Excel version contains the same number of variables and observations; many versions of Excel have a row limit around 65,000.) A common and flexible format for data is a comma separated value - .csv - document. You can easily import .csv data into R and other statistical programs. In addition, the basic content of Excel files can be saved as a .csv for easier import. (For future reference, if your dataset contains dates or other columns that are stored with a particular cell format, the .csv format will preserve the data exactly as you see it presented and will not save any formats. For instance, it is generally best to format dates in Excel so that the full year is visible before saving as a .csv if you want R to be able to recognize them quickly.)

For practice, the swineflu.Rdata data is also saved in a .csv document, named swineflu.csv. To import this to R, first download it to the same location as the .dta document. Double click the swineflu.dta document and look at the command history window to note how R has written out the file location. Copy this location string. Now type:

swineflu <- read.csv("swineflu.csv", header=TRUE)

If the document has loaded properly, you should see the assigned data object name swineflu appear in the top right quadrant environmentand a quick dim(swineflu) should indicate there are 100,000 observations and 7 variables.

R Layout:

When you open Rstudio, there are 4 windows in the screen. The top left is a blank R script, where you can write permanent code (aka, you can save it an run the full script with one click (the run green button).) The bottom left window is the console, where code from the script is run, and output is displayed. R code can also be written directly into the console. The top right panel houses the “environment,” where all saved vaiables, vectors, and dataframes are listed. The bottom right window has a tab for “Files,” which lists the files in your working directory (the folder set with setwd()). There are also tabs for plots (after generating them with R commands), and for installed packages, which are additional user written functions. For example the package foreign, installed for the first time with install.package("foreign"") and loaded with library(foreign), includes the command read.xlsx(), which reads in excel files.

Start working with dataframes and variables:

Before analyzing any dataset, it is always useful to get to know what your data look like, and what type of variables you have. The first command you will be using is the codebook command.

R has a particular syntax, or order, that you have to follow in order for your commands to work. Typically, this syntax involves typing the “command” then the name of the “variable” that you want the command to work on within parenthesis, and then any extra frills and options go after the “variable.” For example try typing the following into the Console window.

summary(swineflu)
##     swineflu       education        facemask         hospital     
##  Min.   :0.000   Min.   :1.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:2.000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.000   Median :2.000   Median :1.0000   Median :0.0000  
##  Mean   :0.048   Mean   :2.003   Mean   :0.5917   Mean   :0.0186  
##  3rd Qu.:0.000   3rd Qu.:3.000   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :1.000   Max.   :3.000   Max.   :1.0000   Max.   :1.0000  
##     humidity       popdensity       persontime 
##  Min.   :1.000   Min.   :  2000   Min.   : 25  
##  1st Qu.:1.000   1st Qu.: 10000   1st Qu.: 72  
##  Median :2.000   Median : 57500   Median : 79  
##  Mean   :1.996   Mean   : 55200   Mean   : 79  
##  3rd Qu.:2.000   3rd Qu.:100000   3rd Qu.: 86  
##  Max.   :3.000   Max.   :100000   Max.   :120

In the above command, you can see that codebook gives you a variety of summary information about all the variables in the dataset. Use the $ operator to refer to a single variable.

summary(swineflu$persontime)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      25      72      79      79      86     120

To see the help page for any command, enter a question mark followed by the command name in the console. For example, ?summary displays the summary() help file in the bottom right window.

You can examine the actual data with the head() and tail() commands, which display the first and last five rows, respectively.

head(swineflu)
##   swineflu education facemask hospital humidity popdensity persontime
## 1        0         1        0        0        3      10000         80
## 2        0         1        1        0        1     100000         79
## 3        0         1        1        0        2     100000         93
## 4        0         1        0        0        3      15000         94
## 5        0         1        0        0        2      10000         84
## 6        0         1        0        0        3       2000         88
tail(swineflu)
##        swineflu education facemask hospital humidity popdensity persontime
## 99995         0         3        1        0        1     100000         83
## 99996         0         3        0        0        3       2000         73
## 99997         0         3        1        0        3     100000         97
## 99998         0         3        1        0        2     100000         97
## 99999         0         3        1        0        3     100000         90
## 100000        0         3        1        0        2     100000         85

Indexing, logicals, and which Statements:

Now that we’ve worked out how to do a command on a variable, let’s work in an extra set of complexity. We will now use the brackets [] to refer to specific rows and columns in a dataframe, and which() to use a command on a variable by values of another variable. Brackets refer to a position in a vector or dataframe. vector[7] refers to the 7th value of the vector, while df[3,5] refers to the value in the third row and fifth columns of a dataframe. Here are two different ways to refer to the first 10 persontime values.

swineflu$persontime[1:10]
##  [1] 80 79 93 94 84 88 77 72 87 79
swineflu[1:10,7]
##  [1] 80 79 93 94 84 88 77 72 87 79

Leaving the row or column position equal returns all values across that row or column. The following code returns the values for all variables in the 100th row of data.

swineflu[100,]
##     swineflu education facemask hospital humidity popdensity persontime
## 100        0         1        0        0        2      15000         83

The which() command allows one to apply a funcition to part of the data, based on the values of another variable. For example, the following code 1) finds the average person-time, and 2) finds the average person-time for subjects which an education category equals 1.

#1)
mean(swineflu$persontime)
## [1] 78.99924
#2)
mean(swineflu$persontime[which(swineflu$education==1)])
## [1] 79.05551

Note the double equal signs, ==. This a logical indicator of equal, while a single = sign assigns a value to a new variable (equivalent to <-). For example:

Eight<-8
##Two equals to test if either side of the expression is equal 
Eight==7
## [1] FALSE
#One equal to assign a new value.
Eight=7
Eight
## [1] 7

There are several other logical arguments you could use: >= is greater than or equal to, <= less than or equal to, < less than (don’t type <<), != is not equal to, and of course > for greater than.

Descriptive Commands:

Now that you know the syntax, try the tabulate() and table() commands. You can tabulate() single variables, or do a crosstabulation using table(), or otherwise known as a “2x2 table.” For example:

tabulate(swineflu$facemask)
## [1] 59165
tabulate(swineflu$swineflu)
## [1] 4800
table(swineflu$swineflu,swineflu$facemask)
##    
##         0     1
##   0 38863 56337
##   1  1972  2828

The above table() command will break the population up into 4 possibilities: people negative for facemask and swine flu, positive for both, facemask wearers without swine flu, and non-facemask wearers with swine flu. Note the layout of this table places the 0 values of facemask in the top row and the 0 values of swine flu in the left-hand column. In most instances (depending on how exposure and outcome are categorized in the dataset), this results in a different layout than shown in the 2x2 tables in class. Be sure to check how exposure and disease are presented whenever you calculate or interpret measures from a 2x2 table.

The table() (and most other commands) can take more complicated arguments. (Use ?table to examine the full manual). For example, the first code below makes a 2x2 table with the 101st to the 200th rows of data. The second line of code creates a new variable highpop which is true if population density is over or equal to 10000.

table(swineflu$swineflu[101:200],swineflu$facemask[101:200])
##    
##      0  1
##   0 50 48
##   1  1  1
table(swineflu$swineflu, highpop = swineflu$popdensity >=10000 )
##    highpop
##     FALSE  TRUE
##   0  9537 85663
##   1   463  4337

Simple Graphics:

Another way to get to know your data is to use some basic graphical plots. Now that you know how to do the syntax, here are just a few of the commands you can play with:

hist(“variable”)
plot(“variable 1”, “variable 2)
boxplot(“variable 1”, “variable 2)
- note, that since the variables in this dataset tend to be categorical/binary, and not so much continuous, you won’t see typical scatterplots that you’re used to. For example:

#This first line of code just outputs the plots into a 2x2 grid, rather than one-by-one.
par(mfrow=c(2,2))
hist(swineflu$education)
hist(swineflu$swineflu)
hist(swineflu$popdensity)
hist(swineflu$persontime)

par(mfrow=c(1,2))
plot(swineflu$swineflu, swineflu$popdensity)
plot(swineflu$persontime, swineflu$popdensity)

Notice that the scatterplots are not that usefull for categorical variables like swineflu, which is either a 0 or a 1, depending on if the subject had swine-flue. Try boxplots to see if persontime, time each subject was followed in the study, varies between cases of swine flu and controls.

par(mfrow=c(1,1))
boxplot( swineflu$persontime~swineflu$swineflu)

On average, the control group (x-axis=0) was followed for longer than the swineflue group. Note the use of the ~ operator instead of the comma. This means “by”, as in persontime by value of the swineflu variable.

Generating and Removing Variables:

  • To remove a variable, it is simple: just use the rm() command and the name of the variable in quotes.
  • To generate variables, use the <- assignment operator with the name of the new variable to the left of the operator.

For example:

 swineflu2<- swineflu$swineflu
table(swineflu2,swineflu$swineflu)
##          
## swineflu2     0     1
##         0 95200     0
##         1     0  4800
  • note that since the two variables are identical, the 2x2 table’s observations are all in the yes/yes or no/no boxes of the 2x2 table.
rm("swineflu2") 

Simply duplicating variables is not as interesting, but you can do some very interesting things with the which statements. For instance: the variable “persontime” records how long a person has been in the study: the maximum possible value is 120 days, and the minimum is 1 day. What if you wanted to create a variable that was = 1 if they stayed the entire time, but equal to 0 otherwise? i.e.: create a binary variable that indicates if someone left early, or if they stayed the full time.

 swineflu$didnotleave <- 0
 swineflu$didnotleave[which(swineflu$persontime==120)] <-1 
 sum(swineflu$didnotleave)
## [1] 7

In the above commands, we first generated a variable that was equal to 0 for everyone (at this point, the 0 means that this person left early), attached to the swineflu dataframe. This could have also been generated as a standalone variable, but we’d have to explicitely set it up as a vector of 0s the same number of rows as the swineflu dataframe using the code didnotleave<-rep(0, nrow(swineflu)). In the next line of code, we fix the generated variable and recode it as a 1 for the people who stayed the full 120 days (replace takes a variable and replaces values with another value, given a which command). Notice the which command is within the square bracket [] operators. which returns a vector of indexs (locations of the positions in the vector that meet the logical criteria, here ==120). If the code is run on its own, it returns those indices:

which(swineflu$persontime==120)
## [1] 24635 36658 53612 61661 70644 86719 98564

When the which command is placed within the square brackets of swineflu$didnotleave[] , it is saying that, in the rows of the data where swineflu$persontime==120, replace the didnotleave variable with the value of 1.

Practice Assignment

Using the above commands, you should how be able to do the following homework assignment. Bold questions indicate trickier questions with respect to coding. We recommend writing your responses as a script file (File -> New File -> R Script) so that you can save it as a separate document for future reference and to compare with others’ approaches. Always saving your work in R in a script is incredibly useful for both the replicabilty of work, and because you can reuse, copy-and-paste, and tweak code for future work.

    1. Calculate the cumulative incidence of swine flu for full population using the simple cumulative method
  • 2. Calculate cumulative incidence of swine flu for people with at least 60 days of person-time using the simple cumulative method.

    1. Calculate the odds of swine flu for those wearing facemasks
    1. Calculate the odds of swine flu for those not wearing facemasks
    1. Out of the people who went to the hospital after feeling sick, what percent of them had swine flu?
  • 6.Calculate rate (incidence density) of disease for full population
    Hint: to get the total person-time, you need to sum up all the person-time contributed by each individual. generate “x” = sum(persontime) will give you a RUNNING total of all the person-time, i.e.: person 1 + person 2 + person 3…. _ person 100,000. The maximum of “x” will be the total person time. Alternatively, the fast way to do this is with a special form of generate called egen (egen allows more complex mathematical arguments when you generate variables. Try: egen x = sum(persontime)

  • 7. Calculate incidence density of disease for the first 60 days of follow-up.