GCH 601 Intro to Biostats, Fall 2019

Introduction

Lab number one for the course is a simple exercise for learning how to use STATA. While STATA is quite powerful for performing statistical analyses, its cost may be a limiting factor for many professionals looking to do statistical analysis of data for research and other industries. R and the R Studio Integrated Development Environment offer a reasonable solution as R and R Studio are open source software that are free to use. For this lab, I will show you how to replicate the work done on STATA using R.

First, the Libraries

R operates using a series of programs stored in libraries. These libraries can be downloaded and installed directly from the R repositories via R Studio by typing in some simple code into the console or your script. Here, I add the libraries that I am going to use for the lab. (Note the comments preceeded by a hashtag, or pound, sign.)

library(tidyverse) # To import data in STATA format
library(Hmisc) # To summarize data
library(haven)

Some project require many more libraries than that, but these two will do for this lab.

Bring in the data

Next, we need to bring in the dataset. For this lab, we are going to use the arbuthnot dataset, which is in a .dta format. Not a problem. R can handle it through the use of the haven package that you loaded above. Before you bring in your data, make sure you set the working directory to where the data are stored. Here, we will import the data to a temporary data frame and call it mydata.

mydata <- read_dta("arbuthnot.dta") # Imports the file in quotes. Make sure the working directory is correct.

Some Data Exploration

Let’s do some quick data exploration.

describe(mydata) # Look at the variables quickly and see some of their values

## mydata 
## 
##  3  Variables      82  Observations
## ---------------------------------------------------------------------------
## year  Format:%12.0g 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       82        0       82        1     1670    27.67     1633     1637 
##      .25      .50      .75      .90      .95 
##     1649     1670     1690     1702     1706 
## 
## lowest : 1629 1630 1631 1632 1633, highest: 1706 1707 1708 1709 1710
## ---------------------------------------------------------------------------
## boys  Format:%12.0g 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       82        0       79        1     5907     1906     3210     3400 
##      .25      .50      .75      .90      .95 
##     4759     6073     7576     7911     8100 
## 
## lowest : 2890 3079 3157 3196 3209, highest: 8102 8239 8366 8379 8426
## ---------------------------------------------------------------------------
## girls  Format:%12.0g 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       82        0       80        1     5535     1835     2911     3188 
##      .25      .50      .75      .90      .95 
##     4457     5718     7150     7480     7654 
## 
## lowest : 2722 2746 2781 2840 2908, highest: 7656 7683 7687 7767 7779
## ---------------------------------------------------------------------------

Exercise #1

For the first exercise, we need to find out more about the boys variable in the dataset.

describe(mydata$boys)

## mydata$boys  Format:%12.0g 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       82        0       79        1     5907     1906     3210     3400 
##      .25      .50      .75      .90      .95 
##     4759     6073     7576     7911     8100 
## 
## lowest : 2890 3079 3157 3196 3209, highest: 8102 8239 8366 8379 8426

Data Visualization

The lab exercise asks you to visualize the girls variable with a scatterplot. So let’s do that in R:

plot(mydata$year, mydata$girls) # Creates a basic scatterplot of the "girls" variable. (NOTE: X, Y)

If we wanted to change the color, we can change the color to red:

plot(mydata$year, mydata$girls, col = "red") # Adding the color red to the points.

Exercise #2

Is there an apparent trend in the number of girls baptized over the years? How would you describe it? (To ensure that your lab report is comprehensive, be sure to include the code needed to make the plot as well as your written interpretation.) Read all about the plot function.

plot(mydata$year, # The same scatterplot, with some labeling and color to it.
     mydata$girls,
     type = "p",
     main = "Scatterplot of the number of girls baptized per year",
     xlab = "YEAR",
     ylab = "GIRLS",
     col = "red")
lines(lowess(mydata),
      col = "blue") # Adds a lowess line. What is that?

R as a Calculator

5 + 9

## [1] 14

5 / 9

## [1] 0.5555556

5 * 9

## [1] 45

5 - 9

## [1] -4

5^9

## [1] 1953125

sqrt(25) # Square root of "25"

## [1] 5

log(100, base = 10) # Log of 100, using the base 10

## [1] 2

exp(2) # Exponentiate 2

## [1] 7.389056

Adding New Variables to the Data Frame

mydata$total = mydata$boys + mydata$girls # Creates new variable "total" by addings "boys" plus "girls"

plot(mydata$year, # X variable
     mydata$total) # Y variable

5218 / 4683 # Ratio of boys to girls baptized in 1629

## [1] 1.114243

mydata$ratio = mydata$boys/mydata$girls

5218 / (5218 + 4683) # Proportion that are newborn boys in 1629

## [1] 0.5270175

mydata$boyratio = mydata$boys / mydata$total

mydata <- mutate(mydata, boyratio = (boys/total)) # A different way of doing it, courtesy of the "tidyverse" package

Exercise 3

Now, generate a plot of the proportion of boys born over time. What do you see?

plot(mydata$year, # The same scatterplot, with some labeling and color to it.
     mydata$boyratio,
     type = "p",
     main = "Scatterplot of the proportion of boys born over time",
     xlab = "YEAR",
     ylab = "PROPORTION OF BIRTHS THAT ARE BOYS",
     col = "blue")

mydata$moreboys = mydata$boys > mydata$girls # "TRUE" if there are more boys than girls being born

Exercise 4

Before we do exercise 4, we need to load the present data.

present <- read_dta("present.dta")

What years are included in this data set? What are the dimensions of the data frame? What are the variable (column) names?

describe(present)

## present 
## 
##  3  Variables      74  Observations
## ---------------------------------------------------------------------------
## year  Format:%10.0g 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       74        0       74        1     1976       25     1944     1947 
##      .25      .50      .75      .90      .95 
##     1958     1976     1995     2006     2009 
## 
## lowest : 1940 1941 1942 1943 1944, highest: 2009 2010 2011 2012 2013
## ---------------------------------------------------------------------------
## boys  Format:%10.0g 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       74        0       74        1  1917502   242448  1441193  1615829 
##      .25      .50      .75      .90      .95 
##  1823071  1988038  2076156  2146859  2179796 
## 
## lowest : 1211684 1289734 1404587 1435301 1444365
## highest: 2179708 2179960 2184237 2186274 2208071
## ---------------------------------------------------------------------------
## girls  Format:%10.0g 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       74        0       74        1  1825037   234497  1362835  1533097 
##      .25      .50      .75      .90      .95 
##  1731210  1897810  1979778  2046355  2075985 
## 
## lowest : 1148715 1223693 1330869 1359499 1364631
## highest: 2074824 2078142 2081318 2082052 2108162
## ---------------------------------------------------------------------------

How do these counts compare to Arbuthnot’s? Are they of a similar magnitude?

summary(present$boys)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 1211684 1823071 1988038 1917502 2076156 2208071

summary(mydata$boys)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2890    4759    6073    5907    7576    8426

summary(present$girls)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 1148715 1731210 1897810 1825037 1979778 2108162

summary(present$boys)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 1211684 1823071 1988038 1917502 2076156 2208071

Exercise 5

Make a plot that displays the proportion of boys born over time. What do you see? Does Arbuthnot’s observation about boys being born in greater proportion than girls hold up in the U.S.? Include the plot in your response. Hint: You should be able to reuse your code from Ex 3 above, just replace the dataframe name.

present <- mutate(present, total = (boys + girls))
present <- mutate(present, boyratio = (boys/total))

plot(present$year, # The same scatterplot, with some labeling and color to it.
     present$boyratio,
     type = "p",
     main = "Scatterplot of the proportion of boys born over time",
     xlab = "YEAR",
     ylab = "PROPORTION OF BIRTHS THAT ARE BOYS",
     col = "blue")

Exercise 6

In what year did we see the most total number of births in the U.S.? Hint: First calculate the totals and save it as a new variable. Then, sort your dataset in descending order based on the total column.

sort(present$total) # This doesn't help much because you have too many observations

##  [1] 2360399 2513427 2735456 2794800 2808996 2936860 3136965 3144198
##  [9] 3159958 3167788 3258411 3288672 3326632 3333279 3494398 3501564
## [17] 3520959 3535068 3554149 3555970 3559529 3600206 3606274 3612258
## [25] 3629238 3638933 3669141 3680537 3699940 3731386 3750850 3756547
## [33] 3760358 3760561 3809394 3846986 3880894 3891494 3899589 3902120
## [41] 3909510 3932181 3941553 3952767 3952841 3953590 3959417 3999386
## [49] 4000240 4017362 4021726 4025933 4027490 4040958 4047295 4058814
## [57] 4065014 4089950 4098020 4110907 4112052 4130665 4138349 4158212
## [65] 4163090 4167362 4203812 4244796 4247694 4254784 4257850 4265555
## [73] 4268326 4316233

tibble(present$year,present$total) # Again, this doesn't help

## # A tibble: 74 x 2
##    `present$year` `present$total`
##             <dbl>           <dbl>
##  1           1940         2360399
##  2           1941         2513427
##  3           1942         2808996
##  4           1943         2936860
##  5           1944         2794800
##  6           1945         2735456
##  7           1946         3288672
##  8           1947         3699940
##  9           1948         3535068
## 10           1949         3559529
## # … with 64 more rows

max(present$total) # This tells you how many births were the highesst in a year, not what year

## [1] 4316233

present[present$total == max(present$total), "year"] # This does the trick!

## # A tibble: 1 x 1
##    year
##   <dbl>
## 1  2007

In that last command, we told R to filter data to keep only the highest “total” row. We then told it to return the value of “year” for that corresponding highest “total.”

That’s it!

That’s it. You’re all done with lab 1 using R. Congratulations!

GCH 601 Intro to Biostats, Fall 2019 – LAB #1

René F. Najera, DrPH

9/24/2019