The goal of this lab is to introduce you to R and RStudio, which you’ll be using throughout the course both to learn the statistical concepts discussed in the course and to analyze real data and come to informed conclusions. The lab specifications can be found here.

To clarify which is which: R is the name of the programming language itself and RStudio is a convenient interface for working with R . Think of it like this:

  • R is like a car’s engine,
  • RStudio is like a car’s dashboard
The difference between R and RStudio
The difference between R and RStudio

Exercise 1

Knit your RMarkdown file and observe the result in the form of a knitted report. Now change the following in your RMarkdown (a) In the YAML, identify the author’s name and replace it with your name. (b) Replace the date to show today’s date. (c) Identify the sentence “The lab specifications can be found here,” and turn the word “here” into a link, (d) Add an image of the car dashboard (a metaphor for RStudio) and the car engine (a metaphor for R).

The arbuthnot dataset

The Arbuthnot data set refers to the work of Dr. John Arbuthnot, an 18th century physician, writer, and mathematician. He was interested in the ratio of newborn boys to newborn girls, so he gathered the baptism records for children born in London for every year from 1629 to 1710.

Exercise 2

Load the tidyverse library into your environment, and then load the arbuthnot dataset. Now use the glimpse function to display its content, and the dim function to display its dimensions. Write down the variables associated with the dataset and the number of observations in your dataset.

# Load the tidyverse library below
library(tidyverse)

# load the data into the environment
arbuthnot <- read_csv("data/arbuthnot.csv")


# Use `glimpse` function to inspect your dataset below
glimpse(arbuthnot)
## Rows: 82
## Columns: 3
## $ year  <dbl> 1629, 1630, 1631, 1632, 1633, 1634, 1635, 1636, 1637, 1638, 1639…
## $ boys  <dbl> 5218, 4858, 4422, 4994, 5158, 5035, 5106, 4917, 4703, 5359, 5366…
## $ girls <dbl> 4683, 4457, 4102, 4590, 4839, 4820, 4928, 4605, 4457, 4952, 4784…
# Use the `dim` function to see the dimensions of your dataset 
dim(arbuthnot)
## [1] 82  3
summary(arbuthnot)
##       year           boys          girls     
##  Min.   :1629   Min.   :2890   Min.   :2722  
##  1st Qu.:1649   1st Qu.:4759   1st Qu.:4457  
##  Median :1670   Median :6073   Median :5718  
##  Mean   :1670   Mean   :5907   Mean   :5535  
##  3rd Qu.:1690   3rd Qu.:7576   3rd Qu.:7150  
##  Max.   :1710   Max.   :8426   Max.   :7779

Answer: There are 3 variables in the arbuthnot dataset. The names of the variables are: year, boys and girls. The dataset consists of 82 observations (rows).

Exercise 3

What command would you use to extract just the counts of girls baptized? Try it out in the console!

arbuthnot$boys

arbuthnot$girls

sum(arbuthnot$girls)

Answer: To extract the counts of baptized girls, we need to use the following command: sum(arbuthnot$girls).

Exercise 4

Create the plot and answer the following: is there an apparent trend in the number of girls baptized over the years? How would you describe it?

# Insert code below
ggplot(data = arbuthnot, aes(x = year, y = girls)) +
  geom_line()

Answer: There is an overall increase in the number of girls baptised per year.

Exercise 5

Now, generate a plot of the proportion of boys born over time. What do you see?

# Insert code below
arbuthnot <- arbuthnot %>%
  mutate(total = boys + girls)
ggplot(data = arbuthnot, aes(x = year, y = total)) + 
  geom_line()

arbuthnot <- arbuthnot %>%
  mutate(boy_to_girl_ratio = boys / girls)

arbuthnot <- arbuthnot %>%
  mutate(boy_ratio = boys / total)

ggplot(data = arbuthnot, aes(x = year, y = boys/total)) + 
  geom_line()

Answer: There is no trend.

Thepresent dataset

Answer the following questions with the present dataset:

Exercise 6

What years are included in this data set? What are the dimensions of the data frame? What are the variable names? How many observations are in your data?

# We already loaded the tidyverse library
# so we do not need to do this again. 
# However, we do need to load the new data into the environment
# Insert code below
present <- read_csv("data/present.csv")



# What years are included in the dataset?
arbuthnot %>%
  summarize(min = min(boys),
            max = max(boys)
            )
arbuthnot %>%
  summarize(min = min(girls),
            max = max(girls)
            )
glimpse(present)
## Rows: 63
## Columns: 3
## $ year  <dbl> 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950…
## $ boys  <dbl> 1211684, 1289734, 1444365, 1508959, 1435301, 1404587, 1691220, 1…
## $ girls <dbl> 1148715, 1223693, 1364631, 1427901, 1359499, 1330869, 1597452, 1…
summary(present)
##       year           boys             girls        
##  Min.   :1940   Min.   :1211684   Min.   :1148715  
##  1st Qu.:1956   1st Qu.:1799857   1st Qu.:1711404  
##  Median :1971   Median :1924868   Median :1831679  
##  Mean   :1971   Mean   :1885600   Mean   :1793915  
##  3rd Qu.:1986   3rd Qu.:2058524   3rd Qu.:1965538  
##  Max.   :2002   Max.   :2186274   Max.   :2082052
# What are the dimensions of the dataset?
dim(present)
## [1] 63  3
# What are the variable names? 
names(present)
## [1] "year"  "boys"  "girls"

Answer: years included run from 1940 to 2002, there are 3 variables (year,boys,girls) and 63 observations.

Exercise 7

How do these counts compare to Arbuthnot’s? Are they of a similar magnitude?

# Insert code below
arbuthnot %>%
  summarize(min = min(boys),
            max = max(boys)
            )
arbuthnot %>%
  summarize(min = min(girls),
            max = max(girls)
            )
present %>%
  summarize(min = min(boys),
            max = max(boys)
            )
present %>%
  summarize(min = min(girls),
            max = max(girls)
            )
ggplot(data = present)+
  geom_line(aes(x = year, y = boys), color='red')+
   geom_line(aes(x = year, y = girls), color='green')

Answer: ________

Exercise 8

Make a plot that displays the proportion of boys born over time. What do you see? Does Arbuthnot’s observation about boys being born in greater proportion than girls hold up in the U.S.? What explains the differences you observe? Include a plot in your response.

# Insert code below
present <- present %>%
  mutate(total = boys + girls)

ggplot(data = present, aes(x = year, y = boys/total)) + 
  geom_line()

Answer: The proportion of boys born in the US over the years decreases. Boys are born in a lower proportion to girls.

Exercise 9

In which year did we see the largest total number of births in the U.S.?

# Insert code below
present %>%
  summarize(min = min(total),
            max = max(total)
            )
present %>%
  arrange(desc(total))

Answer: 1961