The goal of this lab is to introduce you to R and RStudio, which you’ll be using throughout the course both to learn the statistical concepts discussed in the course and to analyze real data and come to informed conclusions. The lab specifications can be found here.
To clarify which is which: R is the name of the
programming language itself and RStudio is a convenient interface for
working with R . Think of it like this:
Knit your RMarkdown file and observe the result in the form of a knitted report. Now change the following in your RMarkdown (a) In the YAML, identify the author’s name and replace it with your name. (b) Replace the date to show today’s date. (c) Identify the sentence “The lab specifications can be found here,” and turn the word “here” into a link, (d) Add an image of the car dashboard (a metaphor for RStudio) and the car engine (a metaphor for R).
arbuthnot datasetThe Arbuthnot data set refers to the work of Dr. John Arbuthnot, an 18th century physician, writer, and mathematician. He was interested in the ratio of newborn boys to newborn girls, so he gathered the baptism records for children born in London for every year from 1629 to 1710.
Load the tidyverse library into your environment, and
then load the arbuthnot dataset. Now use the
glimpse function to display its content, and the
dim function to display its dimensions. Write down the
variables associated with the dataset and the number of observations in
your dataset.
# Load the tidyverse library below
library(tidyverse)
# load the data into the environment
arbuthnot <- read_csv("data/arbuthnot.csv")
# Use `glimpse` function to inspect your dataset below
glimpse(arbuthnot)## Rows: 82
## Columns: 3
## $ year <dbl> 1629, 1630, 1631, 1632, 1633, 1634, 1635, 1636, 1637, 1638, 1639…
## $ boys <dbl> 5218, 4858, 4422, 4994, 5158, 5035, 5106, 4917, 4703, 5359, 5366…
## $ girls <dbl> 4683, 4457, 4102, 4590, 4839, 4820, 4928, 4605, 4457, 4952, 4784…
## [1] 82 3
## year boys girls
## Min. :1629 Min. :2890 Min. :2722
## 1st Qu.:1649 1st Qu.:4759 1st Qu.:4457
## Median :1670 Median :6073 Median :5718
## Mean :1670 Mean :5907 Mean :5535
## 3rd Qu.:1690 3rd Qu.:7576 3rd Qu.:7150
## Max. :1710 Max. :8426 Max. :7779
Answer: There are 3 variables in the arbuthnot
dataset. The names of the variables are: year,
boys and girls. The dataset consists of 82
observations (rows).
What command would you use to extract just the counts of girls baptized? Try it out in the console!
arbuthnot$boys
arbuthnot$girls
sum(arbuthnot$girls)
Answer: To extract the counts of baptized girls, we need to use the following command: sum(arbuthnot$girls).
Create the plot and answer the following: is there an apparent trend in the number of girls baptized over the years? How would you describe it?
Answer: There is an overall increase in the number of girls baptised per year.
Now, generate a plot of the proportion of boys born over time. What do you see?
# Insert code below
arbuthnot <- arbuthnot %>%
mutate(total = boys + girls)
ggplot(data = arbuthnot, aes(x = year, y = total)) +
geom_line()arbuthnot <- arbuthnot %>%
mutate(boy_to_girl_ratio = boys / girls)
arbuthnot <- arbuthnot %>%
mutate(boy_ratio = boys / total)
ggplot(data = arbuthnot, aes(x = year, y = boys/total)) +
geom_line()Answer: There is no trend.
present datasetAnswer the following questions with the present
dataset:
What years are included in this data set? What are the dimensions of the data frame? What are the variable names? How many observations are in your data?
# We already loaded the tidyverse library
# so we do not need to do this again.
# However, we do need to load the new data into the environment
# Insert code below
present <- read_csv("data/present.csv")
# What years are included in the dataset?
arbuthnot %>%
summarize(min = min(boys),
max = max(boys)
)## Rows: 63
## Columns: 3
## $ year <dbl> 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950…
## $ boys <dbl> 1211684, 1289734, 1444365, 1508959, 1435301, 1404587, 1691220, 1…
## $ girls <dbl> 1148715, 1223693, 1364631, 1427901, 1359499, 1330869, 1597452, 1…
## year boys girls
## Min. :1940 Min. :1211684 Min. :1148715
## 1st Qu.:1956 1st Qu.:1799857 1st Qu.:1711404
## Median :1971 Median :1924868 Median :1831679
## Mean :1971 Mean :1885600 Mean :1793915
## 3rd Qu.:1986 3rd Qu.:2058524 3rd Qu.:1965538
## Max. :2002 Max. :2186274 Max. :2082052
## [1] 63 3
## [1] "year" "boys" "girls"
Answer: years included run from 1940 to 2002, there are 3 variables (year,boys,girls) and 63 observations.
How do these counts compare to Arbuthnot’s? Are they of a similar magnitude?
ggplot(data = present)+
geom_line(aes(x = year, y = boys), color='red')+
geom_line(aes(x = year, y = girls), color='green')Answer: ________
Make a plot that displays the proportion of boys born over time. What do you see? Does Arbuthnot’s observation about boys being born in greater proportion than girls hold up in the U.S.? What explains the differences you observe? Include a plot in your response.
# Insert code below
present <- present %>%
mutate(total = boys + girls)
ggplot(data = present, aes(x = year, y = boys/total)) +
geom_line()Answer: The proportion of boys born in the US over the years decreases. Boys are born in a lower proportion to girls.