Data 606 Lab 0

Heather Geiger

February 4, 2018

Arbuthnot data exploration

I had tried to load the arbuthnot data as an R data set, but got an error message.

So, I will instead read in the data from a CSV file I found on the OpenIntro website.

download.file("https://www.openintro.org/stat/data/arbuthnot.csv",destfile="arbuthnot.csv")

arbuthnot <- read.csv("arbuthnot.csv",header=TRUE,stringsAsFactors=FALSE)

Exercise 1 - Extract the count of just girls baptized with this command:

arbuthnot$girls
##  [1] 4683 4457 4102 4590 4839 4820 4928 4605 4457 4952 4784 5332 5200 4910
## [15] 4617 3997 3919 3395 3536 3181 2746 2722 2840 2908 2959 3179 3349 3382
## [29] 3289 3013 2781 3247 4107 4803 4881 5681 4858 4319 5322 5560 5829 5719
## [43] 6061 6120 5822 5738 5717 5847 6203 6033 6041 6299 6533 6744 7158 7127
## [57] 7246 7119 7214 7101 7167 7302 7392 7316 7483 6647 6713 7229 7767 7626
## [71] 7452 7061 7514 7656 7683 5738 7779 7417 7687 7623 7380 7288

Exercise 2 - The number of girls baptized decreases substantially starting around 1640 and stays low until around 1660, where it starts increasing again.

Exercise 3 - Make a plot of the proportion of boys over time with this command.

plot(arbuthnot$year,
arbuthnot$boys / (arbuthnot$boys + arbuthnot$girls),
xlab="Year",
ylab="Proportions of baptisms that were boys",
type="l")

We find that the proportion of boys fluctuates over time, but is always over 0.50 (so more boys born than girls).

On Your Own

First, we load the dataset “present”.

data(present,package='DATA606')

Now, we can explore this data to answer the questions.

  1. What years are included in this data set? What are the dimensions of the data frame and what are the variable or column names?
dim(present)
## [1] 63  3
head(present)
##   year    boys   girls
## 1 1940 1211684 1148715
## 2 1941 1289734 1223693
## 3 1942 1444365 1364631
## 4 1943 1508959 1427901
## 5 1944 1435301 1359499
## 6 1945 1404587 1330869
range(present$year)
## [1] 1940 2002

Like the arbuthnot data, this data frame also has 3 columns: “year”,“boys”, and “girls”.

There are 63 rows, corresponding to years 1940 to 2002.

  1. How do these counts compare to Arbuthnot’s? Are they on a similar scale?
range(arbuthnot$boys + arbuthnot$girls)
## [1]  5612 16145
range(present$boys + present$girls)
## [1] 2360399 4268326

The numbers in present are definitely a lot larger, in the range of a few million per year including boys + girls. Versus the arbuthnot data, where boys + girls are in the range of thousands (max less than 20,000).

  1. Make a plot that displays the boy-to-girl ratio for every year in the data set. What do you see? Does Arbuthnot’s observation about boys being born in greater proportion than girls hold up in the U.S.? Include the plot in your response.
#Let's put the plots side-by-side for Arbuthnot vs. United States so we can compare more clearly.
#Set ylim on the same scale for clearer comparison.

par(mfrow=c(1,2))

plot(arbuthnot$year,
arbuthnot$boys/arbuthnot$girls,
type="l",
xlab="Year",
ylab="Boy/girl birth ratio",
main="Arbuthnot birth sex ratio\n(1629-1710)",
ylim=range(c(arbuthnot$boys/arbuthnot$girls,present$boys/present$girls)))

plot(present$year,
present$boys/present$girls,
type="l",
xlab="Year",
ylab="Boy/girl birth ratio",
main="US birth sex ratio\n(1940-2002)",
ylim=range(c(arbuthnot$boys/arbuthnot$girls,present$boys/present$girls)))

In both Arbuthnot and the United States, more boys are born than girls in all years. However there is a lot more fluctuation in the Arbuthnot birth sex ratios.

  1. In what year did we see the most total number of births in the U.S.?
present$year[which((present$boys + present$girls) == max((present$boys + present$girls)))]
## [1] 1961

The most total births in the U.S. were in 1961.