Description

The Homework-1 is based on “arbuthnot” and “present” datasets to investigate the comparison of whether christen males or christen females were born more for these 82 years. As well as we will compare the result with two generations from 1629 to 1710 and from 1940 to 2002.

The Arbuthnot data set refers to Dr. John Arbuthnot, an 18th-century physician, writer, and mathematician. He was interested in the ratio of newborn boys to newborn girls, so he gathered the christening records for children born in London every year from 1629 to 1710. And the “present” data set is an updated version of the historical Arbuthnot dataset. Numbers of boys and girls born in the United States between 1940 and 2002.

Loading Packages

  • The tidyverse “umbrella” package which houses a suite of many different R packages: for data wrangling and data visualization.

  • The openintro R package: for data and custom functions with the OpenIntro resources.

library(tidyverse)
library(openintro)

Dr. Arbuthnot’s Baptism Records

To get started, let’s take a peek at the data.

  • Without output
arbuthnot
  • With output

By eyeballing of the output the data set (data frame R name) contain with three columns and eighty two rows 82 x 3.

arbuthnot
## # A tibble: 82 x 3
##     year  boys girls
##    <int> <int> <int>
##  1  1629  5218  4683
##  2  1630  4858  4457
##  3  1631  4422  4102
##  4  1632  4994  4590
##  5  1633  5158  4839
##  6  1634  5035  4820
##  7  1635  5106  4928
##  8  1636  4917  4605
##  9  1637  4703  4457
## 10  1638  5359  4952
## # ... with 72 more rows

We can also check the dimensions of this data frame as well as the names of the variables, type of variables and the first few observations by inserting the name of the data set into the glimpse() function, as seen below:

glimpse(arbuthnot)
## Rows: 82
## Columns: 3
## $ year  <int> 1629, 1630, 1631, 1632, 1633, 1634, 1635, 1636, 1637, 1638, 1639~
## $ boys  <int> 5218, 4858, 4422, 4994, 5158, 5035, 5106, 4917, 4703, 5359, 5366~
## $ girls <int> 4683, 4457, 4102, 4590, 4839, 4820, 4928, 4605, 4457, 4952, 4784~

Some Exploration

Extracts the boys column from the arbuthnot data frame.

arbuthnot$boys

Exercise-1: What command would we use to extract just the counts of girls baptized? Try it out in the console!

Answer-1:

arbuthnot$girls
##  [1] 4683 4457 4102 4590 4839 4820 4928 4605 4457 4952 4784 5332 5200 4910 4617
## [16] 3997 3919 3395 3536 3181 2746 2722 2840 2908 2959 3179 3349 3382 3289 3013
## [31] 2781 3247 4107 4803 4881 5681 4858 4319 5322 5560 5829 5719 6061 6120 5822
## [46] 5738 5717 5847 6203 6033 6041 6299 6533 6744 7158 7127 7246 7119 7214 7101
## [61] 7167 7302 7392 7316 7483 6647 6713 7229 7767 7626 7452 7061 7514 7656 7683
## [76] 5738 7779 7417 7687 7623 7380 7288

Data visualization

R has some powerful functions for making graphics. We can create a simple plot of the number of girls baptized per year with the following code:

By using geom_point()

ggplot(data=arbuthnot, aes(x=year, y=girls))+
  geom_point()

Change the style to line graph with geom_line()

ggplot(data=arbuthnot, aes(x=year, y=girls))+
  geom_line()

Exercise-2: Is there an apparent trend in the number of girls baptized over the years? How would we describe it? (To ensure that wer lab report is comprehensive, be sure to include the code needed to make the plot as well as wer written interpretation.)

Answer-2:

The two visual plots (point and line) clearly show an overall increasing number of girls baptized over the study period from 1629 to 1710. Nonetheless, the number of girls baptized dramatically decreased between 1649 and 1659 and 1702 because of the demographic transition and “Queen Anne’s War” simultaneously.

[The interpretation is not real, just an imagination]

R as a big calculator

Example: Four basic mathematical operation

5218+4683 # Addition
## [1] 9901
5218-4683 # Subtraction
## [1] 535
5218*4683 # Multiplication
## [1] 24435894
5218/4683 # Division
## [1] 1.114243

If we add the vector for baptisms for boys to that of girls, R can compute each of these sums simultaneously.

arbuthnot$boys + arbuthnot$girls

Adding a new variable to the data frame

We are interested in using this new vector of the total number of baptisms to generate some plots, so we’ll want to save it as a permanent column in our data frame. We can do this using the following code:

arbuthnot <- arbuthnot %>%
  mutate(total = boys + girls)

We can check whether the new variable total has been added to the data frame or not by using the following code:

glimpse(arbuthnot)
## Rows: 82
## Columns: 4
## $ year  <int> 1629, 1630, 1631, 1632, 1633, 1634, 1635, 1636, 1637, 1638, 1639~
## $ boys  <int> 5218, 4858, 4422, 4994, 5158, 5035, 5106, 4917, 4703, 5359, 5366~
## $ girls <int> 4683, 4457, 4102, 4590, 4839, 4820, 4928, 4605, 4457, 4952, 4784~
## $ total <int> 9901, 9315, 8524, 9584, 9997, 9855, 10034, 9522, 9160, 10311, 10~

we can make a line plot of the total number of baptisms per year with the following code:

ggplot(data = arbuthnot, aes(x = year, y = total)) + 
  geom_line()

In an similar fashion, once we know the total number of baptisms for boys and girls in 1629, we can compute the ratio of the number of boys to the number of girls baptized with the following code:

5218 / 4683
## [1] 1.114243

Alternatively, we could calculate this ratio for every year by acting on the complete boys and girls columns, and then save those calculations into a new variable named boy_to_girl_ratio:

arbuthnot <- arbuthnot %>%
  mutate(boy_to_girl_ratio = boys / girls)

We can also compute the proportion of newborns that are boys in 1629 with the following code:

5218 / (5218 + 4683)
## [1] 0.5270175

Or we can compute this for all years simultaneously and add it as a new variable named boy_ratio to the dataset:

arbuthnot <- arbuthnot %>%
  mutate(boy_ratio = boys / total)

Exercise-3: Now, generate a plot of the proportion of boys born over time. What do we see?

Answer-3:

To compare the number to boys and the proportion of boys born over time, I have created two separate plots by using the following code:

library(ggpubr)

a<-ggplot(data = arbuthnot, aes(x = year, y = boys)) + 
  geom_line()

b<-ggplot(data = arbuthnot, aes(x = year, y = boy_ratio)) + 
  geom_line(color="blue")

ggarrange(a, b, 
          labels = c("A", "B"),
          ncol = 2, nrow = 1)

In Figure A, the number of boys is increasing over time, while in Figure B, the proportion of boys is decreasing over time.

In addition to simple mathematical operators like subtraction and division, we can ask R to make comparisons like greater than, >, less than, <, and equality, ==. For example, we can create a new variable called more_boys that tells us whether the number of births of boys outnumbered that of girls in each year with the following code:

arbuthnot <- arbuthnot %>%
  mutate(more_boys = boys > girls)

More Practice

To find the minimum and maximum values of columns, we can use the functions min() and max() within a summarize() call, which we will learn more about in the following lab.

Here’s an example of how to find the minimum and maximum amount of boy births in a year:

arbuthnot %>%
  summarize(min = min(boys),
            max = max(boys)
            )
## # A tibble: 1 x 2
##     min   max
##   <int> <int>
## 1  2890  8426

Let’s take a peek the present data set for the rest of exercise.

  • Without output
present

Exercise-4: What years are included in this data set? What are the dimensions of the data frame? What are the variable (column) names?

Answer-4:

  • We can calculate starting year, ending year and length of year by the following codes:
range(present$year) 
## [1] 1940 2002
length(present$year)
## [1] 63

In this data set, there are sixty-three years, spanning 1940 to 2002.

  • We can calculate the dimensions of the current data frame by the following code:

    dim(present)
    ## [1] 63  3

This data set contains with 63 rows and 3 columns.

  • We can check the names of columns of the data set by the following code:

    colnames(present)
    ## [1] "year"  "boys"  "girls"

The three columns names year, boys and girls respectively.

Exercise-5: How do these counts compare to Arbuthnot’s? Are they of a similar magnitude?

Answer-5:

Comparing the present birthrate data from the USA, which ranged in years from 1940 to 2002, to that of Artbuthnot’s London birthrate data, which ranged from 1629 to 1710, the scale is quite different. The counts are much larger because it is a whole country in the present day rather than just London in the 17th century.

Exercise-6: Make a plot that displays the proportion of boys born over time. What do we see? Does Arbuthnot’s observation about boys being born in greater proportion than girls hold up in the U.S.? Include the plot in wer response. Hint: we should be able to reuse wer code from Exercise 3 above, just replace the name of the data frame.

Answer-6:

Making a plot that displays the proportion of boys born over time codes.

present <- present %>%
  mutate(total = boys + girls)

present <- present %>%
  mutate(boy_pro = boys / total)

ggplot(data = present, aes(x = year, y = boy_pro)) + 
  geom_line(color="red")

The proportion of boys had declined in the sixties and then increased in late seventies then declined again in the early nineties. and yes Arbuthnot’s observation about boys being born in greater proportion than girls hold up in the U.S is true.

Exercise-7: In what year did we see the most total number of births in the U.S.? Hint: First calculate the totals and save it as a new variable. Then, sort wer dataset in descending order based on the total column. we can do this interactively in the data viewer by clicking on the arrows next to the variable names. To include the sorted result in wer report we will need to use two new functions. First we use arrange() to sorting the variable. Then we can arrange the data in a descending order with another function, desc(), for descending order. The sample code is provided below.

Answer-7:

  • Calculate the totals and save it as a new variable and sort my dataset in descending order based on the ‘total’ column.
present <- present %>%
  mutate(total = boys + girls)

present %>%
  arrange(desc(total))
## # A tibble: 63 x 5
##     year    boys   girls   total boy_pro
##    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1  1961 2186274 2082052 4268326   0.512
##  2  1960 2179708 2078142 4257850   0.512
##  3  1957 2179960 2074824 4254784   0.512
##  4  1959 2173638 2071158 4244796   0.512
##  5  1958 2152546 2051266 4203812   0.512
##  6  1962 2132466 2034896 4167362   0.512
##  7  1956 2133588 2029502 4163090   0.513
##  8  1990 2129495 2028717 4158212   0.512
##  9  1991 2101518 2009389 4110907   0.511
## 10  1963 2101632 1996388 4098020   0.513
## # ... with 53 more rows

The year 1961 had the most total births with 4,268,326.

Summary

The data for these 82 years showed that every year there were more male than female christenings. We found an increasing trend of girls baptized from Dr. Arbuthnot’s Baptism Records. The counts are much larger in the present data set compared to arbuthnot because it is a whole country in the present day rather than just London in the 17th century.