The Homework-1 is based on “arbuthnot” and “present” datasets to investigate the comparison of whether christen males or christen females were born more for these 82 years. As well as we will compare the result with two generations from 1629 to 1710 and from 1940 to 2002.
The Arbuthnot data set refers to Dr. John Arbuthnot, an 18th-century physician, writer, and mathematician. He was interested in the ratio of newborn boys to newborn girls, so he gathered the christening records for children born in London every year from 1629 to 1710. And the “present” data set is an updated version of the historical Arbuthnot dataset. Numbers of boys and girls born in the United States between 1940 and 2002.
The tidyverse “umbrella” package which houses a suite of many different R packages: for data wrangling and data visualization.
The openintro R package: for data
and custom functions with the OpenIntro resources.
library(tidyverse)
library(openintro)
To get started, let’s take a peek at the data.
arbuthnot
By eyeballing of the output the data set (data frame
R name) contain with three columns and eighty two rows
82 x 3.
arbuthnot
## # A tibble: 82 x 3
## year boys girls
## <int> <int> <int>
## 1 1629 5218 4683
## 2 1630 4858 4457
## 3 1631 4422 4102
## 4 1632 4994 4590
## 5 1633 5158 4839
## 6 1634 5035 4820
## 7 1635 5106 4928
## 8 1636 4917 4605
## 9 1637 4703 4457
## 10 1638 5359 4952
## # ... with 72 more rows
We can also check the dimensions of this data frame as well as the
names of the variables, type of variables and the first few observations
by inserting the name of the data set into the glimpse()
function, as seen below:
glimpse(arbuthnot)
## Rows: 82
## Columns: 3
## $ year <int> 1629, 1630, 1631, 1632, 1633, 1634, 1635, 1636, 1637, 1638, 1639~
## $ boys <int> 5218, 4858, 4422, 4994, 5158, 5035, 5106, 4917, 4703, 5359, 5366~
## $ girls <int> 4683, 4457, 4102, 4590, 4839, 4820, 4928, 4605, 4457, 4952, 4784~
Extracts the boys column from the arbuthnot
data frame.
arbuthnot$boys
Exercise-1: What command would we use to extract just the counts of girls baptized? Try it out in the console!
Answer-1:
arbuthnot$girls
## [1] 4683 4457 4102 4590 4839 4820 4928 4605 4457 4952 4784 5332 5200 4910 4617
## [16] 3997 3919 3395 3536 3181 2746 2722 2840 2908 2959 3179 3349 3382 3289 3013
## [31] 2781 3247 4107 4803 4881 5681 4858 4319 5322 5560 5829 5719 6061 6120 5822
## [46] 5738 5717 5847 6203 6033 6041 6299 6533 6744 7158 7127 7246 7119 7214 7101
## [61] 7167 7302 7392 7316 7483 6647 6713 7229 7767 7626 7452 7061 7514 7656 7683
## [76] 5738 7779 7417 7687 7623 7380 7288
R has some powerful functions for making graphics. We
can create a simple plot of the number of girls baptized per year with
the following code:
By using geom_point()
ggplot(data=arbuthnot, aes(x=year, y=girls))+
geom_point()
Change the style to line graph with geom_line()
ggplot(data=arbuthnot, aes(x=year, y=girls))+
geom_line()
Exercise-2: Is there an apparent trend in the number of girls baptized over the years? How would we describe it? (To ensure that wer lab report is comprehensive, be sure to include the code needed to make the plot as well as wer written interpretation.)
Answer-2:
The two visual plots (point and line) clearly show an overall increasing number of girls baptized over the study period from 1629 to 1710. Nonetheless, the number of girls baptized dramatically decreased between 1649 and 1659 and 1702 because of the demographic transition and “Queen Anne’s War” simultaneously.[The interpretation is not real, just an imagination]
Example: Four basic mathematical operation
5218+4683 # Addition
## [1] 9901
5218-4683 # Subtraction
## [1] 535
5218*4683 # Multiplication
## [1] 24435894
5218/4683 # Division
## [1] 1.114243
If we add the vector for baptisms for boys to that of girls, R can compute each of these sums simultaneously.
arbuthnot$boys + arbuthnot$girls
We are interested in using this new vector of the total number of baptisms to generate some plots, so we’ll want to save it as a permanent column in our data frame. We can do this using the following code:
arbuthnot <- arbuthnot %>%
mutate(total = boys + girls)
We can check whether the new variable total has been
added to the data frame or not by using the following code:
glimpse(arbuthnot)
## Rows: 82
## Columns: 4
## $ year <int> 1629, 1630, 1631, 1632, 1633, 1634, 1635, 1636, 1637, 1638, 1639~
## $ boys <int> 5218, 4858, 4422, 4994, 5158, 5035, 5106, 4917, 4703, 5359, 5366~
## $ girls <int> 4683, 4457, 4102, 4590, 4839, 4820, 4928, 4605, 4457, 4952, 4784~
## $ total <int> 9901, 9315, 8524, 9584, 9997, 9855, 10034, 9522, 9160, 10311, 10~
we can make a line plot of the total number of baptisms per year with the following code:
ggplot(data = arbuthnot, aes(x = year, y = total)) +
geom_line()
In an similar fashion, once we know the total number of baptisms for boys and girls in 1629, we can compute the ratio of the number of boys to the number of girls baptized with the following code:
5218 / 4683
## [1] 1.114243
Alternatively, we could calculate this ratio for every year by acting
on the complete boys and girls columns, and then save those calculations
into a new variable named boy_to_girl_ratio:
arbuthnot <- arbuthnot %>%
mutate(boy_to_girl_ratio = boys / girls)
We can also compute the proportion of newborns that are boys in 1629 with the following code:
5218 / (5218 + 4683)
## [1] 0.5270175
Or we can compute this for all years simultaneously and add it as a
new variable named boy_ratio to the dataset:
arbuthnot <- arbuthnot %>%
mutate(boy_ratio = boys / total)
Exercise-3: Now, generate a plot of the proportion of boys born over time. What do we see?
Answer-3:
To compare the number to boys and the proportion of boys born over time, I have created two separate plots by using the following code:
library(ggpubr)
a<-ggplot(data = arbuthnot, aes(x = year, y = boys)) +
geom_line()
b<-ggplot(data = arbuthnot, aes(x = year, y = boy_ratio)) +
geom_line(color="blue")
ggarrange(a, b,
labels = c("A", "B"),
ncol = 2, nrow = 1)
In Figure A, the number of boys is increasing over time, while in Figure B, the proportion of boys is decreasing over time.
In addition to simple mathematical operators like subtraction and
division, we can ask R to make comparisons like greater than,
>, less than, <, and equality,
==. For example, we can create a new variable called
more_boys that tells us whether the number of births of
boys outnumbered that of girls in each year with the following code:
arbuthnot <- arbuthnot %>%
mutate(more_boys = boys > girls)
To find the minimum and maximum values of columns, we can use the
functions min() and max() within a
summarize() call, which we will learn more about in the
following lab.
Here’s an example of how to find the minimum and maximum amount of boy births in a year:
arbuthnot %>%
summarize(min = min(boys),
max = max(boys)
)
## # A tibble: 1 x 2
## min max
## <int> <int>
## 1 2890 8426
Let’s take a peek the present data set for the rest of
exercise.
present
Exercise-4: What years are included in this data set? What are the dimensions of the data frame? What are the variable (column) names?
Answer-4:
range(present$year)
## [1] 1940 2002
length(present$year)
## [1] 63
In this data set, there are sixty-three years, spanning 1940 to 2002.
We can calculate the dimensions of the current data frame by the following code:
dim(present)
## [1] 63 3This data set contains with 63 rows and 3 columns.
We can check the names of columns of the data set by the following code:
colnames(present)
## [1] "year" "boys" "girls"The three columns names year, boys and girls respectively.
Exercise-5: How do these counts compare to Arbuthnot’s? Are they of a similar magnitude?
Answer-5:
Comparing the present birthrate data from the USA, which ranged in years from 1940 to 2002, to that of Artbuthnot’s London birthrate data, which ranged from 1629 to 1710, the scale is quite different. The counts are much larger because it is a whole country in the present day rather than just London in the 17th century.
Exercise-6: Make a plot that displays the proportion of boys born over time. What do we see? Does Arbuthnot’s observation about boys being born in greater proportion than girls hold up in the U.S.? Include the plot in wer response. Hint: we should be able to reuse wer code from Exercise 3 above, just replace the name of the data frame.
Answer-6:
Making a plot that displays the proportion of boys born over time codes.
present <- present %>%
mutate(total = boys + girls)
present <- present %>%
mutate(boy_pro = boys / total)
ggplot(data = present, aes(x = year, y = boy_pro)) +
geom_line(color="red")
The proportion of boys had declined in the sixties and then increased in late seventies then declined again in the early nineties. and yes Arbuthnot’s observation about boys being born in greater proportion than girls hold up in the U.S is true.
Exercise-7: In what year did we see the most total number of births in the U.S.? Hint: First calculate the totals and save it as a new variable. Then, sort wer dataset in descending order based on the total column. we can do this interactively in the data viewer by clicking on the arrows next to the variable names. To include the sorted result in wer report we will need to use two new functions. First we use arrange() to sorting the variable. Then we can arrange the data in a descending order with another function, desc(), for descending order. The sample code is provided below.
Answer-7:
present <- present %>%
mutate(total = boys + girls)
present %>%
arrange(desc(total))
## # A tibble: 63 x 5
## year boys girls total boy_pro
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1961 2186274 2082052 4268326 0.512
## 2 1960 2179708 2078142 4257850 0.512
## 3 1957 2179960 2074824 4254784 0.512
## 4 1959 2173638 2071158 4244796 0.512
## 5 1958 2152546 2051266 4203812 0.512
## 6 1962 2132466 2034896 4167362 0.512
## 7 1956 2133588 2029502 4163090 0.513
## 8 1990 2129495 2028717 4158212 0.512
## 9 1991 2101518 2009389 4110907 0.511
## 10 1963 2101632 1996388 4098020 0.513
## # ... with 53 more rows
The year 1961 had the most total births with 4,268,326.
The data for these 82 years showed that every year there were more
male than female christenings. We found an increasing trend of girls
baptized from Dr. Arbuthnot’s Baptism Records. The counts are much
larger in the present data set compared to
arbuthnot because it is a whole country in the present day
rather than just London in the 17th century.