library(tidyverse)
library(openintro)
library(RCurl)arbuthnot$girls## [1] 4683 4457 4102 4590 4839 4820 4928 4605 4457 4952 4784 5332 5200 4910 4617
## [16] 3997 3919 3395 3536 3181 2746 2722 2840 2908 2959 3179 3349 3382 3289 3013
## [31] 2781 3247 4107 4803 4881 5681 4858 4319 5322 5560 5829 5719 6061 6120 5822
## [46] 5738 5717 5847 6203 6033 6041 6299 6533 6744 7158 7127 7246 7119 7214 7101
## [61] 7167 7302 7392 7316 7483 6647 6713 7229 7767 7626 7452 7061 7514 7656 7683
## [76] 5738 7779 7417 7687 7623 7380 7288
glimpse(arbuthnot)## Rows: 82
## Columns: 3
## $ year <int> 1629, 1630, 1631, 1632, 1633, 1634, 1635, 1636, 1637, 1638, 1639…
## $ boys <int> 5218, 4858, 4422, 4994, 5158, 5035, 5106, 4917, 4703, 5359, 5366…
## $ girls <int> 4683, 4457, 4102, 4590, 4839, 4820, 4928, 4605, 4457, 4952, 4784…
min(arbuthnot$year)## [1] 1629
max(arbuthnot$year)## [1] 1710
arbuthnot## # A tibble: 82 × 3
## year boys girls
## <int> <int> <int>
## 1 1629 5218 4683
## 2 1630 4858 4457
## 3 1631 4422 4102
## 4 1632 4994 4590
## 5 1633 5158 4839
## 6 1634 5035 4820
## 7 1635 5106 4928
## 8 1636 4917 4605
## 9 1637 4703 4457
## 10 1638 5359 4952
## # … with 72 more rows
Is there an apparent trend in the number of girls baptized over the years? How would you describe it? (To ensure that your lab report is comprehensive, be sure to include the code needed to make the plot as well as your written interpretation.)
# Insert code for Exercise 2 here
ggplot(data = arbuthnot, aes(x = year, y = girls)) +
geom_line()
According to the graph, the trend of girls being baptized increased
since around 1658. Between 1640 and 1660, there was a precipitous
decline in girls being baptized, from just under 5,500 in 1640 to under
3,000 by around 1658. In a span of about 3-4 years after 1658, there was
an accelerated increase in girls who were baptized, with close to 6,000
girls baptized around 1685 before a small but sharp decline about 2
years later. From that point forward, there was a steady increase in
girls baptized into the early 1700s, reaching over 7,500 before a
drastic decline less than 6,000 around 1705 before a drastic increase
back over 7,500.
Now, generate a plot of the proportion of boys born over time. What do you see?
# Insert code for Exercise 3 here
ggplot(data = arbuthnot, aes(x = year, y = boys)) +
geom_line()
The graph of boys being born follows a similar trajectory of girls being
baptized. From 1640 to 1660 there was a sharp decline in boys being
born, hitting a low of less than 3,000 in 1650 before a rapid increase
to 6,000 around the year 1665. There is a decline in boys being born
around 1705, frm around 8,000 to close to 6,000. This follows a similar
decline that occurred with girls at almost the same time.
What years are included in this data set? What are the dimensions of the data frame? What are the variable (column) names?
# Insert code for Exercise 4 here
data('present', package='openintro')
df <- present
dim(df)## [1] 63 3
glimpse(df)## Rows: 63
## Columns: 3
## $ year <dbl> 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950…
## $ boys <dbl> 1211684, 1289734, 1444365, 1508959, 1435301, 1404587, 1691220, 1…
## $ girls <dbl> 1148715, 1223693, 1364631, 1427901, 1359499, 1330869, 1597452, 1…
distinct(df, year)## # A tibble: 63 × 1
## year
## <dbl>
## 1 1940
## 2 1941
## 3 1942
## 4 1943
## 5 1944
## 6 1945
## 7 1946
## 8 1947
## 9 1948
## 10 1949
## # … with 53 more rows
min(df$year)## [1] 1940
max(df$year)## [1] 2002
colnames(df)## [1] "year" "boys" "girls"
df## # A tibble: 63 × 3
## year boys girls
## <dbl> <dbl> <dbl>
## 1 1940 1211684 1148715
## 2 1941 1289734 1223693
## 3 1942 1444365 1364631
## 4 1943 1508959 1427901
## 5 1944 1435301 1359499
## 6 1945 1404587 1330869
## 7 1946 1691220 1597452
## 8 1947 1899876 1800064
## 9 1948 1813852 1721216
## 10 1949 1826352 1733177
## # … with 53 more rows
The years in the dataset are from 1940 to 2002. The dimensions of the dataframes are 63 rows and 3 columns. The column names are “year”, “boys” and “girls”.
How do these counts compare to Arbuthnot’s? Are they of a similar magnitude?
df$boys## [1] 1211684 1289734 1444365 1508959 1435301 1404587 1691220 1899876 1813852
## [10] 1826352 1823555 1923020 1971262 2001798 2059068 2073719 2133588 2179960
## [19] 2152546 2173638 2179708 2186274 2132466 2101632 2060162 1927054 1845862
## [28] 1803388 1796326 1846572 1915378 1822910 1669927 1608326 1622114 1613135
## [37] 1624436 1705916 1709394 1791267 1852616 1860272 1885676 1865553 1879490
## [46] 1927983 1924868 1951153 2002424 2069490 2129495 2101518 2082097 2048861
## [55] 2022589 1996355 1990480 1985596 2016205 2026854 2076969 2057922 2057979
dim(arbuthnot)## [1] 82 3
colnames(arbuthnot)## [1] "year" "boys" "girls"
In the arbuthnot dataset, there are 82 rows and 3 columns, 19 more rows than the present dataset. The years in the present dataset are from 1940-2002, while the years being measured in the arbuthnot dataset is from 1629 to 1710. The datatypes and column names are the same for each dataset.
Make a plot that displays the proportion of boys born over time. What do you see? Does Arbuthnot’s observation about boys being born in greater proportion than girls hold up in the U.S.? Include the plot in your response. Hint: You should be able to reuse your code from Exercise 3 above, just replace the dataframe name.
ggplot(data = df, aes(x = year, y = boys)) +
geom_line()Boys born in greater proportion than girls in the U.S. follows the trajectory depicted in Arbuthnot’s dataset. While the pattern of boys and girls born are almost the same in each dataset, boys are born at a slimmer but higher rate than girls, which is shown in each dataset.
In what year did we see the most total number of births in the U.S.? Hint: First calculate the totals and save it as a new variable. Then, sort your dataset in descending order based on the total column. You can do this interactively in the data viewer by clicking on the arrows next to the variable names. To include the sorted result in your report you will need to use two new functions: arrange (for sorting). We can arrange the data in a descending order with another function: desc (for descending order). The sample code is provided below.
df %>%
group_by(year) %>%
summarise(total = sum(boys, girls)) %>%
arrange(desc(total))## # A tibble: 63 × 2
## year total
## <dbl> <dbl>
## 1 1961 4268326
## 2 1960 4257850
## 3 1957 4254784
## 4 1959 4244796
## 5 1958 4203812
## 6 1962 4167362
## 7 1956 4163090
## 8 1990 4158212
## 9 1991 4110907
## 10 1963 4098020
## # … with 53 more rows
1961 was the year the U.S. experienced the most total births in the country, with 4,268,326.