## # A tibble: 3 × 3
## year boys girls
## <int> <int> <int>
## 1 1629 5218 4683
## 2 1630 4858 4457
## 3 1631 4422 4102
We can use the data.frame$variable notation to extract the values of a variable/column.See below.
## [1] 4683 4457 4102 4590 4839 4820 4928 4605 4457 4952 4784 5332 5200 4910 4617
## [16] 3997 3919 3395 3536 3181 2746 2722 2840 2908 2959 3179 3349 3382 3289 3013
## [31] 2781 3247 4107 4803 4881 5681 4858 4319 5322 5560 5829 5719 6061 6120 5822
## [46] 5738 5717 5847 6203 6033 6041 6299 6533 6744 7158 7127 7246 7119 7214 7101
## [61] 7167 7302 7392 7316 7483 6647 6713 7229 7767 7626 7452 7061 7514 7656 7683
## [76] 5738 7779 7417 7687 7623 7380 7288
The line of best fit shows that, in general, the trend of girls baptized has an upward trend from 1629 to 1710. Looking more closely shows that between 1640 to about 1655, there was a downward trend in the number of girls baptized. This trend in the number of girls baptized then took a sharp upward trend until about 1690 when the trend started to plateau.
# Insert code for Exercise 2 here
ggplot(data = arbuthnot, mapping = aes(x = year, y = girls)) + geom_point()+ geom_line() + geom_smooth(method = "lm") + labs(title = "Trend in x girls batized", x = "No. of girls baptized", y = "Year")
## `geom_smooth()` using formula = 'y ~ x'
The graph below shows that, in general, the proportion of boys born/baptized had a downward trend from 1629 to 1710.This shows that, overall, fewer and fewer boys were born/baptized as a proportion to the total number of children born/baptized.However, as shown by he red dashed line, the proportion of boys born/baptized was higher than 50% (0.5) of the total number of children born/baptized each year. That is, more boys than girls were born/baptized each year.
# Insert code for Exercise 3 here
arbuthnot <- arbuthnot %>% mutate(total = boys + girls)
arbuthnot <- arbuthnot %>% mutate(boy_ratio = boys/total)
ggplot(data = arbuthnot, mapping = aes(x = year, y = boy_ratio)) + geom_point()+ geom_hline(yintercept = 0.5, linetype = "dashed", color = "red", size = 1) + geom_smooth(method = "lm") + labs(title = "Proportion of boys born over time", x = "Year of observation", y = "Proportion of boys") + theme(plot.title = element_text(hjust = 0.5))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
The the years included in the observations are from 1940 to 2002.There are three columns/variables and 63 rows/observations in the data frame. The variable (column names) are year, boys, and girls.
# Insert code for Exercise 4 here
data('present', package='openintro')
present %>% summarize(Start_year = min(year), end_year = max(year), columns = ncol(present), rows = nrow(present))
## # A tibble: 1 × 4
## Start_year end_year columns rows
## <dbl> <dbl> <int> <int>
## 1 1940 2002 3 63
Below are the variable or column names present in the data.
## [1] "year" "boys" "girls"
The summaries of the present and arbuthnot data sets shows that the two data sets are similar in how the respective variables compare. The summary statistics of the two sets of data shows that in both data sets,the proportion of boys is higher than 0.5 (50%).
Comparing the summary statistics, the two data sets are different in terms of the magnitude of the variables. The magnitude of boys and girls are much higher in the present data set than in the arbuthnot data set. However, the comparison of the magnitudes cannot be used to draw any inference because the two data sets are collected from different places at vastly different times.
present <- present %>% mutate(total = boys + girls)
present <- present %>% mutate(boy_ratio = boys/total)
summary(present)
## year boys girls total
## Min. :1940 Min. :1211684 Min. :1148715 Min. :2360399
## 1st Qu.:1956 1st Qu.:1799857 1st Qu.:1711405 1st Qu.:3511262
## Median :1971 Median :1924868 Median :1831679 Median :3756547
## Mean :1971 Mean :1885600 Mean :1793915 Mean :3679515
## 3rd Qu.:1986 3rd Qu.:2058524 3rd Qu.:1965538 3rd Qu.:4023830
## Max. :2002 Max. :2186274 Max. :2082052 Max. :4268326
## boy_ratio
## Min. :0.5112
## 1st Qu.:0.5121
## Median :0.5125
## Mean :0.5125
## 3rd Qu.:0.5130
## Max. :0.5143
## year boys girls total boy_ratio
## Min. :1629 Min. :2890 Min. :2722 Min. : 5612 Min. :0.5027
## 1st Qu.:1649 1st Qu.:4759 1st Qu.:4457 1st Qu.: 9199 1st Qu.:0.5118
## Median :1670 Median :6073 Median :5718 Median :11813 Median :0.5157
## Mean :1670 Mean :5907 Mean :5535 Mean :11442 Mean :0.5170
## 3rd Qu.:1690 3rd Qu.:7576 3rd Qu.:7150 3rd Qu.:14723 3rd Qu.:0.5210
## Max. :1710 Max. :8426 Max. :7779 Max. :16145 Max. :0.5362
Below is a plot of the proportion of boys in the present data set over time. The plot shows that the proportion of boys over time remains greater that 0.5 (greater that 50% of the total). However, the trend line shows that this proportion decrease over time. This is similar to the arbuthnot data set (boy_ration plot also shown below). In both data sets, the proportion of boys is greater that 50% of the total but the trends in the proportion of boys over the observation periods decreases.
ggplot(data = present, mapping = aes(x = year, y = boy_ratio)) + geom_point()+ geom_hline(yintercept = 0.5, linetype = "dashed", color = "red", size = 1) + geom_smooth(method = "lm") + labs(title = "Proportion of boys born over time - present", x = "Year of observation", y = "Proportion of boys") + theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using formula = 'y ~ x'
ggplot(data = arbuthnot, mapping = aes(x = year, y = boy_ratio)) + geom_point()+ geom_hline(yintercept = 0.5, linetype = "dashed", color = "red", size = 1) + geom_smooth(method = "lm") + labs(title = "Proportion of boys born over time - arbuthnot", x = "Year of observation", y = "Proportion of boys") + theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using formula = 'y ~ x'
Between 1940 and 2002, The year 1961 was the year with the most total births in the Unites States.
# Insert code for Exercise 7 here
max_total_year <- present %>%
arrange(desc(total))
head(max_total_year, n = 5)
## # A tibble: 5 × 5
## year boys girls total boy_ratio
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1961 2186274 2082052 4268326 0.512
## 2 1960 2179708 2078142 4257850 0.512
## 3 1957 2179960 2074824 4254784 0.512
## 4 1959 2173638 2071158 4244796 0.512
## 5 1958 2152546 2051266 4203812 0.512