DATA 606 Lab
Load the required libraries and the data
library(tidyverse)
library(openintro)
library(dplyr)
data('arbuthnot', package='openintro')
Lets take a look at the data
arbuthnot
## # A tibble: 82 x 3
## year boys girls
## <int> <int> <int>
## 1 1629 5218 4683
## 2 1630 4858 4457
## 3 1631 4422 4102
## 4 1632 4994 4590
## 5 1633 5158 4839
## 6 1634 5035 4820
## 7 1635 5106 4928
## 8 1636 4917 4605
## 9 1637 4703 4457
## 10 1638 5359 4952
## # ... with 72 more rows
Glimpse of the same data
glimpse(arbuthnot)
## Rows: 82
## Columns: 3
## $ year <int> 1629, 1630, 1631, 1632, 1633, 1634, 1635, 1636, 1637, 1638, 1639~
## $ boys <int> 5218, 4858, 4422, 4994, 5158, 5035, 5106, 4917, 4703, 5359, 5366~
## $ girls <int> 4683, 4457, 4102, 4590, 4839, 4820, 4928, 4605, 4457, 4952, 4784~
Let’s explore the data
Total num of girls baptized over the years
sum_girls <- sum(arbuthnot$girls)
sum_girls
## [1] 453841
Plot Girls baptization
ggplot(data = arbuthnot, aes(x = year, y = girls)) +
geom_point()

Plot the same as Line graph
ggplot(data = arbuthnot, aes(x = year, y = girls)) +
geom_line()

Exercise 2: How would you describe the apparent trend in the number of girls baptized over the years?
Draw a bar plot to represent the trend of girls baptized over the years
# Create Sub-set data
girls_subdata <- subset(arbuthnot, select = c("year","girls"))
p<-ggplot(data=girls_subdata, aes(x=year, y=girls)) +
geom_bar(stat="identity")
p

From the above plot, it is clear that as year progressed, we have noticed increase in the count of girls being baptized.
Adding few new columns(total, boys to girls ratio, boys ratio) in the dataframe
arbuthnot <- arbuthnot %>%
mutate(total = boys + girls)
arbuthnot <- arbuthnot %>%
mutate(boy_to_girl_ratio = boys / girls)
arbuthnot <- arbuthnot %>%
mutate(boy_ratio = boys / total)
Exercise 3: Generate a plot of the proportion of boys born over time
Calculate boy proportion as percentage
arbuthnot <- arbuthnot %>%
mutate(boy_percent = (boys / total) * 100 )
b<-ggplot(data=arbuthnot, aes(x=year, y=boy_percent)) +
geom_bar(stat="identity")
b

Boys percentage has always been higher than 50, which means more boys were baptized compared to girls over the years
arbuthnot %>%
summarize(min = min(boys), max = max(boys))
## # A tibble: 1 x 2
## min max
## <int> <int>
## 1 2890 8426
Exercise 4: What years are included in this data set? What are the dimensions of the data frame? What are the variable (column) names?
All the years from the original arbuthnot dataset in included. The result data frame is of 1 x 2 dimension. The variable column names are min and max.
Exercise 5: How do these counts compare to Arbuthnot’s? Are they of a similar magnitude?
These counts are same as Arbuthnot, with similar magnitude.
Exercise 6: Make a plot that displays the proportion of boys born over time. What do you see?
b<-ggplot(data=arbuthnot, aes(x=year, y=boy_ratio)) +
geom_line()
b

Based on the above graph of boys propotion compared to girls, it is clear that during the time period the boys population was higher compared to the girls.
Exercise 7: In what year did we see the most total number of births in the U.S.?
sorted_total <- arrange(arbuthnot, desc(total))
head(sorted_total, 1)
## # A tibble: 1 x 7
## year boys girls total boy_to_girl_ratio boy_ratio boy_percent
## <int> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 1705 8366 7779 16145 1.08 0.518 51.8
The year with the most total number of births in US was 1705