Note: All red text consists of questions I had while exploring R

  1. What command would you use to extract just the counts of girls baptized? Try it!
arbuthnot$girls
##  [1] 4683 4457 4102 4590 4839 4820 4928 4605 4457 4952 4784 5332 5200 4910
## [15] 4617 3997 3919 3395 3536 3181 2746 2722 2840 2908 2959 3179 3349 3382
## [29] 3289 3013 2781 3247 4107 4803 4881 5681 4858 4319 5322 5560 5829 5719
## [43] 6061 6120 5822 5738 5717 5847 6203 6033 6041 6299 6533 6744 7158 7127
## [57] 7246 7119 7214 7101 7167 7302 7392 7316 7483 6647 6713 7229 7767 7626
## [71] 7452 7061 7514 7656 7683 5738 7779 7417 7687 7623 7380 7288
  1. Is there an apparent trend in the number of girls baptized over the years?
    How would you describe it?
library(ggplot2)
#plot(x = arbuthnot$year, y = arbuthnot$boys, type = "l", col="red")
ggplot(arbuthnot, aes(x=year, y = boys)) +
    geom_point(data=arbuthnot,aes(x=arbuthnot$year, y = arbuthnot$boys),color="red",pch=15) +
    geom_smooth(method="lm",data=arbuthnot,aes(x=arbuthnot$year, y = arbuthnot$boys),color="red")+
    #geom_text(aes(label=arbuthnot$boys), size=3)+
    geom_point(data=arbuthnot,aes(x=arbuthnot$year, y=arbuthnot$girls),color="blue")+
    geom_smooth(method="lm", data=arbuthnot,aes(x=arbuthnot$year, y=arbuthnot$girls),color="blue")

I wanted to add a legend to the above ggplot and couldn’t figure out how. The lattice package performs this graph much less code than GGplot because I can add multiple columns as y values. Was there a simpler code in ggplot i could have used?

library(lattice)
xyplot(boys+girls~arbuthnot$year,
       arbuthnot,auto.key = TRUE, 
       par.settings = list(superpose.symbol = list(pch = c(16,17), cex = 1.2,
                                                   col = c("orange", "blue"))))

Our graphs show that the trend in male/female birthrates closely followed each other

Question: did something cause the fluctuations in birth rate?

Hypothesis: External events caused a change in birth rates

Null hypothesis: Fluctuation is naturally occurring

Further exploration to reject Null Hypothesis

  • Perhaps English men went to war form 1640-1660
  • Perhaps there was an illness childbirths or the overall population of England
  • Our data is based on baptisms, so perhaps some religious transformation of sorts
  • Some sort of economic changes

Further exploration to accept Null Hypothesis

  • Need more data from the previous century to compare these trends to.
  • Perhaps in 1640, we were already in the middle of some sort of population decline
  • Trends like this with sharp downturns but overall increase in population could be common over centuries
  1. Now, make a plot of the proportion of boys over time. What do you see? Tip: If you use the up and down arrow keys, you can scroll through your previous commands, your so-called command history. You can also access it by clicking on the history tab in the upper right panel. This will save you a lot of typing in the future.
proportion_boys <- arbuthnot$boys / (arbuthnot$boys + arbuthnot$girls)
plot(arbuthnot$year,proportion_boys, type="l")

  • It would appear that boys were always more than 50% of overall births. Thus Arbuthnot’s hypothesis is validated by this data set

On Your Own

The data are stored in a data frame called present.

  1. What years are included in this data set? What are the dimensions of the data frame and what are the variable or column names?
source("more/present.R")
#present$year
#dim(present)
#names(present)
  • 1940-2002

  • 63 X 3

  • “year” “boys” “girls”

 

  1. How do these counts compare to Arbuthnot’s? Are they on a similar scale?
library(lattice)
xyplot(boys+girls~year,
       present,auto.key = TRUE, 
       par.settings = list(superpose.symbol = list(pch = c(16,17), cex = 1.2,
                                                   col = c("orange", "blue"))))

  • I will attempt to normalize the “boys born” data from the present data and the Arbuthnot and display it all together to compare.

library(plyr)
scaled_arbuthnot_boys <- scale(arbuthnot$boys)
#scaled_arbuthnot_boys
scaled_present_boys <- scale(present$boys)
scaled_list <- list(artnot_boys=scaled_arbuthnot_boys,present_boys=scaled_present_boys)
#str(scaled_list)

# took this code off stack overflow
dat <- lapply(scaled_list, function(x) cbind(x = seq_along(x), y = x))
list_my_names <- names(dat)
lns <- sapply(dat, nrow)
dat <- as.data.frame(do.call("rbind", dat))
dat$group <- rep(list_my_names, lns)


library(ggplot2)
ggplot(dat, aes(x = dat$x, y = dat$V2, colour = group)) +
    theme_bw() +
    geom_line()

library(lattice)
xyplot(dat$V2~dat$x | group,data=dat,
       groups=group,
       auto.key = TRUE, 
       par.settings = list(superpose.symbol = list(pch = 16, cex = 1.2,
                                                   col = c("orange", "blue"))))

length(scaled_present_boys)=length(scaled_arbuthnot_boys)
dat_2 <- cbind(scaled_present_boys,scaled_arbuthnot_boys)
my_dat_2<- as.data.frame(dat_2)
names(my_dat_2) <- c("scaled_present","scaled_arbuthnot")
library(lattice)
xyplot(scaled_present+scaled_arbuthnot~c(1:82),
       my_dat_2,auto.key = TRUE, 
       par.settings = list(superpose.symbol = list(pch = c(16,17), cex = 1.2,
                                                   col = c("orange", "blue"))))

I did alot of experimenting above, I wanted to normalize the entire df but i settled for normalizing the boys columns. I borrowed much of the graphical display code while troubleshooting on stack overflow. I was attempting to fully understand my original creation of dat

dat <- lapply(scaled_list, function(x) cbind(x = seq_along(x), y = x))
list_my_names <- names(dat)
lns <- sapply(dat, nrow)
dat <- as.data.frame(do.call("rbind", dat))
dat$group <- rep(list_my_names, lns)
  • I am applying a function x to each element of scaled_list
  • Function x is a cbind where each element of my original list is paired with its sequence value(original row index value)
  • lns gives us the row values counts of each element in dat
  • We turn our list into a dataframe
  • we add a column that repeats our list_names , lns times

Is the above correct? Is there a shorter way to do this? Any reading material that would cover all this? Would merge(adding nulls and keeping original columns) avoid some of these steps and be a better route?

Observations from normalized data

  • It would appear the growth of the population was happening at a much more rapid rate in our new births data set over the first 20 year span(1940-1960).
    • Could WW II and the post war economy have had an effect on birthrates?
    • Adds more evidence to support my null hypothesis: birthrate naturally sways as opposed to being due to external events
  • Both data sets share similar trends where there is a dip in population growth, and both show an overall trend of an increase in population growth. The deviations are similar if we exclude the 1940-1960 period
  1. Make a plot that displays the boy-to-girl ratio for every year in the data set. What do you see? Does Arbuthnot’s observation about boys being born in greater proportion than girls hold up in the U.S.? Include the plot in your response.
present_proportion_boys <- present$boys / (present$boys + present$girls)
proportion_boys <- arbuthnot$boys / (arbuthnot$boys + arbuthnot$girls)
plot(present$year,present_proportion_boys, type="l")

  • Once again the proportion of boys never dips below 50%. Arbuthnot’s hypothesis holds in America as well

  1. In what year did we see the most total number of births in the U.S.? refer to the help files or the R reference card http://cran.r-project.org/doc/contrib/Short-refcard.pdf to find helpful commands.
present$total_births <- present$boys+present$girls
ordered_by_births <- present[order(-present$total_births),]
ordered_by_births
##    year    boys   girls total_births
## 22 1961 2186274 2082052      4268326
## 21 1960 2179708 2078142      4257850
## 18 1957 2179960 2074824      4254784
## 20 1959 2173638 2071158      4244796
## 19 1958 2152546 2051266      4203812
## 23 1962 2132466 2034896      4167362
## 17 1956 2133588 2029502      4163090
## 51 1990 2129495 2028717      4158212
## 52 1991 2101518 2009389      4110907
## 24 1963 2101632 1996388      4098020
## 53 1992 2082097 1982917      4065014
## 61 2000 2076969 1981845      4058814
## 16 1955 2073719 1973576      4047295
## 50 1989 2069490 1971468      4040958
## 25 1964 2060162 1967328      4027490
## 62 2001 2057922 1968011      4025933
## 63 2002 2057979 1963747      4021726
## 15 1954 2059068 1958294      4017362
## 54 1993 2048861 1951379      4000240
## 60 1999 2026854 1932563      3959417
## 55 1994 2022589 1930178      3952767
## 59 1998 2016205 1925348      3941553
## 49 1988 2002424 1907086      3909510
## 14 1953 2001798 1900322      3902120
## 56 1995 1996355 1903234      3899589
## 57 1996 1990480 1901014      3891494
## 58 1997 1985596 1895298      3880894
## 13 1952 1971262 1875724      3846986
## 48 1987 1951153 1858241      3809394
## 46 1985 1927983 1832578      3760561
## 26 1965 1927054 1833304      3760358
## 47 1986 1924868 1831679      3756547
## 12 1951 1923020 1827830      3750850
## 31 1970 1915378 1816008      3731386
## 8  1947 1899876 1800064      3699940
## 43 1982 1885676 1794861      3680537
## 45 1984 1879490 1789651      3669141
## 44 1983 1865553 1773380      3638933
## 42 1981 1860272 1768966      3629238
## 41 1980 1852616 1759642      3612258
## 27 1966 1845862 1760412      3606274
## 30 1969 1846572 1753634      3600206
## 10 1949 1826352 1733177      3559529
## 32 1971 1822910 1733060      3555970
## 11 1950 1823555 1730594      3554149
## 9  1948 1813852 1721216      3535068
## 28 1967 1803388 1717571      3520959
## 29 1968 1796326 1705238      3501564
## 40 1979 1791267 1703131      3494398
## 39 1978 1709394 1623885      3333279
## 38 1977 1705916 1620716      3326632
## 7  1946 1691220 1597452      3288672
## 33 1972 1669927 1588484      3258411
## 37 1976 1624436 1543352      3167788
## 35 1974 1622114 1537844      3159958
## 36 1975 1613135 1531063      3144198
## 34 1973 1608326 1528639      3136965
## 4  1943 1508959 1427901      2936860
## 3  1942 1444365 1364631      2808996
## 5  1944 1435301 1359499      2794800
## 6  1945 1404587 1330869      2735456
## 2  1941 1289734 1223693      2513427
## 1  1940 1211684 1148715      2360399
ordered_by_births[1,]
##    year    boys   girls total_births
## 22 1961 2186274 2082052      4268326
ordered_by_births$year[1]
## [1] 1961

1961 was the year with the most births

```