Lab 1: Intro to R

library(tidyverse)
library(openintro)
library(RCurl)

Exercise 1

arbuthnot$girls

##  [1] 4683 4457 4102 4590 4839 4820 4928 4605 4457 4952 4784 5332 5200 4910 4617
## [16] 3997 3919 3395 3536 3181 2746 2722 2840 2908 2959 3179 3349 3382 3289 3013
## [31] 2781 3247 4107 4803 4881 5681 4858 4319 5322 5560 5829 5719 6061 6120 5822
## [46] 5738 5717 5847 6203 6033 6041 6299 6533 6744 7158 7127 7246 7119 7214 7101
## [61] 7167 7302 7392 7316 7483 6647 6713 7229 7767 7626 7452 7061 7514 7656 7683
## [76] 5738 7779 7417 7687 7623 7380 7288

glimpse(arbuthnot)

## Rows: 82
## Columns: 3
## $ year  <int> 1629, 1630, 1631, 1632, 1633, 1634, 1635, 1636, 1637, 1638, 1639…
## $ boys  <int> 5218, 4858, 4422, 4994, 5158, 5035, 5106, 4917, 4703, 5359, 5366…
## $ girls <int> 4683, 4457, 4102, 4590, 4839, 4820, 4928, 4605, 4457, 4952, 4784…

min(arbuthnot$year)

## [1] 1629

max(arbuthnot$year)

## [1] 1710

arbuthnot

## # A tibble: 82 × 3
##     year  boys girls
##    <int> <int> <int>
##  1  1629  5218  4683
##  2  1630  4858  4457
##  3  1631  4422  4102
##  4  1632  4994  4590
##  5  1633  5158  4839
##  6  1634  5035  4820
##  7  1635  5106  4928
##  8  1636  4917  4605
##  9  1637  4703  4457
## 10  1638  5359  4952
## # … with 72 more rows

Exercise 2

Is there an apparent trend in the number of girls baptized over the years? How would you describe it? (To ensure that your lab report is comprehensive, be sure to include the code needed to make the plot as well as your written interpretation.)

# Insert code for Exercise 2 here
ggplot(data = arbuthnot, aes(x = year, y = girls)) + 
 geom_line()

According to the graph, the trend of girls being baptized increased since around 1658. Between 1640 and 1660, there was a precipitous decline in girls being baptized, from just under 5,500 in 1640 to under 3,000 by around 1658. In a span of about 3-4 years after 1658, there was an accelerated increase in girls who were baptized, with close to 6,000 girls baptized around 1685 before a small but sharp decline about 2 years later. From that point forward, there was a steady increase in girls baptized into the early 1700s, reaching over 7,500 before a drastic decline less than 6,000 around 1705 before a drastic increase back over 7,500.

Exercise 3

Now, generate a plot of the proportion of boys born over time. What do you see?

# Insert code for Exercise 3 here
ggplot(data = arbuthnot, aes(x = year, y = boys)) + 
 geom_line()

The graph of boys being born follows a similar trajectory of girls being baptized. From 1640 to 1660 there was a sharp decline in boys being born, hitting a low of less than 3,000 in 1650 before a rapid increase to 6,000 around the year 1665. There is a decline in boys being born around 1705, frm around 8,000 to close to 6,000. This follows a similar decline that occurred with girls at almost the same time.

Exercise 4

What years are included in this data set? What are the dimensions of the data frame? What are the variable (column) names?

# Insert code for Exercise 4 here
data('present', package='openintro')
df <- present
dim(df)

## [1] 63  3

glimpse(df)

## Rows: 63
## Columns: 3
## $ year  <dbl> 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950…
## $ boys  <dbl> 1211684, 1289734, 1444365, 1508959, 1435301, 1404587, 1691220, 1…
## $ girls <dbl> 1148715, 1223693, 1364631, 1427901, 1359499, 1330869, 1597452, 1…

distinct(df, year)

## # A tibble: 63 × 1
##     year
##    <dbl>
##  1  1940
##  2  1941
##  3  1942
##  4  1943
##  5  1944
##  6  1945
##  7  1946
##  8  1947
##  9  1948
## 10  1949
## # … with 53 more rows

min(df$year)

## [1] 1940

max(df$year)

## [1] 2002

colnames(df)

## [1] "year"  "boys"  "girls"

df

## # A tibble: 63 × 3
##     year    boys   girls
##    <dbl>   <dbl>   <dbl>
##  1  1940 1211684 1148715
##  2  1941 1289734 1223693
##  3  1942 1444365 1364631
##  4  1943 1508959 1427901
##  5  1944 1435301 1359499
##  6  1945 1404587 1330869
##  7  1946 1691220 1597452
##  8  1947 1899876 1800064
##  9  1948 1813852 1721216
## 10  1949 1826352 1733177
## # … with 53 more rows

The years in the dataset are from 1940 to 2002. The dimensions of the dataframes are 63 rows and 3 columns. The column names are “year”, “boys” and “girls”.

Exercise 5

How do these counts compare to Arbuthnot’s? Are they of a similar magnitude?

df$boys

##  [1] 1211684 1289734 1444365 1508959 1435301 1404587 1691220 1899876 1813852
## [10] 1826352 1823555 1923020 1971262 2001798 2059068 2073719 2133588 2179960
## [19] 2152546 2173638 2179708 2186274 2132466 2101632 2060162 1927054 1845862
## [28] 1803388 1796326 1846572 1915378 1822910 1669927 1608326 1622114 1613135
## [37] 1624436 1705916 1709394 1791267 1852616 1860272 1885676 1865553 1879490
## [46] 1927983 1924868 1951153 2002424 2069490 2129495 2101518 2082097 2048861
## [55] 2022589 1996355 1990480 1985596 2016205 2026854 2076969 2057922 2057979

dim(arbuthnot)

## [1] 82  3

colnames(arbuthnot)

## [1] "year"  "boys"  "girls"

In the arbuthnot dataset, there are 82 rows and 3 columns, 19 more rows than the present dataset. The years in the present dataset are from 1940-2002, while the years being measured in the arbuthnot dataset is from 1629 to 1710. The datatypes and column names are the same for each dataset.

Exercise 6

Make a plot that displays the proportion of boys born over time. What do you see? Does Arbuthnot’s observation about boys being born in greater proportion than girls hold up in the U.S.? Include the plot in your response. Hint: You should be able to reuse your code from Exercise 3 above, just replace the dataframe name.

ggplot(data = df, aes(x = year, y = boys)) + 
 geom_line()

Boys born in greater proportion than girls in the U.S. follows the trajectory depicted in Arbuthnot’s dataset. While the pattern of boys and girls born are almost the same in each dataset, boys are born at a slimmer but higher rate than girls, which is shown in each dataset.

Exercise 7

In what year did we see the most total number of births in the U.S.? Hint: First calculate the totals and save it as a new variable. Then, sort your dataset in descending order based on the total column. You can do this interactively in the data viewer by clicking on the arrows next to the variable names. To include the sorted result in your report you will need to use two new functions: arrange (for sorting). We can arrange the data in a descending order with another function: desc (for descending order). The sample code is provided below.

df %>%
  group_by(year) %>%
  summarise(total = sum(boys, girls)) %>%
  arrange(desc(total))

## # A tibble: 63 × 2
##     year   total
##    <dbl>   <dbl>
##  1  1961 4268326
##  2  1960 4257850
##  3  1957 4254784
##  4  1959 4244796
##  5  1958 4203812
##  6  1962 4167362
##  7  1956 4163090
##  8  1990 4158212
##  9  1991 4110907
## 10  1963 4098020
## # … with 53 more rows

1961 was the year the U.S. experienced the most total births in the country, with 4,268,326.

LS0tCnRpdGxlOiAiTGFiIDE6IEludHJvIHRvIFIiCmF1dGhvcjogIk1vaGFtZWQgSGFzc2FuLUVsIFNlcmFmaSIKZGF0ZTogImByIFN5cy5EYXRlKClgIgpvdXRwdXQ6IG9wZW5pbnRybzo6bGFiX3JlcG9ydAotLS0KCmBgYHtyIGxvYWQtcGFja2FnZXMsIG1lc3NhZ2U9RkFMU0V9CmxpYnJhcnkodGlkeXZlcnNlKQpsaWJyYXJ5KG9wZW5pbnRybykKbGlicmFyeShSQ3VybCkKYGBgCgojIyMgRXhlcmNpc2UgMQoKYGBge3Igdmlldy1naXJscy1jb3VudHN9CmFyYnV0aG5vdCRnaXJscwpnbGltcHNlKGFyYnV0aG5vdCkKbWluKGFyYnV0aG5vdCR5ZWFyKQptYXgoYXJidXRobm90JHllYXIpCmFyYnV0aG5vdApgYGAKCgojIyMgRXhlcmNpc2UgMgoKSXMgdGhlcmUgYW4gYXBwYXJlbnQgdHJlbmQgaW4gdGhlIG51bWJlciBvZiBnaXJscyBiYXB0aXplZCBvdmVyIHRoZSB5ZWFycz8gSG93IHdvdWxkIHlvdSBkZXNjcmliZSBpdD8gKFRvIGVuc3VyZSB0aGF0IHlvdXIgbGFiIHJlcG9ydCBpcyBjb21wcmVoZW5zaXZlLCBiZSBzdXJlIHRvIGluY2x1ZGUgdGhlIGNvZGUgbmVlZGVkIHRvIG1ha2UgdGhlIHBsb3QgYXMgd2VsbCBhcyB5b3VyIHdyaXR0ZW4gaW50ZXJwcmV0YXRpb24uKQoKYGBge3IgdHJlbmQtZ2lybHN9CiMgSW5zZXJ0IGNvZGUgZm9yIEV4ZXJjaXNlIDIgaGVyZQpnZ3Bsb3QoZGF0YSA9IGFyYnV0aG5vdCwgYWVzKHggPSB5ZWFyLCB5ID0gZ2lybHMpKSArIAogZ2VvbV9saW5lKCkKYGBgCkFjY29yZGluZyB0byB0aGUgZ3JhcGgsIHRoZSB0cmVuZCBvZiBnaXJscyBiZWluZyBiYXB0aXplZCBpbmNyZWFzZWQgc2luY2UgYXJvdW5kIDE2NTguIApCZXR3ZWVuIDE2NDAgYW5kIDE2NjAsIHRoZXJlIHdhcyBhIHByZWNpcGl0b3VzIGRlY2xpbmUgaW4gZ2lybHMgYmVpbmcgYmFwdGl6ZWQsIGZyb20ganVzdCB1bmRlciA1LDUwMCBpbiAxNjQwIHRvIHVuZGVyIDMsMDAwIGJ5IGFyb3VuZCAxNjU4LiBJbiBhIHNwYW4gb2YgYWJvdXQgMy00IHllYXJzIGFmdGVyIDE2NTgsIHRoZXJlIHdhcyBhbiBhY2NlbGVyYXRlZCBpbmNyZWFzZSBpbiBnaXJscyB3aG8gd2VyZSBiYXB0aXplZCwgd2l0aCBjbG9zZSB0byA2LDAwMCBnaXJscyBiYXB0aXplZCBhcm91bmQgMTY4NSBiZWZvcmUgYSBzbWFsbCBidXQgc2hhcnAgZGVjbGluZSBhYm91dCAyIHllYXJzIGxhdGVyLiBGcm9tIHRoYXQgcG9pbnQgZm9yd2FyZCwgdGhlcmUgd2FzIGEgc3RlYWR5IGluY3JlYXNlIGluIGdpcmxzIGJhcHRpemVkIGludG8gdGhlIGVhcmx5IDE3MDBzLCByZWFjaGluZyBvdmVyIDcsNTAwIGJlZm9yZSBhIGRyYXN0aWMgZGVjbGluZSBsZXNzIHRoYW4gNiwwMDAgYXJvdW5kIDE3MDUgYmVmb3JlIGEgZHJhc3RpYyBpbmNyZWFzZSBiYWNrIG92ZXIgNyw1MDAuCgojIyMgRXhlcmNpc2UgMwoKTm93LCBnZW5lcmF0ZSBhIHBsb3Qgb2YgdGhlIHByb3BvcnRpb24gb2YgYm95cyBib3JuIG92ZXIgdGltZS4gV2hhdCBkbyB5b3Ugc2VlPwoKYGBge3IgcGxvdC1wcm9wLWJveXMtYXJidXRobm90fQojIEluc2VydCBjb2RlIGZvciBFeGVyY2lzZSAzIGhlcmUKZ2dwbG90KGRhdGEgPSBhcmJ1dGhub3QsIGFlcyh4ID0geWVhciwgeSA9IGJveXMpKSArIAogZ2VvbV9saW5lKCkKYGBgClRoZSBncmFwaCBvZiBib3lzIGJlaW5nIGJvcm4gZm9sbG93cyBhIHNpbWlsYXIgdHJhamVjdG9yeSBvZiBnaXJscyBiZWluZyBiYXB0aXplZC4gRnJvbSAxNjQwIHRvIDE2NjAgdGhlcmUgd2FzIGEgc2hhcnAgZGVjbGluZSBpbiBib3lzIGJlaW5nIGJvcm4sIGhpdHRpbmcgYSBsb3cgb2YgbGVzcyB0aGFuIDMsMDAwIGluIDE2NTAgYmVmb3JlIGEgcmFwaWQgaW5jcmVhc2UgdG8gNiwwMDAgYXJvdW5kIHRoZSB5ZWFyIDE2NjUuIFRoZXJlIGlzIGEgZGVjbGluZSBpbiBib3lzIGJlaW5nIGJvcm4gYXJvdW5kIDE3MDUsIGZybSBhcm91bmQgOCwwMDAgdG8gY2xvc2UgdG8gNiwwMDAuIFRoaXMgZm9sbG93cyBhIHNpbWlsYXIgZGVjbGluZSB0aGF0IG9jY3VycmVkIHdpdGggZ2lybHMgYXQgYWxtb3N0IHRoZSBzYW1lIHRpbWUuICAKCiMjIyBFeGVyY2lzZSA0CgpXaGF0IHllYXJzIGFyZSBpbmNsdWRlZCBpbiB0aGlzIGRhdGEgc2V0PyBXaGF0IGFyZSB0aGUgZGltZW5zaW9ucyBvZiB0aGUgZGF0YSBmcmFtZT8gV2hhdCBhcmUgdGhlIHZhcmlhYmxlIChjb2x1bW4pIG5hbWVzPwoKYGBge3IgZGltLXByZXNlbnR9CiMgSW5zZXJ0IGNvZGUgZm9yIEV4ZXJjaXNlIDQgaGVyZQpkYXRhKCdwcmVzZW50JywgcGFja2FnZT0nb3BlbmludHJvJykKZGYgPC0gcHJlc2VudApkaW0oZGYpCmdsaW1wc2UoZGYpCmRpc3RpbmN0KGRmLCB5ZWFyKQptaW4oZGYkeWVhcikKbWF4KGRmJHllYXIpCmNvbG5hbWVzKGRmKQpkZgpgYGAKVGhlIHllYXJzIGluIHRoZSBkYXRhc2V0IGFyZSBmcm9tIDE5NDAgdG8gMjAwMi4gVGhlIGRpbWVuc2lvbnMgb2YgdGhlIGRhdGFmcmFtZXMgYXJlIDYzIHJvd3MgYW5kIDMgY29sdW1ucy4gVGhlIGNvbHVtbiBuYW1lcyBhcmUgInllYXIiLCAiYm95cyIgYW5kICJnaXJscyIuCgojIyMgRXhlcmNpc2UgNQoKSG93IGRvIHRoZXNlIGNvdW50cyBjb21wYXJlIHRvIEFyYnV0aG5vdOKAmXM/IEFyZSB0aGV5IG9mIGEgc2ltaWxhciBtYWduaXR1ZGU/CgpgYGB7ciBjb3VudC1jb21wYXJlfQpkZiRib3lzCmRpbShhcmJ1dGhub3QpCmNvbG5hbWVzKGFyYnV0aG5vdCkKYGBgCkluIHRoZSBhcmJ1dGhub3QgZGF0YXNldCwgdGhlcmUgYXJlIDgyIHJvd3MgYW5kIDMgY29sdW1ucywgMTkgbW9yZSByb3dzIHRoYW4gdGhlIHByZXNlbnQgZGF0YXNldC4gVGhlIHllYXJzIGluIHRoZSBwcmVzZW50IGRhdGFzZXQgYXJlIGZyb20gMTk0MC0yMDAyLCB3aGlsZSB0aGUgeWVhcnMgYmVpbmcgbWVhc3VyZWQgaW4gdGhlIGFyYnV0aG5vdCBkYXRhc2V0IGlzIGZyb20gMTYyOSB0byAxNzEwLiBUaGUgZGF0YXR5cGVzIGFuZCBjb2x1bW4gbmFtZXMgYXJlIHRoZSBzYW1lIGZvciBlYWNoIGRhdGFzZXQuCgojIyMgRXhlcmNpc2UgNgoKTWFrZSBhIHBsb3QgdGhhdCBkaXNwbGF5cyB0aGUgcHJvcG9ydGlvbiBvZiBib3lzIGJvcm4gb3ZlciB0aW1lLiBXaGF0IGRvIHlvdSBzZWU/IERvZXMgQXJidXRobm904oCZcyBvYnNlcnZhdGlvbiBhYm91dCBib3lzIGJlaW5nIGJvcm4gaW4gZ3JlYXRlciBwcm9wb3J0aW9uIHRoYW4gZ2lybHMgaG9sZCB1cCBpbiB0aGUgVS5TLj8gSW5jbHVkZSB0aGUgcGxvdCBpbiB5b3VyIHJlc3BvbnNlLiBIaW50OiBZb3Ugc2hvdWxkIGJlIGFibGUgdG8gcmV1c2UgeW91ciBjb2RlIGZyb20gRXhlcmNpc2UgMyBhYm92ZSwganVzdCByZXBsYWNlIHRoZSBkYXRhZnJhbWUgbmFtZS4KCmBgYHtyIHBsb3QtcHJvcC1ib3lzLXByZXNlbnR9CmdncGxvdChkYXRhID0gZGYsIGFlcyh4ID0geWVhciwgeSA9IGJveXMpKSArIAogZ2VvbV9saW5lKCkKYGBgCgpCb3lzIGJvcm4gaW4gZ3JlYXRlciBwcm9wb3J0aW9uIHRoYW4gZ2lybHMgaW4gdGhlIFUuUy4gZm9sbG93cyB0aGUgdHJhamVjdG9yeSBkZXBpY3RlZCBpbiBBcmJ1dGhub3QncyBkYXRhc2V0LiBXaGlsZSB0aGUgcGF0dGVybiBvZiBib3lzIGFuZCBnaXJscyBib3JuIGFyZSBhbG1vc3QgdGhlIHNhbWUgaW4gZWFjaCBkYXRhc2V0LCBib3lzIGFyZSBib3JuIGF0IGEgc2xpbW1lciBidXQgaGlnaGVyIHJhdGUgdGhhbiBnaXJscywgd2hpY2ggaXMgc2hvd24gaW4gZWFjaCBkYXRhc2V0LgoKIyMjIEV4ZXJjaXNlIDcKCkluIHdoYXQgeWVhciBkaWQgd2Ugc2VlIHRoZSBtb3N0IHRvdGFsIG51bWJlciBvZiBiaXJ0aHMgaW4gdGhlIFUuUy4/IEhpbnQ6IEZpcnN0IGNhbGN1bGF0ZSB0aGUgdG90YWxzIGFuZCBzYXZlIGl0IGFzIGEgbmV3IHZhcmlhYmxlLiBUaGVuLCBzb3J0IHlvdXIgZGF0YXNldCBpbiBkZXNjZW5kaW5nIG9yZGVyIGJhc2VkIG9uIHRoZSB0b3RhbCBjb2x1bW4uIFlvdSBjYW4gZG8gdGhpcyBpbnRlcmFjdGl2ZWx5IGluIHRoZSBkYXRhIHZpZXdlciBieSBjbGlja2luZyBvbiB0aGUgYXJyb3dzIG5leHQgdG8gdGhlIHZhcmlhYmxlIG5hbWVzLiBUbyBpbmNsdWRlIHRoZSBzb3J0ZWQgcmVzdWx0IGluIHlvdXIgcmVwb3J0IHlvdSB3aWxsIG5lZWQgdG8gdXNlIHR3byBuZXcgZnVuY3Rpb25zOiBhcnJhbmdlIChmb3Igc29ydGluZykuIFdlIGNhbiBhcnJhbmdlIHRoZSBkYXRhIGluIGEgZGVzY2VuZGluZyBvcmRlciB3aXRoIGFub3RoZXIgZnVuY3Rpb246IGRlc2MgKGZvciBkZXNjZW5kaW5nIG9yZGVyKS4gVGhlIHNhbXBsZSBjb2RlIGlzIHByb3ZpZGVkIGJlbG93LgoKYGBge3IgZmluZC1tYXgtdG90YWx9CmRmICU+JQogIGdyb3VwX2J5KHllYXIpICU+JQogIHN1bW1hcmlzZSh0b3RhbCA9IHN1bShib3lzLCBnaXJscykpICU+JQogIGFycmFuZ2UoZGVzYyh0b3RhbCkpCmBgYAoxOTYxIHdhcyB0aGUgeWVhciB0aGUgVS5TLiBleHBlcmllbmNlZCB0aGUgbW9zdCB0b3RhbCBiaXJ0aHMgaW4gdGhlIGNvdW50cnksIAp3aXRoIDQsMjY4LDMyNi4K