Geography 415 Lab 1

Objectives

Getting Started

This first lab is an introduction to R and Rstudio. R and R studio are installed on the geography department computers, but I encourage you to install these programs on your own computer. If you have a laptop, this will enable you to do your lab asignments at home, the coffee shop, or whereever.

R is available for free download at http://www.r-project.org, or by direct link at http://cran.stat.ucla.edu/. Download and install the version that is appropriate for your computer.

R is a little difficult to use by itself. There are helpful “front ends” that organize things and make life easier. The one we will use in class is called 'R studio.' Download and install it here: http://www.rstudio.com/ide/. If you are given an option to install the 'desktop' or 'server' version, choose the 'desktop' version. Accept all of the deault options.

You must first install R before R studio. But once you've installed R studio, you never need to touch R again. Just open up R studio when you want to use R. I live in R studio. In fact, I am typing this in R studio right now (with the help of an add-on package called knitr).

[Note: You often will find questions or comments in brackets like this. This indicates that there is something I want to you think about and answer now. Please duscuss it with a lab partner. You do not need to turn it in. The homework to be turned in as at the end of the page.]

A first look at R Studio

When you open up R studio for the first time, it probably looks something like this:

Screenshot of Rstudio

You will see three windows (they may be in different places than here). You should see windows called: Workspace, Console, and Files. There may be a fourth, blank window. Ignore this for now.

For starters, you can use R like a calculator. Below is a code fragment. Anything you see in grey, you can either 1) copy and paste to your R Console and then hit Enter to run it, or 2) type from scratch into the R Console.

3 + 5
## [1] 8
3 * 5
## [1] 15

Note how the answer is printed out in a new line. You can also create symbols, just like in algebra. For example, suppose we want to create a variable called x, and give it the value 3.

x <- 3  # read as 'x gets 3'.  This is the left carrot '<' followed by the hyphen '-'

You should now see the variable x in your workspace window. You can then use x in math, or change the value x gets.

x + 2
## [1] 5
x <- 4
x + 2
## [1] 6

Note, that if we have an assignment (“<-”), then the result is NOT printed to screen. It just happens silently.

You can also use “=” instead of “<-”, but you'll look like an amateur.

Data Types

There are two main types of data in R: numeric values and character values. (There's many more, but two's enough for now!) Let's 1) create one of each type of data, and 2) then try to add a number and a character, which should lead to an error.

a <- 3
a
## [1] 3
b <- "three"
b
## [1] "three"
a + b
## Error: non-numeric argument to binary operator

That gave a funny error. Errors in R can be very cryptic. But the gist of this error is that there is an “operator” in this case “+”; it was expecting numeric values, but got something else instead. I don't expect you to know what the error means – half of the time, I don't understand it myself.

Getting your hands dirty with data

We'll download some data that were collected in the 18th century, by John Arbuthonot, a London physician, in order to study the sex ratio (the ratio of males to females).

arbuthnot <- read.csv(file = "http://vincentarelbundock.github.com/Rdatasets/csv/HistData/Arbuthnot.csv")

This instructed R to venture out on the web, download some data, and load it back into R. We'll look at importing data again, later, so I won't say anything more about this right now. You should see that the workspace window now has a data object called arbuthnot, which has 82 observations and 8 variables. Click on “arbuthnot” in the workspace window. You should see a new window that looks like simplified Excel spreadsheet. In the crazy language of R, this type of data object is called a 'data.frame.'

Each row of this data.frame contains the observations for one year. For each year, we have the number of boys christened in London, the number of girls christened, and a few other things, such as the number of deaths from plague. John Arbuthnot used these data to evaluate the hypothesis that boys and girls are born in equal proportion (actually, he was evaluating the hypothesis that an intelligent designer controls the birth rate, and not chance!).

[Questions: Why do you think that Arbuthnot collected data on christenings, rather than births? How would these data have been collected in 1710? Why might the number of christenings be different than the number of births?]

When you bring data into R, there are a few useful diagnostics that can be done in order to see what you have in hand. You can see the size of a data.frame by asking R for its dimensions:

dim(arbuthnot)
## [1] 82  8

you can also ask R for the names of the columns:

names(arbuthnot)
## [1] "X"         "Year"      "Males"     "Females"   "Plague"    "Mortality"
## [7] "Ratio"     "Total"    

Suppose that you don't want the entire data set but just one column: maybe the year. R provides multiple ways to do this. Sometimes one way is more convenient than other ways.

One way to ask R for a column is to use [row,column] notation. We can ask for every row, but only the 'Year' column

x <- arbuthnot[, "Year"]  #Note, row in [row, column] is blank.  This implies all rows.
x
##  [1] 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642
## [15] 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656
## [29] 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670
## [43] 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684
## [57] 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698
## [71] 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710

We can ask for every row, but only the second column (which, just happens to be the “Year”)

y <- arbuthnot[, 2]
y
##  [1] 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642
## [15] 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656
## [29] 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670
## [43] 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684
## [57] 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698
## [71] 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710

Note, that in the first case, Year was in quotes, but in the second case, the 2 was not. Why do you think this is?

In general… I always recommend using names rather than numbers. Sometimes data get rearranged. For example, maybe you added or deleted a column along the way or rearranged their order. If you use the name, you'll always get what you mean. If you use the number, then you are assuming that what you want is in a particular place. Sometimes assumptions are correct. Sometimes they aren't.

There is still one other common way to get a single column of a data.frame:

z <- arbuthnot$Year
z
##  [1] 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642
## [15] 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656
## [29] 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670
## [43] 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684
## [57] 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698
## [71] 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710

Here's another bit of notation: What if I only want the first five rows:

arbuthnot[1:5, "Year"]
## [1] 1629 1630 1631 1632 1633

And finally, suppose I want two columns: Males and Females. Here are two ways to do that:

arbuthnot[, c("Males", "Females")]  # Note that I put the names inside a c()
##    Males Females
## 1   5218    4683
## 2   4858    4457
## 3   4422    4102
## 4   4994    4590
## 5   5158    4839
## 6   5035    4820
## 7   5106    4928
## 8   4917    4605
## 9   4703    4457
## 10  5359    4952
## 11  5366    4784
## 12  5518    5332
## 13  5470    5200
## 14  5460    4910
## 15  4793    4617
## 16  4107    3997
## 17  4047    3919
## 18  3768    3395
## 19  3796    3536
## 20  3363    3181
## 21  3079    2746
## 22  2890    2722
## 23  3231    2840
## 24  3220    2908
## 25  3196    2959
## 26  3441    3179
## 27  3655    3349
## 28  3668    3382
## 29  3396    3289
## 30  3157    3013
## 31  3209    2781
## 32  3724    3247
## 33  4748    4107
## 34  5216    4803
## 35  5411    4881
## 36  6041    5681
## 37  5114    4858
## 38  4678    4319
## 39  5616    5322
## 40  6073    5560
## 41  6506    5829
## 42  6278    5719
## 43  6449    6061
## 44  6443    6120
## 45  6073    5822
## 46  6113    5738
## 47  6058    5717
## 48  6552    5847
## 49  6423    6203
## 50  6568    6033
## 51  6247    6041
## 52  6548    6299
## 53  6822    6533
## 54  6909    6744
## 55  7577    7158
## 56  7575    7127
## 57  7484    7246
## 58  7575    7119
## 59  7737    7214
## 60  7487    7101
## 61  7604    7167
## 62  7909    7302
## 63  7662    7392
## 64  7602    7316
## 65  7676    7483
## 66  6985    6647
## 67  7263    6713
## 68  7632    7229
## 69  8062    7767
## 70  8426    7626
## 71  7911    7452
## 72  7578    7061
## 73  8102    7514
## 74  8031    7656
## 75  7765    7683
## 76  6113    5738
## 77  8366    7779
## 78  7952    7417
## 79  8379    7687
## 80  8239    7623
## 81  7840    7380
## 82  7640    7288
# c() is the combine function.  It's used a lot in R.  Another way to get
# these columns
arbuthnot[, 3:4]  # This is much less safe.  What if the data were rearranged?  Then I might not get what I wanted.
##    Males Females
## 1   5218    4683
## 2   4858    4457
## 3   4422    4102
## 4   4994    4590
## 5   5158    4839
## 6   5035    4820
## 7   5106    4928
## 8   4917    4605
## 9   4703    4457
## 10  5359    4952
## 11  5366    4784
## 12  5518    5332
## 13  5470    5200
## 14  5460    4910
## 15  4793    4617
## 16  4107    3997
## 17  4047    3919
## 18  3768    3395
## 19  3796    3536
## 20  3363    3181
## 21  3079    2746
## 22  2890    2722
## 23  3231    2840
## 24  3220    2908
## 25  3196    2959
## 26  3441    3179
## 27  3655    3349
## 28  3668    3382
## 29  3396    3289
## 30  3157    3013
## 31  3209    2781
## 32  3724    3247
## 33  4748    4107
## 34  5216    4803
## 35  5411    4881
## 36  6041    5681
## 37  5114    4858
## 38  4678    4319
## 39  5616    5322
## 40  6073    5560
## 41  6506    5829
## 42  6278    5719
## 43  6449    6061
## 44  6443    6120
## 45  6073    5822
## 46  6113    5738
## 47  6058    5717
## 48  6552    5847
## 49  6423    6203
## 50  6568    6033
## 51  6247    6041
## 52  6548    6299
## 53  6822    6533
## 54  6909    6744
## 55  7577    7158
## 56  7575    7127
## 57  7484    7246
## 58  7575    7119
## 59  7737    7214
## 60  7487    7101
## 61  7604    7167
## 62  7909    7302
## 63  7662    7392
## 64  7602    7316
## 65  7676    7483
## 66  6985    6647
## 67  7263    6713
## 68  7632    7229
## 69  8062    7767
## 70  8426    7626
## 71  7911    7452
## 72  7578    7061
## 73  8102    7514
## 74  8031    7656
## 75  7765    7683
## 76  6113    5738
## 77  8366    7779
## 78  7952    7417
## 79  8379    7687
## 80  8239    7623
## 81  7840    7380
## 82  7640    7288

It's important that you become familiar with how to ask R for certain rows and columns. It's also important that you learn to refer to things by name, rather than by number.

An Introduction to Visualization

Now that we know how to ask R for only some of the data, we can think about the best ways to visually explore the data or to communicate the patterns and stories that are hidden in the data.

This section will have a lot of plotting commands with increasing complexity. You can largely copy and paste them. Remembering these commands is less important than thinking about the best way to visualize data and to communicate the story that we want to communicate.

One natural way to plot the data would be to let the x-axis represent time (years), and the y-axis represent the number of christenings.

plot(arbuthnot$Year, arbuthnot$Males)

plot of chunk unnamed-chunk-13

Note the default labels for the x and y axes. Not the prettiest. You'll fix these later. Maybe we want to add females to the plot:

plot(arbuthnot$Year, arbuthnot$Males)
# plot() creates a new plot
points(arbuthnot$Year, arbuthnot$Females)

plot of chunk unnamed-chunk-14

# points() adds to an existing plot

This plot is terrible! First, the circles are rather confusing. It's hard to tell what's what. By default, 'plot()' produces a 'scatterplot'. Scatterplots are useful when the data are scattered. But these data are not really scattered, they have an order – they are ordered by year. Maybe lines would be better than points at conveying this order? Secondly, the x-axis label is really terrible.

Look at the help file for the plot function.

help(plot)

In particular, look for the parts about 'type' and 'ylab'. Here's how you put that to action:

plot(arbuthnot$Year, arbuthnot$Males, type = "l", ylab = "Christenings")
lines(arbuthnot$Year, arbuthnot$Females)

plot of chunk unnamed-chunk-16

That's better but still not good enough. Maybe color would help:

plot(arbuthnot$Year, arbuthnot$Males, type = "l", col = "lightblue", 
    ylab = "Christenings")
lines(arbuthnot$Year, arbuthnot$Females, col = "pink")

plot of chunk unnamed-chunk-17

We're getting there: but the lines are a little thin and weak. Maybe we want to try a thicker line:

plot(arbuthnot$Year, arbuthnot$Males, type = "l", col = "lightblue", 
    lwd = 10, ylab = "Christenings")
lines(arbuthnot$Year, arbuthnot$Females, col = "pink", lwd = 10)

plot of chunk unnamed-chunk-18

[Questions: Are more males christened than females? Are you certain? How certain? Why are you this certain? Are more males born than females? How certain are you? Why?]

[Questions: I used light blue for males and pink for females. Why? Which would be better: 1) pink for males and light blue females or 2) green for males and black for females? Why? Might color choice be different in another culture? Why?]

Trying alternate displays

Depending on what we want to find out, it may be helpful to present the data in different ways. Maybe, we don't want to show the number of christenings, but the sex ratio of males to females. We could calculate this by hand for each year:

arbuthnot[1, ]
##   X Year Males Females Plague Mortality Ratio Total
## 1 1 1629  5218    4683      0      8771 1.114 9.901
5218/4683
## [1] 1.114

But that would get tedious. A better way is this:

sex.ratio <- arbuthnot$Males/arbuthnot$Females
sex.ratio
##  [1] 1.114 1.090 1.078 1.088 1.066 1.045 1.036 1.068 1.055 1.082 1.122
## [12] 1.035 1.052 1.112 1.038 1.028 1.033 1.110 1.074 1.057 1.121 1.062
## [23] 1.138 1.107 1.080 1.082 1.091 1.085 1.033 1.048 1.154 1.147 1.156
## [34] 1.086 1.109 1.063 1.053 1.083 1.055 1.092 1.116 1.098 1.064 1.053
## [45] 1.043 1.065 1.060 1.121 1.035 1.089 1.034 1.040 1.044 1.024 1.059
## [56] 1.063 1.033 1.064 1.072 1.054 1.061 1.083 1.037 1.039 1.026 1.051
## [67] 1.082 1.056 1.038 1.105 1.062 1.073 1.078 1.049 1.011 1.065 1.075
## [78] 1.072 1.090 1.081 1.062 1.048

We created a completely new variable: sex.ratio. You should see it listed in the Workspace window. Please note that sex.ratio is not part of the arbuthnot data.frame: sex.ratio is it's own object. We can plot the sex ratio now:

plot(arbuthnot$Year, sex.ratio, type = "l", lwd = 10)

plot of chunk unnamed-chunk-21

That plot's OK. But a sex ratio of 1.0 is really important. It means that there are just as many boys as girls. This was actually the value that Arbuthnot was interested in. How do we show this on the plot?

First we need to change the limits of the y axis. The plotting command for setting the y axis limits is 'ylim'. We need to choose reasonable lower and upper bounds. I'll choose .9 and 1.2 for lower and upper bounds. To put these two number together, we use the combine command c() again, i.e. c(.9, 1.2). The plot comand is now:

plot(arbuthnot$Year, sex.ratio, type = "l", lwd = 10, ylim = c(0.9, 
    1.2))

plot of chunk unnamed-chunk-22

That's a lot to swallow (we're almost done, though.) Look that over carefully. Note how we used a function c() within another function plot().

And finally, we might want to show 1.0 as a horizontal reference line.

plot(arbuthnot$Year, sex.ratio, type = "l", lwd = 10, ylim = c(0.9, 
    1.2))
abline(h = 1, lwd = 2)  # Type help(abline) to see how that one works!

plot of chunk unnamed-chunk-23

[How does this second plot of sex ratio compare to the first plot of counts?]

There. We're done for today.

Summary

We learned a few things about R in this lab that are super important.

  1. We learned how to create new objects in R using '<-'.
  2. We were introduced to the data.frame.
  3. We learned how to extract parts of a data.frame, using [,] and $

Second, we learned a few things about statistics and data visualization.

  1. We started to ask questions of our data: For example, What exactly is a 'sex ratio?' What is the sex ratio of births? What is the sex ratio of christenings? How are they related? Can the data we have (sex ratio of christenings) tell us something useful about what we really want to know (sex ratio at birth). Are we able to get better data? Will it be worth the effort?
  2. We are beginning to think critically about data visualization. What is the best way to visualize data? What variable should I plot? Maybe I need to calculate the variable first, as we did with sex ratio. How can I use visual symbols – such as line width, color, or axes – to better convey a message?

Homework

In R:

  1. Clean up the sex ratio plot by adding helpful axes labels and a title. The purpose of this is to a) get practice using the R help function, and b) practice producing better, more informative visualizations. There are a few ways to do this. Typing help(plot) will get you started. Look for instruction on how to use the 'main' argument. There are examples at the bottom. The help file will also direct you to another function that would work, called title(). Just like points() or lines(), title() is called AFTER you call plot(). Note, after you add a title, you might want to change the y axis label. Maybe to “Count”. It depends on your title. There's no right answer (but there are wrong answers!). These are design decisions that are worth thinking about. Include your plot in your homework.
  2. Modify the plot of christening counts by setting the limits of the y-axis to 0. Clean up the axis and give it a title. The purpose of this problem is to learn how axis limits affects our interpretation of plots. Include your plot in your homework.

Not in R:

  1. Compare and contrast the plot of christening counts vs the plots of sex ratio. What do these plots emphasize or reveal? What do they conceal or hide? Which is better? Describe why?
  2. Compare and contrast the plot of christening counts with the default y axis with the ploy with the y axis starting at 0. What do these plots emphasize or reveal? What do they conceal or hide? Which is better? Describe why?
  3. Describe the the sex ratio of christenings and the sex ratio of births? Describe what might cause them to be similar or different? Is it OK to use the sex ratio of christenings as a proxy for the sex ratio at birth? The purpose of this question is to begin thinking about measures and proxies. Proxies are hugely important in geographic research. There are proxies for everything from sex ratio to climate. (If you think it's difficult to measure the sex ratio - for which should have a good definition - imagine trying to quantify and measure “climate”).
  4. For grad students: There is obviously a subtle difference between the sex ratio of christenings and the sex ratio at birth. But what exactly is the “sex ratio at birth”? How might it vary across time and cultures? Why might it vary? Compare 1) 17th century London with 2) the modern-day US and 3) with China under the one-child policy. How does a measure like “sex ratio at birth” depend on interactions of biology, culture and technology?

List of commands that you learned today that you should know