This first lab is an introduction to R and Rstudio. R and R studio are installed on the geography department computers, but I encourage you to install these programs on your own computer. If you have a laptop, this will enable you to do your lab asignments at home, the coffee shop, or whereever.
R is available for free download at http://www.r-project.org, or by direct link at http://cran.stat.ucla.edu/. Download and install the version that is appropriate for your computer.
R is a little difficult to use by itself. There are helpful “front ends” that organize things and make life easier. The one we will use in class is called 'R studio.' Download and install it here: http://www.rstudio.com/ide/. If you are given an option to install the 'desktop' or 'server' version, choose the 'desktop' version. Accept all of the deault options.
You must first install R before R studio. But once you've installed R studio, you never need to touch R again. Just open up R studio when you want to use R. I live in R studio. In fact, I am typing this in R studio right now (with the help of an add-on package called knitr).
[Note: You often will find questions or comments in brackets like this. This indicates that there is something I want to you think about and answer now. Please duscuss it with a lab partner. You do not need to turn it in. The homework to be turned in as at the end of the page.]
When you open up R studio for the first time, it probably looks something like this:
You will see three windows (they may be in different places than here). You should see windows called: Workspace, Console, and Files. There may be a fourth, blank window. Ignore this for now.
For starters, you can use R like a calculator. Below is a code fragment. Anything you see in grey, you can either 1) copy and paste to your R Console and then hit Enter to run it, or 2) type from scratch into the R Console.
3 + 5
## [1] 8
3 * 5
## [1] 15
Note how the answer is printed out in a new line. You can also create symbols, just like in algebra. For example, suppose we want to create a variable called x, and give it the value 3.
x <- 3 # read as 'x gets 3'. This is the left carrot '<' followed by the hyphen '-'
You should now see the variable x
in your workspace window. You can then use x in math, or change the value x gets.
x + 2
## [1] 5
x <- 4
x + 2
## [1] 6
Note, that if we have an assignment (“<-”), then the result is NOT printed to screen. It just happens silently.
You can also use “=” instead of “<-”, but you'll look like an amateur.
There are two main types of data in R: numeric values and character values. (There's many more, but two's enough for now!) Let's 1) create one of each type of data, and 2) then try to add a number and a character, which should lead to an error.
a <- 3
a
## [1] 3
b <- "three"
b
## [1] "three"
a + b
## Error: non-numeric argument to binary operator
That gave a funny error. Errors in R can be very cryptic. But the gist of this error is that there is an “operator” in this case “+”; it was expecting numeric values, but got something else instead. I don't expect you to know what the error means – half of the time, I don't understand it myself.
We'll download some data that were collected in the 18th century, by John Arbuthonot, a London physician, in order to study the sex ratio (the ratio of males to females).
arbuthnot <- read.csv(file = "http://vincentarelbundock.github.com/Rdatasets/csv/HistData/Arbuthnot.csv")
This instructed R to venture out on the web, download some data, and load it back into R. We'll look at importing data again, later, so I won't say anything more about this right now. You should see that the workspace window now has a data object called arbuthnot, which has 82 observations and 8 variables. Click on “arbuthnot” in the workspace window. You should see a new window that looks like simplified Excel spreadsheet. In the crazy language of R, this type of data object is called a 'data.frame.'
Each row of this data.frame contains the observations for one year. For each year, we have the number of boys christened in London, the number of girls christened, and a few other things, such as the number of deaths from plague. John Arbuthnot used these data to evaluate the hypothesis that boys and girls are born in equal proportion (actually, he was evaluating the hypothesis that an intelligent designer controls the birth rate, and not chance!).
[Questions: Why do you think that Arbuthnot collected data on christenings, rather than births? How would these data have been collected in 1710? Why might the number of christenings be different than the number of births?]
When you bring data into R, there are a few useful diagnostics that can be done in order to see what you have in hand. You can see the size of a data.frame by asking R for its dimensions:
dim(arbuthnot)
## [1] 82 8
you can also ask R for the names of the columns:
names(arbuthnot)
## [1] "X" "Year" "Males" "Females" "Plague" "Mortality"
## [7] "Ratio" "Total"
Suppose that you don't want the entire data set but just one column: maybe the year. R provides multiple ways to do this. Sometimes one way is more convenient than other ways.
One way to ask R for a column is to use [row,column]
notation. We can ask for every row, but only the 'Year' column
x <- arbuthnot[, "Year"] #Note, row in [row, column] is blank. This implies all rows.
x
## [1] 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642
## [15] 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656
## [29] 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670
## [43] 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684
## [57] 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698
## [71] 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710
We can ask for every row, but only the second column (which, just happens to be the “Year”)
y <- arbuthnot[, 2]
y
## [1] 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642
## [15] 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656
## [29] 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670
## [43] 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684
## [57] 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698
## [71] 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710
Note, that in the first case, Year was in quotes, but in the second case, the 2 was not. Why do you think this is?
In general… I always recommend using names rather than numbers. Sometimes data get rearranged. For example, maybe you added or deleted a column along the way or rearranged their order. If you use the name, you'll always get what you mean. If you use the number, then you are assuming that what you want is in a particular place. Sometimes assumptions are correct. Sometimes they aren't.
There is still one other common way to get a single column of a data.frame:
z <- arbuthnot$Year
z
## [1] 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642
## [15] 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656
## [29] 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670
## [43] 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684
## [57] 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698
## [71] 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710
Here's another bit of notation: What if I only want the first five rows:
arbuthnot[1:5, "Year"]
## [1] 1629 1630 1631 1632 1633
And finally, suppose I want two columns: Males and Females. Here are two ways to do that:
arbuthnot[, c("Males", "Females")] # Note that I put the names inside a c()
## Males Females
## 1 5218 4683
## 2 4858 4457
## 3 4422 4102
## 4 4994 4590
## 5 5158 4839
## 6 5035 4820
## 7 5106 4928
## 8 4917 4605
## 9 4703 4457
## 10 5359 4952
## 11 5366 4784
## 12 5518 5332
## 13 5470 5200
## 14 5460 4910
## 15 4793 4617
## 16 4107 3997
## 17 4047 3919
## 18 3768 3395
## 19 3796 3536
## 20 3363 3181
## 21 3079 2746
## 22 2890 2722
## 23 3231 2840
## 24 3220 2908
## 25 3196 2959
## 26 3441 3179
## 27 3655 3349
## 28 3668 3382
## 29 3396 3289
## 30 3157 3013
## 31 3209 2781
## 32 3724 3247
## 33 4748 4107
## 34 5216 4803
## 35 5411 4881
## 36 6041 5681
## 37 5114 4858
## 38 4678 4319
## 39 5616 5322
## 40 6073 5560
## 41 6506 5829
## 42 6278 5719
## 43 6449 6061
## 44 6443 6120
## 45 6073 5822
## 46 6113 5738
## 47 6058 5717
## 48 6552 5847
## 49 6423 6203
## 50 6568 6033
## 51 6247 6041
## 52 6548 6299
## 53 6822 6533
## 54 6909 6744
## 55 7577 7158
## 56 7575 7127
## 57 7484 7246
## 58 7575 7119
## 59 7737 7214
## 60 7487 7101
## 61 7604 7167
## 62 7909 7302
## 63 7662 7392
## 64 7602 7316
## 65 7676 7483
## 66 6985 6647
## 67 7263 6713
## 68 7632 7229
## 69 8062 7767
## 70 8426 7626
## 71 7911 7452
## 72 7578 7061
## 73 8102 7514
## 74 8031 7656
## 75 7765 7683
## 76 6113 5738
## 77 8366 7779
## 78 7952 7417
## 79 8379 7687
## 80 8239 7623
## 81 7840 7380
## 82 7640 7288
# c() is the combine function. It's used a lot in R. Another way to get
# these columns
arbuthnot[, 3:4] # This is much less safe. What if the data were rearranged? Then I might not get what I wanted.
## Males Females
## 1 5218 4683
## 2 4858 4457
## 3 4422 4102
## 4 4994 4590
## 5 5158 4839
## 6 5035 4820
## 7 5106 4928
## 8 4917 4605
## 9 4703 4457
## 10 5359 4952
## 11 5366 4784
## 12 5518 5332
## 13 5470 5200
## 14 5460 4910
## 15 4793 4617
## 16 4107 3997
## 17 4047 3919
## 18 3768 3395
## 19 3796 3536
## 20 3363 3181
## 21 3079 2746
## 22 2890 2722
## 23 3231 2840
## 24 3220 2908
## 25 3196 2959
## 26 3441 3179
## 27 3655 3349
## 28 3668 3382
## 29 3396 3289
## 30 3157 3013
## 31 3209 2781
## 32 3724 3247
## 33 4748 4107
## 34 5216 4803
## 35 5411 4881
## 36 6041 5681
## 37 5114 4858
## 38 4678 4319
## 39 5616 5322
## 40 6073 5560
## 41 6506 5829
## 42 6278 5719
## 43 6449 6061
## 44 6443 6120
## 45 6073 5822
## 46 6113 5738
## 47 6058 5717
## 48 6552 5847
## 49 6423 6203
## 50 6568 6033
## 51 6247 6041
## 52 6548 6299
## 53 6822 6533
## 54 6909 6744
## 55 7577 7158
## 56 7575 7127
## 57 7484 7246
## 58 7575 7119
## 59 7737 7214
## 60 7487 7101
## 61 7604 7167
## 62 7909 7302
## 63 7662 7392
## 64 7602 7316
## 65 7676 7483
## 66 6985 6647
## 67 7263 6713
## 68 7632 7229
## 69 8062 7767
## 70 8426 7626
## 71 7911 7452
## 72 7578 7061
## 73 8102 7514
## 74 8031 7656
## 75 7765 7683
## 76 6113 5738
## 77 8366 7779
## 78 7952 7417
## 79 8379 7687
## 80 8239 7623
## 81 7840 7380
## 82 7640 7288
It's important that you become familiar with how to ask R for certain rows and columns. It's also important that you learn to refer to things by name, rather than by number.
Now that we know how to ask R for only some of the data, we can think about the best ways to visually explore the data or to communicate the patterns and stories that are hidden in the data.
This section will have a lot of plotting commands with increasing complexity. You can largely copy and paste them. Remembering these commands is less important than thinking about the best way to visualize data and to communicate the story that we want to communicate.
One natural way to plot the data would be to let the x-axis represent time (years), and the y-axis represent the number of christenings.
plot(arbuthnot$Year, arbuthnot$Males)
Note the default labels for the x and y axes. Not the prettiest. You'll fix these later. Maybe we want to add females to the plot:
plot(arbuthnot$Year, arbuthnot$Males)
# plot() creates a new plot
points(arbuthnot$Year, arbuthnot$Females)
# points() adds to an existing plot
This plot is terrible! First, the circles are rather confusing. It's hard to tell what's what. By default, 'plot()' produces a 'scatterplot'. Scatterplots are useful when the data are scattered. But these data are not really scattered, they have an order – they are ordered by year. Maybe lines would be better than points at conveying this order? Secondly, the x-axis label is really terrible.
Look at the help file for the plot function.
help(plot)
In particular, look for the parts about 'type' and 'ylab'. Here's how you put that to action:
plot(arbuthnot$Year, arbuthnot$Males, type = "l", ylab = "Christenings")
lines(arbuthnot$Year, arbuthnot$Females)
That's better but still not good enough. Maybe color would help:
plot(arbuthnot$Year, arbuthnot$Males, type = "l", col = "lightblue",
ylab = "Christenings")
lines(arbuthnot$Year, arbuthnot$Females, col = "pink")
We're getting there: but the lines are a little thin and weak. Maybe we want to try a thicker line:
plot(arbuthnot$Year, arbuthnot$Males, type = "l", col = "lightblue",
lwd = 10, ylab = "Christenings")
lines(arbuthnot$Year, arbuthnot$Females, col = "pink", lwd = 10)
[Questions: Are more males christened than females? Are you certain? How certain? Why are you this certain? Are more males born than females? How certain are you? Why?]
[Questions: I used light blue for males and pink for females. Why? Which would be better: 1) pink for males and light blue females or 2) green for males and black for females? Why? Might color choice be different in another culture? Why?]
Depending on what we want to find out, it may be helpful to present the data in different ways. Maybe, we don't want to show the number of christenings, but the sex ratio of males to females. We could calculate this by hand for each year:
arbuthnot[1, ]
## X Year Males Females Plague Mortality Ratio Total
## 1 1 1629 5218 4683 0 8771 1.114 9.901
5218/4683
## [1] 1.114
But that would get tedious. A better way is this:
sex.ratio <- arbuthnot$Males/arbuthnot$Females
sex.ratio
## [1] 1.114 1.090 1.078 1.088 1.066 1.045 1.036 1.068 1.055 1.082 1.122
## [12] 1.035 1.052 1.112 1.038 1.028 1.033 1.110 1.074 1.057 1.121 1.062
## [23] 1.138 1.107 1.080 1.082 1.091 1.085 1.033 1.048 1.154 1.147 1.156
## [34] 1.086 1.109 1.063 1.053 1.083 1.055 1.092 1.116 1.098 1.064 1.053
## [45] 1.043 1.065 1.060 1.121 1.035 1.089 1.034 1.040 1.044 1.024 1.059
## [56] 1.063 1.033 1.064 1.072 1.054 1.061 1.083 1.037 1.039 1.026 1.051
## [67] 1.082 1.056 1.038 1.105 1.062 1.073 1.078 1.049 1.011 1.065 1.075
## [78] 1.072 1.090 1.081 1.062 1.048
We created a completely new variable: sex.ratio. You should see it listed in the Workspace window. Please note that sex.ratio is not part of the arbuthnot data.frame: sex.ratio is it's own object. We can plot the sex ratio now:
plot(arbuthnot$Year, sex.ratio, type = "l", lwd = 10)
That plot's OK. But a sex ratio of 1.0 is really important. It means that there are just as many boys as girls. This was actually the value that Arbuthnot was interested in. How do we show this on the plot?
First we need to change the limits of the y axis. The plotting command for setting the y axis limits is 'ylim'. We need to choose reasonable lower and upper bounds. I'll choose .9 and 1.2 for lower and upper bounds. To put these two number together, we use the combine command c() again, i.e. c(.9, 1.2). The plot comand is now:
plot(arbuthnot$Year, sex.ratio, type = "l", lwd = 10, ylim = c(0.9,
1.2))
That's a lot to swallow (we're almost done, though.) Look that over carefully. Note how we used a function c() within another function plot().
And finally, we might want to show 1.0 as a horizontal reference line.
plot(arbuthnot$Year, sex.ratio, type = "l", lwd = 10, ylim = c(0.9,
1.2))
abline(h = 1, lwd = 2) # Type help(abline) to see how that one works!
[How does this second plot of sex ratio compare to the first plot of counts?]
There. We're done for today.
We learned a few things about R in this lab that are super important.
Second, we learned a few things about statistics and data visualization.
In R:
Not in R:
<-
[,]
$
dim()
help()
c()
plot(x,y,...)
points(x,y)
lines(x,y)