CS&SS/SOC/STAT 221, University of Washington, Winter 2015
This R practice will cover some concepts and material from Chapters 8 and 10 of BPS 5e.
For the assignment, answer the problems (marked Problem) in this document. Some guidelines that you should follow for your submission:
Click on File -> New -> R Script. This will open a blank document above the console. As you go along you can copy and paste your code here and save it. This is a good way to keep track of your code and be able to reuse it later. To run your code from this document you can either copy and paste it into the console, highlight the code and hit the Run button, or highlight the code and hit command+enter on a mac or control+enter on a PC.
You will also want to save this script (code document). To do so click on the disk icon. The first time you hit save, RStudio will ask for a file name; you can name it anything you like. Once you hit save you’ll see the file appear under the Files tab in the lower right panel. You can reopen this file anytime by simply clicking on it.
If you reopen RStudio you can open an existing R Script with File -> Open File. Alternatively, a convenient menu item, especially if you have forgotten the location of the file, is to find and open the file using File -> Recent Files.
Chapter 8 of BPS discussed sampling, and Chapter 9 involved random assignment. To draw a simple random sample in R, you use the sample function.
For example, to draw a random sample of size 10 from a the numbers 1 to 100, use the following:
sample(1:100, 10)
## [1] 75 60 41 89 45 36 10 14 24 79
Note that in R, 1:100 is a way to write the sequence 1, 2, 3, …, 100 without having to write out all the numbers. Try running sample a few times and you will see that the results are different each time (and different from what you see here). Note that by default this sampling is done without replacement. That means that one an element is drawn from the population, that item cannot but drawn again (e.g. there cannot be more than one 5 in the sample).
Problem 1 Exercise 8.29 in BPS 5e involves data on a 12000-acre pine forest in Louisiana. To gather data on the forest, the U.S. Forest service created a grid of 1,410 equally spaced circular plots. Randomly select 1% of these plots (round the nearest integer) for a ground crew to visit. Do this 3 times. Note that these are unlikely to be the same plots.
You can also use the sample function to simulate an opinion poll. Suppose that you want to ask an opinion poll on whether people like eating Kale. The two responses that you allow are “Yes” or “No”. It is often convenient to represent “Yes” as 1 and “No” as 0, for reasons that will be clear in a moment. Since this is Seattle, 75% of people like eating Kale. To simulate what would happen if you randomly sampled 20 individuals and asked that question, you do the following
sample(1:0, 20, replace = TRUE, prob = c(0.75, 0.25))
## [1] 1 0 1 1 1 0 0 0 1 1 0 1 0 1 0 1 0 1 1 1
In this usage, the function has two more arguments replace and prob. replace = TRUE means that you are sampling with replacement, meaning that your sample can have multiple 1s or 0s. This is because you are not giving sample the full population, but only the values present in the population, and using the prob argument to tell sample what proportion of individuals in the population have those values. The prob argument assigning a probability to each value (and these must sum to 1). In this case prob = c(0.75, 0.25) means sample the first value (1) with probability 0.75 and the second value (0) with probability 0.25. When using the prob argument check that you are assigning the probabilities to the values that you intend to.
The reason for using 1 and 0 instead of “Yes” and “No” is because it makes calculating the proportion of “Yes” trivial, since the proportion of 1s in a vector of 1s and 0s is just the mean of that vector.
mean(sample(1:0, 20, replace = TRUE, prob = c(0.75, 0.25)))
## [1] 0.6
Try doing this a few times.
Running that code multiple times is tedious. One way to run code multiple times is to use a for loop. The following code simulates that opinion poll 5 times and calculates the proportion who like kale.
for (i in 1:10) {
print(mean(sample(1:0, 20, replace = TRUE, prob = c(0.75, 0.25))))
}
## [1] 0.8
## [1] 0.8
## [1] 0.75
## [1] 0.75
## [1] 0.75
## [1] 0.9
## [1] 0.8
## [1] 0.85
## [1] 0.65
## [1] 0.75
That code is equivalent to running
mean(sample(1:0, 20, replace = TRUE, prob = c(0.75, 0.25)))
## [1] 0.75
mean(sample(1:0, 20, replace = TRUE, prob = c(0.75, 0.25)))
## [1] 0.7
mean(sample(1:0, 20, replace = TRUE, prob = c(0.75, 0.25)))
## [1] 0.7
mean(sample(1:0, 20, replace = TRUE, prob = c(0.75, 0.25)))
## [1] 0.75
mean(sample(1:0, 20, replace = TRUE, prob = c(0.75, 0.25)))
## [1] 0.8
However, the for loop is much more concise, especially as the number of times the task needs to be repeated increases.
Note that the code inside the for loop uses the print function. print is a function that prints the output to the console. R usually does that automatically in the console, but things work a little differently inside a for loop so it is needed if you want to see the output of mean.
Problem 2 BPS 5e Exercise 10.57 discusses an opinion poll of the American Public that shows that 65% have a favorable opinion of Microsoft. I am guessing that that number is different today, but lets go ahead with it nevertheless.
Parts of this lab were adapted from the OpenIntro Probability, which was released under a Creative Commons Attribution-ShareAlike 3.0 Unported.
Some data and examples come from Moore, The Basic Practice of Statistics, 5th ed.