Mini-Assignment 1: Due October 5, 2020 at 10:30am (CT)
This is the 2nd mini-assignment for Stats for Data Analysis I (Fall 2020).
In part 1 of this mini-assignment, you will flip coins both physically and virtually. In part 2, you will learn how to merge data sets and make plots in R.
rm(list=ls())
setwd(“ENTER YOUR WORKING DIRECTORY”)
setwd("~/Desktop/2020Fall/STAT1/Week2/mini-assignment")
This will be the file that you upload to canvas.
sink(“YOURCNETID_MA2.Rout”, split=TRUE)
Sys.info()
## sysname
## "Darwin"
## release
## "19.6.0"
## version
## "Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020; root:xnu-6153.141.2~1/RELEASE_X86_64"
## nodename
## "Xinyus-MacBook-Pro.local"
## machine
## "x86_64"
## login
## "root"
## user
## "xinyuwei"
## effective_user
## "xinyuwei"
Sys.time()
## [1] "2020-10-03 22:08:44 CST"
First, take a coin and flip it 10 times. Record each outcome here as commented code. For eaxmple, if your first flips results in tails, write “1: T” in the line below, and so on. If you have transcended the cash society and no longer carry coins, you can use https://justflipacoin.com/.
RECORD RESULTS HERE * 1: T * 2: T * 3: H * 4: T * 5: T * 6: H * 7: T * 8: H * 9: H * 10: T
What proportion of your flips were tails (i.e., number of tails divided by total flips)
Proportion of tails =
proportion_of_tails <- 6/10
proportion_of_tails
## [1] 0.6
Next, we will simulate tossing a coin many, many times using R. This program generates a binomial distribution for a specified number of trials, N, with a probability of success, P.
Set up values for our simulation:
N <- 100 # number of trials
P <- 0.5 # probability of Heads (or Tails)
N
## [1] 100
P
## [1] 0.5
Define a sample space of two outcomes, heads or tails
We will define Heads =1 and Tails =0
sample.space <- c(0,1)
Toss those coins. Look at the results. Remember: 1=heads and 0=tails.
tosses <- sample(sample.space,
size = N,
replace = TRUE,
prob = c(P, 1-P))
tosses
## [1] 0 0 1 0 1 1 0 1 1 1 0 0 0 1 1 0 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 1 1 1 1 1
## [38] 1 1 1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 1 0 0 1 1 1
## [75] 0 0 1 1 1 0 1 0 1 1 0 0 0 1 0 1 0 1 1 1 0 0 0 1 0 0
What proportion of our flips landed tails?
We will use sum() to count the number of heads, and use length() to get the total number of tosses. Notice that because we want tails and not heads, we must subtract the number of heads from the total number of tosses.
proportion.tails <- (length(tosses) - sum(tosses)) / length(tosses)
proportion.tails
## [1] 0.46
Now adapt the code from above in order to do the following: * Toss the coin 1,000 times instead of 100 times. * And, again calculate the proportion of tails. * Hint: You only need to copy and paste a few lines of code and change a single number. * Make sure to print the new proportion you calculated to the screen!
new_tosses <- sample(sample.space,
size = 1000,
replace = TRUE,
prob = c(P, 1-P))
new_tosses
## [1] 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 1 1 1 1 0 1 0 0 1 0 0 1 1 1 1 1 1 1 1 1
## [38] 1 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 0 1 1 1 0 1 0 0 1 0 1 0 1 0 0 0 1 0 1 0
## [75] 1 0 0 1 1 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 0 0 0 1 0 1 1 1 0 0 1 1
## [112] 0 1 0 1 0 1 0 0 1 1 0 0 0 0 1 0 0 1 0 1 1 1 0 1 0 0 1 1 0 1 1 1 0 0 1 1 1
## [149] 0 0 0 1 1 1 1 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 1 1 1 0 0 0 0 1 0 0 0 1
## [186] 1 1 0 0 1 0 1 0 0 0 0 1 1 0 1 0 0 1 1 0 0 1 1 0 1 1 0 0 1 1 1 0 1 1 0 1 1
## [223] 1 1 0 0 1 1 0 0 1 1 1 0 0 0 1 1 1 1 1 1 1 0 1 0 1 0 0 1 0 1 1 0 0 0 0 0 0
## [260] 1 0 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 1 1 1 1 0 1 0 1 0 0 0 0 1 1
## [297] 1 0 0 1 0 1 0 0 0 1 1 1 0 0 1 0 1 0 1 0 0 0 1 0 0 1 0 0 1 0 1 1 1 0 1 1 0
## [334] 0 1 1 1 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 1 1
## [371] 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 1 0 0 1 1 1 1
## [408] 0 1 0 0 1 0 1 1 0 0 0 0 1 0 0 1 1 1 1 1 0 1 0 1 0 1 0 0 1 1 1 0 1 1 1 0 0
## [445] 0 1 0 1 0 1 0 0 1 0 0 1 0 1 1 1 0 0 1 1 1 0 1 0 0 0 0 0 0 1 0 0 1 1 1 0 0
## [482] 1 0 1 1 1 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0 0 1 1 1 1 1 1 1 1 0 1
## [519] 0 0 1 0 1 0 1 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 1 1
## [556] 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 0 0 0 0 0 0 1
## [593] 1 0 1 0 0 1 1 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 1 0 1 0 1 1 1 1 1 1
## [630] 1 1 1 0 0 1 0 1 0 0 1 1 0 1 0 1 1 1 1 0 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1
## [667] 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 1 0 1 1 0 1 1 1 1 0 0 0 1 0 1 1 0 1 0 0 1
## [704] 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 1 0 1 1 1
## [741] 0 0 0 1 0 0 1 0 1 1 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 0
## [778] 1 1 0 1 0 1 1 0 0 0 1 0 0 0 0 1 1 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 1 0 1
## [815] 0 0 0 1 0 0 0 1 1 1 1 0 1 1 0 0 1 0 1 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 0 1 0
## [852] 1 0 1 0 1 1 1 0 1 0 0 1 0 0 0 0 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1 0 1 0
## [889] 0 0 0 0 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 0 1 1 0 0
## [926] 1 0 0 1 0 0 0 1 0 0 1 1 1 1 1 0 0 1 0 1 1 1 1 0 0 0 0 0 0 0 1 0 1 1 1 0 0
## [963] 1 0 1 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 1 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 1 1
## [1000] 1
new.proportion.tails <- (length(new_tosses) - sum(new_tosses)) / length(new_tosses)
new.proportion.tails
## [1] 0.506
Merging data sets together is one of the most important tools that you will use as a quantitative analyst. Merging is usually how we correctly link together data from multiple sources for an analysis. It is often the case that you will have data from multiple sources for the same units of observation. The examples below are for a relatively small number of units of observation (e.g., 50 states + DC), which we could probably link together by hand in a spreadsheet, but we might make an error. We also wouldn’t be able to reproduce our entire analysis from raw data to results if we don’t have code that does every step. Moreoever, suppose our unit of analysis is the county rather than the state. Now, we have 3,000+ observations in our data set. I don’t think we would want to link counties from multiple data sets by hand.
Let’s start out with two simple data sets: height (measured in inches) and weight (measured in pounds)
height <- data.frame(
name = c("Bob", "Carol", "Don"),
height = c(71,67,66)
)
height
## name height
## 1 Bob 71
## 2 Carol 67
## 3 Don 66
weight <- data.frame(
name = c("Don", "Carol", "Bob"),
height = c(145,140,150)
)
weight
## name height
## 1 Don 145
## 2 Carol 140
## 3 Bob 150
Notice that the two data sets aren’t in the same order. That’s okay! The magic of merging is that, as long as we have common identifier in both data sets, they don’t need to be in the same order to link them together. We will use the merge() command to link together the height and weight of the three individuals in our study.
The first argument in the merge command is one of our data sets, the second argument is the data set that we want to “merge” together with that first data set, and the third argument is the column name common in both data sets that we will use to link them together.
height.weight <- merge(x = height, y = weight, by = "name")
Inspect our newly merged data set to confirm it worked properly.
height.weight
## name height.x height.y
## 1 Bob 71 150
## 2 Carol 67 140
## 3 Don 66 145
We are using data from the Covid Tracking Project (https://covidtracking.com/) for this part of the mini-assignment. The covid_positives.csv contains the number of positive covid tests by month in the U.S., and the covid_tests.csv contains the total number of tests run by month. Read both of these data sets into R and adapt the code from above to merge them together.
pos <- read.csv("covid_positives.csv")
tests <- read.csv("covid_tests.csv")
covid <- merge(x = pos, y = tests, by = "month")
covid
## month positive_tests tot_tests
## 1 1 0 4
## 2 2 18 6504
## 3 3 198277 1120188
## 4 4 876469 5321445
## 5 5 720198 10953496
## 6 6 836969 15576143
## 7 7 1907815 23646772
## 8 8 1460408 23043028
Let’s plot the monthly number of total covid tests across time along with the monthly number of positive covid tests. We will use plot(). First, let’s plot the total number of tests using plot.
Let’s add the trend in positive tests to the plot.
Let’s label both series on our plot using text().
plot(x = covid$month, y = covid$tot_tests, type = "b",
col = "navyblue", lwd = 2,
xlab = "Month", ylab = "(Positive) Tests",
main = "Covid Testing by Month in U.S.")
points(x = covid$month, y = covid$positive_tests, type = "b",
col = "darkred", lwd = 2)
text(x = 3, y = 5300000, labels = "Total Tests", col = "navyblue")
text(x = 7, y = 3000000, labels = "Positive Tests", col = "darkred")
Now you need to create a plot of the test positivity rate by month. You need to create a new variable that is the rate of positive tests. Then you need to plot that single series across time using plot(). Be sure to label your axes and include a title!
library(tidyverse)
new_covid <- covid %>% mutate(positivity_rate = positive_tests/tot_tests)
new_covid
## month positive_tests tot_tests positivity_rate
## 1 1 0 4 0.000000000
## 2 2 18 6504 0.002767528
## 3 3 198277 1120188 0.177003324
## 4 4 876469 5321445 0.164705075
## 5 5 720198 10953496 0.065750515
## 6 6 836969 15576143 0.053734034
## 7 7 1907815 23646772 0.080679722
## 8 8 1460408 23043028 0.063377435
plot(x = new_covid$month, y = new_covid$positivity_rate, type = "b",
col = "navyblue", lwd = 2,
xlab = "Month", ylab = "Positivity Rate",
main = "Covid Testing Positivity Rate by Month in U.S.")