.Rmd for Mini-Assignment 2 of Statistics I

Print your system information to screen (for grading / participation tracking)

Sys.info()

##                                                                                            sysname 
##                                                                                           "Darwin" 
##                                                                                            release 
##                                                                                           "19.6.0" 
##                                                                                            version 
## "Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020; root:xnu-6153.141.2~1/RELEASE_X86_64" 
##                                                                                           nodename 
##                                                                         "Xinyus-MacBook-Pro.local" 
##                                                                                            machine 
##                                                                                           "x86_64" 
##                                                                                              login 
##                                                                                             "root" 
##                                                                                               user 
##                                                                                         "xinyuwei" 
##                                                                                     effective_user 
##                                                                                         "xinyuwei"

Sys.time()

## [1] "2020-10-03 22:08:44 CST"

PART 1: COIN FLIPPING

First, take a coin and flip it 10 times. Record each outcome here as commented code. For eaxmple, if your first flips results in tails, write “1: T” in the line below, and so on. If you have transcended the cash society and no longer carry coins, you can use https://justflipacoin.com/.

RECORD RESULTS HERE * 1: T * 2: T * 3: H * 4: T * 5: T * 6: H * 7: T * 8: H * 9: H * 10: T

What proportion of your flips were tails (i.e., number of tails divided by total flips)
Proportion of tails =

proportion_of_tails <- 6/10
proportion_of_tails

## [1] 0.6

Next, we will simulate tossing a coin many, many times using R. This program generates a binomial distribution for a specified number of trials, N, with a probability of success, P.
Set up values for our simulation:

N <- 100  # number of trials
P <- 0.5  # probability of Heads (or Tails)
N

## [1] 100

## [1] 0.5

Define a sample space of two outcomes, heads or tails
We will define Heads =1 and Tails =0

sample.space <- c(0,1)

Toss those coins. Look at the results. Remember: 1=heads and 0=tails.

tosses <- sample(sample.space,
                 size = N, 
                 replace = TRUE, 
                 prob = c(P, 1-P))
tosses

##   [1] 0 0 1 0 1 1 0 1 1 1 0 0 0 1 1 0 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 1 1 1 1 1
##  [38] 1 1 1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 1 0 0 1 1 1
##  [75] 0 0 1 1 1 0 1 0 1 1 0 0 0 1 0 1 0 1 1 1 0 0 0 1 0 0

What proportion of our flips landed tails?
We will use sum() to count the number of heads, and use length() to get the total number of tosses. Notice that because we want tails and not heads, we must subtract the number of heads from the total number of tosses.

proportion.tails <- (length(tosses) - sum(tosses)) / length(tosses)
proportion.tails

## [1] 0.46

Now adapt the code from above in order to do the following: * Toss the coin 1,000 times instead of 100 times. * And, again calculate the proportion of tails. * Hint: You only need to copy and paste a few lines of code and change a single number. * Make sure to print the new proportion you calculated to the screen!

new_tosses <- sample(sample.space,
                 size = 1000, 
                 replace = TRUE, 
                 prob = c(P, 1-P))
new_tosses

##    [1] 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 1 1 1 1 0 1 0 0 1 0 0 1 1 1 1 1 1 1 1 1
##   [38] 1 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 0 1 1 1 0 1 0 0 1 0 1 0 1 0 0 0 1 0 1 0
##   [75] 1 0 0 1 1 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 0 0 0 1 0 1 1 1 0 0 1 1
##  [112] 0 1 0 1 0 1 0 0 1 1 0 0 0 0 1 0 0 1 0 1 1 1 0 1 0 0 1 1 0 1 1 1 0 0 1 1 1
##  [149] 0 0 0 1 1 1 1 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 1 1 1 0 0 0 0 1 0 0 0 1
##  [186] 1 1 0 0 1 0 1 0 0 0 0 1 1 0 1 0 0 1 1 0 0 1 1 0 1 1 0 0 1 1 1 0 1 1 0 1 1
##  [223] 1 1 0 0 1 1 0 0 1 1 1 0 0 0 1 1 1 1 1 1 1 0 1 0 1 0 0 1 0 1 1 0 0 0 0 0 0
##  [260] 1 0 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 1 1 1 1 0 1 0 1 0 0 0 0 1 1
##  [297] 1 0 0 1 0 1 0 0 0 1 1 1 0 0 1 0 1 0 1 0 0 0 1 0 0 1 0 0 1 0 1 1 1 0 1 1 0
##  [334] 0 1 1 1 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 1 1
##  [371] 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 1 0 0 1 1 1 1
##  [408] 0 1 0 0 1 0 1 1 0 0 0 0 1 0 0 1 1 1 1 1 0 1 0 1 0 1 0 0 1 1 1 0 1 1 1 0 0
##  [445] 0 1 0 1 0 1 0 0 1 0 0 1 0 1 1 1 0 0 1 1 1 0 1 0 0 0 0 0 0 1 0 0 1 1 1 0 0
##  [482] 1 0 1 1 1 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0 0 1 1 1 1 1 1 1 1 0 1
##  [519] 0 0 1 0 1 0 1 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 1 1
##  [556] 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 0 0 0 0 0 0 1
##  [593] 1 0 1 0 0 1 1 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 1 0 1 0 1 1 1 1 1 1
##  [630] 1 1 1 0 0 1 0 1 0 0 1 1 0 1 0 1 1 1 1 0 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1
##  [667] 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 1 0 1 1 0 1 1 1 1 0 0 0 1 0 1 1 0 1 0 0 1
##  [704] 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 1 0 1 1 1
##  [741] 0 0 0 1 0 0 1 0 1 1 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 0
##  [778] 1 1 0 1 0 1 1 0 0 0 1 0 0 0 0 1 1 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 1 0 1
##  [815] 0 0 0 1 0 0 0 1 1 1 1 0 1 1 0 0 1 0 1 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 0 1 0
##  [852] 1 0 1 0 1 1 1 0 1 0 0 1 0 0 0 0 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1 0 1 0
##  [889] 0 0 0 0 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 0 1 1 0 0
##  [926] 1 0 0 1 0 0 0 1 0 0 1 1 1 1 1 0 0 1 0 1 1 1 1 0 0 0 0 0 0 0 1 0 1 1 1 0 0
##  [963] 1 0 1 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 1 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 1 1
## [1000] 1

new.proportion.tails <- (length(new_tosses) - sum(new_tosses)) / length(new_tosses)
new.proportion.tails

## [1] 0.506

PART 2

Merging data sets together is one of the most important tools that you will use as a quantitative analyst. Merging is usually how we correctly link together data from multiple sources for an analysis. It is often the case that you will have data from multiple sources for the same units of observation. The examples below are for a relatively small number of units of observation (e.g., 50 states + DC), which we could probably link together by hand in a spreadsheet, but we might make an error. We also wouldn’t be able to reproduce our entire analysis from raw data to results if we don’t have code that does every step. Moreoever, suppose our unit of analysis is the county rather than the state. Now, we have 3,000+ observations in our data set. I don’t think we would want to link counties from multiple data sets by hand.

Let’s start out with two simple data sets: height (measured in inches) and weight (measured in pounds)

height <- data.frame(
  name = c("Bob", "Carol", "Don"),
  height = c(71,67,66)
)
height

##    name height
## 1   Bob     71
## 2 Carol     67
## 3   Don     66

weight <- data.frame(
  name = c("Don", "Carol", "Bob"),
  height = c(145,140,150)
)
weight

##    name height
## 1   Don    145
## 2 Carol    140
## 3   Bob    150

Notice that the two data sets aren’t in the same order. That’s okay! The magic of merging is that, as long as we have common identifier in both data sets, they don’t need to be in the same order to link them together. We will use the merge() command to link together the height and weight of the three individuals in our study.
The first argument in the merge command is one of our data sets, the second argument is the data set that we want to “merge” together with that first data set, and the third argument is the column name common in both data sets that we will use to link them together.

height.weight <- merge(x = height, y = weight, by = "name")

Inspect our newly merged data set to confirm it worked properly.

height.weight

##    name height.x height.y
## 1   Bob       71      150
## 2 Carol       67      140
## 3   Don       66      145

We are using data from the Covid Tracking Project (https://covidtracking.com/) for this part of the mini-assignment. The covid_positives.csv contains the number of positive covid tests by month in the U.S., and the covid_tests.csv contains the total number of tests run by month. Read both of these data sets into R and adapt the code from above to merge them together.

pos <- read.csv("covid_positives.csv")
tests <- read.csv("covid_tests.csv")
covid <- merge(x = pos, y = tests, by = "month")
covid

##   month positive_tests tot_tests
## 1     1              0         4
## 2     2             18      6504
## 3     3         198277   1120188
## 4     4         876469   5321445
## 5     5         720198  10953496
## 6     6         836969  15576143
## 7     7        1907815  23646772
## 8     8        1460408  23043028

Let’s plot the monthly number of total covid tests across time along with the monthly number of positive covid tests. We will use plot(). First, let’s plot the total number of tests using plot.
Let’s add the trend in positive tests to the plot.
Let’s label both series on our plot using text().

plot(x = covid$month, y = covid$tot_tests, type = "b",
     col = "navyblue", lwd = 2,
     xlab = "Month", ylab = "(Positive) Tests", 
     main = "Covid Testing by Month in U.S.")
points(x = covid$month, y = covid$positive_tests, type = "b",
       col = "darkred", lwd = 2)
text(x = 3, y = 5300000, labels = "Total Tests", col = "navyblue")
text(x = 7, y = 3000000, labels = "Positive Tests", col = "darkred")

Now you need to create a plot of the test positivity rate by month. You need to create a new variable that is the rate of positive tests. Then you need to plot that single series across time using plot(). Be sure to label your axes and include a title!

library(tidyverse)
new_covid <- covid %>% mutate(positivity_rate = positive_tests/tot_tests)
new_covid

##   month positive_tests tot_tests positivity_rate
## 1     1              0         4     0.000000000
## 2     2             18      6504     0.002767528
## 3     3         198277   1120188     0.177003324
## 4     4         876469   5321445     0.164705075
## 5     5         720198  10953496     0.065750515
## 6     6         836969  15576143     0.053734034
## 7     7        1907815  23646772     0.080679722
## 8     8        1460408  23043028     0.063377435

plot(x = new_covid$month, y = new_covid$positivity_rate, type = "b",
     col = "navyblue", lwd = 2,
     xlab = "Month", ylab = "Positivity Rate", 
     main = "Covid Testing Positivity Rate by Month in U.S.")

.Rmd for Mini-Assignment 2 of Statistics I

Xinyu Wei

10/3/2020

PPHA 31002 Stats for Data Analysis I

Clear working memory

Set the working directory.

Open output file for results and insert your CNET ID into the filename.

Print your system information to screen (for grading / participation tracking)

PART 1: COIN FLIPPING

PART 2