This homework is due by 11:59pm on Sunday, February 9th. To complete this assignment, follow these steps:
  1. Download the homework2.Rmd file from Canvas.

  2. Open homework2.Rmd in RStudio.

  3. Replace the “Your Name Here” text in the author: field with your own name.

  4. Supply your solutions to the homework by editing homework2.Rmd.

  5. When you have completed the homework and have checked that your code both runs in the Console and knits correctly when you click Knit HTML, rename the R Markdown file to homework2_YourNameHere.Rmd, and submit on Canvas. (YourNameHere should be changed to your own name.)

Homework tips:
  1. Recall the following useful RStudio hotkeys.
Keystroke Description
<tab> Autocompletes commands and filenames, and lists arguments for functions.
<up> Cycles through previous commands in the console prompt
<ctrl-up> Lists history of previous commands matching an unfinished one
<ctrl-enter> Runs current line from source window to Console. Good for trying things out ideas from a source file.
<ESC> Aborts an unfinished command and get out of the + prompt

Note: Shown above are the Windows/Linux keys. For Mac OS X, the <ctrl> key should be substituted with the <command> (⌘) key.

  1. Instead of sending code line-by-line with <ctrl-enter>, you can send entire code chunks, and even run all of the code chunks in your .Rmd file. Look under the menu of the Source panel.

  2. Run your code in the Console and Knit HTML frequently to check for errors.

  3. You may find it easier to solve a problem by interacting only with the Console at first.

Homework 2 outline

In this homework, we will review some statistical operations which help us understand data relationships and differences between groups. We’ll use multiple datasets so we can think clearly about each analytical piece. Please pay attention to which dataset a question refers to.

library(tidyverse)
## -- Attaching packages --------------------------------- tidyverse 1.3.0 --
## <U+2713> ggplot2 3.2.1     <U+2713> purrr   0.3.3
## <U+2713> tibble  2.1.3     <U+2713> dplyr   0.8.3
## <U+2713> tidyr   1.0.0     <U+2713> stringr 1.4.0
## <U+2713> readr   1.3.1     <U+2713> forcats 0.4.0
## -- Conflicts ------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Problem 1: For this question, we’ll use the USArrests dataset which comes natively in R. Type ?USArrests in the console to learn more about the column definitions.

(a) Checking distributions:
of the data. Does the data “look” normal?Let’s check the distributions of the Rape column. Create a histogram and use the breaks argument to ensure you get enough bars to see the shape
hist(USArrests$Rape, breaks = 15)

(b) Checking distributions (part 2).
We are curious about data dispersion and outliers. Create a boxplot comparing the distributions for Murder and Assault. Which appears to have the larger IQR? Median? Max/min? What does the visualization indicate about the overall occurence of Rape vs Murder? (Note: the boxplot labeled “1” corresponds the first argument, and 2 to the second)
a.murder <- USArrests$Murder
a.assault <- USArrests$Assault
boxplot(a.murder, a.assault)

#it appears the assault has a bigger IQR, larger median, higher max, and murder has a lower min. 
(c) Correlations
We are curious if there may be a linear relationship between any of the variables in USArrests. Use the plot function to create scatter plots between all variables (it will automatically plot scatters between numeric columns only). This can be done easily in a single call (like we did in lecture 3 with our cars data). Which column pairs appear to show potential linear relationships? What about the UrbanPop column, does it show a pseudo linear trend with any of the other columns?
plot(USArrests[,c('Murder', 'Assault', 'UrbanPop', 'Rape')])

#Murder & Assault, Murder & Rape, Assault & Rape, Urban Pop & Rape seem to have a linear relationship. 
(d) Correlations
Use the cor() function to calculate the Pearson Correlation Coefficients between the variables. Interpret and explain the results.
cor(USArrests)
##              Murder   Assault   UrbanPop      Rape
## Murder   1.00000000 0.8018733 0.06957262 0.5635788
## Assault  0.80187331 1.0000000 0.25887170 0.6652412
## UrbanPop 0.06957262 0.2588717 1.00000000 0.4113412
## Rape     0.56357883 0.6652412 0.41134124 1.0000000
#The higher the number, the most positive linear relationship. It seems like Assault and Murder have the highest correlation

Problem 2: Hypothesis testing

(a) What are the means and medians of Murder and Rape?
a.rape <- USArrests$Rape
median(a.murder)
## [1] 7.25
mean(a.murder)
## [1] 7.788
median(a.rape)
## [1] 20.1
mean(a.rape)
## [1] 21.232
(b) We looked at the histogram for Rape earlier, take a look at the histogram for Murder. Is it normal?
hist(USArrests$Murder, breaks = 15)

#the histogram for 'rape' looks a lot more like a normal distribution than the histogram for murder
(c) It’s a little tough to tell if Murder is normally distributed. Let’s use the shapiro.test() on the Murder column of our dataframe to see if it is statistically unlikley to be normal. The lower the p-value, the less likely the data is normal. What did the results show?
shapiro.test(a.murder)
## 
##  Shapiro-Wilk normality test
## 
## data:  a.murder
## W = 0.95703, p-value = 0.06674
#p value slightly higher than the usual cutoff point
(d) Using hypothesis testing to determine robustness of results:
Is the observed difference in means between Murder and Rape statistically significant? For this question, use a parametric test (assume the data is normal). Explain your interpretation.
t.test(a.murder, a.rape)
## 
##  Welch Two Sample t-test
## 
## data:  a.murder and a.rape
## t = -9.2031, df = 69.245, p-value = 1.237e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -16.35807 -10.52993
## sample estimates:
## mean of x mean of y 
##     7.788    21.232
#p value is extremely low and there's a 95% confidence interval so that's really good and statistically significant
Pretend the data is not normal. Use a non-parametric test to determine if the difference is statistically significant. Does the non-parametrics test show statistical significance?
wilcox.test(a.murder, a.rape)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  a.murder and a.rape
## W = 187, p-value = 2.39e-13
## alternative hypothesis: true location shift is not equal to 0
#yes this is still statistically significant due to the low p-value
(e) Reflection:
Which results show a lower p-value between the parametric and non-parametric tests we just performed? Any thoughts on why? What is the overal interpretation about whether the means between Rape and Murder are actually different?

<<<The parametric test had a lower pvalue than the non-parametric test because the parametric test has stricter qualifications for what data sets can be run. The parametric tests take in more data and thus, less likely that the results obtained could be through chanceType your plain text answer here>>>


Problem 3: More Hypothesis testing

For this question, we will use the Lottery.csv dataset. Import the dataset in the code block below. The columns are as follows: - MSR: Master Sales Region (regional identifier for the retail location) - WeeksActive: number of weeks with active lottery retail operation during the study period - InstantSalesAmt - total dollars of in-person sales for this location - OnlineSalesAmt - total dollars of online sales for this location - TotalSalesAmt - total sales dollars for this location - Name - establishment name - City - establishment city

lottery <- read.csv("lottery.csv")
(a) Multiple comparisons (Anova):
We want to compare TotalSalesAmt across 3 different MSRs (215, 216, and 217). In order to perform a multiple comparison test (ANOVA), we need to convert the MSR column to factor. Convert it below.
lottery$MSR <- as.factor(lottery$MSR)
Assume the data is normal and perform a parametric multiple comparison test (ANOVA). Hint: There are other MSR values in our dataframe. Before we can run the ANOVA using the column, we’ll need to use some functions we have learned previously to filter to only those MSRs we want in our comparison (215, 216, 217). Using the summary() of the test results interpret whether there are statistically different totals of sales between the three regions tested.
sub <- lottery[,c(1,7)]
new <- sub[sub$MSR%in%c(215, 216, 217),]

testaov <- aov(TotalSalesAmt ~ MSR, data = new)
summary(testaov)
##              Df    Sum Sq   Mean Sq F value Pr(>F)
## MSR           2 1.168e+10 5.839e+09   0.633  0.531
## Residuals   380 3.504e+12 9.222e+09
(b) Non-parametric multiple comparisons:
Now run the same comparion but using the kruskal wallis test which has the same syntax as ANOVA but uses the kruskal.test() function. The kruskal-wallis test is the non-parametric equivalent of the ANOVA. Note: kruskal-wallis test result p-values are visible without calling summary(). How does the p-value result differ from the ANOVA above? Did the statistical significance determination change?
kruskal.test(TotalSalesAmt ~ MSR, data = new)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  TotalSalesAmt by MSR
## Kruskal-Wallis chi-squared = 0.77439, df = 2, p-value = 0.679
#the pvalue is slightly higher and neither of them are statistically significant

Problem 4: Relationships between categorical variables.

For this question, we will use the NYPD shootings dataset. Download and import the NYPD_Shootings.csv dataset below.
nypdsh <- read.csv("NYPD_Shootings.csv")
(b) Contingency Tables
While we can’t calculate a Pearson correlation coefficient for two categorical variables, we are curious if there is a relationship between the perpetrator sex (PERP_SEX) and the victim sex(VIC_SEX). Create a contingency table for the counts of the unique combinations of these two variables. Is there an obvious difference between the proportions?
table(nypdsh$PERP_SEX, nypdsh$VIC_SEX)
##    
##         F     M
##   F    47   252
##   M  1305 11111
#An overwhelming majority of cases are men against women or Men against men
(c) Is the observed different “stat sig”?
Use the Fisher’s Exact Test to determine if the observed difference between gender on gender shooting proportions is statistically significant. What is the odds ratio? Print out the odds ratio from the results as shown in lecture. Interpret and explain the results (including the odds ratio) and your conclusions.
perps <- nypdsh$PERP_SEX
vics <- nypdsh$VIC_SEX
fisher.test(perps, vics)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  perps and vics
## p-value = 0.005741
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.131249 2.188011
## sample estimates:
## odds ratio 
##   1.587883
#The pvalue is <.05 so this is statistically significant. The odds ratio of a man commiting a crime is 1.5 times higher than a woman? I thought this number should be higher