Download the homework2.Rmd file from Canvas.
Open homework2.Rmd in RStudio.
Replace the “Your Name Here” text in the author: field with your own name.
Supply your solutions to the homework by editing homework2.Rmd.
When you have completed the homework and have checked that your code both runs in the Console and knits correctly when you click Knit HTML, rename the R Markdown file to homework2_YourNameHere.Rmd, and submit on Canvas. (YourNameHere should be changed to your own name.)
| Keystroke | Description |
|---|---|
<tab> |
Autocompletes commands and filenames, and lists arguments for functions. |
<up> |
Cycles through previous commands in the console prompt |
<ctrl-up> |
Lists history of previous commands matching an unfinished one |
<ctrl-enter> |
Runs current line from source window to Console. Good for trying things out ideas from a source file. |
<ESC> |
Aborts an unfinished command and get out of the + prompt |
Note: Shown above are the Windows/Linux keys. For Mac OS X, the <ctrl> key should be substituted with the <command> (⌘) key.
Instead of sending code line-by-line with <ctrl-enter>, you can send entire code chunks, and even run all of the code chunks in your .Rmd file. Look under the
Run your code in the Console and Knit HTML frequently to check for errors.
You may find it easier to solve a problem by interacting only with the Console at first.
In this homework, we will review some statistical operations which help us understand data relationships and differences between groups. We’ll use multiple datasets so we can think clearly about each analytical piece. Please pay attention to which dataset a question refers to.
library(tidyverse)
## -- Attaching packages --------------------------------- tidyverse 1.3.0 --
## <U+2713> ggplot2 3.2.1 <U+2713> purrr 0.3.3
## <U+2713> tibble 2.1.3 <U+2713> dplyr 0.8.3
## <U+2713> tidyr 1.0.0 <U+2713> stringr 1.4.0
## <U+2713> readr 1.3.1 <U+2713> forcats 0.4.0
## -- Conflicts ------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
?USArrests in the console to learn more about the column definitions.Rape column. Create a histogram and use the breaks argument to ensure you get enough bars to see the shapehist(USArrests$Rape, breaks = 15)
Murder and Assault. Which appears to have the larger IQR? Median? Max/min? What does the visualization indicate about the overall occurence of Rape vs Murder? (Note: the boxplot labeled “1” corresponds the first argument, and 2 to the second)a.murder <- USArrests$Murder
a.assault <- USArrests$Assault
boxplot(a.murder, a.assault)
#it appears the assault has a bigger IQR, larger median, higher max, and murder has a lower min.
plot function to create scatter plots between all variables (it will automatically plot scatters between numeric columns only). This can be done easily in a single call (like we did in lecture 3 with our cars data). Which column pairs appear to show potential linear relationships? What about the UrbanPop column, does it show a pseudo linear trend with any of the other columns?plot(USArrests[,c('Murder', 'Assault', 'UrbanPop', 'Rape')])
#Murder & Assault, Murder & Rape, Assault & Rape, Urban Pop & Rape seem to have a linear relationship.
cor() function to calculate the Pearson Correlation Coefficients between the variables. Interpret and explain the results.cor(USArrests)
## Murder Assault UrbanPop Rape
## Murder 1.00000000 0.8018733 0.06957262 0.5635788
## Assault 0.80187331 1.0000000 0.25887170 0.6652412
## UrbanPop 0.06957262 0.2588717 1.00000000 0.4113412
## Rape 0.56357883 0.6652412 0.41134124 1.0000000
#The higher the number, the most positive linear relationship. It seems like Assault and Murder have the highest correlation
Murder and Rape?a.rape <- USArrests$Rape
median(a.murder)
## [1] 7.25
mean(a.murder)
## [1] 7.788
median(a.rape)
## [1] 20.1
mean(a.rape)
## [1] 21.232
Rape earlier, take a look at the histogram for Murder. Is it normal?hist(USArrests$Murder, breaks = 15)
#the histogram for 'rape' looks a lot more like a normal distribution than the histogram for murder
Murder is normally distributed. Let’s use the shapiro.test() on the Murder column of our dataframe to see if it is statistically unlikley to be normal. The lower the p-value, the less likely the data is normal. What did the results show?shapiro.test(a.murder)
##
## Shapiro-Wilk normality test
##
## data: a.murder
## W = 0.95703, p-value = 0.06674
#p value slightly higher than the usual cutoff point
Murder and Rape statistically significant? For this question, use a parametric test (assume the data is normal). Explain your interpretation.t.test(a.murder, a.rape)
##
## Welch Two Sample t-test
##
## data: a.murder and a.rape
## t = -9.2031, df = 69.245, p-value = 1.237e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -16.35807 -10.52993
## sample estimates:
## mean of x mean of y
## 7.788 21.232
#p value is extremely low and there's a 95% confidence interval so that's really good and statistically significant
wilcox.test(a.murder, a.rape)
##
## Wilcoxon rank sum test with continuity correction
##
## data: a.murder and a.rape
## W = 187, p-value = 2.39e-13
## alternative hypothesis: true location shift is not equal to 0
#yes this is still statistically significant due to the low p-value
Rape and Murder are actually different?<<<The parametric test had a lower pvalue than the non-parametric test because the parametric test has stricter qualifications for what data sets can be run. The parametric tests take in more data and thus, less likely that the results obtained could be through chanceType your plain text answer here>>>
For this question, we will use the Lottery.csv dataset. Import the dataset in the code block below. The columns are as follows: - MSR: Master Sales Region (regional identifier for the retail location) - WeeksActive: number of weeks with active lottery retail operation during the study period - InstantSalesAmt - total dollars of in-person sales for this location - OnlineSalesAmt - total dollars of online sales for this location - TotalSalesAmt - total sales dollars for this location - Name - establishment name - City - establishment city
lottery <- read.csv("lottery.csv")
TotalSalesAmt across 3 different MSRs (215, 216, and 217). In order to perform a multiple comparison test (ANOVA), we need to convert the MSR column to factor. Convert it below.lottery$MSR <- as.factor(lottery$MSR)
summary() of the test results interpret whether there are statistically different totals of sales between the three regions tested.sub <- lottery[,c(1,7)]
new <- sub[sub$MSR%in%c(215, 216, 217),]
testaov <- aov(TotalSalesAmt ~ MSR, data = new)
summary(testaov)
## Df Sum Sq Mean Sq F value Pr(>F)
## MSR 2 1.168e+10 5.839e+09 0.633 0.531
## Residuals 380 3.504e+12 9.222e+09
kruskal.test() function. The kruskal-wallis test is the non-parametric equivalent of the ANOVA. Note: kruskal-wallis test result p-values are visible without calling summary(). How does the p-value result differ from the ANOVA above? Did the statistical significance determination change?kruskal.test(TotalSalesAmt ~ MSR, data = new)
##
## Kruskal-Wallis rank sum test
##
## data: TotalSalesAmt by MSR
## Kruskal-Wallis chi-squared = 0.77439, df = 2, p-value = 0.679
#the pvalue is slightly higher and neither of them are statistically significant
NYPD_Shootings.csv dataset below.nypdsh <- read.csv("NYPD_Shootings.csv")
PERP_SEX) and the victim sex(VIC_SEX). Create a contingency table for the counts of the unique combinations of these two variables. Is there an obvious difference between the proportions?table(nypdsh$PERP_SEX, nypdsh$VIC_SEX)
##
## F M
## F 47 252
## M 1305 11111
#An overwhelming majority of cases are men against women or Men against men
perps <- nypdsh$PERP_SEX
vics <- nypdsh$VIC_SEX
fisher.test(perps, vics)
##
## Fisher's Exact Test for Count Data
##
## data: perps and vics
## p-value = 0.005741
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.131249 2.188011
## sample estimates:
## odds ratio
## 1.587883
#The pvalue is <.05 so this is statistically significant. The odds ratio of a man commiting a crime is 1.5 times higher than a woman? I thought this number should be higher