Here I’m going to answer on next question: Does drinking beer make you more attractive to mosquitos?
Ideas and data were took from examples in link
List of activities:
* Expolratory data analysis
* Test of hypotheses
* Permutation test
Reading the data
library(ggplot2)
library(dplyr)
set.seed(12345)
dataset <- read.csv("dataset.csv", head=TRUE)
Look at the data quickly with summary() and head().
summary(dataset)
## Count Type
## Min. :12.00 Beer :25
## 1st Qu.:19.00 Water:18
## Median :21.00
## Mean :21.77
## 3rd Qu.:24.00
## Max. :31.00
head(dataset)
## Count Type
## 1 27 Beer
## 2 19 Beer
## 3 20 Beer
## 4 20 Beer
## 5 23 Beer
## 6 17 Beer
So, we have 25 records for subjects, who drunk beer and 18 for who drunk water.
Let’s calculate mean for each group
means <- dataset %>%
group_by(Type) %>%
summarise(mean(Count))
means
## Source: local data frame [2 x 2]
##
## Type mean(Count)
## (fctr) (dbl)
## 1 Beer 23.60000
## 2 Water 19.22222
Difference between beer and water means:
## [1] 4.377778
For the next step let’s make exploratory plot’s for this data:
g <- ggplot(dataset, aes(dataset$Type, dataset$Count, fill = Type, ymin = 10, ymax = 30)) +
geom_point(colour = "red", size = 4) +
geom_boxplot(aes(fill = dataset$Type)) +
labs(title="The number of bites by mosquitoes",
x="Subjects",
y="Count")
g
Infer: Difference between means (4.38). Is it a sufficient evidence to claim that drinking beer makes you more attractive for moquitos?
Let’s test null hypithesis: means for each types are equal.
t.test(Count ~ Type, data = dataset)$p.value
## [1] 0.0007474019
t.test(Count ~ Type, data = dataset)$conf.int
## [1] 1.957472 6.798084
## attr(,"conf.level")
## [1] 0.95
Infer: P-value less than significance level (0.05), so we reject null hypothesis. Means for each types are different.
Let’s test null hypothesis that the labels are irrelevant (exchangeable). This is a handy way to create a null distribution for our test statistic by simply permuting the labels over and over and seeing how extreme our data are with respect to this permuted distribution. The procedure would be as follows:
1. consider a data from with count and type,
2. permute the type (group) labels,
3. recalculate the statistic (such as the difference in means),
4. calculate the percentage of simulations where the simulated statistic was more extreme (toward the alternative) than the observed.
group <- as.character(dataset$Type)
testStat <- function(w, g) {mean(w[g == "Beer"]) - mean(w[g == "Water"])}
observedStat <- testStat(dataset$Count, group)
observedStat
## [1] 4.377778
permutations <- sapply(1 : 50000, function(i) testStat(dataset$Count, sample(group)))
round(mean(permutations > observedStat, 3))
## [1] 0
hist(permutations, col = "#62829C", breaks = 50000)
abline(v = observedStat, lwd=3, col = "red")
abline(v = 0, lwd=2, col = "blue", se = FALSE)
Infer: We reject null hypothesis. Data for both types have differences.
Drinking beer makes you more attractive for mosquitos.