Does drinking beer make you more attractive to mosquitos?

Data processing

Expolratory data analysis

Reading the data


library(ggplot2) 
library(dplyr)

set.seed(12345)

dataset <- read.csv("dataset.csv", head=TRUE)

Look at the data quickly with summary() and head().

summary(dataset)
##      Count          Type   
##  Min.   :12.00   Beer :25  
##  1st Qu.:19.00   Water:18  
##  Median :21.00             
##  Mean   :21.77             
##  3rd Qu.:24.00             
##  Max.   :31.00
head(dataset)
##   Count Type
## 1    27 Beer
## 2    19 Beer
## 3    20 Beer
## 4    20 Beer
## 5    23 Beer
## 6    17 Beer

So, we have 25 records for subjects, who drunk beer and 18 for who drunk water.

Let’s calculate mean for each group

means <- dataset %>%
  group_by(Type) %>%
  summarise(mean(Count))
means
## Source: local data frame [2 x 2]
## 
##     Type mean(Count)
##   (fctr)       (dbl)
## 1   Beer    23.60000
## 2  Water    19.22222

Difference between beer and water means:

## [1] 4.377778

For the next step let’s make exploratory plot’s for this data:

g <- ggplot(dataset, aes(dataset$Type, dataset$Count, fill = Type, ymin = 10, ymax = 30)) +
  geom_point(colour = "red", size = 4) + 
  geom_boxplot(aes(fill = dataset$Type)) +
  labs(title="The number of bites by mosquitoes", 
    x="Subjects",
    y="Count")
g

Infer: Difference between means (4.38). Is it a sufficient evidence to claim that drinking beer makes you more attractive for moquitos?

Hypotheses test

Let’s test null hypithesis: means for each types are equal.

t.test(Count ~ Type, data = dataset)$p.value
## [1] 0.0007474019

t.test(Count ~ Type, data = dataset)$conf.int
## [1] 1.957472 6.798084
## attr(,"conf.level")
## [1] 0.95

Infer: P-value less than significance level (0.05), so we reject null hypothesis. Means for each types are different.

Permutation test

Let’s test null hypothesis that the labels are irrelevant (exchangeable). This is a handy way to create a null distribution for our test statistic by simply permuting the labels over and over and seeing how extreme our data are with respect to this permuted distribution. The procedure would be as follows:
1. consider a data from with count and type,
2. permute the type (group) labels,
3. recalculate the statistic (such as the difference in means),
4. calculate the percentage of simulations where the simulated statistic was more extreme (toward the alternative) than the observed.

group <- as.character(dataset$Type)

testStat <- function(w, g) {mean(w[g == "Beer"]) - mean(w[g == "Water"])}

observedStat <- testStat(dataset$Count, group)
observedStat
## [1] 4.377778

permutations <- sapply(1 : 50000, function(i) testStat(dataset$Count, sample(group)))
round(mean(permutations > observedStat, 3))
## [1] 0

hist(permutations, col = "#62829C", breaks = 50000)
abline(v = observedStat, lwd=3, col = "red")
abline(v = 0, lwd=2, col = "blue", se = FALSE)

Infer: We reject null hypothesis. Data for both types have differences.

Does drinking beer make you more attractive to mosquitos?

Zanin Pavel

March 12, 2016

Introduction

Data processing

Expolratory data analysis

Hypotheses test

Permutation test

Conclusions