# Save the dataset fruitfly and rmarkdown file in the same folder!!!
# Import dataset fruitfly
fruitfly<-read.csv("fruitfly.csv")
Compare the distribution of lifespan among the five experimental groups of fruitflies.
Hint: use the Console in Rstudio to examine the dateset before attempting this exercise. For instance, type “names(fruitfly)” (without quotes) in the Console to see the variables in the data set and type “fruitfly” to see the entire dataset.
# Plot the appropriate figure to visualize the association between one quantitative variable and one categorical variable.
boxplot(lifespan~type, data = fruitfly)
We use a boxplot
# get the group means of one quantitative varible categorized by one categorical variable
# tapply(quantitative, categorical, function)
tapply(fruitfly$lifespan, fruitfly$type, mean)
## 1 2 3 4 5
## 63.56 64.80 63.36 56.76 38.72
tapply(fruitfly$lifespan, fruitfly$type, sd)
## 1 2 3 4 5
## 16.45215 15.65248 14.53983 14.92838 12.10207
The 8 virgin females have the shortest lifespan (type=5)
Let’s compare the lifespan distribution between the group supplied with 8 virgin females and the group supplied with 8 newly pregnant females with the normal distribution.
# Supplied with 8 newly pregnant females N(63.4, 14.5)
# P(X<30)
round(pnorm(30, 63.4, 14.5), 4)
## [1] 0.0106
# P(30<X<50)
round(diff(pnorm(c(30,50), 63.4, 14.5)), 4)
## [1] 0.1671
# P(50<X<70)
round(diff(pnorm(c(50,70), 63.4, 14.5)), 4)
## [1] 0.4978
# P(X>70)
round(1-pnorm(70, 63.4, 14.5), 4)
## [1] 0.3245
# Supplied with 8 virgin females N(38.7,12.1)
# P(X<30)
round(pnorm(30, 38.7, 12.1), 4)
## [1] 0.2361
# P(30<X<50)
round(diff(pnorm(c(30,50), 38.7, 12.1)), 4)
## [1] 0.5888
# P(50<X<70)
round(diff(pnorm(c(50,70), 38.7, 12.1)), 4)
## [1] 0.1703
# P(X>70)
round(1-pnorm(70, 38.7, 12.1), 4)
## [1] 0.0048
tapply(fruitfly$lifespan, fruitfly$type, mean)
## 1 2 3 4 5
## 63.56 64.80 63.36 56.76 38.72
mean(c(81, 65, 56, 70, 56))
## [1] 65.6
Supplied with 8 newly pregnant females mean: 63.36 days Supplied with 8 virin females mean: 38.72 days
The fruitflies came from the newly pregnany females because the mean lifespan of the 8 newly pregnant females is closer to the mean of the escaped experimental fruitflies compared to the mean of the 8 virgin female fruitflies.
fruitflysubset<-subset(fruitfly,type==5)
Fill in the table to compare the theoretical quantiles (calculated using the normal distribution) and observed quantiles (calculated using fruitflysubset) from the two groups. (Round all quantiles to one decimal place.)
# Supplied with 8 virgin females N(38.7,12.1)
# Observed
# 10th percentile
round(qnorm(p=0.10,mean=38.72,sd=12.10),1)
## [1] 23.2
# 25th percentile
round(qnorm(p=0.25,mean=38.72,sd=12.10),1)
## [1] 30.6
# 50th percentile
round(qnorm(p=0.50,mean=38.72,sd=12.10),1)
## [1] 38.7
# 75th percentile
round(qnorm(p=0.75,mean=38.72,sd=12.10),1)
## [1] 46.9
# 90th percentile
round(qnorm(p=0.90,mean=38.72,sd=12.10),1)
## [1] 54.2
# Theoretical
# 10th quantile
quantile(x=fruitfly$lifespan,probs=0.10)
## 10%
## 34.4
# 25th quantile
quantile(x=fruitfly$lifespan,probs=0.25)
## 25%
## 46
# 50th quantile
quantile(x=fruitfly$lifespan,probs=0.50)
## 50%
## 58
#75th quantile
quantile(x=fruitfly$lifespan,probs=0.75)
## 75%
## 70
# 90th quantile
quantile(x=fruitfly$lifespan,probs=0.90)
## 90%
## 79.6
hist(x=fruitfly$lifespan)
hist(x=fruitflysubset$lifespan)
The theoretical and observed quantiles are decently far apart.This does not invalidate the assumption that a lifespan follows a normal distribution because it is just a subset of a sample, which is not reflective of the entire population. If the experiment were to be repeated more, the data would most likely approach the normal distribution. One subset cannot completely invalidate the data of a population.