# Save the dataset fruitfly and rmarkdown file in the same folder!!!
# Import dataset fruitfly

fruitfly<-read.csv("fruitfly.csv")

Question 1:

Compare the distribution of lifespan among the five experimental groups of fruitflies.

  1. Produce an appropriate figure to compare the distribution of lifespan among the five experimental groups of fruitflies. What figure did you produce?

Hint: use the Console in Rstudio to examine the dateset before attempting this exercise. For instance, type “names(fruitfly)” (without quotes) in the Console to see the variables in the data set and type “fruitfly” to see the entire dataset.

# Plot the appropriate figure to visualize the association between one quantitative variable and one categorical variable.
boxplot(lifespan~type, data = fruitfly)

We use a boxplot

  1. Identify the group with the shortest average lifespan and provide the mean and standard deviation of lifespan among this group.
# get the group means of one quantitative varible categorized by one categorical variable

# tapply(quantitative, categorical, function)

tapply(fruitfly$lifespan, fruitfly$type, mean)
##     1     2     3     4     5 
## 63.56 64.80 63.36 56.76 38.72
tapply(fruitfly$lifespan, fruitfly$type, sd)
##        1        2        3        4        5 
## 16.45215 15.65248 14.53983 14.92838 12.10207

The 8 virgin females have the shortest lifespan (type=5)

Question 2:

Let’s compare the lifespan distribution between the group supplied with 8 virgin females and the group supplied with 8 newly pregnant females with the normal distribution.

  1. Using the normal distribution, fill in the table below to calculate the probability of surviving within the given range of days. Some answers have been filled in for you. (Round all probabilities to 4 decimals.)
# Supplied with 8 newly pregnant females N(63.4, 14.5)
# P(X<30)
round(pnorm(30, 63.4, 14.5), 4)
## [1] 0.0106
# P(30<X<50)
round(diff(pnorm(c(30,50), 63.4, 14.5)), 4)
## [1] 0.1671
# P(50<X<70)
round(diff(pnorm(c(50,70), 63.4, 14.5)), 4)
## [1] 0.4978
# P(X>70)
round(1-pnorm(70, 63.4, 14.5), 4)
## [1] 0.3245
# Supplied with 8 virgin females N(38.7,12.1)
# P(X<30)
round(pnorm(30, 38.7, 12.1), 4)
## [1] 0.2361
# P(30<X<50)
round(diff(pnorm(c(30,50), 38.7, 12.1)), 4)
## [1] 0.5888
# P(50<X<70)
round(diff(pnorm(c(50,70), 38.7, 12.1)), 4)
## [1] 0.1703
# P(X>70)
round(1-pnorm(70, 38.7, 12.1), 4)
## [1] 0.0048
  1. Suppose five fruitflies escape from their experimental conditions in a different lab, but they were noted to survive 81 65 56 70 56 days. Do you think they came from the ‘supplied with 8 newly pregnant females’ group or the ‘supplied with with 8 virgin females’ group and why?
tapply(fruitfly$lifespan, fruitfly$type, mean)
##     1     2     3     4     5 
## 63.56 64.80 63.36 56.76 38.72
mean(c(81, 65, 56, 70, 56))
## [1] 65.6

Supplied with 8 newly pregnant females mean: 63.36 days Supplied with 8 virin females mean: 38.72 days

The fruitflies came from the newly pregnany females because the mean lifespan of the 8 newly pregnant females is closer to the mean of the escaped experimental fruitflies compared to the mean of the 8 virgin female fruitflies.

  1. Submit the following code to create a data set that only contains the group of fruitflies with the shortest average lifespan. Be sure to enter the number corresponding to the type of fruitflies you identified in (1b) after the double equal sign.
fruitflysubset<-subset(fruitfly,type==5)

Fill in the table to compare the theoretical quantiles (calculated using the normal distribution) and observed quantiles (calculated using fruitflysubset) from the two groups. (Round all quantiles to one decimal place.)

# Supplied with 8 virgin females N(38.7,12.1)
# Observed 
# 10th percentile
round(qnorm(p=0.10,mean=38.72,sd=12.10),1)
## [1] 23.2
# 25th percentile
round(qnorm(p=0.25,mean=38.72,sd=12.10),1)
## [1] 30.6
# 50th percentile
round(qnorm(p=0.50,mean=38.72,sd=12.10),1)
## [1] 38.7
# 75th percentile
round(qnorm(p=0.75,mean=38.72,sd=12.10),1)
## [1] 46.9
# 90th percentile
round(qnorm(p=0.90,mean=38.72,sd=12.10),1)
## [1] 54.2
# Theoretical
# 10th quantile
quantile(x=fruitfly$lifespan,probs=0.10)
##  10% 
## 34.4
# 25th quantile
quantile(x=fruitfly$lifespan,probs=0.25)
## 25% 
##  46
# 50th quantile
quantile(x=fruitfly$lifespan,probs=0.50)
## 50% 
##  58
#75th quantile
quantile(x=fruitfly$lifespan,probs=0.75)
## 75% 
##  70
# 90th quantile
quantile(x=fruitfly$lifespan,probs=0.90)
##  90% 
## 79.6
  1. Are the theoretical and observed quantiles close together or far apart? Does that validate or invalidate the assumption that lifespan follows a normal distribution?
hist(x=fruitfly$lifespan)

hist(x=fruitflysubset$lifespan)

The theoretical and observed quantiles are decently far apart.This does not invalidate the assumption that a lifespan follows a normal distribution because it is just a subset of a sample, which is not reflective of the entire population. If the experiment were to be repeated more, the data would most likely approach the normal distribution. One subset cannot completely invalidate the data of a population.