Problem 1: Birds of The Caribbean

Load the data into R and assign it to a variable called birds

birds <- read.csv("birds.csv", stringsAsFactors = T)
birds
##    immigrationDates
## 1              0.00
## 2              0.00
## 3              0.04
## 4              0.21
## 5              0.29
## 6              0.54
## 7              0.63
## 8              0.88
## 9              0.96
## 10             1.25
## 11             1.67
## 12             1.75
## 13             1.84
## 14             1.96
## 15             2.01
## 16             2.51
## 17             2.72
## 18             3.30
## 19             3.51
## 20             4.05
## 21             4.85
## 22             6.94
## 23             8.73
## 24            10.57
## 25            11.11
## 26            12.45
## 27            14.00
## 28            17.30
## 29            17.92
## 30            18.05
## 31            18.43
## 32            22.48
## 33            22.48
## 34            23.48
## 35            26.32
## 36            26.45
## 37            28.87

Calculate the number of species which migrated at least 5 millions years ago

birds >= 5
##       immigrationDates
##  [1,]            FALSE
##  [2,]            FALSE
##  [3,]            FALSE
##  [4,]            FALSE
##  [5,]            FALSE
##  [6,]            FALSE
##  [7,]            FALSE
##  [8,]            FALSE
##  [9,]            FALSE
## [10,]            FALSE
## [11,]            FALSE
## [12,]            FALSE
## [13,]            FALSE
## [14,]            FALSE
## [15,]            FALSE
## [16,]            FALSE
## [17,]            FALSE
## [18,]            FALSE
## [19,]            FALSE
## [20,]            FALSE
## [21,]            FALSE
## [22,]             TRUE
## [23,]             TRUE
## [24,]             TRUE
## [25,]             TRUE
## [26,]             TRUE
## [27,]             TRUE
## [28,]             TRUE
## [29,]             TRUE
## [30,]             TRUE
## [31,]             TRUE
## [32,]             TRUE
## [33,]             TRUE
## [34,]             TRUE
## [35,]             TRUE
## [36,]             TRUE
## [37,]             TRUE
mig <- birds[birds >= 5, ]
length(mig)
## [1] 16
paste("The number of species which migrated at least 5 millions years ago:", length(mig))
## [1] "The number of species which migrated at least 5 millions years ago: 16"

Plot the data on a frequency histogram and describe the shape of the frequency distribution.

hist(birds[birds$immigrationDates, ], 
     col =c("pink"),
     main = "Frequency of Bird Immigration Millions of Years Ago",
     xlab = "Dates (Millions of Years Ago)",
     ylab = "Number of Birds") 

The shape of the frequency histogram is not a normal distribution, but instead is positively skewed. Most of the values occured within 0-5 million years ago.

Overlay density distribution line over the histogram.

hist(birds[birds$immigrationDates, ], freq = F,
     col =c("pink"),
     main = "Frequency of Bird Immigration Millions of Years Ago",
     xlab = "Dates (Millions of Years Ago)",
     ylab = "Density")
lines(density(birds$immigrationDates), col = "red")

Replicating the sample plot

bird <- birds$immigrationDates
mean <- mean(bird)
sd <- sd(bird)
n <- length(bird)
bird.rn <- rnorm(n, mean, sd)
plot(density(birds$immigrationDates), 
     col = "red",
     ylim = c(0.00,0.05),
     xlim = c(-10, 40),
     xlab = "migration date, mln years",
     main = "Distriubtions of the Migration Dates: Observed and Fitted Normal")
lines(density(rnorm(n, mean, sd)), col = "blue")
abline(v=mean, col = "blue", lwd = 2) + abline(v=mean+sd, col = "blue", lty =3)+ abline(v=mean-sd, col = "blue", lty = 3)

## integer(0)

Present the data on a box plot.

boxplot(birds,
        ylab = "migration date, mln years", main = "Approximate Dates of Migration of Bird Species in Lesser Antile") 

points(mean, pch = 18, col = "blue", cex =2) + points(mean+sd, pch = 25, col = "red", bg = "red")+ points(mean-sd, pch = 17, col = "red", bg = "red") 
## integer(0)
legend("topright", 
       legend = c("Mean", "Mean + sd", "Mean - sd"), 
       col = c("blue", "red", "red"), 
       pch = c(18, 25, 24), 
     pt.cex = c(2, 1,1 ),
       bty = "n", 
       horiz = F, pt.bg = "red"
  )  

Describe in your own words what box plot is presenting

The graph above is a boxplot, which displays central tendency and spread of data. The median (the value in the middle of the data) is indicated by the thick black line at roughly y= 3 million years ago. The height of the box displays where 50% of the data lies within the range; in this case roughly 2-17 mya. The whiskers at the top and bottom of the plot shows the maximum and minimum data values at roughly 29mya and 2mya.

From https://www.ncl.ac.uk/webtemplate/ask-assets/external/maths-resources/statistics/data-presentation/box-and-whisker-plots.html

Visually and Statistically Checking Normality

qqnorm(bird, main = "Migration QQ plot")
qqline(bird)

shapiro.test(bird)
## 
##  Shapiro-Wilk normality test
## 
## data:  bird
## W = 0.82802, p-value = 5.028e-05

Visually inspecting the QQ plot, you can see that the points do not fit the line. This is the first indication that the data is not normally distriubted. The Shapiro- Wilk normality test returned a p value of 5.028e-05, which is much smaller than the alpha level of 0.05. This value means that the proobability of observing a normal distribution from this data is p=5.028e-05. Because of this, we reject the null hypothesis which states that the data is normally distributed. Therefore, the bird data is not distriubted normally (alternative hypothesis).

Probability that a Randomly picked bird species has immigrated into Lesser Antilles between 1 and 5 million years ago.

pnorm(5, mean, sd) - pnorm(1, mean, sd)
## [1] 0.141978

Problem 2: Exploding Termites

Load the data into R and assign it to a variable called birds

 termites <- read.csv("termite_gland .csv", stringsAsFactors = T)
termites
##    glandColor immobilization
## 1        blue       unharmed
## 2        blue       unharmed
## 3        blue       unharmed
## 4        blue    immobilized
## 5        blue    immobilized
## 6        blue    immobilized
## 7        blue    immobilized
## 8        blue    immobilized
## 9        blue    immobilized
## 10       blue    immobilized
## 11       blue    immobilized
## 12       blue    immobilized
## 13       blue    immobilized
## 14       blue    immobilized
## 15       blue    immobilized
## 16       blue    immobilized
## 17       blue    immobilized
## 18       blue    immobilized
## 19       blue    immobilized
## 20       blue    immobilized
## 21       blue    immobilized
## 22       blue    immobilized
## 23       blue    immobilized
## 24       blue    immobilized
## 25       blue    immobilized
## 26       blue    immobilized
## 27       blue    immobilized
## 28       blue    immobilized
## 29       blue    immobilized
## 30       blue    immobilized
## 31       blue    immobilized
## 32       blue    immobilized
## 33       blue    immobilized
## 34       blue    immobilized
## 35       blue    immobilized
## 36       blue    immobilized
## 37       blue    immobilized
## 38       blue    immobilized
## 39       blue    immobilized
## 40       blue    immobilized
## 41      white       unharmed
## 42      white       unharmed
## 43      white       unharmed
## 44      white       unharmed
## 45      white       unharmed
## 46      white       unharmed
## 47      white       unharmed
## 48      white       unharmed
## 49      white       unharmed
## 50      white       unharmed
## 51      white       unharmed
## 52      white       unharmed
## 53      white       unharmed
## 54      white       unharmed
## 55      white       unharmed
## 56      white       unharmed
## 57      white       unharmed
## 58      white       unharmed
## 59      white       unharmed
## 60      white       unharmed
## 61      white       unharmed
## 62      white       unharmed
## 63      white       unharmed
## 64      white       unharmed
## 65      white       unharmed
## 66      white       unharmed
## 67      white       unharmed
## 68      white       unharmed
## 69      white       unharmed
## 70      white       unharmed
## 71      white       unharmed
## 72      white    immobilized
## 73      white    immobilized
## 74      white    immobilized
## 75      white    immobilized
## 76      white    immobilized
## 77      white    immobilized
## 78      white    immobilized
## 79      white    immobilized
## 80      white    immobilized

Calculate number of observations for each gland colour separately.

summary(termites$glandColor)
##  blue white 
##    40    40

Number of observation for blue gland colour is 40, and number of observations for white gland colour is also 40.

Create a contingency table representing the data

 term.tbl <- table(termites$immobilization, termites$glandColor)
term.tbl
##              
##               blue white
##   immobilized   37     9
##   unharmed       3    31

Visualize the data in the contingency table on a Mosaic plot

mosaicplot(t(term.tbl), col = c("pink", "lightblue"), main= "Termite Gland Colour and Immobilization", xlab= "colour", ylab = "fate")

Perform statistical analysis of the data to answer the research question

The Null Hypothesis (H0) states that there is no association between gland colour and its effect on the termite, i.e wether it is immobilized or unharmed.

The Alternative hypothesis (Ha) states that gland colour has a statistically significant correlation with a termites fate when they come in contact with the liquid.

The statistical test used to analyse this data will be the Fisher’s Exact Test. This is because the Chi - Squared Test assumes that less than 20% of the data is less than 5 (This condition is not met due to the 3 blue unharmed termites), and that the marginal totals are the same. Instead, The Fisher’s Exact Test is ideal for 2x2 contingency tables of categorical variables where the data is unbalanced (which suits our data), because it is a random permutation test.

fisher.test(term.tbl)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  term.tbl
## p-value = 1.254e-10
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##    9.454827 246.938567
## sample estimates:
## odds ratio 
##   39.60155

The p-value of 1.254e-10 is much smaller than the alpha significance value of a=0.05. This means that we can reject H0, and accept Ha, which suggests that there is strong statistical evidence of correlation between termite gland colour and fate. If p was greater than alpha, then we would retain the null hypothesis, that there is no association.

Degrees of Freedom (DF)

To calculate DF = (Rows -1)(Columns -1). For our table, DF = (2-1)(2-1) = 1

Odds Ratio

The Odds ratio calculated by the Fisher’s Test is 39.60. An odds ratio is the strength of association between the two categorical variables (in this case colour and fate) but only if each have two categories. If there is no association, the odds ratio = 1. Therefore, a value of 39.60 suggests a strong correlation between termite colour and their fate.

Conclusion

Performing a Fisher Test on the data revealed very strong statistical evidence (p<<<0.05 and a very high odds ratio) that termite colour influences wether they are unharmed or immobilized. Additionally, the high odds ratio of 39.6 proves a high degree of association between the categorical variables.Therefore, the alternative hypothesis was accepted. Now we accept that gland colour has a statistically significant correlation with a termites fate when they come in contact with the liquid. This findings suggest that liquid from blue glands are significantly more likely to immobilise a termite than have no effect. This is especially true when compared to the effect of liquid from white glands, which are more likely to leave termites unharmed than immobilise them.

Problem 3: Eating When Diving

Load the data into R

dolphin <- read.csv("dolphin_oxygen.csv", stringsAsFactors = T)
dolphin
##    individual oxygenUseNonfeeding oxygenUseFeeding
## 1           1                42.2             71.0
## 2           2                51.7             77.3
## 3           3                59.8             82.6
## 4           4                66.5             96.1
## 5           5                81.9            106.6
## 6           6                82.0            112.8
## 7           7                81.3            121.2
## 8           8                81.3            126.4
## 9           9                96.0            127.5
## 10         10               104.1            143.1

Summary of Data

summary(dolphin)
##    individual    oxygenUseNonfeeding oxygenUseFeeding
##  Min.   : 1.00   Min.   : 42.20      Min.   : 71.00  
##  1st Qu.: 3.25   1st Qu.: 61.48      1st Qu.: 85.97  
##  Median : 5.50   Median : 81.30      Median :109.70  
##  Mean   : 5.50   Mean   : 74.68      Mean   :106.46  
##  3rd Qu.: 7.75   3rd Qu.: 81.97      3rd Qu.:125.10  
##  Max.   :10.00   Max.   :104.10      Max.   :143.10

Non feeding

  • The minimum data point is 42.20 and the maximum is 104.10.

  • Mean = 74.68, Median = 81.30.

  • The first Quartile is at 61.48 and the third is at 81.97

feeding

  • The minimum data point is 71.00 and the maximum is 143.10.

  • Mean = 106.46, Median = 109.70.

  • The first Quartile is at 85.97 and the third is at 125.10

Present the data on two different plots.

Boxplot

boxplot(dolphin$oxygenUseNonfeeding, dolphin$oxygenUseFeeding, col = c("pink", "lightblue"), names = c("Non-Feeding", "Feeding"), ylab = "Oxygen Use", xlab = "Dive Type", main = "Dolphin Oxygen Use After Diving")

Stacked Histogram

plot_colors <- c("pink", "lightblue")
plot_angles <- c(135, 45)

hist(dolphin$oxygenUseNonfeeding, 
    col = plot_colors[1], density = 15, angle = plot_angles[1],
     xlab = "Oxygen Use (ml O2/kg)", ylab = "Frequency", 
    main = "Dolphin Oxygen Use After Diving")
hist(dolphin$oxygenUseFeeding, 
    col = plot_colors[2], density = 15, angle = plot_angles[2], add = T)

legend("topright", legend = c("feeding", "non-feeding"), 
       border = plot_colors,  fill = plot_colors, density = 15, angle = plot_angles, bty = "n")

Statistical Analysis

Hypothesis

Null Hypothesis (H0) = The mean oxygen use for feeding vs non-feeding dives are not statiscally different from eachother, i.e. there is no correlation between dive type and oxygen use. Alternative hypothesis (Ha) = The mean oxygen use for feeding vs non-feeding dives are statiscally different from eachother, i.e. oxygen consumption post dive is influenced by whether it was a feeding or non-feeding dive.

Paired T test

The T-test was chosen to determine whether the means of Oxygen Use in feeding and non-feeding dives are significantly different, i.e to test the null hypothesis. It is paired because the dolphin in each observation is measured twice - once per dive type (feeding/non-feeding dives). These form the ‘before’ and ‘after’ variables often seen in paired t-tests.

before <- dolphin$oxygenUseNonfeeding
after <- dolphin$oxygenUseFeeding
qqnorm(before, main = "Oxygen Use Non- Feeding Dive QQ Plot")
qqline(before)

shapiro.test(before)
## 
##  Shapiro-Wilk normality test
## 
## data:  before
## W = 0.951, p-value = 0.6804
qqnorm(after, main = "Oxygen Use Feeding Dive QQ Plot")
qqline(after)

shapiro.test(after)
## 
##  Shapiro-Wilk normality test
## 
## data:  after
## W = 0.95397, p-value = 0.7155

T-test assumptions

The paired t-test assumes continuous data, which is met because oxygen use is measured in ml O2/kg. This test also assumes a normal distribution of data, which is confirmed above visually via QQ plots. The data is statistically confirmed normal, because the p-values from the Shapiro-Wilk tests are both greater than a=0.05. Therefore, we accept that the data is normally distributed.

Degrees of Freedom

For a paired t-test, the DF is n-1, where n is the total number of paired observations. Therefore, DF = 10 -1 = 9.

T-test

t.test(before, after, paired = T, alternative = "two.sided")
## 
##  Paired t-test
## 
## data:  before and after
## t = -13.774, df = 9, p-value = 2.361e-07
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -36.99942 -26.56058
## sample estimates:
## mean difference 
##          -31.78

The p-value

The calculated p-value from the Paired t-test is p=2.361e-07. This means that the probability of observing this data given the null hypothesis is 2.361e-07. Since this is much smaller than the alpha value of a=0.05, we have significant evidence to reject the null hypothesis (that there is no difference between the means). Due to the small p value, we confidently accept the alternative hypothesis: that a dolphin’s post dive oxygen use is influenced by dive type. Ie wether it was for hunting or not.

The confidence interval

The range of the 95% confidence interval is from -36.99942 to -26.56058, and does not include zero. This means that 95% of the time, observations will fall between (roughly) -37 and -27 on a normal distribution (within 2 standard deviations of the mean). Because this is range is far from zero, the probability of the difference between means being equal to zero is much less than 0.05. This is additional support for rejecting the null hypothesis.

Is homoscedasticity (homogeneity of variance) an important factor in this experiment?

No, because homoscedasticity is an important factor for ANOVA, but not t-testing. However, ANOVA is only used when the data includes three or more groups, which is when t-testing is no longer valid.

Conclusion

We have strong statistical support for the alternative hypothesis, staing that the mean oxygen use for feeding vs non-feeding dives are statiscally different from eachother. The paired t-test gave a p-value of 2.361e-07, where the maximum value for accepting Ha is a=0.05 (much larger than p). Additionally, the 95% confidence interval from -37 to -27 indicated that the probability of the difference between the means being equal to zero was less than 5%. Due to these statistical test, we can confidently reject the null hypothesis, and conclude that oxygen consumption post dive is influenced by dive type, and is significantly higher after a feeding in comparison to a non-feeding dive.