Exercise 1

The dataset on body temperatures includes a column coding male (1) and female (2) subjects. Use a \(t\)-test to look for differences in temperature between males and females.

First let us look at the data.

normtemp = read.csv("normtemp.csv")

summary(normtemp)
##        X               temp             sex          weight     
##  Min.   :  1.00   Min.   : 96.30   Min.   :1.0   Min.   :57.00  
##  1st Qu.: 33.25   1st Qu.: 97.80   1st Qu.:1.0   1st Qu.:69.00  
##  Median : 65.50   Median : 98.30   Median :1.5   Median :74.00  
##  Mean   : 65.50   Mean   : 98.25   Mean   :1.5   Mean   :73.76  
##  3rd Qu.: 97.75   3rd Qu.: 98.70   3rd Qu.:2.0   3rd Qu.:79.00  
##  Max.   :130.00   Max.   :100.80   Max.   :2.0   Max.   :89.00

The mean temperature for the entire sample is 98.2492308\(^\circ\)F. The mean weight is73.7615385 kg (assumed). The sex variable is clearly meant to be categorical. Lets change that now.

normtemp$sex <- factor(normtemp$sex, labels = c("M", "F"))
summary(normtemp)
##        X               temp        sex        weight     
##  Min.   :  1.00   Min.   : 96.30   M:65   Min.   :57.00  
##  1st Qu.: 33.25   1st Qu.: 97.80   F:65   1st Qu.:69.00  
##  Median : 65.50   Median : 98.30          Median :74.00  
##  Mean   : 65.50   Mean   : 98.25          Mean   :73.76  
##  3rd Qu.: 97.75   3rd Qu.: 98.70          3rd Qu.:79.00  
##  Max.   :130.00   Max.   :100.80          Max.   :89.00

Now we can look at the distribution of the samples by sex.

library (ggplot2)

ggplot(normtemp) +
  geom_boxplot(aes(x = sex, y = temp)) + 
  labs(title = "Distribution of sampled temperature by sex", 
       x = "Sex",
       y = paste0("Temperature (", "\u00B0F)")) +
  theme_bw()

Visually it looks like females have a slightly higher body temperature than males, but is this statistically significant? The null hypothesis is that there is no difference between the means for males and females:

\[H_0: \mu_{Male Temp} = \mu_{Female Temp}\]

The two-sided alternate hypothese is that there is a difference:

\[H_a: \mu_{Male Temp} \neq \mu_{Female Temp}\]

This hypothesis is for the worst case scenario, when we don’t know anything about the data.

The two-sided \(t\)-test can be calculated by first, reformatting the data into a new dataframe with columns for each sex, then performing the \(t\)-test.

tempData <- data.frame(female = normtemp$temp[which(normtemp$sex == "F")], 
                       male = normtemp$temp[which(normtemp$sex == "M")])

t.test(tempData$female, tempData$male, alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  tempData$female and tempData$male
## t = 2.2854, df = 127.51, p-value = 0.02394
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.03881298 0.53964856
## sample estimates:
## mean of x mean of y 
##  98.39385  98.10462

The test returns a \(p\)-value of 0.02394. If our critical value is \(\alpha = 0.05\), then we can reject the null hypothesis and state that with an 95% accuracy there is a difference between the temperatures of males and females. However if our critical value was lower, say, \(\alpha = 0.01\) then we can no longer reject the null hypotheses.

Looking at the data, it appears that females have a higher temperature on average. Changing our alternative hypothesis to a single-sided test where:

\[H_a: \mu_{Female Temp} > \mu_{Male Temp}\]

results in the following \(t\)-test:

t.test(tempData$female, tempData$male, alternative = "greater")
## 
##  Welch Two Sample t-test
## 
## data:  tempData$female and tempData$male
## t = 2.2854, df = 127.51, p-value = 0.01197
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.07954459        Inf
## sample estimates:
## mean of x mean of y 
##  98.39385  98.10462

Which of course lowers the \(p\)-value by half because the \(t\) distribution is symmetric around zero. However because the \(p\)-value is greater than 0.01, we still cannot say that we’re 99% certain that there is a difference between the mean temperatures of males and females.

Let’s look at the variance of the two groups.

var.test(tempData$female, tempData$male)
## 
##  F test to compare two variances
## 
## data:  tempData$female and tempData$male
## F = 1.1321, num df = 64, denom df = 64, p-value = 0.6211
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.6905408 1.8561126
## sample estimates:
## ratio of variances 
##           1.132131

The \(p\)-value for this test is quite high (0.6211), which means that there isn’t much difference between the variance for the two groups. Running the \(t\)-test again, using the assumption of equal variance results in:

t.test(tempData$female, tempData$male, var.equal = T)
## 
##  Two Sample t-test
## 
## data:  tempData$female and tempData$male
## t = 2.2854, df = 128, p-value = 0.02393
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.03882216 0.53963938
## sample estimates:
## mean of x mean of y 
##  98.39385  98.10462

basically the same result, but hey, at least I was thorough.

Summary:

A two sample t-test was performed to compare the mean body temperature between males and females. There was a significant difference in the mean body temperature between females (M = 98.3938462\(^\circ\)F, SD = 0.7434878\(^\circ\)F) and males (M = 98.1046154\(^\circ\)F, SD = 0.6987558\(^\circ\)F); \(t\)(128) = 2.2854, \(p\)-value = 0.02393).


Exercise 2

The file gapC.csv contains socio-economic information for 173 countries from the GapMinder dataset. Each country has been assigned to one of seven geographical regions, roughly corresponding to the continents. Carry out a one-way analysis of variance of the life expectancy variable to look for differences across the different regions. - State the null and alternate hypotheses - Make a boxplot of life expectancy per continent - Carry out the ANOVA and give the \(F\)-statistic and the \(p\)-value obtained - On the basis of this state whether or not life expectancy varies across continents

gap <- read.csv("gapC.csv")
gap$continent <- factor(gap$continent)

summary(gap)
##    country          lifeexpectancy  continent 
##  Length:173         Min.   :47.79   AF   :56  
##  Class :character   1st Qu.:62.64   AS   :35  
##  Mode  :character   Median :72.97   EE   :22  
##                     Mean   :69.20   LATAM:28  
##                     3rd Qu.:76.13   NORAM: 3  
##                     Max.   :83.39   OC   : 8  
##                     NA's   :1       WE   :21

The mean life expectancy for the entire data set is 69.20 years. It should be noted that there is not a good balance between the number of samples in each group.

Displaying life expecting by continent:

ggplot(gap) +
  geom_boxplot(aes(y = lifeexpectancy, fill = continent)) +
  labs(title = "Life expectancy by continent",
       x = "Continent",
       y = "Life Expectancy (years)") +
  theme_bw()
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_boxplot()`).

Visually, there does appear to be differences between the mean life expectancy of the continents. The null hypothesis is that there is no difference between the mean life expectancy of the different continents, or:

\[H_0: \mu_{life} = \mu_{AF} = \mu_{AS} = \mu_{EE} = \mu_{LATAM} = \mu_{NORAM} = \mu_{OC} = \mu_{WE}\] and the alternate hypothesis that there is some difference between at least one of the groups.

Performing a one-way ANOVA, the results show:

aov(lifeexpectancy ~ continent, data = gap)
## Call:
##    aov(formula = lifeexpectancy ~ continent, data = gap)
## 
## Terms:
##                 continent Residuals
## Sum of Squares   9757.236  7141.470
## Deg. of Freedom         6       165
## 
## Residual standard error: 6.578878
## Estimated effects may be unbalanced
## 1 observation deleted due to missingness

Most of the variance is in the groups of continents, but it is not as large as in the example of the DO measurements. Calculating the \(F\)-statistic:

summary(aov(lifeexpectancy ~ continent, data = gap))
##              Df Sum Sq Mean Sq F value Pr(>F)    
## continent     6   9757  1626.2   37.57 <2e-16 ***
## Residuals   165   7141    43.3                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 observation deleted due to missingness

results in a very high \(F\)-value (37.57) and an extremely low \(p\)-value (<2e-16), meaning that we are able to reject the null hypothesis and say that there is, with > 99.9% confidence, a difference between the mean life expectancy for one or more continents.

Summary:

We found a statistically-significant difference in mean life expectancy according to continent (F(6) = 37.57, \(p\) < 2e-16).