The dataset on body temperatures includes a column coding male (1) and female (2) subjects. Use a \(t\)-test to look for differences in temperature between males and females.
First let us look at the data.
normtemp = read.csv("normtemp.csv")
summary(normtemp)
## X temp sex weight
## Min. : 1.00 Min. : 96.30 Min. :1.0 Min. :57.00
## 1st Qu.: 33.25 1st Qu.: 97.80 1st Qu.:1.0 1st Qu.:69.00
## Median : 65.50 Median : 98.30 Median :1.5 Median :74.00
## Mean : 65.50 Mean : 98.25 Mean :1.5 Mean :73.76
## 3rd Qu.: 97.75 3rd Qu.: 98.70 3rd Qu.:2.0 3rd Qu.:79.00
## Max. :130.00 Max. :100.80 Max. :2.0 Max. :89.00
The mean temperature for the entire sample is 98.2492308\(^\circ\)F. The mean weight is73.7615385 kg (assumed). The sex variable is clearly meant to be categorical. Lets change that now.
normtemp$sex <- factor(normtemp$sex, labels = c("M", "F"))
summary(normtemp)
## X temp sex weight
## Min. : 1.00 Min. : 96.30 M:65 Min. :57.00
## 1st Qu.: 33.25 1st Qu.: 97.80 F:65 1st Qu.:69.00
## Median : 65.50 Median : 98.30 Median :74.00
## Mean : 65.50 Mean : 98.25 Mean :73.76
## 3rd Qu.: 97.75 3rd Qu.: 98.70 3rd Qu.:79.00
## Max. :130.00 Max. :100.80 Max. :89.00
Now we can look at the distribution of the samples by sex.
library (ggplot2)
ggplot(normtemp) +
geom_boxplot(aes(x = sex, y = temp)) +
labs(title = "Distribution of sampled temperature by sex",
x = "Sex",
y = paste0("Temperature (", "\u00B0F)")) +
theme_bw()
Visually it looks like females have a slightly higher body temperature than males, but is this statistically significant? The null hypothesis is that there is no difference between the means for males and females:
\[H_0: \mu_{Male Temp} = \mu_{Female Temp}\]
The two-sided alternate hypothese is that there is a difference:
\[H_a: \mu_{Male Temp} \neq \mu_{Female Temp}\]
This hypothesis is for the worst case scenario, when we don’t know anything about the data.
The two-sided \(t\)-test can be calculated by first, reformatting the data into a new dataframe with columns for each sex, then performing the \(t\)-test.
tempData <- data.frame(female = normtemp$temp[which(normtemp$sex == "F")],
male = normtemp$temp[which(normtemp$sex == "M")])
t.test(tempData$female, tempData$male, alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: tempData$female and tempData$male
## t = 2.2854, df = 127.51, p-value = 0.02394
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.03881298 0.53964856
## sample estimates:
## mean of x mean of y
## 98.39385 98.10462
The test returns a \(p\)-value of 0.02394. If our critical value is \(\alpha = 0.05\), then we can reject the null hypothesis and state that with an 95% accuracy there is a difference between the temperatures of males and females. However if our critical value was lower, say, \(\alpha = 0.01\) then we can no longer reject the null hypotheses.
Looking at the data, it appears that females have a higher temperature on average. Changing our alternative hypothesis to a single-sided test where:
\[H_a: \mu_{Female Temp} > \mu_{Male Temp}\]
results in the following \(t\)-test:
t.test(tempData$female, tempData$male, alternative = "greater")
##
## Welch Two Sample t-test
##
## data: tempData$female and tempData$male
## t = 2.2854, df = 127.51, p-value = 0.01197
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 0.07954459 Inf
## sample estimates:
## mean of x mean of y
## 98.39385 98.10462
Which of course lowers the \(p\)-value by half because the \(t\) distribution is symmetric around zero. However because the \(p\)-value is greater than 0.01, we still cannot say that we’re 99% certain that there is a difference between the mean temperatures of males and females.
Let’s look at the variance of the two groups.
var.test(tempData$female, tempData$male)
##
## F test to compare two variances
##
## data: tempData$female and tempData$male
## F = 1.1321, num df = 64, denom df = 64, p-value = 0.6211
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.6905408 1.8561126
## sample estimates:
## ratio of variances
## 1.132131
The \(p\)-value for this test is quite high (0.6211), which means that there isn’t much difference between the variance for the two groups. Running the \(t\)-test again, using the assumption of equal variance results in:
t.test(tempData$female, tempData$male, var.equal = T)
##
## Two Sample t-test
##
## data: tempData$female and tempData$male
## t = 2.2854, df = 128, p-value = 0.02393
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.03882216 0.53963938
## sample estimates:
## mean of x mean of y
## 98.39385 98.10462
basically the same result, but hey, at least I was thorough.
Summary:
A two sample t-test was performed to compare the mean body temperature between males and females. There was a significant difference in the mean body temperature between females (M = 98.3938462\(^\circ\)F, SD = 0.7434878\(^\circ\)F) and males (M = 98.1046154\(^\circ\)F, SD = 0.6987558\(^\circ\)F); \(t\)(128) = 2.2854, \(p\)-value = 0.02393).
The file gapC.csv contains socio-economic information for 173 countries from the GapMinder dataset. Each country has been assigned to one of seven geographical regions, roughly corresponding to the continents. Carry out a one-way analysis of variance of the life expectancy variable to look for differences across the different regions. - State the null and alternate hypotheses - Make a boxplot of life expectancy per continent - Carry out the ANOVA and give the \(F\)-statistic and the \(p\)-value obtained - On the basis of this state whether or not life expectancy varies across continents
gap <- read.csv("gapC.csv")
gap$continent <- factor(gap$continent)
summary(gap)
## country lifeexpectancy continent
## Length:173 Min. :47.79 AF :56
## Class :character 1st Qu.:62.64 AS :35
## Mode :character Median :72.97 EE :22
## Mean :69.20 LATAM:28
## 3rd Qu.:76.13 NORAM: 3
## Max. :83.39 OC : 8
## NA's :1 WE :21
The mean life expectancy for the entire data set is 69.20 years. It should be noted that there is not a good balance between the number of samples in each group.
Displaying life expecting by continent:
ggplot(gap) +
geom_boxplot(aes(y = lifeexpectancy, fill = continent)) +
labs(title = "Life expectancy by continent",
x = "Continent",
y = "Life Expectancy (years)") +
theme_bw()
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_boxplot()`).
Visually, there does appear to be differences between the mean life expectancy of the continents. The null hypothesis is that there is no difference between the mean life expectancy of the different continents, or:
\[H_0: \mu_{life} = \mu_{AF} = \mu_{AS} = \mu_{EE} = \mu_{LATAM} = \mu_{NORAM} = \mu_{OC} = \mu_{WE}\] and the alternate hypothesis that there is some difference between at least one of the groups.
Performing a one-way ANOVA, the results show:
aov(lifeexpectancy ~ continent, data = gap)
## Call:
## aov(formula = lifeexpectancy ~ continent, data = gap)
##
## Terms:
## continent Residuals
## Sum of Squares 9757.236 7141.470
## Deg. of Freedom 6 165
##
## Residual standard error: 6.578878
## Estimated effects may be unbalanced
## 1 observation deleted due to missingness
Most of the variance is in the groups of continents, but it is not as large as in the example of the DO measurements. Calculating the \(F\)-statistic:
summary(aov(lifeexpectancy ~ continent, data = gap))
## Df Sum Sq Mean Sq F value Pr(>F)
## continent 6 9757 1626.2 37.57 <2e-16 ***
## Residuals 165 7141 43.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 observation deleted due to missingness
results in a very high \(F\)-value (37.57) and an extremely low \(p\)-value (<2e-16), meaning that we are able to reject the null hypothesis and say that there is, with > 99.9% confidence, a difference between the mean life expectancy for one or more continents.
Summary:
We found a statistically-significant difference in mean life expectancy according to continent (F(6) = 37.57, \(p\) < 2e-16).