We will investigate the probability distribution that is most central to statistics: the normal distribution. If we are confident that our data are nearly normal, that opens the door to many powerful statistical methods. Here we’ll use the graphical tools of R to assess the normality of our data.

library(tidyverse)
library(here)
library(ggfortify)

The Data

We will be working with measurements of body dimensions. This data set contains measurements from 247 men and 260 women, most of whom were considered healthy young adults.

https://www.openintro.org/book/statdata/?data=bdims

bdims<-read_csv(here("data","bdims.csv"))
glimpse(bdims)
## Rows: 507
## Columns: 25
## $ bia.di <dbl> 42.9, 43.7, 40.1, 44.3, 42.5, 43.3, 43.5, 44.4, 43.5, 42.0, 40…
## $ bii.di <dbl> 26.0, 28.5, 28.2, 29.9, 29.9, 27.0, 30.0, 29.8, 26.5, 28.0, 29…
## $ bit.di <dbl> 31.5, 33.5, 33.3, 34.0, 34.0, 31.5, 34.0, 33.2, 32.1, 34.0, 33…
## $ che.de <dbl> 17.7, 16.9, 20.9, 18.4, 21.5, 19.6, 21.9, 21.8, 15.5, 22.5, 20…
## $ che.di <dbl> 28.0, 30.8, 31.7, 28.2, 29.4, 31.3, 31.7, 28.8, 27.5, 28.0, 30…
## $ elb.di <dbl> 13.1, 14.0, 13.9, 13.9, 15.2, 14.0, 16.1, 15.1, 14.1, 15.6, 13…
## $ wri.di <dbl> 10.4, 11.8, 10.9, 11.2, 11.6, 11.5, 12.5, 11.9, 11.2, 12.0, 10…
## $ kne.di <dbl> 18.8, 20.6, 19.7, 20.9, 20.7, 18.8, 20.8, 21.0, 18.9, 21.1, 19…
## $ ank.di <dbl> 14.1, 15.1, 14.1, 15.0, 14.9, 13.9, 15.6, 14.6, 13.2, 15.0, 14…
## $ sho.gi <dbl> 106.2, 110.5, 115.1, 104.5, 107.5, 119.8, 123.5, 120.4, 111.0,…
## $ che.gi <dbl> 89.5, 97.0, 97.5, 97.0, 97.5, 99.9, 106.9, 102.5, 91.0, 93.5, …
## $ wai.gi <dbl> 71.5, 79.0, 83.2, 77.8, 80.0, 82.5, 82.0, 76.8, 68.5, 77.5, 81…
## $ nav.gi <dbl> 74.5, 86.5, 82.9, 78.8, 82.5, 80.1, 84.0, 80.5, 69.0, 81.5, 81…
## $ hip.gi <dbl> 93.5, 94.8, 95.0, 94.0, 98.5, 95.3, 101.0, 98.0, 89.5, 99.8, 9…
## $ thi.gi <dbl> 51.5, 51.5, 57.3, 53.0, 55.4, 57.5, 60.9, 56.0, 50.0, 59.8, 60…
## $ bic.gi <dbl> 32.5, 34.4, 33.4, 31.0, 32.0, 33.0, 42.4, 34.1, 33.0, 36.5, 34…
## $ for.gi <dbl> 26.0, 28.0, 28.8, 26.2, 28.4, 28.0, 32.3, 28.0, 26.0, 29.2, 27…
## $ kne.gi <dbl> 34.5, 36.5, 37.0, 37.0, 37.7, 36.6, 40.1, 39.2, 35.5, 38.3, 38…
## $ cal.gi <dbl> 36.5, 37.5, 37.3, 34.8, 38.6, 36.1, 40.3, 36.7, 35.0, 38.6, 40…
## $ ank.gi <dbl> 23.5, 24.5, 21.9, 23.0, 24.4, 23.5, 23.6, 22.5, 22.0, 22.2, 23…
## $ wri.gi <dbl> 16.5, 17.0, 16.9, 16.6, 18.0, 16.9, 18.8, 18.0, 16.5, 16.9, 16…
## $ age    <dbl> 21, 23, 28, 23, 22, 21, 26, 27, 23, 21, 23, 22, 20, 26, 23, 22…
## $ wgt    <dbl> 65.6, 71.8, 80.7, 72.6, 78.8, 74.8, 86.4, 78.4, 62.0, 81.6, 76…
## $ hgt    <dbl> 174.0, 175.3, 193.5, 186.5, 187.2, 181.5, 184.0, 184.5, 175.0,…
## $ sex    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
bdims<-bdims %>%
  mutate(sex=ifelse(sex==1,"M","F")) %>%
  mutate(sex=as.factor(sex))

Are men taller than women?

There are different ways to investigate this, from the data.

We can find the mean and standard deviation of the heights of each gender

heights<-bdims %>%
  group_by(sex) %>%
  summarise(mean=mean(hgt),st.dev=sd(hgt))
heights

or we could do a box plot

ggplot(bdims,aes(x=sex,y=hgt))+
  geom_boxplot()+
  ylab("Height (cm)")+
  theme_bw()

It would appear so, in general, but we can also see that if we had chosen any one male and any one female, then we might have found that the female was taller than the male.

Note how each of the boxes are remarkably symmetric - for each gender the median is at the centre and the data are symmetrically distributed around it.

This means that the data are likely normally distributed.

We can get a better sense of that by plotting the data as a histograms:

ggplot(bdims,aes(x=hgt))+
  geom_histogram(bins=20)+
  xlab("Height (cm)")+
  facet_wrap(~sex,ncol=1)+
  theme_bw()

Many (but not all) numerical data in nature are normally distributed.

What does this distribution look like?

x<-seq(-4,4,0.01)
y<-dnorm(x)
df<-tibble(x=x,y=y)
ggplot(df,aes(x=x,y=y))+
  geom_line()+
  theme_bw()

Properties of the normal distribution

  1. It is a symmetrical distribution
  2. It is characterised by a mean value and a standard deviation
  3. For a standard normal the mean is 0 and the standard devation is 1
  4. 68% of the data lie within one standard deviation of the mean
  5. 95 of the data lie within two standard deviations of the mean
  6. 99.7% of the data lie within three standard devations of the mean.

Does this appear to be roughly true for the heights of males and females?

qq plots as a test for normality

A good way to tell whether a set of numbers is normally distributed is to do a ‘qqplot’.

If the data are normally distributed, they will lie on a straight line.

p <- ggplot(bdims, aes(sample = hgt)) +
  stat_qq()+
  stat_qq_line()+
  facet_wrap(~sex)
p

### t-test for difference Since the height data are normally distributed, we can use a t-test to determine whether there is difference between male and female heights:

First, lets pull the male and female heights out as separate vectors:

 male_heights<-bdims %>%
 filter(sex=="M") %>%
 select(hgt) %>%
 pull(hgt)

 female_heights<-bdims %>%
 filter(sex=="F") %>%
 select(hgt) %>%
 pull(hgt)
t.test(female_heights,male_heights)
## 
##  Welch Two Sample t-test
## 
## data:  female_heights and male_heights
## t = -21.059, df = 494.73, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -14.07406 -11.67201
## sample estimates:
## mean of x mean of y 
##  164.8723  177.7453

What do we conclude from this?

### Replicates

Let us treat this group of men and women as ‘the population’ and decide to investigate whether men are taller than women

SUpoose we did this by taking 1 woman at random and 1 man at random. What would we find?

sample(female_heights,1)
## [1] 165
sample(male_heights,1)
## [1] 177.8

Wit this data set it is possible that we would happen on a woman that was taller than the man we had picked. This experiment would give us a misleading idea of the population from which they were drawn.

What if we took sample of 10 men and 10 women and found the mean heights of these samples?

mean(sample(female_heights,10))
## [1] 166
mean(sample(male_heights,10))
## [1] 177.99

What if we took sample of 100 men and 100 women and found the mean heights of these samples?

mean(sample(female_heights,100))
## [1] 164.926
mean(sample(male_heights,100))
## [1] 178.196

Do you see how, the larger the sample we take, the average of that sample becomes more stable and more likely to represent the population from which it is drawn.

This is the value of taking replicates.

Exercise

Investigate the question: Do men weigh more than women?

Null hypothesis : Men and women weigh the same

Adapt the code above to determine whether men and women’s weights are normally distributed.

If so, carry out a t-test to determine whether there is a difference between the weights of men and those of women.

The random sample and cluster plots used in the lecture

x<-sample(1:1000,100,replace=TRUE)
y<-sample(1:1000,100,replace=TRUE)
df<-tibble(x=x,y=y)
ggplot(df,aes(x=x,y=y)) +
  geom_point()+
  theme_bw()

centres_x<-sample(x,10,replace=TRUE)
centres_y<-sample(y,10,replace=TRUE)

xs<-rep(0,100)
ys<-rep(0,100)
for (i in 1:10){
  for (j in 1:10){
    xs[(i-1)*10+j]=centres_x[i]+rnorm(1,centres_x[i],20)
    ys[(i-1)*10+j]=centres_y[i]+rnorm(1,centres_y[i],20)
  }
}

df2<-tibble(x=xs,y=ys)
ggplot(df2,aes(x=x,y=y)) +
  geom_point()+
  theme_bw()

```