Heights Data

Author

Dr Andrew Dalby

Background

This is a set of data that I have used for many years to teach data analysis in SPSS. The data set contains the heights of three groups of children. Two groups would be a simpler example and you would focus on t-tests but three groups allows me to teach about the narrative behind the study as well as showing how to use programs to create the required output. This is a very simple data set but it has a surprising number of uses.

This data allows me to show students:

  1. Categorical and quantitative data.

  2. Data entry.

  3. Long vs wide formats of the data.

  4. Creating graphical representations.

  5. Summary statistics.

  6. The importance of the research question and the underlying narrative.

  7. ANOVA.

  8. The different software approaches

The data

Data on thirty children’s heights (cms)
Group 1 Group 2 Group 3
93 105 100
95 107 101
101 110 103
103 110 107
108 115 111
111 118 113
114 120 115
115 120 115
115 123 118
117 126 125

Creating the Data

The easiest way would be to create the data as a text file with either commas between column values (comma separated variable file csv) or tab separated (tab separated variable file tsv). This then creates a discussion about wide or long data.

Wide data

As tabulated above, the data is in wide format. Each column corresponds to data from a particular group BUT the rows do not have any meaning. I cannot read across the three groups and assume any relationship in each of the ten rows. They are just arbitrary.

This is how people commonly set up Excel files with columns for each of the groups. This is not the best practice unless the rows really do have some meaning.

For example here they are arranged in height order from shortest to tallest and if this was all of the students in three schools then comparing the ordered heights might make sense depending on how you structured the study (it doesn’t make much sense).

If rows correspond to blocking - such as multiple measurements of the same experimental unit at multiple time periods then even if you are measuring a single variable it makes sense to have the measurements in different columns as they are matched.

Long data

If you think about how many variables there are in the data set there are TWO. There is height which is a quantitative and continuous variable measured in cm and there is group which is a nominal categorical variable. The groups are labelled 1,2 and 3 but this has no particular meaning. This is just a way of encoding the groups. Each group could correspond to a different school for example.

If we want to format the file so that each row corresponds to the data for an individual in the study then there will be two columns and 30 rows of data. For this set of data that is much better practice.

colnames(data) <- c(“Group”,“Height (cm)”)

# First I create a vector that contains all of the height values
height <- c(93,95,101,103,108,111,114,115,115,117,105,107,110,110,115,118,
            120, 120, 123,126,100,101,103,107,111,113,115,115,118,125)
#Next I create a vector that contains the groups for each of the heights.
group <- c(rep(1,10),rep(2,10),rep(3,10))


#Now I combine them both together to create a dataframe.
data <- data.frame(group,height)

#I assign appropriate column names
colnames(data) <- c("Group","Height (cm)")

#I have defined Group as a factor that has three values 1,2 and 3. This means that they are now treated as categorical and not as numerical values. 
data$Group <- as.factor(data$Group)

#I write it as a comma separated variable file. This can be opened with Excel and SPSS
write.csv(data, "year_month_day_height.csv", row.names=FALSE)

You can write SPSS files from R if you install the Haven library. As you made group into a factor it will be coded as a nominal variable and not a scalar in SPSS.

colnames(data) <- c(“Group”,“Height”)

library(haven)

#You need to have column names compatible with SPSS which means removing the units.
colnames(data) <- c("Group","Height")
write_sav(data, "year_month_day_height.sav")

If you have the Openxlsx2 library installed you can also write the file to Excel.

#|warning: false
library(openxlsx2)
Warning: package 'openxlsx2' was built under R version 4.2.3
colnames(data) <- c("Group","Height (cm)")
openxlsx2::write_xlsx(data, file="year_month_day_height.xlsx", colNames=TRUE)

Visualising the Data

I cannot emphasize the need to do this enough. The first thing that you always need to do with your data is look at it. In this case you have one measurement which is a continuous variable - height and a second categorical grouping variable.

A sensible way to display this data is in a boxplot which will show if there are any outliers in the groups and also show you if there is likely to be any difference between the groups if you choose to carry out a null hypothesis statistical test.

You can create a boxplot in base R simply from the data.

boxplot(data$`Height (cm)`~ data$Group, xlab="Group", ylab="Height (cm)", main = "Boxplot of Children's Heights from 3 Groups of Measurements", col=c("cadetblue","lightblue","dodgerblue"))

Note the Height ~ Group format. This is a common format in R for specifying the relationship between two variables. The first one is the dependent variable and the second the independent variable. The first is the y-axis and the second is the x-axis.

Within R there are also more visually impressive alternatives to the basic graphics available if you use the ggplot2 library. In this case the mean is also indicated in the boxplot.

library(ggplot2)
colnames(data) <- c("Group","Height")
d <- ggplot(data, aes(x=Group, y=Height))
d + geom_boxplot(fill="lightblue")+
  stat_summary(fun.y = mean, geom = "point",
               shape=18, size =2.5, color="purple4")
Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
ℹ Please use the `fun` argument instead.

Or to match the colour scheme from before.

library(ggplot2)
d <- ggplot(data, aes(x=Group, y=Height))
d + geom_boxplot(aes(fill=Group))+
  scale_fill_manual(values=c("cadetblue","lightblue","dodgerblue"))+
  stat_summary(fun.y = mean, geom = "point",
               shape=18, size =2.5, color="purple4")+
  theme(legend.position = "none")

As a more recent alternative to the boxplot there is also the violin plot which allows you to see the density of the points in the boxplot.

library(ggplot2)
d <- ggplot(data, aes(x=Group, y=Height))
d + geom_violin(aes(fill=Group),trim = FALSE)+
  geom_boxplot(width = 0.2)+
  scale_fill_manual(values=c("cadetblue","lightblue","dodgerblue"))+
  stat_summary(fun.y = mean, geom = "point",
               shape=18, size =2.5, color="purple4")+
  theme(legend.position = "none")

Summarising The Data

You can calculate the summary of all of the data and the summary statistics such as the mean and standard deviation.

summary(data$Height)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   93.0   105.5   112.0   111.1   116.5   126.0 
mean(data$Height)
[1] 111.1333
sd(data$Height)
[1] 8.431195
tapply(data$Height, data$Group, summary)
$`1`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   93.0   101.5   109.5   107.2   114.8   117.0 

$`2`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  105.0   110.0   116.5   115.4   120.0   126.0 

$`3`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  100.0   104.0   112.0   110.8   115.0   125.0 
tapply(data$Height, data$Group, mean)
    1     2     3 
107.2 115.4 110.8 
tapply(data$Height, data$Group, sd)
       1        2        3 
8.727988 7.121173 8.038795 

Null Hypothesis Statistical Testing

There are three groups and a single variable. This is the simplest case of a one-way ANOVA to compare the means of the different groups. In R you use the format that we used before with the boxplots to specify the relationship between height and group.

model1 <- aov(Height ~ Group, data=data)
summary(model1)
            Df Sum Sq Mean Sq F value Pr(>F)  
Group        2  337.9  168.93   2.646 0.0892 .
Residuals   27 1723.6   63.84                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value for the F-test is 0.0892 which is not significant at the 0.05% level.

This is when the narrative and the question that you wanted to answer is important because it needs to define what your hypothesis is and which test you will carry out.

Here I did something general as there was no narrative suggesting that I should do something else.

Now imagine that the study for the groups is between one school from a wealthy area and two schools from poorer more deprived areas and that my hypothesis is that deprivation will affect the height of the students from the poorer areas.

In that case the test that I need to perform is Dunnett’s test where the data from the wealthy school will be the control group, in this case imagine that it is group 2.

library(DescTools)
DunnettTest(x=data$Height, g=data$Group, control=2)

  Dunnett's test for comparing several treatments with a control :  
    95% family-wise confidence level

$`2`
    diff    lwr.ci    upr.ci   pval    
1-2 -8.2 -16.53837 0.1383661 0.0543 .  
3-2 -4.6 -12.93837 3.7383661 0.3445    

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this case it is still not significant but it was closer to the threshold.

This is core to why people say that there are lies, damned lies and statistics. The issue is that people often change the hypothesis and test that they apply after they get the data. They are fishing to try and find a significant result. This weakens science because it increases the likelihood of producing ireproducible results because you are increasing the chances of making a type I error.

If I did some more fishing I might just decide to drop the third group from the results that I present and just have group 1 and group 2 as these look the most promising for finding a significant result from the Dunnett’s test. I will also use a one sided hypothesis that group 1 will be shorter than group 2

new <- subset(data, Group ==c(1,2))
t.test(new$Height ~ new$Group, alternative="less")

    Welch Two Sample t-test

data:  new$Height by new$Group
t = -1.8576, df = 7.7411, p-value = 0.05077
alternative hypothesis: true difference in means between group 1 and group 2 is less than 0
95 percent confidence interval:
       -Inf 0.05409785
sample estimates:
mean in group 1 mean in group 2 
          106.2           116.2 

This is even closer and now it involves me surpressing data that disagreed with my hypothesis by removing group 3. This is all too easy for people to do and we need to make sure that published studies actually follow their intended designs.