The purpose and the data

In this lesson we progress from data exploration (numerical summaries) to hypothesis testing using a small data set on the performance of two diets. These data contain information on 78 people using one of two diets and a control group. When there are more than two groups and you are interested in the differences between the groups you would (if assumptions are met) apply a one way analysis of variance, ANOVA.These lecture notes walk you through preparing the data, exploratory data analysis and undertaking statistical tests. Our final analysis is the ANOVA and associated post-hoc test.

Your unit outline lists the following reading for hypothesis testing https://moderndive.com/9-hypothesis-testing.html

Further reading on ANOVA and application in R can be found in Chapter 2 Mark Greenwood and Katharine Banner (unfinished text but a neat section on ANOVA) https://arc.lib.montana.edu/book/statistics-with-r-textbook/item/54

The data are sourced from https://www.sheffield.ac.uk/mash/statistics/datasets and are described in the table below

Variable name Variable Data type
Person Participant number
gender 1 = male, 0 = female factor
Age Age (years) numerical
Height Height (cm) numerical
preweight Weight before (kg) numerical
Diet Diet factor
weight10weeks Weight 10 weeks(kg) numerical
weightLoss Weight lost (kg) numerical
load packages

Remember if your package does not load you need to

install (“package”)

require (tidyverse)
## Loading required package: tidyverse
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
require(psych)
## Loading required package: psych
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
require(EnvStats)
## Loading required package: EnvStats
## 
## Attaching package: 'EnvStats'
## The following objects are masked from 'package:stats':
## 
##     predict, predict.lm
## The following object is masked from 'package:base':
## 
##     print.default

read in the data frame

df_Diet<-read.csv("Diet_R.csv")

Last week you covered some of these commands to ‘take a peak’ at your data. Perhaps avoid using View when the data set is large. glimpse is a personal favourite as it details the structure of your data. Keep these commands handy or commit to memory as they are convenient for when ever you bring in data to the working environment or whenever you performa transformations on the variable in your dataframe.

## Choose on or two of the following to view the data source
View(df_Diet) ## opens the data frame in the source window
str(df_Diet)    # print the structure of the data frame in the console
head(df_Diet)   # gives the first 5 rows for all features
glimpse(df_Diet) # shows the structure of the data frame in a different format that str
class(df_Diet)  # list the class or type of an object ... in this case it is a data frame
summary(df_Diet) # Gives a summary of the data -- Quartiles for numeric, counts for factors and total characters for strings

Declaring factors

You can create a new object which is a vector of text (strings). In other statistical packages these were called namelist. It is a convenient way of performing a function for all features in the vector.

The “lapply” function stands for l=list apply. So it is a shorthand way of performing a loop.

  • for all i in the list MyFactors

    • is.factor(df_Diet$i)
  • stop when i= the number of elements in Myfactor

The code looks somewhat different though

MyFactors <- c("gender", "Diet") #create the name list
df_Diet[MyFactors] <- lapply(df_Diet[MyFactors], as.factor) #declare factors
str(df_Diet)
## 'data.frame':    78 obs. of  7 variables:
##  $ Person      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ gender      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Age         : int  22 46 55 33 50 50 37 28 28 45 ...
##  $ Height      : int  159 192 170 171 170 201 174 176 165 165 ...
##  $ pre.weight  : int  58 60 64 64 65 66 67 69 70 70 ...
##  $ Diet        : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ weight6weeks: num  54.2 54 63.3 61.1 62.2 64 65 60.5 68.1 66.9 ...

Descriptive statistics

Here we will use the psych package. The psych package was developed at the Personality, Motivation and Cognition laboratory in the Department of Psychology at Northwestern University. The idea was to package statistical models most useful for personality, psychometric, and psychological research. It also a good package for reporting descriptive statistics as the format of the output is more in line with statistical software packages for social science (Minitab, SPSS etc).

The basic command is

where decribe(x) will take on all the default settings above.

You can see details on the function at https://personality-project.org/r/html/describe.html. Try the following code to see how the function works:

describe (df_Diet) #default for all variables
describe (df_Diet$Age) #default for one variable
describe(df_Diet$pre.weight, fast=TRUE) #just the basics
describe(df_Diet$pre.weight, skew = FALSE, ranges = FALSE, quant=TRUE)#customised output

Furthermore, you may report basic summary statistics by a grouping variable. The function is useful if you wish to explore the difference by a demographic factor (gender) or if the grouping variable is some experimental variable (Diet).

as before x is the dataframe or group of variables you wish to examine, group is the variable in which you would like to split the descriptive statistics output.

mat is for a matrix format or =NULL to present 2 separate tables.

The … implies that all other settings from the describe function may be applied here.

describeBy (df_Diet, group=df_Diet$gender) #default for all variables grouped by gender
describeBy (df_Diet$Age, group=df_Diet$gender, mat=TRUE) #desc for one variable, grouped in mat format

The aim of the analysis is to test whether either diet (Diet=2 and Diet=3)helped the respondents lose weight over the control group (Diet=1) and also decide which is the higher performing diet. A good place to start is to look at the descriptive statistics as grouped by Diet indicator variable in your data frame

#work in teams of 2 / 3 to decide which code is appropriate. Discuss your findings.
#hint: what should you measuring? Did you learn anything from Tristan that can make this clear in the output.
# as well as the table output is there a graph that you could provide that also helps show the differences between diets
Analytics extension task.

We have looked at some numerical summaries. The second and perhaps more important component of EDA is to visualise the data. During the week complete analytics extension tasks. These will application of skills covered in ‘last week’s’ class.

#create at the 3 visualisations of the data that you believe will help you - and the reader of your reported findings -- understand the data 

Hypothesis testing

Overview of hypothesis testing

An overall model and related assumptions are made (the most common being observations following a normal distribution)

  • The null (H0) and alternative (H1 or HA) hypothesis are specified. Usually the null specifies a particular value of a parameter.
  • With given data, the value of the test statistic is calculated.
  • Under the general assumptions, as well as assuming the null hypothesis is true, the distribution of the test statistic is known. Given the distribution and value of the test statistic, as well as the form of the alternative hypothesis, we can calculate a p-value of the test. Based on the p-value and pre-specified level of significance, we make a decision. One of:
  • Fail to reject the null hypothesis.
  • Reject the null hypothesis.

Preliminary data set up

Our analysis will be on weight loss. You need to create this variable in your dataframe (you should have completed this task in the previous exercise, but just in case). In addition we would like to test each diet level against the other and the control group we will make 3 new factors Control, Diet2 and Diet3. The following make use of your learning from week 2 on working with dataframes in R.

note that because we have already declared Diet to be a factor we cannot use (Diet = 1) as this is looking for a numerical variable. We can use the factor lables (in our case level 1 is “1” etc). The function uses the following syntax df_Diet<-df_Diet%>%mutate(control = ifelse(df_Diet$Diet==“1”,1,0)) to create a new dataframe with an extra variable = control, which is 1 when Diet factor is the first level = “1”.

df_Diet<-df_Diet%>%mutate(WeightLoss=pre.weight-weight6weeks)
df_Diet<-df_Diet%>%mutate(control = ifelse(df_Diet$Diet=="1",1,0))
df_Diet<-df_Diet%>%mutate(Diet2 = ifelse(Diet=="2",1,0))
df_Diet<-df_Diet%>%mutate(Diet3 = ifelse(Diet=="3",1,0))

Univariate tests of the mean (t tests) and variance (chi-sq)

Create variables for the some of the relevant statistics

The target group for our new diet is 20 to 60 year old adults. We want to know if the sampling done by our scientist meets this criterion. We’ll start with the mean which we will take to be the mid-point of our target range. The hypothesis test states (HO mu=40) that the sample is drawn from a population with an average age of 40. We are calculating the sample average to check if our sample does not comply (H1 mu not= 40).

First calculate the statistics for the whole frame and store these as constants (vectors of size 1) and then repeat fro one group. We will test whether that group is different from the average. The results indicate that the sample average is somewhat older than the target population of 40 years of age.

x_bar_age <- mean(df_Diet$Age) 
s_age <- sd(df_Diet$Age) 

To test for significance we use the t-test. It is a two sided test and we will set the level of significance to 0.05

t.test(df_Diet$Age ,mu=40,alternative = c("two.sided"),conf.level = 0.95)

However we cannot rely on the sample average. We also want our sample to be contained within the range that marketing has specified. Here the obvious thing to do would be to only accept people between the ages of 20 to 60. However, this would have been a costly exercise to undertake (the more detailed your sampling requirements are the harder it is to find a participant who meets the requirements). It was decided that as long as we get the bulk of the sample +/- 2 deviations within that range we feel that sample meets the target as set by marketing. This implies that (H0 sigma=10) the sample was drawn from a population with ages range of approximately 20 years or (H0 sigma=<10). We will use the chi-sq test to see if our data is statistically unlikely (H1 sigma > 10) to have been drawn from a the null population.

One of the problems with this analysis is that

varTest(df_Diet$Age , alternative = "less", conf.level = 0.95, sigma.squared = 100)

One of the problems with this analysis is that the sample maybe skewed with some subjects being 60+ but no subjects being under 20-25. In a more complete exploratory data analysis we would have examined the distribution of the age.

Analytics extension task
#overlay a kernal density plot of age on a uniform distribution between 20 and 60. 
#you will have to Google how to generate a uniform distribution or you can use the following cheat
#there are 78 subjects and the range is 40 so the height is about 1.95. create a new variable x=1.95 
# and plot this along with the 

Testing gender is balanced

The last test for a random sample (representative is a better term) examines if the number of females and males are equally represented. Because our dependent variable is a categorical variable we do not use the t-test but a test of proportions. Our null (H0) is proportion F = 0.5 and our alternate hypothesis (HA) is the prop F is not = 0.5.

df_Diet$gender= ifelse(df_Diet$gender=="1","F","M") #relabel gender
table(df_Diet$gender) #count the number of F and M, table displays these counts for factors
prop.test(x=33, n=78, p=0.5, alternative="two.sided", conf.level=0.90) # note similar format to t-test

It would seem that despite the sample being unbalanced that there is no evidence to reject that the sample is not a random draw from the target population (pob F=0.5). However this is not to say that our sample is balanced and if we were to believe that gender is a control variable then we would aim to have a more balanced sample. i.e. we would look to top up the sample with more females.

Testing for differences in the means between groups

In this section we are interested in differences between groups. If there are only two groups we use a t-test for differences in means and and F test for differences in variance. If there are more than two groups we will do a one-way ANOVA for differences in means (note we can undertake a test of variances using a Bartlett or Levene test, but that is an extension task)

Is our entry weight for our control group different from the dieters?

Here we ask the question, have we randomly sorted our subjects and in doing so not introduced bias in the starting conditions (current weight). This is a two-independent sample t-test.

t.test(df_Diet$pre.weight~df_Diet$control,mu=0,alternative = "two.sided",conf.level = 0.95,  var.equal = TRUE) #test of means assumes equal variances. If the test of variances rejects the null we could change the argument to var.equal = FALSE, see below that we do not rej HO on the test of variances.
Analytic extension task

You also want to test there is no difference between Diet2 and Deit3. However, you need to filter out the control group first.

#use filter to select the rows that correspond to subjects in the treatment groups
#run an independent t-test for means between Diet2 and Diet3
Analytic extension task

Strictly speaking the t-test do rely on normally distributed data. Our weight data is skewed – see earlier – a rigorous approach would be to explore this is EDA numerical and graphical. Furthermore test for normality - shapiro-Wilk normality test – and is there is a skew one could take the log transformed data and run t-test on the transformed variable. Run the code in your own time.

shapiro.test(df_Diet$pre.weight) #test for normality. results indicate a weak reject of null p=0.056
df_Diet$logpreweight= log(df_Diet$pre.weight) # log transformation
shapiro.test(df_Diet$logpreweight) # retest for normality on the log-transform.
#EDA plot pre.weight and overlay a normal dist. 

Testing for differences in the variance between groups

Strictly speaking the test above assumes the variance for each treatment/control group is the same. We can test for differences in variances using an F-test. The null hypothesis is that the ratio of the variances equals one indicating the variances are equal (Ho Var1=Var2). The alternate hypothesis (H1 var1 not = var2) would indicate that ratio is not equal to one. Typically we place the sample with the larger variance on the numerator so we are test F values > 1.

var.test (df_Diet$pre.weight~df_Diet$control,ratio=1,alternative = "two.sided",conf.level = 0.95) #test of = var is not rejected and therefore no need to run t-tests. 

Paired sample t-test

We are finally coming to the point. Did the subjects lose weight? In our first analysis we will filter the sample to look at only the Dieters. Then run a paired sample t-test to test our assumption that weight6weeks < pre.weight. The null hypothesis H0 is that the after weight is no less than the subjects’ weight before the diet (H0 weight6weeks >= pre.weight) with the alternate being that weight loss has occurred (H1 weight6weeks < pre.weight)

First we use the inbuilt function t-test paired sample.

df_Diet_only <- df_Diet%>%filter(control==0)
t.test(df_Diet_only$pre.weight, df_Diet_only$weight6weeks , paired = TRUE, alternative="greater", conf.level=0.90)

Please notes that Paired t-test analysis is performed as follow:

  • Calculate the difference (d) between each pair of value
  • Compute the mean (m) and the standard deviation (s) of d
  • Compare the average difference to 0. If there is any significant difference between the two pairs of samples, then the mean of d (m) is expected to be far from 0.

Using this information we can simply find the change in weight (weightLOSS) and run a univariate test on the constructed variable.

t.test(df_Diet_only$WeightLoss ,mu=0,alternative = "greater",conf.level = 0.95)
Analytic extension task

Think about what we have here: the difference variable is all wee need to work with. Please use you learning to determine if there is a difference in WeightLoss between to two diets.

Testing the differences in means for more than two groups using ANOVA

In our final analysis we will see if there is a difference in means between the control group, Diet2 and Diet3. This is the appropriate test for these data and all the steps leading up to this point was merely exploratory data analysis. There are two parts to the analysis firstly by examining the differences between groups: the null is that the means are equal for all groups (HO mu control = mu Diet2 = Mu diet3) with the alternate (H1) being that not all means are equal. Note that this is a two sided test and does not test whether one mean is less or greater than another. Furthermore it does not indicate which of the means are different. To aid the discussion we can use DescribeBy to see which of the diets have a higher or lower average weight loss. This helps with direction. Lastly we can use a post-hoc test to pairwise compare the treatment and control groups.

The other learning point here is that we will often save models as an object. In this way we can print results at any stage, but more importantly we can run additional code on the model object to investigate the relationships in the data to a further degree (i.e. examine residuals in a regression or perform post-hoc analysis in an ANOVA)

res.aov <- aov(WeightLoss ~ Diet, data = df_Diet)# Compute the analysis of variance and save the model object

summary(res.aov)# display the summary of the analysis

At last we come to the question. Do these diets work? Which Diet performed better? Was it significantly better?

TukeyHSD(res.aov)# Computes pairwise comparisons that consider the 'total variance'. 

The post-hoc tests (Tukey is not the only one) differ from a sequence of pairwise t-test but accounting to the total variance and degrees of freedom. The motivation for post-hoc tests is that running multiple independent sample tests may lead to suprious rejections of the null hypothesis. If had you had 7 groups then there would be 21 pairwise tests and at the 0.05 confidence level you would expect one of these tests to be rejected at the 5% level. Posthoc tests take this into account and correct for the fact you are running multiple tests see https://arc.lib.montana.edu/book/statistics-with-r-textbook/item/59

Summary

This week we covered hypothesis testing. We concentrated on the main tests that any first year statistics student should know. Univariate t-tests and Chi-sq test apply to one random variable (feature) and are used to test the mean and variance in the data against an assumed population parameter (mu or sigma^2). Two sample tests may be independent (i.e. test of weights between control and treatment groups) or pairwise (two observations on the same subject such as before and after weight). Note that wen looking at before and after technically we are examining the differences. We then examined the three groups (more than two groups) and tested for a difference in means between groups. The one Way ANOVA confirmed that there is a difference but it did not inform us on which group has the the greatest change (highest means). We use a post-hoc test (TukeyHSD) to examine the differences.

How does this fit into our journey

In the first two weeks you became familiar with R and working with dataframes. In week 2 you worked with ggplot to visualise the data. The other task to do when examining / communicating the features of a data set is to give its numerical summaries (measures of central tendency, spread and distribution). We covered that in the first part of the lesson on descriptive statistics. You should now have a complement of skills that include – working in R studio, working with data frames (dplyer) and calculating statistics and generating visualisations. Please note that we cannot show you every possible trick in the book. You will have to use Google / github / stackoverflow to solve new problems or find new code as the need arises.

Looking ahead. The next two weeks covers regression based models and these rely heavily on partitioning variances and providing statistical output that align with t-tests, F-tests and ANOVA. Our theoretical 9insight from this week will hold us in good stead.