This guide contains the most common types of tests used in statistics and the R code needed to run them.
The most common tests are t-tests, ANOVA (Analysis of Variance), Chi-square tests, correlation tests, and regression analysis. There are also various comparison tests. Z-tests are also used, but less common because the standard deviation of the population is needed. Assumptions are made for Z-tests, so they are not included in this guide.
To start, you need to know what data you have. Is it quantitative or categorical?
Quantitative variables represent a numerical value
(height, weight, age, etc.). This can be further divided into discrete
or continuous data.
- Discrete: can only take on specific, whole
numbers
- Continuous: can take on any value within a range.
Because most data is continous, we will only be discussing tests that look at continuous data in this guide.
Categorical data represents a category or group based on descriptions (gender, hair color, favorite food, etc.)
If you are using only categorical data, meaning both your independent and dependent variables are categorical, you are limited to using a chi-square test.
X: Categorical
Y: Categorical
Independent variables for chi-squared test are generally only one variable, but can be split into 2 or more categories between subjects. This test is used to analyze situation where you want to compare observed frequencies in categorical data to expected frequencies. Your x variables can categorized between or within subjects.
A chi-square test between subjects compares categorical data between different groups of participants that only experience one condition each. A chi-square test within subjects analyzes where the same participants are measured multiple times across different conditions.
Examples: - Is the distribution of colors in a bag of candies even? - Is there a relationship between gender and movie preference? - Is the proportion of people who recycle different across intervention groups?
Steps 1. Create contingency table 2. Run test 3. Interpret results
R Code
subdata = table(data$variable, data$variable) = creates a
contingency table with needed variables
chisq.test(subdata)
Now, let’s say that your independent variables are categorical, but your dependent variables are continuous.
X: Categorical Between
Y: Continuous (1 Variable)
If you only have one dependent variable, you would use an independent t-test. This test is used to determine whether there is a difference between two independent, unrelated groups. For example, you want test whether a new fertilizer improves corn yield compared to a traditional one.
There are six assumptions that must be met to use this test:
Y is continuous (yield, weight, height, etc.)
X consists of two categorical, indpendent groups (dichotomous). A dichotomous variable can be ordinal or nominal.
The first three assumptions relate to study design and how variables were measured. They MUST be met. The next three checks your data rather than study design.
\[I = [q_{0.25} - 1.5 \times\ IQR;q_{0.75} + 1.5 \times\ IQR]\]
boxplot.stats(data$variable)$out to extract
the values of the potential outliers. The which() function can help you
extract the row number corresponding to the outlier.
i.e. which(data$variable %in% c(out))The dependent variable should be approximately normally distributed for each category of your independent variable.
There needs to be homogeneity of variance. This means the population variance for each category of the independent variable is the same.
Steps
1. Check assumptions.
2. State null and alternative hypotheses.
3. Calculate standard deviation for groups separately.
4. Calculate significance level.
5. Calculate test statistic.
R Code:
t.test(dv ~ iv, mu = 0, paired = FALSE, var.equal= FALSE, conf.level = 0.95, data = dataframe)
dv = dependent_variable
iv = independent_variable
var.equal = default is “FALSE”. This is only set to “TRUE”
if the data meets the assumption of homogeneity of
variances. If they are not equal, the Welch approximation to
the degrees of freedom is used.
paired = default is “FALSE”. Only set to “TRUE” when doing
paired t-test.
conf.level = default is 0.95.
data = runs the independent samples t-test on the data that
was imported.
This will give you important results including:
- T-value (t): measures the size of the difference
relative to the variation in sample data (aka the calculated difference
represented in standard error). Larger t = larger difference between
sample sets
- Degrees of Freedom (df): number of independent pieces
of information used to calculate the test statistic. Independent t-test
is n-2 (2 samples) - statistical significance
(p-value)
- 95% confidence interval (CI) of the mean
difference
- Mean value of each group
Generating descriptive statistics This will generate
the mean, standard deviation, and samples size for each group.
object = dataframe %>% group_by(iv) %>% summarise(mean = mean(dv), sd = sd(dv), n = n())
%>% = simply means “then”. This tells R that after
looking for the data, it should then do something (aka
the code that follows).
group_by(iv) = This groups the data on the iv (independent
variable) specified by the parentheses.
summarise = run the descriptive statistics between the
parentheses.
mean = mean(dv) = generates the mean score of the dv
(dependent variable)
sd = sd(dv) = generates the standard deviation of the
dv
n = n() = generates the sample size
Mean Difference
This tells you the difference between the two group means generated
above.
mean_difference = object[1,2] - object[2,2]
object[a,b] = finds the statistics in row a and column
b
X: Categorical Within
Y: Continuous (1 Variable)
The paired t-test is used to determine whether the mean of two dependent variables is the same in two related groups. In other words, there are two categories within subjects. For example, you want to determine whether a new drought-resistant wheat variety improved grain yield under low-water conditions compared to a conventional variety. To control for natural differences between subjects and locations (field/soil types), you will use a paired t-test.
Four assumptions for using a paired t-test:
Your Y variable should be continuous
Your X variable should consist of two categorical, related groups. This means the same subjects are present in both groups.
There should be no significant outliers.
The distribution of the differences in the dependent variable between the two related groups should be approximately normally distributed.
R Code
t.test(dv ~ iv, data = dataframe, paired = TRUE, conf.level = 0.95)
paired = this is set to TRUE”.
conf.level = this sets your confidence level o fthe
interval
The results from R are similar to that of an independent t-test.
Okay, so great. Now, let’s say your independent variable is still categorical, but you have more than 2 categories between or within subjects. You only have 1 continuous dependent variable. This is where ANOVA comes in.
ANOVA, or analysis of variance is used to compare the means of two or more groups. There are a few ANOVAs, but since we covered most statistical terms in the categories above, we will be quick here.
X: Categorical Between
Y: Continuous (1 Variable)
Used when comparing the means of three of more independent groups to determine a significant difference between. For example, you want to test three fertilizers on corn yield.
-Design: Randomly assign the fertilizers to multiple corn plots under same environmental conditions. - Apply fertilizers and maintain - Harvest, record the yield - Perform one-way ANOVA
R Code aov(x ~ y, data = dataframe)
aov = ANOVA. The results of this will tell you whether
there is a difference in means, but we don’t know which pairs of groups
are different, so we introduce a pairwise comparison if
the ANOVA is significant. TukeyHSD(res.aov)
TukeyHSD = performs a multiple pairwise-comparison between
the means of groups.
The results of this will include:
- diff: difference between means of the two groups -
lwr, upr: lower and upper end point of the confidence
interval at 95% (or whatever confidence interval you set) - p
adj: p-value after adjustment for the multiple comparisons
You can then conduct a multiple comparison with
multcomp.
summary(glht(model, linfct = mcp(group = "TUKEY")))
glht = general linear hypothesis test
model = fitted model, like an object returned by
aov()
lincft() = specification of the linear hypotheses to be
tested. Multiple comparisons in ANOVA are specified by objects returned
from the function mcp().
Or you can conduct a pairwise t-test.
pairwise.t.test(data$variable, data$group, p.adjust.method = "BH")
p.adjust.method = adjusted by the Benjamini-Hochberg
method.
X: Categorical Within
Y: Continuous (1 Variable)
A repeated measures ANOVA is used when the same subjects (experimental units) are measured multiple times under different conditions of time points. For example, you want to evaluate the effect of different forage diets on cattle weight over time. The repeated-measures ANOVA controls for individual animal variation, more statistical power from tracking same animals over time, and detects trends over time, not just at one point.
A one-way repeated measures ANOVA is an extension of the paired-samples t-test for comparing the means of 3+ levels of a within-subjects variable.
A two-way repeated measures ANOVA is used to evaluate simultaneously the effect of two within-subject factors on a continuous outcome variable. Example: Effect of diet and exercise on cattle weight over time.
A three-way repeated measures ANOVA is used to evaluate simultaneously the effect of three within-subject factors on a continuous outcome variable. Example: Effect of diet, exercise, and supplementation on cattle weight over time.
R Code
One-way
Data Preparation:
model=aov(formula = y ~ x + Error(subject/x))
summary(model)
You can create a line graph with average response variable level plotted
over time to see the trend.
X: Categorical Between (>1 Variable)
Y: Continuous (1 Variable)