Statistical Tests and R Code

This guide contains the most common types of tests used in statistics and the R code needed to run them.

Chart explaining what test to use and the steps needed to complete it. Source:https://medium.com/towards-data-science/demystifying-statistical-analysis-1-a-handy-cheat-sheet-b6229bf992cf
Chart explaining what test to use and the steps needed to complete it. Source:https://medium.com/towards-data-science/demystifying-statistical-analysis-1-a-handy-cheat-sheet-b6229bf992cf

Types of Tests

The most common tests are t-tests, ANOVA (Analysis of Variance), Chi-square tests, correlation tests, and regression analysis. There are also various comparison tests. Z-tests are also used, but less common because the standard deviation of the population is needed. Assumptions are made for Z-tests, so they are not included in this guide.

Types of Data

To start, you need to know what data you have. Is it quantitative or categorical?

Quantitative variables represent a numerical value (height, weight, age, etc.). This can be further divided into discrete or continuous data.
- Discrete: can only take on specific, whole numbers
- Continuous: can take on any value within a range.

Because most data is continous, we will only be discussing tests that look at continuous data in this guide.

Categorical data represents a category or group based on descriptions (gender, hair color, favorite food, etc.)

If you are using only categorical data, meaning both your independent and dependent variables are categorical, you are limited to using a chi-square test.

Chi-Square

X: Categorical
Y: Categorical

Independent variables for chi-squared test are generally only one variable, but can be split into 2 or more categories between subjects. This test is used to analyze situation where you want to compare observed frequencies in categorical data to expected frequencies. Your x variables can categorized between or within subjects.

A chi-square test between subjects compares categorical data between different groups of participants that only experience one condition each. A chi-square test within subjects analyzes where the same participants are measured multiple times across different conditions.

Examples: - Is the distribution of colors in a bag of candies even? - Is there a relationship between gender and movie preference? - Is the proportion of people who recycle different across intervention groups?

Steps 1. Create contingency table 2. Run test 3. Interpret results

R Code
subdata = table(data$variable, data$variable) = creates a contingency table with needed variables chisq.test(subdata)

Now, let’s say that your independent variables are categorical, but your dependent variables are continuous.

T-Tests

Independent t-test

X: Categorical Between
Y: Continuous (1 Variable)

If you only have one dependent variable, you would use an independent t-test. This test is used to determine whether there is a difference between two independent, unrelated groups. For example, you want test whether a new fertilizer improves corn yield compared to a traditional one.

  • Hypothesis: the new fertilizer will produce a higher average corn yield than the traditional fertilizer
  • X: type of fertilizer (A vs B)
  • Y: corn yield (bushels/acre)
  • Design: Select two separate groups of corn fields
    • Group 1: Fertilizer A
    • Group 2: Fertilizer B
  • Methods: Apply fertilizer in the same/recommended manner, record the yield of corn for each field. Apply the independent t-test to compare the mean yields for the two groups.

There are six assumptions that must be met to use this test:

  1. Y is continuous (yield, weight, height, etc.)

  2. X consists of two categorical, indpendent groups (dichotomous). A dichotomous variable can be ordinal or nominal.

  • Ordinal: categorized by name (M/F, hair color, nationalities, names)
  • Nominal: categorized by rank or order (education level, income level, satisfaction rating)
  1. There should be independence of observations. this means there is no relationship between the observations in each group, and the groups themselves.

The first three assumptions relate to study design and how variables were measured. They MUST be met. The next three checks your data rather than study design.

  1. There should be no significant outliers. Outliers are problematic because they can influence the assumptions and results, leading to invalid conclusions.
  • To detect outliers in R, you can run a summary or min/max function. This will give you your minimum and maximum values.
  • You can also create a histogram or boxplot. A boxplot will help to visualize an outlier by using the interquartile range (IQR). An outlier based on this range would be anything outside of:

\[I = [q_{0.25} - 1.5 \times\ IQR;q_{0.75} + 1.5 \times\ IQR]\]

  • You can use boxplot.stats(data$variable)$out to extract the values of the potential outliers. The which() function can help you extract the row number corresponding to the outlier. i.e. which(data$variable %in% c(out))
  1. The dependent variable should be approximately normally distributed for each category of your independent variable.

  2. There needs to be homogeneity of variance. This means the population variance for each category of the independent variable is the same.

Steps
1. Check assumptions.
2. State null and alternative hypotheses.
3. Calculate standard deviation for groups separately.
4. Calculate significance level.
5. Calculate test statistic.

R Code:
t.test(dv ~ iv, mu = 0, paired = FALSE, var.equal= FALSE, conf.level = 0.95, data = dataframe)
dv = dependent_variable
iv = independent_variable
var.equal = default is “FALSE”. This is only set to “TRUE” if the data meets the assumption of homogeneity of variances. If they are not equal, the Welch approximation to the degrees of freedom is used.
paired = default is “FALSE”. Only set to “TRUE” when doing paired t-test.
conf.level = default is 0.95.
data = runs the independent samples t-test on the data that was imported.

This will give you important results including:
- T-value (t): measures the size of the difference relative to the variation in sample data (aka the calculated difference represented in standard error). Larger t = larger difference between sample sets
- Degrees of Freedom (df): number of independent pieces of information used to calculate the test statistic. Independent t-test is n-2 (2 samples) - statistical significance (p-value)
- 95% confidence interval (CI) of the mean difference
- Mean value of each group

Generating descriptive statistics This will generate the mean, standard deviation, and samples size for each group. object = dataframe %>% group_by(iv) %>% summarise(mean = mean(dv), sd = sd(dv), n = n())
%>% = simply means “then”. This tells R that after looking for the data, it should then do something (aka the code that follows).
group_by(iv) = This groups the data on the iv (independent variable) specified by the parentheses.
summarise = run the descriptive statistics between the parentheses.
mean = mean(dv) = generates the mean score of the dv (dependent variable)
sd = sd(dv) = generates the standard deviation of the dv
n = n() = generates the sample size

Mean Difference
This tells you the difference between the two group means generated above.
mean_difference = object[1,2] - object[2,2]
object[a,b] = finds the statistics in row a and column b

Paired t-test

X: Categorical Within
Y: Continuous (1 Variable)

The paired t-test is used to determine whether the mean of two dependent variables is the same in two related groups. In other words, there are two categories within subjects. For example, you want to determine whether a new drought-resistant wheat variety improved grain yield under low-water conditions compared to a conventional variety. To control for natural differences between subjects and locations (field/soil types), you will use a paired t-test.

  • Hypothesis: The new wheat variety will yield more grain than the conventional variety under drought conditions.
  • X: Wheat variety (A vs B)
  • Y: Grain yield (bushels/acre)
  • Study Design (Paired Approach)
    • Multiple plots in the same field
    • Split each plot in half:
      • One half in wheat A
      • One half in wheat B
    • Grow under same conditions
    • Harvest and record yield
    • Perform paired t-test comparing the varieties within each plot.

Four assumptions for using a paired t-test:

  1. Your Y variable should be continuous

  2. Your X variable should consist of two categorical, related groups. This means the same subjects are present in both groups.

  3. There should be no significant outliers.

  4. The distribution of the differences in the dependent variable between the two related groups should be approximately normally distributed.

R Code
t.test(dv ~ iv, data = dataframe, paired = TRUE, conf.level = 0.95)
paired = this is set to TRUE”.
conf.level = this sets your confidence level o fthe interval

The results from R are similar to that of an independent t-test.

Okay, so great. Now, let’s say your independent variable is still categorical, but you have more than 2 categories between or within subjects. You only have 1 continuous dependent variable. This is where ANOVA comes in.

ANOVA

ANOVA, or analysis of variance is used to compare the means of two or more groups. There are a few ANOVAs, but since we covered most statistical terms in the categories above, we will be quick here.

One-Way ANOVA

X: Categorical Between
Y: Continuous (1 Variable)

Used when comparing the means of three of more independent groups to determine a significant difference between. For example, you want to test three fertilizers on corn yield.

-Design: Randomly assign the fertilizers to multiple corn plots under same environmental conditions. - Apply fertilizers and maintain - Harvest, record the yield - Perform one-way ANOVA

R Code aov(x ~ y, data = dataframe) aov = ANOVA. The results of this will tell you whether there is a difference in means, but we don’t know which pairs of groups are different, so we introduce a pairwise comparison if the ANOVA is significant. TukeyHSD(res.aov) TukeyHSD = performs a multiple pairwise-comparison between the means of groups.

The results of this will include:
- diff: difference between means of the two groups - lwr, upr: lower and upper end point of the confidence interval at 95% (or whatever confidence interval you set) - p adj: p-value after adjustment for the multiple comparisons

You can then conduct a multiple comparison with multcomp.
summary(glht(model, linfct = mcp(group = "TUKEY")))
glht = general linear hypothesis test
model = fitted model, like an object returned by aov()
lincft() = specification of the linear hypotheses to be tested. Multiple comparisons in ANOVA are specified by objects returned from the function mcp().

Or you can conduct a pairwise t-test.
pairwise.t.test(data$variable, data$group, p.adjust.method = "BH")
p.adjust.method = adjusted by the Benjamini-Hochberg method.

Repeated Measures ANOVA

X: Categorical Within
Y: Continuous (1 Variable)

A repeated measures ANOVA is used when the same subjects (experimental units) are measured multiple times under different conditions of time points. For example, you want to evaluate the effect of different forage diets on cattle weight over time. The repeated-measures ANOVA controls for individual animal variation, more statistical power from tracking same animals over time, and detects trends over time, not just at one point.

  • X Factor: Forage diet type (A, B, C)
  • Repeated (within subject) Factor: Time (monthly measurements)
  • Y Factor: Cattle weight (lbs)
  • Design: Randomly assign a group of cattle to one of the three diets.
    • Weigh each animal at month 0.
    • Feed and record weight each month for six months
    • Perform repeated measures ANOVA to analyze effects (time, diet, time x diet interaction).

A one-way repeated measures ANOVA is an extension of the paired-samples t-test for comparing the means of 3+ levels of a within-subjects variable.

A two-way repeated measures ANOVA is used to evaluate simultaneously the effect of two within-subject factors on a continuous outcome variable. Example: Effect of diet and exercise on cattle weight over time.

A three-way repeated measures ANOVA is used to evaluate simultaneously the effect of three within-subject factors on a continuous outcome variable. Example: Effect of diet, exercise, and supplementation on cattle weight over time.

R Code
One-way
Data Preparation:
model=aov(formula = y ~ x + Error(subject/x))
summary(model)
You can create a line graph with average response variable level plotted over time to see the trend.

Factorial ANOVA

X: Categorical Between (>1 Variable)
Y: Continuous (1 Variable)