Introduction to Statistical Analysis using R and RStudio

Review of Hypotheses

In general, hypotheses are never “proven”, but rather tested against. In most cases, they cannot be thoroughly evaluated by a single experiment, and that is where statistics becomes useful. Statistics allow us to evaluate our hypotheses using sample data that is representative of the larger population, allowing us to infer conclusions on the whole population.

Hypotheses are made in pairs of two mutually exclusive statements:

Null Hypothesis H₀: Hypothesis that shall be falsified
- What we test our results against either “fail to reject” or “reject”
- Hypothesis is never “accepted” or “proven”
Alternative Hypothesis H₁: Hypothesis that shall be confirmed (research hypothesis)

Hypotheses can be one-sided or two-sided (sometimes referred to as one or two “tailed”:

One-Sided: Also known as directional tests, because you are testing in only one direction of the distribution.
- H0 : the effect is ≥ 0 (the effect is less than or equal to 0)
- H1: the effect is < 0 (the effect is greater than 0)
  
  or
- H₀ : the effect is ≤ 0 (the effect is greater than or equal to 0)
- H₁: the effect is > 0 (the effect is greater than 0)
Two-Sided: Important to researchers where they are interested in seeing any kind of effect (positive or negative). This is the type of test that is most-often used. Called a two-sided test because it includes cases where the effect is both greater than and less than 0.
- H₀ : the effect = 0 (there is no effect)
- H₁: the effect ≠ 0 (there is an effect)

Review of Types of Error

In science, we assume our samples are representative of the larger population; however, we are unable to account for every random or uncontrollable variable that is active within the system. This can sometimes lead to errors in the interpretation of our statistical outcomes.

	H0 is true	H1 is true
Decision for H0	Correct decision 1-α (sensitivity)	Type II error β
Decision for H1	Type I error α (significance level)	Correct decision 1-β (power specificity)

Type I Error: probability of rejecting H₀ when it is true
- E.g. “False positives” –> a pregnancy test saying you are pregnant when you really aren’t
Type II Error: probability of not rejecting H0 when it is false
- E.g. “false negatives” –> a COVID test saying you don’t have the virus when you really do

Test Decision

Test decision is generally based off of the p-value, the (conditional) probability that the value of the test statistic is in the rejection region of H0 (the red-shaded region in the distribution images above). This rejection region is determined by the value of alpha, α. In most cases, especially in ecology and “non-medical” biology, α = 0.05. This means that there is a 5% chance of a Type I error. In the medical world, they may aim for a smaller α. For our class, we will use α = 0.05.

If the p-value is > α, then we fail to reject H₀
If the p-value is ≤ α, then we reject H₀

Types of Statistical Analyses

Two types of frequentest statistical analyses are parametric and nonparametric. Parametric statistics are based on assumptions about the population distribution from which the sample (the data) was taken. One of the assumptions made in parametric analyses is that the data is normally distributed. Data that is normally distributed peaks in the middle and is symmetrical about the mean (e.g. a bell curve). Data does not need to be perfectly normal in order to use parametric tests, however, you should check your data for normality before using these analyses.

Nonparametric statistics have fewer assumptions than parametric tests and are not restricted to normal distributions (the data can be collected from a sample that does not follow a specific distribution).

Comparison between normally distributed data and abnormally distributed data

Data can be checked for normality using visual methods such as density plots, histograms, and Q-Q plot. Density plots and histograms check to see if the data follow a bell shape and Q-Q plots (quantile-quantile plots) draw the correlation between a given sample and the normal distribution.

Hypothesis testing for normality:

Data can be quantitatively checked for normality by using the Shapiro-Wilk test, shapiro.test(x), where “x” is a numeric vector of data values. When checking a particular column in a data frame, instead of “x”, use the format “dataframe$column”.

Null Hypothesis: The data are normally distributed

Alternative Hypothesis: The data are not normally distributed

NOTE: If p > 0.05, normality can be assumed.

Example of Shapiro-Wilk normality test with normally distributed data outcome:

shapiro.test(rnorm(100, mean = 5, sd = 3))

## 
##  Shapiro-Wilk normality test
## 
## data:  rnorm(100, mean = 5, sd = 3)
## W = 0.9897, p-value = 0.6405

Example of Shapiro-Wilk normality test with non-normally distributed data outcome:

shapiro.test(runif(100, min = 2, max = 4))

## 
##  Shapiro-Wilk normality test
## 
## data:  runif(100, min = 2, max = 4)
## W = 0.95724, p-value = 0.002567

If data is not normally distributed:

Transform the dependent variable: This is done by taking the log or square root of the dependent variable. Note: Test for normality should be repeated after data have been transformed.
Use a non-parametric test: You may use the non-parametric equivalent of the parametric test. Reminder: non-parametric tests are not restricted to specific distributions.

Nonparametric Tests in R

Parametric Tests

One-Sample t-tests
Two-Sample t-tests
Paired t-tests
One-Way ANOVA
Linear Regression

One-Sample t-test

One-sample t-test is used to compare the mean of one sample to a known standard (or theoretical/hypothetical) mean (μ). Generally, the theoretical mean comes from a previous experiment or from an experiment where you have control and treatment conditions. If you express your data as a “percent of control”, you can test whether the average value of treatment condition differs significantly from 100.

Assumptions:

The data are continuous (not discrete)
The data follow the normal probability distribution
The sample is a simple random sample from its population. Each individual in the population has an equal probability of being selected in the sample.

Typical questions:

Whether the mean (m) of the sample is equal to the theoretical mean (μ)?
Whether the mean (m) of the sample is less than the theoretical mean (μ)?
Whether the mean (m) of the sample is greater than the theoretical mean (μ)?

Corresponding null hypothesis (H₀):

H₀ : m = μ
H₀ : m ≤ μ
H₀ : m ≥ μ

The corresponding alternative hypotheses (Ha) are as follow:

H₁ : m ≠ μ (different)
H₁ : m > μ (greater)
H₁ : m < μ (less)

Note:

Hypothesis 1) is a two-tailed test
Hypotheses 2) & 3) are one-tailed tests

Example Using Student’s Sleep Data

Description of data: An experiment was conducted to determine which of two soporific drugs was better at increasing the hours of sleep individuals received, on average. There were two groups of 10 patients each. One group received the first drug, and the other group received the second drug. The amount of extra sleep that individuals received when drugged was measured. The data is contained in the sleep data set in R.
Data frame name: sleep
Format: a data frame with 20 observations on 3 variables

Extra numeric increase in hours of sleep
Group factor: drug given
ID factor: patient ID

NOTE: The group variable name may be misleading about the data: They represent measurements on 10 persons, not in “groups”.

head(sleep)

##   extra group ID
## 1   0.7     1  1
## 2  -1.6     1  2
## 3  -0.2     1  3
## 4  -1.2     1  4
## 5  -0.1     1  5
## 6   3.4     1  6

Below is an example of a one-sample t-test using a dataset provided by R. The Sleep data show the effect of two soporific drugs on 10 patients, where the observed values are differences in hours of sleep compared to the control group (the “extra” column).

FIRST: check for normality using the Shapiro-Wilk test

shapiro.test(sleep$extra)

## 
##  Shapiro-Wilk normality test
## 
## data:  sleep$extra
## W = 0.94607, p-value = 0.3114

H₀: The data are normally distributed
H₁: The data are not normally distributed

The resulting p-value is greater than 0.05, so we fail to reject the null hypothesis. Therefore, the data are assumed to be normally distributed.

Now we can run our t-test:
Here, we are going to run a t-test on all observed data. To do this, we will specify x as the vector of values from the “extra” column and submit it to a one-sample t-test. We will do this by specifying x as sleep$extra, which pulls only the “extra” column from the “sleep” dataset. Our null hypothesis is that the average differences in sleep are equal to, or less than zero (zero is the default μ value) while our alternative hypothesis is that the average differences in sleep are greater than zero (which is specified using the alternative parameter):

Make your hypothesis:
- H₀: μ_change ≤ 0 (0 is the hypothesized value)
- H₁: μ_change > 0 (0 is the hypothesized value)
Run your test:

# One-sample t-test using Sleep data.
t.test(x = sleep$extra, alternative = "greater")

## 
##  One Sample t-test
## 
## data:  sleep$extra
## t = 3.413, df = 19, p-value = 0.001459
## alternative hypothesis: true mean is greater than 0
## 95 percent confidence interval:
##  0.7597797       Inf
## sample estimates:
## mean of x 
##      1.54

Evaluate your hypothesis: The resulting p-value is smaller than 0.05, so we reject the null hypothesis
Interpret your results: There was a significant increase in the amount of sleep after taking the drug when compared to the control.

Two-Sample t-test

A Two-sample t-test (unpaired) is used to compare the mean of two independent groups.

Assumptions:

The data are continuous (not discrete)
The data follow the normal probability distribution
- Check using Shapiro-Wilk test shaprio.test() for BOTH samples
The sample is a simple random sample from its population. Each individual in the population has an equal probability of being selected in the sample.
The variances for the two independent groups are equal
- Check using F-test var.test() (If p-value is greater than 0.05, there is a significant difference between variances)

Typical questions:

Whether the mean of group A (m_A) is equal to the mean of group B (m_B)?
Whether the mean of group A (m_A) is less than the mean of group B (m_B)?
Whether the mean of group A (m_A) is more than the mean of group B (m_B)?

Corresponding null hypothesis (H₀):

H₀ : m_A = m_B
H₀ : m_A ≤ m_B
H₀ : m_A ≥ m_B

The corresponding alternative hypotheses (Ha) are as follow:

H₁ : m_A ≠ m_B (different)
H₁ : m_A > m_B (greater)
H₁ : m_A < m_B (less)

Note:

Hypothesis 1) is a two-tailed test
Hypotheses 2) & 3) are one-tailed tests

Example Using Student’s Sleep Data

Extra numeric increase in hours of sleep
Group factor: drug given
ID factor: patient ID

NOTE: The group variable name may be misleading about the data: They represent measurements on 10 persons, not in “groups”. For this example, we will be treating Group 1 and Group 2 as independent groups.

head(sleep)

##   extra group ID
## 1   0.7     1  1
## 2  -1.6     1  2
## 3  -0.2     1  3
## 4  -1.2     1  4
## 5  -0.1     1  5
## 6   3.4     1  6

Below is an example of a two-sample t-test using a dataset provided by R. The Sleep data show the effect of two soporific drugs on 10 patients, where the observed values are differences in hours of sleep compared to the control group (the “extra” column).

FIRST: check for normality using the Shapiro-Wilk test
H₀: The data are normally distributed
H₁: The data are not normally distributed

#Shaprio-Wilk test for Group 1
with(sleep, shapiro.test(extra[group == "1"]))

## 
##  Shapiro-Wilk normality test
## 
## data:  extra[group == "1"]
## W = 0.92581, p-value = 0.4079

#Shaprio-Wilk test for Group 2
with(sleep, shapiro.test(extra[group == "2"]))

## 
##  Shapiro-Wilk normality test
## 
## data:  extra[group == "2"]
## W = 0.9193, p-value = 0.3511

The resulting p-values are greater than 0.05, so we fail to reject the null hypotheses and the data are assumed to be normally distributed.

SECOND: check for homogeneity in variances using the F-test
H₀: Var₁ = Var₂
H₁: Var₁ ≠ Var₂

var.test(extra ~ group, data = sleep)

## 
##  F test to compare two variances
## 
## data:  extra by group
## F = 0.79834, num df = 9, denom df = 9, p-value = 0.7427
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.198297 3.214123
## sample estimates:
## ratio of variances 
##          0.7983426

The resulting p-value is greater than 0.05, so we fail to reject the null hypotheses and conclude there is no significant differences between the variances of the two groups. Therefore, we can use the classic t-test, which assumes equality of the two variances.
NOTE: If variances were found to be unequal, the Welch’s t-test could me used to test if the means of two populations are equal.

Now we can run our t-test:
Our null hypothesis is that the mean extra hours of sleep for group 1 are equal to the mean extra hours of sleep for group 2, while our alternative hypothesis is that the mean hours of extra sleep for group 1 are different (greater than or less than) the mean extra hours of sleep for group 2:

Make your hypothesis:
- H₀: m₁ = m₂
- H₁: m₁ ≠ m₂
Run your test:

# Two-sample t-test using Sleep data.
t.test(extra ~ group, data = sleep, var.equal= TRUE)

## 
##  Two Sample t-test
## 
## data:  extra by group
## t = -1.8608, df = 18, p-value = 0.07919
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.363874  0.203874
## sample estimates:
## mean in group 1 mean in group 2 
##            0.75            2.33

#This will default to the Welch's t-test if you leave off "var.equal = TRUE"

Evaluate your hypothesis: The resulting p-value is greater than 0.05, so we fail to reject the null hypothesis
Interpret your results: There is no significant difference in the increase of sleep between groups 1 and 2.

Paired t-test

A Paired t-test compares the means between two related groups of samples: two values (i.e., pair of values) for the same samples, a “before vs. after” scenario, or the same sample under two different treatments. It is a special case of the two-sample t-test for comparing the sample means of a continuous response variable (Y) measured for two different values of a categorical variable (X).

Assumptions:

The data are continuous (not discrete)
The data follow the normal probability distribution
- Check using Shapiro-Wilk test shaprio.test()
The sample is a simple random sample from its population. Each individual in the population has an equal probability of being selected in the sample.

Typical questions:

Whether the mean difference (m) is equal to 0?
Whether the mean difference (m) is less than to 0?
Whether the mean difference (m) is greater than to 0?

Corresponding null hypothesis (H₀):

H₀ : m = 0
H₀ : m ≤ 0
H₀ : m ≥ 0

The corresponding alternative hypotheses (Ha) are as follow:

H₁ : m ≠ 0 (different)
H₁ : m > 0 (greater)
H₁ : m < 0 (less)

Note:

Hypothesis 1) is a two-tailed test
Hypotheses 2) & 3) are one-tailed tests

Example Using Student’s Sleep Data

Extra numeric increase in hours of sleep
Group factor: drug given
ID factor: patient ID

NOTE: The group variable name may be misleading about the data: They represent measurements on 10 persons, not in “groups”. These data are actually paired.

# The first value for group 1 is paired with the first value for group 2
## i.e., row 1 and row 11, row 2 and row 12, row 3 and row 13, etc.
## Notice how the IDs for these rows match
print(sleep)

##    extra group ID
## 1    0.7     1  1
## 2   -1.6     1  2
## 3   -0.2     1  3
## 4   -1.2     1  4
## 5   -0.1     1  5
## 6    3.4     1  6
## 7    3.7     1  7
## 8    0.8     1  8
## 9    0.0     1  9
## 10   2.0     1 10
## 11   1.9     2  1
## 12   0.8     2  2
## 13   1.1     2  3
## 14   0.1     2  4
## 15  -0.1     2  5
## 16   4.4     2  6
## 17   5.5     2  7
## 18   1.6     2  8
## 19   4.6     2  9
## 20   3.4     2 10

Below is an example of a paired t-test using a dataset provided by R. The Sleep data show the effect of two soporific drugs on 10 patients, where the observed values are differences in hours of sleep compared to the control group (the “extra” column).

FIRST: check for normality using the Shapiro-Wilk test
H₀: The data are normally distributed
H₁: The data are not normally distributed

#First, find the difference between group 1 and group 2 values for each unique ID
sleep1 <- with(sleep, 
        extra[group == "2"] - extra[group == "1"])
head(sleep1)

## [1] 1.2 2.4 1.3 1.3 0.0 1.0

#Then complete the Shapiro-Wilk normality test for the differences
shapiro.test(sleep1)

## 
##  Shapiro-Wilk normality test
## 
## data:  sleep1
## W = 0.82987, p-value = 0.03334

The resulting p-values is less than 0.05, so we must reject the null hypotheses. However, if we look at the box plot of our data, we can see that we have an outlier for ID 9, with 4.6 extra hours.

#create boxplot for sleep data to view outliers
stripchart(sleep1, method = "stack", xlab = "hours",
                     main = "Sleep prolongation (n = 10)")
boxplot(sleep1, horizontal = TRUE, add = TRUE,
                at = .6, pars = list(boxwex = 0.5, staplewex = 0.25))

We can remove the paired values for ID 9, and rerun the normality test:

#Remove row values where ID == 9 
sleep.rm<-sleep[sleep$ID !=9, ]

#Now rerun the Shapiro Wilk test: find the difference between group 1 and group 2 values for each unique ID
sleep1 <- with(sleep.rm, 
        extra[group == "2"] - extra[group == "1"])
head(sleep1)

## [1] 1.2 2.4 1.3 1.3 0.0 1.0

#Then complete the Shapiro-Wilk normality test for the differences
shapiro.test(sleep1)

## 
##  Shapiro-Wilk normality test
## 
## data:  sleep1
## W = 0.95736, p-value = 0.7703

The resulting p-values is greater than 0.05, so we fail to reject the null hypotheses, and the data is considered to be normally distributed.

Now we can run our t-test:
Our null hypothesis is that the mean differences in sleep are equal to zero (zero is the default μ value) while our alternative hypothesis is that the mean differences in sleep are greater or less than zero (which is specified using the alternative parameter):

Make your hypothesis:
- H₀: m = 0
- H₁: m ≠ 0
Run your test:

# Paired t-test using Sleep data.
t.test(extra~group, data=sleep.rm, paired=TRUE)

## 
##  Paired t-test
## 
## data:  extra by group
## t = -5.6587, df = 8, p-value = 0.0004766
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.7515777 -0.7373112
## sample estimates:
## mean of the differences 
##               -1.244444

#This will default to the standard t-test if you leave off "paired = TRUE"

Evaluate your hypothesis: The resulting p-value is less than 0.05, so we reject the null hypothesis
Interpret your results: There is a significant difference in the increase of sleep between groups 1 and 2

One-way ANOVA test

The one-way analysis of variance (ANOVA), also known as one-factor ANOVA, is an extension of independent two-samples t-test for comparing means in a situation where there are more than two groups. In one-way ANOVA, the data is organized into several groups base on one single grouping variable (also called factor variable).

Assumptions:

All populations have a common variance
- Can check using Levene’s test leveneTest() to check for the homogeneity of variances (after the ANOVA is)
The data follow the normal probability distribution
- Check using Shapiro-Wilk test shaprio.test()
Within each sample, the observations are sampled randomly and independently of each other
Factor effects are additive

Typical questions:

Whether there is a difference in the sample means among any of my groups?

Corresponding null hypothesis (H₀):

H₀ : the means of the sample groups are the same

The corresponding alternative hypotheses (Ha) are as follow:

H₁ : At least one sample mean is not equal to the others.

Example Using PlantGrowth Data

Description of data: Results from an experiment to compare yields (as measured by dried weight of plants) obtained under a control and two different treatment conditions. The data is contained in the PlantGrowth data set in R.
Data frame name: PlantGrowth
Format: A data frame of 30 cases on 2 variables.

[, 1] weight numeric
[, 2] group factor
The levels of group are ‘ctrl’, ‘trt1’, and ‘trt2’.

head(PlantGrowth)

##   weight group
## 1   4.17  ctrl
## 2   5.58  ctrl
## 3   5.18  ctrl
## 4   6.11  ctrl
## 5   4.50  ctrl
## 6   4.61  ctrl

Below is an example of an ANOVA using a dataset, PlantGrowth, provided by R. We want to know if there is any significant difference between the average weights of plants in the 3 experimental conditions.

FIRST: check for normality using the Shapiro-Wilk test for EACH group
H₀: The data are normally distributed
H₁: The data are not normally distributed

#Complete the Shapiro-Wilk normality test for each group
with(PlantGrowth, shapiro.test(weight[group =="ctrl"]))

## 
##  Shapiro-Wilk normality test
## 
## data:  weight[group == "ctrl"]
## W = 0.95668, p-value = 0.7475

with(PlantGrowth, shapiro.test(weight[group =="trt1"]))

## 
##  Shapiro-Wilk normality test
## 
## data:  weight[group == "trt1"]
## W = 0.93041, p-value = 0.4519

with(PlantGrowth, shapiro.test(weight[group =="trt2"]))

## 
##  Shapiro-Wilk normality test
## 
## data:  weight[group == "trt2"]
## W = 0.94101, p-value = 0.5643

The resulting p-values are greater than 0.05, so we fail to reject the null hypotheses, and the data are considered to be normally distributed.

Now we can run our ANOVA:
Our null hypothesis is that the mean differences in sleep are equal to zero (zero is the default μ value) while our alternative hypothesis is that the mean differences in sleep are greater or less than zero (which is specified using the alternative parameter):

Make your hypothesis:

H₀: m₁ = … = m_k
- All of the population means are equal —> All average plant weights are the same
H₁: Not m₁ = … = m_k
- The population means are not all equal —> Not all of the plants have the same average weights

Run your test:

# Compute one-way ANOVA test with PlantGrowth data
res.aov <- aov(weight ~group, data = PlantGrowth)
# Summary of the analysis
summary(res.aov)

##             Df Sum Sq Mean Sq F value Pr(>F)  
## group        2  3.766  1.8832   4.846 0.0159 *
## Residuals   27 10.492  0.3886                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output includes the columns F value and Pr(<F) corresponding to the p-value of the test.

Evaluate your hypothesis: The Pr(<F) value is less than 0.05, so we reject the null hypothesis

The “Signif. codes” tell you how significant your Pr(>F) value is. For this example, we can see that it is significant at an alpha of 0.05, but not at 0.01.

Interpret your results: In one-way ANOVA test, a significant p-value (Pr(>F)) indicates that some of the group means are different, but we don’t know which pairs of groups are different. Multiple pairwise-comparison can be done to determine where the mean difference between specific pairs of group is/are statistically significant.

Tukey multiple pairwise-comparisons: We can compute Tukey HSD (Tukey Honest Significant Differences) using the R function TukeyHSD() for performing multiple pairwise-comparisons between the means of different groups using the residuals from our ANOVA (res.aov).

# Compute Tukey pairwise-comparisons to find WHERE the difference in means takes place
TukeyHSD(res.aov)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = weight ~ group, data = PlantGrowth)
## 
## $group
##             diff        lwr       upr     p adj
## trt1-ctrl -0.371 -1.0622161 0.3202161 0.3908711
## trt2-ctrl  0.494 -0.1972161 1.1852161 0.1979960
## trt2-trt1  0.865  0.1737839 1.5562161 0.0120064

diff: difference between means of the two groups
lwr, upr: the lower and the upper end point of the confidence interval at 95% (default)
p adj: p-value after adjustment for the multiple comparisons

From the output, we can see that only the difference between trt2 and trt1 is significant with an adjusted p-value of 0.012.

This is the only p adj that is less than 0.05

Check the homogeneity of variance assumption using a residuals versus fits plot

# Plot residuals vs fit
plot(res.aov,1)

NOTE: Points 17, 15, 4 are detected as outliers, which can severely affect normality and homogeneity of variance. It can be useful to remove outliers to meet the test assumptions.

Leven’s test leveneTest() can also be used to check for homogeneity of variances:

require(car)
leveneTest(weight~group, data = PlantGrowth)

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  2  1.1192 0.3412
##       27

From the output above we can see that the p-value is not less than the significance level of 0.05. This means that there is no evidence to suggest that the variance across groups is statistically significantly different. Therefore, we can assume the homogeneity of variances in the different treatment groups.
NOTE: leveneTest() is not a function in the base program of R and you must either “require” or install and call the library “car”.

Linear Regression

Linear regression is a commonly used modeling technique used to predict the value of a continuous variable Y given one or more KNOWN predictor variables X ( X₁, X₂, X₃, etc.). Linear regression analyses find a mathematical equation that scientists and data analysis can use to predict Y values when only X values are known.

The mathematical equation is generalized as: Y = β₀ + β₁ X₁ + ϵ

Y is the unknown continuous variable (dependent variable)
β₀ is the y-intercept
β₁ is the slope for X₁
Collectively, the y-intercepts and X slopes are called beta coefficients
X is your known input predictor variable(s) (independent variable(s))
- If you had more than one variable (X₁, X₂, X₃, etc), each variable would have it’s own slope
ϵ is your random error term (unknown/unmeasured factors affecting Y) *very rarely (if ever) are we ever able to account for every possible factor that may affect our Y

Linear Regression Line

This mathematical formula is very similar in use and concept to one that you are likely already very familiar with: Y = mX + b. Where Y is the dependent variable, m is the slope, X is the independent variable, and b is the y-intercept.

Assumptions:

The relationship between the response and predictor variables is linear and additive
The data are normally distributed

Typical questions:

Is there a relationship/correlation between the independent and dependent variables?

Corresponding null hypothesis (H₀):

H₀ : beta coefficients = 0 (There is no relationship between the independent and dependent vairables)

The corresponding alternative hypotheses (Ha) are as follow:

H₁ : beta coefficients ≠ 0 (A relationship exists between the independent and dependent variables)

**Example Using the Cars dataset**

Description of data: The data give the speed of cars and the distances taken to stop. Note: the data were recorded in the 1920s.

head(cars)

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

Here, we will create a mathematical equation for stopping distance as a function of speed and build a linear regression model so that we can predict the stopping distance when only the speed of the car is known.

Graphical Analysis

Visualizing our data first can help us to better create our model.

Scatter plot: Visualize the linear relationship between X and Y
Box plot: Identify outliers
Density plot: Observe the distribution of X

Scatter Plot: Stopping Distance vs Speed

scatter.smooth(x=cars$speed, y=cars$dist, main="Dist ~ Speed")

The scatter plot with the smoothing line suggests a linear and positive relationship between dist and speed.

Box Plot: Checking for Outliers

par(mfrow=c(1, 2))  # divide graph area in 2 columns
boxplot(cars$speed, main="Speed", sub=paste("Outlier rows: ", boxplot.stats(cars$speed)$out))  # box plot for 'speed'
boxplot(cars$dist, main="Distance", sub=paste("Outlier rows: ", boxplot.stats(cars$dist)$out))  # box plot for 'distance'

Density Plot: General check for normality

library(e1071)  # for skewness function
par(mfrow=c(1, 2))  # divide graph area in 2 columns
plot(density(cars$speed), main="Density Plot: Speed", ylab="Frequency", sub=paste("Skewness:", round(e1071::skewness(cars$speed), 2)))  # density plot for 'speed'
polygon(density(cars$speed), col="red")
plot(density(cars$dist), main="Density Plot: Distance", ylab="Frequency", sub=paste("Skewness:", round(e1071::skewness(cars$dist), 2)))  # density plot for 'dist'
polygon(density(cars$dist), col="red")

Correlation Analysis

Correlation analysis looks at the strength of the relationship between the two continuous variables by computing their correlation coefficient. Correlation is a statistical measure that shows the degree of linear dependence between two variables and is scaled from -1 to 1.

If Y increases as you increase X, they are said to have a strong positive correlation and will have a value close to 1.
If Y decreases as you increase X (or vice versa), they are said to have a strong negative correlation and will have a value close to -1.
Values close to 0 suggest a weak relationship between the variables and that variation of the Y variable is not explained by the X variable. In this case, other variables should be investigated.

Remember, correlation ≠ causation. Correlation is only an aid to understand the relationship. If two variables have a high correlation, it does not necessarily mean that X causes Y.

Let’s look at the correlation between speed and dist:

cor(cars$speed, cars$dist)

## [1] 0.8068949

Our output shows that there is a strong positive correlation between speed and dist

Building the Linear Regression Model

The function for building a linear model is lm() and takes two main arguments: formula and data. Below is an example of the general syntax for this function:

model.name <- lm( Y ~ X, data = dataframe)

** Example using Cars**

linearMod <- lm(dist ~ speed, data=cars)
print(linearMod)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932

We have now established the relationship between our predictor X and response Y as a mathematical formula.

The coefficients listed are the y-intercept (-17.579) and the slope of our predictor variable X, speed, (3.932).
These coefficients can now be used to form our formula: Y = 17.579 + 3.932X
- Note: in this simple model we did not include a random error term

Linear Regression Diagnostics

Before using the model we created to make predictions, we must first ensure that it is statistically significant. We can do this by using the summary function summary( ) on our named model, linearMod:

summary(linearMod)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

As with the other statistical analyses, significance is checked by evaluating the p-value. With linear regression, the model can only be considered significant if ALL coefficient p-values are significant. Remember, p-values are only considered significant if they are less than the associated alpha value, α. For our purposes, α = 0.05.

Note: the t-value is the coefficient divided by standard error. It indicates the likelihood that the coefficient is not equal to 0 purple by chance. A t-value that is comparatively larger than it’s corresponding standard error indicates it is less likely that the coefficient is not equal to zero purely by chance. For the sake of this class though, you can go by the p-values.

Hypothesis Evaluation
For this example, both coefficient p-values are smaller than 0.05, and therefore the null hypothesis is rejected.

Model Comparisons

There are a couple of different methods that can be used to compare different models to see which one best fits or represents the data:

R-Squared and Adjusted R-Squared
R-squared tells us the proportion of variation in the dependent variable that has been explained by the model. The adjusted R-squared does the same, but also accounts for the number of predictors in the model. When comparing models, it is best to compare with the adjusted R-squared rather than just the R-squared. The R output refers to these as the “Multiple R-squared” and “Adjusted R-squared”.

In our example, we can see that about 64-65% of the variation in Y (stoping distance) is explained by the model.

For R-squared and adjusted R-squared, the higher the values, the better the fit.

AIC and BIC
We won’t get into the details here, but AIC (Akaike’s information criterion) and BIC (Bayesian information criterion) are both measures of the goodness of fit of linear regression models and can be used for model selection.

For model comparison, the model with the lowest AIC and BIC value is the “better” model.

AIC:

AIC(linearMod)

## [1] 419.1569

BIC:

BIC(linearMod)

## [1] 424.8929

Example using mtcars

Below are linear regression models using the dataframe mtcars to show the difference between models and using linear models with mulitple X variables. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). Horsepower (hp) will be our Y variable.

A data frame with 32 observations on 11 (numeric) variables.

mpg Miles/(US) gallon
cyl Number of cylinders
disp Displacement (cu.in.)
hp Gross horsepower
drat Rear axle ratio
wt Weight (1000 lbs)
qsec 1/4 mile time
vs Engine (0 = V-shaped, 1 = straight)
am Transmission (0 = automatic, 1 = manual)
gear Number of forward gears
carb Number of carburetors

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Models 1-4:

model.1 <- lm(hp ~ mpg + wt + drat + qsec, data=mtcars)  
summary (model.1)

## 
## Call:
## lm(formula = hp ~ mpg + wt + drat + qsec, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.801 -16.007  -5.482  11.614  97.338 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  473.779    105.213   4.503 0.000116 ***
## mpg           -2.877      2.381  -1.209 0.237319    
## wt            26.037     13.514   1.927 0.064600 .  
## drat           4.819     15.952   0.302 0.764910    
## qsec         -20.751      3.993  -5.197 1.79e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.25 on 27 degrees of freedom
## Multiple R-squared:  0.8073, Adjusted R-squared:  0.7787 
## F-statistic: 28.27 on 4 and 27 DF,  p-value: 2.647e-09

AIC(model.1)

## [1] 319.6877

BIC(model.1)

## [1] 328.4822

model.2 <- lm(hp ~ mpg + wt + drat + vs, data=mtcars)  
summary (model.2)

## 
## Call:
## lm(formula = hp ~ mpg + wt + drat + vs, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.345 -17.558  -9.001  14.027 131.889 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  223.900    125.096   1.790   0.0847 .
## mpg           -6.747      2.767  -2.439   0.0216 *
## wt             3.272     16.024   0.204   0.8397  
## drat          19.416     19.885   0.976   0.3375  
## vs           -50.335     19.505  -2.581   0.0156 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40.86 on 27 degrees of freedom
## Multiple R-squared:  0.6907, Adjusted R-squared:  0.6449 
## F-statistic: 15.07 on 4 and 27 DF,  p-value: 1.361e-06

AIC(model.2)

## [1] 334.8215

BIC(model.2)

## [1] 343.6159

model.3 <- lm(hp ~ cyl + wt + drat + vs, data=mtcars)  
summary (model.3)

## 
## Call:
## lm(formula = hp ~ cyl + wt + drat + vs, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.151 -21.562  -7.202  22.794 122.591 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -226.872    120.090  -1.889  0.06965 . 
## cyl           33.078      9.642   3.431  0.00195 **
## wt            11.142     12.040   0.925  0.36296   
## drat          38.102     19.680   1.936  0.06339 . 
## vs            -9.125     24.247  -0.376  0.70962   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.66 on 27 degrees of freedom
## Multiple R-squared:  0.7372, Adjusted R-squared:  0.6982 
## F-statistic: 18.93 on 4 and 27 DF,  p-value: 1.605e-07

AIC(model.3)

## [1] 329.6141

BIC(model.3)

## [1] 338.4085

model.4 <- lm(hp ~ cyl + wt, data=mtcars)  
summary (model.4)

## 
## Call:
## lm(formula = hp ~ cyl + wt, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -53.98 -25.75 -10.93  20.88 130.95 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -51.806     26.231  -1.975   0.0579 .  
## cyl           31.388      6.343   4.949 2.93e-05 ***
## wt             1.330     11.577   0.115   0.9093    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.27 on 29 degrees of freedom
## Multiple R-squared:  0.6931, Adjusted R-squared:  0.6719 
## F-statistic: 32.75 on 2 and 29 DF,  p-value: 3.641e-08

AIC(model.4)

## [1] 330.5718

BIC(model.4)

## [1] 336.4348

By comparing the Adjusted R-squared values (looking for the highest value) and comparing the AIC and BIC (looking for the lowest values) we can see that Model 1 best fits our data; however, mpg and drat do not appear to be signicant variables. We can improve our model even better by removing those two variables:

model.5 <- lm(hp ~ wt + qsec, data=mtcars)  
summary (model.5)

## 
## Call:
## lm(formula = hp ~ wt + qsec, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.192 -17.334  -4.859  11.234  98.410 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  441.263     64.639   6.827 1.70e-07 ***
## wt            38.670      5.957   6.492 4.17e-07 ***
## qsec         -23.474      3.262  -7.197 6.36e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.95 on 29 degrees of freedom
## Multiple R-squared:  0.7968, Adjusted R-squared:  0.7828 
## F-statistic: 56.87 on 2 and 29 DF,  p-value: 9.205e-11

AIC(model.5)

## [1] 317.3736

BIC(model.5)

## [1] 323.2365

Data and information from the Linear Regression section was borrowed from machinelearningplus.com. For more information on linear regression and how to use your linear model to predict new data be sure to visit their page.

Non-parametric Tests

Unpaired Two-Samples Wilcoxon Test (Independent t-test Alternative)
- Also known as Wilcoxon rank sum test or Mann-Whitney test
Paired samples Wilcoxon Test (Paired t-test Alternative)
- Also known as Wilcoxon signed rank test
Kruskal-Wallis Test (One-way ANOVA Alternative)

For these examples, we will assume that the previou assumption tests have already been completed.

Unpaired Two-Samples Wilcoxon Test (Non-parametric Independent t-test Alternative)

The unpaired two-samples Wilcoxon test (also known as Wilcoxon rank sum test or Mann-Whitney test) is a non-parametric alternative to the unpaired two-samples t-test, which can be used to compare two independent groups of samples. It’s used when your data are not normally distributed.

This function can be used for one or two sample tests. Here, we will show an example using a two-sample Wilcoxon test. For a one-sample Wilcoxon test example, visit: http://www.sthda.com/english/wiki/one-sample-wilcoxon-signed-rank-test-in-r/.

We will use the example data below which contains the weight of 19 individuals (9 women, 9 men).

# Create data
women_weight <- c(38.9, 61.2, 73.3, 21.8, 63.4, 64.6, 48.4, 48.8, 48.5)
men_weight <- c(67.8, 60, 63.4, 76, 89.4, 73.3, 67.3, 61.3, 62.4) 
# Create a data frame
weight_data <- data.frame( 
                group = rep(c("Woman", "Man"), each = 9),
                weight = c(women_weight,  men_weight)
                )
#Check data
print(weight_data)

##    group weight
## 1  Woman   38.9
## 2  Woman   61.2
## 3  Woman   73.3
## 4  Woman   21.8
## 5  Woman   63.4
## 6  Woman   64.6
## 7  Woman   48.4
## 8  Woman   48.8
## 9  Woman   48.5
## 10   Man   67.8
## 11   Man   60.0
## 12   Man   63.4
## 13   Man   76.0
## 14   Man   89.4
## 15   Man   73.3
## 16   Man   67.3
## 17   Man   61.3
## 18   Man   62.4

Question: Is there any significant difference between women and men weights?

Hypothesis:

H₀: m_W = m_M
H₁: m_W ≠ m_M

Compute the two-sample Wilcoxon test:

wilcox.test(weight ~ group, data = weight_data)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  weight by group
## W = 66, p-value = 0.02712
## alternative hypothesis: true location shift is not equal to 0

Note: R may give a warning message stating “cannot compute exact p-value with tie”. This stems form the assumption of a Wilcoxon test that the responses are continuous. This can be suppressed by adding the argument “exact = FALSE”, but the results remain the same:

wilcox.test(weight ~ group, data = weight_data, exact = FALSE)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  weight by group
## W = 66, p-value = 0.02712
## alternative hypothesis: true location shift is not equal to 0

Evaluate your hypothesis: The resulting p-value is less than 0.05, so we reject the null hypothesis

Interpret your results: We can conclude that men’s median weight is significantly different from women’s median weight.

If you wanted to test whether the women’s weights were “greater than” or “less than”, you could add the argument alternative = "greater" or alternative = "less", respectively.
- These would be one-tailed tests.

Paired samples Wilcoxon Test (Non-parametric Paired t-test Alternative)

The paired samples Wilcoxon test (also known as Wilcoxon signed-rank test) is a non-parametric alternative to paired t-test used to compare paired data. It’s used when your data are not normally distributed. Differences between paired samples should be distributed symmetrically around the median.

Here, we’ll use an example data set, which contains the weight of 10 mice before and after the treatment.

# Weight of the mice before treatment
before <-c(200.1, 190.9, 192.7, 213, 241.4, 196.9, 172.2, 185.5, 205.2, 193.7)
# Weight of the mice after treatment
after <-c(392.9, 393.2, 345.1, 393, 434, 427.9, 422, 383.9, 392.3, 352.2)
# Create a data frame
mice_weight <- data.frame( 
                group = rep(c("before", "after"), each = 10),
                weight = c(before,  after)
                )
#Check data
print(mice_weight)

##     group weight
## 1  before  200.1
## 2  before  190.9
## 3  before  192.7
## 4  before  213.0
## 5  before  241.4
## 6  before  196.9
## 7  before  172.2
## 8  before  185.5
## 9  before  205.2
## 10 before  193.7
## 11  after  392.9
## 12  after  393.2
## 13  after  345.1
## 14  after  393.0
## 15  after  434.0
## 16  after  427.9
## 17  after  422.0
## 18  after  383.9
## 19  after  392.3
## 20  after  352.2

Question: Is there any significant difference in weights of mice before and after treatment?

Our null hypothesis is that the mean differences in weight are equal to zero, while our alternative hypothesis is that the mean differences in weight are greater or less than zero (which is specified using the alternative parameter):
Hypothesis:

H₀: m = 0
H₁: m ≠ 0

Compute the paired-sample Wilcoxon test:

  wilcox.test(weight ~ group, data = mice_weight, paired = TRUE)

## 
##  Wilcoxon signed rank exact test
## 
## data:  weight by group
## V = 55, p-value = 0.001953
## alternative hypothesis: true location shift is not equal to 0

Evaluate your hypothesis: The resulting p-value is less than 0.05, so we reject the null hypothesis

Interpret your results: We can conclude that the median weight of the mice before treatment is significantly different than after treatment

If you wanted to test whether the weights after treatment were “greater than” or “less than”, you could add the argument alternative = "greater" or alternative = "less", respectively.
- These would be one-tailed tests.

Kruskal-Wallis Test (Non-parametric One-way ANOVA Alternative)

Kruskal-Wallis test by rank is a non-parametric alternative to one-way ANOVA test, which extends the two-samples Wilcoxon test in the situation where there are more than two groups. It’s recommended when the assumptions of one-way ANOVA test are not met.

Let’s revisit the PlantGrowth dataset used in our ANOVA example above. Recall that there were several outlier points (17, 15, 4), which could severely affect normality and homogeneity of variance, which could invalidate our assumptions and affect the outcome of the test.

Instead, let’s compute a Kruskal-Wallis test on the data to test our original question: Is there any significant difference between the average weights of plants in the 3 experimental conditions?

Hypothesis:

H₀: m₁ = … = m_k
- All of the population means are equal —> All average plant weights are the same
H₁: Not m₁ = … = m_k
- The population means are not all equal —> Not all of the plants have the same average weights

Compute the Kruskal-Wallis test using kruskal.test():

kruskal.test(weight ~ group, data = PlantGrowth)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  weight by group
## Kruskal-Wallis chi-squared = 7.9882, df = 2, p-value = 0.01842

Evaluate your hypothesis: The resulting p-value is less than 0.05, so we reject the null hypothesis

Interpret your results: We can conclude that there are significant difference between treatment groups

NOTE: Just like with the ANOVA, in order to tell where those differences occur, multiple pairwise-comparisons should be made between groups. We can use the function pairwise.wilcox.test() to calculate pairwise-comparisons between group levels with corrections for multiple testing.

pairwise.wilcox.test(PlantGrowth$weight, PlantGrowth$group,
                 p.adjust.method = "BH")

## 
##  Pairwise comparisons using Wilcoxon rank sum test with continuity correction 
## 
## data:  PlantGrowth$weight and PlantGrowth$group 
## 
##      ctrl  trt1 
## trt1 0.199 -    
## trt2 0.095 0.027
## 
## P value adjustment method: BH

From the output, we can see that only the difference between trt2 and trt1 is significant with an adjusted p-value of 0.027.

This is the only p adj that is less than 0.05

END INTRODUCTION TO STATISTICAL ANALYSIS WITH R AND RSTUDIO