In general, hypotheses are never “proven”, but rather tested against. In most cases, they cannot be thoroughly evaluated by a single experiment, and that is where statistics becomes useful. Statistics allow us to evaluate our hypotheses using sample data that is representative of the larger population, allowing us to infer conclusions on the whole population.
Hypotheses are made in pairs of two mutually exclusive statements:
Hypotheses can be one-sided or two-sided (sometimes referred to as one or two “tailed”:
In science, we assume our samples are representative of the larger population; however, we are unable to account for every random or uncontrollable variable that is active within the system. This can sometimes lead to errors in the interpretation of our statistical outcomes.
Test decision is generally based off of the p-value, the (conditional) probability that the value of the test statistic is in the rejection region of H0 (the red-shaded region in the distribution images above). This rejection region is determined by the value of alpha, α. In most cases, especially in ecology and “non-medical” biology, α = 0.05. This means that there is a 5% chance of a Type I error. In the medical world, they may aim for a smaller α. For our class, we will use α = 0.05.
Two types of frequentest statistical analyses are parametric and nonparametric. Parametric statistics are based on assumptions about the population distribution from which the sample (the data) was taken. One of the assumptions made in parametric analyses is that the data is normally distributed. Data that is normally distributed peaks in the middle and is symmetrical about the mean (e.g. a bell curve). Data does not need to be perfectly normal in order to use parametric tests, however, you should check your data for normality before using these analyses.
Nonparametric statistics have fewer assumptions than parametric tests and are not restricted to normal distributions (the data can be collected from a sample that does not follow a specific distribution).
Data can be checked for normality using visual methods such as density plots, histograms, and Q-Q plot. Density plots and histograms check to see if the data follow a bell shape and Q-Q plots (quantile-quantile plots) draw the correlation between a given sample and the normal distribution.
Data can be quantitatively checked for normality by using the Shapiro-Wilk test, shapiro.test(x), where “x” is a numeric vector of data values. When checking a particular column in a data frame, instead of “x”, use the format “dataframe$column”.
Null Hypothesis: The data are normally distributed
Alternative Hypothesis: The data are not normally distributed
NOTE: If p > 0.05, normality can be assumed.
Example of Shapiro-Wilk normality test with normally distributed data outcome:
shapiro.test(rnorm(100, mean = 5, sd = 3))
##
## Shapiro-Wilk normality test
##
## data: rnorm(100, mean = 5, sd = 3)
## W = 0.9897, p-value = 0.6405
Example of Shapiro-Wilk normality test with non-normally distributed data outcome:
shapiro.test(runif(100, min = 2, max = 4))
##
## Shapiro-Wilk normality test
##
## data: runif(100, min = 2, max = 4)
## W = 0.95724, p-value = 0.002567
If data is not normally distributed:
Nonparametric Tests in R
One-sample t-test is used to compare the mean of one sample to a known standard (or theoretical/hypothetical) mean (μ). Generally, the theoretical mean comes from a previous experiment or from an experiment where you have control and treatment conditions. If you express your data as a “percent of control”, you can test whether the average value of treatment condition differs significantly from 100.
Assumptions:
Typical questions:
Corresponding null hypothesis (H0):
The corresponding alternative hypotheses (Ha) are as follow:
Note:
Description of data: An experiment was conducted to determine which of two soporific drugs was better at increasing the hours of sleep individuals received, on average. There were two groups of 10 patients each. One group received the first drug, and the other group received the second drug. The amount of extra sleep that individuals received when drugged was measured. The data is contained in the sleep data set in R.
Data frame name: sleep
Format: a data frame with 20 observations on 3 variables
NOTE: The group variable name may be misleading about the data: They represent measurements on 10 persons, not in “groups”.
head(sleep)
## extra group ID
## 1 0.7 1 1
## 2 -1.6 1 2
## 3 -0.2 1 3
## 4 -1.2 1 4
## 5 -0.1 1 5
## 6 3.4 1 6
Below is an example of a one-sample t-test using a dataset provided by R. The Sleep data show the effect of two soporific drugs on 10 patients, where the observed values are differences in hours of sleep compared to the control group (the “extra” column).
FIRST: check for normality using the Shapiro-Wilk test
shapiro.test(sleep$extra)
##
## Shapiro-Wilk normality test
##
## data: sleep$extra
## W = 0.94607, p-value = 0.3114
H0: The data are normally distributed
H1: The data are not normally distributed
The resulting p-value is greater than 0.05, so we fail to reject the null hypothesis. Therefore, the data are assumed to be normally distributed.
Now we can run our t-test:
Here, we are going to run a t-test on all observed data. To do this, we will specify x as the vector of values from the “extra” column and submit it to a one-sample t-test. We will do this by specifying x as sleep$extra, which pulls only the “extra” column from the “sleep” dataset. Our null hypothesis is that the average differences in sleep are equal to, or less than zero (zero is the default μ value) while our alternative hypothesis is that the average differences in sleep are greater than zero (which is specified using the alternative parameter):
# One-sample t-test using Sleep data.
t.test(x = sleep$extra, alternative = "greater")
##
## One Sample t-test
##
## data: sleep$extra
## t = 3.413, df = 19, p-value = 0.001459
## alternative hypothesis: true mean is greater than 0
## 95 percent confidence interval:
## 0.7597797 Inf
## sample estimates:
## mean of x
## 1.54
A Two-sample t-test (unpaired) is used to compare the mean of two independent groups.
Assumptions:
shaprio.test() for BOTH samplesvar.test() (If p-value is greater than 0.05, there is a significant difference between variances)Typical questions:
Corresponding null hypothesis (H0):
The corresponding alternative hypotheses (Ha) are as follow:
Note:
Description of data: An experiment was conducted to determine which of two soporific drugs was better at increasing the hours of sleep individuals received, on average. There were two groups of 10 patients each. One group received the first drug, and the other group received the second drug. The amount of extra sleep that individuals received when drugged was measured. The data is contained in the sleep data set in R.
Data frame name: sleep
Format: a data frame with 20 observations on 3 variables
NOTE: The group variable name may be misleading about the data: They represent measurements on 10 persons, not in “groups”. For this example, we will be treating Group 1 and Group 2 as independent groups.
head(sleep)
## extra group ID
## 1 0.7 1 1
## 2 -1.6 1 2
## 3 -0.2 1 3
## 4 -1.2 1 4
## 5 -0.1 1 5
## 6 3.4 1 6
Below is an example of a two-sample t-test using a dataset provided by R. The Sleep data show the effect of two soporific drugs on 10 patients, where the observed values are differences in hours of sleep compared to the control group (the “extra” column).
FIRST: check for normality using the Shapiro-Wilk test
H0: The data are normally distributed
H1: The data are not normally distributed
#Shaprio-Wilk test for Group 1
with(sleep, shapiro.test(extra[group == "1"]))
##
## Shapiro-Wilk normality test
##
## data: extra[group == "1"]
## W = 0.92581, p-value = 0.4079
#Shaprio-Wilk test for Group 2
with(sleep, shapiro.test(extra[group == "2"]))
##
## Shapiro-Wilk normality test
##
## data: extra[group == "2"]
## W = 0.9193, p-value = 0.3511
The resulting p-values are greater than 0.05, so we fail to reject the null hypotheses and the data are assumed to be normally distributed.
SECOND: check for homogeneity in variances using the F-test
H0: Var1 = Var2
H1: Var1 ≠ Var2
var.test(extra ~ group, data = sleep)
##
## F test to compare two variances
##
## data: extra by group
## F = 0.79834, num df = 9, denom df = 9, p-value = 0.7427
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.198297 3.214123
## sample estimates:
## ratio of variances
## 0.7983426
The resulting p-value is greater than 0.05, so we fail to reject the null hypotheses and conclude there is no significant differences between the variances of the two groups. Therefore, we can use the classic t-test, which assumes equality of the two variances.
NOTE: If variances were found to be unequal, the Welch’s t-test could me used to test if the means of two populations are equal.
Now we can run our t-test:
Our null hypothesis is that the mean extra hours of sleep for group 1 are equal to the mean extra hours of sleep for group 2, while our alternative hypothesis is that the mean hours of extra sleep for group 1 are different (greater than or less than) the mean extra hours of sleep for group 2:
# Two-sample t-test using Sleep data.
t.test(extra ~ group, data = sleep, var.equal= TRUE)
##
## Two Sample t-test
##
## data: extra by group
## t = -1.8608, df = 18, p-value = 0.07919
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.363874 0.203874
## sample estimates:
## mean in group 1 mean in group 2
## 0.75 2.33
#This will default to the Welch's t-test if you leave off "var.equal = TRUE"
A Paired t-test compares the means between two related groups of samples: two values (i.e., pair of values) for the same samples, a “before vs. after” scenario, or the same sample under two different treatments. It is a special case of the two-sample t-test for comparing the sample means of a continuous response variable (Y) measured for two different values of a categorical variable (X).
Assumptions:
shaprio.test()Typical questions:
Corresponding null hypothesis (H0):
The corresponding alternative hypotheses (Ha) are as follow:
Note:
Description of data: An experiment was conducted to determine which of two soporific drugs was better at increasing the hours of sleep individuals received, on average. There were two groups of 10 patients each. One group received the first drug, and the other group received the second drug. The amount of extra sleep that individuals received when drugged was measured. The data is contained in the sleep data set in R.
Data frame name: sleep
Format: a data frame with 20 observations on 3 variables
NOTE: The group variable name may be misleading about the data: They represent measurements on 10 persons, not in “groups”. These data are actually paired.
# The first value for group 1 is paired with the first value for group 2
## i.e., row 1 and row 11, row 2 and row 12, row 3 and row 13, etc.
## Notice how the IDs for these rows match
print(sleep)
## extra group ID
## 1 0.7 1 1
## 2 -1.6 1 2
## 3 -0.2 1 3
## 4 -1.2 1 4
## 5 -0.1 1 5
## 6 3.4 1 6
## 7 3.7 1 7
## 8 0.8 1 8
## 9 0.0 1 9
## 10 2.0 1 10
## 11 1.9 2 1
## 12 0.8 2 2
## 13 1.1 2 3
## 14 0.1 2 4
## 15 -0.1 2 5
## 16 4.4 2 6
## 17 5.5 2 7
## 18 1.6 2 8
## 19 4.6 2 9
## 20 3.4 2 10
Below is an example of a paired t-test using a dataset provided by R. The Sleep data show the effect of two soporific drugs on 10 patients, where the observed values are differences in hours of sleep compared to the control group (the “extra” column).
FIRST: check for normality using the Shapiro-Wilk test
H0: The data are normally distributed
H1: The data are not normally distributed
#First, find the difference between group 1 and group 2 values for each unique ID
sleep1 <- with(sleep,
extra[group == "2"] - extra[group == "1"])
head(sleep1)
## [1] 1.2 2.4 1.3 1.3 0.0 1.0
#Then complete the Shapiro-Wilk normality test for the differences
shapiro.test(sleep1)
##
## Shapiro-Wilk normality test
##
## data: sleep1
## W = 0.82987, p-value = 0.03334
The resulting p-values is less than 0.05, so we must reject the null hypotheses. However, if we look at the box plot of our data, we can see that we have an outlier for ID 9, with 4.6 extra hours.
#create boxplot for sleep data to view outliers
stripchart(sleep1, method = "stack", xlab = "hours",
main = "Sleep prolongation (n = 10)")
boxplot(sleep1, horizontal = TRUE, add = TRUE,
at = .6, pars = list(boxwex = 0.5, staplewex = 0.25))
We can remove the paired values for ID 9, and rerun the normality test:
#Remove row values where ID == 9
sleep.rm<-sleep[sleep$ID !=9, ]
#Now rerun the Shapiro Wilk test: find the difference between group 1 and group 2 values for each unique ID
sleep1 <- with(sleep.rm,
extra[group == "2"] - extra[group == "1"])
head(sleep1)
## [1] 1.2 2.4 1.3 1.3 0.0 1.0
#Then complete the Shapiro-Wilk normality test for the differences
shapiro.test(sleep1)
##
## Shapiro-Wilk normality test
##
## data: sleep1
## W = 0.95736, p-value = 0.7703
The resulting p-values is greater than 0.05, so we fail to reject the null hypotheses, and the data is considered to be normally distributed.
Now we can run our t-test:
Our null hypothesis is that the mean differences in sleep are equal to zero (zero is the default μ value) while our alternative hypothesis is that the mean differences in sleep are greater or less than zero (which is specified using the alternative parameter):
# Paired t-test using Sleep data.
t.test(extra~group, data=sleep.rm, paired=TRUE)
##
## Paired t-test
##
## data: extra by group
## t = -5.6587, df = 8, p-value = 0.0004766
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.7515777 -0.7373112
## sample estimates:
## mean of the differences
## -1.244444
#This will default to the standard t-test if you leave off "paired = TRUE"
The one-way analysis of variance (ANOVA), also known as one-factor ANOVA, is an extension of independent two-samples t-test for comparing means in a situation where there are more than two groups. In one-way ANOVA, the data is organized into several groups base on one single grouping variable (also called factor variable).
Assumptions:
leveneTest() to check for the homogeneity of variances (after the ANOVA is)shaprio.test()Typical questions:
Corresponding null hypothesis (H0):
The corresponding alternative hypotheses (Ha) are as follow:
Description of data: Results from an experiment to compare yields (as measured by dried weight of plants) obtained under a control and two different treatment conditions. The data is contained in the PlantGrowth data set in R.
Data frame name: PlantGrowth
Format: A data frame of 30 cases on 2 variables.
head(PlantGrowth)
## weight group
## 1 4.17 ctrl
## 2 5.58 ctrl
## 3 5.18 ctrl
## 4 6.11 ctrl
## 5 4.50 ctrl
## 6 4.61 ctrl
Below is an example of an ANOVA using a dataset, PlantGrowth, provided by R. We want to know if there is any significant difference between the average weights of plants in the 3 experimental conditions.
FIRST: check for normality using the Shapiro-Wilk test for EACH group
H0: The data are normally distributed
H1: The data are not normally distributed
#Complete the Shapiro-Wilk normality test for each group
with(PlantGrowth, shapiro.test(weight[group =="ctrl"]))
##
## Shapiro-Wilk normality test
##
## data: weight[group == "ctrl"]
## W = 0.95668, p-value = 0.7475
with(PlantGrowth, shapiro.test(weight[group =="trt1"]))
##
## Shapiro-Wilk normality test
##
## data: weight[group == "trt1"]
## W = 0.93041, p-value = 0.4519
with(PlantGrowth, shapiro.test(weight[group =="trt2"]))
##
## Shapiro-Wilk normality test
##
## data: weight[group == "trt2"]
## W = 0.94101, p-value = 0.5643
The resulting p-values are greater than 0.05, so we fail to reject the null hypotheses, and the data are considered to be normally distributed.
Now we can run our ANOVA:
Our null hypothesis is that the mean differences in sleep are equal to zero (zero is the default μ value) while our alternative hypothesis is that the mean differences in sleep are greater or less than zero (which is specified using the alternative parameter):
# Compute one-way ANOVA test with PlantGrowth data
res.aov <- aov(weight ~group, data = PlantGrowth)
# Summary of the analysis
summary(res.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## group 2 3.766 1.8832 4.846 0.0159 *
## Residuals 27 10.492 0.3886
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The output includes the columns F value and Pr(<F) corresponding to the p-value of the test.
TukeyHSD() for performing multiple pairwise-comparisons between the means of different groups using the residuals from our ANOVA (res.aov).# Compute Tukey pairwise-comparisons to find WHERE the difference in means takes place
TukeyHSD(res.aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = weight ~ group, data = PlantGrowth)
##
## $group
## diff lwr upr p adj
## trt1-ctrl -0.371 -1.0622161 0.3202161 0.3908711
## trt2-ctrl 0.494 -0.1972161 1.1852161 0.1979960
## trt2-trt1 0.865 0.1737839 1.5562161 0.0120064
From the output, we can see that only the difference between trt2 and trt1 is significant with an adjusted p-value of 0.012.
# Plot residuals vs fit
plot(res.aov,1)
NOTE: Points 17, 15, 4 are detected as outliers, which can severely affect normality and homogeneity of variance. It can be useful to remove outliers to meet the test assumptions.
Leven’s test leveneTest() can also be used to check for homogeneity of variances:
require(car)
leveneTest(weight~group, data = PlantGrowth)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 1.1192 0.3412
## 27
From the output above we can see that the p-value is not less than the significance level of 0.05. This means that there is no evidence to suggest that the variance across groups is statistically significantly different. Therefore, we can assume the homogeneity of variances in the different treatment groups.
NOTE: leveneTest() is not a function in the base program of R and you must either “require” or install and call the library “car”.
Linear regression is a commonly used modeling technique used to predict the value of a continuous variable Y given one or more KNOWN predictor variables X ( X1, X2, X3, etc.). Linear regression analyses find a mathematical equation that scientists and data analysis can use to predict Y values when only X values are known.
The mathematical equation is generalized as: Y = β0 + β1 X1 + ϵ
Linear Regression Line
This mathematical formula is very similar in use and concept to one that you are likely already very familiar with: Y = mX + b. Where Y is the dependent variable, m is the slope, X is the independent variable, and b is the y-intercept.
Assumptions:
Typical questions:
Corresponding null hypothesis (H0):
The corresponding alternative hypotheses (Ha) are as follow:
Description of data: The data give the speed of cars and the distances taken to stop. Note: the data were recorded in the 1920s.
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
Here, we will create a mathematical equation for stopping distance as a function of speed and build a linear regression model so that we can predict the stopping distance when only the speed of the car is known.
Visualizing our data first can help us to better create our model.
Scatter Plot: Stopping Distance vs Speed
scatter.smooth(x=cars$speed, y=cars$dist, main="Dist ~ Speed")
The scatter plot with the smoothing line suggests a linear and positive relationship between dist and speed.
Box Plot: Checking for Outliers
par(mfrow=c(1, 2)) # divide graph area in 2 columns
boxplot(cars$speed, main="Speed", sub=paste("Outlier rows: ", boxplot.stats(cars$speed)$out)) # box plot for 'speed'
boxplot(cars$dist, main="Distance", sub=paste("Outlier rows: ", boxplot.stats(cars$dist)$out)) # box plot for 'distance'
Density Plot: General check for normality
library(e1071) # for skewness function
par(mfrow=c(1, 2)) # divide graph area in 2 columns
plot(density(cars$speed), main="Density Plot: Speed", ylab="Frequency", sub=paste("Skewness:", round(e1071::skewness(cars$speed), 2))) # density plot for 'speed'
polygon(density(cars$speed), col="red")
plot(density(cars$dist), main="Density Plot: Distance", ylab="Frequency", sub=paste("Skewness:", round(e1071::skewness(cars$dist), 2))) # density plot for 'dist'
polygon(density(cars$dist), col="red")
Correlation analysis looks at the strength of the relationship between the two continuous variables by computing their correlation coefficient. Correlation is a statistical measure that shows the degree of linear dependence between two variables and is scaled from -1 to 1.
Remember, correlation ≠ causation. Correlation is only an aid to understand the relationship. If two variables have a high correlation, it does not necessarily mean that X causes Y.
Let’s look at the correlation between speed and dist:
cor(cars$speed, cars$dist)
## [1] 0.8068949
Our output shows that there is a strong positive correlation between speed and dist
The function for building a linear model is lm() and takes two main arguments: formula and data. Below is an example of the general syntax for this function:
model.name <- lm( Y ~ X, data = dataframe)
** Example using Cars**
linearMod <- lm(dist ~ speed, data=cars)
print(linearMod)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
We have now established the relationship between our predictor X and response Y as a mathematical formula.
Before using the model we created to make predictions, we must first ensure that it is statistically significant. We can do this by using the summary function summary( ) on our named model, linearMod:
summary(linearMod)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
As with the other statistical analyses, significance is checked by evaluating the p-value. With linear regression, the model can only be considered significant if ALL coefficient p-values are significant. Remember, p-values are only considered significant if they are less than the associated alpha value, α. For our purposes, α = 0.05.
Note: the t-value is the coefficient divided by standard error. It indicates the likelihood that the coefficient is not equal to 0 purple by chance. A t-value that is comparatively larger than it’s corresponding standard error indicates it is less likely that the coefficient is not equal to zero purely by chance. For the sake of this class though, you can go by the p-values.
Hypothesis Evaluation
For this example, both coefficient p-values are smaller than 0.05, and therefore the null hypothesis is rejected.
There are a couple of different methods that can be used to compare different models to see which one best fits or represents the data:
R-Squared and Adjusted R-Squared
R-squared tells us the proportion of variation in the dependent variable that has been explained by the model. The adjusted R-squared does the same, but also accounts for the number of predictors in the model. When comparing models, it is best to compare with the adjusted R-squared rather than just the R-squared. The R output refers to these as the “Multiple R-squared” and “Adjusted R-squared”.
In our example, we can see that about 64-65% of the variation in Y (stoping distance) is explained by the model.
For R-squared and adjusted R-squared, the higher the values, the better the fit.
AIC and BIC
We won’t get into the details here, but AIC (Akaike’s information criterion) and BIC (Bayesian information criterion) are both measures of the goodness of fit of linear regression models and can be used for model selection.
For model comparison, the model with the lowest AIC and BIC value is the “better” model.
AIC:
AIC(linearMod)
## [1] 419.1569
BIC:
BIC(linearMod)
## [1] 424.8929
Below are linear regression models using the dataframe mtcars to show the difference between models and using linear models with mulitple X variables. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). Horsepower (hp) will be our Y variable.
A data frame with 32 observations on 11 (numeric) variables.
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Models 1-4:
model.1 <- lm(hp ~ mpg + wt + drat + qsec, data=mtcars)
summary (model.1)
##
## Call:
## lm(formula = hp ~ mpg + wt + drat + qsec, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.801 -16.007 -5.482 11.614 97.338
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 473.779 105.213 4.503 0.000116 ***
## mpg -2.877 2.381 -1.209 0.237319
## wt 26.037 13.514 1.927 0.064600 .
## drat 4.819 15.952 0.302 0.764910
## qsec -20.751 3.993 -5.197 1.79e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.25 on 27 degrees of freedom
## Multiple R-squared: 0.8073, Adjusted R-squared: 0.7787
## F-statistic: 28.27 on 4 and 27 DF, p-value: 2.647e-09
AIC(model.1)
## [1] 319.6877
BIC(model.1)
## [1] 328.4822
model.2 <- lm(hp ~ mpg + wt + drat + vs, data=mtcars)
summary (model.2)
##
## Call:
## lm(formula = hp ~ mpg + wt + drat + vs, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.345 -17.558 -9.001 14.027 131.889
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 223.900 125.096 1.790 0.0847 .
## mpg -6.747 2.767 -2.439 0.0216 *
## wt 3.272 16.024 0.204 0.8397
## drat 19.416 19.885 0.976 0.3375
## vs -50.335 19.505 -2.581 0.0156 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40.86 on 27 degrees of freedom
## Multiple R-squared: 0.6907, Adjusted R-squared: 0.6449
## F-statistic: 15.07 on 4 and 27 DF, p-value: 1.361e-06
AIC(model.2)
## [1] 334.8215
BIC(model.2)
## [1] 343.6159
model.3 <- lm(hp ~ cyl + wt + drat + vs, data=mtcars)
summary (model.3)
##
## Call:
## lm(formula = hp ~ cyl + wt + drat + vs, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.151 -21.562 -7.202 22.794 122.591
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -226.872 120.090 -1.889 0.06965 .
## cyl 33.078 9.642 3.431 0.00195 **
## wt 11.142 12.040 0.925 0.36296
## drat 38.102 19.680 1.936 0.06339 .
## vs -9.125 24.247 -0.376 0.70962
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.66 on 27 degrees of freedom
## Multiple R-squared: 0.7372, Adjusted R-squared: 0.6982
## F-statistic: 18.93 on 4 and 27 DF, p-value: 1.605e-07
AIC(model.3)
## [1] 329.6141
BIC(model.3)
## [1] 338.4085
model.4 <- lm(hp ~ cyl + wt, data=mtcars)
summary (model.4)
##
## Call:
## lm(formula = hp ~ cyl + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.98 -25.75 -10.93 20.88 130.95
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -51.806 26.231 -1.975 0.0579 .
## cyl 31.388 6.343 4.949 2.93e-05 ***
## wt 1.330 11.577 0.115 0.9093
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.27 on 29 degrees of freedom
## Multiple R-squared: 0.6931, Adjusted R-squared: 0.6719
## F-statistic: 32.75 on 2 and 29 DF, p-value: 3.641e-08
AIC(model.4)
## [1] 330.5718
BIC(model.4)
## [1] 336.4348
By comparing the Adjusted R-squared values (looking for the highest value) and comparing the AIC and BIC (looking for the lowest values) we can see that Model 1 best fits our data; however, mpg and drat do not appear to be signicant variables. We can improve our model even better by removing those two variables:
model.5 <- lm(hp ~ wt + qsec, data=mtcars)
summary (model.5)
##
## Call:
## lm(formula = hp ~ wt + qsec, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.192 -17.334 -4.859 11.234 98.410
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 441.263 64.639 6.827 1.70e-07 ***
## wt 38.670 5.957 6.492 4.17e-07 ***
## qsec -23.474 3.262 -7.197 6.36e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.95 on 29 degrees of freedom
## Multiple R-squared: 0.7968, Adjusted R-squared: 0.7828
## F-statistic: 56.87 on 2 and 29 DF, p-value: 9.205e-11
AIC(model.5)
## [1] 317.3736
BIC(model.5)
## [1] 323.2365
Data and information from the Linear Regression section was borrowed from machinelearningplus.com. For more information on linear regression and how to use your linear model to predict new data be sure to visit their page.
For these examples, we will assume that the previou assumption tests have already been completed.
The unpaired two-samples Wilcoxon test (also known as Wilcoxon rank sum test or Mann-Whitney test) is a non-parametric alternative to the unpaired two-samples t-test, which can be used to compare two independent groups of samples. It’s used when your data are not normally distributed.
This function can be used for one or two sample tests. Here, we will show an example using a two-sample Wilcoxon test. For a one-sample Wilcoxon test example, visit: http://www.sthda.com/english/wiki/one-sample-wilcoxon-signed-rank-test-in-r/.
We will use the example data below which contains the weight of 19 individuals (9 women, 9 men).
# Create data
women_weight <- c(38.9, 61.2, 73.3, 21.8, 63.4, 64.6, 48.4, 48.8, 48.5)
men_weight <- c(67.8, 60, 63.4, 76, 89.4, 73.3, 67.3, 61.3, 62.4)
# Create a data frame
weight_data <- data.frame(
group = rep(c("Woman", "Man"), each = 9),
weight = c(women_weight, men_weight)
)
#Check data
print(weight_data)
## group weight
## 1 Woman 38.9
## 2 Woman 61.2
## 3 Woman 73.3
## 4 Woman 21.8
## 5 Woman 63.4
## 6 Woman 64.6
## 7 Woman 48.4
## 8 Woman 48.8
## 9 Woman 48.5
## 10 Man 67.8
## 11 Man 60.0
## 12 Man 63.4
## 13 Man 76.0
## 14 Man 89.4
## 15 Man 73.3
## 16 Man 67.3
## 17 Man 61.3
## 18 Man 62.4
Question: Is there any significant difference between women and men weights?
Hypothesis:
Compute the two-sample Wilcoxon test:
wilcox.test(weight ~ group, data = weight_data)
##
## Wilcoxon rank sum test with continuity correction
##
## data: weight by group
## W = 66, p-value = 0.02712
## alternative hypothesis: true location shift is not equal to 0
Note: R may give a warning message stating “cannot compute exact p-value with tie”. This stems form the assumption of a Wilcoxon test that the responses are continuous. This can be suppressed by adding the argument “exact = FALSE”, but the results remain the same:
wilcox.test(weight ~ group, data = weight_data, exact = FALSE)
##
## Wilcoxon rank sum test with continuity correction
##
## data: weight by group
## W = 66, p-value = 0.02712
## alternative hypothesis: true location shift is not equal to 0
Evaluate your hypothesis: The resulting p-value is less than 0.05, so we reject the null hypothesis
Interpret your results: We can conclude that men’s median weight is significantly different from women’s median weight.
alternative = "greater" or alternative = "less", respectively.
The paired samples Wilcoxon test (also known as Wilcoxon signed-rank test) is a non-parametric alternative to paired t-test used to compare paired data. It’s used when your data are not normally distributed. Differences between paired samples should be distributed symmetrically around the median.
Here, we’ll use an example data set, which contains the weight of 10 mice before and after the treatment.
# Weight of the mice before treatment
before <-c(200.1, 190.9, 192.7, 213, 241.4, 196.9, 172.2, 185.5, 205.2, 193.7)
# Weight of the mice after treatment
after <-c(392.9, 393.2, 345.1, 393, 434, 427.9, 422, 383.9, 392.3, 352.2)
# Create a data frame
mice_weight <- data.frame(
group = rep(c("before", "after"), each = 10),
weight = c(before, after)
)
#Check data
print(mice_weight)
## group weight
## 1 before 200.1
## 2 before 190.9
## 3 before 192.7
## 4 before 213.0
## 5 before 241.4
## 6 before 196.9
## 7 before 172.2
## 8 before 185.5
## 9 before 205.2
## 10 before 193.7
## 11 after 392.9
## 12 after 393.2
## 13 after 345.1
## 14 after 393.0
## 15 after 434.0
## 16 after 427.9
## 17 after 422.0
## 18 after 383.9
## 19 after 392.3
## 20 after 352.2
Question: Is there any significant difference in weights of mice before and after treatment?
Our null hypothesis is that the mean differences in weight are equal to zero, while our alternative hypothesis is that the mean differences in weight are greater or less than zero (which is specified using the alternative parameter):
Hypothesis:
Compute the paired-sample Wilcoxon test:
wilcox.test(weight ~ group, data = mice_weight, paired = TRUE)
##
## Wilcoxon signed rank exact test
##
## data: weight by group
## V = 55, p-value = 0.001953
## alternative hypothesis: true location shift is not equal to 0
Evaluate your hypothesis: The resulting p-value is less than 0.05, so we reject the null hypothesis
Interpret your results: We can conclude that the median weight of the mice before treatment is significantly different than after treatment
alternative = "greater" or alternative = "less", respectively.
Kruskal-Wallis test by rank is a non-parametric alternative to one-way ANOVA test, which extends the two-samples Wilcoxon test in the situation where there are more than two groups. It’s recommended when the assumptions of one-way ANOVA test are not met.
Let’s revisit the PlantGrowth dataset used in our ANOVA example above. Recall that there were several outlier points (17, 15, 4), which could severely affect normality and homogeneity of variance, which could invalidate our assumptions and affect the outcome of the test.
Instead, let’s compute a Kruskal-Wallis test on the data to test our original question: Is there any significant difference between the average weights of plants in the 3 experimental conditions?
Hypothesis:
Compute the Kruskal-Wallis test using kruskal.test():
kruskal.test(weight ~ group, data = PlantGrowth)
##
## Kruskal-Wallis rank sum test
##
## data: weight by group
## Kruskal-Wallis chi-squared = 7.9882, df = 2, p-value = 0.01842
Evaluate your hypothesis: The resulting p-value is less than 0.05, so we reject the null hypothesis
Interpret your results: We can conclude that there are significant difference between treatment groups
NOTE: Just like with the ANOVA, in order to tell where those differences occur, multiple pairwise-comparisons should be made between groups. We can use the function pairwise.wilcox.test() to calculate pairwise-comparisons between group levels with corrections for multiple testing.
pairwise.wilcox.test(PlantGrowth$weight, PlantGrowth$group,
p.adjust.method = "BH")
##
## Pairwise comparisons using Wilcoxon rank sum test with continuity correction
##
## data: PlantGrowth$weight and PlantGrowth$group
##
## ctrl trt1
## trt1 0.199 -
## trt2 0.095 0.027
##
## P value adjustment method: BH
From the output, we can see that only the difference between trt2 and trt1 is significant with an adjusted p-value of 0.027.
END INTRODUCTION TO STATISTICAL ANALYSIS WITH R AND RSTUDIO