For Today’s class, we will be using the iris flower data set built in R. Specifically we will be looking at the setosa iris from the setosa_data.csv file uploaded on canvas. This data file will have five columns:
Sepal.Length
Sepal.Width
Petal.Length
Petal.width
Species (all of them are setosa)
T-test
You can compute the t-test manually using the formula. However, R already has a built-in function that does this for you: t.test()
t.test takes in many parameters (try using ? to see all of them!). However, at a minimum you must pass in one (or two) vectors of data for it to compare with a null hypothesis.
For most t-tests in this class, you will be sticking to the default values for the parameters. We will talk about exceptions as they come
Let’s say you want to find the one-sample t-test to see if petal length is not 0
One Sample t-test
data: petal_length
t = 59.528, df = 49, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
1.412645 1.511355
sample estimates:
mean of x
1.462
Now, let’s try to see if the petal lengths and sepal lengths are significantly different
Welch Two Sample t-test
data: petal_length and sepal_length
t = -63.774, df = 71.464, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.654793 -3.433207
sample estimates:
mean of x mean of y
1.462 5.006
However, we could instead use a paired t-test by setting paired = TRUE
t.test(petal_length, sepal_length, paired =TRUE)
Paired t-test
data: petal_length and sepal_length
t = -71.835, df = 49, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-3.643143 -3.444857
sample estimates:
mean difference
-3.544
Confidence Intervals
The general formula for a confidence interval is the sample mean +/- some margin of error
For a significance test, it can be calculated using the following:
95% CI: M +/- t * SE
Standard Error
The standard error of a sample can be calculated by dividing the standard deviation by the square root of the sample size.
You can obtain the standard deviation of a sample by calling sd() in r:
sd(petal_length)
[1] 0.173664
You can also find the standard deviation in the describe() function:
describe(setosa_data)
vars n mean sd median trimmed mad min max range skew kurtosis
Sepal.Length 1 50 5.01 0.35 5.0 5.00 0.30 4.3 5.8 1.5 0.11 -0.45
Sepal.Width 2 50 3.43 0.38 3.4 3.42 0.37 2.3 4.4 2.1 0.04 0.60
Petal.Length 3 50 1.46 0.17 1.5 1.46 0.15 1.0 1.9 0.9 0.10 0.65
Petal.Width 4 50 0.25 0.11 0.2 0.24 0.00 0.1 0.6 0.5 1.18 1.26
Species* 5 50 1.00 0.00 1.0 1.00 0.00 1.0 1.0 0.0 NaN NaN
se
Sepal.Length 0.05
Sepal.Width 0.05
Petal.Length 0.02
Petal.Width 0.01
Species* 0.00
Previously we discussed how to calculate the t-score for a vector of data points. When calculating confidence intervals, what we actually want to get is the t-crit
qt(0.975, 10)
[1] 2.228139
The first number is probability of the confidence interval. For a 95% confidence interval,
Confidence Interval
So to calculate the confidence interval for the average petal length of the setosa species:
m =mean(petal_length) # sample meansd =sd(petal_length) # sample standard deviationn =length(petal_length) # sample sizese = sd /sqrt(n) # sample standard errort_crit <-qt(1- .05/2, df = n -1) # t-crit value for a 2-tailed testmargin_error <- t_crit * se # margin of error for confidence intervallower <- m - margin_errorupper <- m + margin_errorcat("95% Confidence Interval for Iris Petal.Length:\n")
95% Confidence Interval for Iris Petal.Length:
cat("Lower bound:", round(lower, 3), "\n")
Lower bound: 1.413
cat("Upper bound:", round(upper, 3), "\n")
Upper bound: 1.511
confint()
However, you can also just use confint(), which takes in a linear model and calculates a confidence interval for each predictor.
model <-lm(Petal.Length ~ Species, data = iris)confint(model)
Note that this gives you the confidence intervals of the slope and y-intercept estimate
What does this mean?
Data Visualization
ggplot
ggplot is not built into base R, so we will have to import it before we can use it.
library(ggplot2)
ggplot (cont.)
p <-ggplot(data = setosa_data, aes(Petal.Length, Sepal.Length))p
Here are the components of a ggplot call:
p <-: Although not necessary, this creates a plot variable p that we can use later. Think of this as like a template.
data =: The data frame from which the data will be plotted.
aes(x, y): This decides what your independent and dependent variables are.
You may notice that nothing happened when you ran that line of code. You will need to add something like geom_point for the plot to show up.
p1 <- p +geom_point()p1
Adding a line of best fit to ggplot
geom_point() isn’t the only thing we can add to a plot. We can also do geom_smooth() with method = "lm" as a parameter.
p2 <- p1 +geom_smooth(method ="lm")p2
Besides the regression, what else did this add to the plot?
If you want to remove the confidence band, you can do so by setting se = FALSE
p2_no_band <- p1 +geom_smooth(method ="lm", se =FALSE)p2_no_band
Customizing your graph
There are nearly infinite ways to customize your graph in ggplot! We won’t be able to cover all of them in this class, but here are just a few:
p3 <- p +geom_point(color ="blue", size =5) +geom_smooth(method ="lm", se =FALSE, color ="seagreen") +labs(title ="Petal vs Sepal Length", x ="Petal Length", y ="Sepal Length") +theme_minimal()p3
You can also do the main and axis titles separately without using labs