Lab 3

John Yi, Emma Fightmaster

Data

For Today’s class, we will be using the iris flower data set built in R. Specifically we will be looking at the setosa iris from the setosa_data.csv file uploaded on canvas. This data file will have five columns:

Sepal.Length
Sepal.Width
Petal.Length
Petal.width
Species (all of them are setosa)

T-test

You can compute the t-test manually using the formula. However, R already has a built-in function that does this for you: t.test()

t.test takes in many parameters (try using ? to see all of them!). However, at a minimum you must pass in one (or two) vectors of data for it to compare with a null hypothesis.
For most t-tests in this class, you will be sticking to the default values for the parameters. We will talk about exceptions as they come

Let’s say you want to find the one-sample t-test to see if petal length is not 0

setosa_data <- read.csv("setosa_data.csv")
petal_length <- setosa_data$Petal.Length
t.test(petal_length)


    One Sample t-test

data:  petal_length
t = 59.528, df = 49, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 1.412645 1.511355
sample estimates:
mean of x 
    1.462

Now, let’s try to see if the petal lengths and sepal lengths are significantly different

sepal_length <- setosa_data$Sepal.Length
t.test(petal_length, sepal_length)


    Welch Two Sample t-test

data:  petal_length and sepal_length
t = -63.774, df = 71.464, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.654793 -3.433207
sample estimates:
mean of x mean of y 
    1.462     5.006

However, we could instead use a paired t-test by setting paired = TRUE

t.test(petal_length, sepal_length, paired = TRUE)


    Paired t-test

data:  petal_length and sepal_length
t = -71.835, df = 49, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -3.643143 -3.444857
sample estimates:
mean difference 
         -3.544

Confidence Intervals

The general formula for a confidence interval is the sample mean +/- some margin of error

For a significance test, it can be calculated using the following:

95% CI: M +/- t * SE

Standard Error

The standard error of a sample can be calculated by dividing the standard deviation by the square root of the sample size.

You can obtain the standard deviation of a sample by calling sd() in r:

sd(petal_length)

[1] 0.173664

You can also find the standard deviation in the describe() function:

describe(setosa_data)

             vars  n mean   sd median trimmed  mad min max range skew kurtosis
Sepal.Length    1 50 5.01 0.35    5.0    5.00 0.30 4.3 5.8   1.5 0.11    -0.45
Sepal.Width     2 50 3.43 0.38    3.4    3.42 0.37 2.3 4.4   2.1 0.04     0.60
Petal.Length    3 50 1.46 0.17    1.5    1.46 0.15 1.0 1.9   0.9 0.10     0.65
Petal.Width     4 50 0.25 0.11    0.2    0.24 0.00 0.1 0.6   0.5 1.18     1.26
Species*        5 50 1.00 0.00    1.0    1.00 0.00 1.0 1.0   0.0  NaN      NaN
               se
Sepal.Length 0.05
Sepal.Width  0.05
Petal.Length 0.02
Petal.Width  0.01
Species*     0.00

To calculate the standard error:

sd = sd(petal_length)
n = length(petal_length)
se = sd / sqrt(n)
print(se)

[1] 0.0245598

T-crit

Previously we discussed how to calculate the t-score for a vector of data points. When calculating confidence intervals, what we actually want to get is the t-crit

qt(0.975, 10)

[1] 2.228139

The first number is probability of the confidence interval. For a 95% confidence interval,

Confidence Interval

So to calculate the confidence interval for the average petal length of the setosa species:

m = mean(petal_length) # sample mean
sd = sd(petal_length) # sample standard deviation
n = length(petal_length) # sample size
se = sd / sqrt(n) # sample standard error

t_crit <- qt(1 - .05/2, df = n - 1) # t-crit value for a 2-tailed test

margin_error <- t_crit * se # margin of error for confidence interval

lower <- m - margin_error
upper <- m + margin_error

cat("95% Confidence Interval for Iris Petal.Length:\n")

95% Confidence Interval for Iris Petal.Length:

cat("Lower bound:", round(lower, 3), "\n")

Lower bound: 1.413

cat("Upper bound:", round(upper, 3), "\n")

Upper bound: 1.511

confint()

However, you can also just use confint(), which takes in a linear model and calculates a confidence interval for each predictor.

model <- lm(Petal.Length ~ Species, data = iris)

confint(model)

                     2.5 %   97.5 %
(Intercept)       1.341729 1.582271
Speciesversicolor 2.627912 2.968088
Speciesvirginica  3.919912 4.260088

Note that this gives you the confidence intervals of the slope and y-intercept estimate
What does this mean?

Data Visualization

ggplot

ggplot is not built into base R, so we will have to import it before we can use it.

library(ggplot2)

ggplot (cont.)

p <- ggplot(data = setosa_data, aes(Petal.Length, Sepal.Length))
p

Here are the components of a ggplot call:

p <-: Although not necessary, this creates a plot variable p that we can use later. Think of this as like a template.
data =: The data frame from which the data will be plotted.
aes(x, y): This decides what your independent and dependent variables are.

You may notice that nothing happened when you ran that line of code. You will need to add something like geom_point for the plot to show up.

p1 <- p + geom_point()
p1

Adding a line of best fit to ggplot

geom_point() isn’t the only thing we can add to a plot. We can also do geom_smooth() with method = "lm" as a parameter.

p2 <- p1 + geom_smooth(method = "lm")
p2

Besides the regression, what else did this add to the plot?

If you want to remove the confidence band, you can do so by setting se = FALSE

p2_no_band <- p1 + geom_smooth(method = "lm", se = FALSE)
p2_no_band

Customizing your graph

There are nearly infinite ways to customize your graph in ggplot! We won’t be able to cover all of them in this class, but here are just a few:

p3 <- p +
  geom_point(color = "blue", size = 5) +
  geom_smooth(method = "lm", se = FALSE, color = "seagreen") +
  labs(title = "Petal vs Sepal Length", x = "Petal Length", y = "Sepal Length") +
  theme_minimal()
p3

You can also do the main and axis titles separately without using labs

p4 <- p2 +
  ggtitle("Petal vs Sepal Length") +
  xlab("Petal Length") +
  ylab("Sepal Length")
p4

Saving your Graph

Although it’s nice to view your plots in RStudio, sometimes you would like to also have it saved for future use. You can do so using ggsave()

ggsave("petal_vs_sepal_length.png", p3)

You can specify the file type in the filename (for example, try changing the name to a .jpg)