Markdown Author: Jessie Bell, 2023
Libraries Used: none
Answers: purple
Calculating probabilities using the normal distribution with z scores
ht <- seq(150, 190, .01) #seq creates a sequence of numbers that say, "start at 150, end at 190 with 0.1 increments"
plot(ht, dnorm(ht, 170, 8), type="l", col="#00c19a")
Answer The probability of a value under the Standard Normal Curve (SNC) taking on the value of -1.25 or smaller is 0.106
meanht <- 170 #see plot function above
sdht <- 8 #see plot function above
(160-meanht)/sdht
## [1] -1.25
#checks out!
#Caculate the prob of a vale under the SNC taking on the value of -1.25 or smaller
pnorm(-1.25)
## [1] 0.1056498
Answer The probability of a randomly selected individual being shorter than 165 cm tall is 0.266
a <- (165-meanht)/sdht
a
## [1] -0.625
anew <- pnorm(a)
anew
## [1] 0.2659855
plot(ht, dnorm(ht, 170, 8), type="l", col="#e68613")+
abline(v=165, col=2) #not necessary for you to have, but good for you to visualize
## integer(0)
Answer The probability of a randomly selected individual being taller than 175 cm tall is 1 - 0.734 = 0.266, which checks out since our values are both on equal sides of our mean.
b <- (175-meanht)/sdht
b
## [1] 0.625
bnew <- 1-pnorm(b) #becuase now you want the right tail
bnew
## [1] 0.2659855
#another way to calculate this is:
pnorm(b, lower.tail = F) #where you tell r to calculate the upper tail instead of the lower tail.
## [1] 0.2659855
plot(ht, dnorm(ht, 170, 8), type="l", col="lightgreen")+
abline(v=175, col=2) #not necessary for you to have, but good for you to visualize
## integer(0)
Answer The probability of a randomly selected individual being between 165 and 175 cm tall is the probability that it is not in the tails we calculated in a and b. We know that our total probability is 1, so add the tails and subtract from 1 to find the middle. The probability of this middle is 0.468.
c <- 1-(anew+bnew)
c #checks out! since 68.26% of normal distribution lies within 1 standard deviation of the mean, and 175-165 is still shy of the first standard deviation.
## [1] 0.4680289
plot(ht, dnorm(ht, 170, 8), type="l", col="tomato")+
abline(v=165, col=2)+
abline(v=175, col=2) #not necessary for you to have, but good for you to visualize
## integer(0)
Helpful phrase: “If the p is low, the null has got to go.”
avglifespan_hrs <- 1200
n_lightbulbs <- 36
mean_sample_hrs <- 1150
sd_sample_hrs <- 100
H0: μ = 1200
HA: μ < 1200
lightbulbs <- seq(800, 1500, .1)
plot(lightbulbs, dnorm(lightbulbs, 1150, 100), type="l", col="plum")
bulb <- (mean_sample_hrs-avglifespan_hrs)/sd_sample_hrs
bulb
## [1] -0.5
# For a two-tailed test with alpha = 0.05
critical <- qnorm(0.05)
p <- pnorm(-0.5)
p
## [1] 0.3085375
Testing for normality graphically below. Please know that normality is one of the three assumptions you are making when you run a t-test.
Mutant <- c(35.4, 33.0, 28.0, 42.2, 34.1, 33.2, 32.1, 36.8, 42.6, 33.2, 39.4, 28.6, 33.6,
36.0, 31.6, 29.3, 29.8, 23.1, 42.9, 32.7, 39.7, 38.4, 42.9, 29.2)
Wildtype <- c(33.6, 36.0, 31.6, 29.3, 29.8, 23.1, 42.9, 32.7, 39.7, 38.4, 42.9, 29.2, 32.5,
34.1, 33.8, 32.7, 36.7, 42.6, 43.2, 59.4, 18.6, 43.6, 56.0, 51.6, 24.3, 25.8, 53.3, 47.6, 32.7)
rangeData <- c(72.9, 40.9, 36.7, 64.2, 104.2, 33.6, 55.1, 44.3, 40.0, 91.1, 78.8) #not sure why this is a part of the lab?
mean(Mutant)
## [1] 34.49167
sd(Mutant)
## [1] 5.311507
hist(Mutant, ylim = c(0,.1), freq = F, col="pink")
curve(dnorm(x, 34.5, 5.3), add=T)
Answer The distribution for Wildtype below has a bit of a right skew, the Mutant distribution above looks pretty normal. Because Wildtype seems like it could violate our normality assumption, we can also check for normality using other tests like qqplot or Shapiro-Wilks, you will see examples of this in Part II problem 5, and Part III problem 6.
mean(Wildtype)
## [1] 37.16207
sd(Wildtype)
## [1] 9.950033
hist(Wildtype, ylim = c(0,.06), freq = F, col="steelblue")
curve(dnorm(x, 37.2, 9.95), add=T)
Answer notice that within one standard deviation of the mean is where about 68% of all of the data lie within the standard normal curve.
Answer They both look linear, suggesting that they are normally distributed.
qqnorm(Mutant, main="Mutant", col="pink")
qqline(Mutant)
qqnorm(Wildtype, main="Wildtype", col="steelblue")
qqline(Wildtype)
Testing for normality with statistical tests: SHAPIRO-WILK
Shapiro-wilk hypotheses: H0: the data are normal; HA: the data are not normal
Answer The p-value for both Mutant & Wildtype is greater that 0.05 (p > 0.05) and we conclude that there is evidence suggesting the data ARE normally distributed. In case you still aren’t sure, plot a friggin’ histogram! In fact, I always begin there. Visualize your data first. ALWAYS.
shapiro.test(Mutant)
##
## Shapiro-Wilk normality test
##
## data: Mutant
## W = 0.95859, p-value = 0.4107
shapiro.test(Wildtype)
##
## Shapiro-Wilk normality test
##
## data: Wildtype
## W = 0.96723, p-value = 0.4872
hist(Mutant, main="Mutant Distribution", col="pink")
hist(Wildtype, main="Wildtype Distribution", col="steelblue")
Notes: A histogram of your data, a qqplot of your data, AND the Shapiro-Wilks tests are all just telling you if your data violate the assumption of normality. You should also be testing equal variance in this lab, so I am not exactly sure why there are no examples of this.
You must ensure 3 things before running a t-test
1. your data are random, aka INDEPENDENT. Read Ch. 1 & Ch. 4 of your text for more information on randomness. For now, we assume our data are random since the data are already collected, and randomness happens in experimental design – which you will read more about in Ch. 13.
2. your data is approximately normal: histogram, qqplot, or the Shapiro-Wilks test
3. your data have equal variance: jitter/stripchart, Levene’s Test (can be done in R by adding the package “car” to your library and following the code below):
#install.packages(car)
library(car)
## Loading required package: carData
# Data
Mutant <- c(35.4, 33.0, 28.0, 42.2, 34.1, 33.2, 32.1, 36.8, 42.6, 33.2, 39.4, 28.6, 33.6, 36.0, 31.6, 29.3, 29.8, 23.1, 42.9, 32.7, 39.7, 38.4, 42.9, 29.2)
Wildtype <- c(33.6, 36.0, 31.6, 29.3, 29.8, 23.1, 42.9, 32.7, 39.7, 38.4, 42.9, 29.2, 32.5, 34.1, 33.8, 32.7, 36.7, 42.6, 43.2, 59.4, 18.6, 43.6, 56.0, 51.6, 24.3, 25.8, 53.3, 47.6, 32.7)
# Combine data into a data frame
WildData <- data.frame(Group = rep(c("Mutant", "Wildtype"), times = c(length(Mutant), length(Wildtype))),
Value = c(Mutant, Wildtype))
# Perform Levene's test
levene_test_result <- leveneTest(Value ~ Group, data = WildData)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
print(levene_test_result)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 5.3948 0.02422 *
## 51
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#the p value is less that 0.05 so we can reject our null hypothesis in favor of the alternative. Our data are not equal variance, so we might not have enough assumptions met to even run a t-test. In these examples we will proceed becauase the volation is due to the low sampling number.
Working with island.csv
islandData <- read.csv("island.csv")
Try this on your own and I will upload the new key next week.