by hand and through r testing a dataset against a mu that is NOT equal to 0.
crabs <- c(25.8,24.6,26.1,22.9,25.1,27.3,24,24.5,23.9,26.2,24.3,24.6,23.3,25.5,28.1,24.8,23.5,26.3,25.4,25.5,23.9,27,24.8,22.9,25.4)
crabs
## [1] 25.8 24.6 26.1 22.9 25.1 27.3 24.0 24.5 23.9 26.2 24.3 24.6 23.3 25.5 28.1
## [16] 24.8 23.5 26.3 25.4 25.5 23.9 27.0 24.8 22.9 25.4
air_temp <- 24.3
By hand:
# calculating means:
# i'm assuming i can use these kinds of functions for my calculations "by hand"
mean_crabs <- mean(crabs)
mean_crabs
## [1] 25.028
mean_air<- mean(air_temp)
mean_air
## [1] 24.3
sd_crabs <- sd(crabs)
n_crabs <- 25
se_crabs <- sd_crabs/sqrt(n_crabs)
se_crabs
## [1] 0.2683605
t_crabs <- (mean_crabs - mean_air) / se_crabs
t_crabs # t-statistic
## [1] 2.712769
df_c <- 25-1
# P-value
p_value_crabs <- 2* pt(-abs(t_crabs),df_c)
p_value_crabs
## [1] 0.01214554
body temperature of crabs is significantly different than the air temperature (P < 0.05). Mean crab body temp = 25.028c, air temp = 24.3c, se = 0.27. This suggests Crabs have a significantly higher mean body temp than the environmental temperature.
Using R:
head(crabs)
## [1] 25.8 24.6 26.1 22.9 25.1 27.3
t.test(crabs, mu = 24.3) # mean air temp is 24.3
##
## One Sample t-test
##
## data: crabs
## t = 2.7128, df = 24, p-value = 0.01215
## alternative hypothesis: true mean is not equal to 24.3
## 95 percent confidence interval:
## 24.47413 25.58187
## sample estimates:
## mean of x
## 25.028
Body temperature of crabs is still significantly different from environmental temperature (P<0.05)
when n is *NOT given and you want to find when Y>
or < value: p[Y>value]
pnorm(q = value, mean, sd, lower.tail=F) pr[Y<value]
pnorm(q = value, mean, sd, lower.tail=T
pr[ybar >= value] pnorm(q = z, lower.tail = F)
pr[ybar <= value] pnorm(q = z, lower.tail = T)
se = sd/sqrt(n) Z = Y-u / se Y= observed value ybar = mean u=mean
Z tells us how many standard deviants a value is from the mean.
Central Limit Theorem:
The sum of mean and large numbers randomly samples, non-normally distributed, is about equal to a normal distribution of that population.
if we want to find when x = y, for example “what is x at the 35th
percentile? we would use qnorm(p=x, mean=u, sd=sd)
Normal distributions are continuous, symmetrical about the mean, have a singular mode, probability density is highest at the mean.
The following table lists the means and standard deviations of several different normal distributions. For each, a sample of 10 individuals was taken, as well as a sample of 30 individuals. For each sample, calculate the probability that the mean of the sample was greater than the given value. when workign with known n population use z = Y-u / se se = sd / sqrt(n) y = value u = mean then pnorm() to get pr[y>value] or pr[y<value]
d2 <- data.frame(mean = c(14,15,-23,72),
sd = c(5,3,4,50),
value = c(15,15.5,-22,45)
)
# check d2
# for n=10
# compute se
se_n10 <- d2$sd / sqrt(10)
#se_n10
# compute z in terms of se
z_n10 <- (d2$value - d2$mean) / se_n10
#z_n10
d2$pr_ybar_greater_value_n10 <- pnorm(q=z_n10, lower.tail=F)
d2$pr_ybar_less_value_n10 <- pnorm(q=z_n10, lower.tail = T)
# for n=30
se_n30 <- d2$sd / sqrt(30)
#se_n30
z_n30 <- (d2$value - d2$mean) / d2$sd
#z_n30
d2$pr_ybar_greater_value_n30 <- pnorm(q=z_n30, lower.tail=F)
d2$pr_ybar_less_value_n30 <- pnorm(q=z_n30, lower.tail = F)
d2
checked and correct^
The proportion of traffic fatalities resulting from drivers with high blood alcohol levels in 1982 was approximately normally distributed among U.S. states, with mean 0.569 and standard deviation 0.068 (U.S. Department of Transportation Traffic Safety Facts 1999).
u_drive <- 0.569
sd_drive <- 0.068
notice that no n is given! this means we are using z = y-u/sd for this problem
# so when is y > 0.65
z_drive <- (0.65 - u_drive) / sd_drive
z_drive # find the z-statistic
## [1] 1.191176
pnorm(q=z_drive, lower.tail = F) # use z to in pnorm() to find the lower end
## [1] 0.1167922
So, about 11.7% of states are expected to have more than 65% of their traffic deaths due to drunk driving
# we want to know when x=25% what is that value of of y? so we will use qnorm()
qnorm(p=0.25, mean = u_drive, sd = sd_drive)
## [1] 0.5231347
# which is the same thing as:
qnorm(p=0.75, mean=u_drive, sd=sd_drive, lower.tail = F) # lower tail=F means taking the p value starting from the right. it will give you the same value.
## [1] 0.5231347
the proportion of deaths due to drunk driving is 52.3% in 25% of States. In other words, States that exhibit 52.3% of driving deaths due to drunk driving are at the 25th percentile.
t.test() will give you:
general syntax of t.test()
#t.test(x,
# y = "second numeric vector",
# alternative = "type of test (default or `two-sided`)",
# mu = "hypothesized population mean of the null hypothesis. in other words, what you are #testing against. default mu = 0",
# paired = "logical (TRUE/FALSE), use T for paired",
# var.equal = "logical (TRUE/FALSE), use T is two samples are assumed to have equal variance (pooled t-test)",
# conf.level = "confidence level for the test. default is 0.95"
# )
here is an example of how we use t.test() for 1, 2, and
paired t-tests:
H0: true mean = 0 Ha: true mean does not equal 0
head(cars)
t.test(x = cars)
##
## One Sample t-test
##
## data: cars
## t = 12.625, df = 99, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 24.60221 33.77779
## sample estimates:
## mean of x
## 29.19
# That simple.
the one sample t-test is similar to the z statistic for when we have an n=# t = ybar-u / se(ybar) t = test statistic ybar = standard error of the mean (sd of sample distributino) u = mean se = sd/sqrt(n) df = n-1
given to you using t.test() if you cannot use that
function because you have no data, you have to use data tables and
calculate it by hand.
see problem 11.14
# confidence interval equation:
"CI_lower_tail" <- "dbar - ta(2),df * se_of_dbar"
"CI_upper_tail" <- "dbar + ta(2),df * se_of_dbar"
data.frame(variable = c("dbar",
"ta(2),df",
"se_of_df"),
definition = c("mean difference",
"critical value for 0.05 and your df: n-1",
"standard error of the difference = sd/sqrt(n)")
)
H0: means of two groups are equal Ha: means of two groups are not equal
head(cars)
t.test(cars$speed, cars$dist)
##
## Welch Two Sample t-test
##
## data: cars$speed and cars$dist
## t = -7.4134, df = 53.119, p-value = 9.628e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -35.04152 -20.11848
## sample estimates:
## mean of x mean of y
## 15.40 42.98
# welches t-test by default, assumes non equal variance
only used for when you have before-after or matched data. H0: mean difference = 0 Ha: mean difference does not equal 0
head(cars)
t.test(cars$speed, cars$dist, paired = T)
##
## Paired t-test
##
## data: cars$speed and cars$dist
## t = -8.9753, df = 49, p-value = 6.417e-12
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -33.75516 -21.40484
## sample estimates:
## mean difference
## -27.58
Astronauts increased in height by an average of approximately 40mm(about an inch and a half) during the Apollo–Soyuz missions, due to the absence of gravity compressing their spines during their time in space. Does something similar happen here on Earth? An experiment supported by NASA measured the heights of six men immediately before going to bed, and then again after three days of bed rest (Styf et al. 1997). On average, they increased in height by 14mm, with standard deviation 0.66mm. Find the 95% confidence interval for the change in height after three days of bed rest.
# this is a before and after question, which would lead us to doing a paired t-test. However, we are only given the mean change, sd, and n. So for this specific question we should use a one sample t-test. But we can't do that either because we don't have enough data. So, we have to calculate it by hand and use a statistical table.
# we know the mean, sd and n.
mean_h <- 14
n_h <- 6
sd_h <- 0.66
se_h <- sd_h / sqrt(n_h)
se_h
## [1] 0.2694439
df_h <- n_h - 1
df_h
## [1] 5
crit_val_h <- 2.57 # from table in book
# confidence intervals are the mean +/- margin of error (ME)
me_h <- crit_val_h * se_h # margin of error
ci_lower_h <- mean_h - me_h
ci_upper_h <- mean_h + me_h
data.frame("confidence interval" = "", Lower = ci_lower_h,
upper = ci_upper_h)
and there is our 95% confidence interval.
Male koalas bellow during the breeding season, but do females pay attention? Charlton et al. (2012) measured responses of estrous female koalas to playbacks of bellows that had been modified on the computer to simulate male callers of different body size. Females were placed one at a time into an enclosure while loudspeakers played bellows simulating a larger male on one side (randomly chosen) and a smaller male on the other side. Male bellows were played repeatedly, alternating between sides, over 10 minutes. Females often turned to look in the direction of a loudspeaker (invisible to her) during a trial. The following data measure the preference of each of 15 females for the simulated sound of the “larger” male. Preference was calculated as the number of looks toward the larger-male side minus the number of looks to the smaller-male side. Preference is positive if a female looked most often toward the larger male, and it is negative if she looked most often in the direction of the smaller male.
H0: means of two groups are equal
Ha: means of two groups are not equal
each treatment is an independent sample unit
head(cars)
t.test(cars$speed, cars$dist)
##
## Welch Two Sample t-test
##
## data: cars$speed and cars$dist
## t = -7.4134, df = 53.119, p-value = 9.628e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -35.04152 -20.11848
## sample estimates:
## mean of x mean of y
## 15.40 42.98
# welches t-test by default, assumes non equal variance
both treatments applied to every sample
minimizes extraneous impact (environment, genetics, etc)
only used for when you have before-after or matched data.
H0: mean difference = 0
Ha: mean difference does not equal 0
head(cars)
t.test(cars$speed, cars$dist, paired = T)
##
## Paired t-test
##
## data: cars$speed and cars$dist
## t = -8.9753, df = 49, p-value = 6.417e-12
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -33.75516 -21.40484
## sample estimates:
## mean difference
## -27.58
When your authors were growing up, it was thought that humans dreamed in black and white rather than color. A recent hypothesis is that North Americans did in fact dream in black and white in the era of black-and-white television and movies, but that we shifted back to dreaming in color after the introduction of color media. To test this hypothesis, Murzyn (2008) queried 30 older individuals who had grown up in the black-and-white era and 30 younger individuals who grew up with color media about their dreams. She recorded the percentage of color dreams for each individual. A mean of 68.4% of the younger peoples’ dreams were in color (with a standard deviation of 31.8%). On average, 33.9% of the older individuals’ dreams were in color, with a standard deviation of 36.9%. The scores were approximately normally distributed in each group. Is the difference between the two means statistically significant? We’ll use a two-sample𝑡-test for this comparison.
d12.2 <- data.frame(sample = c("young", "old"),
ybar = c(64.8, 33.9),
sd = c(31.8, 36.9),
n = c(30, 30))
d12.2
# H0: There is no difference between old and young groups and dreaming in color
# Ha: There is a difference between old and young groups dreaming in color in that more people in the young group dream in color.
# the assumption of a two samples t-test are that the data is random, independent of each other, the variances are similar (unless using welches t-test), and both samples should be normally distributed.
# the data meets this criteria well enough.
sd_young <- 31.8
sd_old <- 36.9
var_young <- sd_young^2
var_old <- sd_old^2
var_young
## [1] 1011.24
var_old
## [1] 1361.61
n_young <- 30
n_old <- 30
df_young <- 30-1
df_old <- 30-1
df_young
## [1] 29
df_old
## [1] 29
pooled_var <- ((df_young * var_young) + (df_old * var_old)) / (n_young + df_old)
pooled_var
## [1] 1166.316
se_12.2 <- sqrt(pooled_var * ((1/n_young) + (1/n_old)))
se_12.2
## [1] 8.817846
g.Calculate t \[ t = \frac{\bar{Y}_1 - \bar{Y}_2}{SE} \] ybar = the value, in this case-the mean
ybar1 <- 64.8
ybar2 <- 33.9
t_12.2 <- (ybar1 - ybar2) / se_12.2
t_12.2
## [1] 3.504257
n_young + n_old - 2
## [1] 58
# the statistical table shows 2.0
p12.2 <- 2 * (1 - pt(abs(t_12.2), df = 58))
p12.2
## [1] 0.0008895793
# The p value is <0.05 (0.0009), this shows a significant difference between the two means of groups young and old.
# Additionally, The T statistic (3.5) is greater than the critical value (2.0). This suggests that the young group has greater dreams in color on average compared to the old group.
we can detect if something differs from the normal graphically by seeing if it is normally distributed or skewed.
or, we can do this by formal calculations using the Shapiro-Wilk test. It estimated the mean and sd from data, then tests the goodness of fit to the normal distribution. The null hypothesis is that the data is not different from a normal distribution.
We can ignore violations if they do not significantly change the results.
If data has zeros, we can use \[ Y = ln[Y+1] \] Doing this can make the original data set appear to be normally distributed with similar sd without actually changing the data at all. As long as you do the same transformations to all values and samples.
p' = arcsin[sqrt(p)]
Y' = sqrt[Y + 1/2]
non-parametric test: Mann Whit U test compares the distributions of two groups but does not need as many assumptions as a two-sample t-test.
df_MW <- data.frame(starved = c(1.9,2.1,3.8,9.0,9.6,13,14.7,17.9,21.7,29.0,72.3,NA,NA),
fed = c(1.5,1.7,2.4,3.6,5.7,22.6,22.8,39,54.4,72.1,73.6,79.5,88.9))
df_MW
#Ha: statistical difference between means of two groups
# H0: no difference between means of two groups
hist(df_MW$starved)
hist(df_MW$fed)
# skewed and log transformation doesn't work, so a t-test will not be accurate.
MWtest <- wilcox.test(x = df_MW$starved, y = df_MW$fed)
MWtest
##
## Wilcoxon rank sum exact test
##
## data: df_MW$starved and df_MW$fed
## W = 55, p-value = 0.3607
## alternative hypothesis: true location shift is not equal to 0
# W = test statistic used to get P
P > 0.05 so the distributions are the same.
-treatments/variables
A planned comparison is a comparison between means identified as being of crucial interest during the design of the study, identified prior to obtaining the data.
planned comparisons have more power than planned comparisons.
With any anova you could technically just run a bunch of t-tests, but that wouldn’t be as powerful as using anova, because the t-tests would miss the confounding variables in the third group that the anova would automatically account for.
An unplanned comparison one of multiple comparisons, such as between all pairs of means, carried out to help determine where differences between means lie. Tests between interests not identifies prior to obtaining the data.
works like a series of t-tests but with a larger critical value to accommodate for type 1 error.
You only need to perform A Tukey test if you have a significant P-value from your ANOVA. This will allow you to see which comparison is significant.
TukeyHSD(x = aovdata)
library(ggplot2) # plots
library(dplyr) # sub-setting data
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr) # read in the data file
titanic <- read.csv("C:/Users/ahand/OneDrive - St. Mary's College of MD/SMCM_FA25/biostats/exam2_prep/titanic.csv",
stringsAsFactors = T) # character names become factors
head(titanic) # view the top portion of the data
summary(titanic) # summarizes the data into some statistical vlaues
## passenger_class name age
## 1st:322 Carlsson,MrFransOlof : 2 Min. : 0.1667
## 2nd:280 Connolly,MissKate : 2 1st Qu.:21.0000
## 3rd:711 Kelly,MrJames : 2 Median :30.0000
## Abbing,MrAnthony : 1 Mean :31.1942
## Abbott,MasterEugeneJoseph: 1 3rd Qu.:41.0000
## Abbott,MrRossmoreEdward : 1 Max. :71.0000
## (Other) :1304 NA's :680
## embarked home_destination sex survive
## :493 :558 female:463 no :864
## Cherbourg :202 NewYork,NY : 65 male :850 yes:449
## Queenstown : 45 London : 14
## Southampton:573 Montreal,PQ : 10
## Cornwall/Akron,OH: 9
## Paris,France : 9
## (Other) :648
ggplot(titanic, aes(x = age)) +
geom_histogram() +
facet_wrap(~ passenger_class, ncol = 1)
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## Warning: Removed 680 rows containing non-finite outside the scale range
## (`stat_bin()`).
# dplyr:
titanic_by_passenger_class <- group_by(titanic, passenger_class) # groups titanic data by passenger class numerically
head(titanic_by_passenger_class)
s1 <- summarize(titanic_by_passenger_class,
group_mean = mean(age, na.rm = T)) # summarizes the means
s2 <- summarize(titanic_by_passenger_class,
group_mean = mean(age, na.rm = T),
group_sd = sd(age, na.rm = T)) # summarizes means and sd in one table
s2
# ANOVA test
titanic_model <- aov(data = titanic, age ~ passenger_class) # runs anova to test for difference of age inside passenger_class
titanic_model
## Call:
## aov(formula = age ~ passenger_class, data = titanic)
##
## Terms:
## passenger_class Residuals
## Sum of Squares 26689.69 110763.68
## Deg. of Freedom 2 630
##
## Residual standard error: 13.25954
## Estimated effects may be unbalanced
## 680 observations deleted due to missingness
summary.aov(titanic_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## passenger_class 2 26690 13345 75.9 <2e-16 ***
## Residuals 630 110764 176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 680 observations deleted due to missingness
use when the sample size in each group is equal
TukeyHSD(titanic_model) # Tukey test, one of a couple post-hoc tests
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = age ~ passenger_class, data = titanic)
##
## $passenger_class
## diff lwr upr p adj
## 2nd-1st -11.367459 -14.345803 -8.389115 0.0000000
## 3rd-1st -15.148115 -18.192710 -12.103521 0.0000000
## 3rd-2nd -3.780656 -6.871463 -0.689849 0.0116695
Tukey tests gives yo the pairwise comparison between groups.
Usually labeled using letters; treatments in group A are not sig different from each other but are sig different from treatment B. The same goes for group B.
p-adjacent is the p value for those two groups.
we can see here that class 1 is significantly different from all classes, and class 3 and 2 are not sig different from each other for alpha = 0.01.
plot(titanic_model)
kruskal.test(data = titanic, age ~ passenger_class)
##
## Kruskal-Wallis rank sum test
##
## data: age by passenger_class
## Kruskal-Wallis chi-squared = 116.08, df = 2, p-value < 2.2e-16
Kruskal-test is basically a sum of squares for multiple groups
used when normal distribution assumption is violated
measures the fraction of variation in the mean \[ R^2 = \frac{SSgroups}{SStotal} \]
SSgroups = sum of squares of group
SStotal = sum of squares total
the output x100% gives you an explanation to the amount of the result that can be accurately explained by that treatment.
An output of 0.04 means that 40% of the time, the data is explained accurately by the model.
variance within groups = meansq residuals
variance between groups = meansq group
repeatability is different but similar.
\[ repeatability = \frac{S^2A}{S^2A + MSerror} \]
variance within groups is the mean squares of residuals
variance among groups is the mean squares of the group
repeatability is measured by R^2
ss_group15.20 <- 0.01579
ss_residuals15.20 <- 0.00415
R_squared15.20 <- (ss_group15.20) / (ss_group15.20 + ss_residuals15.20)
R_squared15.20
## [1] 0.7918756
about 20% of the data seen is due to measurement error.