Biostats Outline Part 2: Andrew Hand Directory

chapter 10 Z-tests and normal distribution

One sample t-Test Example:

by hand and through r testing a dataset against a mu that is NOT equal to 0.

crabs <- c(25.8,24.6,26.1,22.9,25.1,27.3,24,24.5,23.9,26.2,24.3,24.6,23.3,25.5,28.1,24.8,23.5,26.3,25.4,25.5,23.9,27,24.8,22.9,25.4)
crabs
##  [1] 25.8 24.6 26.1 22.9 25.1 27.3 24.0 24.5 23.9 26.2 24.3 24.6 23.3 25.5 28.1
## [16] 24.8 23.5 26.3 25.4 25.5 23.9 27.0 24.8 22.9 25.4
air_temp <- 24.3

By hand:

# calculating means:  

# i'm assuming i can use these kinds of functions for my calculations "by hand"
mean_crabs <- mean(crabs)
mean_crabs
## [1] 25.028
mean_air<- mean(air_temp)
mean_air
## [1] 24.3
sd_crabs <- sd(crabs)
n_crabs <- 25
se_crabs <- sd_crabs/sqrt(n_crabs)
se_crabs
## [1] 0.2683605
t_crabs <- (mean_crabs - mean_air) / se_crabs 
t_crabs # t-statistic
## [1] 2.712769
df_c <- 25-1

# P-value
p_value_crabs <- 2* pt(-abs(t_crabs),df_c)
p_value_crabs
## [1] 0.01214554

body temperature of crabs is significantly different than the air temperature (P < 0.05). Mean crab body temp = 25.028c, air temp = 24.3c, se = 0.27. This suggests Crabs have a significantly higher mean body temp than the environmental temperature.

Using R:

head(crabs)
## [1] 25.8 24.6 26.1 22.9 25.1 27.3
t.test(crabs, mu = 24.3)        # mean air temp is 24.3
## 
##  One Sample t-test
## 
## data:  crabs
## t = 2.7128, df = 24, p-value = 0.01215
## alternative hypothesis: true mean is not equal to 24.3
## 95 percent confidence interval:
##  24.47413 25.58187
## sample estimates:
## mean of x 
##    25.028

Body temperature of crabs is still significantly different from environmental temperature (P<0.05)

Z and Ybar questions:

when n is *NOT given and you want to find when Y> or < value: p[Y>value] pnorm(q = value, mean, sd, lower.tail=F) pr[Y<value] pnorm(q = value, mean, sd, lower.tail=T

when given n and using ybar

pr[ybar >= value] pnorm(q = z, lower.tail = F) pr[ybar <= value] pnorm(q = z, lower.tail = T)

se = sd/sqrt(n) Z = Y-u / se Y= observed value ybar = mean u=mean

Z tells us how many standard deviants a value is from the mean.

Central Limit Theorem:

The sum of mean and large numbers randomly samples, non-normally distributed, is about equal to a normal distribution of that population.

if we want to find when x = y, for example “what is x at the 35th percentile? we would use qnorm(p=x, mean=u, sd=sd)

Normal distributions are continuous, symmetrical about the mean, have a singular mode, probability density is highest at the mean.

CH10 Problems

10.13

The following table lists the means and standard deviations of several different normal distributions. For each, a sample of 10 individuals was taken, as well as a sample of 30 individuals. For each sample, calculate the probability that the mean of the sample was greater than the given value. when workign with known n population use z = Y-u / se se = sd / sqrt(n) y = value u = mean then pnorm() to get pr[y>value] or pr[y<value]

d2 <- data.frame(mean = c(14,15,-23,72), 
                 sd = c(5,3,4,50), 
                 value = c(15,15.5,-22,45) 
                 ) 
                
# check d2

                              # for n=10
# compute se 
se_n10 <- d2$sd / sqrt(10) 
#se_n10

# compute z in terms of se 
z_n10 <- (d2$value - d2$mean) / se_n10
#z_n10

d2$pr_ybar_greater_value_n10 <- pnorm(q=z_n10, lower.tail=F)

d2$pr_ybar_less_value_n10 <- pnorm(q=z_n10, lower.tail = T)

                               # for n=30

se_n30 <- d2$sd / sqrt(30)
#se_n30

z_n30 <- (d2$value - d2$mean) / d2$sd
#z_n30

d2$pr_ybar_greater_value_n30 <- pnorm(q=z_n30, lower.tail=F)

d2$pr_ybar_less_value_n30 <- pnorm(q=z_n30, lower.tail = F)

d2

checked and correct^

10.22

The proportion of traffic fatalities resulting from drivers with high blood alcohol levels in 1982 was approximately normally distributed among U.S. states, with mean 0.569 and standard deviation 0.068 (U.S. Department of Transportation Traffic Safety Facts 1999).

u_drive <- 0.569 
sd_drive <- 0.068

notice that no n is given! this means we are using z = y-u/sd for this problem

  1. What proportion of states would you expect to have more than 65% of their traffic fatalities from drunk driving?
# so when is y > 0.65

z_drive <- (0.65 - u_drive) / sd_drive
z_drive       # find the z-statistic
## [1] 1.191176
pnorm(q=z_drive, lower.tail = F)          # use z to in pnorm() to find the lower end 
## [1] 0.1167922

So, about 11.7% of states are expected to have more than 65% of their traffic deaths due to drunk driving

  1. What proportion of deaths due to drunk driving would you expect to be at the 25th percentile of this distribution?
# we want to know when x=25% what is that value of of y? so we will use qnorm()

qnorm(p=0.25, mean = u_drive, sd = sd_drive)
## [1] 0.5231347
# which is the same thing as:
qnorm(p=0.75, mean=u_drive, sd=sd_drive, lower.tail = F) # lower tail=F means taking the p value starting from the right. it will give you the same value.
## [1] 0.5231347

the proportion of deaths due to drunk driving is 52.3% in 25% of States. In other words, States that exhibit 52.3% of driving deaths due to drunk driving are at the 25th percentile.

chapter 11 Normal Distributions and t-Tests

the t-test function in r: t.test() will give you:

general syntax of t.test()

#t.test(x, 
#       y = "second numeric vector", 
#       alternative = "type of test (default or `two-sided`)", 
#       mu = "hypothesized population mean of the null hypothesis. in other words, what you are #testing against. default mu = 0", 
#       paired = "logical (TRUE/FALSE), use T for paired", 
#       var.equal = "logical (TRUE/FALSE), use T is two samples are assumed to have equal variance (pooled t-test)", 
#       conf.level = "confidence level for the test. default is 0.95"
#         )

here is an example of how we use t.test() for 1, 2, and paired t-tests:

one sample t-Test (student’s t-distribution, thank you William Gosset)

H0: true mean = 0 Ha: true mean does not equal 0

head(cars)
t.test(x = cars)
## 
##  One Sample t-test
## 
## data:  cars
## t = 12.625, df = 99, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  24.60221 33.77779
## sample estimates:
## mean of x 
##     29.19
# That simple. 

the one sample t-test is similar to the z statistic for when we have an n=# t = ybar-u / se(ybar) t = test statistic ybar = standard error of the mean (sd of sample distributino) u = mean se = sd/sqrt(n) df = n-1

finding critical values

given to you using t.test() if you cannot use that function because you have no data, you have to use data tables and calculate it by hand.

see problem 11.14

# confidence interval equation: 

"CI_lower_tail" <- "dbar - ta(2),df * se_of_dbar"
"CI_upper_tail" <- "dbar + ta(2),df * se_of_dbar"

data.frame(variable = c("dbar", 
                        "ta(2),df", 
                        "se_of_df"), 
           definition = c("mean difference", 
                          "critical value for 0.05 and your df: n-1", 
                          "standard error of the difference = sd/sqrt(n)")
           )

welches two sample independent t-Test

H0: means of two groups are equal Ha: means of two groups are not equal

head(cars)
t.test(cars$speed, cars$dist) 
## 
##  Welch Two Sample t-test
## 
## data:  cars$speed and cars$dist
## t = -7.4134, df = 53.119, p-value = 9.628e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -35.04152 -20.11848
## sample estimates:
## mean of x mean of y 
##     15.40     42.98
# welches t-test by default, assumes non equal variance

Paired t-Test

only used for when you have before-after or matched data. H0: mean difference = 0 Ha: mean difference does not equal 0

head(cars)
t.test(cars$speed, cars$dist, paired = T)
## 
##  Paired t-test
## 
## data:  cars$speed and cars$dist
## t = -8.9753, df = 49, p-value = 6.417e-12
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -33.75516 -21.40484
## sample estimates:
## mean difference 
##          -27.58

Practice Problems

Problem 11.14

Astronauts increased in height by an average of approximately 40mm(about an inch and a half) during the Apollo–Soyuz missions, due to the absence of gravity compressing their spines during their time in space. Does something similar happen here on Earth? An experiment supported by NASA measured the heights of six men immediately before going to bed, and then again after three days of bed rest (Styf et al. 1997). On average, they increased in height by 14mm, with standard deviation 0.66mm. Find the 95% confidence interval for the change in height after three days of bed rest.

# this is a before and after question, which would lead us to doing a paired t-test. However, we are only given the mean change, sd, and n. So for this specific question we should use a one sample t-test. But we can't do that either because we don't have enough data. So, we have to calculate it by hand and use a statistical table. 

# we know the mean, sd and n. 

mean_h <- 14 
n_h <- 6
sd_h <- 0.66
se_h <- sd_h / sqrt(n_h)
se_h
## [1] 0.2694439
df_h <- n_h - 1
df_h
## [1] 5
crit_val_h <- 2.57 # from table in book

# confidence intervals are the mean +/- margin of error (ME) 

me_h <- crit_val_h * se_h # margin of error

ci_lower_h <- mean_h - me_h
ci_upper_h <- mean_h + me_h

data.frame("confidence interval" = "", Lower = ci_lower_h,
           upper = ci_upper_h)

and there is our 95% confidence interval.

Problem 11.19

Male koalas bellow during the breeding season, but do females pay attention? Charlton et al. (2012) measured responses of estrous female koalas to playbacks of bellows that had been modified on the computer to simulate male callers of different body size. Females were placed one at a time into an enclosure while loudspeakers played bellows simulating a larger male on one side (randomly chosen) and a smaller male on the other side. Male bellows were played repeatedly, alternating between sides, over 10 minutes. Females often turned to look in the direction of a loudspeaker (invisible to her) during a trial. The following data measure the preference of each of 15 females for the simulated sound of the “larger” male. Preference was calculated as the number of looks toward the larger-male side minus the number of looks to the smaller-male side. Preference is positive if a female looked most often toward the larger male, and it is negative if she looked most often in the direction of the smaller male.

Chapter 12: Two sample independent and Paired t-Tests

welches two sample independent t-Test

H0: means of two groups are equal

Ha: means of two groups are not equal

each treatment is an independent sample unit

head(cars)
t.test(cars$speed, cars$dist) 
## 
##  Welch Two Sample t-test
## 
## data:  cars$speed and cars$dist
## t = -7.4134, df = 53.119, p-value = 9.628e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -35.04152 -20.11848
## sample estimates:
## mean of x mean of y 
##     15.40     42.98
# welches t-test by default, assumes non equal variance

Paired t-Test

both treatments applied to every sample

minimizes extraneous impact (environment, genetics, etc)

only used for when you have before-after or matched data.

H0: mean difference = 0

Ha: mean difference does not equal 0

head(cars)
t.test(cars$speed, cars$dist, paired = T)
## 
##  Paired t-test
## 
## data:  cars$speed and cars$dist
## t = -8.9753, df = 49, p-value = 6.417e-12
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -33.75516 -21.40484
## sample estimates:
## mean difference 
##          -27.58

Chapter 12 Problems

12.2 Calculation practice: two-sample t-test.

When your authors were growing up, it was thought that humans dreamed in black and white rather than color. A recent hypothesis is that North Americans did in fact dream in black and white in the era of black-and-white television and movies, but that we shifted back to dreaming in color after the introduction of color media. To test this hypothesis, Murzyn (2008) queried 30 older individuals who had grown up in the black-and-white era and 30 younger individuals who grew up with color media about their dreams. She recorded the percentage of color dreams for each individual. A mean of 68.4% of the younger peoples’ dreams were in color (with a standard deviation of 31.8%). On average, 33.9% of the older individuals’ dreams were in color, with a standard deviation of 36.9%. The scores were approximately normally distributed in each group. Is the difference between the two means statistically significant? We’ll use a two-sample𝑡-test for this comparison.

d12.2 <- data.frame(sample = c("young", "old"), 
                 ybar = c(64.8, 33.9),
                 sd = c(31.8, 36.9), 
                 n = c(30, 30))
d12.2
  1. State the null and alternate hypotheses for this test.
# H0: There is no difference between old and young groups and dreaming in color
# Ha: There is a difference between old and young groups dreaming in color in that more people in the young group dream in color. 
  1. What are the assumptions of a two-sample t-test? Do these data match these assumptions well enough?
# the assumption of a two samples t-test are that the data is random, independent of each other, the variances are similar (unless using welches t-test), and both samples should be normally distributed.

# the data meets this criteria well enough. 
  1. What are the sample variances (s^2) for each group? variance (s^2) = sd^2
sd_young <- 31.8
sd_old <- 36.9

var_young <- sd_young^2
var_old <- sd_old^2

var_young
## [1] 1011.24
var_old
## [1] 1361.61
  1. How many degrees of freedom are associated with each group?
n_young <- 30
n_old <- 30
df_young <- 30-1 
df_old <- 30-1

df_young
## [1] 29
df_old
## [1] 29
  1. Calculate the pooled variance for these data. \[ s_p^2 = \frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2} \] s^2 = variance
pooled_var <- ((df_young * var_young) + (df_old * var_old)) / (n_young + df_old)
pooled_var
## [1] 1166.316
  1. What is the standard error of the difference between means? \[ \sqrt{s_p^2 \left( \frac{1}{n_1} + \frac{1}{n_2} \right)} \] s^p2 = pooled variance
se_12.2 <- sqrt(pooled_var * ((1/n_young) + (1/n_old)))
se_12.2
## [1] 8.817846

g.Calculate t \[ t = \frac{\bar{Y}_1 - \bar{Y}_2}{SE} \] ybar = the value, in this case-the mean

ybar1 <- 64.8
ybar2 <- 33.9
t_12.2 <- (ybar1 - ybar2) / se_12.2
t_12.2
## [1] 3.504257
  1. How many degrees of freedom does this t-statistic have?
n_young + n_old - 2 
## [1] 58
  1. What is the critical value of t with𝛼=0.05 for this test?
# the statistical table shows 2.0
  1. What is the most precise 𝑃-value that you can determine for this test?
p12.2 <- 2 * (1 - pt(abs(t_12.2), df = 58))
p12.2
## [1] 0.0008895793
  1. What can you conclude about the difference between the older and younger people in how often they dream in color? Interpret the conclusions of the test in terms of the original scientific question.
# The p value is <0.05 (0.0009), this shows a significant difference between the two means of groups young and old. 

# Additionally, The T statistic (3.5) is greater than the critical value (2.0). This suggests that the young group has greater dreams in color on average compared to the old group. 

Chapter 13: Handling Violations and Assumptions

Detecting deviations from normality

  • we can detect if something differs from the normal graphically by seeing if it is normally distributed or skewed.

  • or, we can do this by formal calculations using the Shapiro-Wilk test. It estimated the mean and sd from data, then tests the goodness of fit to the normal distribution. The null hypothesis is that the data is not different from a normal distribution.

  • We can ignore violations if they do not significantly change the results.

Transforming Data

  • the best way to handle data that don’t meet the assumptions is to do a transformation.

log transformation

  • most common and robust take the log of Y \[ Y' = ln[Y] \] then convert it into exponential form \[ Y = e^{(Y')} \] ALL data must be transformed in the same way.

If data has zeros, we can use \[ Y = ln[Y+1] \] Doing this can make the original data set appear to be normally distributed with similar sd without actually changing the data at all. As long as you do the same transformations to all values and samples.

arcsin transformation

p' = arcsin[sqrt(p)]

sqrt transformation

Y' = sqrt[Y + 1/2]

Mann Whitney U Tests

  • compares the distribution of two groups
  • can be used in place of the two-sample t-Test when the normal distribution assumption can NOT be met.

Example Mann Whitney U test

non-parametric test: Mann Whit U test compares the distributions of two groups but does not need as many assumptions as a two-sample t-test.

df_MW <- data.frame(starved = c(1.9,2.1,3.8,9.0,9.6,13,14.7,17.9,21.7,29.0,72.3,NA,NA), 
           fed = c(1.5,1.7,2.4,3.6,5.7,22.6,22.8,39,54.4,72.1,73.6,79.5,88.9))
df_MW
#Ha: statistical difference between means of two groups 
# H0: no difference between means of two groups

hist(df_MW$starved)

hist(df_MW$fed)

# skewed and log transformation doesn't work, so a t-test will not be accurate. 

MWtest <- wilcox.test(x = df_MW$starved, y = df_MW$fed)
MWtest
## 
##  Wilcoxon rank sum exact test
## 
## data:  df_MW$starved and df_MW$fed
## W = 55, p-value = 0.3607
## alternative hypothesis: true location shift is not equal to 0
# W = test statistic used to get P

P > 0.05 so the distributions are the same.

Chapter 14: Experimental Design

confounding variables

  • masks the true relationships between variables of interest
  • try to introduce controls to manage these

experimental artifacts:

  • bias based on experiments you produce

experimental goals and methods for experimentation:

  • reduce bias

control group:

  • measurements on subjects that do not receive treatment

randomization:

  • random assignment of treatments to subject

blinding

  • not giving subjects and/or researching information

methods for reducing sample error:

replication

  • measuring multiple subjects in each treatment

Balance

  • equal sample sizes within each treatment (as close as possible)

blocking

  • grouping experimental units that have similar units (paring, using paired tests)

Factors:

-treatments/variables

factoral design:

  • interactions between two or more treatment variables
  • an interaction between variables means that the effect of ones variable response depends on the state of the second variable.

Chapter 15: Analysis of Variance ANOVA

Planned and Unplanned Comparison

A planned comparison is a comparison between means identified as being of crucial interest during the design of the study, identified prior to obtaining the data.

planned comparisons have more power than planned comparisons.

With any anova you could technically just run a bunch of t-tests, but that wouldn’t be as powerful as using anova, because the t-tests would miss the confounding variables in the third group that the anova would automatically account for.

An unplanned comparison one of multiple comparisons, such as between all pairs of means, carried out to help determine where differences between means lie. Tests between interests not identifies prior to obtaining the data.

Tukey Kramer Test

works like a series of t-tests but with a larger critical value to accommodate for type 1 error.

You only need to perform A Tukey test if you have a significant P-value from your ANOVA. This will allow you to see which comparison is significant.

TukeyHSD(x = aovdata)

in class workshop with ANOVA

library introduction:

library(ggplot2) # plots 
library(dplyr) # sub-setting data 
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr) # read in the data file
titanic <- read.csv("C:/Users/ahand/OneDrive - St. Mary's College of MD/SMCM_FA25/biostats/exam2_prep/titanic.csv", 
                    stringsAsFactors = T) # character names become factors 
head(titanic) # view the top portion of the data
summary(titanic) # summarizes the data into some statistical vlaues
##  passenger_class                        name           age         
##  1st:322         Carlsson,MrFransOlof     :   2   Min.   : 0.1667  
##  2nd:280         Connolly,MissKate        :   2   1st Qu.:21.0000  
##  3rd:711         Kelly,MrJames            :   2   Median :30.0000  
##                  Abbing,MrAnthony         :   1   Mean   :31.1942  
##                  Abbott,MasterEugeneJoseph:   1   3rd Qu.:41.0000  
##                  Abbott,MrRossmoreEdward  :   1   Max.   :71.0000  
##                  (Other)                  :1304   NA's   :680      
##         embarked            home_destination     sex      survive  
##             :493                    :558     female:463   no :864  
##  Cherbourg  :202   NewYork,NY       : 65     male  :850   yes:449  
##  Queenstown : 45   London           : 14                           
##  Southampton:573   Montreal,PQ      : 10                           
##                    Cornwall/Akron,OH:  9                           
##                    Paris,France     :  9                           
##                    (Other)          :648

using ggplot with histograms

ggplot(titanic, aes(x = age)) + 
  geom_histogram() + 
  facet_wrap(~ passenger_class, ncol = 1)
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## Warning: Removed 680 rows containing non-finite outside the scale range
## (`stat_bin()`).

using dplyr

# dplyr: 
titanic_by_passenger_class <- group_by(titanic, passenger_class) # groups titanic data by passenger class numerically 
head(titanic_by_passenger_class)
s1 <- summarize(titanic_by_passenger_class, 
          group_mean = mean(age, na.rm = T)) # summarizes the means

s2 <- summarize(titanic_by_passenger_class, 
          group_mean = mean(age, na.rm = T), 
          group_sd = sd(age, na.rm = T)) # summarizes means and sd in one table
s2

ANVOVA in r

# ANOVA test 
titanic_model <- aov(data = titanic, age ~ passenger_class) # runs anova to test for difference of age inside passenger_class
titanic_model
## Call:
##    aov(formula = age ~ passenger_class, data = titanic)
## 
## Terms:
##                 passenger_class Residuals
## Sum of Squares         26689.69 110763.68
## Deg. of Freedom               2       630
## 
## Residual standard error: 13.25954
## Estimated effects may be unbalanced
## 680 observations deleted due to missingness
summary.aov(titanic_model)
##                  Df Sum Sq Mean Sq F value Pr(>F)    
## passenger_class   2  26690   13345    75.9 <2e-16 ***
## Residuals       630 110764     176                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 680 observations deleted due to missingness

post-hoc tests

use when the sample size in each group is equal

TukeyHSD(titanic_model) # Tukey test, one of a couple post-hoc tests
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = age ~ passenger_class, data = titanic)
## 
## $passenger_class
##               diff        lwr        upr     p adj
## 2nd-1st -11.367459 -14.345803  -8.389115 0.0000000
## 3rd-1st -15.148115 -18.192710 -12.103521 0.0000000
## 3rd-2nd  -3.780656  -6.871463  -0.689849 0.0116695

Tukey tests gives yo the pairwise comparison between groups.

Usually labeled using letters; treatments in group A are not sig different from each other but are sig different from treatment B. The same goes for group B.

p-adjacent is the p value for those two groups.

we can see here that class 1 is significantly different from all classes, and class 3 and 2 are not sig different from each other for alpha = 0.01.

residuals

  • residuals are
plot(titanic_model)

non-parametric post-hoc test: Kruskal-Wallis test

kruskal.test(data = titanic, age ~ passenger_class)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  age by passenger_class
## Kruskal-Wallis chi-squared = 116.08, df = 2, p-value < 2.2e-16

Kruskal-test is basically a sum of squares for multiple groups

used when normal distribution assumption is violated

R Squared (R^2) repeatability

measures the fraction of variation in the mean \[ R^2 = \frac{SSgroups}{SStotal} \]

SSgroups = sum of squares of group

SStotal = sum of squares total

the output x100% gives you an explanation to the amount of the result that can be accurately explained by that treatment.

An output of 0.04 means that 40% of the time, the data is explained accurately by the model.

variance within groups = meansq residuals

variance between groups = meansq group

repeatability is different but similar.

\[ repeatability = \frac{S^2A}{S^2A + MSerror} \]

Creating a data frame using rep and length

variance within groups is the mean squares of residuals

variance among groups is the mean squares of the group

repeatability is measured by R^2

ss_group15.20 <- 0.01579
ss_residuals15.20 <- 0.00415
R_squared15.20 <- (ss_group15.20) / (ss_group15.20 + ss_residuals15.20)
R_squared15.20
## [1] 0.7918756

about 20% of the data seen is due to measurement error.