Probability and inference

Probabilities

For a normal random variable X with mean 5, and standard deviation 2, find the probability that X is less than 3. Find the probability that X is greater than 4.5.

pnorm(3.0, 5,2)

## [1] 0.1586553

pnorm(4.5, 5,2, lower.tail=FALSE)

## [1] 0.5987063

#or
1-pnorm(4.5,5,2)

## [1] 0.5987063

Find the value K so that P(X > K) = 0.05.

qnorm(0.95, 5, 2)

## [1] 8.289707

#or
qnorm(0.05, 5, 2, lower.tail=FALSE)

## [1] 8.289707

When tossing a fair coin 10 times, 1nd the probability of seeing no heads. Find the probability of seeing exactly 5 heads. Find the probability of seeing more than 7 heads.

dbinom(x = 0, size = 10, prob = 0.5)

## [1] 0.0009765625

dbinom(x = 5, size = 10, prob = 0.5)

## [1] 0.2460938

1-pbinom(q=7, size=10, prob=0.5)

## [1] 0.0546875

Univariate Distributions

Simulate a sample of 100 random data points from a normal distribution with mean 100 and standard deviation 5, and store the result in a vector.

Plot a histogram and a boxplot of the vector you just created.

Calculate the sample mean and standard deviation.

Calculate the median and interquartile range.

Using the data above, test the hypothesis that the mean equals 100 (using t.test).

Test the hypothesis that mean equals 90.

Repeat the above two tests using a Wilcoxon signed rank test. Compare the p-values with those from the t-tests you just did.

x <- rnorm(n=100, mean=100, sd=5)

par(mfrow=c(1,2))
hist(x)
boxplot(x)

mean(x)

## [1] 100.334

sd(x)

## [1] 4.981055

median(x)

## [1] 100.3069

IQR(x)

## [1] 7.454873

t.test(x, mu=100)

## 
##  One Sample t-test
## 
## data:  x
## t = 0.67044, df = 99, p-value = 0.5041
## alternative hypothesis: true mean is not equal to 100
## 95 percent confidence interval:
##   99.3456 101.3223
## sample estimates:
## mean of x 
##   100.334

t.test(x, mu=90)

## 
##  One Sample t-test
## 
## data:  x
## t = 20.747, df = 99, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 90
## 95 percent confidence interval:
##   99.3456 101.3223
## sample estimates:
## mean of x 
##   100.334

wilcox.test(x, mu=100)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  x
## V = 2684, p-value = 0.5858
## alternative hypothesis: true location is not equal to 100

wilcox.test(x, mu=90)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  x
## V = 5045, p-value < 2.2e-16
## alternative hypothesis: true location is not equal to 90

Use the t.test function to compare PupalWeight by T_treatment.

Repeat above using a Wilcoxon rank sum test.

pupae <- read.csv("pupae.csv")

t.test(PupalWeight~T_treatment, data=pupae, var.equal=TRUE)

## 
##  Two Sample t-test
## 
## data:  PupalWeight by T_treatment
## t = 1.4385, df = 82, p-value = 0.1541
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.007715698  0.048012420
## sample estimates:
##  mean in group ambient mean in group elevated 
##              0.3222973              0.3021489

wilcox.test(PupalWeight~T_treatment, data=pupae)

## Warning in wilcox.test.default(x = c(0.244, 0.319, 0.221, 0.28, 0.257,
## 0.333, : cannot compute exact p-value with ties

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  PupalWeight by T_treatment
## W = 1017.5, p-value = 0.1838
## alternative hypothesis: true location shift is not equal to 0

Run the following code to generate some data:

base <- rnorm(20, 20, 5)
x <- base + rnorm(20,0,0.5)
y <- base + rnorm(20,1,0.5)

Using a two-sample t-test compare the means of x and y, assume that the variance is equal for the two samples.

t.test(x,y, var.equal=TRUE)

## 
##  Two Sample t-test
## 
## data:  x and y
## t = -0.44588, df = 38, p-value = 0.6582
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.491474  2.870057
## sample estimates:
## mean of x mean of y 
##  20.99835  21.80906

Repeat the above using a paired t-test. How has the p-value changed?

t.test(x,y, paired=TRUE)

## 
##  Paired t-test
## 
## data:  x and y
## t = -6.6634, df = 19, p-value = 2.26e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.0653594 -0.5560568
## sample estimates:
## mean of the differences 
##              -0.8107081

Which test is most appropriate? The paired t-test is more appropriate because X and Y are not independent.

Simple linear regression

Perform a simple linear regression of Frass on PupalWeight. Produce and inspect the following:

Plots of the data

plot(Frass ~ PupalWeight, data = pupae)

Summary of the model

model <- lm(Frass ~ PupalWeight, data = pupae)
summary(model)

## 
## Call:
## lm(formula = Frass ~ PupalWeight, data = pupae)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77463 -0.21560 -0.01064  0.26259  0.89392 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.5046     0.1838   2.745  0.00746 ** 
## PupalWeight   4.2994     0.5773   7.448  9.1e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3332 on 81 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.4065, Adjusted R-squared:  0.3991 
## F-statistic: 55.47 on 1 and 81 DF,  p-value: 9.1e-11

Diagnostic plots.

par(mfrow=c(1,2))
plot(model)

All of the above for a subset of the data, where Gender is 0, and CO2_treatment is 400.

plot(Frass ~ PupalWeight, data = pupae, subset=Gender==0 & CO2_treatment == 400)

model <- lm(Frass ~ PupalWeight, data = pupae, subset=Gender==0 & CO2_treatment == 400)
summary(model)

## 
## Call:
## lm(formula = Frass ~ PupalWeight, data = pupae, subset = Gender == 
##     0 & CO2_treatment == 400)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.26720 -0.08526 -0.01585  0.13171  0.28181 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.6751     0.1845   3.660  0.00156 ** 
## PupalWeight   4.1189     0.6430   6.405 3.01e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1657 on 20 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.6723, Adjusted R-squared:  0.6559 
## F-statistic: 41.03 on 1 and 20 DF,  p-value: 3.006e-06

par(mfrow=c(1,2))
plot(model)

Probability and inference

Kushan De Silva

August 16, 2017