For these exercises, we will be using the following dataset:

library(downloader) 
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleControlsPopulation.csv"
filename <- basename(url)
download(url, destfile=filename)
x <- unlist( read.csv(filename) )

Here x represents the weights for the entire population.

Using the same process as before (in Null Distribution Exercises), set the seed at 1, then using a for-loop take a random sample of 5 mice 1,000 times. Save these averages. After that, set the seed at 1, then using a for-loop take a random sample of 50 mice 1,000 times. Save these averages.

f <- function(n,a){
  averages <- vector("numeric",n)
  for(i in 1:n){
    X = sample(x,a)
    averages[i] = mean(X)
  }
  averages
}
set.seed(1)
averages5 = f(1000, 5)
set.seed(1)
averages50 = f(1000, 50)

Divide the plot area into 2 to have histogram in same plot

par(mfrow = c(1,2))
hist(averages5)
hist(averages50)

#sum(between(averages50, 23, 25))

Now ask the same question of a normal distribution with average 23.9 and standard deviation 0.43. pnorm(Z2) - pnrom(Z1) Z1 = (23-23.9)/0.43

pnorm((25-23.9)/0.43) - pnorm((23 - 23.9)/0.43)
## [1] 0.9765648
library(downloader) 
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(knitr)
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/mice_pheno.csv"
filename <- basename(url)
download(url, destfile=filename)
dat <- read.csv(filename) 
#We will remove the lines that contain missing values:

dat <- na.omit(dat)

Use dplyr to create a vector x with the body weight of all males on the control (chow) diet. What is this population’s average?

x <- filter(dat, Sex == "M" & Diet == "chow") %>% select(Bodyweight) %>% unlist

library(rafalib)
xBar = mean(x)
#Now use the rafalib package and use the popsd function to compute the population standard deviation.
xSigma = popsd(x)
#Set the seed at 1. Take a random sample X of size 25 from x. What is the sample average?
set.seed(1)
X = mean(sample(x, 25))

Use dplyr to create a vector y with the body weight of all males on the high fat hf) diet. What is this population’s average?

y <- filter(dat, Sex == "M" & Diet == "hf") %>% select(Bodyweight) %>% unlist
yBar = mean(y)
#population standard deviation
ySigma = popsd(y)
#mean sample average
set.seed(1)
Y = mean(sample(y,25))

Absolute difference betweeen population mean and sample mean

a <- yBar - xBar
b = Y - X
abs(a)-abs(b)
## [1] 1.211716

Repeat the above for females. Make sure to set the seed to 1 before each sample call.

fx = filter(dat, Sex == "F" & Diet == "chow") %>% select(Bodyweight) %>% unlist
fxBar = mean(fx)
fxsigma = popsd(fx)
set.seed(1)
fX = mean(sample(fx,25))
fy = filter(dat, Sex == "F" & Diet == "hf") %>% select(Bodyweight) %>% unlist
fyBar = mean(fy)
fySigma = popsd(fy)
set.seed(1)
fY = mean(sample(fy,25))
fa = fyBar - fxBar
fb = fX - fY
abs(fb)-abs(fa)
## [1] 0.7364828

For the females, our sample estimates were closer to the population difference than with males. What is a possible explanation for this?

#The population variance of the females is smaller than that of the males; thus, the sample variable has less variability.