Aknowledgment: This work was adapted from the Gelman and Hill 2006 Book.
Citation:
Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge university press.
Install the ARM package for running regression
A test is graded from 0 to 50, with an average score of 35 and a standard deviation of 10. For comparison to other tests, it would be convenient to rescale to a mean of 100 and standard deviation of 15.
old_mean <- 35
old_std <- 10
old_var <- 10^2
new_mean <- 100
new_std <- 15
new_var <- 15^2
A linear transformation can be completed using the equation for a line
\[ y = aX + b \] Thus, E[Y]= a E[X] + b and Var[Y] = a^{2} * Var[X]
We want E[Y] = 100 and Var[Y] = 15^{2} = 225.
Using these, we get that: 100 = a * 35 + b 225 = a^{2} * 100
# Solving for a and b we get?
a = sqrt(225/100)
b = 100 - (a * 35)
So the linear transformation required is \[Y = aX + b\]
The following are the proportions of girl births in Vienna for each month in 1908 and 1909 (out of an average of 3900 births per month):
girls.1908 <- c(.4777, .4875, .4859, .4754, .4874, .4864, .4813, .4787, .4895, .4797, .4876, .4859)
girls.1909 <- c(.4857, .4907, .5010, .4903, .4860, .4911, .4871, .4725, .4822, .4870, .4823, .4973)
n <- 3900
girls <- c(girls.1908, girls.1909)
Compute the standard deviation of these proportions and compare to the standard deviation that would be expected if the sexes of babies were independently decided with a constant probability over the 24-month period.
sd_births <- sd(girls)
sd_births
## [1] 0.006409724
mean_birth <- mean(girls)
mean_birth
## [1] 0.485675
expected_sd <- sqrt(mean_birth * (1-mean_birth)/n)
expected_sd
## [1] 0.008003121
Demonstration of the Central Limit Theorem: let x = x1 + · · · + x20, the sum of 20 independent Uniform(0,1) random variables. In R, create 1000 simulations of x and plot their histogram. On the histogram, overlay a graph of the normal density function. Comment on any differences between the histogram and the curve.
sample_size <- 20
x <- 0
y <- 1
mean_xy <- 0.5 * (x + y)
mean_xy
## [1] 0.5
sum_mean_xy <- sample_size * mean_xy
sum_mean_xy
## [1] 10
#variance_xy <- (sum((0 - .5)^2 + (1 - .5)^2) / sample_size-1)
#variance_xy
variance_xy <- (1/12) * ((y - x)^2)
variance_xy
## [1] 0.08333333
sd_xy <- sqrt(sample_size * variance_xy)
sd_xy
## [1] 1.290994
#sd_xy <- sqrt(variance_xy) * sample_size
#sd_xy
simulations <- 1000
obs_1 <- runif(sample_size * simulations)
a <- matrix(obs_1, nrow = simulations, ncol = sample_size, byrow = F)
data1 <- data.frame(out = rowSums (a))
typeof(data1)
## [1] "list"
ggplot(data1, aes(x=out), family = "Times New Roman") + geom_histogram(aes(y=after_stat(density)), binwidth = 1, alpha = 0.1) + geom_density() + stat_function(fun=dnorm)
what is after_stat
Not sure where 1/12 came from… not sure if sample size goes inside parenthesis for sd
Distribution of averages and differences: the heights of men in the United States are approximately normally distributed with mean 69.1 inches and standard deviation 2.9 inches. The heights of women are approximately normally distributed with mean 63.7 inches and standard deviation 2.7 inches. Let x be the average height of 100 randomly sampled men, and y be the average height of 100 randomly sampled women. In R, create 1000 simulations of x − y and plot their histogram. Using the simulations, compute the mean and standard deviation of the distribution of x − y and compare to their exact values.
mean_male_height <- 69.1
sd_male_height <- 2.9
mean_female_height <- 63.7
sd_female_height <- 2.7
sample_size_height <- 100
simulations_height <- 1000
# Create a variable for normally distributed data for men and one for women
male_data <- rnorm(sample_size_height * simulations_height, mean = mean_male_height, sd = sd_male_height)
female_data <- rnorm(sample_size_height * simulations_height, mean = mean_female_height, sd = sd_female_height)
# Create a Matrix of the observations data.
matrix_male <- matrix(male_data, nrow=simulations_height, ncol=sample_size_height, byrow = F)
matrix_female <- matrix(female_data, nrow=simulations_height, ncol=sample_size_height, byrow = F)
# create data set or data frame.
data_2 <- data.frame(male_height = rowMeans(matrix_male), female_height = rowMeans(matrix_female))
data_2['diff'] = data_2['male_height'] - data_2['female_height']
data_2[0:3,]
## male_height female_height diff
## 1 69.54401 63.73056 5.813450
## 2 69.09551 63.55170 5.543810
## 3 68.77687 63.63329 5.143576
Now plot it
ggplot(data_2, aes(x=diff)) + geom_histogram() + ggtitle("x-y") + xlab("Value") + ylab("Counts")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
mean and sd from simulation
simulation_mean <- mean_male_height - mean_female_height
simulation_mean
## [1] 5.4
data_mean <- mean(data_2$diff)
data_mean
## [1] 5.41112
simulation_sd <- sqrt (sd_male_height^2 + sd_female_height^2)
simulation_sd
## [1] 3.962323
standard deviation from simulation
sd_data <- sd(data_2$diff)
sd_data
## [1] 0.3957295
Correlated random variables: suppose that the heights of husbands and wives have a correlation of 0.3. Let x and y be the heights of a married couple chosen at random. What are the mean and standard deviation of the average height, (x + y)/2?
mean_x <- mean_male_height
mean_y <- mean_female_height
stan_dev_x <- sd_male_height
stan_dev_y <- sd_female_height
height_corr <- .3
# plug into formula from chapter 2
mean_height <- (mean_x + mean_y) /2
mean_height
## [1] 66.4
sd_height <- (stan_dev_x^2 + stan_dev_y^2 + 2*height_corr*stan_dev_x*stan_dev_y^{1/2})
sd_height
## [1] 18.55911
Comments