Aknowledgment: This work was adapted from the Gelman and Hill 2006 Book.

Citation:

Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge university press.

Install the ARM package for running regression

Chapter 2 Exersize Questions

1

A test is graded from 0 to 50, with an average score of 35 and a standard deviation of 10. For comparison to other tests, it would be convenient to rescale to a mean of 100 and standard deviation of 15.

  1. How can the scores be linearly transformed to have this new mean and standard deviation?

Answer

old_mean <- 35
old_std <- 10
old_var <- 10^2

new_mean <- 100
new_std <- 15
new_var <- 15^2

A linear transformation can be completed using the equation for a line

\[ y = aX + b \] Thus, E[Y]= a E[X] + b and Var[Y] = a^{2} * Var[X]

We want E[Y] = 100 and Var[Y] = 15^{2} = 225.

Using these, we get that: 100 = a * 35 + b 225 = a^{2} * 100

# Solving for a and b we get?
a = sqrt(225/100)
b = 100 - (a * 35)

So the linear transformation required is \[Y = aX + b\]

Extra Credit

  1. There is another linear transformation that also rescales the scores to have mean 100 and standard deviation 15. What is it, and why would you not want to use it for this purpose?

Answer

2

The following are the proportions of girl births in Vienna for each month in 1908 and 1909 (out of an average of 3900 births per month):

girls.1908 <- c(.4777, .4875, .4859, .4754, .4874, .4864, .4813, .4787, .4895, .4797, .4876, .4859)
girls.1909 <- c(.4857, .4907, .5010, .4903, .4860, .4911, .4871, .4725, .4822, .4870, .4823, .4973)
     
n <- 3900
     
girls <- c(girls.1908, girls.1909)

Compute the standard deviation of these proportions and compare to the standard deviation that would be expected if the sexes of babies were independently decided with a constant probability over the 24-month period.

Answer

sd_births <- sd(girls)
sd_births 
## [1] 0.006409724
mean_birth <- mean(girls) 
mean_birth
## [1] 0.485675
expected_sd <- sqrt(mean_birth * (1-mean_birth)/n)
expected_sd
## [1] 0.008003121

3

Demonstration of the Central Limit Theorem: let x = x1 + · · · + x20, the sum of 20 independent Uniform(0,1) random variables. In R, create 1000 simulations of x and plot their histogram. On the histogram, overlay a graph of the normal density function. Comment on any differences between the histogram and the curve.

Answer

sample_size <- 20
x <- 0 
y <- 1

mean_xy <- 0.5 * (x + y)
mean_xy
## [1] 0.5
sum_mean_xy <- sample_size * mean_xy
sum_mean_xy
## [1] 10
#variance_xy <- (sum((0 - .5)^2 + (1 - .5)^2) / sample_size-1)
#variance_xy

variance_xy <- (1/12) * ((y - x)^2)
variance_xy
## [1] 0.08333333
sd_xy <- sqrt(sample_size * variance_xy)
sd_xy
## [1] 1.290994
#sd_xy <- sqrt(variance_xy) * sample_size
#sd_xy
simulations <- 1000

obs_1 <- runif(sample_size * simulations)

a <- matrix(obs_1, nrow = simulations, ncol = sample_size, byrow = F)

data1 <- data.frame(out = rowSums (a))

typeof(data1)
## [1] "list"
ggplot(data1, aes(x=out), family = "Times New Roman") + geom_histogram(aes(y=after_stat(density)), binwidth = 1, alpha = 0.1) + geom_density() + stat_function(fun=dnorm)

what is after_stat

Comment:

Not sure where 1/12 came from… not sure if sample size goes inside parenthesis for sd

4

Distribution of averages and differences: the heights of men in the United States are approximately normally distributed with mean 69.1 inches and standard deviation 2.9 inches. The heights of women are approximately normally distributed with mean 63.7 inches and standard deviation 2.7 inches. Let x be the average height of 100 randomly sampled men, and y be the average height of 100 randomly sampled women. In R, create 1000 simulations of x − y and plot their histogram. Using the simulations, compute the mean and standard deviation of the distribution of x − y and compare to their exact values.

mean_male_height <- 69.1
sd_male_height <- 2.9

mean_female_height <- 63.7
sd_female_height <- 2.7

sample_size_height <- 100
simulations_height <- 1000

# Create a variable for normally distributed data for men and one for women 

male_data <- rnorm(sample_size_height * simulations_height, mean = mean_male_height, sd = sd_male_height)

female_data <- rnorm(sample_size_height * simulations_height, mean = mean_female_height, sd = sd_female_height)

# Create a Matrix of the observations data. 
matrix_male <- matrix(male_data, nrow=simulations_height, ncol=sample_size_height, byrow = F)
     
matrix_female <- matrix(female_data, nrow=simulations_height, ncol=sample_size_height, byrow = F)

# create data set or data frame. 

data_2 <- data.frame(male_height = rowMeans(matrix_male), female_height = rowMeans(matrix_female))

data_2['diff'] = data_2['male_height'] - data_2['female_height']

data_2[0:3,]
##   male_height female_height     diff
## 1    69.54401      63.73056 5.813450
## 2    69.09551      63.55170 5.543810
## 3    68.77687      63.63329 5.143576

Now plot it

ggplot(data_2, aes(x=diff)) + geom_histogram() + ggtitle("x-y") + xlab("Value") + ylab("Counts")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mean and sd from simulation

simulation_mean <- mean_male_height - mean_female_height
simulation_mean
## [1] 5.4
data_mean <- mean(data_2$diff)
data_mean
## [1] 5.41112
simulation_sd <- sqrt (sd_male_height^2 + sd_female_height^2)
simulation_sd
## [1] 3.962323

standard deviation from simulation

sd_data <- sd(data_2$diff)
sd_data
## [1] 0.3957295

Comments

5

Correlated random variables: suppose that the heights of husbands and wives have a correlation of 0.3. Let x and y be the heights of a married couple chosen at random. What are the mean and standard deviation of the average height, (x + y)/2?

Answer

mean_x  <- mean_male_height
mean_y <- mean_female_height

stan_dev_x <- sd_male_height
stan_dev_y <- sd_female_height
height_corr <- .3
# plug into formula from chapter 2
mean_height <- (mean_x + mean_y) /2
mean_height
## [1] 66.4
sd_height <- (stan_dev_x^2 + stan_dev_y^2 + 2*height_corr*stan_dev_x*stan_dev_y^{1/2})
sd_height
## [1] 18.55911