Foundations for Inference

Statistics and Probability for Data Analytics

CUNY MSDS DATA 606

Rose Koh

2018/03/14

Links

Rpub Link

Assignments

Chapter 4 Foundations for Inference

Practice: 4.3, 4.13, 4.23, 4.25, 4.39, 4.47
Graded: 4.4, 4.14, 4.24, 4.26, 4.34, 4.40, 4.48

The point estimate for the average height of active individuals is 171.143787
The median is 170.3

The point estimate for the standard deviation of the heights of active individuals is 9.4072052
The IQR is 14

To be considered unusual, the value needs to be greater than 2 standard deviation above the value. For this case, we can consider Z >= 2.

180cm

x <- 180
mu <- mean(bdims$hgt)
sd <- sd(bdims$hgt)
z <- (x - mu) / sd # 0.94
z > 2

## [1] FALSE

As above, we can conclude the 180cm is not unusual.
155cm

x <- 155
mu <- mean(bdims$hgt)
sd <- sd(bdims$hgt)
z <- (x - mu) / sd # -1.71
abs(z) > 2

## [1] FALSE

As above, we can conclude the 155cm is not unusual.

No I would not expect the new sample to match the above. The Mean and SD would be near value, but not the same.

n.samp <- 507
SE <- sd/sqrt(n.samp)
SE

## [1] 0.4177887

F The sample mean (point estimate) 84.7067651 is always in the confidence level. The 95% confidence interval covers the population mean (a parameter) with 95% probability.
F Based on the condition for sample mean being nearly normal and SE being accurate, the sample observations should be independent and size should be larger than 30. The population distribution should not be strongly skewed. If there is any prominent outliers present, the sample should be at least 100 observations. The conditions on the confidence interval are met for this example. All samples will show different distributions.
F Samples are to predict on the populations. Also, given the sample’s nature, different sizes of samples may have different confidence intervals as we saw in lab4b. We can state that the mean value of 95% of the random sample (n = 436) lie within the confidence interval.
T The confidence interval covers the parameter value( average spending of an average american adult) with probability 95%.
T The higher confidence, the broader possible outcomes.
F A sample size of 3 times bigger is not enough, since the SE equals sigma / sqrt(n). To make the confidence interval smaller to 1/3 of what it is now, we need a sample size 9 times bigger than 436. (SE = sigma / (3 to the power * n))
T The margin of error is given by z * se 4.4

Independant samples selected randomly and its size is larger than size 30. The sample satisfies the basic requirements.
H0: mu = 32 H1: mu < 32

x <- 32 #mean value
n <- 36 #sample size
min <- 32
mu <- 30.69
sd <- 4.31
max <- 39
alpha <- 0.10

Z <- (mu - min) / (sd / sqrt(n))
P <- pnorm(Z, mean = 0, sd = 1)
P >= alpha

## [1] FALSE

As above, P-value is lower than the significance level of 0.10, thus we reject H0 the null hypothesis.

There is significant eviddence to infer that hte gifted children can count to 10 earlier than general population does.

low <- mu - 1.645 * sd / sqrt(n)
high <- mu + 1.645 * sd / sqrt(n)

90% Confidnece interval is given by 29.5083417, 31.8716583

e.Yes. With 90% confidence , the population mean of gifted children is between 29.5083417, 31.8716583. The value of 32 months is outside of the confidence interval. We can conclude that 32 months is an unusual event.

x <- 100
n <- 36
min <- 101
mu <- 118.2
sd <- 6.5
max <- 131
alpha <- 0.10

z <- (mu - x) / sd
p <- 1 - pnorm(z, 0, 1)

p < alpha

## [1] TRUE

H0: mu = 100 avg of Gifted children’s mothers IQ = avg of Population’s IQ Ha: mu != 100 avg of Gifted children’s mothers IQ != avg of Population’s IQ

Since p < alpha = TRUE, we reject the null hypothesis. The data favors that mother of gifted children does have higher mean IQ than mothers in general population.

SE  <- sd / sqrt(n)
high <- mu + (1.645 * SE)
low <- mu - (1.645 * SE)

A 90% confidence interval for the average IQ of mothers of gifted children is 116.4179167, 119.9820833

Rejected the null hypothesis as p-value is near 0, less than the given significance level of 0.10.
The confidence interval with 90% 116.4179167, 119.9820833 does not include the proposed mean (100) which favors the Ha.

Sampling distribution of the mean:

random, independent samples of a constant sample size n.
The distribution of the values of the mean from all the samples.
It obeys the Central Limit Theorem in that it has a normal distribution(given the sample size is larger than 30, and not strongly skewed) and that it would tend towards the mean (spread becomes narrower) as simple size increases.
- As the sample size increases
  - The shape becomes closer to normal distribution (normal curve)
  - The center becomes taller (increase frequency of values that is cloes to the true population mean)
  - Spread becomes narrower

mu <- 9000
sd <- 1000
x <- 10500
z <- (x - mu) / sd
prob <- 1 - pnorm(z)

The probability that a randomly chosen light bulb lasts more than 10,500 hours is 0.0668072

The random sampling of 15 independent light bulbs, the distribution of the mean lifespan would be centered near population mean, centered around 9000, and having a nearly normal shape.

n <- 15
se <- sd / sqrt(n)
z <- (x - mu) / se
p <- pnorm(z, mean = mu, sd = sd)
p

## [1] 1.189897e-19

The probability that the tmean lifespan of 15 randomly chosen light bulbs is more than 10,500 hours is approximately 0%.

normal.sample <- seq(mu - (4 * sd), mu + (4 * sd), length=100) # normal sample
h.norm <- dnorm(normal.sample, mean=mu, sd=sd)
df <- data.frame(name="Population", x =normal.sample, h.norm)

random.sample <- seq(mu - (4 * se), mu + (4 * se), length=100) # random sample
h.rand <- dnorm(random.sample, mean=mu, sd=se)
df <- rbind(df, data.frame(name="Sample", x = random.sample, h.norm=h.rand))

ggplot(df, aes(x, h.norm,
               color = name)) + geom_line()

If the lifespans of lightbulbs had a skewed distribution, we should not estimate the probabilities in either (a) or (c) using normal distribution.
- for part (a), we would need to use the skewed distribution to do the probability calculation.
- for part (c), since the sample size is smaller than 30, we can’t use the CLT to assume that it’s approximately normal.

The P value depends on the standard of error.

To calculate standard of error:

sd(population)/ sqrt(n)

With SE, we can calculate Z score:

(point estimate - mean) / standard of error

To calculate the P value:

1 - probability value from the Z score and multiply it by 2 (two test)

So if we use N = 500, the denominator in the SE would be larger, thus the SE would be smaller. If the SE becomes smaller, then the Z score will get larger. If the Z score is larger, then (1 - prob value from the Z score) * 2 will get smaller.

Thus the P value will decrease.

Having a higher N value will allow you to reject the H0 null hypothesis in favor of the Ha alternavite hypothesis.