Chapter 4 Foundations for Inference
Practice: 4.3, 4.13, 4.23, 4.25, 4.39, 4.47 Graded: 4.4, 4.14, 4.24, 4.26, 4.34, 4.40, 4.48

library('DATA606')          # Load the package
## 
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics 
## This package is designed to support this course. The text book used 
## is OpenIntro Statistics, 3rd Edition. You can read this by typing 
## vignette('os3') or visit www.OpenIntro.org. 
##  
## The getLabs() function will return a list of the labs available. 
##  
## The demo(package='DATA606') will list the demos that are available.
## 
## Attaching package: 'DATA606'
## The following object is masked from 'package:utils':
## 
##     demo
vignette(package='DATA606') # Lists vignettes in the DATA606 package
## no vignettes found
vignette('os3')             # Loads a PDF of the OpenIntro Statistics book
## Warning: vignette 'os3' not found
data(package='DATA606')     # Lists data available in the package
4.4 Heights of adults.

Researchers studying anthropometry collected body girth measurements and skeletal diameter measurements, as well as age, weight, height and gender, for 507 physically active individuals. The histogram below shows the sample distribution of heights in centimeters.

Answers
  1. Mean = 171.1; Median = 170.3

  2. SD = 9.4; IQR = 177.8 - 163.8 = 14

  3. Since both z1 and z2 are within 2 SD of the mean 180 or 155 are not unusual heights

z1 <- (180-171.1)/9.4
z1
## [1] 0.9468085
z2 <- (155-171.1)/9.4
z2
## [1] -1.712766
  1. If researchers take another sample of physically active individuals, I would expect the mean and standard deviation to be different than the values above because sample based point estimates only approximate the sample parameters and the values will vary with different samples.

  2. We use Standard Error as a measure to quantify the variability of such estimate.

SE1 <- 9.4 / sqrt(507)
SE1
## [1] 0.4174687
4.14 Thanksgiving spending, Part I.

The 2009 holiday retail season, which kicked o↵ on November 27, 2009 (the day after Thanksgiving), had been marked by somewhat lower self-reported consumer spending than was seen during the comparable period in 2008. To get an estimate of consumer spending, 436 randomly sampled American adults were surveyed. Daily consumer spending for the six-day period after Thanksgiving, spanning the Black Friday weekend and Cyber Monday, averaged $84.71. A 95% confidence interval based on this sample is ($80.31, $89.11). Determine whether the following statements are true or false, and explain your reasoning.

Answers:
  1. False. Point estimate is always is the confidence interval so in this case the point estimate will be be true 100% of the times

  2. False. Since the sample of 436 individuals is high enough, slight right skew does not play an important role.

  3. False. Confidence level is about parameter values in a given sample, not about sample mean.

  4. True

  5. True. To be more confident, we need wider interval and to be less confident than 95%, interval will be narrower.

  6. False. In order to decrease margin error we will have to increase the sample size by (3)^2 = 9 times

  7. True. Margin of error is 4.4

n <-436
m <-84.7
x1 <-80.31
x2 <-89.11
(x2-x1) / 2
## [1] 4.4
4.24 Gifted children, Part I.

Researchers investigating characteristics of gifted children col- lected data from schools in a large city on a random sample of thirty-six children who were identified as gifted children soon after they reached the age of four. The following histogram shows the dis- tribution of the ages (in months) at which these children first counted to 10 successfully. Also provided are some sample statistics.

Answers:
  1. Yes as the sample data is large enough (>30) and independent
  2. Since the p-value is less than alpha (0.05), it rejects the null hypothesis hence alternative hypothesis that, gifted childeren count to 10 at age less than the average of 32 months, is plausible.
n <- 36
m <- 30.69
min <- 21
sd <- 4.31
max <- 39
slevel <- 0.10

## hypothesis that mean = 32
x1 <- 32

## standard error of mean 30.69
SE <- sd/sqrt(n)
SE
## [1] 0.7183333
## z-score of the sample mean of 30.69
z <- (m - x1)/ SE
z
## [1] -1.823666
## p-val is the percentile of those value less than z-score.  This is the probability of having mean less than 30.69 given the hypothesis that the mean is 32 is true.

pval <- pnorm(z, mean=0, sd=1)
pval
## [1] 0.0341013
  1. The p-value of 0.034 is much smaller than the significance level of 0.1, which shows that null hypothesis is less likely and proves the alternative hypothesis

lower_b <- (m - 1.65*SE)
upper_b <- (m + 1.65*SE)
lower_b
## [1] 29.50475
upper_b
## [1] 31.87525
  1. yes we are 90% confident that the average age of a gifted child is between 29.50 and 31.8 months
4.26 Gifted children, Part II.

Exercise 4.24 describes a study on gifted children. In this study, along with variables on the children, the researchers also collected data on the mother’s and father’s IQ of the 36 randomly sampled gifted children. The histogram below shows the distribution of mother’s IQ. Also provided are some sample statistics.

Answers:

n <- 36 min <- 101 mean <- 118.2 sd <- 6.5 max <- 131 x1 <- 100

  1. Null hypothesis, h0 = 100 Alternate hypothesis, hA != 100
n <- 36 
min <- 101
m <- 118.2 
sd <- 6.5
max <- 131

## hypothesis that IQ is 100
x1 <- 100

## standard error for mean value 118.2
SE <- sd/sqrt(n)
SE
## [1] 1.083333
## z score of 118.2
z <- (m-x1)/SE
z
## [1] 16.8
## P-val indicating the probability of mean < 118.2
pval <- pnorm(z, mean=0, sd=1)

## Percentile for > 118.2  (Probability of mean > 118.2)
1 - pval
## [1] 0

Since the p-value =0 is less than 0.10, the hypothesis that mean = 100 is rejected.

  1. Calculate a 90% confidence interval for the average IQ of mothers of gifted children.
lower_b <- (m - 1.65*SE)
upper_b <- (m + 1.65*SE)
lower_b
## [1] 116.4125
upper_b
## [1] 119.9875
  1. yes the hypothesis result rejects the hypothesis that mothers average is = 100. It agrees with the confidence interval of 90% confidence where the average of 100 does not fall in the interval, which is much higher.
4.34 CLT. Define the term “sampling distribution” of the mean, and describe how the shape, center, and spread of the sampling distribution of the mean change as sample size increases.
Answer:

A sampling distribution of the mean is the distribution of the mean taken from many different samples of the population. As the sample size increases, the sampling distribution becomes more normal in shape, the center becomes almost identical to the population mean, and the spread decreases.

4.40 CFLBs. A manufacturer of compact fluorescent light bulbs advertises that the distribution of the lifespans of these light bulbs is nearly normal with a mean of 9,000 hours and a standard deviation of 1,000 hours.
Answer:
m <- 9000
sd <- 1000

x <- 10500

z <- (x-m)/sd
z
## [1] 1.5
pval = pnorm(z)
p <- 1-pval
p
## [1] 0.0668072
## Altnernate way of solving
1 - pnorm(10500, 9000, 1000)
## [1] 0.0668072
## plot
normalPlot(bounds = c(z,Inf))

  1. Since the data is pretty normal, the sample mean for 15 bulbs will be nearly normal with distribution (calculated below) of N(9000, 258.2)
## mean lifespan of 
n <- 15

se <- sd / sqrt(15)
se
## [1] 258.1989
  1. What is the probability that the mean lifespan of 15 randomly chosen light bulbs is more than 10,500 hours?
n <- 15

se <- sd / sqrt(n)
se
## [1] 258.1989
z <- (x-m)/se
z
## [1] 5.809475
## probability of more than 10500 hours 
1 - pnorm(z)
## [1] 3.133452e-09
  1. Sketch the two distributions (population and sampling) on the same scale.
x <- 6000:12000
m
## [1] 9000
sd
## [1] 1000
se
## [1] 258.1989
y1 <- dnorm(x, m, sd)
y2 <- dnorm(x, m, se)
plot(x,y1,type="l",col="red")
lines(x,y2,col="blue")

e) No we cannot estimate probabilities for parts a) and c) if the lifespan of the bulbs was not normally distributed. It would not allow to calculate z-score in (a) and sample size of (c) is too small to represent normal distribution

4.48 Same observation, different sample size. Suppose you conduct a hypothesis test based on a sample where the sample size is n = 50, and arrive at a p-value of 0.08. You then refer back to your notes and discover that you made a careless mistake, the sample size should have been n = 500. Will your p-value increase, decrease, or stay the same? Explain.
Answer:

With increase in sample size, SE will reduce, which means Z score with SE as a denominator will increase significantly. This means z score will move further away from the mean and reduce the p-value which reflects that area of the rejection region.