Chapter 4 Foundations for Inference
Practice: 4.3, 4.13, 4.23, 4.25, 4.39, 4.47 Graded: 4.4, 4.14, 4.24, 4.26, 4.34, 4.40, 4.48
library('DATA606') # Load the package
##
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics
## This package is designed to support this course. The text book used
## is OpenIntro Statistics, 3rd Edition. You can read this by typing
## vignette('os3') or visit www.OpenIntro.org.
##
## The getLabs() function will return a list of the labs available.
##
## The demo(package='DATA606') will list the demos that are available.
##
## Attaching package: 'DATA606'
## The following object is masked from 'package:utils':
##
## demo
vignette(package='DATA606') # Lists vignettes in the DATA606 package
## no vignettes found
vignette('os3') # Loads a PDF of the OpenIntro Statistics book
## Warning: vignette 'os3' not found
data(package='DATA606') # Lists data available in the package
Researchers studying anthropometry collected body girth measurements and skeletal diameter measurements, as well as age, weight, height and gender, for 507 physically active individuals. The histogram below shows the sample distribution of heights in centimeters.
Mean = 171.1; Median = 170.3
SD = 9.4; IQR = 177.8 - 163.8 = 14
Since both z1 and z2 are within 2 SD of the mean 180 or 155 are not unusual heights
z1 <- (180-171.1)/9.4
z1
## [1] 0.9468085
z2 <- (155-171.1)/9.4
z2
## [1] -1.712766
If researchers take another sample of physically active individuals, I would expect the mean and standard deviation to be different than the values above because sample based point estimates only approximate the sample parameters and the values will vary with different samples.
We use Standard Error as a measure to quantify the variability of such estimate.
SE1 <- 9.4 / sqrt(507)
SE1
## [1] 0.4174687
The 2009 holiday retail season, which kicked o↵ on November 27, 2009 (the day after Thanksgiving), had been marked by somewhat lower self-reported consumer spending than was seen during the comparable period in 2008. To get an estimate of consumer spending, 436 randomly sampled American adults were surveyed. Daily consumer spending for the six-day period after Thanksgiving, spanning the Black Friday weekend and Cyber Monday, averaged $84.71. A 95% confidence interval based on this sample is ($80.31, $89.11). Determine whether the following statements are true or false, and explain your reasoning.
False. Point estimate is always is the confidence interval so in this case the point estimate will be be true 100% of the times
False. Since the sample of 436 individuals is high enough, slight right skew does not play an important role.
False. Confidence level is about parameter values in a given sample, not about sample mean.
True
True. To be more confident, we need wider interval and to be less confident than 95%, interval will be narrower.
False. In order to decrease margin error we will have to increase the sample size by (3)^2 = 9 times
True. Margin of error is 4.4
n <-436
m <-84.7
x1 <-80.31
x2 <-89.11
(x2-x1) / 2
## [1] 4.4
Researchers investigating characteristics of gifted children col- lected data from schools in a large city on a random sample of thirty-six children who were identified as gifted children soon after they reached the age of four. The following histogram shows the dis- tribution of the ages (in months) at which these children first counted to 10 successfully. Also provided are some sample statistics.
n <- 36
m <- 30.69
min <- 21
sd <- 4.31
max <- 39
slevel <- 0.10
## hypothesis that mean = 32
x1 <- 32
## standard error of mean 30.69
SE <- sd/sqrt(n)
SE
## [1] 0.7183333
## z-score of the sample mean of 30.69
z <- (m - x1)/ SE
z
## [1] -1.823666
## p-val is the percentile of those value less than z-score. This is the probability of having mean less than 30.69 given the hypothesis that the mean is 32 is true.
pval <- pnorm(z, mean=0, sd=1)
pval
## [1] 0.0341013
The p-value of 0.034 is much smaller than the significance level of 0.1, which shows that null hypothesis is less likely and proves the alternative hypothesis
lower_b <- (m - 1.65*SE)
upper_b <- (m + 1.65*SE)
lower_b
## [1] 29.50475
upper_b
## [1] 31.87525
Exercise 4.24 describes a study on gifted children. In this study, along with variables on the children, the researchers also collected data on the mother’s and father’s IQ of the 36 randomly sampled gifted children. The histogram below shows the distribution of mother’s IQ. Also provided are some sample statistics.
n <- 36 min <- 101 mean <- 118.2 sd <- 6.5 max <- 131 x1 <- 100
n <- 36
min <- 101
m <- 118.2
sd <- 6.5
max <- 131
## hypothesis that IQ is 100
x1 <- 100
## standard error for mean value 118.2
SE <- sd/sqrt(n)
SE
## [1] 1.083333
## z score of 118.2
z <- (m-x1)/SE
z
## [1] 16.8
## P-val indicating the probability of mean < 118.2
pval <- pnorm(z, mean=0, sd=1)
## Percentile for > 118.2 (Probability of mean > 118.2)
1 - pval
## [1] 0
Since the p-value =0 is less than 0.10, the hypothesis that mean = 100 is rejected.
lower_b <- (m - 1.65*SE)
upper_b <- (m + 1.65*SE)
lower_b
## [1] 116.4125
upper_b
## [1] 119.9875
A sampling distribution of the mean is the distribution of the mean taken from many different samples of the population. As the sample size increases, the sampling distribution becomes more normal in shape, the center becomes almost identical to the population mean, and the spread decreases.
m <- 9000
sd <- 1000
x <- 10500
z <- (x-m)/sd
z
## [1] 1.5
pval = pnorm(z)
p <- 1-pval
p
## [1] 0.0668072
## Altnernate way of solving
1 - pnorm(10500, 9000, 1000)
## [1] 0.0668072
## plot
normalPlot(bounds = c(z,Inf))
## mean lifespan of
n <- 15
se <- sd / sqrt(15)
se
## [1] 258.1989
n <- 15
se <- sd / sqrt(n)
se
## [1] 258.1989
z <- (x-m)/se
z
## [1] 5.809475
## probability of more than 10500 hours
1 - pnorm(z)
## [1] 3.133452e-09
x <- 6000:12000
m
## [1] 9000
sd
## [1] 1000
se
## [1] 258.1989
y1 <- dnorm(x, m, sd)
y2 <- dnorm(x, m, se)
plot(x,y1,type="l",col="red")
lines(x,y2,col="blue")
e) No we cannot estimate probabilities for parts a) and c) if the lifespan of the bulbs was not normally distributed. It would not allow to calculate z-score in (a) and sample size of (c) is too small to represent normal distribution
With increase in sample size, SE will reduce, which means Z score with SE as a denominator will increase significantly. This means z score will move further away from the mean and reduce the p-value which reflects that area of the rejection region.