https://github.com/asmozo24/DATA606_Homework

library(tidyverse) #loading all library needed for this assignment
library(openintro)
#head(fastfood)
library(readxl)
library(data.table)
library(readr)
library(plyr)
library(dplyr)
#library(dice)
# #library(VennDiagram)
# #library(help = "dice")
library(DBI)
library(dbplyr)

library(rstudioapi)
library(RJDBC)
library(odbc)
library(RSQLite)
#library(rvest)
library(stringr)
library(readtext)
library(ggpubr)
#library(fitdistrplus)
library(ggplot2)
#library(moments)
library(qualityTools)
library(normalp)
library(utils)
library(MASS)
library(qqplotr)
library(DATA606)
## 
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics 
## This package is designed to support this course. The text book used 
## is OpenIntro Statistics, 3rd Edition. You can read this by typing 
## vignette('os3') or visit www.OpenIntro.org. 
##  
## The getLabs() function will return a list of the labs available. 
##  
## The demo(package='DATA606') will list the demos that are available.
getLabs()
##  [1] "Lab1"  "Lab2"  "Lab3"  "Lab4"  "Lab5a" "Lab5b" "Lab6"  "Lab7"  "Lab8" 
## [10] "Lab9"
library(StMoSim)

7.8 Heights of adults. (7.7, p.260)

Researchers studying anthropometry collected body girth measurements and skele- tal diameter measurements, as well as age, weight, height and gender, for 507 physically active individuals. The histogram below shows the sample distribution of heights in centimeters.

# What is the point estimate for the average height of active individuals? 171.1, What about the median? 170.3
summary(bdims$hgt)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   147.2   163.8   170.3   171.1   177.8   198.1
# What is the point estimate for the standard deviation of the heights of active individuals? 9.407205... What about the IQR? 14
# 1st Qu.      3rd Qu. 
# 163.8          177.8

sd(bdims$hgt)  
## [1] 9.407205
# (c) Is a person who is 1m 80cm (180 cm) tall considered unusually tall? And is a person who is 1m 55cm (155cm) considered unusually short? Explain your reasoning.  
# The Z calculation can give us the answer , but looking at the histogram for the sample distribution of heights in centimeters. 180 cm is not in the Z-tail, it is about close to average heights and or 155 cm or how far is from the mean 155~  171.1 +- u ...-1.712766...so not unusually short
x = x =180
z <- (x-171.1)/9.4 # x =180, 155

#The researchers take another random sample of physically active individuals. Would you expect the mean and the standard deviation of this new sample to be the ones given above? Explain your reasoning. 
# No , the mean and the standard deviation of this new sample would not be the same as above because of the new sample at random distribution is likely to have different output, in some case, maybe close.

# (e) The sample means obtained are point estimates for the mean height of all active individuals, if the sample of individuals is equivalent to a simple random sample. What measure do we use to quantify the variability of such an estimate? Compute this quantity using the data from the original sample under the condition that the data are a simple random sample. 0.4174687

SD = 9.4/sqrt(507)

Thanksgiving spending, Part I

  1. We are 95% confident that the average spending of these 436 American adults is between $80.31 and $89.11 False, confidence interval is a range of values that you can be 95% certain contains the true mean of the population.

  2. This confidence interval is not valid since the distribution of spending in the sample is right skewed False, 436 randomly sampled American adults were survey and histogram shows right skewed

  3. 95% of random samples have a sample mean between $80.31 and $89.11 False, random sample does not mean same mean …this statement is more false than true.

  4. We are 95% confident that the average spending of all American adults is between $80.31 and $89.11. True, the study is about getting an estimate of consumer spending.

  5. A 90% confidence interval would be narrower than the 95% confidence interval since we don’t need to be as sure about our estimate. True, the 90% confidence interval gives 10% chance to be wrong, if the level of significance is 5% , meaning being error is about +- 5%, thus the probability of of this interval (u-3sd < x < u+ 3sd) is about 95% , then if the probability decrease the interval will narrow.

  6. In order to decrease the margin of error of a 95% confidence interval to a third of what it is now, we would need to use a sample 3 times larger. False, margin of error = z*(u/sqrt(n))…we need 9 times not 3 times

  7. the margin of error is 4.4; (89.11 - 80.31)/2 = 4.4 which is half the confidence interval. true

# We are 95% confident that the average spending of these 436 American adults is between $80.31 and $89.11

Gifter children , part I

  1. Are conditions for inference satisfied? Yes

a collected data from schools in a large city on a random sample of thirty-six children A random sample sample is reasonably large (n≥30) Independent: yes…sample collected from school…so children assumes to have different background

  1. Suppose you read online that children first count to 10 successfully when they are 32 months old, on average. Perform a hypothesis test to evaluate if these data provide convincing evidence that the average age at which gifted children first count to 10 successfully is less than the general average of 32 months. Use a significance level of 0.10.

null Hypothesis :u = 32 , alternative hypothesis < 32, z = abs(30.69-32)/36 = 0.03638889, P-value = 0.97, significance level of 0.10 smaller than the p-value…so the result is not significant. these data does not provide convincing evidence that the average age at which gifted children first count to 10 successfully is less than the general average of 32 months. so we accept the null hypothesis.

  1. Interpret the p-value in context of the hypothesis test and the data. The significance level is smaller than the p-value. So, the null hypothesis is true .

  2. Calculate a 90% confidence interval for the average age at which gifted children first count to 10 successfully. u +- z ( 4.31/sqrt(36)) ; 30.69 - 1.645* ( 4.31/sqrt(36)) ; (29.50834 , 31.87166)

  3. Do your results from the hypothesis test and the confidence interval agree? Explain. Yes, the confidence interval for the gifted children population is below the 32 months.

Gifted Children, Part II

  1. Perform a hypothesis test to evaluate if these data provide convincing evidence that the average IQ of mothers of gifted children is different than the average IQ for the population at large, which is 100. Use a significance level of 0.10. null Hypothesis :u = 100 , alternative hypothesis > 100 , z = abs(118.2-100)/36 = 0.5055556, P-value = 0.61, sinificance level (alpha) 0.10, the significance level of 0.10 smaller than the p-value…so the result is not significant. data does not provide convincing evidence that the average IQ of mothers of gifted children is different than the average IQ for the population at large.

  2. Calculate a 90% confidence interval for the average IQ of mothers of gifted children. u +- ( 6.5/sqrt(36)); z from z-table ; 118.2 - 1.645* ( 6.5/sqrt(36)) ; (116.4179 , 119.9821)

  3. Do your results from the hypothesis test and the confidence interval agree? Explain.

Yes, the 90% confidence interval is above the mean of the population. So, these data provide convincing evidence that the average IQ of mothers of gifted children is diffrent than the average IQ for the population at large, which is 100.

30.69 + 1.645* ( 4.31/sqrt(36)) 
## [1] 31.87166

CLT. Define the term “sampling distribution” of the mean, and describe how the shape, center, and spread of the sampling distribution of the mean change as sample size increases.

sampling distribution of the mean is the probability distribution of means for all possible random samples of a given size from some population. if sample size increases, there is a possibility that more data on the mean side …the center will narrow up, the spread will be narrowed, and shape to the normal distribution.

CFLBs. A manufacturer of compact fluorescent light bulbs advertises that the distribution of the lifespans of these light bulbs is nearly normal with a mean of 9,000 hours and a standard deviation of 1,000 hours. 9000+ 5*1000

# (a) What is the probability that a randomly chosen light bulb lasts more than 10,500 hours? 66.8072%
# p = 1 - pnorm(prob)
p = 1- pnorm((10500 - 9000)/1000) 

normalPlot(mean = 9000, sd = 1000, bounds = c(10500, 15000), tails = FALSE)

# (b) Describe the distribution of the mean lifespan of 15 light bulbs. 258.1989; about normal distribution...large sample with mean, standard deviation...
SE = 1000/sqrt(15)

# (c) What is the probability that the mean lifespan of 15 randomly chosen light bulbs is more than 10,500 hours? 3.133452e-09, this is probability 0.0%

# prob = 1- pnorm(z)
prob = 1 - pnorm((10500 - 9000)/SE)

# (d) Sketch the two distributions (population and sampling) on the same scale.

norm0 <- rnorm(100000, mean = 9000, sd = 1000)
norm1 <- rnorm (100000, mean = 9000, SE)
norm <- data.frame(sample = norm0, mean = norm1)

ggplot(norm) + geom_density(aes(x = sample)) + geom_density(aes(x = mean))

# (e) Could you estimate the probabilities from parts (a) and (c) if the lifespans of light bulbs had a skewed distribution? no , we could not estimate estimate the probabilities from parts (a) and (c) if the lifespans of light bulbs had a skewed distribution.

Same Observation, Different Sample Size

Suppose you conduct a hypothesis test based on a sample where the sample size is n = 50, and arrive at a p-value of 0.08. You then refer back to your notes and discover that you made a careless mistake, the sample size should have been n = 500. Will your p-value increase, decrease, or stay the same? Explain.

Answer: The sample increases definitely by 100 times. Thus, assuming a normal distribution the p-value will decrease because the spread will be narrowed and so will (u-3sd < x < u+ 3sd)