DATA 606 Chapter 6 Assignment

Chapter 6 - Inference for Categorical Data
6.6
6.12
6.20
6.28
6.44
6.48

Chapter 6 - Inference for Categorical Data

Practice: 6.5, 6.11, 6.27, 6.43, 6.47 #Graded: 6.6, 6.12, 6.20, 6.28, 6.44, 6.48

6.6

(a)False we are 100% sure that 46% of people in this sample support the decision, we are 95% condicent that between 43 and 48% of americans support this decision (b)True. We know that for the entire population the support is between 43% and 49% with a 95% confidence level, this is inferred from the sample. (c)True. 43% to 49% is the confidence interval for a 95% confidence as calculated from the original sample. So we would expect 95% of new samples to be within this interval. (d)False the margin of error would be lower

6.12

(a)This is sample statistic, it is a parameter of the sample, not the population. (b)

n<-1259
p<-0.48
(z<--qnorm(0.025)) #95% confident

## [1] 1.959964

(SE<-sqrt(p*(1-p)/n))

## [1] 0.01408022

(ME<-z*SE)

## [1] 0.02759672

upper<-p+ME
lower<-p-ME
round(c(lower,upper),4)

## [1] 0.4524 0.5076

(c)Conditions for the sampling distribution p^ being nearly normal need to be met: 1) samples should be independent, since our sample size n is much less than the total US residents, less than 10%, we can conclude they are. 2) we should have at least 10 successes, np, and 10 failures, n(1-p), in our sample. We test this below and find that this is the case. Since both conditions are met, we conclude that normality is true for this sample.

(d)This statement is false. The 95% confidence interval falls between 45.24% to 50.76%, which most of it is < 50%. Therefore, this statement is incorrect.

6.20

We use formulas for ME and SE to calculate n:

ME = z* SE => SE = ME / z* SE = p(1−p)/n−√ => n = p(1-p) / SE2

(z<--qnorm(.025))

## [1] 1.959964

ME<-0.02
(SE<-ME/z)

## [1] 0.01020427

p<-0.48
(n<-p*(1-p)/SE^2)

## [1] 2397.07

We would need 2398 respondents

6.28

We are 95% confident that the difference between between sleep deprivation between calfornia and oregon resident is between .15% higher for california and 1.75% percent lower for california when compared to oregon

p1<-0.08
p2<-0.088
p<-p1-p2
p

## [1] -0.008

n1<-11545
n2<-4691
SE<-sqrt(p1*(1-p1)/n1+p2*(1-p2)/n2)
SE

## [1] 0.004845984

z<--qnorm(0.025) #95% confidence
z

## [1] 1.959964

upper<-p+z*SE
lower<-p-z*SE
round(c(lower,upper),4)

## [1] -0.0175  0.0015

6.44

(a)Ho: Barking deer do not favor a specific habitat, and the observed differences in counts reflect natural sampling fluctuation Ha: Barking deer do have preferences to certain habitats

We can use a Chi square test.

(c)The sample size and distribution condition can be tested by calculating the expected number of deer for each habitat. As can be seen below it is greater than 5 for each habitat, thus satisfying this condition also.

(d)From the calculation below we can see we compute a very large chi, which results in a very small, practically zero p-value. Because this is smaller than 0.05, we reject the null and find that there is evidence to suggest deer do favor one habitat over another.

pwoods<-0.048
pcult<-0.147
pforest<-0.396
pother<-1-pwoods-pcult-pforest
df<-data.frame("Woods"=c(4,0.048),"Cultivated_grassplot"=c(16,0.147),"Deciduous_forests"=c(67,0.396),"Other"=c(345,1-0.048-0.147-0.396),"Total"=c(426,1))
#we calculate the expected values from each group
df<-rbind(df,c(df$Total[1]*df$Woods[2],df$Total[1]*df$Cultivated_grassplot[2],df$Total[1]*df$Deciduous_forests[2],df$Total[1]*df$Other[2],df$Total[1]))
df

##    Woods Cultivated_grassplot Deciduous_forests   Other Total
## 1  4.000               16.000            67.000 345.000   426
## 2  0.048                0.147             0.396   0.409     1
## 3 20.448               62.622           168.696 174.234   426

#we calculate the Z values for each group
obs<-c(df$Woods[1],df$Cultivated_grassplot[1],df$Deciduous_forests[1],df$Other[1])
expt<-c(df$Woods[3],df$Cultivated_grassplot[3],df$Deciduous_forests[3],df$Other[3])
z<-(obs-expt)/sqrt(expt)
z

## [1] -3.637372 -5.891521 -7.829815 12.937041

chi<-sum(z^2)
chi

## [1] 276.6135

df<-length(obs)-1
p<-1-pchisq(chi,df=df)
p

## [1] 0

6.48

(a)The chi-square test for two-way tables. (b)Ho: There is no relationship between depression and coffee consumption, that is the proportion of depressed women is the same no matter what their coffee consumption is. Ha: There is a relationship between coffee consumption and depression in women, that is the proportion of women depressed is different for different levels of coffee consumption. (c)

p_depressedTotal<-2607/50739
p_depressedTotal

## [1] 0.05138059

p_notDepressedTotal<-48132/50739
p_notDepressedTotal

## [1] 0.9486194

two_six_cups<-p_depressedTotal*6617
two_six_cups

## [1] 339.9854

#test statistic for this group
testStat<-(373-two_six_cups)^2/two_six_cups
testStat

## [1] 3.205914

rows<-2
cols<-5
df<-(rows-1)*(cols-1)
df

## [1] 4

pval<-1-pchisq(20.93,df=df)
pval

## [1] 0.0003269507

The calculated p-value is smaller than 0.05, so we reject the null and conclude that coffee does in fact affect depression, de proportion of depressed women in each group of different coffee comsumptions is in fact different.
There can be a debate between correalation and causation of depression and coffee, coffee doesn’t neccessarily reduce depression