Problem 1: A medical researcher conjectures that the likelihood of having wrinkled skin around the eyes increases when a person smokes. The smoking habits as well as the presence of prominent wrinkles around the eyes were recorded for 500 randomly selected people from the population of interest. The following frequency table is obtained:
Prominent Wrinkles Wrinkles not prominent
Heavy smoker 95 55 Light smoker 75 75 Non-smoker 66 134
Smoking_Wrinklees = matrix(c(95,55,75,75,66,134), nrow=3, byrow=TRUE, dimnames= list(c("Heavy smoker","Light smoker", "Non-smoker"),c("Prominent Wrinkles","Wrinkles not prominent")))
Smoking_Wrinklees
## Prominent Wrinkles Wrinkles not prominent
## Heavy smoker 95 55
## Light smoker 75 75
## Non-smoker 66 134
chisq.test(Smoking_Wrinklees, correct = TRUE)
##
## Pearson's Chi-squared test
##
## data: Smoking_Wrinklees
## X-squared = 32.32, df = 2, p-value = 9.59e-08
The PV is less than alpha (0.05); therefore, we reject Ho and support Ha. So, we can claim that someone’s smoking habits are associated with the presence of skin wrinkles. In other words, skin wrinkles are dependent on smoking habits.Or you can say that skin Wrinkles are affected by smoking.
round(prop.table(Smoking_Wrinklees, margin = 2)*100,1)
## Prominent Wrinkles Wrinkles not prominent
## Heavy smoker 40.3 20.8
## Light smoker 31.8 28.4
## Non-smoker 28.0 50.8
We can see from this analysis that Heavy smokers have the highest association with prominent wrinkles, and that Non-smokers have the highest association with Wrinkles not being prominent.
Sample_Smoker_count = c(200,150,150)
probvector_smoke= c(9/20, 3/10, 1/4)
chisq.test(Sample_Smoker_count, p= probvector_smoke, correct = FALSE)
##
## Chi-squared test for given probabilities
##
## data: Sample_Smoker_count
## X-squared = 7.7778, df = 2, p-value = 0.02047
We can see that we the p value is below alpha so we will reject Ho: and support Ha
HO: that the population of interest is split with 45% of the people are non-smokers, 30% are light smokers, and the rest are heavy smokers. Ha: At least one of the probabilities is different than what was proposed.
So the belief that the population of interest is split with 45% of the people are non-smokers, 30% are light smokers, and the rest are heavy smokers is Not Supported.
If you do some calculations, you will find that this sample population is actually split with 40% of the people are non-smokers, 30% are light smokers, and 30% are heavy smokers.
Problem 2: The results after rolling a die 300 times are shown in the next table:
1’s 2’s 3’s 4’s 5’s 6’s
Frequency 45 52 50 58 55 40
Is there sufficient evidence to conclude that a loaded die was used in this experiment? Use a significance level of 0.05.
Note: A normal (not loaded) die is one with equal probability for all the faces of the die.
Dice_check = matrix(c(45,52,50,58,55,40), nrow=1, byrow=TRUE, dimnames= list(c("Frequency"),c("1's","2's","3's","4's","5's","6's")))
probvector_trueDice = c(1/6, 1/6, 1/6, 1/6, 1/6, 1/6)
Dice_check
## 1's 2's 3's 4's 5's 6's
## Frequency 45 52 50 58 55 40
chisq.test(Dice_check, p=probvector_trueDice, correct = FALSE)
##
## Chi-squared test for given probabilities
##
## data: Dice_check
## X-squared = 4.36, df = 5, p-value = 0.4988
Since the p value is greater than alpha we will accept Ho and reject Ha, This will be sufficient evidence to conclude that the die was not loaded.
Problem 3:
Ho: the bursting strength for the new and old design is similar, Ha: the bursting strength of the new design is greater than the old design
Old_Design = c(210,212,211,211,190,213,212,211,164,209)
New_Design = c(216,217,162,137,219,216,179,153,152,217)
wilcox.test(Old_Design, New_Design, alternative="greater")
## Warning in wilcox.test.default(Old_Design, New_Design, alternative = "greater"):
## cannot compute exact p-value with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: Old_Design and New_Design
## W = 49, p-value = 0.5453
## alternative hypothesis: true location shift is greater than 0
t.test(New_Design, Old_Design, paired=TRUE, alternative="greater")
##
## Paired t-test
##
## data: New_Design and Old_Design
## t = -1.629, df = 9, p-value = 0.9311
## alternative hypothesis: true mean difference is greater than 0
## 95 percent confidence interval:
## -37.19258 Inf
## sample estimates:
## mean difference
## -17.5
Both test were similar in that their p values were not significant(not below alpha). Given this we would accept Ho and deny the assumption that the new bottle design is superior to the old design.
Problem 4:
For problem 4, you are going to use data from the dataset “airquality”, which is a built-in R dataset (you do not need to install any package to use it).
airquality
It contains daily measures of airquality in a place in New York (for five months). Explore the “airquality” dataset by calling the str() function.
From this dataset, you are going to focus on two variables:
Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island (New York) Month: The month when the measurement was taken. It extends from May (5) to September (9). Goal: Find out whether different months lead to statistically different average ozone levels.
You would use a one-way ANOVA test, since we have only one dependent variable (ozone level), and one independent variable with multiple levels (month).
shapiro.test(airquality$Ozone)
##
## Shapiro-Wilk normality test
##
## data: airquality$Ozone
## W = 0.87867, p-value = 2.79e-08
The p-value is less than alpha(0.05) so we can assume the distribution of data is statistically significant from the normal distribution. Normality is not accepted.
month_asfactor= as.factor(airquality$Month)
kruskal.test(Ozone ~ month_asfactor, data=airquality)
##
## Kruskal-Wallis rank sum test
##
## data: Ozone by month_asfactor
## Kruskal-Wallis chi-squared = 29.267, df = 4, p-value = 6.901e-06
The p-value is less than alpha(0.05) so we can still reject hypothesis and normality is not accepted.
month_asfactor = as.factor(airquality$Month)
mean_ozone_hold = round(tapply(airquality$Ozone, month_asfactor, mean, na.rm=TRUE),2)
mean_ozone_hold
## 5 6 7 8 9
## 23.62 29.44 59.12 59.96 31.45
It seems statistically higher or the months July and August. Both are far above in value than the others.