R Notebook

The R code you used for each exercise part.
The R output you got after running each line of code.
Your written answers (if needed). If a question asks you to justify/explain, your notebook must include a text section with the justification/explanation.

Problem 1: A medical researcher conjectures that the likelihood of having wrinkled skin around the eyes increases when a person smokes. The smoking habits as well as the presence of prominent wrinkles around the eyes were recorded for 500 randomly selected people from the population of interest. The following frequency table is obtained:

          Prominent Wrinkles          Wrinkles not prominent

Heavy smoker 95 55 Light smoker 75 75 Non-smoker 66 134

Smoking_Wrinklees = matrix(c(95,55,75,75,66,134), nrow=3, byrow=TRUE, dimnames= list(c("Heavy smoker","Light smoker", "Non-smoker"),c("Prominent Wrinkles","Wrinkles not prominent")))

Smoking_Wrinklees

##              Prominent Wrinkles Wrinkles not prominent
## Heavy smoker                 95                     55
## Light smoker                 75                     75
## Non-smoker                   66                    134

Conduct a test to find out if someone’s smoking habits are associated with the presence of skin wrinkles. Use α=0.05.

chisq.test(Smoking_Wrinklees, correct = TRUE)

## 
##  Pearson's Chi-squared test
## 
## data:  Smoking_Wrinklees
## X-squared = 32.32, df = 2, p-value = 9.59e-08

The PV is less than alpha (0.05); therefore, we reject Ho and support Ha. So, we can claim that someone’s smoking habits are associated with the presence of skin wrinkles. In other words, skin wrinkles are dependent on smoking habits.Or you can say that skin Wrinkles are affected by smoking.

Conduct a deeper analysis to know which smoking category is associated with prominent wrinkles, and which one is linked to non-prominent wrinkles.

round(prop.table(Smoking_Wrinklees, margin = 2)*100,1)

##              Prominent Wrinkles Wrinkles not prominent
## Heavy smoker               40.3                   20.8
## Light smoker               31.8                   28.4
## Non-smoker                 28.0                   50.8

We can see from this analysis that Heavy smokers have the highest association with prominent wrinkles, and that Non-smokers have the highest association with Wrinkles not being prominent.

Researchers believe that, in the population of interest, 45% of the people are non-smokers, 30% are light smokers, and the rest are heavy smokers. Does the data support their belief at α=0.05?

Sample_Smoker_count = c(200,150,150)
probvector_smoke= c(9/20, 3/10, 1/4)

chisq.test(Sample_Smoker_count, p= probvector_smoke, correct = FALSE)

## 
##  Chi-squared test for given probabilities
## 
## data:  Sample_Smoker_count
## X-squared = 7.7778, df = 2, p-value = 0.02047

We can see that we the p value is below alpha so we will reject Ho: and support Ha

HO: that the population of interest is split with 45% of the people are non-smokers, 30% are light smokers, and the rest are heavy smokers. Ha: At least one of the probabilities is different than what was proposed.

So the belief that the population of interest is split with 45% of the people are non-smokers, 30% are light smokers, and the rest are heavy smokers is Not Supported.

If you do some calculations, you will find that this sample population is actually split with 40% of the people are non-smokers, 30% are light smokers, and 30% are heavy smokers.

Problem 2: The results after rolling a die 300 times are shown in the next table:

      1’s       2’s       3’s     4’s     5’s     6’s

Frequency 45 52 50 58 55 40

Is there sufficient evidence to conclude that a loaded die was used in this experiment? Use a significance level of 0.05.

Note: A normal (not loaded) die is one with equal probability for all the faces of the die.

Dice_check = matrix(c(45,52,50,58,55,40), nrow=1, byrow=TRUE, dimnames= list(c("Frequency"),c("1's","2's","3's","4's","5's","6's")))
probvector_trueDice = c(1/6, 1/6, 1/6, 1/6, 1/6, 1/6)

Dice_check

##           1's 2's 3's 4's 5's 6's
## Frequency  45  52  50  58  55  40

chisq.test(Dice_check, p=probvector_trueDice, correct = FALSE)

## 
##  Chi-squared test for given probabilities
## 
## data:  Dice_check
## X-squared = 4.36, df = 5, p-value = 0.4988

Since the p value is greater than alpha we will accept Ho and reject Ha, This will be sufficient evidence to conclude that the die was not loaded.

Problem 3:

Test the claim that the bursting strength for the new design is greater than the bursting strength for the old design. Use the appropriate non-parametric test.

Ho: the bursting strength for the new and old design is similar, Ha: the bursting strength of the new design is greater than the old design

Old_Design = c(210,212,211,211,190,213,212,211,164,209)
New_Design = c(216,217,162,137,219,216,179,153,152,217)

wilcox.test(Old_Design, New_Design, alternative="greater")

## Warning in wilcox.test.default(Old_Design, New_Design, alternative = "greater"):
## cannot compute exact p-value with ties

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Old_Design and New_Design
## W = 49, p-value = 0.5453
## alternative hypothesis: true location shift is greater than 0

Repeat the analysis by conducting the analogous parametric test (assuming all the conditions required for the validity of this test are satisfied). Did both tests lead you to the same conclusion?

t.test(New_Design, Old_Design, paired=TRUE, alternative="greater")

## 
##  Paired t-test
## 
## data:  New_Design and Old_Design
## t = -1.629, df = 9, p-value = 0.9311
## alternative hypothesis: true mean difference is greater than 0
## 95 percent confidence interval:
##  -37.19258       Inf
## sample estimates:
## mean difference 
##           -17.5

Both test were similar in that their p values were not significant(not below alpha). Given this we would accept Ho and deny the assumption that the new bottle design is superior to the old design.

Problem 4:

For problem 4, you are going to use data from the dataset “airquality”, which is a built-in R dataset (you do not need to install any package to use it).

airquality

It contains daily measures of airquality in a place in New York (for five months). Explore the “airquality” dataset by calling the str() function.

From this dataset, you are going to focus on two variables:

Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island (New York) Month: The month when the measurement was taken. It extends from May (5) to September (9). Goal: Find out whether different months lead to statistically different average ozone levels.

What parametric test would you apply to answer this question? JUSTIFY (Do NOT apply the test, just mention it and justify)

You would use a one-way ANOVA test, since we have only one dependent variable (ozone level), and one independent variable with multiple levels (month).

Because most parametric methods rely on the assumption of Normal distribution, you want to check that the outcome variable in this problem is indeed well described by this distribution. Check whether ozone level is well described by a Normal distribution?

shapiro.test(airquality$Ozone)

## 
##  Shapiro-Wilk normality test
## 
## data:  airquality$Ozone
## W = 0.87867, p-value = 2.79e-08

The p-value is less than alpha(0.05) so we can assume the distribution of data is statistically significant from the normal distribution. Normality is not accepted.

You decided to conclude that the Normality assumption for ozone is suspect. Therefore, you chose to conduct a nonparametric test to answer your goal. Conduct the relevant nonparametric test at a significance level of 0.05. To conduct this test, use Month as a factor rather than as a numeric variable. Use the following script to make this conversion: month_asfactor= as.factor(Month) Now, apply the non-parametric test using month_asfactor rather than Month.

month_asfactor= as.factor(airquality$Month)

kruskal.test(Ozone ~ month_asfactor, data=airquality)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Ozone by month_asfactor
## Kruskal-Wallis chi-squared = 29.267, df = 4, p-value = 6.901e-06

The p-value is less than alpha(0.05) so we can still reject hypothesis and normality is not accepted.

During which month (or months) does the average ozone seem to be statistically higher? Show the work that led you to select your answer.

month_asfactor = as.factor(airquality$Month)

mean_ozone_hold = round(tapply(airquality$Ozone, month_asfactor, mean, na.rm=TRUE),2)
mean_ozone_hold

##     5     6     7     8     9 
## 23.62 29.44 59.12 59.96 31.45

It seems statistically higher or the months July and August. Both are far above in value than the others.