Assignment 2

Question 1: (30 points) A medical researcher conjectures that the likelihood of having wrinkled skin around the eyes increases when a person smokes. The smoking habits as well as the presence of prominent wrinkles around the eyes were recorded for 500 randomly selected people from the population of interest. The following frequency table is obtained:

# Make a chart for our data
data_matrix <- matrix(c(95, 55, 75, 75, 66, 134), nrow = 3, byrow = TRUE)

# Put names on the chart
row_names <- c("Heavy Smoker", "Light Smoker", "Non-smoker")
col_names <- c("Prominent Wrinkles", "Wrinkles Not Prominent")
colnames(data_matrix) <- col_names
rownames(data_matrix) <- row_names

# Show the chart
data_matrix

##              Prominent Wrinkles Wrinkles Not Prominent
## Heavy Smoker                 95                     55
## Light Smoker                 75                     75
## Non-smoker                   66                    134

Conduct a test to find out if someone’s smoking habits are associated with the presence of skin wrinkles. Use alpha= 0.05. You must follow these steps to conduct this test:

#Perform the chi-squared test

result <- chisq.test(data_matrix)

#Print the test result

result

## 
##  Pearson's Chi-squared test
## 
## data:  data_matrix
## X-squared = 32.32, df = 2, p-value = 9.59e-08

a1) State the hypotheses (Ho and Ha).

#Null Hypothesis (Ho): There is no association between someone’s smoking habits and the presence of skin wrinkles.

#Alternative Hypothesis (Ha): There is an association between someone’s smoking habits and the presence of skin wrinkles.

a2) Whether you reject or fail to reject Ho and why.

#Reject the null hypothesis (Ho) because the p-value (9.59e-08) is smaller than the significance level (0.05)

a3) Your conclusion (i.e., whether smoking is associated with having wrinkles).

#The Data suggests that smoking habits are associated with the presence of skin wrinkles.

Conduct a deeper analysis to know which smoking category is associated with prominent wrinkles, and which one is linked to non-prominent wrinkles. Show your work and discuss your results.

#Perform the chi-squared test for independence 
chi_square_result <- chisq.test(data_matrix)

# Calculate the expected frequencies
expected <- chi_square_result$expected

# Perform pairwise comparisons using prop.test
pairwise_results <- pairwise.prop.test(data_matrix, p.adjust.method = "BH")

# Print the pairwise comparison results
pairwise_results

## 
##  Pairwise comparisons using Pairwise comparison of proportions 
## 
## data:  data_matrix 
## 
##              Heavy Smoker Light Smoker
## Light Smoker 0.0268       -           
## Non-smoker   9.8e-08      0.0029      
## 
## P value adjustment method: BH

#Results: Both prominent and non-prominent wrinkles are considerably different in Heavy and Light Smokers compared to non-smokers. However, when it comes to the presence of obvious wrinkles, there is no discernible difference between heavy and light smokers.

Question 2: (40 points) A researcher wants to compare the average anxiety levels of people living in Alaska and Hawaii. The researcher does not have any specific hypothesis in mind in terms of which state could have higher mean anxiety levels. She collected data on anxiety scores for two samples of randomly selected residents from both states. Each resident was given a score between 0 to 100 (higher scores mean more anxiety).

The anxiety scores she collected for each state are shown next. Create two vectors in R with these data (one vector for Alaska scores and another one for Hawaii scores).

Alaska scores: 69 76 64 65 67 77 56 67 62 82 56 77 71 68 76 69 64 66 83 77 75 79 71 75 86 67 70 73 77 71 78 64 62 58 67

Hawai scores: 64 76 74 74 73 71 75 63 67 77 74 67 69 70 64 72 72 72 74 76 67 69 80 73 68 77 71 73 69 68 71 71 73 75 71

Assume that the variables involved in this problem follow a normal distribution. Also assume their variances can be safely considered to be the same (in other words, you do NOT need to do the test to compare two variances here. Assume that the variances are equal)

# Create vectors for Alaska and Hawaii scores
alaska_scores <- c(69, 76, 64, 65, 67, 77, 56, 67, 62, 82, 56, 77, 71, 68, 76, 69, 64, 66, 83, 77, 75, 79, 71, 75, 86, 67, 70, 73, 77, 71, 78, 64, 62, 58, 67)
hawaii_scores <- c(64, 76, 74, 74, 73, 71, 75, 63, 67, 77, 74, 67, 69, 70, 64, 72, 72, 72, 74, 76, 67, 69, 80, 73, 68, 77, 71, 73, 69, 68, 71, 71, 73, 75, 71)

# Perform the two-sample t-test
t_test_result <- t.test(alaska_scores, hawaii_scores)

# Print the test result
t_test_result

## 
##  Welch Two Sample t-test
## 
## data:  alaska_scores and hawaii_scores
## t = -0.70165, df = 51.503, p-value = 0.4861
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.860542  1.860542
## sample estimates:
## mean of x mean of y 
##  70.42857  71.42857

State the hypotheses (Ho and Ha) that this researcher should set up to conduct the test that will allow her to make the desired comparison.

#Ho: There is no significant difference in the average anxiety levels between people living in Alaska and Hawaii. #Ha: There is a signficant difference in the average anxiety levels between people living in Alaska and Hawaii.

Make a decision using a significance level of 0.05. Justify your decision.

#We fail to reject Ho since the p-value (0.4861) is greater than the significance level.

Obtain a 95% confidence interval for the difference between the mean anxiety level in Alaska and the mean anxiety level in Hawaii. Does the interval lead you to the same conclusion you reached from the hypothesis test? Justify.

# Calculate the 95% confidence interval
conf_interval <- t.test(alaska_scores, hawaii_scores)$conf.int

# Print the confidence interval
conf_interval

## [1] -3.860542  1.860542
## attr(,"conf.level")
## [1] 0.95

#Yes, the interval led me to the same conclusion because the confidence interval range included zero, which means there is a real chance that there’s no difference in anxiety levels between Alaska and Hawaii.

What kind of hypothesis test was conducted by these authors? A test to compare two means? A test to compare two variances? A chi-square test to test for independence? Choose the correct option among these three and justify.

#A hypothesis test was used by the authors to compare two means. Their statement, which implies they were comparing the average sleep durations of two groups—one receiving melatonin supplementation and the other a placebo—makes this clear. The P-value (0.046) evaluates the statistical significance of this mean difference between the two groups, and it indicates that the reported increase of 36 minutes reflects a change in the mean sleep time.

State the hypotheses (Ho and Ha) that the authors were testing in this case. You need to state both Ho and Ha.

#Ho = Melatonin supplementation has no significant difference on average sleep time. #Ha = Melatonin spplementation has a significant effect on average sleep time.

Did the authors find evidence to support the alternative hypothesis? Justify.

#Yes, the results indicate that melatonin supplementation signficantly lengthens average sleep time compared to a placebo and is supported by the statistically significant result with a P-value of 0.046.

Question 4: (15 points) The results after rolling a die 300 times are shown in the next table:

Is there sufficient evidence to conclude that a loaded die was used in this experiment? Use a significance level of 0.05. Note: A normal (not loaded) die is one with equal probability for all the faces of the die.

# Observed frequencies
observed <- c(45, 52, 50, 58, 55, 40)

# Define the expected probabilities for a fair die (1/6 for each face)
expected_probabilities <- rep(1/6, 6)

# Calculate the expected frequencies by multiplying the probabilities by the total number of rolls
total_rolls <- sum(observed)
expected <- expected_probabilities * total_rolls

# Perform the chi-squared test
chi_square_result <- chisq.test(observed, p = expected_probabilities)

# Print the test result
chi_square_result

## 
##  Chi-squared test for given probabilities
## 
## data:  observed
## X-squared = 4.36, df = 5, p-value = 0.4988

#We fail to reject the null hypothesis (Ho) at a significance level of 0.05, which means we lack sufficient data to infer that the experiment’s die is loaded. The results do not support the claim that the die is unfair and are not statistically significant either. It may be concluded that the die is fair (not loaded) and that there is no evidence to support its use.