Data Dive - Hypothesis Testing

options(repos = c(CRAN = "https://cran.r-project.org/"))

ASSIGNMENT 7:

install.packages('pwr')

## 
## The downloaded binary packages are in
##  /var/folders/1t/lvl69_w12vj1sz_yxkxrvt7w0000gn/T//RtmpSwcuDJ/downloaded_packages

library(pwr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

Obesity <- read.csv('/Users/ankit/Downloads/Obesity.csv')

Devise at least two different null hypotheses based on two different aspects (e.g., columns) of your data. For each hypothesis:

Come up with an alpha level, power level, and minimum effect size, and explain why you chose each value.
Determine if you have enough data to perform a Neyman-Pearson hypothesis test. If you do, perform one and interpret results. If not, explain why. Perform a Fisher’s style test for significance, and interpret the p-value.

Alpha level : This is the threshold for statistical significance. It is the chance/probability of making a Type 1 Error, that means rejecting the null hypothesis, even though it is true.

In my case, since I am working on a health related data, so I want the risk of making a type 1 error to be minimum. That’s why I am taking the alpha/ significance level to be 0.01

alpha <- 0.01

Power Level: It is the probability of correctly rejecting a false null hypothesis, i.e. preventing a type 2 error.I’ll chose it to be 0.80, which means I want an 80 % chance of detecting a true effect.

power <- 0.8

Minimum effect size: It is the minimum effect size representing the smallest difference or relationship that we consider practically meaningful.

minimum_effect_size <- 0.2

PERFORMING NEYMAN- PEARSON TEST: Since both of my columns showing the weight category and the mode of transportation are categorical, this condition itself stops us from performing a neyman-pearson test. Since, the basic requirement of this test is that data should be continuous. Taking other continuous columns such as age, weight, etc, won’t give us any meaningful trend/insight.

So, instead of performing a Neyman-Pearson test, I am performing a chi-squared test of independence: Since both my columns used for this hypothesis are categorical, therefore I am using chi-squared test of independence.

This test is meant to test whether two variables are likely to be related or not. This will help us reject or accept our null hypothesis.

HYPOTHESIS 1: NULL HYPOTHESIS:People having walk as a mode of transportation do not come under normal weight category. ALTERNATE HYPOTHESIS: People having walk as a mode of transportation come under normal weight category. #VISUALIZATION

df1 <- Obesity %>% 
  mutate(physical_activity = case_when(
    MTRANS %in% c("Public_Transportation", "Automobile", "Motorbike") ~ "No Physical Activity",
    MTRANS %in% c ("Walking","Bike") ~ "Physical Activity",
    TRUE ~ "other"
  )
)

df1 <- df1 %>%
  filter(NObeyesdad != 'Insufficient_Weight') 

df1 <- df1 %>% 
  mutate(Weight_Category = case_when(
    NObeyesdad %in% c("Normal_Weight") ~ "Normal Weight",
    TRUE ~ "Overweight + Obese"
  )
)

df1_normal_weight <- df1 %>%
  filter(Weight_Category == 'Normal Weight') %>%
  group_by(physical_activity) %>%
  summarise(normal_weight_count = n())


df1_total <- df1 %>%
  group_by(physical_activity) %>%
  summarise(total_count = n())


df1_results <- df1_normal_weight %>% inner_join( df1_total, 
        by=c('physical_activity'='physical_activity'))

df1_results$normal_weight_probability <- df1_results$normal_weight_count/df1_results$total_count

ggplot(data = df1_results, aes(x = physical_activity, y = normal_weight_probability)) +
  geom_bar(stat = "identity", position = position_dodge(), alpha = 0.75) +
  labs(x = "\n Physical Activity", y = "Probability\n", title = "\n Probability of being Normal Weight considering physical activity\n") +
  theme(plot.title = element_text(hjust = 0.5), 
        axis.title.x = element_text(face="bold", colour="red", size = 12),
        axis.title.y = element_text(face="bold", colour="red", size = 12),
        legend.title = element_text(face="bold", size = 10))

This graph clearly tells us that ‘People having walk as mode of transportation do not come under normal weight category’, hence rejecting the null hypothesis.

#parameters for sample size calculation
minimum_effect_size <- 0.2
alpha <- 0.05
power <-0.80

contingency_table <- table(df1$MTRANS, df1$NObeyesdad)

#performing Chi-squared test of independence
chi_square_test <- chisq.test(contingency_table)

## Warning in chisq.test(contingency_table): Chi-squared approximation may be
## incorrect

print(chi_square_test)

## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 275.19, df = 20, p-value < 2.2e-16

# Perform Fisher's Exact Test
fisher_test_result <- fisher.test(contingency_table, simulate.p.value = TRUE, B = 10000)

# Print the results
print(fisher_test_result)

## 
##  Fisher's Exact Test for Count Data with simulated p-value (based on
##  10000 replicates)
## 
## data:  contingency_table
## p-value = 9.999e-05
## alternative hypothesis: two.sided

The X-squared value is relatively large, which is 275.19, indicating that there is a substantial difference between the observed and expected frequencies int he contingency table. It provides evidence against the null hypothesis.

Even the p value is very low in both Chi-squared and Fisher’s exact test, therefore we can reject the null hypothesis and could say that ’people having walk as the mode of transportation fall under the normal weight category.

HYPOTHESIS 2 Null Hypothesis: People having a family history of being overweight do not come under overweight and obese weight category Alternate Hypothesis: People having a family history of being overweight come under overweight and obese weight category

Since both of my columns showing the weight category and the mode of transportation are categorical, this condition itself stops us from performing a neyman-pearson test. Since, the basic requirement of this test is that data should be continuous. Taking other continuous columns such as age, weight, etc, won’t give us any meaningful trend/insight.

This test is meant to test whether two variables are likely to be related or not. This will help us reject or accept our null hypothesis.

df2 <- Obesity %>%
  filter(NObeyesdad != 'Insufficient_Weight') 

df2 <- df2 %>% 
  mutate(weight_category = case_when(
    NObeyesdad %in% c("Normal_Weight") ~ "Normal_Weight",
    NObeyesdad %in% c("Overweight_Level_I", "Overweight_Level_II") ~ "Overweight",
    NObeyesdad %in% c ("Obesity_Type_I","Obesity_Type_II", "Obesity_Type_III") ~ "Obese",
    TRUE ~ "other"
  )
)

df2_family_history <- df2 %>%
  group_by(family_history_with_overweight) %>%
  summarise(family_history_records = n())

df2_family_history_and_weight_category <- df2 %>%
  group_by(family_history_with_overweight, weight_category) %>%
  summarise(weight_category_records = n())

## `summarise()` has grouped output by 'family_history_with_overweight'. You can
## override using the `.groups` argument.

df2_results <- df2_family_history %>% inner_join( df2_family_history_and_weight_category, 
        by=c('family_history_with_overweight'='family_history_with_overweight'))

df2_results$weight_category_probability <- df2_results$weight_category_records/df2_results$family_history_records

ggplot(data = df2_results, aes(x = weight_category, y = weight_category_probability, fill = family_history_with_overweight)) +
  geom_bar(stat = "identity", position = position_dodge(), alpha = 0.75) +
  labs(x = "\n Weight Category", y = "Probability\n", title = "\n Probability of being in a weight category considering family history \n") +
  theme(plot.title = element_text(hjust = 0.5), 
        axis.title.x = element_text(face="bold", colour="red", size = 12),
        axis.title.y = element_text(face="bold", colour="red", size = 12),
        legend.title = element_text(face="bold", size = 10))

The graph clearly tells us that we can reject the null hypothesis.

#parameters for sample size calculation
minimum_effect_size <- 0.2
alpha <- 0.05
power <-0.80

contingency_table <- table(df2$family_history_with_overweight, df2$weight_category)

#performing Chi-squared test of independence.
chi_square_test <- chisq.test(contingency_table)

#printing results
print(chi_square_test)

## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 412.25, df = 2, p-value < 2.2e-16

# Perform Fisher's Exact Test
fisher_test_result <- fisher.test(contingency_table, simulate.p.value = TRUE, B = 10000)

# Print the results
print(fisher_test_result)

## 
##  Fisher's Exact Test for Count Data with simulated p-value (based on
##  10000 replicates)
## 
## data:  contingency_table
## p-value = 9.999e-05
## alternative hypothesis: two.sided

The X-squared value is relatively large, which is 412.25, indicating that there is a substantial difference between the observed and expected frequencies int he contingency table. It provides evidence against the null hypothesis.

Even the p value is very low in both chi-squared and Fisher’s Exact test, therefore we can reject the null hypothesis and could say that ‘People having a family history of being overweight do not come under overweight and obese weight category’.

Data Dive - Hypothesis Testing

Jagriti Mahajan

2023-10-10