Week 7 | Data Dive — Hypothesis Testing

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)
library(effsize)
library(pwrss)

## 
## Attaching package: 'pwrss'
## 
## The following object is masked from 'package:stats':
## 
##     power.t.test

The purpose of the assigment is to 1) Devise at least two different null hypotheses based on two different aspects (e.g., columns) of your data. For each hypothesis: i)Come up with an alpha level, power level, and minimum effect size, and explain why you chose each value. ii) Determine if you have enough data to perform a Neyman-Pearson hypothesis test. If you do, perform one and interpret results. If not, explain why. iii) Perform a Fisher’s style test for significance, and interpret the p-value. So, technically, you have two hypothesis tests for each hypothesis, equating two four total tests.

Build two visualizations that best illustrate the results from the two pairs of hypothesis tests, one for each null hypothesis.

Loading Data

mpg<- read_delim("C:/Users/kondo/OneDrive/Desktop/INTRO to Statistics and R/Data Set and work/data.csv", delim = ";",show_col_types = FALSE)

glimpse(mpg)

## Rows: 4,424
## Columns: 37
## $ `Marital status`                                 <dbl> 1, 1, 1, 1, 2, 2, 1, …
## $ `Application mode`                               <dbl> 17, 15, 1, 17, 39, 39…
## $ `Application order`                              <dbl> 5, 1, 5, 2, 1, 1, 1, …
## $ Course                                           <dbl> 171, 9254, 9070, 9773…
## $ `Daytime/evening attendance\t`                   <dbl> 1, 1, 1, 1, 0, 0, 1, …
## $ `Previous qualification`                         <dbl> 1, 1, 1, 1, 1, 19, 1,…
## $ `Previous qualification (grade)`                 <dbl> 122.0, 160.0, 122.0, …
## $ Nacionality                                      <dbl> 1, 1, 1, 1, 1, 1, 1, …
## $ `Mother's qualification`                         <dbl> 19, 1, 37, 38, 37, 37…
## $ `Father's qualification`                         <dbl> 12, 3, 37, 37, 38, 37…
## $ `Mother's occupation`                            <dbl> 5, 3, 9, 5, 9, 9, 7, …
## $ `Father's occupation`                            <dbl> 9, 3, 9, 3, 9, 7, 10,…
## $ `Admission grade`                                <dbl> 127.3, 142.5, 124.8, …
## $ Displaced                                        <dbl> 1, 1, 1, 1, 0, 0, 1, …
## $ `Educational special needs`                      <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ Debtor                                           <dbl> 0, 0, 0, 0, 0, 1, 0, …
## $ `Tuition fees up to date`                        <dbl> 1, 0, 0, 1, 1, 1, 1, …
## $ Gender                                           <dbl> 1, 1, 1, 0, 0, 1, 0, …
## $ `Scholarship holder`                             <dbl> 0, 0, 0, 0, 0, 0, 1, …
## $ `Age at enrollment`                              <dbl> 20, 19, 19, 20, 45, 5…
## $ International                                    <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ `Curricular units 1st sem (credited)`            <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ `Curricular units 1st sem (enrolled)`            <dbl> 0, 6, 6, 6, 6, 5, 7, …
## $ `Curricular units 1st sem (evaluations)`         <dbl> 0, 6, 0, 8, 9, 10, 9,…
## $ `Curricular units 1st sem (approved)`            <dbl> 0, 6, 0, 6, 5, 5, 7, …
## $ `Curricular units 1st sem (grade)`               <dbl> 0.00000, 14.00000, 0.…
## $ `Curricular units 1st sem (without evaluations)` <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ `Curricular units 2nd sem (credited)`            <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ `Curricular units 2nd sem (enrolled)`            <dbl> 0, 6, 6, 6, 6, 5, 8, …
## $ `Curricular units 2nd sem (evaluations)`         <dbl> 0, 6, 0, 10, 6, 17, 8…
## $ `Curricular units 2nd sem (approved)`            <dbl> 0, 6, 0, 5, 6, 5, 8, …
## $ `Curricular units 2nd sem (grade)`               <dbl> 0.00000, 13.66667, 0.…
## $ `Curricular units 2nd sem (without evaluations)` <dbl> 0, 0, 0, 0, 0, 5, 0, …
## $ `Unemployment rate`                              <dbl> 10.8, 13.9, 10.8, 9.4…
## $ `Inflation rate`                                 <dbl> 1.4, -0.3, 1.4, -0.8,…
## $ GDP                                              <dbl> 1.74, 0.79, 1.74, -3.…
## $ Target                                           <chr> "Dropout", "Graduate"…

##Null Hypothesis 1 (H0): The average age at enrollment for students who dropped out (Target: “Dropout”) is the same as the average age at enrollment for students who graduated (Target: “Graduate”).

##Null Hypothesis 2 (H0): There is no significant difference in the admission grade between students who have educational special needs (Educational special needs: 1) and those who don’t have educational special needs (Educational special needs: 0).

##Hypothesis Test 1:

Alpha Level (α): 0.05 Power Level: 0.80 Minimum Effect Size (Cohen’s d): 0.50 (medium effect size)

We will perform a two-sample t-test to compare the means of age at enrollment for students who dropped out and those who graduated. Then, we will visualize the results.

# Hypothesis Test 1: Age at Enrollment vs. Target (Dropout vs. Graduate)

# Filter the data for only "Dropout" and "Graduate" levels in the "Target" variable
subset_df <- mpg[mpg$Target %in% c("Dropout", "Graduate"), ]

alpha <- 0.05
power <- 0.80
effect_size <- 0.50

result_t_test <- t.test(mpg$`Age at enrollment`[subset_df$Target == "Dropout"],
                        mpg$`Age at enrollment`[subset_df$Target == "Graduate"])


# Visualization
boxplot(`Age at enrollment` ~ Target, data = mpg, main = "Age at Enrollment by Target",
        xlab = "Target", ylab = "Age at Enrollment", col = c("blue", "green"))
legend("topright", legend = c("Dropout", "Graduate"), fill = c("blue", "green"))

# Interpretation of Results
if (result_t_test$p.value < alpha) {
  cat("H0 is rejected. There is a significant difference in age at enrollment between Dropout and Graduate students.")
} else {
  cat("H0 is not rejected. There is no significant difference in age at enrollment between Dropout and Graduate students.")
}

## H0 is not rejected. There is no significant difference in age at enrollment between Dropout and Graduate students.

##Hypothesis Test 2:

Alpha Level (α): 0.05 Power Level: 0.80 Minimum Effect Size (Cohen’s d): 0.50 (medium effect size) We will perform a two-sample t-test to compare the means of admission grade for students with and without educational special needs.

# Hypothesis Test 2: Admission Grade vs. Educational Special Needs (1 vs. 0)
alpha <- 0.05
power <- 0.80
effect_size <- 0.50


result_t_test <- t.test(mpg$`Admission grade` ~ mpg$`Educational special needs`)

print (result_t_test$p.value)

## [1] 0.1731924

# Visualization
boxplot(`Admission grade` ~ `Educational special needs`, data = mpg, main = "Admission Grade by Educational Special Needs",
        xlab = "Educational Special Needs", ylab = "Admission Grade", col = c("blue", "green"))
legend("topright", legend = c("No Educational Special Needs", "With Educational Special Needs"), fill = c("blue", "green"))

# Interpretation of Results
if (result_t_test$p.value < alpha) {
  cat("H0 is rejected. There is a significant difference in admission grade between students with and without educational special needs.")
} else {
  cat("H0 is not rejected. There is no significant difference in admission grade between students with and without educational special needs.")
}

## H0 is not rejected. There is no significant difference in admission grade between students with and without educational special needs.

##Hypothesis Test 1: Age at Enrollment vs. Target (Dropout vs. Graduate) Result: H0 is not rejected. Interpretation: There is no significant difference in age at enrollment between students who dropped out and those who graduated.

##Hypothesis Test 2: Admission Grade vs. Educational Special Needs (1 vs. 0) Result: H0 is not rejected. Interpretation: There is no significant difference in admission grade between students with educational special needs and those without. These results suggest that, based on the data and the chosen significance level, there is insufficient evidence to conclude that there are significant differences in age at enrollment or admission grade among the specified groups.

##Further Considerations:

While these specific hypotheses did not yield significant results, it’s important to note that the absence of evidence of a significant difference does not necessarily mean that there are no practical or real-world differences. Further research and analyses may be needed to explore other factors that could influence these variables and to understand their implications better.

Additionally, it might be worth considering whether the sample size or the choice of statistical test had an impact on the results. Further investigations or different analytical approaches could provide deeper insights into these variables’ relationships and significance.

Week 7 | Data Dive — Hypothesis Testing

Vaishali Kondoju

2023-10-04