Importing data into “dataset”:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

dataset <-read_delim("C:/Users/MSKR/MASTERS_ADS/STATISTICS_SEM1/DATA_SET_1.csv", delim = ",")

## Rows: 4424 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): Target
## dbl (36): Marital status, Application mode, Application order, Course, Dayti...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Creating a custom table with mutating necessary categorical columns;

dataset_1<-dataset
dataset_1<-mutate(dataset_1, marital_status = ifelse(dataset$`Marital status` == 1, "single",
                    ifelse(`Marital status` == 2, "married",
                    ifelse(`Marital status` == 3, "widower",
                    ifelse(`Marital status` == 4, "divorced",
                    ifelse(`Marital status` == 5, "facto union",
                    ifelse(`Marital status` == 6, "legally seperated", "no")))))))

dataset_1<-mutate(dataset_1, day_eve_class= ifelse(dataset_1$`Daytime/evening attendance    ` == 1, "day","evening"))

dataset_1<-mutate(dataset_1, target = ifelse(dataset$Target == "Graduate",2,
                    ifelse(Target == "Enrolled",1,
                    ifelse(Target == "Dropout", 0, "no"))))

dataset_1<-mutate(dataset_1, sem_results= rowMeans(select(dataset_1,`Curricular units 1st sem (grade)`, `Curricular units 2nd sem (grade)`)))

Hypothesis 1 - Neyman-Pearson Framework

options(repos = c(CRAN = "https://cloud.r-project.org/"))
install.packages("pwr")

## Installing package into 'C:/Users/MSKR/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)

## package 'pwr' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\MSKR\AppData\Local\Temp\RtmpKSOKpd\downloaded_packages

library(pwr)

Null Hypothesis: The true proportion between Graduate and Dropout students count is 0.5.

Alternative Hypothesis: The true proportion between Graduate and Dropout students count is not equal to 0.5.
Effect Size (0.1):
- An effect size of 0.1 indicates that we are looking for at least a 10% difference in proportions of two groups (Graduate and Dropout). Effect size helps to determine how large a sample size is needed to detect this difference.
Alpha (alpha = 0.05):
- This is the significance level of the test. It represents the probability of making a Type I error (rejecting the null hypothesis when it is true). An alpha of 0.05 is commonly used, meaning we are willing to accept a 5% chance of incorrectly rejecting the null hypothesis.
Power (0.80):
- Power is the probability of correctly rejecting the null hypothesis when the alternative hypothesis is true (i.e., detecting an effect when there is one). A power of 0.80 means we have an 80% chance of detecting an effect if the true effect size is at least 0.1.

# Define parameters for sample size calculation
effect_size <- 0.1  # Define your minimum effect size (10% difference)
alpha <- 0.05
power <- 0.80

# Calculate sample size for each group
n <- pwr.2p.test(h = ES.h(0.5, 0.5 + effect_size), sig.level = alpha, power = power)$n
n

## [1] 387.1677

The sample size is calculated, which is required for a two-proportion test, given a minimum detectable effect size (10% difference), a significance level of 0.05, and a desired power of 0.80.

# Create a contingency table
dataset_2<-filter(dataset_1,Target!="Enrolled")
contingency_table <- table(dataset_2$Target)

We have filtered out the “Enrolled” category students from our data as this set of students are still pursuing the program and would be used in testing our model if they are going to be graduated or dropout of the program.

contingency_table

## 
##  Dropout Graduate 
##     1421     2209

Perform the z-test:

prop_test <- prop.test(contingency_table)
prop_test

## 
##  1-sample proportions test with continuity correction
## 
## data:  contingency_table, null probability 0.5
## X-squared = 170.63, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.3755686 0.4075829
## sample estimates:
##         p 
## 0.3914601

The test value(0.3914601) is greater than considered alpha (0.05), Therefore, the Null Hypothesis fails.

Summarizing the results in short:

Test Type: One-sample proportions test with continuity correction.
Test Statistic: Chi-squared value is 170.63 with 1 degree of freedom.
P-value: Less than 2.2e-16, indicating strong evidence against the null hypothesis.
95% Confidence Interval: The true proportion is estimated to be between approximately 0.3756 and 0.4076.
Sample Estimate: The observed proportion in your data is approximately 0.3915.

# Create a bar plot for graduation rates
ggplot(dataset_2, aes(x = Target)) +
  geom_bar(aes(y = (..count..)/sum(..count..)), fill = "steelblue") +
  labs(title = "Proportion of Graduates vs. Dropouts",
       y = "Proportion",
       x = "Target") +
  scale_y_continuous(labels = scales::percent)

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The above metrics and bar plot conclude that there is strong evidence to suggest that the true proportion of the population differs from 0.5, and the Alternate Hypothesis is correct.

Hypothesis 2 - Fisher’s Significance Testing Framework

Null Hypothesis: The average semester grades of Graduate and Enrolled students is significantly different.

Alternative Hypothesis: The average semester grades of Graduate and Enrolled students is not significantly different.

# Filter data for graduates and enrolled students
graduates <- dataset_1 |> filter(Target == "Graduate")
enrolled <- dataset_1 |> filter(Target == "Enrolled")

Perform the t-test

t_test_result <- t.test(graduates$sem_results, enrolled$sem_results)
t_test_result

## 
##  Welch Two Sample t-test
## 
## data:  graduates$sem_results and enrolled$sem_results
## t = 11.611, df = 1153.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.287354 1.810905
## sample estimates:
## mean of x mean of y 
##  12.67056  11.12143

if (t_test_result$p.value < 0.05) {
  print("Reject H0: The average semester grades of Graduate and Enrolled students is not significantly different.")
} else {
  print("Fail to reject H0: The average semester grades of Graduate and Enrolled students is significantly different. ")
}

## [1] "Reject H0: The average semester grades of Graduate and Enrolled students is not significantly different."

Summarizing the results in short:

Test Type: Welch Two Sample t-test comparing the means of two groups (graduates and enrolled).
Test Statistic: t = 11.611 with degrees of freedom (df) ≈ 1153.5.
P-value: Less than 2.2e-16, indicating strong evidence against the null hypothesis.
95% Confidence Interval: The true difference in means is estimated to be between approximately 1.2874 and 1.8109.
Sample Estimates:
- Mean of graduates (x): 12.67056
- Mean of enrolled (y): 11.12143

# Create a box plot for sem_resuts comparison
ggplot(dataset_1, aes(x = Target, y = sem_results)) +
  geom_boxplot(fill = "lightgreen") +
  labs(title = "Comparison of sem_results between Graduates and Enrolled Students",
       y = "sem_results",
       x = "Target")

The results from the above Hypothesis framework concludes that the Average semester results of Enrolled students is very close to those of already Graduated students which fails our Null Hypothesis.

From the box plot, it is evident that the Enrolled student’s grades are comparatively close to Graduated students than those with the Dropouts. This is a good signal that the enrolled students are progressing in good direction.

Week7_DataDive_HypothesisTesting

Kiran

2024-10-15

Hypothesis 1 - Neyman-Pearson Framework

Hypothesis 2 - Fisher’s Significance Testing Framework