After having exploring your dataset over the past few weeks, you should already have some questions. Devise at least two different null hypotheses based on two different aspects (e.g., columns) of your data. For each hypothesis: Come up with an alpha level, power level, and minimum effect size, and explain why you chose each value. Determine if you have enough data to perform a Neyman-Pearson hypothesis test. If you do, perform one and interpret results. If not, explain why. Perform a Fisher’s style test for significance, and interpret the p-value. So, technically, you have two hypothesis tests for each hypothesis, equating two four total tests. Build two visualizations that best illustrate the results from the two pairs of hypothesis tests, one for each null hypothesis. For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.
Importing all the libararies
## Loading required package: ggplot2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ stringr 1.5.0
## ✔ forcats 1.0.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'kableExtra'
##
##
## The following object is masked from 'package:dplyr':
##
## group_rows
We get the data loaded into dataframe named ‘data’
#Loading the dataset
data <- read_delim("data.csv", delim = ";")
## Rows: 4424 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (1): Target
## dbl (36): Marital status, Application mode, Application order, Course, Dayti...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## [1] "Marital status"
## [2] "Application mode"
## [3] "Application order"
## [4] "Course"
## [5] "Daytime/evening attendance\t"
## [6] "Previous qualification"
## [7] "Previous qualification (grade)"
## [8] "Nacionality"
## [9] "Mother's qualification"
## [10] "Father's qualification"
## [11] "Mother's occupation"
## [12] "Father's occupation"
## [13] "Admission grade"
## [14] "Displaced"
## [15] "Educational special needs"
## [16] "Debtor"
## [17] "Tuition fees up to date"
## [18] "Gender"
## [19] "Scholarship holder"
## [20] "Age at enrollment"
## [21] "International"
## [22] "Curricular units 1st sem (credited)"
## [23] "Curricular units 1st sem (enrolled)"
## [24] "Curricular units 1st sem (evaluations)"
## [25] "Curricular units 1st sem (approved)"
## [26] "Curricular units 1st sem (grade)"
## [27] "Curricular units 1st sem (without evaluations)"
## [28] "Curricular units 2nd sem (credited)"
## [29] "Curricular units 2nd sem (enrolled)"
## [30] "Curricular units 2nd sem (evaluations)"
## [31] "Curricular units 2nd sem (approved)"
## [32] "Curricular units 2nd sem (grade)"
## [33] "Curricular units 2nd sem (without evaluations)"
## [34] "Unemployment rate"
## [35] "Inflation rate"
## [36] "GDP"
## [37] "Target"
Null hypothesis \(H_0\): The mean admission grade for students who dropped out and students who graduated is the same
Alternative hypothesis \(H_a\):The mean admission grade for students who dropped out and students who graduated is different.
alpha level (α): A significance level of 0.05 (5%) is commonly used . we are willing to accept a 5% chance of making a Type I error
Power level (1-β): 0.80, A power level of 0.80 means that we have a 20% chance of making a Type II error
Minimum effect size (δ): Moderate effect size of 0.2
## Sample variances: Dropout = 228.77, Graduate = 198.01
## Rejection region for Neyman-Pearson test: t > 1.65
## Decision for Neyman-Pearson test: Fail to reject H0
## Sample size required to achieve desired power level: 394
## The p-value is 1
## Decision for fisher test: Fail to reject H0
Summary of the hypotheis: Neyman-Pearson test results: fail to reject the null hypothesis, rejection region: t > 1.65, test statistic: -0.98 Fisher’s exact test results: reject the null hypothesis, p-value = 2.59e-14.
There is error in the boxplot and was not able to solve it, tried different plots
Null hypothesis \(H_0\): The average grade in the 1st semester for students who drop out is the same as the average grade for students who are enrolled in the 2nd semester.
Alternative hypothesis \(H_a\):The average grade in the 1st semester for students who drop out is not same as the average grade for students who are enrolled in the 2nd semester.
alpha level (α): A level of 0.05 (5%) is commonly used. we are willing to accept a 5% chance of making a Type I error
Power level (1-β): 0.80, A power level of 0.80 means that we have a 20% chance of making a Type II error
Minimum effect size (δ): Moderate effect size of 0.2
## Sample variances: Dropout = 36.37, Graduate = 7.28
## Rejection region for Neyman-Pearson test: t > 1.65
## Decision for Neyman-Pearson test: Fail to reject H0
## Sample size required to achieve desired power level: 394
## The p-value is 1
## Decision for Fisher test: Fail to reject H0
Sample variances: Dropout group has a higher variance (36.37) compared to the Graduate group (7.28). Neyman-Pearson test: The test statistic (t) did not exceed the critical value (1.65), so we “Fail to reject” the null hypothesis. Sample size required to achieve the desired power level was calculated to be 394. Fisher’s test: The p-value is extremely small (2.399e-175), indicating strong evidence to “Reject” the null hypothesis. In summary, while the Neyman-Pearson test did not find a significant difference between the groups, the Fisher’s test strongly suggests that there is a significant difference in the means of the “Curricular units 1st sem (grade)” between the Dropout and Graduate groups.
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine