Write the first research question. Perform the statistical hypothesis tests to answer your research question. Perform both the parametric test and the corresponding non-parametric test and explain the results. For the parametric test, check all necessary assumptions. Finally, describe which test (parametric or non-parametric) is more suitable for your particular case and why. Also calculate the effect size and explain it. Finally, answer your research question clearly.
file_path <- "~/Desktop/MVA/HW1/mba_decision_dataset.csv"
mydata <- read.csv(file_path)
#It didn't find the file, so I had to specify it a bit differently.
#mydata <- read.table("./mba_decision_dataset.csv",
#header=TRUE,
# sep=",")
head(mydata)
## Person.ID Age Gender Undergraduate.Major Undergraduate.GPA Years.of.Work.Experience
## 1 1 27 Male Arts 3.18 8
## 2 2 24 Male Arts 3.03 4
## 3 3 33 Female Business 3.66 9
## 4 4 31 Male Engineering 2.46 1
## 5 5 28 Female Business 2.75 9
## 6 6 33 Male Business 3.58 3
## Current.Job.Title Annual.Salary..Before.MBA. Has.Management.Experience GRE.GMAT.Score
## 1 Entrepreneur 90624 No 688
## 2 Analyst 53576 Yes 791
## 3 Engineer 79796 No 430
## 4 Manager 105956 No 356
## 5 Entrepreneur 96132 No 472
## 6 Manager 101925 No 409
## Undergrad.University.Ranking Entrepreneurial.Interest Networking.Importance MBA.Funding.Source
## 1 185 7.9 7.6 Loan
## 2 405 3.8 4.1 Loan
## 3 107 6.7 5.5 Scholarship
## 4 257 1.0 5.3 Loan
## 5 338 9.5 4.9 Loan
## 6 280 3.4 7.1 Scholarship
## Desired.Post.MBA.Role Expected.Post.MBA.Salary Location.Preference..Post.MBA. Reason.for.MBA
## 1 Finance Manager 156165 International Entrepreneurship
## 2 Startup Founder 165612 International Career Growth
## 3 Consultant 122248 Domestic Skill Enhancement
## 4 Consultant 123797 International Entrepreneurship
## 5 Consultant 197509 Domestic Skill Enhancement
## 6 Marketing Director 99591 International Networking
## Online.vs..On.Campus.MBA Decided.to.Pursue.MBA.
## 1 On-Campus Yes
## 2 Online No
## 3 Online No
## 4 On-Campus No
## 5 Online Yes
## 6 On-Campus No
The unit of observation is one MBA student, there are 4500 students in the observed population.
Source of data: https://www.kaggle.com/datasets/ashaychoudhary/dataset-mba-decision-after-bachelors/data
Description:
Person ID – Unique identifier
Age – Age at the time of decision
Gender – Male, Female, Other
Undergraduate Major – Engineering, Business, Arts, Science, etc.
Undergraduate GPA – Scale from 0 to 4
Years of Work Experience – Years before MBA decision
Current Job Title – Analyst, Manager, Consultant, etc.
Annual Salary (Before MBA) – In USD
Has Management Experience – Yes/No
GRE/GMAT Score – Standardized test score
Undergrad University Ranking – Ranking of Bachelor’s institution
Entrepreneurial Interest – Scale from 1 to 10
Networking Importance – Scale from 1 to 10
MBA Funding Source – Self-funded, Loan, Scholarship, Employer
Desired Post-MBA Role – Consultant, Executive, Startup Founder, etc.
Expected Post-MBA Salary – Expected salary after MBA
Location Preference (Post-MBA) – Domestic, International
Reason for MBA – Career Growth, Skill Enhancement, Entrepreneurship, etc.
Online vs. On-Campus MBA – Preference for learning mode
Decided to Pursue MBA? – Yes/No (Target Variable)
The imported data set recorded the Expected post MBA Salaries for all students. Can I conclude, that the expected salaries are the same, in other words that the gender of the student does not affect their expected post MBA salaries?
library(psych)
psych::describe(mydata$Expected.Post.MBA.Salary)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 4500 130292.9 40384.83 130085.5 130271.4 51845.04 60021 199999 139978 0.01 -1.2 602.02
describeBy(x = mydata$Expected.Post.MBA.Salary, group = mydata$Gender)
##
## Descriptive statistics by group
## group: Female
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 1995 131479 40491.51 131817 131742.1 52408.43 60043 199940 139897 -0.04 -1.19 906.55
## ---------------------------------------------------------------------------
## group: Male
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 2320 129205 40321.86 128373.5 128894.5 51501.82 60021 199999 139978 0.06 -1.19 837.14
## ---------------------------------------------------------------------------
## group: Other
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 185 131144.9 39853.6 132622 131689.5 48937.66 60390 199801 139411 -0.09 -1.21 2930.09
Parameter estimations:
The minimal Expected post MBA Salary is 60021 and the maximal is 199999.
The average Expected post MBA Salary is 130292.9.
Half of the students Expect their post MBA Salary to be 130085.5, while the other half expect it to be higher.
Here I see that there are three “genders” that I have to observe, “Male”, “Female” and “Other”. So I will use the One way Anova and Kruskal-Wallis Rank sum test.
The assumptions for rANOVA are: - analyzed variables are numeric - the variables in all populations are normally distributed - Homoskedasticity.
My null hypothesis is that all three means are the same. H1 is that at least one is different.
For the Levene test, our null hypothesis is homoskedasticity, H1 is heteroskedasticity.
library(car)
leveneTest(mydata$Expected.Post.MBA.Salary, group = mydata$Gender)
## Warning in leveneTest.default(mydata$Expected.Post.MBA.Salary, group = mydata$Gender):
## mydata$Gender coerced to factor.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 0.0728 0.9298
## 4497
The p-value is too large, so we cannot reject H0.
For the Shapiro test, our null hypothesis is that Expected post MBA Salary is normally distributed in all groups, H1 is that it is not normally distributed.
library(dplyr)
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:stats':
##
## filter
mydata %>%
group_by(Gender) %>%
shapiro_test(Expected.Post.MBA.Salary)
## # A tibble: 3 × 4
## Gender variable statistic p
## <chr> <chr> <dbl> <dbl>
## 1 Female Expected.Post.MBA.Salary 0.955 2.24e-24
## 2 Male Expected.Post.MBA.Salary 0.955 3.11e-26
## 3 Other Expected.Post.MBA.Salary 0.954 1.05e- 5
We can reject H0 for all Genders at p < 0.001, they are not normally distributed.
The null hypothesis is that location distance of Expected Post MBA Salary is the same for all three groups and H1 is that at least one is different.
kruskal.test(Expected.Post.MBA.Salary ~ Gender,
data = mydata)
##
## Kruskal-Wallis rank sum test
##
## data: Expected.Post.MBA.Salary by Gender
## Kruskal-Wallis chi-squared = 3.4757, df = 2, p-value = 0.1759
We cannot reject H0, because the p-value is 0.1259, therefore we cannot say that the location distributions of the Expected Post MBA Salary is different for all three genders.
kruskal_effsize(Expected.Post.MBA.Salary ~ Gender,
data = mydata)
## # A tibble: 1 × 5
## .y. n effsize method magnitude
## * <chr> <int> <dbl> <chr> <ord>
## 1 Expected.Post.MBA.Salary 4500 0.000328 eta2[H] small
To check how large the differences are between the Expected Post MBA Salary for all three gender, we see that the differences are actually small.
Now we perform a Wilcox Rank Sum test.
library(rstatix)
groups_nonpar <- wilcox_test(Expected.Post.MBA.Salary ~ Gender,
paired = FALSE,
p.adjust.method = "bonferroni",
data = mydata)
groups_nonpar
## # A tibble: 3 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Expected.Post.MBA.Salary Female Male 1995 2320 2389239 0.066 0.198 ns
## 2 Expected.Post.MBA.Salary Female Other 1995 185 185427 0.914 1 ns
## 3 Expected.Post.MBA.Salary Male Other 2320 185 208562. 0.524 1 ns
We cannot reject H0 for any Gender (or rather when we compare Genders) because the p-values are too large, so we cannot say that there is a difference between the Expected Post MBA Salaries. The differences are statistically not significant.
library(dplyr)
library(rstatix)
library(ggpubr)
# Perform Wilcoxon test
pwc <- mydata %>%
wilcox_test(Expected.Post.MBA.Salary ~ Gender,
paired = FALSE,
p.adjust.method = "bonferroni") %>%
add_y_position(fun = "median", step.increase = 0.35)
# Perform Kruskal-Wallis test
Kruskal_results <- kruskal_test(Expected.Post.MBA.Salary ~ Gender,
data = mydata)
# Plot
ggboxplot(mydata, x = "Gender", y = "Expected.Post.MBA.Salary", add = "point") +
stat_pvalue_manual(pwc, hide.ns = FALSE) +
stat_summary(fun = median, geom = "point", shape = 16, size = 4, color = "darkred") +
labs(subtitle = get_test_label(Kruskal_results, detailed = TRUE),
caption = get_pwc_label(pwc))
We found that the distribution of Expected Post MBA Salaries do not differ for any of the three Genders (male, female and other) for chi squared = 3.48 and p-value = 0.18, the effect size was small. Post-Hoc tests revealed no statistical difference between each pair of groups.
Write the second research question. Using two numerical variables from your dataset, calculate the appropriate correlation coefficient and explain it. Justify your decision. Perform the appropriate statistical test and interpret the result obtained. Answer your research question clearly.
The imported data set recorded GMAT Scores and their Undergraduate GPA Score. I want to check if a there is a correlation between them, for example, if a student with a higher Undergraduate GPA scored higher on the GMAT than a student with a lower Undergraduate GPA.
library(car)
# Scatterplot matrix for Undergraduate GPA Score and GMAT Score
scatterplotMatrix(mydata[, c(5, 10)], smooth = FALSE,
main = "Scatterplot matrix for Undergraduate GPA Score and GRE/GMAT Score",
diagonal = "histogram")
## Warning in applyDefaults(diagonal, defaults = list(method = "adaptiveDensity"), : unnamed diag
## arguments, will be ignored
correlation <- cor(mydata[, 5], mydata[, 10], use = "complete.obs", method = "pearson")
# Print the correlation result
print(paste("Correlation between Undergraduate GPA Scores and GMAT scores:", correlation))
## [1] "Correlation between Undergraduate GPA Scores and GMAT scores: 0.0151703482617015"
As we can assumer from the graphs, the grades are not normally distributed. This could result from the fact that the MBA students come from different backgrounds, had different Undergraduate Majors and naturally those who studied Engineering as an Undergraduate will have a lower GPA than those in Arts. That will of course not be true for GRE and GMAT scores, it is usually the opposite, those who studied more rigorous Undergraduate subjects will likely score higher on the GMAT or GRE test than those who did not.
library(GGally)
ggpairs(mydata[, c(5, 10)],
title = "Scatterplot matrix for Undergraduate GPA Score and GRE/GMAT Score")
Here we can see that the Pearson correlation coefficient is very small:
“Correlation between Undergraduate GPA Scores and GMAT scores:
0.0151703482617015”, therefore making the correlation between
Undergraduate GPA and GRE/GMAT Score very small.
library(Hmisc)
correlation_result <- rcorr(as.matrix(mydata[, c(5, 10)]),
type = "pearson")
print(correlation_result$r)
## Undergraduate.GPA GRE.GMAT.Score
## Undergraduate.GPA 1.00000000 0.01517035
## GRE.GMAT.Score 0.01517035 1.00000000
print(correlation_result$P)
## Undergraduate.GPA GRE.GMAT.Score
## Undergraduate.GPA NA 0.3089471
## GRE.GMAT.Score 0.3089471 NA
Here we can see that p = 0.3089471, which means we cannot reject H0 and we cannot say that there is a significant correlation.
library(Hmisc)
correlation_result <- rcorr(as.matrix(mydata[, c(5, 10)]),
type = "spearman")
print(correlation_result$r)
## Undergraduate.GPA GRE.GMAT.Score
## Undergraduate.GPA 1.00000000 0.01469981
## GRE.GMAT.Score 0.01469981 1.00000000
print(correlation_result$P)
## Undergraduate.GPA GRE.GMAT.Score
## Undergraduate.GPA NA 0.3241958
## GRE.GMAT.Score 0.3241958 NA
We perform the Spearman Correlation when we assume ordinal variables. And even here, the correlation coefficient is 0.01469981, which suggests that there is no meaningful linear relationship between Undergraduate GPA score and GRE or GMAT Score. That is further backed by the p-value = 0.3089471, which indicates that this result is not statistically significant, meaning that the observed weak correlation could easily have occurred randomly.
To be honest, I could have used either Spearman or Pearson, the result is the same. No significant correlation.
Write the third research question. Using two categorical variables, perform the Pearson Chi2 test. Make sure that the necessary assumptions are met. Write down the null hypothesis and the alternative hypothesis as well as your findings based on the p-value of the test. Show empirical and theoretical frequencies and explain them. Also calculate the standardized residuals and interpret them. Calculate the effect size. Answer your research question clearly.
The imported data set recorded Gender their Undergraduate Majors. I want to check if the Undergrad Major a person chose influenced whether a person has management experience or not.
First, I have to create a table to summarize the data by Undergraduate Major and Management Experience.
library(dplyr)
undergrad_major_mgmt_experience <- mydata %>%
group_by(Undergraduate.Major, Has.Management.Experience) %>%
summarise(Count = n(), .groups = "drop")
print(undergrad_major_mgmt_experience)
## # A tibble: 10 × 3
## Undergraduate.Major Has.Management.Experience Count
## <chr> <chr> <int>
## 1 Arts No 552
## 2 Arts Yes 332
## 3 Business No 565
## 4 Business Yes 343
## 5 Economics No 517
## 6 Economics Yes 370
## 7 Engineering No 518
## 8 Engineering Yes 351
## 9 Science No 551
## 10 Science Yes 401
Now I will do the Pearson chi-squared test to see if there is an association between the two categorical variables. I do not have a 2x2 table, so I will not use Yates correction.
The assumptions:
The observations are independent of each other
All expected frequencies are greater than 5.
Both of the assumptions are met.
The null hypothesis is that there is no association and H1 is that there is an association.
results <- chisq.test(mydata$Undergraduate.Major, mydata$Has.Management.Experience,
correct = FALSE)
results
##
## Pearson's Chi-squared test
##
## data: mydata$Undergraduate.Major and mydata$Has.Management.Experience
## X-squared = 6.9937, df = 4, p-value = 0.1362
We cannot reject H0 therefore we cannot state that there is no association between Undergraduate Major and if they have management experience.
addmargins(results$observed)
##
## mydata$Undergraduate.Major No Yes Sum
## Arts 552 332 884
## Business 565 343 908
## Economics 517 370 887
## Engineering 518 351 869
## Science 551 401 952
## Sum 2703 1797 4500
round(results$expected, 2)
##
## mydata$Undergraduate.Major No Yes
## Arts 530.99 353.01
## Business 545.41 362.59
## Economics 532.79 354.21
## Engineering 521.98 347.02
## Science 571.83 380.17
Example of interpretation:
If there was no association between their Undergraduate Major and whether they have management experience, we would expect 353.01 students that studied Arts in Undergrade to not have management experience.
round(results$residuals, 2)
##
## mydata$Undergraduate.Major No Yes
## Arts 0.91 -1.12
## Business 0.84 -1.03
## Economics -0.68 0.84
## Engineering -0.17 0.21
## Science -0.87 1.07
Here we can see that all the numbers (in absolute terms) are below 1.96, so, sadly, we cannot draw any conclusions.
addmargins(round(prop.table(results$observed, 1), 3), 2)
##
## mydata$Undergraduate.Major No Yes Sum
## Arts 0.624 0.376 1.000
## Business 0.622 0.378 1.000
## Economics 0.583 0.417 1.000
## Engineering 0.596 0.404 1.000
## Science 0.579 0.421 1.000
Example of interpretation:
Out of the observed Students, 37,6 % of those who chose Arts as their Undergraduate Major have Management experience and 62.4 % do not.
addmargins(round(prop.table(results$observed, 2), 3), 1)
##
## mydata$Undergraduate.Major No Yes
## Arts 0.204 0.185
## Business 0.209 0.191
## Economics 0.191 0.206
## Engineering 0.192 0.195
## Science 0.204 0.223
## Sum 1.000 1.000
Example of interpretation:
Out of the observed Students who do not have Management experience, 18.5 % of them chose Arts as their Undergraduate Major.
library(effectsize)
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
## The following object is masked from 'package:psych':
##
## phi
effectsize::cramers_v(mydata$Undergraduate.Major, mydata$Has.Management.Experience)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.03 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.03)
## [1] "tiny"
## (Rules: funder2019)
We could not reject H0, because the standardised residuals were below 1.96, I could not draw any conclusions.
Cramer’s V = 0.03 suggests that there is a very weak association between the two categorical variables, indicating that the relationship between them is almost nonexistent.