Question 1

Write the first research question. Perform the statistical hypothesis tests to answer your research question. Perform both the parametric test and the corresponding non-parametric test and explain the results. For the parametric test, check all necessary assumptions. Finally, describe which test (parametric or non-parametric) is more suitable for your particular case and why. Also calculate the effect size and explain it. Finally, answer your research question clearly.

file_path <- "~/Desktop/MVA/HW1/mba_decision_dataset.csv"
mydata <- read.csv(file_path)

#It didn't find the file, so I had to specify it a bit differently.

#mydata <- read.table("./mba_decision_dataset.csv", 
                     #header=TRUE, 
                    # sep=",")

head(mydata)
##   Person.ID Age Gender Undergraduate.Major Undergraduate.GPA Years.of.Work.Experience
## 1         1  27   Male                Arts              3.18                        8
## 2         2  24   Male                Arts              3.03                        4
## 3         3  33 Female            Business              3.66                        9
## 4         4  31   Male         Engineering              2.46                        1
## 5         5  28 Female            Business              2.75                        9
## 6         6  33   Male            Business              3.58                        3
##   Current.Job.Title Annual.Salary..Before.MBA. Has.Management.Experience GRE.GMAT.Score
## 1      Entrepreneur                      90624                        No            688
## 2           Analyst                      53576                       Yes            791
## 3          Engineer                      79796                        No            430
## 4           Manager                     105956                        No            356
## 5      Entrepreneur                      96132                        No            472
## 6           Manager                     101925                        No            409
##   Undergrad.University.Ranking Entrepreneurial.Interest Networking.Importance MBA.Funding.Source
## 1                          185                      7.9                   7.6               Loan
## 2                          405                      3.8                   4.1               Loan
## 3                          107                      6.7                   5.5        Scholarship
## 4                          257                      1.0                   5.3               Loan
## 5                          338                      9.5                   4.9               Loan
## 6                          280                      3.4                   7.1        Scholarship
##   Desired.Post.MBA.Role Expected.Post.MBA.Salary Location.Preference..Post.MBA.    Reason.for.MBA
## 1       Finance Manager                   156165                  International  Entrepreneurship
## 2       Startup Founder                   165612                  International     Career Growth
## 3            Consultant                   122248                       Domestic Skill Enhancement
## 4            Consultant                   123797                  International  Entrepreneurship
## 5            Consultant                   197509                       Domestic Skill Enhancement
## 6    Marketing Director                    99591                  International        Networking
##   Online.vs..On.Campus.MBA Decided.to.Pursue.MBA.
## 1                On-Campus                    Yes
## 2                   Online                     No
## 3                   Online                     No
## 4                On-Campus                     No
## 5                   Online                    Yes
## 6                On-Campus                     No

The unit of observation is one MBA student, there are 4500 students in the observed population.

Source of data: https://www.kaggle.com/datasets/ashaychoudhary/dataset-mba-decision-after-bachelors/data

Description:

Research question:

The imported data set recorded the Expected post MBA Salaries for all students. Can I conclude, that the expected salaries are the same, in other words that the gender of the student does not affect their expected post MBA salaries?

library(psych)
psych::describe(mydata$Expected.Post.MBA.Salary)
##    vars    n     mean       sd   median  trimmed      mad   min    max  range skew kurtosis     se
## X1    1 4500 130292.9 40384.83 130085.5 130271.4 51845.04 60021 199999 139978 0.01     -1.2 602.02
describeBy(x = mydata$Expected.Post.MBA.Salary, group = mydata$Gender)
## 
##  Descriptive statistics by group 
## group: Female
##    vars    n   mean       sd median  trimmed      mad   min    max  range  skew kurtosis     se
## X1    1 1995 131479 40491.51 131817 131742.1 52408.43 60043 199940 139897 -0.04    -1.19 906.55
## --------------------------------------------------------------------------- 
## group: Male
##    vars    n   mean       sd   median  trimmed      mad   min    max  range skew kurtosis     se
## X1    1 2320 129205 40321.86 128373.5 128894.5 51501.82 60021 199999 139978 0.06    -1.19 837.14
## --------------------------------------------------------------------------- 
## group: Other
##    vars   n     mean      sd median  trimmed      mad   min    max  range  skew kurtosis      se
## X1    1 185 131144.9 39853.6 132622 131689.5 48937.66 60390 199801 139411 -0.09    -1.21 2930.09

Parameter estimations:

  • The minimal Expected post MBA Salary is 60021 and the maximal is 199999.

  • The average Expected post MBA Salary is 130292.9.

  • Half of the students Expect their post MBA Salary to be 130085.5, while the other half expect it to be higher.

Here I see that there are three “genders” that I have to observe, “Male”, “Female” and “Other”. So I will use the One way Anova and Kruskal-Wallis Rank sum test.

The assumptions for rANOVA are: - analyzed variables are numeric - the variables in all populations are normally distributed - Homoskedasticity.

My null hypothesis is that all three means are the same. H1 is that at least one is different.

Levene test

For the Levene test, our null hypothesis is homoskedasticity, H1 is heteroskedasticity.

library(car)
leveneTest(mydata$Expected.Post.MBA.Salary, group = mydata$Gender)
## Warning in leveneTest.default(mydata$Expected.Post.MBA.Salary, group = mydata$Gender):
## mydata$Gender coerced to factor.
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    2  0.0728 0.9298
##       4497

The p-value is too large, so we cannot reject H0.

Shapiro test

For the Shapiro test, our null hypothesis is that Expected post MBA Salary is normally distributed in all groups, H1 is that it is not normally distributed.

library(dplyr)
library(rstatix)
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
mydata %>%
  group_by(Gender) %>%  
  shapiro_test(Expected.Post.MBA.Salary)
## # A tibble: 3 × 4
##   Gender variable                 statistic        p
##   <chr>  <chr>                        <dbl>    <dbl>
## 1 Female Expected.Post.MBA.Salary     0.955 2.24e-24
## 2 Male   Expected.Post.MBA.Salary     0.955 3.11e-26
## 3 Other  Expected.Post.MBA.Salary     0.954 1.05e- 5

We can reject H0 for all Genders at p < 0.001, they are not normally distributed.

Kruskal-Wallis test

The null hypothesis is that location distance of Expected Post MBA Salary is the same for all three groups and H1 is that at least one is different.

kruskal.test(Expected.Post.MBA.Salary ~ Gender, 
             data = mydata)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Expected.Post.MBA.Salary by Gender
## Kruskal-Wallis chi-squared = 3.4757, df = 2, p-value = 0.1759

We cannot reject H0, because the p-value is 0.1259, therefore we cannot say that the location distributions of the Expected Post MBA Salary is different for all three genders.

kruskal_effsize(Expected.Post.MBA.Salary ~ Gender, 
                data = mydata)
## # A tibble: 1 × 5
##   .y.                          n  effsize method  magnitude
## * <chr>                    <int>    <dbl> <chr>   <ord>    
## 1 Expected.Post.MBA.Salary  4500 0.000328 eta2[H] small

To check how large the differences are between the Expected Post MBA Salary for all three gender, we see that the differences are actually small.

Wilcox test

Now we perform a Wilcox Rank Sum test.

library(rstatix)

groups_nonpar <- wilcox_test(Expected.Post.MBA.Salary ~ Gender,
                             paired = FALSE,
                             p.adjust.method = "bonferroni",
                             data = mydata)

groups_nonpar
## # A tibble: 3 × 9
##   .y.                      group1 group2    n1    n2 statistic     p p.adj p.adj.signif
## * <chr>                    <chr>  <chr>  <int> <int>     <dbl> <dbl> <dbl> <chr>       
## 1 Expected.Post.MBA.Salary Female Male    1995  2320  2389239  0.066 0.198 ns          
## 2 Expected.Post.MBA.Salary Female Other   1995   185   185427  0.914 1     ns          
## 3 Expected.Post.MBA.Salary Male   Other   2320   185   208562. 0.524 1     ns

We cannot reject H0 for any Gender (or rather when we compare Genders) because the p-values are too large, so we cannot say that there is a difference between the Expected Post MBA Salaries. The differences are statistically not significant.

library(dplyr)
library(rstatix)
library(ggpubr)

# Perform Wilcoxon test
pwc <- mydata %>%
  wilcox_test(Expected.Post.MBA.Salary ~ Gender, 
              paired = FALSE, 
              p.adjust.method = "bonferroni") %>%
  add_y_position(fun = "median", step.increase = 0.35)

# Perform Kruskal-Wallis test
Kruskal_results <- kruskal_test(Expected.Post.MBA.Salary ~ Gender, 
                                data = mydata)

# Plot
ggboxplot(mydata, x = "Gender", y = "Expected.Post.MBA.Salary", add = "point") +
  stat_pvalue_manual(pwc, hide.ns = FALSE) +
  stat_summary(fun = median, geom = "point", shape = 16, size = 4, color = "darkred") +
  labs(subtitle = get_test_label(Kruskal_results, detailed = TRUE),
       caption = get_pwc_label(pwc))

Conclusion:

We found that the distribution of Expected Post MBA Salaries do not differ for any of the three Genders (male, female and other) for chi squared = 3.48 and p-value = 0.18, the effect size was small. Post-Hoc tests revealed no statistical difference between each pair of groups.

Question 2

Write the second research question. Using two numerical variables from your dataset, calculate the appropriate correlation coefficient and explain it. Justify your decision. Perform the appropriate statistical test and interpret the result obtained. Answer your research question clearly.

Research question:

The imported data set recorded GMAT Scores and their Undergraduate GPA Score. I want to check if a there is a correlation between them, for example, if a student with a higher Undergraduate GPA scored higher on the GMAT than a student with a lower Undergraduate GPA.

Scatter plot

library(car)

# Scatterplot matrix for Undergraduate GPA Score and GMAT Score
scatterplotMatrix(mydata[, c(5, 10)], smooth = FALSE,
                  main = "Scatterplot matrix for Undergraduate GPA Score and GRE/GMAT Score",
                  diagonal = "histogram")
## Warning in applyDefaults(diagonal, defaults = list(method = "adaptiveDensity"), : unnamed diag
## arguments, will be ignored

correlation <- cor(mydata[, 5], mydata[, 10], use = "complete.obs", method = "pearson")

# Print the correlation result
print(paste("Correlation between Undergraduate GPA Scores and GMAT scores:", correlation))
## [1] "Correlation between Undergraduate GPA Scores and GMAT scores: 0.0151703482617015"

As we can assumer from the graphs, the grades are not normally distributed. This could result from the fact that the MBA students come from different backgrounds, had different Undergraduate Majors and naturally those who studied Engineering as an Undergraduate will have a lower GPA than those in Arts. That will of course not be true for GRE and GMAT scores, it is usually the opposite, those who studied more rigorous Undergraduate subjects will likely score higher on the GMAT or GRE test than those who did not.

library(GGally)
ggpairs(mydata[, c(5, 10)],
        title = "Scatterplot matrix for Undergraduate GPA Score and GRE/GMAT Score")

Here we can see that the Pearson correlation coefficient is very small: “Correlation between Undergraduate GPA Scores and GMAT scores: 0.0151703482617015”, therefore making the correlation between Undergraduate GPA and GRE/GMAT Score very small.

Pearson Correlation matrix

library(Hmisc)
correlation_result <- rcorr(as.matrix(mydata[, c(5, 10)]), 
                            type = "pearson")

print(correlation_result$r)
##                   Undergraduate.GPA GRE.GMAT.Score
## Undergraduate.GPA        1.00000000     0.01517035
## GRE.GMAT.Score           0.01517035     1.00000000
print(correlation_result$P)
##                   Undergraduate.GPA GRE.GMAT.Score
## Undergraduate.GPA                NA      0.3089471
## GRE.GMAT.Score            0.3089471             NA

Here we can see that p = 0.3089471, which means we cannot reject H0 and we cannot say that there is a significant correlation.

Spearman Correlation matrix for ordinal variables

library(Hmisc)
correlation_result <- rcorr(as.matrix(mydata[, c(5, 10)]), 
                            type = "spearman")

print(correlation_result$r)
##                   Undergraduate.GPA GRE.GMAT.Score
## Undergraduate.GPA        1.00000000     0.01469981
## GRE.GMAT.Score           0.01469981     1.00000000
print(correlation_result$P)
##                   Undergraduate.GPA GRE.GMAT.Score
## Undergraduate.GPA                NA      0.3241958
## GRE.GMAT.Score            0.3241958             NA

Conclusion

We perform the Spearman Correlation when we assume ordinal variables. And even here, the correlation coefficient is 0.01469981, which suggests that there is no meaningful linear relationship between Undergraduate GPA score and GRE or GMAT Score. That is further backed by the p-value = 0.3089471, which indicates that this result is not statistically significant, meaning that the observed weak correlation could easily have occurred randomly.

To be honest, I could have used either Spearman or Pearson, the result is the same. No significant correlation.

Question 3

Write the third research question. Using two categorical variables, perform the Pearson Chi2 test. Make sure that the necessary assumptions are met. Write down the null hypothesis and the alternative hypothesis as well as your findings based on the p-value of the test. Show empirical and theoretical frequencies and explain them. Also calculate the standardized residuals and interpret them. Calculate the effect size. Answer your research question clearly.

Research question:

The imported data set recorded Gender their Undergraduate Majors. I want to check if the Undergrad Major a person chose influenced whether a person has management experience or not.

First, I have to create a table to summarize the data by Undergraduate Major and Management Experience.

library(dplyr)

undergrad_major_mgmt_experience <- mydata %>%
  group_by(Undergraduate.Major, Has.Management.Experience) %>%
  summarise(Count = n(), .groups = "drop")

print(undergrad_major_mgmt_experience)
## # A tibble: 10 × 3
##    Undergraduate.Major Has.Management.Experience Count
##    <chr>               <chr>                     <int>
##  1 Arts                No                          552
##  2 Arts                Yes                         332
##  3 Business            No                          565
##  4 Business            Yes                         343
##  5 Economics           No                          517
##  6 Economics           Yes                         370
##  7 Engineering         No                          518
##  8 Engineering         Yes                         351
##  9 Science             No                          551
## 10 Science             Yes                         401

Now I will do the Pearson chi-squared test to see if there is an association between the two categorical variables. I do not have a 2x2 table, so I will not use Yates correction.

The assumptions:

  • The observations are independent of each other

  • All expected frequencies are greater than 5.

Both of the assumptions are met.

The null hypothesis is that there is no association and H1 is that there is an association.

results <- chisq.test(mydata$Undergraduate.Major, mydata$Has.Management.Experience,
                      correct = FALSE)

results
## 
##  Pearson's Chi-squared test
## 
## data:  mydata$Undergraduate.Major and mydata$Has.Management.Experience
## X-squared = 6.9937, df = 4, p-value = 0.1362

We cannot reject H0 therefore we cannot state that there is no association between Undergraduate Major and if they have management experience.

addmargins(results$observed)
##                           
## mydata$Undergraduate.Major   No  Yes  Sum
##                Arts         552  332  884
##                Business     565  343  908
##                Economics    517  370  887
##                Engineering  518  351  869
##                Science      551  401  952
##                Sum         2703 1797 4500
round(results$expected, 2)
##                           
## mydata$Undergraduate.Major     No    Yes
##                Arts        530.99 353.01
##                Business    545.41 362.59
##                Economics   532.79 354.21
##                Engineering 521.98 347.02
##                Science     571.83 380.17

Example of interpretation:

If there was no association between their Undergraduate Major and whether they have management experience, we would expect 353.01 students that studied Arts in Undergrade to not have management experience.

round(results$residuals, 2)
##                           
## mydata$Undergraduate.Major    No   Yes
##                Arts         0.91 -1.12
##                Business     0.84 -1.03
##                Economics   -0.68  0.84
##                Engineering -0.17  0.21
##                Science     -0.87  1.07

Here we can see that all the numbers (in absolute terms) are below 1.96, so, sadly, we cannot draw any conclusions.

addmargins(round(prop.table(results$observed, 1), 3), 2)
##                           
## mydata$Undergraduate.Major    No   Yes   Sum
##                Arts        0.624 0.376 1.000
##                Business    0.622 0.378 1.000
##                Economics   0.583 0.417 1.000
##                Engineering 0.596 0.404 1.000
##                Science     0.579 0.421 1.000

Example of interpretation:

Out of the observed Students, 37,6 % of those who chose Arts as their Undergraduate Major have Management experience and 62.4 % do not.

addmargins(round(prop.table(results$observed, 2), 3), 1)
##                           
## mydata$Undergraduate.Major    No   Yes
##                Arts        0.204 0.185
##                Business    0.209 0.191
##                Economics   0.191 0.206
##                Engineering 0.192 0.195
##                Science     0.204 0.223
##                Sum         1.000 1.000

Example of interpretation:

Out of the observed Students who do not have Management experience, 18.5 % of them chose Arts as their Undergraduate Major.

library(effectsize)
## 
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared
## The following object is masked from 'package:psych':
## 
##     phi
effectsize::cramers_v(mydata$Undergraduate.Major, mydata$Has.Management.Experience)
## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.03              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.03)
## [1] "tiny"
## (Rules: funder2019)

Conclusion:

We could not reject H0, because the standardised residuals were below 1.96, I could not draw any conclusions.

Cramer’s V = 0.03 suggests that there is a very weak association between the two categorical variables, indicating that the relationship between them is almost nonexistent.