Homework Assignment 2

Task 1-3: Finding, importing and displaying the data

mydata <- read.table("./loan.csv", 
                     header = TRUE,
                     sep = ",", 
                     quote = "\"") #I used quote = "\"" because R has not understood apostrophe correctly

head(mydata)
##   age gender occupation education_level marital_status income credit_score loan_status
## 1  32   Male   Engineer      Bachelor's        Married  85000          720    Approved
## 2  45 Female    Teacher        Master's         Single  62000          680    Approved
## 3  28   Male    Student     High School         Single  25000          590      Denied
## 4  51 Female    Manager      Bachelor's        Married 105000          780    Approved
## 5  36   Male Accountant      Bachelor's        Married  75000          710    Approved
## 6  24 Female      Nurse     Associate's         Single  48000          640      Denied

 

Task 4: Explaining my data

Unit of the observation:

  • Individual person
     

Sample size:

  • 61 observations
     

Description of the variables:

  1. age
    • The age of the individual
    • Type of variable: Numerical ratio
    • Units: Years
       
  2. gender
    • The gender of the individual, either ‘Male’ or ‘Female’.
    • Type of variable: Categorical nominal
       
  3. occupation
    • The occupation of the individual
    • Type of variable: Categorical nominal
       
  4. education_level
    • The highest level of education attained by the individual, such as ‘High School’, ‘Associate’s’, ‘Bachelor’s’, ‘Master’s’, or ‘Doctoral’.
    • Type of variable: Categorical ordinal
       
  5. marital_status
    • The marital status of the individual, either ‘Single’ or ‘Married’.
    • Type of variable: Categorical nominal
       
  6. income
    • The annual income of the individual
    • Type of variable: Numerical ratio
    • Units: Dollars
       
  7. credit_score
    • The credit score of the individual, ranging from 300 to 850.
    • Type of variable: Numerical interval
       
  8. loan_status
    • The target variable, indicating whether the loan application was ‘Approved’ or ‘Denied’.
    • Type of variable: Categorical nominal
       

Task 5: Source of data

Source: Mandala, S. K. (n.d.). Simple Loan Classification Dataset. Kaggle. Retrieved 24 March 2025, from https://www.kaggle.com/datasets/sujithmandala/simple-loan-classification-dataset
 

Task 6: Data manipulation

 

6.1 Renaming variables, dropping all N/As, adding IDs and removing redundant columns

# install.packages("dplyr")
# install.packages("tidyr")
library(dplyr)
library(tidyr)
mydata <- mydata %>%
  rename(Income_USD = income, Age = age, Gender = gender, Occupation = occupation, Education_level = education_level, Marital_status = marital_status, Credit_score = credit_score, Loan_status = loan_status) %>%
  drop_na()

# Adding a new variable called "ID"
mydata$ID <- c(1:61)
mydata <- mydata[ , c(9, 1:8)]  # Moves column 10 to the front

# Removing columns that are not important for the future research question
mydata <- mydata[ , -c(2, 4:6, 8:9)]

head(mydata)
##   ID Gender Income_USD
## 1  1   Male      85000
## 2  2 Female      62000
## 3  3   Male      25000
## 4  4 Female     105000
## 5  5   Male      75000
## 6  6 Female      48000

 

6.2 Descriptive statistics

library(psych)
describeBy(mydata$Income_USD, group = mydata$Gender) # Dropping IDs, because it makes no sense doing descriptive statistics for this variable
## 
##  Descriptive statistics by group 
## group: Female
##    vars  n     mean       sd median  trimmed     mad   min    max  range skew kurtosis      se
## X1    1 30 77233.33 40331.97  66500 71958.33 34841.1 28000 180000 152000 0.97     0.19 7363.58
## ------------------------------------------------------------------------------------------------------------------------------------------------------ 
## group: Male
##    vars  n     mean       sd median trimmed   mad   min    max  range  skew kurtosis      se
## X1    1 31 80677.42 26507.09  85000   81640 22239 25000 130000 105000 -0.34    -0.68 4760.81

Income comparison between male and female based on the given data shows that:

  • On average, males in the sample earn slightly more ($80,677.42) than females ($77,233.33). However, statistical significance should be tested before drawing strong conclusions
     
  • The standard deviation (SD), which measures income variability, is higher among females ($40,331.97) than males ($26,507.09). This suggests that female incomes are more widely spread out in the sample
     
  • These unequal variances may violate the assumption of homogeneity of variance required for standard t-tests, so a Welch’s correction may be applied
     
  • The skewness values show that female income distribution is slightly positively skewed (0.97), meaning more females earn below the average, while male income is slightly negatively skewed (−0.34), indicating a few higher earners pulling the average up
     
  • Overall, skewness and unequal variances indicate that normality and homogeneity assumptions may not hold, and should be considered when choosing statistical tests
     

Task 7: Hypothesis testing

 

Research question: Is there a statistically significant difference in average income between males and females?

 

  • Null hypothesis (H0): There is no difference in average income between males and females
  • Alternative hypothesis (H1): There is a difference in average income between males and females
     

7.1: Parametrical test (Independent Samples T-Test)

  • The given hypothesis is about the difference between 2 population arithmetic means, each unit is measured once
  • Thus, the analysis will be proceeded with Independent Samples T-Test
     

Required assumptions:

  1. Variable is numeric (true)
  2. Normality: The distribution of the variable is normal in both populations (should be tested)
  3. The data must come from two independent populations (assumed to be true)
  4. Variable has the same variance in both populations (should be tested)
     

7.1.1: Testing normality (Graphs and Shapiro-Wilk Test)

# install.packages("ggplot2")
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
ggplot(mydata, aes(x = Income_USD)) +
          geom_histogram(binwidth = 15000, colour="gray") +
  facet_wrap(~Gender, ncol = 1) +
  ylab("Frequency") +
  xlab("Annual Income in USD")

 
Histograms interpretation:

  • Female income appears more right-skewed, with many individuals earning lower incomes and a few high-income outliers
     
  • Male income is more symmetrical, and may be normally distributed
     
  • These patterns suggest that normality assumptions may be violated, particularly for the female group, and should be confirmed with statistical tests (e.g., Shapiro–Wilk test or Q-Q plots)
     
# install.packages("ggpubr")
library(ggpubr)
ggqqplot(mydata,
         "Income_USD",
         facet.by = "Gender")

 

Q-Q Plot Interpretation:

  • The female income distribution shows noticeable deviations from the diagonal (1 point outside gray area, other also a bit too far), suggesting that the normality may be violated
  • The male income distribution closely follows the diagonal line, indicating that the data is approximately normally distributed
     
# install.packages("rstatix")
library(rstatix)
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
library(dplyr)

mydata %>%
  group_by(Gender) %>%
  shapiro_test(Income_USD)
## # A tibble: 2 × 4
##   Gender variable   statistic       p
##   <chr>  <chr>          <dbl>   <dbl>
## 1 Female Income_USD     0.903 0.00969
## 2 Male   Income_USD     0.976 0.707

 

Hypotheses for Shapiro-Wilk Test:

  • Null hypothesis (H0): Annual income in USD is normally distributed for male/female
  • Alternative hypothesis (H1): Annual income in USD is NOT normally distributed for male/female
     

Conclusions:

  • The null hypothesis for the female group is rejected at a p-value less than 0.01, indicating that annual income in USD is not normally distributed for females
     
  • For the male group, the null hypothesis is not rejected (p = 0.71), suggesting that income appears to follow a normal distribution for males
     
  • Therefore, a non-parametric test is more appropriate due to the violation of normality (Paragraph 7.2). However, the task requires to perform both parametric and corresponding non-parametric test till the end, so further it will be tested, if the Variable has the same variance in both populations
     

7.1.2: Testing homogenity of variance (Levene’s Test)

# install.packages("car")
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
## The following object is masked from 'package:dplyr':
## 
##     recode
leveneTest(mydata$Income_USD, group = mydata$Gender)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)  
## group  1  3.0923 0.08385 .
##       59                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 
Hypotheses for Levene's Test:

  • Null hypothesis (H0): There is no significant difference in income variance between male and female groups
  • Alternative hypothesis (H1): There is a significant difference in income variance between male and female groups
     

Conclusions:

  • Since the p-value from Levene’s Test is greater than 0.05, we fail to reject the null hypothesis
     
  • This means that there is no statistically significant difference in income variances between males and females
     
  • Therefore, the assumption of equal variances is not violated.
     

7.1.3: Independent Samples T-Test

  • Due to violated assumption about normality we shouldn’t proceed with this test. However, the task requires to “show my knowledge”, so this test will be performed
t.test(mydata$Income_USD ~ mydata$Gender,
       var.equal = TRUE,
       alternative = "two.sided")
## 
##  Two Sample t-test
## 
## data:  mydata$Income_USD by mydata$Gender
## t = -0.39538, df = 59, p-value = 0.694
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -20874.26  13986.09
## sample estimates:
## mean in group Female   mean in group Male 
##             77233.33             80677.42

 
Hypotheses for Independent Samples T-Test:

  • Null hypothesis (H0): There is no significant difference in average income between males and females
  • Alternative hypothesis (H1): There is a significant difference in average income between males and females
     

Conclusion:

  • The null hypothesis cannot be rejected (p-value is greater than 0.05)
  • We can assume that there is no statistically significant difference in average income between males and females based on the given sample
     
  • For further precision, the effect size could be calculated
     

7.1.4: Effect size (CohenD Statistics - Sawilowsky, 2009)

library(effectsize)
## 
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared
## The following object is masked from 'package:psych':
## 
##     phi
effectsize::cohens_d(mydata$Income_USD ~ mydata$Gender,
                     pooled_sd = FALSE)
## Cohen's d |        95% CI
## -------------------------
## -0.10     | [-0.60, 0.40]
## 
## - Estimated using un-pooled SD.
interpret_cohens_d(0.10, rules = "sawilowsky2009")
## [1] "very small"
## (Rules: sawilowsky2009)

 

  • The difference in average annual income between males and females is very small
     

Conclusion for the parametric test:

  • Based on the sample data, we found that the average annual income between man and women does not differ (p > 0.05), the difference in average income is very small (the effect size is very small, d=0.10)
     

  • However, this result may be unreliable due to violated assumption about normality
     

  • Therefore, corresponding non-parametric test (e.g. Wilcoxon Rank Sum Test) should be performed
     

7.2: Non-parametrical Test (Wilcoxon Rank Sum Test)

  • In this case, the corresponding non-parametric test would be Wilcoxon Rank Sum Test
wilcox.test(mydata$Income_USD ~ mydata$Gender,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")
## 
##  Wilcoxon rank sum test
## 
## data:  mydata$Income_USD by mydata$Gender
## W = 392.5, p-value = 0.2954
## alternative hypothesis: true location shift is not equal to 0

 

Hypotheses for Wilcoxon Rank Sum Test:

  • Null hypothesis (H0): The distribution location of annual income is the same for males and females
  • Alternative hypothesis (H1): The distribution location of annual income differs between males and females
     

Conclusion:

  • The null hypothesis cannot be rejected (p-value is greater than 0.05)
     

  • This suggests that the distribution of annual income is not significantly different between males and females
     

  • For further precision, the effect size could be calculated
     

7.2.1: Effect size (Bisserial Correlation - Funder, 2019)

# install.packages("effectsize")
library(effectsize)
effectsize(wilcox.test(mydata$Income_USD ~ mydata$Gender,
                       correct = FALSE,
                       exact = FALSE,
                       alternative = "two.sided"))
## r (rank biserial) |        95% CI
## ---------------------------------
## -0.16             | [-0.42, 0.13]
interpret_rank_biserial(0.16)
## [1] "small"
## (Rules: funder2019)

 

  • The effect size is small, r = 0.16
     

  • The distribution locations of annual income have only small differences for male and female
     

7.3: Final conclusion

 

  • Based on the sample data, we found that man and women do not differ in the annual income (p > 0.05), the distribution locations have only small differences between both populations (the effect size is small, r = 0.16)
     

  • Both the t-test (parametric) and Wilcoxon test (non-parametric) led to similar conclusions
     

  • However, due to the violation of the normality assumption, the non-parametric test is more reliable and appropriate in this context