Homework Assignment 2

Name: Rostyslav Mykhalchuk
StudentID: 12321015
Course number: 5708

Task 1-3: Finding, importing and displaying the data

mydata <- read.table("./loan.csv", 
                     header = TRUE,
                     sep = ",", 
                     quote = "\"") #I used quote = "\"" because R has not understood apostrophe correctly

head(mydata)

##   age gender occupation education_level marital_status income credit_score loan_status
## 1  32   Male   Engineer      Bachelor's        Married  85000          720    Approved
## 2  45 Female    Teacher        Master's         Single  62000          680    Approved
## 3  28   Male    Student     High School         Single  25000          590      Denied
## 4  51 Female    Manager      Bachelor's        Married 105000          780    Approved
## 5  36   Male Accountant      Bachelor's        Married  75000          710    Approved
## 6  24 Female      Nurse     Associate's         Single  48000          640      Denied

Task 4: Explaining my data

Unit of the observation:

Individual person

Sample size:

61 observations

Description of the variables:

age
- The age of the individual
- Type of variable: Numerical ratio
- Units: Years
gender
- The gender of the individual, either ‘Male’ or ‘Female’.
- Type of variable: Categorical nominal
occupation
- The occupation of the individual
- Type of variable: Categorical nominal
education_level
- The highest level of education attained by the individual, such as ‘High School’, ‘Associate’s’, ‘Bachelor’s’, ‘Master’s’, or ‘Doctoral’.
- Type of variable: Categorical ordinal
marital_status
- The marital status of the individual, either ‘Single’ or ‘Married’.
- Type of variable: Categorical nominal
income
- The annual income of the individual
- Type of variable: Numerical ratio
- Units: Dollars
credit_score
- The credit score of the individual, ranging from 300 to 850.
- Type of variable: Numerical interval
loan_status
- The target variable, indicating whether the loan application was ‘Approved’ or ‘Denied’.
- Type of variable: Categorical nominal

Task 5: Source of data

Source: Mandala, S. K. (n.d.). Simple Loan Classification Dataset. Kaggle. Retrieved 24 March 2025, from https://www.kaggle.com/datasets/sujithmandala/simple-loan-classification-dataset

Task 6: Data manipulation

6.1 Renaming variables, dropping all N/As, adding IDs and removing redundant columns

# install.packages("dplyr")
# install.packages("tidyr")
library(dplyr)
library(tidyr)
mydata <- mydata %>%
  rename(Income_USD = income, Age = age, Gender = gender, Occupation = occupation, Education_level = education_level, Marital_status = marital_status, Credit_score = credit_score, Loan_status = loan_status) %>%
  drop_na()

# Adding a new variable called "ID"
mydata$ID <- c(1:61)
mydata <- mydata[ , c(9, 1:8)]  # Moves column 10 to the front

# Removing columns that are not important for the future research question
mydata <- mydata[ , -c(2, 4:6, 8:9)]

head(mydata)

##   ID Gender Income_USD
## 1  1   Male      85000
## 2  2 Female      62000
## 3  3   Male      25000
## 4  4 Female     105000
## 5  5   Male      75000
## 6  6 Female      48000

6.2 Descriptive statistics

library(psych)
describeBy(mydata$Income_USD, group = mydata$Gender) # Dropping IDs, because it makes no sense doing descriptive statistics for this variable

## 
##  Descriptive statistics by group 
## group: Female
##    vars  n     mean       sd median  trimmed     mad   min    max  range skew kurtosis      se
## X1    1 30 77233.33 40331.97  66500 71958.33 34841.1 28000 180000 152000 0.97     0.19 7363.58
## ------------------------------------------------------------------------------------------------------------------------------------------------------ 
## group: Male
##    vars  n     mean       sd median trimmed   mad   min    max  range  skew kurtosis      se
## X1    1 31 80677.42 26507.09  85000   81640 22239 25000 130000 105000 -0.34    -0.68 4760.81

Income comparison between male and female based on the given data shows that:

On average, males in the sample earn slightly more ($80,677.42) than females ($77,233.33). However, statistical significance should be tested before drawing strong conclusions
The standard deviation (SD), which measures income variability, is higher among females ($40,331.97) than males ($26,507.09). This suggests that female incomes are more widely spread out in the sample
These unequal variances may violate the assumption of homogeneity of variance required for standard t-tests, so a Welch’s correction may be applied
The skewness values show that female income distribution is slightly positively skewed (0.97), meaning more females earn below the average, while male income is slightly negatively skewed (−0.34), indicating a few higher earners pulling the average up
Overall, skewness and unequal variances indicate that normality and homogeneity assumptions may not hold, and should be considered when choosing statistical tests

Task 7: Hypothesis testing

Research question: Is there a statistically significant difference in average income between males and females?

Null hypothesis (H0): There is no difference in average income between males and females
Alternative hypothesis (H1): There is a difference in average income between males and females

7.1: Parametrical test (Independent Samples T-Test)

The given hypothesis is about the difference between 2 population arithmetic means, each unit is measured once
Thus, the analysis will be proceeded with Independent Samples T-Test

Required assumptions:

Variable is numeric (true)
Normality: The distribution of the variable is normal in both populations (should be tested)
The data must come from two independent populations (assumed to be true)
Variable has the same variance in both populations (should be tested)

7.1.1: Testing normality (Graphs and Shapiro-Wilk Test)

# install.packages("ggplot2")
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(mydata, aes(x = Income_USD)) +
          geom_histogram(binwidth = 15000, colour="gray") +
  facet_wrap(~Gender, ncol = 1) +
  ylab("Frequency") +
  xlab("Annual Income in USD")

Histograms interpretation:

Female income appears more right-skewed, with many individuals earning lower incomes and a few high-income outliers
Male income is more symmetrical, and may be normally distributed
These patterns suggest that normality assumptions may be violated, particularly for the female group, and should be confirmed with statistical tests (e.g., Shapiro–Wilk test or Q-Q plots)

# install.packages("ggpubr")
library(ggpubr)
ggqqplot(mydata,
         "Income_USD",
         facet.by = "Gender")

Q-Q Plot Interpretation:

The female income distribution shows noticeable deviations from the diagonal (1 point outside gray area, other also a bit too far), suggesting that the normality may be violated
The male income distribution closely follows the diagonal line, indicating that the data is approximately normally distributed

# install.packages("rstatix")
library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

library(dplyr)

mydata %>%
  group_by(Gender) %>%
  shapiro_test(Income_USD)

## # A tibble: 2 × 4
##   Gender variable   statistic       p
##   <chr>  <chr>          <dbl>   <dbl>
## 1 Female Income_USD     0.903 0.00969
## 2 Male   Income_USD     0.976 0.707

Hypotheses for Shapiro-Wilk Test:

Null hypothesis (H0): Annual income in USD is normally distributed for male/female
Alternative hypothesis (H1): Annual income in USD is NOT normally distributed for male/female

Conclusions:

The null hypothesis for the female group is rejected at a p-value less than 0.01, indicating that annual income in USD is not normally distributed for females
For the male group, the null hypothesis is not rejected (p = 0.71), suggesting that income appears to follow a normal distribution for males
Therefore, a non-parametric test is more appropriate due to the violation of normality (Paragraph 7.2). However, the task requires to perform both parametric and corresponding non-parametric test till the end, so further it will be tested, if the Variable has the same variance in both populations

7.1.2: Testing homogenity of variance (Levene’s Test)

# install.packages("car")
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

## The following object is masked from 'package:dplyr':
## 
##     recode

leveneTest(mydata$Income_USD, group = mydata$Gender)

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)  
## group  1  3.0923 0.08385 .
##       59                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Hypotheses for Levene's Test:

Null hypothesis (H0): There is no significant difference in income variance between male and female groups
Alternative hypothesis (H1): There is a significant difference in income variance between male and female groups

Conclusions:

Since the p-value from Levene’s Test is greater than 0.05, we fail to reject the null hypothesis
This means that there is no statistically significant difference in income variances between males and females
Therefore, the assumption of equal variances is not violated.

7.1.3: Independent Samples T-Test

Due to violated assumption about normality we shouldn’t proceed with this test. However, the task requires to “show my knowledge”, so this test will be performed

t.test(mydata$Income_USD ~ mydata$Gender,
       var.equal = TRUE,
       alternative = "two.sided")

## 
##  Two Sample t-test
## 
## data:  mydata$Income_USD by mydata$Gender
## t = -0.39538, df = 59, p-value = 0.694
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -20874.26  13986.09
## sample estimates:
## mean in group Female   mean in group Male 
##             77233.33             80677.42

Hypotheses for Independent Samples T-Test:

Null hypothesis (H0): There is no significant difference in average income between males and females
Alternative hypothesis (H1): There is a significant difference in average income between males and females

Conclusion:

The null hypothesis cannot be rejected (p-value is greater than 0.05)
We can assume that there is no statistically significant difference in average income between males and females based on the given sample
For further precision, the effect size could be calculated

7.1.4: Effect size (CohenD Statistics - Sawilowsky, 2009)

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared

## The following object is masked from 'package:psych':
## 
##     phi

effectsize::cohens_d(mydata$Income_USD ~ mydata$Gender,
                     pooled_sd = FALSE)

## Cohen's d |        95% CI
## -------------------------
## -0.10     | [-0.60, 0.40]
## 
## - Estimated using un-pooled SD.

interpret_cohens_d(0.10, rules = "sawilowsky2009")

## [1] "very small"
## (Rules: sawilowsky2009)

The difference in average annual income between males and females is very small

Conclusion for the parametric test:

Based on the sample data, we found that the average annual income between man and women does not differ (p > 0.05), the difference in average income is very small (the effect size is very small, d=0.10)
However, this result may be unreliable due to violated assumption about normality
Therefore, corresponding non-parametric test (e.g. Wilcoxon Rank Sum Test) should be performed

7.2: Non-parametrical Test (Wilcoxon Rank Sum Test)

In this case, the corresponding non-parametric test would be Wilcoxon Rank Sum Test

wilcox.test(mydata$Income_USD ~ mydata$Gender,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")

## 
##  Wilcoxon rank sum test
## 
## data:  mydata$Income_USD by mydata$Gender
## W = 392.5, p-value = 0.2954
## alternative hypothesis: true location shift is not equal to 0

Hypotheses for Wilcoxon Rank Sum Test:

Null hypothesis (H0): The distribution location of annual income is the same for males and females
Alternative hypothesis (H1): The distribution location of annual income differs between males and females