2024-06-17

Set up and Cleaning Data

So to begin our data manipulation, we have to import our dataset and the libraries we’ll need.

data = read.csv("C:/Users/schou/Downloads/dataset.csv",sep = ",",
                         header=TRUE)

library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Seeing if there are NA values in the dataset:

sum(is.na(data))
## [1] 0

Thankfully, this data has no NA values is already clean! All the column names are accurate as well.

Question

So now we can manipulate the data to answer our question: what is the correlation in some of the variables that are required to get a credit card?

In this data set, we have 20 variables:

##  [1] "ID"              "Gender"          "Own_car"         "Own_property"   
##  [5] "Work_phone"      "Phone"           "Email"           "Unemployed"     
##  [9] "Num_children"    "Num_family"      "Account_length"  "Total_income"   
## [13] "Age"             "Years_employed"  "Income_type"     "Education_type" 
## [17] "Family_status"   "Housing_type"    "Occupation_type" "Target"

The “Target” variable is a binary value stating if the individual is eligible for a credit card.

Out of the men and women that were eligible, what were the average salaries for both genders?

To find this out, I used:

men <- filter(data, Gender == 0 & Target == 1)
avgsal_men <-mean(men$Total_income)
women <- filter(data, Gender == 1 & Target == 1)
avgsal_women <- mean(women$Total_income)

Result

avgsal_men
## [1] 168770
avgsal_women
## [1] 215723.5

Women that are eligible have a higher salary than men who are eligible.

How Important is Account Length?

The average account length for an eligible individual is:

## [1] 30.57599

30 years is a long account length! Do men or women have longer account lengths?

avgacc_men <- mean(men$Account_length)
avgacc_women <- mean(women$Account_length)
avgacc_men
## [1] 30.99015
avgacc_women
## [1] 29.862

Age VS Total Income

Of those who are eligible, what is the distribution of their age vs total income?