“Exploratory Data Analysis” on the Credit Card default dataset

Question 1

a) Write R code to read the data into a dataframe called “df”

## Load Libraries ##
library(readr)
library(data.table)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
## Load data ##
df <- read_csv("./DefaultData.csv")
## Parsed with column specification:
## cols(
##   default = col_character(),
##   student = col_character(),
##   balance = col_double(),
##   income = col_double()
## )
## Looking into structure of the data ##
str(df)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 10000 obs. of  4 variables:
##  $ default: chr  "No" "No" "No" "No" ...
##  $ student: chr  "No" "Yes" "No" "No" ...
##  $ balance: num  730 817 1074 529 786 ...
##  $ income : num  44362 12106 31767 35704 38463 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   default = col_character(),
##   ..   student = col_character(),
##   ..   balance = col_double(),
##   ..   income = col_double()
##   .. )

b) Also write R code to read the data into a data table called “dt”.

## Load into data table ##
dt <- fread("./DefaultData.csv")
## Looking into structure of the data ##
str(dt)
## Classes 'data.table' and 'data.frame':   10000 obs. of  4 variables:
##  $ default: chr  "No" "No" "No" "No" ...
##  $ student: chr  "No" "Yes" "No" "No" ...
##  $ balance: num  730 817 1074 529 786 ...
##  $ income : num  44362 12106 31767 35704 38463 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Q.2 Write R code to get the dimensions of the dataframe “df”

dim(df)
## [1] 10000     4

Q.3 Write R code to list the column names of the dataframe “df”

colnames(df)
## [1] "default" "student" "balance" "income"

Q.4 Write R code to attach the dataframe “df”

attach(df)

Question 5

a) Write R code to list the data structures of the columns in the dataframe “df”

sapply(df, class)
##     default     student     balance      income 
## "character" "character"   "numeric"   "numeric"

b) Also Write R Code To List The Data Structures Of The Columns in the data.table “dt”. Notice if there is any difference in the outputs.

sapply(dt, class)
##     default     student     balance      income 
## "character" "character"   "numeric"   "numeric"

Q.6 Write R code to count how many consumers default on their loan

default <- as.factor(default)
table(default)
## default
##   No  Yes 
## 9667  333

Q.7 Write R code to count how many consumers default on their loan, further broken down by whether or not they are students

df %>% group_by(student, default) %>% summarise(number=n())
## # A tibble: 4 x 3
## # Groups:   student [2]
##   student default number
##   <chr>   <chr>    <int>
## 1 No      No        6850
## 2 No      Yes        206
## 3 Yes     No        2817
## 4 Yes     Yes        127

Q.8 Write R code to create the complete contingency table of defaulters broken down by students

addmargins(table(student, default), c(1,2))
##        default
## student    No   Yes   Sum
##     No   6850   206  7056
##     Yes  2817   127  2944
##     Sum  9667   333 10000

Q.9 Write R code to calculate the percentage of Defaulters and non-Defaulters, rounded to 1 decimal place

round(table(default)/length(default),3)*100
## default
##   No  Yes 
## 96.7  3.3

Q.10 Write R code to get Mean, Standard Deviation and Variance Of The Income

## Mean of income ##
mean(income)
## [1] 33516.98
## Standard deviation of income ##
sd(income)
## [1] 13336.64
## Variance of income ##
var(income)
## [1] 177865955

Q.11 Write R code to calculate the Minimum And Maximum Income, rounding it to 2 decimal places

## minimum income
round(min(income),2)
## [1] 771.97
## maximum income
round(max(income),2)
## [1] 73554.23

Question 12

a) Write R code to print the following Descriptive Statistics

library(psych)
## 
## Attaching package: 'psych'
## The following object is masked from 'df':
## 
##     income
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
df$default <- as.factor(df$default)
df$student <- as.factor(df$student)
describe(df, na.rm=TRUE)[,c(1:5,8,9)]
##          vars     n     mean       sd   median    min      max
## default*    1 10000     1.03     0.18     1.00   1.00     2.00
## student*    2 10000     1.29     0.46     1.00   1.00     2.00
## balance     3 10000   835.37   483.71   823.64   0.00  2654.32
## income      4 10000 33516.98 13336.64 34552.64 771.97 73554.23

b) In the above output, Interpret the meaning of the 1.29 written as the mean of the student column.

The values No & Yes are coded as: - No <- 1 - Yes <- 2 and, then the mean is calculated coming down to 1.29

Question 13

a) Write R code to get average of balance, broken down by whether consumers default on their loan

df %>% group_by(default) %>% summarise(avg_balance = mean(balance))
## # A tibble: 2 x 2
##   default avg_balance
##   <fct>         <dbl>
## 1 No             804.
## 2 Yes           1748.

b) Write R code to create a Histogram of balance

plot <- df %>% ggplot(aes(x=balance)) + geom_histogram(binwidth = 100)
print(plot)

Q.14 Write R code to get a breakdown of the mean and standard deviation of the balance, with respect to whether someone is a student and whether he or she has defaulted in payment, as shown in the following output

df %>% group_by(default, student) %>% summarise(count=n(), mean_balance = mean(balance), sd_balance=sd(balance))
## # A tibble: 4 x 5
## # Groups:   default [2]
##   default student count mean_balance sd_balance
##   <fct>   <fct>   <int>        <dbl>      <dbl>
## 1 No      No       6850         745.       446.
## 2 No      Yes      2817         948.       451.
## 3 Yes     No        206        1678.       331.
## 4 Yes     Yes       127        1860.       329.

Q.15 Write R code to create a Box-Plot for credit card balance

plot <- df %>%  ggplot(aes(y=balance)) + geom_boxplot()
print(plot)

Q.16 Write R code to create boxplots for credit card balance, broken down by whether a consumer is a student or not a student

plot <- df %>% ggplot(aes(y=balance, fill=student)) + geom_boxplot()
print(plot)