Set up Rstudio

Setting up RMarkdown when opening it enables you to create dynamic, reproducible, and visually appealing reports, presentations, and documents, that can help you communicate your data analysis and research findings more effectively.

if(!require(dplyr)){install.packages('dplyr')} #installing the package if not
library(dplyr) #loading the library
library(gtsummary)
library(ggplot2)
library(dplyr)
library(ggplot2)
library(psych)
library(corrplot)
library(VIM)
library(gridExtra)
library(car)
library(knitr)
library(gmodels)
library(gtsummary)

Preliminary

As soon as you start any session in R you need to ensure that: * Your working directory is set-up correctly. If you don’t know how to do this refer to the instructions in the first workshop. Don’t proceed unless this is done. * Download from the folder on Moodle containing this week’s workshop the file Rfunctions.R and all data files (.csv files), and store these files in your working directory. * Source the script file Rfunctions.R by source(“Rfunctions.R”)

After sourcing the file plot_fcns.R you will be able to use the functions for plotting defined in this script. This will not work (and an error message will appear) if you have not set correctly your working directory, and/ or if you have not stored the files in that directory). Do not use the file you sourced as your script file. Use a different (preferably a new) R script for this workshop.

Loading a dataset

The dataset we use in this workshop is the “Credit card default” dataset contained in the textbook “An Introduction to Statistical Learning with Applications in R”. It is also discussed in the first video lectures for Chapter 4. Our first task is to load the dataset we will be using in this workshop. In general, data can come in many different formats from many different sources (e.g. text files, Excel, SPSS, SAS, etc). R can handle a very broad range of data files. In this workshop the dataset we will use is in the Comma Separated Value (CSV) format. CSV files are effectively text files that represent a data table by having entries in a row separated by commas (as the name suggests). CSV files are supported by every software package.

Read CSV file and store it in variable named: Default

data <- read.csv("C:\\Users\\user\\Downloads\\Default.csv")
head(data,5)
attach(data)

The function read.csv() is a variant of the more general function read.table(). In other words read.csv() is a call to the read.table() function with a number of optional arguments of the latter function set to values that are suitable for CSV files. To find out what is the type (class) of the variable Default, you can use the function class().

Eliminate the missing observations

data<-na.omit(data)
class(data)
[1] "data.frame"

Getting to know the data

Let’s find out more about this dataset. First we need to know how many observations and how many variables Default contains. Typically the rows of a data matrix (frame) correspond to observations, and the columns to variables.

Find out dimensions (size, and number of variables in dataset)

dim(data)
[1] 30000     8
Alternatively
nrow(data)
[1] 30000
ncol(data)
[1] 8

We can also find out the names of the variables (these names must be defined in the file we loaded).

Names of variables in data.frame

names(data)
[1] "ID"             "Income"         "Balance"        "Gender"        
[5] "Education"      "Marital.Status" "Age"            "Default"       

Create factors for categorical variables

data$Gender<-factor(data$Gender, levels = c(1,2),
                              labels = c("Male", "Female"))
data$Education<-factor(data$Education, levels = c(0,1,2,3,4,5,6),
                              labels = c("Doctoral Degree (e.g., PhD or EdD)", "Professional Degree (e.g., MD or JD)","Master's Degree (e.g., MA or MS)","Bachelor's Degree (e.g., BA or BS)","Associate's Degree (e.g., AA or AS)","Certificate/Diploma in a Trade or Vocational Program","High School Diploma"))
data$Marital.Status<-factor(data$Marital.Status, levels = c(0,1,2,3),
                              labels = c("Divorced", "Married", "Single", "Other"))
data$Default<-factor(data$Default, levels = c(0,1),
                              labels = c("yes", "no"))

Check class of various variables

We can use the class() function to find out the types of different variables (columns in Default).

class(data$Gender)
[1] "factor"
class(data$Education)
[1] "factor"

Factors are a special variable type in R and you need to be aware of their basic properties. Factors are described in the Introduction to R course in DataCamp.

Alternatively

### structure function
str(data)
'data.frame':   30000 obs. of  8 variables:
 $ ID            : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Income        : num  20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
 $ Balance       : num  3913 2682 29239 46990 8617 ...
 $ Gender        : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 1 1 2 2 1 ...
 $ Education     : Factor w/ 7 levels "Doctoral Degree (e.g., PhD or EdD)",..: 3 3 3 3 3 2 2 3 4 4 ...
 $ Marital.Status: Factor w/ 4 levels "Divorced","Married",..: 2 3 3 2 2 3 3 3 2 3 ...
 $ Age           : int  24 26 34 37 57 37 29 23 28 35 ...
 $ Default       : Factor w/ 2 levels "yes","no": 2 2 1 1 1 1 1 1 1 1 ...

Basic summary statistics

summary(data)
       ID            Income           Balance           Gender     
 Min.   :    1   Min.   :  10000   Min.   :-165580   Male  :11888  
 1st Qu.: 7501   1st Qu.:  50000   1st Qu.:   3559   Female:18112  
 Median :15000   Median : 140000   Median :  22382                 
 Mean   :15000   Mean   : 167484   Mean   :  51223                 
 3rd Qu.:22500   3rd Qu.: 240000   3rd Qu.:  67091                 
 Max.   :30000   Max.   :1000000   Max.   : 964511                 
                                                                   
                                                Education      Marital.Status 
 Doctoral Degree (e.g., PhD or EdD)                  :   14   Divorced:   54  
 Professional Degree (e.g., MD or JD)                :10585   Married :13659  
 Master's Degree (e.g., MA or MS)                    :14030   Single  :15964  
 Bachelor's Degree (e.g., BA or BS)                  : 4917   Other   :  323  
 Associate's Degree (e.g., AA or AS)                 :  123                   
 Certificate/Diploma in a Trade or Vocational Program:  280                   
 High School Diploma                                 :   51                   
      Age        Default    
 Min.   :21.00   yes:23364  
 1st Qu.:28.00   no : 6636  
 Median :34.00              
 Mean   :35.49              
 3rd Qu.:41.00              
 Max.   :79.00              
                            

Answering Basic Questions About this data

In this dataset the objective is to model the probability of a customer defaulting on their credit card debt. Let’s try to obtain some basic insights about the “objects”, in this case credit card holders.

# Has an individual defaulted?
# Note that we use == and not a single =
head(data$Default=="yes",100)
  [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [13]  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
 [25]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
 [37]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
 [49]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [61] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
 [73]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE
 [85]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
 [97]  TRUE  TRUE  TRUE FALSE

The above command returns a vector with entries either TRUE or FALSE depending on whether the condition (Default==“yes”) is satisfied or not. (A variable that can take only the values TRUE/FALSE is called a Boolean variable). Note that to test equality we use == and not single =. This is very important, because in the latter case we would assign all values for the class variable “defaulted” equal to “Yes”. If you made a mistake and used a single =, you can always restore the original version by reloading the dataset. Let’s ask slightly more interesting questions: How many customers defaulted, what is the proportion of default in our sample?

# How many customers defaulted and how many did not default?
sum(data$Default=="yes")
[1] 23364
sum(data$Default=="no")
[1] 6636
# Proportion of customers that defaulted = P ^ (Default = yes)
sum(data$Default=="yes")/nrow(data)
[1] 0.7788
#### or equivalently:
mean(data$Default=="yes")
[1] 0.7788

Take a moment to understand why the above is correct. If you can’t figure it our ask first the person next to you, and then the tutor. Only after you figure this out attempt the exercise below.

Exercise

How many of the customers in this dataset are students and how many are not. Estimate the corresponding proportions. Hint: In R the symbol for “not equal” is != It is possible to have conditional statements that involve more than one variable. We structure these using and: & or: |. This allows us to answer questions involving more than one variable. For example:

# How many post-graduate students have defaulted?
sum(data$Default == "yes" | data$Education=="PostGraduate")
[1] 23364
# How many female clients have defaulted?
sum(data$Default == "yes" | data$Gender=="Female")
[1] 27127
# How many male clients have defaulted?
sum(data$Default == "yes" | data$Gender=="Male")
[1] 26237

Summary statistics

data[,c(2,3,6,7)] %>% tbl_summary(by = Marital.Status) %>% add_p()
Characteristic Divorced, N = 541 Married, N = 13,6591 Single, N = 15,9641 Other, N = 3231 p-value2
Income 115,000 (70,000, 200,000) 160,000 (70,000, 260,000) 130,000 (50,000, 220,000) 60,000 (30,000, 130,000) <0.001
Balance 6,357 (1,095, 38,964) 22,357 (3,056, 70,464) 22,436 (4,113, 64,476) 23,496 (7,198, 51,152) 0.015
Age 37 (31, 45) 39 (34, 46) 29 (26, 35) 43 (37, 49) <0.001
1 Median (IQR)
2 Kruskal-Wallis rank sum test
if(!require(gtsummary)){install.packages('gtsummary')}
library(gtsummary)

data %>%
  select(Income,Balance,Age) %>%
  tbl_summary(
    #by = trt,
    label = list(Balance ~ "Balance"),
    statistic = list(all_continuous() ~ "{min} {median} {mean} {sd} {max}"),
    digits = list(c(Income, Balance, Age) ~ c(0, 0, 2, 2,0))
  )
Characteristic N = 30,0001
Income 10,000 140,000 167,484.32 129,747.66 1,000,000
Balance -165,580 22,382 51,223.33 73,635.86 964,511
Age 21 34 35.49 9.22 79
1 Minimum Median Mean SD Maximum
data[,c(4,5,6,8)] %>% tbl_summary()
Characteristic N = 30,0001
Gender
    Male 11,888 (40%)
    Female 18,112 (60%)
Education
    Doctoral Degree (e.g., PhD or EdD) 14 (<0.1%)
    Professional Degree (e.g., MD or JD) 10,585 (35%)
    Master's Degree (e.g., MA or MS) 14,030 (47%)
    Bachelor's Degree (e.g., BA or BS) 4,917 (16%)
    Associate's Degree (e.g., AA or AS) 123 (0.4%)
    Certificate/Diploma in a Trade or Vocational Program 280 (0.9%)
    High School Diploma 51 (0.2%)
Marital.Status
    Divorced 54 (0.2%)
    Married 13,659 (46%)
    Single 15,964 (53%)
    Other 323 (1.1%)
Default 23,364 (78%)
1 n (%)
n <- c(10585, 14030,4917,174,280,14)

perc <- paste0(n, " = ", round(100 * n/sum(n), 2), "%")
pie(n, labels = perc)

ggplot(data, aes(x=Income, y=Balance)) +
  geom_point()+
  geom_smooth(method=lm, se=FALSE)

pie(n, labels = perc, main = "Education",col = c("blue", "black","Red","Orange","Green"))
legend("topleft", c("Graduate School","University","High school","PostGraduate","Unknown"), cex = 0.8,
       fill = c("blue", "black","Red","Orange","Green"))

data[,c(2, 3, 7,8)] %>% tbl_summary(by = Default) %>% add_p()
Characteristic yes, N = 23,3641 no, N = 6,6361 p-value2
Income 150,000 (70,000, 250,000) 90,000 (50,000, 200,000) <0.001
Balance 23,120 (3,677, 69,027) 20,185 (2,988, 59,626) <0.001
Age 34 (28, 41) 34 (28, 42) 0.4
1 Median (IQR)
2 Wilcoxon rank sum test

Data Visualization

gender and purpose

data [,c(5,6)] %>%
  tbl_summary(by = Marital.Status) %>%
  add_p() %>%
  add_overall() %>% 
  bold_labels()
Characteristic Overall, N = 30,0001 Divorced, N = 541 Married, N = 13,6591 Single, N = 15,9641 Other, N = 3231 p-value
Education
    Doctoral Degree (e.g., PhD or EdD) 14 (<0.1%) 0 (0%) 4 (<0.1%) 10 (<0.1%) 0 (0%)
    Professional Degree (e.g., MD or JD) 10,585 (35%) 4 (7.4%) 3,722 (27%) 6,809 (43%) 50 (15%)
    Master's Degree (e.g., MA or MS) 14,030 (47%) 6 (11%) 6,842 (50%) 7,020 (44%) 162 (50%)
    Bachelor's Degree (e.g., BA or BS) 4,917 (16%) 44 (81%) 2,861 (21%) 1,909 (12%) 103 (32%)
    Associate's Degree (e.g., AA or AS) 123 (0.4%) 0 (0%) 52 (0.4%) 68 (0.4%) 3 (0.9%)
    Certificate/Diploma in a Trade or Vocational Program 280 (0.9%) 0 (0%) 150 (1.1%) 127 (0.8%) 3 (0.9%)
    High School Diploma 51 (0.2%) 0 (0%) 28 (0.2%) 21 (0.1%) 2 (0.6%)
1 n (%)
library(ggplot2)
ggplot(data, aes(x = Education))+
  geom_bar(aes(fill = Gender), 
           position = position_stack(reverse = FALSE)) +
  geom_text(aes(label = after_stat(count)),  stat='count'
            , color="green", size =3, nudge_y= 8, nudge_x=0,size=9)+
 theme_minimal()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

Age and Education

data [,c(5,7)] %>%
  tbl_summary(by = Education,
              statistic = list(all_continuous() ~ "{mean} {median} {sd}"),
              digits = list(Age ~ 2)) %>%
  add_p() %>%
  add_overall() %>% 
  bold_labels()
Characteristic Overall, N = 30,0001 Doctoral Degree (e.g., PhD or EdD), N = 141 Professional Degree (e.g., MD or JD), N = 10,5851 Master's Degree (e.g., MA or MS), N = 14,0301 Bachelor's Degree (e.g., BA or BS), N = 4,9171 Associate's Degree (e.g., AA or AS), N = 1231 Certificate/Diploma in a Trade or Vocational Program, N = 2801 High School Diploma, N = 511 p-value2
Age 35.49 34.00 9.22 38.86 38.50 7.24 34.23 32.00 8.27 34.72 33.00 8.89 40.30 40.00 10.44 33.85 32.00 8.23 35.60 34.00 8.99 43.90 46.00 10.12 <0.001
1 Mean Median SD
2 Kruskal-Wallis rank sum test

Histogram (Age)

summary(Age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  21.00   28.00   34.00   35.49   41.00   79.00 
ggplot(data=data, aes(Age)) +        
  geom_histogram(aes(y = ..density..),color="red")+
  scale_x_continuous(limits = c(15,85))+
  stat_function(fun = dnorm,
                args = list(mean = mean(Age),
                            sd = sd(Age)),
                col = "#1b98e0",
                size = 1)+
  labs(title = "Histogram Showing the Distribution of Age")

Age and Education

ggplot(data, aes(x = Education, y = Age))+
  labs(title = "", y = "Age", x = "Education")+
  geom_boxplot(aes(fill = Education)) +theme(legend.position="none")+
  labs(title = "Boxplot of Age Against Education", x = "Education", y = "Age")+
  theme(axis.text.x = element_text(angle = 60, vjust = 0.5))

Income, Education and Default
ggplot(data=data, aes(x=Education, y =Income, fill=Default)) +
  geom_bar(position = "dodge",
           stat = "summary",
           fun = "mean")+
  theme_minimal()+
  labs(title = "Bar Plot of Income Against Education", x = "Education", y = "Average Income")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

Additional Visuals

Barplots
data [,c(2,8)] %>%
  tbl_summary(by = Default,
              statistic = list(all_continuous() ~ "{mean} {median} {sd}"),
              digits = list(Income ~ 2)) %>%
  add_p() %>%
  add_overall() %>% 
  bold_labels()
Characteristic Overall, N = 30,0001 yes, N = 23,3641 no, N = 6,6361 p-value2
Income 167,484.32 140,000.00 129,747.66 178,099.73 150,000.00 131,628.36 130,109.66 90,000.00 115,378.54 <0.001
1 Mean Median SD
2 Wilcoxon rank sum test
library(ggplot2)

# Create some sample data
x <- c("yes", "No")
y <- c(23362,6636)
data_0 <- data.frame(Category = x, Value = y)

# Create a bar plot
ggplot(data_0, aes(x = Category, y = Value)) +
  geom_bar(stat = "identity") +
  labs(title = "Bar Plot of Defaulted Clients", x = "Defaulted Clients", y = "Frequency")

Bar graph with Relative Frequency

library(ggplot2)

# Create some sample data
x <- c("yes", "No")
y <- c(23362,6636)
data_1 <- data.frame(Category = x, Value = y)

# Calculate relative frequencies
data_1$Freq <- data_1$Value / sum(data_1$Value)

# Create a bar plot
ggplot(data_1, aes(x = Category, y = Freq)) +
  geom_bar(stat = "identity") +
  labs(title = "Bar Plot of Relative Frequency of Defaulted", x = "Defaulted", y = "Relative Frequency")

Barplots of Class Conditional Distributions
Histogram
#### Histogram of balance
hist(data$Income)

#### Histogram of balance with 20 bins
hist(Balance, breaks = 20, col="red", main="", xlab="Credit Card Balance")

# create two histograms, separated by group, in one plot
par(mfrow = c(1,2)) # set the plotting area to two columns
hist(data$Income[data$Default == "yes"], main = "Defaulted (yes)", xlab = "Income", col = "blue")
hist(data$Income[data$Default == "no"], main = "Defaulted (no)", xlab = "Income", col = "green")

Combined Histograms (Using ggplot2)

ggplot(data, aes(x = Income, fill = Default)) +
  geom_histogram(alpha = 0.5, position = "identity", bins = 20) +
  scale_fill_manual(values = c("red", "blue")) +
  labs(title = "Histogram of Values by Group", x = "Income", y = "Frequency")

ggplot(data, aes(x = Income, fill = Default)) + 
  geom_histogram(alpha = 0.5, position = "identity", bins = 30) +
  labs(x = "Value", y = "Frequency", title = "Histogram of Income by Defaulted Clients") +
  scale_fill_manual(values = c("blue", "green")) +
  facet_wrap(~Default, ncol = 2)

Average Income Across Eduacation categories

group_mean_age <- aggregate(Income ~ Education, data = data, mean)
group_mean_age

Histogram

# Create some sample data
x <- c("Doctoral Degree (e.g., PhD or EdD)", "Professional Degree (e.g., MD or JD)", "Master's Degree (e.g., MA or MS)","Bachelor's Degree (e.g., BA or BS)","Associate's Degree (e.g., AA or AS)","Certificate/Diploma in a Trade or Vocational Program","High School Diploma")
y <- c( 217142.9,212956.1,147062.4,126550.3,220894.3,168164.3,148235.3)
data_2 <- data.frame(Category = x, Value = y)

# Create a bar plot
ggplot(data_2, aes(x = Category, y = Value)) +
  geom_bar(stat = "identity") +
  labs(title = "Bar Plot of Average Income Across Education", x = "Education", y = "Average Income")+
  theme(axis.text.x = element_text(angle = 60, vjust = 0.5))

Chi-square Test of Independence (Marital Status and Default)

dat <- data[,c(6,8)] %>%
  tbl_summary(by = Default) %>%
  add_p() %>%
  add_overall() %>% 
  bold_labels()
dat
Characteristic Overall, N = 30,0001 yes, N = 23,3641 no, N = 6,6361 p-value2
Marital.Status <0.001
    Divorced 54 (0.2%) 49 (0.2%) 5 (<0.1%)
    Married 13,659 (46%) 10,453 (45%) 3,206 (48%)
    Single 15,964 (53%) 12,623 (54%) 3,341 (50%)
    Other 323 (1.1%) 239 (1.0%) 84 (1.3%)
1 n (%)
2 Pearson's Chi-squared test

The table presents data on marital status among a sample of 30,000 individuals, with 23,364 indicating “yes” and 6,636 indicating “no.” The p-value of <0.001 indicates that there is a statistically significant association between marital status and the “yes”/“no” outcome variable (Default).

Looking at the table, we can see that the majority of individuals in the sample are either single (53%) or married (46%). Divorced individuals make up only a small percentage (0.2%) of the sample, while those identifying as “other” comprise 1.1%.

When we compare the distribution of marital status between the “yes” and “no” groups, we can see that the proportion of married individuals is slightly higher among those who answered “no” (48%) than among those who answered “yes” (45%). Conversely, the proportion of single individuals is slightly higher among those who answered “yes” (54%) than among those who answered “no” (50%). These small differences may contribute to the statistically significant association between marital status and the outcome variable (Default).

Correlation

In statistics, correlation is a measure of the strength and direction of the linear relationship between two variables. It measures how closely related two variables are to each other. Correlation coefficients range from -1 to +1, where -1 represents a perfect negative correlation (i.e., as one variable increases, the other variable decreases), 0 represents no correlation, and +1 represents a perfect positive correlation (i.e., as one variable increases, the other variable increases).

A correlation coefficient of zero does not mean that there is no relationship between the variables, only that there is no linear relationship between them. It is possible for variables to have a nonlinear relationship or a relationship that is not captured by a linear model. Correlation can be used to explore relationships between variables, identify patterns, and make predictions. It is important to note, however, that correlation does not necessarily imply causation, and that other factors may be influencing the relationship between the variables. Consider the table below with the rule of thumb.

+==================================================+ |Correlation coefficient | Interpretation | ———————– | ————————– |0.90 - 1.00 | Very strong correlation | |0.70 - 0.89 | Strong correlation | |0.40 - 0.69 | Moderate correlation | |0.10 - 0.39 | Weak correlation | |0.00 - 0.10 | Negligible correlation | +==================================================+

dat <- data.frame(Income,Balance,Age)
dat %>% 
  dplyr::select(Income:Age) %>%
  cor() %>%
  round(3) %>%
  corrplot(method = "color", addCoef.col="white", type = "upper",
           title="Correlation - Income, Balance and Age",
           mar=c(0,0,2,0),
           tl.cex=0.5, number.cex = 0.4)

plot(data[-1], col = Default)

Faceted Histogram of Income Across Marital Status

ggplot(data, aes(x = Income, fill = Marital.Status)) + 
  geom_histogram(alpha = 0.5, position = "identity", bins = 30) +
  labs(x = "Value", y = "Frequency", title = "Histogram of Income by Marital Status of Clients") +
  scale_fill_manual(values = c("blue", "green","orange","purple")) +
  facet_wrap(~Marital.Status, ncol = 7)+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))