Exploratory Data Analysis: Part two

Set up Rstudio

Setting up RMarkdown when opening it enables you to create dynamic, reproducible, and visually appealing reports, presentations, and documents, that can help you communicate your data analysis and research findings more effectively.

if(!require(dplyr)){install.packages('dplyr')} #installing the package if not
library(dplyr) #loading the library
library(gtsummary)
library(ggplot2)
library(dplyr)
library(ggplot2)
library(psych)
library(corrplot)
library(VIM)
library(gridExtra)
library(car)
library(knitr)
library(gmodels)
library(gtsummary)

Preliminary

As soon as you start any session in R you need to ensure that: * Your working directory is set-up correctly. If you don’t know how to do this refer to the instructions in the first workshop. Don’t proceed unless this is done. * Download from the folder on Moodle containing this week’s workshop the file Rfunctions.R and all data files (.csv files), and store these files in your working directory. * Source the script file Rfunctions.R by source(“Rfunctions.R”)

After sourcing the file plot_fcns.R you will be able to use the functions for plotting defined in this script. This will not work (and an error message will appear) if you have not set correctly your working directory, and/ or if you have not stored the files in that directory). Do not use the file you sourced as your script file. Use a different (preferably a new) R script for this workshop.

Loading a dataset

The dataset we use in this workshop is the “Credit card default” dataset contained in the textbook “An Introduction to Statistical Learning with Applications in R”. It is also discussed in the first video lectures for Chapter 4. Our first task is to load the dataset we will be using in this workshop. In general, data can come in many different formats from many different sources (e.g. text files, Excel, SPSS, SAS, etc). R can handle a very broad range of data files. In this workshop the dataset we will use is in the Comma Separated Value (CSV) format. CSV files are effectively text files that represent a data table by having entries in a row separated by commas (as the name suggests). CSV files are supported by every software package.

Read CSV file and store it in variable named: Default

data <- read.csv("C:\\Users\\user\\Downloads\\Default.csv")
head(data,5)

attach(data)

The function read.csv() is a variant of the more general function read.table(). In other words read.csv() is a call to the read.table() function with a number of optional arguments of the latter function set to values that are suitable for CSV files. To find out what is the type (class) of the variable Default, you can use the function class().

Eliminate the missing observations

data<-na.omit(data)

class(data)

[1] "data.frame"

Getting to know the data

Let’s find out more about this dataset. First we need to know how many observations and how many variables Default contains. Typically the rows of a data matrix (frame) correspond to observations, and the columns to variables.

Find out dimensions (size, and number of variables in dataset)

dim(data)

[1] 30000     8

Alternatively

nrow(data)

[1] 30000

ncol(data)

[1] 8

We can also find out the names of the variables (these names must be defined in the file we loaded).

Names of variables in data.frame

names(data)

[1] "ID"             "Income"         "Balance"        "Gender"        
[5] "Education"      "Marital.Status" "Age"            "Default"

Create factors for categorical variables

data$Gender<-factor(data$Gender, levels = c(1,2),
                              labels = c("Male", "Female"))
data$Education<-factor(data$Education, levels = c(0,1,2,3,4,5,6),
                              labels = c("Doctoral Degree (e.g., PhD or EdD)", "Professional Degree (e.g., MD or JD)","Master's Degree (e.g., MA or MS)","Bachelor's Degree (e.g., BA or BS)","Associate's Degree (e.g., AA or AS)","Certificate/Diploma in a Trade or Vocational Program","High School Diploma"))
data$Marital.Status<-factor(data$Marital.Status, levels = c(0,1,2,3),
                              labels = c("Divorced", "Married", "Single", "Other"))
data$Default<-factor(data$Default, levels = c(0,1),
                              labels = c("yes", "no"))

Check class of various variables

We can use the class() function to find out the types of different variables (columns in Default).

class(data$Gender)

[1] "factor"

class(data$Education)

[1] "factor"

Factors are a special variable type in R and you need to be aware of their basic properties. Factors are described in the Introduction to R course in DataCamp.

Alternatively

### structure function
str(data)

'data.frame':   30000 obs. of  8 variables:
 $ ID            : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Income        : num  20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
 $ Balance       : num  3913 2682 29239 46990 8617 ...
 $ Gender        : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 1 1 2 2 1 ...
 $ Education     : Factor w/ 7 levels "Doctoral Degree (e.g., PhD or EdD)",..: 3 3 3 3 3 2 2 3 4 4 ...
 $ Marital.Status: Factor w/ 4 levels "Divorced","Married",..: 2 3 3 2 2 3 3 3 2 3 ...
 $ Age           : int  24 26 34 37 57 37 29 23 28 35 ...
 $ Default       : Factor w/ 2 levels "yes","no": 2 2 1 1 1 1 1 1 1 1 ...

Basic summary statistics

summary(data)

       ID            Income           Balance           Gender     
 Min.   :    1   Min.   :  10000   Min.   :-165580   Male  :11888  
 1st Qu.: 7501   1st Qu.:  50000   1st Qu.:   3559   Female:18112  
 Median :15000   Median : 140000   Median :  22382                 
 Mean   :15000   Mean   : 167484   Mean   :  51223                 
 3rd Qu.:22500   3rd Qu.: 240000   3rd Qu.:  67091                 
 Max.   :30000   Max.   :1000000   Max.   : 964511                 
                                                                   
                                                Education      Marital.Status 
 Doctoral Degree (e.g., PhD or EdD)                  :   14   Divorced:   54  
 Professional Degree (e.g., MD or JD)                :10585   Married :13659  
 Master's Degree (e.g., MA or MS)                    :14030   Single  :15964  
 Bachelor's Degree (e.g., BA or BS)                  : 4917   Other   :  323  
 Associate's Degree (e.g., AA or AS)                 :  123                   
 Certificate/Diploma in a Trade or Vocational Program:  280                   
 High School Diploma                                 :   51                   
      Age        Default    
 Min.   :21.00   yes:23364  
 1st Qu.:28.00   no : 6636  
 Median :34.00              
 Mean   :35.49              
 3rd Qu.:41.00              
 Max.   :79.00

Answering Basic Questions About this data

In this dataset the objective is to model the probability of a customer defaulting on their credit card debt. Let’s try to obtain some basic insights about the “objects”, in this case credit card holders.

# Has an individual defaulted?
# Note that we use == and not a single =
head(data$Default=="yes",100)

  [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [13]  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
 [25]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
 [37]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
 [49]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [61] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
 [73]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE
 [85]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
 [97]  TRUE  TRUE  TRUE FALSE

The above command returns a vector with entries either TRUE or FALSE depending on whether the condition (Default==“yes”) is satisfied or not. (A variable that can take only the values TRUE/FALSE is called a Boolean variable). Note that to test equality we use == and not single =. This is very important, because in the latter case we would assign all values for the class variable “defaulted” equal to “Yes”. If you made a mistake and used a single =, you can always restore the original version by reloading the dataset. Let’s ask slightly more interesting questions: How many customers defaulted, what is the proportion of default in our sample?

# How many customers defaulted and how many did not default?
sum(data$Default=="yes")

[1] 23364

sum(data$Default=="no")

[1] 6636

# Proportion of customers that defaulted = P ^ (Default = yes)
sum(data$Default=="yes")/nrow(data)

[1] 0.7788

#### or equivalently:
mean(data$Default=="yes")

[1] 0.7788

Take a moment to understand why the above is correct. If you can’t figure it our ask first the person next to you, and then the tutor. Only after you figure this out attempt the exercise below.

Exercise

How many of the customers in this dataset are students and how many are not. Estimate the corresponding proportions. Hint: In R the symbol for “not equal” is != It is possible to have conditional statements that involve more than one variable. We structure these using and: & or: |. This allows us to answer questions involving more than one variable. For example:

# How many post-graduate students have defaulted?
sum(data$Default == "yes" | data$Education=="PostGraduate")

[1] 23364

# How many female clients have defaulted?
sum(data$Default == "yes" | data$Gender=="Female")

[1] 27127

# How many male clients have defaulted?
sum(data$Default == "yes" | data$Gender=="Male")

[1] 26237

Summary statistics

data[,c(2,3,6,7)] %>% tbl_summary(by = Marital.Status) %>% add_p()

Characteristic	Divorced, N = 54¹	Married, N = 13,659¹	Single, N = 15,964¹	Other, N = 323¹	p-value²
Income	115,000 (70,000, 200,000)	160,000 (70,000, 260,000)	130,000 (50,000, 220,000)	60,000 (30,000, 130,000)	<0.001
Balance	6,357 (1,095, 38,964)	22,357 (3,056, 70,464)	22,436 (4,113, 64,476)	23,496 (7,198, 51,152)	0.015
Age	37 (31, 45)	39 (34, 46)	29 (26, 35)	43 (37, 49)	<0.001
¹ Median (IQR)
² Kruskal-Wallis rank sum test

if(!require(gtsummary)){install.packages('gtsummary')}
library(gtsummary)

data %>%
  select(Income,Balance,Age) %>%
  tbl_summary(
    #by = trt,
    label = list(Balance ~ "Balance"),
    statistic = list(all_continuous() ~ "{min} {median} {mean} {sd} {max}"),
    digits = list(c(Income, Balance, Age) ~ c(0, 0, 2, 2,0))
  )

Characteristic	N = 30,000¹
Income	10,000 140,000 167,484.32 129,747.66 1,000,000
Balance	-165,580 22,382 51,223.33 73,635.86 964,511
Age	21 34 35.49 9.22 79
¹ Minimum Median Mean SD Maximum

data[,c(4,5,6,8)] %>% tbl_summary()

Characteristic	N = 30,000¹
Gender
Male	11,888 (40%)
Female	18,112 (60%)
Education
Doctoral Degree (e.g., PhD or EdD)	14 (<0.1%)
Professional Degree (e.g., MD or JD)	10,585 (35%)
Master's Degree (e.g., MA or MS)	14,030 (47%)
Bachelor's Degree (e.g., BA or BS)	4,917 (16%)
Associate's Degree (e.g., AA or AS)	123 (0.4%)
Certificate/Diploma in a Trade or Vocational Program	280 (0.9%)
High School Diploma	51 (0.2%)
Marital.Status
Divorced	54 (0.2%)
Married	13,659 (46%)
Single	15,964 (53%)
Other	323 (1.1%)
Default	23,364 (78%)
¹ n (%)

n <- c(10585, 14030,4917,174,280,14)

perc <- paste0(n, " = ", round(100 * n/sum(n), 2), "%")
pie(n, labels = perc)

ggplot(data, aes(x=Income, y=Balance)) +
  geom_point()+
  geom_smooth(method=lm, se=FALSE)

pie(n, labels = perc, main = "Education",col = c("blue", "black","Red","Orange","Green"))
legend("topleft", c("Graduate School","University","High school","PostGraduate","Unknown"), cex = 0.8,
       fill = c("blue", "black","Red","Orange","Green"))

data[,c(2, 3, 7,8)] %>% tbl_summary(by = Default) %>% add_p()

Characteristic	yes, N = 23,364¹	no, N = 6,636¹	p-value²
Income	150,000 (70,000, 250,000)	90,000 (50,000, 200,000)	<0.001
Balance	23,120 (3,677, 69,027)	20,185 (2,988, 59,626)	<0.001
Age	34 (28, 41)	34 (28, 42)	0.4
¹ Median (IQR)
² Wilcoxon rank sum test

Data Visualization

gender and purpose

data [,c(5,6)] %>%
  tbl_summary(by = Marital.Status) %>%
  add_p() %>%
  add_overall() %>% 
  bold_labels()

Characteristic	Overall, N = 30,000¹	Divorced, N = 54¹	Married, N = 13,659¹	Single, N = 15,964¹	Other, N = 323¹
Education
Doctoral Degree (e.g., PhD or EdD)	14 (<0.1%)	0 (0%)	4 (<0.1%)	10 (<0.1%)	0 (0%)
Professional Degree (e.g., MD or JD)	10,585 (35%)	4 (7.4%)	3,722 (27%)	6,809 (43%)	50 (15%)
Master's Degree (e.g., MA or MS)	14,030 (47%)	6 (11%)	6,842 (50%)	7,020 (44%)	162 (50%)
Bachelor's Degree (e.g., BA or BS)	4,917 (16%)	44 (81%)	2,861 (21%)	1,909 (12%)	103 (32%)
Associate's Degree (e.g., AA or AS)	123 (0.4%)	0 (0%)	52 (0.4%)	68 (0.4%)	3 (0.9%)
Certificate/Diploma in a Trade or Vocational Program	280 (0.9%)	0 (0%)	150 (1.1%)	127 (0.8%)	3 (0.9%)
High School Diploma	51 (0.2%)	0 (0%)	28 (0.2%)	21 (0.1%)	2 (0.6%)
¹ n (%)

library(ggplot2)
ggplot(data, aes(x = Education))+
  geom_bar(aes(fill = Gender), 
           position = position_stack(reverse = FALSE)) +
  geom_text(aes(label = after_stat(count)),  stat='count'
            , color="green", size =3, nudge_y= 8, nudge_x=0,size=9)+
 theme_minimal()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

Age and Education

data [,c(5,7)] %>%
  tbl_summary(by = Education,
              statistic = list(all_continuous() ~ "{mean} {median} {sd}"),
              digits = list(Age ~ 2)) %>%
  add_p() %>%
  add_overall() %>% 
  bold_labels()

Characteristic	Overall, N = 30,000¹	Doctoral Degree (e.g., PhD or EdD), N = 14¹	Professional Degree (e.g., MD or JD), N = 10,585¹	Master's Degree (e.g., MA or MS), N = 14,030¹	Bachelor's Degree (e.g., BA or BS), N = 4,917¹	Associate's Degree (e.g., AA or AS), N = 123¹	Certificate/Diploma in a Trade or Vocational Program, N = 280¹	High School Diploma, N = 51¹	p-value²
Age	35.49 34.00 9.22	38.86 38.50 7.24	34.23 32.00 8.27	34.72 33.00 8.89	40.30 40.00 10.44	33.85 32.00 8.23	35.60 34.00 8.99	43.90 46.00 10.12	<0.001
¹ Mean Median SD
² Kruskal-Wallis rank sum test

Histogram (Age)

summary(Age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  21.00   28.00   34.00   35.49   41.00   79.00

ggplot(data=data, aes(Age)) +        
  geom_histogram(aes(y = ..density..),color="red")+
  scale_x_continuous(limits = c(15,85))+
  stat_function(fun = dnorm,
                args = list(mean = mean(Age),
                            sd = sd(Age)),
                col = "#1b98e0",
                size = 1)+
  labs(title = "Histogram Showing the Distribution of Age")

Age and Education

ggplot(data, aes(x = Education, y = Age))+
  labs(title = "", y = "Age", x = "Education")+
  geom_boxplot(aes(fill = Education)) +theme(legend.position="none")+
  labs(title = "Boxplot of Age Against Education", x = "Education", y = "Age")+
  theme(axis.text.x = element_text(angle = 60, vjust = 0.5))

Income, Education and Default

ggplot(data=data, aes(x=Education, y =Income, fill=Default)) +
  geom_bar(position = "dodge",
           stat = "summary",
           fun = "mean")+
  theme_minimal()+
  labs(title = "Bar Plot of Income Against Education", x = "Education", y = "Average Income")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

Additional Visuals

Barplots

data [,c(2,8)] %>%
  tbl_summary(by = Default,
              statistic = list(all_continuous() ~ "{mean} {median} {sd}"),
              digits = list(Income ~ 2)) %>%
  add_p() %>%
  add_overall() %>% 
  bold_labels()

Characteristic	Overall, N = 30,000¹	yes, N = 23,364¹	no, N = 6,636¹	p-value²
Income	167,484.32 140,000.00 129,747.66	178,099.73 150,000.00 131,628.36	130,109.66 90,000.00 115,378.54	<0.001
¹ Mean Median SD
² Wilcoxon rank sum test

library(ggplot2)

# Create some sample data
x <- c("yes", "No")
y <- c(23362,6636)
data_0 <- data.frame(Category = x, Value = y)

# Create a bar plot
ggplot(data_0, aes(x = Category, y = Value)) +
  geom_bar(stat = "identity") +
  labs(title = "Bar Plot of Defaulted Clients", x = "Defaulted Clients", y = "Frequency")

Bar graph with Relative Frequency

library(ggplot2)

# Create some sample data
x <- c("yes", "No")
y <- c(23362,6636)
data_1 <- data.frame(Category = x, Value = y)

# Calculate relative frequencies
data_1$Freq <- data_1$Value / sum(data_1$Value)

# Create a bar plot
ggplot(data_1, aes(x = Category, y = Freq)) +
  geom_bar(stat = "identity") +
  labs(title = "Bar Plot of Relative Frequency of Defaulted", x = "Defaulted", y = "Relative Frequency")

Barplots of Class Conditional Distributions

Histogram

#### Histogram of balance
hist(data$Income)

#### Histogram of balance with 20 bins
hist(Balance, breaks = 20, col="red", main="", xlab="Credit Card Balance")

# create two histograms, separated by group, in one plot
par(mfrow = c(1,2)) # set the plotting area to two columns
hist(data$Income[data$Default == "yes"], main = "Defaulted (yes)", xlab = "Income", col = "blue")
hist(data$Income[data$Default == "no"], main = "Defaulted (no)", xlab = "Income", col = "green")

Combined Histograms (Using ggplot2)

ggplot(data, aes(x = Income, fill = Default)) +
  geom_histogram(alpha = 0.5, position = "identity", bins = 20) +
  scale_fill_manual(values = c("red", "blue")) +
  labs(title = "Histogram of Values by Group", x = "Income", y = "Frequency")

ggplot(data, aes(x = Income, fill = Default)) + 
  geom_histogram(alpha = 0.5, position = "identity", bins = 30) +
  labs(x = "Value", y = "Frequency", title = "Histogram of Income by Defaulted Clients") +
  scale_fill_manual(values = c("blue", "green")) +
  facet_wrap(~Default, ncol = 2)

Average Income Across Eduacation categories

group_mean_age <- aggregate(Income ~ Education, data = data, mean)
group_mean_age

Histogram

# Create some sample data
x <- c("Doctoral Degree (e.g., PhD or EdD)", "Professional Degree (e.g., MD or JD)", "Master's Degree (e.g., MA or MS)","Bachelor's Degree (e.g., BA or BS)","Associate's Degree (e.g., AA or AS)","Certificate/Diploma in a Trade or Vocational Program","High School Diploma")
y <- c( 217142.9,212956.1,147062.4,126550.3,220894.3,168164.3,148235.3)
data_2 <- data.frame(Category = x, Value = y)

# Create a bar plot
ggplot(data_2, aes(x = Category, y = Value)) +
  geom_bar(stat = "identity") +
  labs(title = "Bar Plot of Average Income Across Education", x = "Education", y = "Average Income")+
  theme(axis.text.x = element_text(angle = 60, vjust = 0.5))

Chi-square Test of Independence (Marital Status and Default)

dat <- data[,c(6,8)] %>%
  tbl_summary(by = Default) %>%
  add_p() %>%
  add_overall() %>% 
  bold_labels()
dat

Characteristic	Overall, N = 30,000¹	yes, N = 23,364¹	no, N = 6,636¹	p-value²
Marital.Status				<0.001
Divorced	54 (0.2%)	49 (0.2%)	5 (<0.1%)
Married	13,659 (46%)	10,453 (45%)	3,206 (48%)
Single	15,964 (53%)	12,623 (54%)	3,341 (50%)
Other	323 (1.1%)	239 (1.0%)	84 (1.3%)
¹ n (%)
² Pearson's Chi-squared test

The table presents data on marital status among a sample of 30,000 individuals, with 23,364 indicating “yes” and 6,636 indicating “no.” The p-value of <0.001 indicates that there is a statistically significant association between marital status and the “yes”/“no” outcome variable (Default).

Looking at the table, we can see that the majority of individuals in the sample are either single (53%) or married (46%). Divorced individuals make up only a small percentage (0.2%) of the sample, while those identifying as “other” comprise 1.1%.

When we compare the distribution of marital status between the “yes” and “no” groups, we can see that the proportion of married individuals is slightly higher among those who answered “no” (48%) than among those who answered “yes” (45%). Conversely, the proportion of single individuals is slightly higher among those who answered “yes” (54%) than among those who answered “no” (50%). These small differences may contribute to the statistically significant association between marital status and the outcome variable (Default).

Correlation

In statistics, correlation is a measure of the strength and direction of the linear relationship between two variables. It measures how closely related two variables are to each other. Correlation coefficients range from -1 to +1, where -1 represents a perfect negative correlation (i.e., as one variable increases, the other variable decreases), 0 represents no correlation, and +1 represents a perfect positive correlation (i.e., as one variable increases, the other variable increases).

A correlation coefficient of zero does not mean that there is no relationship between the variables, only that there is no linear relationship between them. It is possible for variables to have a nonlinear relationship or a relationship that is not captured by a linear model. Correlation can be used to explore relationships between variables, identify patterns, and make predictions. It is important to note, however, that correlation does not necessarily imply causation, and that other factors may be influencing the relationship between the variables. Consider the table below with the rule of thumb.

+==================================================+ |Correlation coefficient | Interpretation | ———————– | ————————– |0.90 - 1.00 | Very strong correlation | |0.70 - 0.89 | Strong correlation | |0.40 - 0.69 | Moderate correlation | |0.10 - 0.39 | Weak correlation | |0.00 - 0.10 | Negligible correlation | +==================================================+

dat <- data.frame(Income,Balance,Age)

dat %>% 
  dplyr::select(Income:Age) %>%
  cor() %>%
  round(3) %>%
  corrplot(method = "color", addCoef.col="white", type = "upper",
           title="Correlation - Income, Balance and Age",
           mar=c(0,0,2,0),
           tl.cex=0.5, number.cex = 0.4)

plot(data[-1], col = Default)

Faceted Histogram of Income Across Marital Status

ggplot(data, aes(x = Income, fill = Marital.Status)) + 
  geom_histogram(alpha = 0.5, position = "identity", bins = 30) +
  labs(x = "Value", y = "Frequency", title = "Histogram of Income by Marital Status of Clients") +
  scale_fill_manual(values = c("blue", "green","orange","purple")) +
  facet_wrap(~Marital.Status, ncol = 7)+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

Exploratory Data Analysis: Part two

Lumumba W. Vivtor

2023-03-19

Set up Rstudio

Preliminary

Loading a dataset

Read CSV file and store it in variable named: Default

Eliminate the missing observations

Getting to know the data

Find out dimensions (size, and number of variables in dataset)

Alternatively

Names of variables in data.frame

Create factors for categorical variables

Check class of various variables

Alternatively

Basic summary statistics

Answering Basic Questions About this data

Exercise

Summary statistics

Data Visualization

gender and purpose

Age and Education

Histogram (Age)

Age and Education

Income, Education and Default

Additional Visuals

Barplots

Bar graph with Relative Frequency

Barplots of Class Conditional Distributions

Histogram

Combined Histograms (Using ggplot2)

Average Income Across Eduacation categories

Histogram

Chi-square Test of Independence (Marital Status and Default)

Correlation

Faceted Histogram of Income Across Marital Status