Setting up RMarkdown when opening it enables you to create dynamic, reproducible, and visually appealing reports, presentations, and documents, that can help you communicate your data analysis and research findings more effectively.
if(!require(dplyr)){install.packages('dplyr')} #installing the package if not
library(dplyr) #loading the library
library(gtsummary)
library(ggplot2)
library(dplyr)
library(ggplot2)
library(psych)
library(corrplot)
library(VIM)
library(gridExtra)
library(car)
library(knitr)
library(gmodels)
library(gtsummary)
As soon as you start any session in R you need to ensure that: * Your working directory is set-up correctly. If you don’t know how to do this refer to the instructions in the first workshop. Don’t proceed unless this is done. * Download from the folder on Moodle containing this week’s workshop the file Rfunctions.R and all data files (.csv files), and store these files in your working directory. * Source the script file Rfunctions.R by source(“Rfunctions.R”)
After sourcing the file plot_fcns.R you will be able to use the functions for plotting defined in this script. This will not work (and an error message will appear) if you have not set correctly your working directory, and/ or if you have not stored the files in that directory). Do not use the file you sourced as your script file. Use a different (preferably a new) R script for this workshop.
The dataset we use in this workshop is the “Credit card default” dataset contained in the textbook “An Introduction to Statistical Learning with Applications in R”. It is also discussed in the first video lectures for Chapter 4. Our first task is to load the dataset we will be using in this workshop. In general, data can come in many different formats from many different sources (e.g. text files, Excel, SPSS, SAS, etc). R can handle a very broad range of data files. In this workshop the dataset we will use is in the Comma Separated Value (CSV) format. CSV files are effectively text files that represent a data table by having entries in a row separated by commas (as the name suggests). CSV files are supported by every software package.
data <- read.csv("C:\\Users\\user\\Downloads\\Default.csv")
head(data,5)
attach(data)
The function read.csv() is a variant of the more general function read.table(). In other words read.csv() is a call to the read.table() function with a number of optional arguments of the latter function set to values that are suitable for CSV files. To find out what is the type (class) of the variable Default, you can use the function class().
data<-na.omit(data)
class(data)
[1] "data.frame"
Let’s find out more about this dataset. First we need to know how many observations and how many variables Default contains. Typically the rows of a data matrix (frame) correspond to observations, and the columns to variables.
dim(data)
[1] 30000 8
nrow(data)
[1] 30000
ncol(data)
[1] 8
We can also find out the names of the variables (these names must be defined in the file we loaded).
names(data)
[1] "ID" "Income" "Balance" "Gender"
[5] "Education" "Marital.Status" "Age" "Default"
data$Gender<-factor(data$Gender, levels = c(1,2),
labels = c("Male", "Female"))
data$Education<-factor(data$Education, levels = c(0,1,2,3,4,5,6),
labels = c("Doctoral Degree (e.g., PhD or EdD)", "Professional Degree (e.g., MD or JD)","Master's Degree (e.g., MA or MS)","Bachelor's Degree (e.g., BA or BS)","Associate's Degree (e.g., AA or AS)","Certificate/Diploma in a Trade or Vocational Program","High School Diploma"))
data$Marital.Status<-factor(data$Marital.Status, levels = c(0,1,2,3),
labels = c("Divorced", "Married", "Single", "Other"))
data$Default<-factor(data$Default, levels = c(0,1),
labels = c("yes", "no"))
We can use the class() function to find out the types of different variables (columns in Default).
class(data$Gender)
[1] "factor"
class(data$Education)
[1] "factor"
Factors are a special variable type in R and you need to be aware of their basic properties. Factors are described in the Introduction to R course in DataCamp.
### structure function
str(data)
'data.frame': 30000 obs. of 8 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Income : num 20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
$ Balance : num 3913 2682 29239 46990 8617 ...
$ Gender : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 1 1 2 2 1 ...
$ Education : Factor w/ 7 levels "Doctoral Degree (e.g., PhD or EdD)",..: 3 3 3 3 3 2 2 3 4 4 ...
$ Marital.Status: Factor w/ 4 levels "Divorced","Married",..: 2 3 3 2 2 3 3 3 2 3 ...
$ Age : int 24 26 34 37 57 37 29 23 28 35 ...
$ Default : Factor w/ 2 levels "yes","no": 2 2 1 1 1 1 1 1 1 1 ...
summary(data)
ID Income Balance Gender
Min. : 1 Min. : 10000 Min. :-165580 Male :11888
1st Qu.: 7501 1st Qu.: 50000 1st Qu.: 3559 Female:18112
Median :15000 Median : 140000 Median : 22382
Mean :15000 Mean : 167484 Mean : 51223
3rd Qu.:22500 3rd Qu.: 240000 3rd Qu.: 67091
Max. :30000 Max. :1000000 Max. : 964511
Education Marital.Status
Doctoral Degree (e.g., PhD or EdD) : 14 Divorced: 54
Professional Degree (e.g., MD or JD) :10585 Married :13659
Master's Degree (e.g., MA or MS) :14030 Single :15964
Bachelor's Degree (e.g., BA or BS) : 4917 Other : 323
Associate's Degree (e.g., AA or AS) : 123
Certificate/Diploma in a Trade or Vocational Program: 280
High School Diploma : 51
Age Default
Min. :21.00 yes:23364
1st Qu.:28.00 no : 6636
Median :34.00
Mean :35.49
3rd Qu.:41.00
Max. :79.00
In this dataset the objective is to model the probability of a customer defaulting on their credit card debt. Let’s try to obtain some basic insights about the “objects”, in this case credit card holders.
# Has an individual defaulted?
# Note that we use == and not a single =
head(data$Default=="yes",100)
[1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[13] TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
[25] TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
[37] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
[49] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[61] FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
[73] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE
[85] TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
[97] TRUE TRUE TRUE FALSE
The above command returns a vector with entries either TRUE or FALSE depending on whether the condition (Default==“yes”) is satisfied or not. (A variable that can take only the values TRUE/FALSE is called a Boolean variable). Note that to test equality we use == and not single =. This is very important, because in the latter case we would assign all values for the class variable “defaulted” equal to “Yes”. If you made a mistake and used a single =, you can always restore the original version by reloading the dataset. Let’s ask slightly more interesting questions: How many customers defaulted, what is the proportion of default in our sample?
# How many customers defaulted and how many did not default?
sum(data$Default=="yes")
[1] 23364
sum(data$Default=="no")
[1] 6636
# Proportion of customers that defaulted = P ^ (Default = yes)
sum(data$Default=="yes")/nrow(data)
[1] 0.7788
#### or equivalently:
mean(data$Default=="yes")
[1] 0.7788
Take a moment to understand why the above is correct. If you can’t figure it our ask first the person next to you, and then the tutor. Only after you figure this out attempt the exercise below.
How many of the customers in this dataset are students and how many are not. Estimate the corresponding proportions. Hint: In R the symbol for “not equal” is != It is possible to have conditional statements that involve more than one variable. We structure these using and: & or: |. This allows us to answer questions involving more than one variable. For example:
# How many post-graduate students have defaulted?
sum(data$Default == "yes" | data$Education=="PostGraduate")
[1] 23364
# How many female clients have defaulted?
sum(data$Default == "yes" | data$Gender=="Female")
[1] 27127
# How many male clients have defaulted?
sum(data$Default == "yes" | data$Gender=="Male")
[1] 26237
data[,c(2,3,6,7)] %>% tbl_summary(by = Marital.Status) %>% add_p()
| Characteristic | Divorced, N = 541 | Married, N = 13,6591 | Single, N = 15,9641 | Other, N = 3231 | p-value2 |
|---|---|---|---|---|---|
| Income | 115,000 (70,000, 200,000) | 160,000 (70,000, 260,000) | 130,000 (50,000, 220,000) | 60,000 (30,000, 130,000) | <0.001 |
| Balance | 6,357 (1,095, 38,964) | 22,357 (3,056, 70,464) | 22,436 (4,113, 64,476) | 23,496 (7,198, 51,152) | 0.015 |
| Age | 37 (31, 45) | 39 (34, 46) | 29 (26, 35) | 43 (37, 49) | <0.001 |
| 1 Median (IQR) | |||||
| 2 Kruskal-Wallis rank sum test | |||||
if(!require(gtsummary)){install.packages('gtsummary')}
library(gtsummary)
data %>%
select(Income,Balance,Age) %>%
tbl_summary(
#by = trt,
label = list(Balance ~ "Balance"),
statistic = list(all_continuous() ~ "{min} {median} {mean} {sd} {max}"),
digits = list(c(Income, Balance, Age) ~ c(0, 0, 2, 2,0))
)
| Characteristic | N = 30,0001 |
|---|---|
| Income | 10,000 140,000 167,484.32 129,747.66 1,000,000 |
| Balance | -165,580 22,382 51,223.33 73,635.86 964,511 |
| Age | 21 34 35.49 9.22 79 |
| 1 Minimum Median Mean SD Maximum | |
data[,c(4,5,6,8)] %>% tbl_summary()
| Characteristic | N = 30,0001 |
|---|---|
| Gender | |
| Male | 11,888 (40%) |
| Female | 18,112 (60%) |
| Education | |
| Doctoral Degree (e.g., PhD or EdD) | 14 (<0.1%) |
| Professional Degree (e.g., MD or JD) | 10,585 (35%) |
| Master's Degree (e.g., MA or MS) | 14,030 (47%) |
| Bachelor's Degree (e.g., BA or BS) | 4,917 (16%) |
| Associate's Degree (e.g., AA or AS) | 123 (0.4%) |
| Certificate/Diploma in a Trade or Vocational Program | 280 (0.9%) |
| High School Diploma | 51 (0.2%) |
| Marital.Status | |
| Divorced | 54 (0.2%) |
| Married | 13,659 (46%) |
| Single | 15,964 (53%) |
| Other | 323 (1.1%) |
| Default | 23,364 (78%) |
| 1 n (%) | |
n <- c(10585, 14030,4917,174,280,14)
perc <- paste0(n, " = ", round(100 * n/sum(n), 2), "%")
pie(n, labels = perc)
ggplot(data, aes(x=Income, y=Balance)) +
geom_point()+
geom_smooth(method=lm, se=FALSE)
pie(n, labels = perc, main = "Education",col = c("blue", "black","Red","Orange","Green"))
legend("topleft", c("Graduate School","University","High school","PostGraduate","Unknown"), cex = 0.8,
fill = c("blue", "black","Red","Orange","Green"))
data[,c(2, 3, 7,8)] %>% tbl_summary(by = Default) %>% add_p()
| Characteristic | yes, N = 23,3641 | no, N = 6,6361 | p-value2 |
|---|---|---|---|
| Income | 150,000 (70,000, 250,000) | 90,000 (50,000, 200,000) | <0.001 |
| Balance | 23,120 (3,677, 69,027) | 20,185 (2,988, 59,626) | <0.001 |
| Age | 34 (28, 41) | 34 (28, 42) | 0.4 |
| 1 Median (IQR) | |||
| 2 Wilcoxon rank sum test | |||
data [,c(5,6)] %>%
tbl_summary(by = Marital.Status) %>%
add_p() %>%
add_overall() %>%
bold_labels()
| Characteristic | Overall, N = 30,0001 | Divorced, N = 541 | Married, N = 13,6591 | Single, N = 15,9641 | Other, N = 3231 | p-value |
|---|---|---|---|---|---|---|
| Education | ||||||
| Doctoral Degree (e.g., PhD or EdD) | 14 (<0.1%) | 0 (0%) | 4 (<0.1%) | 10 (<0.1%) | 0 (0%) | |
| Professional Degree (e.g., MD or JD) | 10,585 (35%) | 4 (7.4%) | 3,722 (27%) | 6,809 (43%) | 50 (15%) | |
| Master's Degree (e.g., MA or MS) | 14,030 (47%) | 6 (11%) | 6,842 (50%) | 7,020 (44%) | 162 (50%) | |
| Bachelor's Degree (e.g., BA or BS) | 4,917 (16%) | 44 (81%) | 2,861 (21%) | 1,909 (12%) | 103 (32%) | |
| Associate's Degree (e.g., AA or AS) | 123 (0.4%) | 0 (0%) | 52 (0.4%) | 68 (0.4%) | 3 (0.9%) | |
| Certificate/Diploma in a Trade or Vocational Program | 280 (0.9%) | 0 (0%) | 150 (1.1%) | 127 (0.8%) | 3 (0.9%) | |
| High School Diploma | 51 (0.2%) | 0 (0%) | 28 (0.2%) | 21 (0.1%) | 2 (0.6%) | |
| 1 n (%) | ||||||
library(ggplot2)
ggplot(data, aes(x = Education))+
geom_bar(aes(fill = Gender),
position = position_stack(reverse = FALSE)) +
geom_text(aes(label = after_stat(count)), stat='count'
, color="green", size =3, nudge_y= 8, nudge_x=0,size=9)+
theme_minimal()+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
data [,c(5,7)] %>%
tbl_summary(by = Education,
statistic = list(all_continuous() ~ "{mean} {median} {sd}"),
digits = list(Age ~ 2)) %>%
add_p() %>%
add_overall() %>%
bold_labels()
| Characteristic | Overall, N = 30,0001 | Doctoral Degree (e.g., PhD or EdD), N = 141 | Professional Degree (e.g., MD or JD), N = 10,5851 | Master's Degree (e.g., MA or MS), N = 14,0301 | Bachelor's Degree (e.g., BA or BS), N = 4,9171 | Associate's Degree (e.g., AA or AS), N = 1231 | Certificate/Diploma in a Trade or Vocational Program, N = 2801 | High School Diploma, N = 511 | p-value2 |
|---|---|---|---|---|---|---|---|---|---|
| Age | 35.49 34.00 9.22 | 38.86 38.50 7.24 | 34.23 32.00 8.27 | 34.72 33.00 8.89 | 40.30 40.00 10.44 | 33.85 32.00 8.23 | 35.60 34.00 8.99 | 43.90 46.00 10.12 | <0.001 |
| 1 Mean Median SD | |||||||||
| 2 Kruskal-Wallis rank sum test | |||||||||
summary(Age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
21.00 28.00 34.00 35.49 41.00 79.00
ggplot(data=data, aes(Age)) +
geom_histogram(aes(y = ..density..),color="red")+
scale_x_continuous(limits = c(15,85))+
stat_function(fun = dnorm,
args = list(mean = mean(Age),
sd = sd(Age)),
col = "#1b98e0",
size = 1)+
labs(title = "Histogram Showing the Distribution of Age")
ggplot(data, aes(x = Education, y = Age))+
labs(title = "", y = "Age", x = "Education")+
geom_boxplot(aes(fill = Education)) +theme(legend.position="none")+
labs(title = "Boxplot of Age Against Education", x = "Education", y = "Age")+
theme(axis.text.x = element_text(angle = 60, vjust = 0.5))
ggplot(data=data, aes(x=Education, y =Income, fill=Default)) +
geom_bar(position = "dodge",
stat = "summary",
fun = "mean")+
theme_minimal()+
labs(title = "Bar Plot of Income Against Education", x = "Education", y = "Average Income")+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
data [,c(2,8)] %>%
tbl_summary(by = Default,
statistic = list(all_continuous() ~ "{mean} {median} {sd}"),
digits = list(Income ~ 2)) %>%
add_p() %>%
add_overall() %>%
bold_labels()
| Characteristic | Overall, N = 30,0001 | yes, N = 23,3641 | no, N = 6,6361 | p-value2 |
|---|---|---|---|---|
| Income | 167,484.32 140,000.00 129,747.66 | 178,099.73 150,000.00 131,628.36 | 130,109.66 90,000.00 115,378.54 | <0.001 |
| 1 Mean Median SD | ||||
| 2 Wilcoxon rank sum test | ||||
library(ggplot2)
# Create some sample data
x <- c("yes", "No")
y <- c(23362,6636)
data_0 <- data.frame(Category = x, Value = y)
# Create a bar plot
ggplot(data_0, aes(x = Category, y = Value)) +
geom_bar(stat = "identity") +
labs(title = "Bar Plot of Defaulted Clients", x = "Defaulted Clients", y = "Frequency")
library(ggplot2)
# Create some sample data
x <- c("yes", "No")
y <- c(23362,6636)
data_1 <- data.frame(Category = x, Value = y)
# Calculate relative frequencies
data_1$Freq <- data_1$Value / sum(data_1$Value)
# Create a bar plot
ggplot(data_1, aes(x = Category, y = Freq)) +
geom_bar(stat = "identity") +
labs(title = "Bar Plot of Relative Frequency of Defaulted", x = "Defaulted", y = "Relative Frequency")
#### Histogram of balance
hist(data$Income)
#### Histogram of balance with 20 bins
hist(Balance, breaks = 20, col="red", main="", xlab="Credit Card Balance")
# create two histograms, separated by group, in one plot
par(mfrow = c(1,2)) # set the plotting area to two columns
hist(data$Income[data$Default == "yes"], main = "Defaulted (yes)", xlab = "Income", col = "blue")
hist(data$Income[data$Default == "no"], main = "Defaulted (no)", xlab = "Income", col = "green")
ggplot(data, aes(x = Income, fill = Default)) +
geom_histogram(alpha = 0.5, position = "identity", bins = 20) +
scale_fill_manual(values = c("red", "blue")) +
labs(title = "Histogram of Values by Group", x = "Income", y = "Frequency")
ggplot(data, aes(x = Income, fill = Default)) +
geom_histogram(alpha = 0.5, position = "identity", bins = 30) +
labs(x = "Value", y = "Frequency", title = "Histogram of Income by Defaulted Clients") +
scale_fill_manual(values = c("blue", "green")) +
facet_wrap(~Default, ncol = 2)
group_mean_age <- aggregate(Income ~ Education, data = data, mean)
group_mean_age
# Create some sample data
x <- c("Doctoral Degree (e.g., PhD or EdD)", "Professional Degree (e.g., MD or JD)", "Master's Degree (e.g., MA or MS)","Bachelor's Degree (e.g., BA or BS)","Associate's Degree (e.g., AA or AS)","Certificate/Diploma in a Trade or Vocational Program","High School Diploma")
y <- c( 217142.9,212956.1,147062.4,126550.3,220894.3,168164.3,148235.3)
data_2 <- data.frame(Category = x, Value = y)
# Create a bar plot
ggplot(data_2, aes(x = Category, y = Value)) +
geom_bar(stat = "identity") +
labs(title = "Bar Plot of Average Income Across Education", x = "Education", y = "Average Income")+
theme(axis.text.x = element_text(angle = 60, vjust = 0.5))
dat <- data[,c(6,8)] %>%
tbl_summary(by = Default) %>%
add_p() %>%
add_overall() %>%
bold_labels()
dat
| Characteristic | Overall, N = 30,0001 | yes, N = 23,3641 | no, N = 6,6361 | p-value2 |
|---|---|---|---|---|
| Marital.Status | <0.001 | |||
| Divorced | 54 (0.2%) | 49 (0.2%) | 5 (<0.1%) | |
| Married | 13,659 (46%) | 10,453 (45%) | 3,206 (48%) | |
| Single | 15,964 (53%) | 12,623 (54%) | 3,341 (50%) | |
| Other | 323 (1.1%) | 239 (1.0%) | 84 (1.3%) | |
| 1 n (%) | ||||
| 2 Pearson's Chi-squared test | ||||
The table presents data on marital status among a sample of 30,000 individuals, with 23,364 indicating “yes” and 6,636 indicating “no.” The p-value of <0.001 indicates that there is a statistically significant association between marital status and the “yes”/“no” outcome variable (Default).
Looking at the table, we can see that the majority of individuals in the sample are either single (53%) or married (46%). Divorced individuals make up only a small percentage (0.2%) of the sample, while those identifying as “other” comprise 1.1%.
When we compare the distribution of marital status between the “yes” and “no” groups, we can see that the proportion of married individuals is slightly higher among those who answered “no” (48%) than among those who answered “yes” (45%). Conversely, the proportion of single individuals is slightly higher among those who answered “yes” (54%) than among those who answered “no” (50%). These small differences may contribute to the statistically significant association between marital status and the outcome variable (Default).
In statistics, correlation is a measure of the strength and direction of the linear relationship between two variables. It measures how closely related two variables are to each other. Correlation coefficients range from -1 to +1, where -1 represents a perfect negative correlation (i.e., as one variable increases, the other variable decreases), 0 represents no correlation, and +1 represents a perfect positive correlation (i.e., as one variable increases, the other variable increases).
A correlation coefficient of zero does not mean that there is no relationship between the variables, only that there is no linear relationship between them. It is possible for variables to have a nonlinear relationship or a relationship that is not captured by a linear model. Correlation can be used to explore relationships between variables, identify patterns, and make predictions. It is important to note, however, that correlation does not necessarily imply causation, and that other factors may be influencing the relationship between the variables. Consider the table below with the rule of thumb.
+==================================================+ |Correlation coefficient | Interpretation | ———————– | ————————– |0.90 - 1.00 | Very strong correlation | |0.70 - 0.89 | Strong correlation | |0.40 - 0.69 | Moderate correlation | |0.10 - 0.39 | Weak correlation | |0.00 - 0.10 | Negligible correlation | +==================================================+
dat <- data.frame(Income,Balance,Age)
dat %>%
dplyr::select(Income:Age) %>%
cor() %>%
round(3) %>%
corrplot(method = "color", addCoef.col="white", type = "upper",
title="Correlation - Income, Balance and Age",
mar=c(0,0,2,0),
tl.cex=0.5, number.cex = 0.4)
plot(data[-1], col = Default)
ggplot(data, aes(x = Income, fill = Marital.Status)) +
geom_histogram(alpha = 0.5, position = "identity", bins = 30) +
labs(x = "Value", y = "Frequency", title = "Histogram of Income by Marital Status of Clients") +
scale_fill_manual(values = c("blue", "green","orange","purple")) +
facet_wrap(~Marital.Status, ncol = 7)+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))