Credit risk analysis is an essential part of any bank nowadays. Many of the banks have automated scoring systems which they use when deciding to give an applicant credit or not. The credit dataset will be used for this project. It consists of 21 variables and 1000 observations.
1. Load the dataset and convert the character variables to factors. Preview the data
# Load the credit_data dataset
credit_data <- read.csv("credit_data.csv")
# Convert the character veriables to factors
credit_data[sapply(credit_data, is.character)] <-
lapply(credit_data[sapply(credit_data, is.character)], as.factor)
str(credit_data)
## 'data.frame': 1000 obs. of 21 variables:
## $ checking_status : Factor w/ 4 levels "<0",">=200","0<=X<200",..: 4 3 4 4 1 1 3 1 3 1 ...
## $ duration : int 24 36 24 24 12 48 26 12 36 6 ...
## $ credit_history : Factor w/ 5 levels "all paid","critical/other existing credit",..: 4 3 4 4 2 4 4 4 4 4 ...
## $ purpose : Factor w/ 10 levels "business","domestic appliance",..: 10 5 7 4 5 7 10 7 7 4 ...
## $ credit_amount : int 5433 8086 1376 2835 2171 6758 7966 1107 2323 1374 ...
## $ savings_status : Factor w/ 5 levels "<100",">=1000",..: 5 3 4 4 1 1 1 1 1 1 ...
## $ employment : Factor w/ 5 levels "<1",">=7","1<=X<4",..: 5 2 4 2 3 3 1 3 4 3 ...
## $ installment_commitment: int 2 2 4 3 4 3 2 2 4 1 ...
## $ personal_status : Factor w/ 4 levels "femaledivdepmar",..: 1 4 1 4 4 1 4 4 4 4 ...
## $ other_parties : Factor w/ 3 levels "co applicant",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ residence_since : int 4 4 1 4 4 2 3 2 4 2 ...
## $ property_magnitude : Factor w/ 4 levels "car","life insurance",..: 2 1 1 2 2 1 1 4 1 4 ...
## $ age : int 26 42 28 53 38 31 30 20 24 36 ...
## $ other_payment_plans : Factor w/ 3 levels "bank","none",..: 2 2 2 2 1 2 2 2 2 1 ...
## $ housing : Factor w/ 3 levels "for free","own",..: 3 2 2 2 2 2 2 3 3 2 ...
## $ existing_credits : int 1 4 1 1 2 1 2 1 1 1 ...
## $ job : Factor w/ 4 levels "high qualif/self emp/mgmt",..: 1 1 2 2 4 2 2 1 2 4 ...
## $ num_dependents : int 1 1 1 1 1 1 1 2 1 1 ...
## $ own_telephone : Factor w/ 2 levels "none","yes": 2 2 1 1 1 2 1 2 1 2 ...
## $ foreign_worker : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
## $ class : Factor w/ 2 levels "bad","good": 2 1 2 2 2 1 2 2 2 2 ...
2. Using dplyr package, transform the dataframe as follows:
remove the following columns: checking_status, savings_status, installment_commitment, other_parties, other_payment_plans, num_dependents and own_telephone;
rename variable job to employment_status;
filter the data by the following criteria: include the observations with the credit_amount >=300 and exclude observations with credit_history “existing paid” and “no credits/all paid”;
save the modified version of the data frame into a new object and use it in subsequent analysis. (6 points)
new_credit_data <- credit_data %>%
select(-c("checking_status", "savings_status", "installment_commitment",
"other_parties", "other_payment_plans", "num_dependents","own_telephone")) %>% rename(employment_status = `job`)%>%
filter(credit_amount >=300 & credit_history != "existing paid" &
credit_history != "no credits/all paid")
3. Create you own theme with the preferred parameters and preset it to be automatically applied to all visualizations in the file (hint: can be done with function theme_set()).
mytheme <- theme(plot.title=element_text(face="bold", size="12", color="red"),
axis.title=element_text(face="bold", size=10, color="blue"),
axis.text=element_text(face="bold", size=9, color="black"),
panel.background=element_rect(fill="white", color="black"),
panel.grid.major.y=element_line(color="grey", linetype=1),
panel.grid.minor.y=element_line(color="grey",linetype=2),
panel.grid.minor.x=element_blank(), legend.position="top")
theme_set(mytheme)
4. Visualize the credit_amount using a histogram. Do not forget to add a title and axes labels where necessary. Describe the shape of the distribution.
histo <- ggplot(new_credit_data, aes(`credit_amount`))+
geom_histogram(color="red", fill="lightblue", bins = 30)+
ggtitle("Distribution of The Credit Amount")
histo
# The distribution is skewed right, the mean is greater than the median.
5. Visualize loan’s duration with a density plot (hint: use geom_density()). Arrange this plot together with the preceding histogram in a grid.
densit <- ggplot(new_credit_data, aes(duration))+
geom_density(color="blue", fill="lightblue")+
ggtitle("Visualize loan's duration with a density plot")
grid.arrange(histo, densit)
# The peaks of a Density Plot help us to display where values are concentrated
#over the duration interval.
# It is a decreasing density plot, during the duration increase
# which means the people take less loans in long time interval
6. Visualize the credit_amount by categories of employment_status using boxplot. Read the boxplots: is the data for the different employment status groups normally distributed? Is any of the distributions skewed (not symmetric)? Add means to the graph to check the direction of the skewness Is any of the distributions skewed (not symmetric)?
is the data for the different employment status groups normally distributed?
ggplot(new_credit_data, aes(x=credit_amount, y=employment_status)) +
geom_boxplot(fill = "lightblue") +
labs(title="Credit amount by employment status",
x="credit amount", y="employment_status")+
stat_summary(fun.y=mean, geom="point", shape=20, size=3, color="red", fill="red")+
coord_flip() + mytheme
# None of the boxplots representing a normal distribution. The mean is greater than the median.
# Right-skewed distribution
7. Visualize the credit_amount by categories of credit_history using violin plot(s). Interpret the result you obtained.
ggplot(new_credit_data, aes(x=credit_amount, y=credit_history)) +
geom_violin(fill = "lightblue") +
labs(title="credit amount by employment status", x="credit amount", y="employment status") +
coord_flip() +
stat_summary(fun.y=mean, geom="point", shape=20, size=3, color="red", fill="red")
# There are no any normal distribution, right-skewed distribution
#the mean is greater than the median.
8. Visualize the relationship between the credit_amount and duration with a scatterplot. If there is an overplotting issue, overcome it with any of the preferred method(s) (hint: you can use jittering, adding transparency and/or marginal plots, changing the size of points, etc). What kind of relationship do you see? Also, add a best fit line and facet the scatterplot by class. What kind of relationship do you see?
# Positive correlation for both, but the right relationship is better than the left one,
# because the distribution of the values associated around and on the line of the best fit
ggplot(new_credit_data, aes(x= credit_amount, y=duration,))+
geom_point(shape=21, color="black", fill="yellow", size=3)+
geom_jitter(alpha = 0) +
ggtitle(" Visualize the relationship between the credit amount and duration") +
geom_smooth(method = "lm", se=T, color = "black")+ facet_wrap(~class)
9. Calculate average credit_amount grouped by purpose and visualize the statistic in ascending order using bar charts Add data labels with the average score to the bars. Rotate the text on the X-axis vertically so that it does not overlap. For what purpose do people take on average larger loans?
#People take on average larger loans for used cars and some other.
#and the least loans on average taken for retraining purpose.
new_credit_data %>% group_by(purpose) %>%
summarise(avg_credit_amount = round(mean(credit_amount))) %>%
ggplot(aes(x=reorder(purpose, avg_credit_amount), y=avg_credit_amount, label =avg_credit_amount ))+
geom_bar(stat = 'identity', fill="lightblue", colour="black")+
labs(title = "Purposes for taking larger loans", y = "Average Credit Amount", x=element_blank())+
geom_text(vjust=1.5, colour="black", position=position_dodge(1), size=3)+
theme(axis.text.x=element_text(angle=45))
str(new_credit_data$credit_amount)
## int [1:429] 8086 2171 4455 1442 522 2978 2862 1554 958 1940 ...
10. Relevel the factor variable employment, with a natural order (from “unemployed” to “>=7”). Visualize the employment variable using bar chart, map class to fill aesthetics. Add position = “fill” so that bars show relative rather than absolute frequencies.
new_credit_data$employment <- factor(new_credit_data$employment,
levels = c("unemployed","<1","1<=X<4","4<=X<7",">=7"))
ggplot(new_credit_data, aes(x=employment, fill= class )) +
geom_bar(position="fill") +
labs(title= "Relation between Employment Years and Class",
x="Employment Years", y="Good/Bad classes")
# The ratio of employment years 4<=X<7 is the highest
# since the unemployed ratio is better than who employed for less than year