library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.4
## ✔ ggplot2 3.4.3 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data_frame = read.csv('C:/Users/prera/OneDrive/Desktop/INFO-I590/bank-full2.csv',header=TRUE, sep = ",")
str(data_frame)
## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : chr "management" "technician" "entrepreneur" "blue-collar" ...
## $ marital : chr "married" "single" "married" "married" ...
## $ education: chr "tertiary" "secondary" "secondary" NA ...
## $ default : chr "no" "no" "no" "no" ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : chr "yes" "yes" "yes" "yes" ...
## $ loan : chr "no" "no" "yes" "no" ...
## $ contact : chr NA NA NA NA ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : chr "may" "may" "may" "may" ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr NA NA NA NA ...
## $ y : chr "no" "no" "no" "no" ...
‘campaign’, ‘pdays’ and ‘previous’ are three columns, that are ambiguous if we just looking at the data and column names.
campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
pdays: number of days that passed by after the client was last contacted from a previous campaign (-1 means client was not previously contacted).
previous: number of contacts performed before this campaign and for this client.
If I would have not read the documentation, I would have not been able to understand what the value -1 denoted in the pdays column and would have assumed it to be a wrong value.
summary(data_frame$campaign)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 2.764 3.000 63.000
summary(data_frame$pdays)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.0 -1.0 -1.0 40.2 -1.0 871.0
summary(data_frame$poutcome)
## Length Class Mode
## 45211 character character
‘poutcome’ - ‘outcome of the previous marketing campaign’
According to the documentation and unique values in the data set, the column will have the following values “NA”, “other”, “failure” “success” . But they do not mention what the ‘other’ signifies.
unique(data_frame$poutcome)
## [1] NA "failure" "other" "success"
count of values
table(data_frame$poutcome)
##
## failure other success
## 4901 1840 1511
data_frame |>
filter(!(is.na(duration) | is.na(poutcome)))|>
ggplot() +
geom_boxplot(mapping = aes(x = balance, y = poutcome)) +
ggtitle("poutcome vs Duration") +
theme_minimal()
Exploring the relation between the values in outcome and other columns in the data.
Relation between poutcome, balance and education
p <- data_frame |>
filter(!(is.na(education) | is.na(poutcome)))|>
ggplot(aes(x = poutcome, y=balance, fill =education ) )+
geom_bar(position = "dodge", stat = "identity") +
theme_minimal() +
scale_fill_brewer(palette = 'Pastel2')
p
Relation between poutcome, age and if the client has the client subscribed a term deposit
p <- data_frame |>
filter(!is.na(poutcome))|>
ggplot(aes(x = poutcome, y=age, fill =y ) )+
geom_bar(position = "dodge", stat = "identity") +
theme_minimal() +
scale_fill_brewer(palette = 'Pastel1')
p
Relation between poutcome, balance and job
p <- data_frame |>
filter(!(is.na(poutcome)|is.na(job)))|>
ggplot(aes(x = poutcome, y=balance, fill =job ) )+
geom_bar(position = "dodge", stat = "identity") +
theme_minimal() +
scale_colour_brewer(palette = "Blues")
p
Relation between poutcome, balance and if the client has the client subscribed a term deposit
p <- data_frame |>
filter(!(is.na(poutcome)|is.na(job)))|>
ggplot(aes(x = poutcome, y=balance, fill =y ) )+
geom_bar(position = "dodge", stat = "identity") +
theme_minimal() +
scale_colour_brewer(palette = "Blues")
p
As for significant risks in the data set, one risk is the presence of missing data and presence of values in the data we do not know the meaning of. For instance, from the graphs above we can clearly see there is a clear relationship between the ‘poutcome’ column and if the client has the client subscribed a term deposit.
To reduce the negative consequences, we can either clean the data set and be careful while performing any analysis.