Week 5 Data Dive

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ ggplot2   3.4.3     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

data_frame = read.csv('C:/Users/prera/OneDrive/Desktop/INFO-I590/bank-full2.csv',header=TRUE, sep = ",")

str(data_frame)

## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : chr  "management" "technician" "entrepreneur" "blue-collar" ...
##  $ marital  : chr  "married" "single" "married" "married" ...
##  $ education: chr  "tertiary" "secondary" "secondary" NA ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : chr  "yes" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "no" "yes" "no" ...
##  $ contact  : chr  NA NA NA NA ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : chr  "may" "may" "may" "may" ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : chr  NA NA NA NA ...
##  $ y        : chr  "no" "no" "no" "no" ...

Questions

A list of at least 3 columns (or values) in your data which are unclear until you read the documentation.

‘campaign’, ‘pdays’ and ‘previous’ are three columns, that are ambiguous if we just looking at the data and column names.

Meanings of the columns as mentioned in the documentation

campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
pdays: number of days that passed by after the client was last contacted from a previous campaign (-1 means client was not previously contacted).
previous: number of contacts performed before this campaign and for this client.

Why do you think they chose to encode the data the way they did? What could have happened if you didn’t read the documentation?

If I would have not read the documentation, I would have not been able to understand what the value -1 denoted in the pdays column and would have assumed it to be a wrong value.

summary(data_frame$campaign)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   2.764   3.000  63.000

summary(data_frame$pdays)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    -1.0    -1.0    -1.0    40.2    -1.0   871.0

summary(data_frame$poutcome)

##    Length     Class      Mode 
##     45211 character character

At least one element or your data that is unclear even after reading the documentation

Elements that are unclear even after going through the documentation are ‘poutcome’

‘poutcome’ - ‘outcome of the previous marketing campaign’

According to the documentation and unique values in the data set, the column will have the following values “NA”, “other”, “failure” “success” . But they do not mention what the ‘other’ signifies.

unique(data_frame$poutcome)

## [1] NA        "failure" "other"   "success"

count of values

table(data_frame$poutcome)

## 
## failure   other success 
##    4901    1840    1511

Building a visualization of column

data_frame |>
  filter(!(is.na(duration) | is.na(poutcome)))|>
  ggplot() +
  geom_boxplot(mapping = aes(x = balance, y = poutcome)) +
  ggtitle("poutcome vs Duration") +
  theme_minimal()

Exploring the relation between the values in outcome and other columns in the data.

Relation between poutcome, balance and education

p <- data_frame |>
  filter(!(is.na(education) | is.na(poutcome)))|>
  ggplot(aes(x = poutcome, y=balance, fill =education ) )+
  geom_bar(position = "dodge", stat = "identity") +
  theme_minimal() +
  scale_fill_brewer(palette = 'Pastel2')
  
p

Relation between poutcome, age and if the client has the client subscribed a term deposit

p <- data_frame |>
  filter(!is.na(poutcome))|>
  ggplot(aes(x = poutcome, y=age, fill =y ) )+
  geom_bar(position = "dodge", stat = "identity") +
  theme_minimal() +
  scale_fill_brewer(palette = 'Pastel1')
  
p

Relation between poutcome, balance and job

p <- data_frame |>
  filter(!(is.na(poutcome)|is.na(job)))|>
  ggplot(aes(x = poutcome, y=balance, fill =job ) )+
  geom_bar(position = "dodge", stat = "identity") +
  theme_minimal() +
  scale_colour_brewer(palette = "Blues")
  
p

Relation between poutcome, balance and if the client has the client subscribed a term deposit

p <- data_frame |>
  filter(!(is.na(poutcome)|is.na(job)))|>
  ggplot(aes(x = poutcome, y=balance, fill =y ) )+
  geom_bar(position = "dodge", stat = "identity") +
  theme_minimal() +
  scale_colour_brewer(palette = "Blues")
  
p

Do you notice any significant risks? If so, what could you do to reduce negative consequences?

As for significant risks in the data set, one risk is the presence of missing data and presence of values in the data we do not know the meaning of. For instance, from the graphs above we can clearly see there is a clear relationship between the ‘poutcome’ column and if the client has the client subscribed a term deposit.

To reduce the negative consequences, we can either clean the data set and be careful while performing any analysis.