| title: “assignment 2” |
| author: “Mikaela Taylor” |
| date: “07/23/23” |
| output:html_document: |
| toc: yes |
| toc_depth: 4 |
| fig_width: 4 |
| fig_caption: yes |
| number_sections: yes |
| theme: readable |
| fig_height: 4 |
With this data set, the goal is to be able to predict which customers will subscribe to a term deposit based upon specific qualities and characteristics of the customer and methods used by the campaign. The data set is made up of 45,211 observations and contains 17 variables. Among all of the variables, there were no missing values or null values found. The variables and their descriptions are listed below:
1 - age of customer(numeric)
2 - job : type of job (categorical: “admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student”, “blue-collar”, “self-employed”, “retired”, “technician”, “services”)
3 - marital : marital status (categorical: “married”, “divorced”, “single”; note: “divorced” means divorced or widowed)
4 - education (categorical: “unknown”, “secondary”, “primary”, “tertiary”)
5 - default: has credit in default? (binary: “yes”, “no”)
6 - balance: average yearly balance, in euros (numeric)
7 - housing: has housing loan? (binary: “yes”, “no”)
8 - loan: has personal loan? (binary: “yes”, “no”) # related with the last contact of the current campaign:
9 - contact: contact communication type (categorical: “unknown”, “telephone”, “cellular”)
10 - day: last contact day of the month (numeric)
11 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
12 - duration: last contact duration, in seconds (numeric) # other attributes:
13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
15 - previous: number of contacts performed before this campaign and for this client (numeric)
16 - poutcome: outcome of the previous marketing campaign (categorical: “unknown”, “other”, “failure”, “success”)
Output variable:
17 – y - has the client subscribed a term deposit? (binary: “yes”, “no”)
Boxplots for EDA
Histograms for EDA
##
## Winter
## 45211
##
## admin. blue-collar entrepreneur housemaid management
## 5171 9732 1487 1240 9458
## retired self-employed services student technician
## 2264 1579 4154 938 7597
## unemployed unknown
## 1303 288
##
## 1 2 3 4-63
## 17544 12505 5521 9641
Histograms for Transformed Features
The campaign variable was more difficult because none of the transformations like the log, square root, and cube root transformations normalized the distribution. So, instead the variable was grouped into categories to get rid of the sparse values. Both variables pdays and previous were 80% made up of customers that were not previously contacted, so I grouped both variables with two groups, not contacted or contacted. This was done to get rid of the sparse groups as well.
##
## 0 1+
## 36954 8257
##
## -1 0
## 36954 8257
##
## 1 2 3 4-63
## 17544 12505 5521 9641
Next, pairwise comparison was performed with a pairwise scatterplot for each numeric variable, excluding the newly grouped variables campaign, previous, and pdays. The scatterplots all showed a similar trend with the red curve differing from the blue curve, showing only a small correlation between the compared numerical variables. A low correlation value for each comparison means they should all be included in further subsequential models and algorithms.
## Warning: Removed 31 rows containing non-finite values (`stat_density()`).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 31 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 31 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 31 rows containing missing values
## Warning: Removed 31 rows containing missing values (`geom_point()`).
## Removed 31 rows containing missing values (`geom_point()`).
## Removed 31 rows containing missing values (`geom_point()`).
Pairwise Plots
None of the categorical variables included any missing values or null values, so next, variables with many different categories were grouped to minimize the number of different levels. Both job and month had over 10 different categories, so month was grouped by season rather then month, and job was split by less specific category titles. One of the new category titles was called unknown/unemployed which included students, retired, unemployed, and unknown. Another was called Blue-collar/Services which included services, housemaid, and technician. The last category was called Business, which included admin., entreprenuer, management, and self-employed.
##
## Winter
## 45211
##
## admin. blue-collar entrepreneur housemaid management
## 5171 9732 1487 1240 9458
## retired self-employed services student technician
## 2264 1579 4154 938 7597
## unemployed unknown
## 1303 288
pairwise comparison was performed. Each variable was compared to the
output variable y in a mosaic plot. The new grouped variables previous
and pdays were included in this comparison due to their newly grouped
values. Most of the plots showed that the outcome of whether the
customer subscribed to the deposit was dependent on the different
variables because the plots individual categories were not equal. The
only variable that showed itself to be independent was the variable
default, so this means this variable should not be involved in any
subsequential models and algorithms.
Pairwise Comparison
Pairwise Comparison