Data Dive Week 11

Libraries and Data

data_frame_full = read.csv('C:/Users/prera/OneDrive/Desktop/INFO-I590/bank-full2.csv',header=TRUE, sep = ",")

Summary of data frame

summary(data_frame_full)

##       age            job              marital           education        
##  Min.   :18.00   Length:45211       Length:45211       Length:45211      
##  1st Qu.:33.00   Class :character   Class :character   Class :character  
##  Median :39.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.94                                                           
##  3rd Qu.:48.00                                                           
##  Max.   :95.00                                                           
##    default             balance         housing              loan          
##  Length:45211       Min.   : -8019   Length:45211       Length:45211      
##  Class :character   1st Qu.:    72   Class :character   Class :character  
##  Mode  :character   Median :   448   Mode  :character   Mode  :character  
##                     Mean   :  1362                                        
##                     3rd Qu.:  1428                                        
##                     Max.   :102127                                        
##    contact               day           month              duration     
##  Length:45211       Min.   : 1.00   Length:45211       Min.   :   0.0  
##  Class :character   1st Qu.: 8.00   Class :character   1st Qu.: 103.0  
##  Mode  :character   Median :16.00   Mode  :character   Median : 180.0  
##                     Mean   :15.81                      Mean   : 258.2  
##                     3rd Qu.:21.00                      3rd Qu.: 319.0  
##                     Max.   :31.00                      Max.   :4918.0  
##     campaign          pdays          previous          poutcome        
##  Min.   : 1.000   Min.   : -1.0   Min.   :  0.0000   Length:45211      
##  1st Qu.: 1.000   1st Qu.: -1.0   1st Qu.:  0.0000   Class :character  
##  Median : 2.000   Median : -1.0   Median :  0.0000   Mode  :character  
##  Mean   : 2.764   Mean   : 40.2   Mean   :  0.5803                     
##  3rd Qu.: 3.000   3rd Qu.: -1.0   3rd Qu.:  0.0000                     
##  Max.   :63.000   Max.   :871.0   Max.   :275.0000                     
##       y            
##  Length:45211      
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Columns of data frame

1 - age;

2 - job;

3 - marital(marital status);

4 - education;

5 - default: has credit in default?;

6 - balance: average yearly balance, in euros

7 - housing: has housing loan?;

8 - loan: has personal loan?;

9 - contact: contact communication type;

10 - day: last contact day of the month

11 - month: last contact month of year;

12 - duration: last contact duration, in seconds;

13 - campaign: number of contacts performed during this campaign and for this client

14 - pdays: number of days that passed by after the client was last contacted from a previous campaign

15 - previous: number of contacts performed before this campaign and for this client

16 - poutcome: outcome of the previous marketing campaign;

17 - y : has the client subscribed a term deposit?

data_frame <- na.omit(data_frame_full)

Tasks for the Data Dive

Build a linear (or generalized linear) model as you like
- Use whatever response variable and explanatory variables you prefer
Use the tools from previous weeks to diagnose the model
- Highlight any issues with the model
Interpret at least one of the coefficients

Selecting Response Variable

y - Response Variable

I have selected a binary response variable. This is important to the bank because if the clients’ credit is defaulted the the bank will lose some amount.

Converting default from Discreet to Continuous

data_frame <- data_frame |>
  mutate(y_value = ifelse(y %in% c("yes"),1, 0))

Selecting explanatory variables

I have selected the following variables as the explanatory variables-

campaign

Building an lm model.

LM()

model_lm <- lm(age ~ duration, data = data_frame)

print(summary(model_lm),show.residuals=TRUE)

## 
## Call:
## lm(formula = age ~ duration, data = data_frame)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.887  -8.310  -2.768   6.700  48.369 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.005e+01  1.897e-01 211.131  < 2e-16 ***
## duration    2.814e-03  5.385e-04   5.224 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.26 on 7840 degrees of freedom
## Multiple R-squared:  0.003469,   Adjusted R-squared:  0.003342 
## F-statistic:  27.3 on 1 and 7840 DF,  p-value: 1.791e-07

Diagnosing the model

Residuals VS fitted values

plot(model_lm, which=1)

Checking linearity of the model using a graph

plot(model_lm, which=2)

Q-Q Plot

qqnorm(resid(model_lm))

As seen from the above plot, we can see that the distribution of the Q-Q is not linear

Histogram of the residuals

hist(resid(model_lm))

The distributioin of the model is not normal. It is skewed.

rsquared <- summary(model_lm)$r.squared
rsquared

## [1] 0.003469446

The R-squared value typically ranges from 0 to 1.0. A higher values indicates a better fit of the model to the data. We can see that the value for the model is 0.0034, which tells us that it is not a good fit.

Interpretation of the Coefficients

model_lm$coefficients

##  (Intercept)     duration 
## 40.048680414  0.002813632

For a unit increase in age the duration spent for bank marketing calls is 0.0028 seconds.

When the duration of the marketing call is zero seconds (which is not practically meaningful), the estimated age is 40.048 years.

The coefficient suggests that there is a positive linear relationship between the duration of the marketing call and the age of the individuals receiving the call.