data_frame_full = read.csv('C:/Users/prera/OneDrive/Desktop/INFO-I590/bank-full2.csv',header=TRUE, sep = ",")
summary(data_frame_full)
## age job marital education
## Min. :18.00 Length:45211 Length:45211 Length:45211
## 1st Qu.:33.00 Class :character Class :character Class :character
## Median :39.00 Mode :character Mode :character Mode :character
## Mean :40.94
## 3rd Qu.:48.00
## Max. :95.00
## default balance housing loan
## Length:45211 Min. : -8019 Length:45211 Length:45211
## Class :character 1st Qu.: 72 Class :character Class :character
## Mode :character Median : 448 Mode :character Mode :character
## Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
## contact day month duration
## Length:45211 Min. : 1.00 Length:45211 Min. : 0.0
## Class :character 1st Qu.: 8.00 Class :character 1st Qu.: 103.0
## Mode :character Median :16.00 Mode :character Median : 180.0
## Mean :15.81 Mean : 258.2
## 3rd Qu.:21.00 3rd Qu.: 319.0
## Max. :31.00 Max. :4918.0
## campaign pdays previous poutcome
## Min. : 1.000 Min. : -1.0 Min. : 0.0000 Length:45211
## 1st Qu.: 1.000 1st Qu.: -1.0 1st Qu.: 0.0000 Class :character
## Median : 2.000 Median : -1.0 Median : 0.0000 Mode :character
## Mean : 2.764 Mean : 40.2 Mean : 0.5803
## 3rd Qu.: 3.000 3rd Qu.: -1.0 3rd Qu.: 0.0000
## Max. :63.000 Max. :871.0 Max. :275.0000
## y
## Length:45211
## Class :character
## Mode :character
##
##
##
1 - age;
2 - job;
3 - marital(marital status);
4 - education;
5 - default: has credit in default?;
6 - balance: average yearly balance, in euros
7 - housing: has housing loan?;
8 - loan: has personal loan?;
9 - contact: contact communication type;
10 - day: last contact day of the month
11 - month: last contact month of year;
12 - duration: last contact duration, in seconds;
13 - campaign: number of contacts performed during this campaign and for this client
14 - pdays: number of days that passed by after the client was last contacted from a previous campaign
15 - previous: number of contacts performed before this campaign and for this client
16 - poutcome: outcome of the previous marketing campaign;
17 - y : has the client subscribed a term deposit?
data_frame <- na.omit(data_frame_full)
Build a linear (or generalized linear) model as you like
Use the tools from previous weeks to diagnose the model
Interpret at least one of the coefficients
y - Response Variable
I have selected a binary response variable. This is important to the bank because if the clients’ credit is defaulted the the bank will lose some amount.
Converting default from Discreet to Continuous
data_frame <- data_frame |>
mutate(y_value = ifelse(y %in% c("yes"),1, 0))
I have selected the following variables as the explanatory variables-
Building an lm model.
model_lm <- lm(age ~ duration, data = data_frame)
print(summary(model_lm),show.residuals=TRUE)
##
## Call:
## lm(formula = age ~ duration, data = data_frame)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.887 -8.310 -2.768 6.700 48.369
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.005e+01 1.897e-01 211.131 < 2e-16 ***
## duration 2.814e-03 5.385e-04 5.224 1.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.26 on 7840 degrees of freedom
## Multiple R-squared: 0.003469, Adjusted R-squared: 0.003342
## F-statistic: 27.3 on 1 and 7840 DF, p-value: 1.791e-07
plot(model_lm, which=1)
plot(model_lm, which=2)
qqnorm(resid(model_lm))
As seen from the above plot, we can see that the distribution of the Q-Q is not linear
hist(resid(model_lm))
The distributioin of the model is not normal. It is skewed.
rsquared <- summary(model_lm)$r.squared
rsquared
## [1] 0.003469446
The R-squared value typically ranges from 0 to 1.0. A higher values indicates a better fit of the model to the data. We can see that the value for the model is 0.0034, which tells us that it is not a good fit.
model_lm$coefficients
## (Intercept) duration
## 40.048680414 0.002813632
For a unit increase in age the duration spent for bank marketing calls is 0.0028 seconds.
When the duration of the marketing call is zero seconds (which is not practically meaningful), the estimated age is 40.048 years.
The coefficient suggests that there is a positive linear relationship between the duration of the marketing call and the age of the individuals receiving the call.