data_frame = read.csv('C:/Users/prera/OneDrive/Desktop/INFO-I590/bank-full2.csv',header=TRUE, sep = ",")
summary(data_frame)
## age job marital education
## Min. :18.00 Length:45211 Length:45211 Length:45211
## 1st Qu.:33.00 Class :character Class :character Class :character
## Median :39.00 Mode :character Mode :character Mode :character
## Mean :40.94
## 3rd Qu.:48.00
## Max. :95.00
## default balance housing loan
## Length:45211 Min. : -8019 Length:45211 Length:45211
## Class :character 1st Qu.: 72 Class :character Class :character
## Mode :character Median : 448 Mode :character Mode :character
## Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
## contact day month duration
## Length:45211 Min. : 1.00 Length:45211 Min. : 0.0
## Class :character 1st Qu.: 8.00 Class :character 1st Qu.: 103.0
## Mode :character Median :16.00 Mode :character Median : 180.0
## Mean :15.81 Mean : 258.2
## 3rd Qu.:21.00 3rd Qu.: 319.0
## Max. :31.00 Max. :4918.0
## campaign pdays previous poutcome
## Min. : 1.000 Min. : -1.0 Min. : 0.0000 Length:45211
## 1st Qu.: 1.000 1st Qu.: -1.0 1st Qu.: 0.0000 Class :character
## Median : 2.000 Median : -1.0 Median : 0.0000 Mode :character
## Mean : 2.764 Mean : 40.2 Mean : 0.5803
## 3rd Qu.: 3.000 3rd Qu.: -1.0 3rd Qu.: 0.0000
## Max. :63.000 Max. :871.0 Max. :275.0000
## y
## Length:45211
## Class :character
## Mode :character
##
##
##
1 - age;
2 - job;
3 - marital(marital status);
4 - education;
5 - default: has credit in default?;
6 - balance: average yearly balance, in euros
7 - housing: has housing loan?;
8 - loan: has personal loan?;
9 - contact: contact communication type;
10 - day: last contact day of the month
11 - month: last contact month of year;
12 - duration: last contact duration, in seconds;
13 - campaign: number of contacts performed during this campaign and for this client
14 - pdays: number of days that passed by after the client was last contacted from a previous campaign
15 - previous: number of contacts performed before this campaign and for this client
16 - poutcome: outcome of the previous marketing campaign;
17 - y : has the client subscribed a term deposit?
Select a column of your data that encodes time (e.g., “date”, “timestamp”, “year”, etc.). Convert this into a Date in R.
Choose a column of data to analyze over time. This should be a “response-like” variable that is of particular interest.
Create a tsibble object of just the date and response variable. Then, plot your data over time. Consider different windows of time.
Use linear regression to detect any upwards or downwards trends.
Do you need to subset the data for multiple trends?
Use smoothing to detect at least one season in your data, and interpret your results.
Can you illustrate the seasonality using ACF or PACF?
Converting month column to a column of continuous data
data_frame['month_value'] <- unclass(factor(data_frame$month, levels = month.name))
As mentioned in the Documentation, the data has been collected from May 2008 to November 2010. I am going to first add the year column to the data frame and then create a date column which is a combination of the ‘year’, ‘month’ and ‘day’ column
data_frame_2008 <- data_frame[1:27729,]
data_frame_2008 <- data_frame_2008 |> mutate(year = 2008)
tail(data_frame_2008)
## age job marital education default balance housing loan contact
## 27724 34 services married tertiary no 1442 no no cellular
## 27725 55 retired divorced secondary no 366 no no telephone
## 27726 38 management married tertiary no 0 yes no telephone
## 27727 32 blue-collar single secondary no 232 no no <NA>
## 27728 53 management divorced secondary no 1004 no yes telephone
## 27729 28 admin. single secondary no 6100 no no <NA>
## day month duration campaign pdays previous poutcome y month_value
## 27724 11 December 234 1 -1 0 <NA> no 12
## 27725 11 December 173 1 -1 0 <NA> no 12
## 27726 12 December 136 1 -1 0 <NA> no 12
## 27727 13 December 232 1 -1 0 <NA> no 12
## 27728 22 December 119 1 -1 0 <NA> yes 12
## 27729 27 December 333 1 -1 0 <NA> no 12
## year
## 27724 2008
## 27725 2008
## 27726 2008
## 27727 2008
## 27728 2008
## 27729 2008
data_frame_2009 <- data_frame[27730:42591,]
data_frame_2009<- data_frame_2009 |> mutate(year = 2009)
tail(data_frame_2009)
## age job marital education default balance housing loan contact
## 42586 61 management married secondary no 544 no no cellular
## 42587 30 management single tertiary no 2907 yes no cellular
## 42588 46 management divorced tertiary no 461 no no cellular
## 42589 30 housemaid single tertiary no 3185 yes no cellular
## 42590 40 admin. married secondary no 3126 yes no cellular
## 42591 32 technician married <NA> no 14533 no no cellular
## day month duration campaign pdays previous poutcome y month_value
## 42586 29 December 63 3 200 9 success yes 12
## 42587 30 December 146 2 218 1 failure no 12
## 42588 30 December 369 2 -1 0 <NA> yes 12
## 42589 30 December 158 3 -1 0 <NA> no 12
## 42590 30 December 110 3 232 2 failure no 12
## 42591 31 December 646 3 198 2 success no 12
## year
## 42586 2009
## 42587 2009
## 42588 2009
## 42589 2009
## 42590 2009
## 42591 2009
data_frame_2010 <- data_frame[42592:45211,]
data_frame_2010 <- data_frame_2010 |> mutate(year = 2010)
tail(data_frame_2010)
## age job marital education default balance housing loan
## 45206 25 technician single secondary no 505 no yes
## 45207 51 technician married tertiary no 825 no no
## 45208 71 retired divorced primary no 1729 no no
## 45209 72 retired married secondary no 5715 no no
## 45210 57 blue-collar married secondary no 668 no no
## 45211 37 entrepreneur married secondary no 2971 no no
## contact day month duration campaign pdays previous poutcome y
## 45206 cellular 17 November 386 2 -1 0 <NA> yes
## 45207 cellular 17 November 977 3 -1 0 <NA> yes
## 45208 cellular 17 November 456 2 -1 0 <NA> yes
## 45209 cellular 17 November 1127 5 184 3 success yes
## 45210 telephone 17 November 508 4 -1 0 <NA> no
## 45211 cellular 17 November 361 2 188 11 other no
## month_value year
## 45206 11 2010
## 45207 11 2010
## 45208 11 2010
## 45209 11 2010
## 45210 11 2010
## 45211 11 2010
data_frame <- rbind(data_frame_2008,data_frame_2009)
data_frame <- rbind(data_frame,data_frame_2010)
data_frame <- na.omit(data_frame)
data_frame$date<-as.Date(with(data_frame,paste(year,month_value,day,sep="-")),"%Y-%m-%d")
head(data_frame)
## age job marital education default balance housing loan contact
## 24061 33 admin. married tertiary no 882 no no telephone
## 24063 42 admin. single secondary no -247 yes yes telephone
## 24065 33 services married secondary no 3444 yes no telephone
## 24073 36 management married tertiary no 2415 yes no telephone
## 24078 36 management married tertiary no 0 yes no telephone
## 24087 44 blue-collar married secondary no 1324 yes no telephone
## day month duration campaign pdays previous poutcome y month_value
## 24061 21 October 39 1 151 3 failure no 10
## 24063 21 October 519 1 166 1 other yes 10
## 24065 21 October 144 1 91 4 failure yes 10
## 24073 22 October 73 1 86 4 other no 10
## 24078 23 October 140 1 143 3 failure yes 10
## 24087 25 October 119 1 89 2 other no 10
## year date
## 24061 2008 2008-10-21
## 24063 2008 2008-10-21
## 24065 2008 2008-10-21
## 24073 2008 2008-10-22
## 24078 2008 2008-10-23
## 24087 2008 2008-10-25
summary(data_frame)
## age job marital education
## Min. :18.00 Length:7842 Length:7842 Length:7842
## 1st Qu.:32.00 Class :character Class :character Class :character
## Median :38.00 Mode :character Mode :character Mode :character
## Mean :40.78
## 3rd Qu.:47.00
## Max. :89.00
## default balance housing loan
## Length:7842 Min. :-1884 Length:7842 Length:7842
## Class :character 1st Qu.: 162 Class :character Class :character
## Mode :character Median : 595 Mode :character Mode :character
## Mean : 1552
## 3rd Qu.: 1734
## Max. :81204
## contact day month duration
## Length:7842 Min. : 1.00 Length:7842 Min. : 5.0
## Class :character 1st Qu.: 7.00 Class :character 1st Qu.: 113.0
## Mode :character Median :14.00 Mode :character Median : 194.0
## Mean :14.26 Mean : 261.3
## 3rd Qu.:20.00 3rd Qu.: 324.0
## Max. :31.00 Max. :2219.0
## campaign pdays previous poutcome
## Min. : 1.000 Min. : 1.0 Min. : 1.000 Length:7842
## 1st Qu.: 1.000 1st Qu.:133.0 1st Qu.: 1.000 Class :character
## Median : 2.000 Median :195.0 Median : 2.000 Mode :character
## Mean : 2.064 Mean :223.3 Mean : 3.184
## 3rd Qu.: 2.000 3rd Qu.:326.0 3rd Qu.: 4.000
## Max. :16.000 Max. :871.0 Max. :275.000
## y month_value year date
## Length:7842 Min. : 1.000 Min. :2008 Min. :2008-10-21
## Class :character 1st Qu.: 4.000 1st Qu.:2009 1st Qu.:2009-04-06
## Mode :character Median : 5.000 Median :2009 Median :2009-05-12
## Mean : 5.807 Mean :2009 Mean :2009-07-09
## 3rd Qu.: 8.000 3rd Qu.:2009 3rd Qu.:2009-10-13
## Max. :12.000 Max. :2010 Max. :2010-11-17
The response variable I am picking is ‘y’. It denotes if the client has subscribed to a term deposit.
I am going to consider the number of people who have subscribed to a term deposit.
I am alos going to use ‘Group by’ to group by date, because there are cases where multiple clients have or have not subscribed to a term deposit on a single day
data_frame <- data_frame |>
mutate(y_value = ifelse(y %in% c("yes"),1, 0))
df <- select(data_frame, date, y_value)
df_grouped <- df|>
group_by(date,y_value) |>
summarise(y_value_avg = mean(y_value,na.rm=TRUE), size=n())
## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.
df_grouped <- df_grouped |>
filter(y_value==1)
df_grouped <- df_grouped[c('date','size')]
df_grouped <- as_data_frame(df_grouped)
## Warning: `as_data_frame()` was deprecated in tibble 2.0.0.
## ℹ Please use `as_tibble()` (with slightly different semantics) to convert to a
## tibble, or `as.data.frame()` to convert to a data frame.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
df_grouped
## # A tibble: 346 × 2
## date size
## <date> <int>
## 1 2008-10-21 2
## 2 2008-10-23 1
## 3 2008-11-17 4
## 4 2008-11-18 3
## 5 2008-11-19 9
## 6 2008-11-20 10
## 7 2008-11-21 9
## 8 2009-01-28 1
## 9 2009-01-30 1
## 10 2009-02-02 7
## # ℹ 336 more rows
_
df_ts <- as_tsibble(df_grouped, index=date)
df_ts |>
ggplot(mapping = aes(x = date, y = size)) +
geom_line() +
labs(title = "Term Deposit Subscriptions Over Time") +
theme_hc()
Considering year wise data
df_ts |>
filter_index("2008") |>
ggplot(mapping = aes(x = date, y = size)) +
geom_line() +
labs(title = "Term Deposit Subscriptions Over Time") +
theme_hc()
df_ts |>
filter_index("2009") |>
ggplot(mapping = aes(x = date, y = size)) +
geom_line() +
labs(title = "Term Deposit Subscriptions Over Time") +
theme_hc()
df_ts |>
filter_index("2010") |>
ggplot(mapping = aes(x = date, y = size)) +
geom_line() +
labs(title = "Term Deposit Subscriptions Over Time") +
theme_hc()
From the above plots we can see that in the year 2009 has the highest number of term deposits made, with nearly 50 subscriptions made on a single day in May 2009.
model <- lm(size ~ date, data = df_ts)
summary(model)
##
## Call:
## lm(formula = size ~ date, data = df_ts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.111 -2.878 -0.939 1.391 45.301
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.997629 19.050039 1.942 0.0529 .
## date -0.002179 0.001304 -1.671 0.0956 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.635 on 344 degrees of freedom
## Multiple R-squared: 0.008055, Adjusted R-squared: 0.005171
## F-statistic: 2.793 on 1 and 344 DF, p-value: 0.09557
model$coefficients
## (Intercept) date
## 36.997629271 -0.002178935
ggplot(df_ts, aes(x = date, y = size)) +
geom_line() +
geom_smooth(method = "lm", color = "lightpink") +
labs(x = "date", y = "number of subscriptions", title = "Term Deposit Subscriptions Over Time") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The model has a time coefficient which is very small (-0.00217) and a p-value of no statistic significance (p-value = 0.09557). This suggests that there is no significant linear trend in the number of term deposit subscriptions over time.
df_ts |>
ggplot(mapping = aes(x = date, y = size)) +
geom_line() +
geom_smooth(method = 'lm', color = 'lightpink', se=TRUE) +
labs(title = "Term Deposit Subscriptions Over Time",subtitle = "For year 2008-2010") +
theme_hc()
## `geom_smooth()` using formula = 'y ~ x'
df_ts |>
filter_index("2008") |>
ggplot(mapping = aes(x = date, y = size)) +
geom_line() +
geom_smooth(method = 'lm', color = 'lightpink', se=FALSE) +
labs(title = "Term Deposit Subscriptions Over Time",subtitle = "For year 2008") +
theme_hc()
## `geom_smooth()` using formula = 'y ~ x'
df_ts |>
filter_index("2009") |>
ggplot(mapping = aes(x = date, y = size)) +
geom_line() +
geom_smooth(method = 'lm', color = 'blue', se=FALSE) +
labs(title = "Term Deposit Subscriptions Over Time",subtitle = "For year 2009") +
theme_hc()
## `geom_smooth()` using formula = 'y ~ x'
df_ts |>
filter_index("2010") |>
ggplot(mapping = aes(x = date, y = size)) +
geom_line() +
geom_smooth(method = 'lm', color = 'green', se=FALSE) +
labs(title = "Term Deposit Subscriptions Over Time",subtitle = "For year 2010") +
theme_hc()
## `geom_smooth()` using formula = 'y ~ x'
For the year 2010, we can see a wave pattern, with the number of subscriptions increasing between Feb - Mar, May - June, Aug-Sept.
df_ts |>
index_by(year = floor_date(date,'month')) |>
summarise(avg_subscriptions = mean(size, na.rm = TRUE)) |>
ggplot(mapping = aes(x = year, y = avg_subscriptions)) +
geom_line(color = '#ecb920') +
geom_smooth(span = 0.3, color = '#7d92a5', se=FALSE, ) +
labs(title = "Number of term deposit Over Time",
subtitle = "(yearwise)")+
theme_hc()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The middle point in this window holds the highest weight, and the weights decrease for data points further away from the middle. From the plot we can determine the number of subscriptions over the years. The resulting curve models the slopes for each of those lines from within the window.
To determine potential seasons, I am plotting a quarterly term deposit
df_ts |>
index_by(year = floor_date(date,'month')) |>
summarise(avg_subscriptions = mean(size, na.rm = TRUE)) |>
ggplot(mapping = aes(x = year, y = avg_subscriptions)) +
geom_line(color="lightblue",size=1.5) +
geom_smooth(span = 0.3, color = 'blue', se=FALSE, size=1.5) +
labs(title = 'Number of term deposit Over Time', subtitle = "by month",y="Average Subscriptions",x="Month") +
scale_x_date (breaks = "1 year", labels = \(x) year(x)) +
theme_bw()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Like from 2009, the number of subscriptions increases till April and drops till July. Towards the end of 2010 we can see that there is a slight decrease.
acf(df_ts, ci = 0.95, na.action = na.exclude, main='Number of term deposit Over Time')
pacf(df_ts, na.action = na.exclude, main='Number of term deposit Over Time')
From the above graph , lag \(h = 3\) seems to be most correlated with the time periods.
The ACF seems has more value that is over the threshold.