Data Dive Week 12

Libraries and Data

data_frame = read.csv('C:/Users/prera/OneDrive/Desktop/INFO-I590/bank-full2.csv',header=TRUE, sep = ",")

Summary of data frame

summary(data_frame)

##       age            job              marital           education        
##  Min.   :18.00   Length:45211       Length:45211       Length:45211      
##  1st Qu.:33.00   Class :character   Class :character   Class :character  
##  Median :39.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.94                                                           
##  3rd Qu.:48.00                                                           
##  Max.   :95.00                                                           
##    default             balance         housing              loan          
##  Length:45211       Min.   : -8019   Length:45211       Length:45211      
##  Class :character   1st Qu.:    72   Class :character   Class :character  
##  Mode  :character   Median :   448   Mode  :character   Mode  :character  
##                     Mean   :  1362                                        
##                     3rd Qu.:  1428                                        
##                     Max.   :102127                                        
##    contact               day           month              duration     
##  Length:45211       Min.   : 1.00   Length:45211       Min.   :   0.0  
##  Class :character   1st Qu.: 8.00   Class :character   1st Qu.: 103.0  
##  Mode  :character   Median :16.00   Mode  :character   Median : 180.0  
##                     Mean   :15.81                      Mean   : 258.2  
##                     3rd Qu.:21.00                      3rd Qu.: 319.0  
##                     Max.   :31.00                      Max.   :4918.0  
##     campaign          pdays          previous          poutcome        
##  Min.   : 1.000   Min.   : -1.0   Min.   :  0.0000   Length:45211      
##  1st Qu.: 1.000   1st Qu.: -1.0   1st Qu.:  0.0000   Class :character  
##  Median : 2.000   Median : -1.0   Median :  0.0000   Mode  :character  
##  Mean   : 2.764   Mean   : 40.2   Mean   :  0.5803                     
##  3rd Qu.: 3.000   3rd Qu.: -1.0   3rd Qu.:  0.0000                     
##  Max.   :63.000   Max.   :871.0   Max.   :275.0000                     
##       y            
##  Length:45211      
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Columns of data frame

1 - age;

2 - job;

3 - marital(marital status);

4 - education;

5 - default: has credit in default?;

6 - balance: average yearly balance, in euros

7 - housing: has housing loan?;

8 - loan: has personal loan?;

9 - contact: contact communication type;

10 - day: last contact day of the month

11 - month: last contact month of year;

12 - duration: last contact duration, in seconds;

13 - campaign: number of contacts performed during this campaign and for this client

14 - pdays: number of days that passed by after the client was last contacted from a previous campaign

15 - previous: number of contacts performed before this campaign and for this client

16 - poutcome: outcome of the previous marketing campaign;

17 - y : has the client subscribed a term deposit?

Tasks To Be Performed

Select a column of your data that encodes time (e.g., “date”, “timestamp”, “year”, etc.). Convert this into a Date in R.
Choose a column of data to analyze over time. This should be a “response-like” variable that is of particular interest.
Create a tsibble object of just the date and response variable. Then, plot your data over time. Consider different windows of time.
- What stands out immediately?
Use linear regression to detect any upwards or downwards trends.
Do you need to subset the data for multiple trends?
- How strong are these trends?
Use smoothing to detect at least one season in your data, and interpret your results.
Can you illustrate the seasonality using ACF or PACF?

Creating the Date column

Converting month column to a column of continuous data

data_frame['month_value'] <- unclass(factor(data_frame$month, levels = month.name))

As mentioned in the Documentation, the data has been collected from May 2008 to November 2010. I am going to first add the year column to the data frame and then create a date column which is a combination of the ‘year’, ‘month’ and ‘day’ column

Adding the year column

data_frame_2008 <- data_frame[1:27729,]
data_frame_2008 <- data_frame_2008 |> mutate(year = 2008)
tail(data_frame_2008)

##       age         job  marital education default balance housing loan   contact
## 27724  34    services  married  tertiary      no    1442      no   no  cellular
## 27725  55     retired divorced secondary      no     366      no   no telephone
## 27726  38  management  married  tertiary      no       0     yes   no telephone
## 27727  32 blue-collar   single secondary      no     232      no   no      <NA>
## 27728  53  management divorced secondary      no    1004      no  yes telephone
## 27729  28      admin.   single secondary      no    6100      no   no      <NA>
##       day    month duration campaign pdays previous poutcome   y month_value
## 27724  11 December      234        1    -1        0     <NA>  no          12
## 27725  11 December      173        1    -1        0     <NA>  no          12
## 27726  12 December      136        1    -1        0     <NA>  no          12
## 27727  13 December      232        1    -1        0     <NA>  no          12
## 27728  22 December      119        1    -1        0     <NA> yes          12
## 27729  27 December      333        1    -1        0     <NA>  no          12
##       year
## 27724 2008
## 27725 2008
## 27726 2008
## 27727 2008
## 27728 2008
## 27729 2008

data_frame_2009 <- data_frame[27730:42591,]
data_frame_2009<- data_frame_2009 |> mutate(year = 2009)
tail(data_frame_2009)

##       age        job  marital education default balance housing loan  contact
## 42586  61 management  married secondary      no     544      no   no cellular
## 42587  30 management   single  tertiary      no    2907     yes   no cellular
## 42588  46 management divorced  tertiary      no     461      no   no cellular
## 42589  30  housemaid   single  tertiary      no    3185     yes   no cellular
## 42590  40     admin.  married secondary      no    3126     yes   no cellular
## 42591  32 technician  married      <NA>      no   14533      no   no cellular
##       day    month duration campaign pdays previous poutcome   y month_value
## 42586  29 December       63        3   200        9  success yes          12
## 42587  30 December      146        2   218        1  failure  no          12
## 42588  30 December      369        2    -1        0     <NA> yes          12
## 42589  30 December      158        3    -1        0     <NA>  no          12
## 42590  30 December      110        3   232        2  failure  no          12
## 42591  31 December      646        3   198        2  success  no          12
##       year
## 42586 2009
## 42587 2009
## 42588 2009
## 42589 2009
## 42590 2009
## 42591 2009

data_frame_2010 <- data_frame[42592:45211,]
data_frame_2010 <- data_frame_2010 |> mutate(year = 2010)
tail(data_frame_2010)

##       age          job  marital education default balance housing loan
## 45206  25   technician   single secondary      no     505      no  yes
## 45207  51   technician  married  tertiary      no     825      no   no
## 45208  71      retired divorced   primary      no    1729      no   no
## 45209  72      retired  married secondary      no    5715      no   no
## 45210  57  blue-collar  married secondary      no     668      no   no
## 45211  37 entrepreneur  married secondary      no    2971      no   no
##         contact day    month duration campaign pdays previous poutcome   y
## 45206  cellular  17 November      386        2    -1        0     <NA> yes
## 45207  cellular  17 November      977        3    -1        0     <NA> yes
## 45208  cellular  17 November      456        2    -1        0     <NA> yes
## 45209  cellular  17 November     1127        5   184        3  success yes
## 45210 telephone  17 November      508        4    -1        0     <NA>  no
## 45211  cellular  17 November      361        2   188       11    other  no
##       month_value year
## 45206          11 2010
## 45207          11 2010
## 45208          11 2010
## 45209          11 2010
## 45210          11 2010
## 45211          11 2010

Recombining the data frames

data_frame <- rbind(data_frame_2008,data_frame_2009)
data_frame <- rbind(data_frame,data_frame_2010)

Removing NA values

data_frame <- na.omit(data_frame)

data_frame$date<-as.Date(with(data_frame,paste(year,month_value,day,sep="-")),"%Y-%m-%d")
head(data_frame)

##       age         job marital education default balance housing loan   contact
## 24061  33      admin. married  tertiary      no     882      no   no telephone
## 24063  42      admin.  single secondary      no    -247     yes  yes telephone
## 24065  33    services married secondary      no    3444     yes   no telephone
## 24073  36  management married  tertiary      no    2415     yes   no telephone
## 24078  36  management married  tertiary      no       0     yes   no telephone
## 24087  44 blue-collar married secondary      no    1324     yes   no telephone
##       day   month duration campaign pdays previous poutcome   y month_value
## 24061  21 October       39        1   151        3  failure  no          10
## 24063  21 October      519        1   166        1    other yes          10
## 24065  21 October      144        1    91        4  failure yes          10
## 24073  22 October       73        1    86        4    other  no          10
## 24078  23 October      140        1   143        3  failure yes          10
## 24087  25 October      119        1    89        2    other  no          10
##       year       date
## 24061 2008 2008-10-21
## 24063 2008 2008-10-21
## 24065 2008 2008-10-21
## 24073 2008 2008-10-22
## 24078 2008 2008-10-23
## 24087 2008 2008-10-25

summary(data_frame)

##       age            job              marital           education        
##  Min.   :18.00   Length:7842        Length:7842        Length:7842       
##  1st Qu.:32.00   Class :character   Class :character   Class :character  
##  Median :38.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.78                                                           
##  3rd Qu.:47.00                                                           
##  Max.   :89.00                                                           
##    default             balance        housing              loan          
##  Length:7842        Min.   :-1884   Length:7842        Length:7842       
##  Class :character   1st Qu.:  162   Class :character   Class :character  
##  Mode  :character   Median :  595   Mode  :character   Mode  :character  
##                     Mean   : 1552                                        
##                     3rd Qu.: 1734                                        
##                     Max.   :81204                                        
##    contact               day           month              duration     
##  Length:7842        Min.   : 1.00   Length:7842        Min.   :   5.0  
##  Class :character   1st Qu.: 7.00   Class :character   1st Qu.: 113.0  
##  Mode  :character   Median :14.00   Mode  :character   Median : 194.0  
##                     Mean   :14.26                      Mean   : 261.3  
##                     3rd Qu.:20.00                      3rd Qu.: 324.0  
##                     Max.   :31.00                      Max.   :2219.0  
##     campaign          pdays          previous         poutcome        
##  Min.   : 1.000   Min.   :  1.0   Min.   :  1.000   Length:7842       
##  1st Qu.: 1.000   1st Qu.:133.0   1st Qu.:  1.000   Class :character  
##  Median : 2.000   Median :195.0   Median :  2.000   Mode  :character  
##  Mean   : 2.064   Mean   :223.3   Mean   :  3.184                     
##  3rd Qu.: 2.000   3rd Qu.:326.0   3rd Qu.:  4.000                     
##  Max.   :16.000   Max.   :871.0   Max.   :275.000                     
##       y              month_value          year           date           
##  Length:7842        Min.   : 1.000   Min.   :2008   Min.   :2008-10-21  
##  Class :character   1st Qu.: 4.000   1st Qu.:2009   1st Qu.:2009-04-06  
##  Mode  :character   Median : 5.000   Median :2009   Median :2009-05-12  
##                     Mean   : 5.807   Mean   :2009   Mean   :2009-07-09  
##                     3rd Qu.: 8.000   3rd Qu.:2009   3rd Qu.:2009-10-13  
##                     Max.   :12.000   Max.   :2010   Max.   :2010-11-17

Choosing the response variable

The response variable I am picking is ‘y’. It denotes if the client has subscribed to a term deposit.

I am going to consider the number of people who have subscribed to a term deposit.

I am alos going to use ‘Group by’ to group by date, because there are cases where multiple clients have or have not subscribed to a term deposit on a single day

data_frame <- data_frame |>
  mutate(y_value = ifelse(y %in% c("yes"),1, 0))

Creating a tibble of response variable and date column

df <- select(data_frame, date, y_value)

df_grouped <- df|>
    group_by(date,y_value) |>
      summarise(y_value_avg = mean(y_value,na.rm=TRUE), size=n())

## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.

df_grouped <- df_grouped |>
  filter(y_value==1)

df_grouped <- df_grouped[c('date','size')]
df_grouped <- as_data_frame(df_grouped)

## Warning: `as_data_frame()` was deprecated in tibble 2.0.0.
## ℹ Please use `as_tibble()` (with slightly different semantics) to convert to a
##   tibble, or `as.data.frame()` to convert to a data frame.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

df_grouped

## # A tibble: 346 × 2
##    date        size
##    <date>     <int>
##  1 2008-10-21     2
##  2 2008-10-23     1
##  3 2008-11-17     4
##  4 2008-11-18     3
##  5 2008-11-19     9
##  6 2008-11-20    10
##  7 2008-11-21     9
##  8 2009-01-28     1
##  9 2009-01-30     1
## 10 2009-02-02     7
## # ℹ 336 more rows

df_ts <- as_tsibble(df_grouped, index=date)

Plotting date vs number of people who subscribed to a term deposit.

df_ts |>
  ggplot(mapping = aes(x = date, y = size)) +
  geom_line() +
  labs(title = "Term Deposit Subscriptions Over Time") +
  theme_hc()

Considering year wise data

df_ts |>
  filter_index("2008") |>
  ggplot(mapping = aes(x = date, y = size)) +
  geom_line() +
  labs(title = "Term Deposit Subscriptions Over Time") +
  theme_hc()

df_ts |>
  filter_index("2009") |>
  ggplot(mapping = aes(x = date, y = size)) +
  geom_line() +
  labs(title = "Term Deposit Subscriptions Over Time") +
  theme_hc()

df_ts |>
  filter_index("2010") |>
  ggplot(mapping = aes(x = date, y = size)) +
  geom_line() +
  labs(title = "Term Deposit Subscriptions Over Time") +
  theme_hc()

From the above plots we can see that in the year 2009 has the highest number of term deposits made, with nearly 50 subscriptions made on a single day in May 2009.

Linear Regression model

model <- lm(size ~ date, data = df_ts)
summary(model)

## 
## Call:
## lm(formula = size ~ date, data = df_ts)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.111 -2.878 -0.939  1.391 45.301 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 36.997629  19.050039   1.942   0.0529 .
## date        -0.002179   0.001304  -1.671   0.0956 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.635 on 344 degrees of freedom
## Multiple R-squared:  0.008055,   Adjusted R-squared:  0.005171 
## F-statistic: 2.793 on 1 and 344 DF,  p-value: 0.09557

model$coefficients

##  (Intercept)         date 
## 36.997629271 -0.002178935

ggplot(df_ts, aes(x = date, y = size)) +
  geom_line() +
  geom_smooth(method = "lm", color = "lightpink") +
  labs(x = "date", y = "number of subscriptions", title = "Term Deposit Subscriptions Over Time") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

The model has a time coefficient which is very small (-0.00217) and a p-value of no statistic significance (p-value = 0.09557). This suggests that there is no significant linear trend in the number of term deposit subscriptions over time.

df_ts |>
  ggplot(mapping = aes(x = date, y = size)) +
  geom_line() +
  geom_smooth(method = 'lm', color = 'lightpink', se=TRUE) +
  labs(title = "Term Deposit Subscriptions Over Time",subtitle = "For year 2008-2010") +
  theme_hc()

## `geom_smooth()` using formula = 'y ~ x'

Sub setting data

df_ts |>
  filter_index("2008") |>
  ggplot(mapping = aes(x = date, y = size)) +
  geom_line() +
  geom_smooth(method = 'lm', color = 'lightpink', se=FALSE) +
  labs(title = "Term Deposit Subscriptions Over Time",subtitle = "For year 2008") +
  theme_hc()

## `geom_smooth()` using formula = 'y ~ x'

df_ts |>
  filter_index("2009") |>
  ggplot(mapping = aes(x = date, y = size)) +
  geom_line() +
  geom_smooth(method = 'lm', color = 'blue', se=FALSE) +
  labs(title = "Term Deposit Subscriptions Over Time",subtitle = "For year 2009") +
  theme_hc()

## `geom_smooth()` using formula = 'y ~ x'

df_ts |>
  filter_index("2010") |>
  ggplot(mapping = aes(x = date, y = size)) +
  geom_line() +
  geom_smooth(method = 'lm', color = 'green', se=FALSE) +
  labs(title = "Term Deposit Subscriptions Over Time",subtitle = "For year 2010") +
  theme_hc()

## `geom_smooth()` using formula = 'y ~ x'

For the year 2010, we can see a wave pattern, with the number of subscriptions increasing between Feb - Mar, May - June, Aug-Sept.

SMOOTHING

df_ts |>
  index_by(year = floor_date(date,'month')) |>
  summarise(avg_subscriptions = mean(size, na.rm = TRUE)) |>
  ggplot(mapping = aes(x = year, y = avg_subscriptions)) +
  geom_line(color = '#ecb920') +
  geom_smooth(span = 0.3, color = '#7d92a5', se=FALSE, ) +
  labs(title = "Number of term deposit Over Time",
       subtitle = "(yearwise)")+
  theme_hc()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The middle point in this window holds the highest weight, and the weights decrease for data points further away from the middle. From the plot we can determine the number of subscriptions over the years. The resulting curve models the slopes for each of those lines from within the window.

Illustrating the seasonality using ACF or PACF

To determine potential seasons, I am plotting a quarterly term deposit

df_ts |>
  index_by(year = floor_date(date,'month')) |>
  summarise(avg_subscriptions = mean(size, na.rm = TRUE)) |>
  ggplot(mapping = aes(x = year, y = avg_subscriptions)) + 
  geom_line(color="lightblue",size=1.5) +
  geom_smooth(span = 0.3, color = 'blue', se=FALSE, size=1.5) +
  labs(title = 'Number of term deposit Over Time', subtitle = "by month",y="Average Subscriptions",x="Month") + 
  scale_x_date (breaks = "1 year", labels = \(x) year(x)) + 
  theme_bw()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Like from 2009, the number of subscriptions increases till April and drops till July. Towards the end of 2010 we can see that there is a slight decrease.

ACF

acf(df_ts, ci = 0.95, na.action = na.exclude, main='Number of term deposit Over Time')

PACF

pacf(df_ts, na.action = na.exclude, main='Number of term deposit Over Time')

From the above graph , lag \(h = 3\) seems to be most correlated with the time periods.

The ACF seems has more value that is over the threshold.