Initial preparation of data

Loading libraries and reading of data from CSV to data frame

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)

#path to the file = 'C:/Users/prera/OneDrive/Desktop/INFO-I590/bank-full2.csv'

Reading data into data_frame

data_frame = read.csv('C:/Users/prera/OneDrive/Desktop/INFO-I590/bank-full2.csv',header=TRUE, sep = ",")

head(data_frame)

##   age          job marital education default balance housing loan contact day
## 1  58   management married  tertiary      no    2143     yes   no    <NA>   5
## 2  44   technician  single secondary      no      29     yes   no    <NA>   5
## 3  33 entrepreneur married secondary      no       2     yes  yes    <NA>   5
## 4  47  blue-collar married      <NA>      no    1506     yes   no    <NA>   5
## 5  33         <NA>  single      <NA>      no       1      no   no    <NA>   5
## 6  35   management married  tertiary      no     231     yes   no    <NA>   5
##   month duration campaign pdays previous poutcome  y
## 1   may      261        1    -1        0     <NA> no
## 2   may      151        1    -1        0     <NA> no
## 3   may       76        1    -1        0     <NA> no
## 4   may       92        1    -1        0     <NA> no
## 5   may      198        1    -1        0     <NA> no
## 6   may      139        1    -1        0     <NA> no

Exploring the data

Dimension of data and data types of the columns

Dimension of data

dim(data_frame)

## [1] 45211    17

Data types of the columns

str(data_frame)

## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : chr  "management" "technician" "entrepreneur" "blue-collar" ...
##  $ marital  : chr  "married" "single" "married" "married" ...
##  $ education: chr  "tertiary" "secondary" "secondary" NA ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : chr  "yes" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "no" "yes" "no" ...
##  $ contact  : chr  NA NA NA NA ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : chr  "may" "may" "may" "may" ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : chr  NA NA NA NA ...
##  $ y        : chr  "no" "no" "no" "no" ...

Sampling the data

I am considering the following columns while selecting the samples:

age
job
education
default
balance
housing
loan

size= 0.5*45211
size

## [1] 22605.5

First Sample

data_frame_selected <- select(data_frame, c('age','job','education','default','balance','housing','loan'))

data_frame_sample1 <- sample_n(data_frame_selected,22605, replace = TRUE)

head(data_frame_sample1)

##   age           job education default balance housing loan
## 1  39   blue-collar   primary      no     519     yes  yes
## 2  38        admin. secondary      no     439      no   no
## 3  31    management  tertiary      no     732     yes   no
## 4  49        admin. secondary      no     652      no  yes
## 5  26    technician secondary      no    -211     yes   no
## 6  29 self-employed  tertiary      no    8749      no   no

Summary of the Data Frame

summary(data_frame_sample1)

##       age            job             education           default         
##  Min.   :18.00   Length:22605       Length:22605       Length:22605      
##  1st Qu.:33.00   Class :character   Class :character   Class :character  
##  Median :39.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :41.03                                                           
##  3rd Qu.:49.00                                                           
##  Max.   :89.00                                                           
##     balance         housing              loan          
##  Min.   : -8019   Length:22605       Length:22605      
##  1st Qu.:    78   Class :character   Class :character  
##  Median :   458   Mode  :character   Mode  :character  
##  Mean   :  1397                                        
##  3rd Qu.:  1455                                        
##  Max.   :102127

Finding number of client in each education group for sample1

data_frame_sample1_group <- data_frame_sample1|>
  filter(!(is.na(education)))|>
    group_by(education) |>
      summarise(category_mean_balance = mean(balance,na.rm=TRUE), size=n())

data_frame_sample1_group

## # A tibble: 3 × 3
##   education category_mean_balance  size
##   <chr>                     <dbl> <int>
## 1 primary                   1263.  3490
## 2 secondary                 1212. 11619
## 3 tertiary                  1756.  6586

Second Sample

data_frame_sample2 <- sample_n(data_frame_selected,22605, replace = TRUE)

head(data_frame_sample2)

##   age        job education default balance housing loan
## 1  34   services secondary      no       0     yes   no
## 2  34 technician secondary      no     558      no   no
## 3  31     admin. secondary      no     525      no  yes
## 4  44 technician   primary      no    2887     yes   no
## 5  31 management  tertiary      no     197     yes   no
## 6  35 technician secondary      no    2151     yes   no

Summary of the Data Frame

summary(data_frame_sample2)

##       age           job             education           default         
##  Min.   :18.0   Length:22605       Length:22605       Length:22605      
##  1st Qu.:33.0   Class :character   Class :character   Class :character  
##  Median :39.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.9                                                           
##  3rd Qu.:48.0                                                           
##  Max.   :94.0                                                           
##     balance        housing              loan          
##  Min.   :-3313   Length:22605       Length:22605      
##  1st Qu.:   74   Class :character   Class :character  
##  Median :  453   Mode  :character   Mode  :character  
##  Mean   : 1322                                        
##  3rd Qu.: 1401                                        
##  Max.   :98417

Finding number of client in each education group for sample2

data_frame_sample2_group <- data_frame_sample2|>
  filter(!(is.na(education)))|>
    group_by(education) |>
      summarise(category_mean_balance = mean(balance,na.rm=TRUE), size=n())

data_frame_sample2_group

## # A tibble: 3 × 3
##   education category_mean_balance  size
##   <chr>                     <dbl> <int>
## 1 primary                   1199.  3492
## 2 secondary                 1133. 11513
## 3 tertiary                  1690.  6669

Third Sample

data_frame_sample3 <- sample_n(data_frame_selected,22605, replace = TRUE)

head(data_frame_sample3)

##   age         job education default balance housing loan
## 1  33  management  tertiary      no   22867     yes   no
## 2  36 blue-collar secondary      no     134      no   no
## 3  40  management  tertiary      no     351     yes   no
## 4  33  technician  tertiary      no       0      no   no
## 5  53  management  tertiary      no     404      no   no
## 6  60     retired  tertiary      no     100      no   no

Summary of the Data Frame

summary(data_frame_sample3)

##       age            job             education           default         
##  Min.   :18.00   Length:22605       Length:22605       Length:22605      
##  1st Qu.:33.00   Class :character   Class :character   Class :character  
##  Median :39.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.89                                                           
##  3rd Qu.:48.00                                                           
##  Max.   :95.00                                                           
##     balance        housing              loan          
##  Min.   :-3372   Length:22605       Length:22605      
##  1st Qu.:   69   Class :character   Class :character  
##  Median :  436   Mode  :character   Mode  :character  
##  Mean   : 1321                                        
##  3rd Qu.: 1406                                        
##  Max.   :71188

Finding number of client in each education group for sample3

data_frame_sample3_group <- data_frame_sample3|>
  filter(!(is.na(education)))|>
    group_by(education) |>
      summarise(category_mean_balance = mean(balance,na.rm=TRUE), size=n())

data_frame_sample3_group

## # A tibble: 3 × 3
##   education category_mean_balance  size
##   <chr>                     <dbl> <int>
## 1 primary                   1223.  3350
## 2 secondary                 1146. 11635
## 3 tertiary                  1660.  6669

Fourth Sample

data_frame_sample4 <- sample_n(data_frame_selected,22605, replace = TRUE)

head(data_frame_sample4)

##   age         job education default balance housing loan
## 1  45  management  tertiary      no    1395      no   no
## 2  57     retired secondary      no    3783     yes   no
## 3  44  management  tertiary      no    3355     yes   no
## 4  47 blue-collar   primary      no     554     yes   no
## 5  19     student secondary      no    1803      no   no
## 6  33      admin.  tertiary      no     235     yes   no

Summary of the Data Frame

summary(data_frame_sample4)

##       age            job             education           default         
##  Min.   :18.00   Length:22605       Length:22605       Length:22605      
##  1st Qu.:33.00   Class :character   Class :character   Class :character  
##  Median :39.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.89                                                           
##  3rd Qu.:48.00                                                           
##  Max.   :95.00                                                           
##     balance        housing              loan          
##  Min.   :-3313   Length:22605       Length:22605      
##  1st Qu.:   75   Class :character   Class :character  
##  Median :  447   Mode  :character   Mode  :character  
##  Mean   : 1372                                        
##  3rd Qu.: 1422                                        
##  Max.   :98417

Finding number of client in each education group for sample4

data_frame_sample4_group <- data_frame_sample4|>
  filter(!(is.na(education)))|>
    group_by(education) |>
      summarise(category_mean_balance = mean(balance,na.rm=TRUE), size=n())

data_frame_sample4_group

## # A tibble: 3 × 3
##   education category_mean_balance  size
##   <chr>                     <dbl> <int>
## 1 primary                   1271.  3482
## 2 secondary                 1170. 11613
## 3 tertiary                  1753.  6597

Fifth Sample

data_frame_sample5 <- sample_n(data_frame_selected,22605, replace = TRUE)

head(data_frame_sample5)

##   age         job education default balance housing loan
## 1  53  management  tertiary      no    8563     yes   no
## 2  32  management  tertiary      no    6248     yes   no
## 3  43  unemployed secondary      no       0     yes   no
## 4  37    services secondary      no     731     yes   no
## 5  48 blue-collar secondary      no    -202     yes  yes
## 6  32 blue-collar secondary      no     290     yes   no

Summary of the Data Frame

summary(data_frame_sample5)

##       age            job             education           default         
##  Min.   :18.00   Length:22605       Length:22605       Length:22605      
##  1st Qu.:33.00   Class :character   Class :character   Class :character  
##  Median :39.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.96                                                           
##  3rd Qu.:48.00                                                           
##  Max.   :95.00                                                           
##     balance         housing              loan          
##  Min.   : -8019   Length:22605       Length:22605      
##  1st Qu.:    70   Class :character   Class :character  
##  Median :   449   Mode  :character   Mode  :character  
##  Mean   :  1345                                        
##  3rd Qu.:  1381                                        
##  Max.   :102127

Finding number of client in each education group for sample5

data_frame_sample5_group <- data_frame_sample5|>
  filter(!(is.na(education)))|>
    group_by(education) |>
      summarise(category_mean_balance = mean(balance,na.rm=TRUE), size=n())

data_frame_sample5_group

## # A tibble: 3 × 3
##   education category_mean_balance  size
##   <chr>                     <dbl> <int>
## 1 primary                   1295.  3373
## 2 secondary                 1130. 11717
## 3 tertiary                  1768.  6566

Findings among the Samples

Comparing the value of ‘balance’ among the columns

Mean balance of the samples

sample <- c("sample1", "sample2","sample3","sample4","sample5")
mean_balance <- c(mean(data_frame_sample1$balance), mean(data_frame_sample2$balance),mean(data_frame_sample3$balance),mean(data_frame_sample4$balance),mean(data_frame_sample5$balance))

sample_mean_balance <- data.frame(sample, mean_balance)
sample_mean_balance

##    sample mean_balance
## 1 sample1     1396.572
## 2 sample2     1322.086
## 3 sample3     1320.810
## 4 sample4     1372.111
## 5 sample5     1344.683

Plotting the mean balance of the samples

p <- sample_mean_balance |>
  ggplot(aes(x = sample, y=mean_balance) )+
  geom_bar(position = "dodge", stat = "identity",fill="lightblue") +
  theme_minimal()
  
p

Max balance of the samples

sample <- c("sample1", "sample2","sample3","sample4","sample5")
max_balance <- c(max(data_frame_sample1$balance), max(data_frame_sample2$balance),max(data_frame_sample3$balance),max(data_frame_sample4$balance),max(data_frame_sample5$balance))

sample_max_balance <- data.frame(sample, max_balance)
sample_max_balance

##    sample max_balance
## 1 sample1      102127
## 2 sample2       98417
## 3 sample3       71188
## 4 sample4       98417
## 5 sample5      102127

Plotting max of balance of each of the samples

p <- sample_max_balance |>
  ggplot(aes(x = sample, y=max_balance) )+
  geom_bar(position = "dodge", stat = "identity",fill="lightpink") +
  theme_minimal()
  
p

IQR balance of the samples

sample <- c("sample1", "sample2","sample3","sample4","sample5")
IQR_balance <- c(IQR(data_frame_sample1$balance), IQR(data_frame_sample2$balance),IQR(data_frame_sample3$balance),IQR(data_frame_sample4$balance),IQR(data_frame_sample5$balance))

sample_IQR_balance <- data.frame(sample, IQR_balance)
sample_IQR_balance

##    sample IQR_balance
## 1 sample1        1377
## 2 sample2        1327
## 3 sample3        1337
## 4 sample4        1347
## 5 sample5        1311

Plotting max of balance of each of the samples

p <- sample_IQR_balance |>
  ggplot(aes(x = sample, y=IQR_balance) )+
  geom_bar(position = "dodge", stat = "identity",fill="yellow") +
  theme_minimal()
  
p

Calculating the probabilities

Calculating totals for each sample

total_sample1_size <- sum(data_frame_sample1_group$size)
total_sample2_size <- sum(data_frame_sample2_group$size)
total_sample3_size <- sum(data_frame_sample3_group$size)
total_sample4_size <- sum(data_frame_sample4_group$size)
total_sample5_size <- sum(data_frame_sample5_group$size)

Probability of having primary as education level for each sample

primary_education_sample1 <- 3397/total_sample1_size
primary_education_sample2 <- 3317/total_sample2_size
primary_education_sample3 <- 3512/total_sample3_size
primary_education_sample4 <- 3364/total_sample4_size
primary_education_sample5 <- 3404/total_sample5_size

Probability of having secondary as education level for each sample

secondary_education_sample1 <- 11647/total_sample1_size
secondary_education_sample2 <- 11606/total_sample2_size
secondary_education_sample3 <- 11515/total_sample3_size
secondary_education_sample4 <- 11664/total_sample4_size
secondary_education_sample5 <- 11648/total_sample5_size

Probability of having tertiary as education level for each sample

tertiary_education_sample1 <- 6644/total_sample1_size
tertiary_education_sample2 <- 6770/total_sample2_size
tertiary_education_sample3 <- 6633/total_sample3_size
tertiary_education_sample4 <- 6637/total_sample4_size
tertiary_education_sample5 <- 6663/total_sample5_size

Probability of having tertiary as education level and having balance above the average

sample_names <- c("sample1", "sample2","sample3","sample4","sample5")
secondary_education <- c(secondary_education_sample1,secondary_education_sample2,secondary_education_sample3,secondary_education_sample4,secondary_education_sample5)
primary_education <- c(primary_education_sample1,primary_education_sample2,primary_education_sample3,primary_education_sample4,primary_education_sample5)
tertiary_education <- c(tertiary_education_sample1,tertiary_education_sample2,tertiary_education_sample3,tertiary_education_sample4,tertiary_education_sample5)


sample_education_level_probability <- data.frame(sample_names,primary_education,secondary_education,tertiary_education)
sample_education_level_probability

##   sample_names primary_education secondary_education tertiary_education
## 1      sample1         0.1565799           0.5368518          0.3062457
## 2      sample2         0.1530405           0.5354803          0.3123558
## 3      sample3         0.1621871           0.5317724          0.3063175
## 4      sample4         0.1550802           0.5377098          0.3059653
## 5      sample5         0.1571851           0.5378648          0.3076745

Plotting the probability and samples

primary_education

p <- sample_education_level_probability |>
  ggplot(aes(x = sample_names, y=(primary_education)) )+
  geom_bar(position = "dodge", stat = "identity",fill="yellow") +
  theme_minimal()
  
p

# Pie Chart with Percentages
slices <- c(primary_education_sample1,primary_education_sample2, primary_education_sample3, primary_education_sample4, primary_education_sample5)


lbls <- c("primary_education_sample1","primary_education_sample2", "primary_education_sample3", "primary_education_sample4", "primary_education_sample5", "tertiary_above_average")
pct <- round(slices/sum(slices)*100)
lbls <- paste(lbls, pct)
# add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels
pie(slices,labels = lbls, main="Pie Chart - primary_education",col=topo.colors(5))

secondary_education

p <- sample_education_level_probability |>
  ggplot(aes(x = sample_names, y=(secondary_education)) )+
  geom_bar(position = "dodge", stat = "identity",fill="lightgreen") +
  theme_minimal()
  
p

# Pie Chart with Percentages
slices <- c(secondary_education_sample1,secondary_education_sample2, secondary_education_sample3, secondary_education_sample4, secondary_education_sample5)


lbls <- c("secondary_education_sample1","secondary_education_sample2", "secondary_education_sample3", "secondary_education_sample4", "secondary_education_sample5", "secondary_above_average")
pct <- round(slices/sum(slices)*100)
lbls <- paste(lbls, pct)
# add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels
pie(slices,labels = lbls, main="Pie Chart - secondary_education" )

tertiary_education

p <- sample_education_level_probability |>
  ggplot(aes(x = sample_names, y=(tertiary_education)) )+
  geom_bar(position = "dodge", stat = "identity",fill="red") +
  theme_minimal()
  
p

library(RColorBrewer)
myPalette <- brewer.pal(5, "Set3")

# Pie Chart with Percentages
slices <- c(tertiary_education_sample1,tertiary_education_sample2, tertiary_education_sample3, tertiary_education_sample4, tertiary_education_sample5)


lbls <- c("tertiary_education_sample1","tertiary_education_sample2", "tertiary_education_sample3", "tertiary_education_sample4", "tertiary_education_sample5", "secondary_above_average")
pct <- round(slices/sum(slices)*100)
lbls <- paste(lbls, pct)
# add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels
pie(slices,labels = lbls, main="Pie Chart - tertiary_education", col=myPalette )

Questions

How different are they

The samples I have generated are very similar to each other in most aspects

Anomaly in the Sub-sample

The maximum of balance has significant change. In the first and fourth samples it is comparatively less when compared to the second, third and fifth sample.

Are there aspects of the data that are consistent among all sub-samples

From the above graphs we can say that among the samples the probability of picking a client having any education level is nearly same and the mean and IQR of the balance is nearly consistent.

How does this investigation affect the conclusions you draw about the data in the future

Take extra care while testing any hypothesis. Consider multiple samples while drawing conclusions.

Data Dive Week 4

PreranaRajole

2023-09-17

Initial preparation of data

Loading libraries and reading of data from CSV to data frame

Reading data into data_frame

Exploring the data

Dimension of data and data types of the columns

Dimension of data

Data types of the columns

Sampling the data

I am considering the following columns while selecting the samples:

First Sample

Summary of the Data Frame

Finding number of client in each education group for sample1

Second Sample

Summary of the Data Frame

Finding number of client in each education group for sample2

Third Sample

Summary of the Data Frame

Finding number of client in each education group for sample3

Fourth Sample

Summary of the Data Frame

Finding number of client in each education group for sample4

Fifth Sample

Summary of the Data Frame

Finding number of client in each education group for sample5

Findings among the Samples

Comparing the value of ‘balance’ among the columns

Mean balance of the samples

Plotting the mean balance of the samples

Max balance of the samples

Plotting max of balance of each of the samples

IQR balance of the samples

Plotting max of balance of each of the samples

Calculating the probabilities

Calculating totals for each sample

Probability of having primary as education level for each sample

Probability of having secondary as education level for each sample

Probability of having tertiary as education level for each sample

Probability of having tertiary as education level and having balance above the average

Plotting the probability and samples

primary_education

secondary_education

tertiary_education

Questions

How different are they

Anomaly in the Sub-sample

Are there aspects of the data that are consistent among all sub-samples

How does this investigation affect the conclusions you draw about the data in the future