00 - Week 1 Data Dives Introduction

In this data dive, I am investigating (1) academic performance data and (2) marketing campaign outcome data to better understand their basic data structures, answer simple questions / queries about relationships amongst the data, and help determine which dataset I would like to use for my final project.

The following exploratory analysis is performed for each dataset in their respective areas – and a more concrete, summative findings section that meet the requirements of the course is directly below under the 01 - Output (Look Here) section. Specifically, per guidance from the TA – the focus will be on dataset #2 – bank campaign marketing data.

library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.2.1
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
setwd("C:/Users/chris/OneDrive - Indiana University/Graduate School/MIS/INFO-H 510/Project Data")

01 - Output (Look Here)

To begin we will start by reading in the bank_marketing data set, which is composed of 45,211 observations and 17 columns. This dataset displays campaign marketing outcome on whether the contacted client subscribed for a term deposit (variable y). Variables can be divided into a couple of different categories and are described below:

The main goal of the dataset is to be able to predict if the client will subscribe (yes/no) to a term deposit (variable y) as directly stated in the data documentation here. Key questions asked during data exploration include:

## Read in File Location
bank_marketing <- read_delim("bank-marketing.csv", delim = ";")
## Rows: 45211 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (10): job, marital, education, default, housing, loan, contact, month, p...
## dbl  (7): age, balance, day, duration, campaign, pdays, previous
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Return column names and information from dataset
class(bank_marketing)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
colnames(bank_marketing)
##  [1] "age"       "job"       "marital"   "education" "default"   "balance"  
##  [7] "housing"   "loan"      "contact"   "day"       "month"     "duration" 
## [13] "campaign"  "pdays"     "previous"  "poutcome"  "y"

Are higher educated individuals more likely to have access to a cellphone?

Yes, they are. 75% of individual with tertiary education – indicating that they have completed a college degree have a cellular phone versus 63% of those who only have completed up to high school. This falls into line with what expect–e.g., those who have completed more schooling likely have more resources and thus a greater mean to have access to a cellular device than those who do not.

# First display the raw number counts within each category
bank_marketing |>
  count(education, contact) |>
  pivot_wider(
    names_from = contact,
    values_from = n
  )
## # A tibble: 4 × 4
##   education cellular telephone unknown
##   <chr>        <int>     <int>   <int>
## 1 primary       3568       722    2561
## 2 secondary    14688      1365    7149
## 3 tertiary     10039       629    2633
## 4 unknown        990       190     677
# Now display these raw counts converted proportionally
bank_marketing |>
  count(education, contact) |>
  group_by(education) |>
  mutate(pct = round(n / sum(n), 2)) |>
  select(-n) |> # keeps all column but n for raw counts
  pivot_wider(
    names_from = contact,
    values_from = pct
  )
## # A tibble: 4 × 4
## # Groups:   education [4]
##   education cellular telephone unknown
##   <chr>        <dbl>     <dbl>   <dbl>
## 1 primary       0.52      0.11    0.37
## 2 secondary     0.63      0.06    0.31
## 3 tertiary      0.75      0.05    0.2 
## 4 unknown       0.53      0.1     0.36

Are “wealthier” clients more likely to subscribe to a term deposit?

Yes, there is evidence to suggest that “wealthier clients”– those defined by having a higher average yearly balance deposited with the bank – are more likely to subscribe to a term deposit. This is suggested by the box plot below where the median average balance for those who did not subscribe is $417, contrasted with $733 for those who did.

Another way to visualize this potential discrepancy is by using a scatterplot and segmenting out the colors on whether a term deposit was made. I also added age onto the y-axis to complete the scatterplot. I also needed to take a sample of the populations to allow for a proportional comparison in the scatteplot below, which I did by taking 200 data points and plotting. There is clear evidence of this phenomena.

# Create a Boxplot displaying the binary outcome variable mapped across income
bank_marketing |>
  ggplot(aes(x = y, y = balance)) +
  geom_boxplot() +
  stat_summary(
    fun = median,
    geom = 'text',
    aes(label = round(after_stat(y), 0)),
    vjust = -0.5
  ) +
  coord_cartesian(ylim = c(0, 5000)) + # adjust y-axis to avoid super outliers
  labs(
    title = 'Difference in Average Balance by Term Depsoit Subscription', 
    x = 'Term Deposit Subscription', 
    y = 'Average Yearly Balance')

# Take a sample from each group equal to each other to compare proprtionality
bank_marketing_sample <- bank_marketing |>
  group_by(y) |>
  slice_sample(n = 200) |>
  ungroup()

# Create a faceted scatter plot displaying average balance by color group (for outcome)
bank_marketing_sample |>
  ggplot(aes(x = age, y = balance)) +
  geom_point(data = mutate(bank_marketing_sample, y = NULL), color = 'grey85') +
  geom_point(color = 'darkred') +
  facet_wrap(~ y) +
  coord_cartesian(ylim = c(0, 5000)) +
  theme_classic() +
  labs(
    title = "Balance vs Age by Term Deposit Subscription Outcome",
    x = "Age",
    y = "Average Yearly Balance"
  )

Is there any connection between contact type (e.g., cellular vs telephone) and subscription to a term deposit?

There is a 2-percent positive differential between whether the client is contacted / has a cellular phone as opposed to a telephone. This outcome is less impressive than I would have expected given that those contacted cellular would likely have higher education levels and higher average balance levels which I figured would translate based on outcomes defined above into a greater difference. This is a finding that we can definitely explore deeper.

#

bank_marketing |>
  count(contact, y) |>
  group_by(contact) |>
  mutate(pct = round(n / sum(n), 2)) |>
  select(-n) |>
  pivot_wider(
    names_from = y,
    values_from = pct
  )
## # A tibble: 3 × 3
## # Groups:   contact [3]
##   contact      no   yes
##   <chr>     <dbl> <dbl>
## 1 cellular   0.85  0.15
## 2 telephone  0.87  0.13
## 3 unknown    0.96  0.04

02 - Academic Performance Data Exploratory Analysis

This academic performance dataset comes from the Univertisty of California Irvine to understand and predict academic performance – e.g., re-enrollment, graduation, or dropping out based on a set of variables. More information and data documentation can be found here

This following section is dedicated towards exploratory analysis and playing around with the data, please see above for the summary meeting all requirements of the Week 2 data dive.

Read in Dataset

## Read in File Location
acad_perf <- read_delim("academic-performance-data.csv", delim = ";")
## Rows: 4424 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr  (1): Target
## dbl (36): Marital status, Application mode, Application order, Course, Dayti...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Return column names from dataset
class(acad_perf)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
colnames(acad_perf)
##  [1] "Marital status"                                
##  [2] "Application mode"                              
##  [3] "Application order"                             
##  [4] "Course"                                        
##  [5] "Daytime/evening attendance\t"                  
##  [6] "Previous qualification"                        
##  [7] "Previous qualification (grade)"                
##  [8] "Nacionality"                                   
##  [9] "Mother's qualification"                        
## [10] "Father's qualification"                        
## [11] "Mother's occupation"                           
## [12] "Father's occupation"                           
## [13] "Admission grade"                               
## [14] "Displaced"                                     
## [15] "Educational special needs"                     
## [16] "Debtor"                                        
## [17] "Tuition fees up to date"                       
## [18] "Gender"                                        
## [19] "Scholarship holder"                            
## [20] "Age at enrollment"                             
## [21] "International"                                 
## [22] "Curricular units 1st sem (credited)"           
## [23] "Curricular units 1st sem (enrolled)"           
## [24] "Curricular units 1st sem (evaluations)"        
## [25] "Curricular units 1st sem (approved)"           
## [26] "Curricular units 1st sem (grade)"              
## [27] "Curricular units 1st sem (without evaluations)"
## [28] "Curricular units 2nd sem (credited)"           
## [29] "Curricular units 2nd sem (enrolled)"           
## [30] "Curricular units 2nd sem (evaluations)"        
## [31] "Curricular units 2nd sem (approved)"           
## [32] "Curricular units 2nd sem (grade)"              
## [33] "Curricular units 2nd sem (without evaluations)"
## [34] "Unemployment rate"                             
## [35] "Inflation rate"                                
## [36] "GDP"                                           
## [37] "Target"

Explore Categorical Data Distributions

While the dataset above displays numerous dbl columns, most of these are actually categorical and need to be decoded as shown below.

The categorical columns are as follows:

Application mode; Course; Daytime/evening attendance; Previous qualification; Nacionality; Mother’s qualification; Father’s qualification; Mother’s occupation; Father’s occupation; Displaced; Educational special needs; Debtor; Tuition fees up to date; Gender; Scholarship holder.

Marital Status

## Replace martial status column with the textual values
acad_perf <- acad_perf |>
  mutate(mar_stat_desc = recode(
    `Marital status`,
    `1` = "Single",
    `2` = "Married",
    `3` = "Widower",
    `4` = "Divorced",
    `5` = "De facto union",
    `6` = "Legally separated"
  )) 

## Now access updated data

acad_perf |>  
  ## Count based on the marital status code
  count(mar_stat_desc) |>
  
  ## Specify printing options (avoid truncation of output)
  print(n = Inf, width = Inf)
## # A tibble: 6 × 2
##   mar_stat_desc         n
##   <chr>             <int>
## 1 De facto union       25
## 2 Divorced             91
## 3 Legally separated     6
## 4 Married             379
## 5 Single             3919
## 6 Widower               4

Course

acad_perf <- acad_perf |>
  mutate(course_desc = recode(
    Course,
    `33`   = "Biofuel Production Technologies",
    `171`  = "Animation and Multimedia Design",
    `8014` = "Social Service (evening attendance)",
    `9003` = "Agronomy",
    `9070` = "Communication Design",
    `9085` = "Veterinary Nursing",
    `9119` = "Informatics Engineering",
    `9130` = "Equinculture",
    `9147` = "Management",
    `9238` = "Social Service",
    `9254` = "Tourism",
    `9500` = "Nursing",
    `9556` = "Oral Hygiene",
    `9670` = "Advertising and Marketing Management",
    `9773` = "Journalism and Communication",
    `9853` = "Basic Education",
    `9991` = "Management (evening attendance)"
  ))
  
acad_perf |>
  count(course_desc) |>
  print(n = Inf, width = Inf)
## # A tibble: 17 × 2
##    course_desc                              n
##    <chr>                                <int>
##  1 Advertising and Marketing Management   268
##  2 Agronomy                               210
##  3 Animation and Multimedia Design        215
##  4 Basic Education                        192
##  5 Biofuel Production Technologies         12
##  6 Communication Design                   226
##  7 Equinculture                           141
##  8 Informatics Engineering                170
##  9 Journalism and Communication           331
## 10 Management                             380
## 11 Management (evening attendance)        268
## 12 Nursing                                766
## 13 Oral Hygiene                            86
## 14 Social Service                         355
## 15 Social Service (evening attendance)    215
## 16 Tourism                                252
## 17 Veterinary Nursing                     337

Nationality

acad_perf <- acad_perf |>
  mutate(nationality_desc = recode(
    Nacionality,
    `1`   = "Portuguese",
    `2`   = "German",
    `6`   = "Spanish",
    `11`  = "Italian",
    `13`  = "Dutch",
    `14`  = "English",
    `17`  = "Lithuanian",
    `21`  = "Angolan",
    `22`  = "Cape Verdean",
    `24`  = "Guinean",
    `25`  = "Mozambican",
    `26`  = "Santomean",
    `32`  = "Turkish",
    `41`  = "Brazilian",
    `62`  = "Romanian",
    `100` = "Moldovan",
    `101` = "Mexican",
    `103` = "Ukrainian",
    `105` = "Russian",
    `108` = "Cuban",
    `109` = "Colombian"
  ))

acad_perf |>
  count(nationality_desc) |>
  print(n = Inf, width = Inf)
## # A tibble: 21 × 2
##    nationality_desc     n
##    <chr>            <int>
##  1 Angolan              2
##  2 Brazilian           38
##  3 Cape Verdean        13
##  4 Colombian            1
##  5 Cuban                1
##  6 Dutch                1
##  7 English              1
##  8 German               2
##  9 Guinean              5
## 10 Italian              3
## 11 Lithuanian           1
## 12 Mexican              2
## 13 Moldovan             3
## 14 Mozambican           2
## 15 Portuguese        4314
## 16 Romanian             2
## 17 Russian              2
## 18 Santomean           14
## 19 Spanish             13
## 20 Turkish              1
## 21 Ukrainian            3

Target vs Admissions Grade

## Box plot of Target Outcome vs Admissions grade

acad_perf |>
  ggplot() +
  geom_boxplot(mapping = aes(x = Target, y = `Admission grade`))

Admissions grade seems to have little impact on whether a student re-enrolls; however, there may be some connection between whether a student ultimately graduates.

Explore Continuous Data Distributions

Curricular Units 1st Semester

## Determine Minimum, Median, Mean, and Maximum of Curricular Units for 1st Semester (Enrolled)
acad_perf |>
  select(`Curricular units 1st sem (enrolled)`) |>
  summarize(
    min = min(`Curricular units 1st sem (enrolled)`),
    med = median(`Curricular units 1st sem (enrolled)`),
    mean = mean(`Curricular units 1st sem (enrolled)`),
    max = max(`Curricular units 1st sem (enrolled)`)
    
  ) |>
  print(n = Inf, width = Inf)
## # A tibble: 1 × 4
##     min   med  mean   max
##   <dbl> <dbl> <dbl> <dbl>
## 1     0     6  6.27    26
## Box Plot for Curricular Units for 1st Semester (Enrolled)
acad_perf |>
  ggplot() +
  geom_boxplot(mapping = aes(y = `Curricular units 1st sem (enrolled)`)) +
  labs(title="Interquartile Ranges of Curricular Units for 1st Semester")

Admission Grade

## Box plot for admission grade by course
acad_perf |>
  ggplot() +
  geom_boxplot(mapping = aes(x = course_desc, y = `Admission grade`)) +
  coord_flip() +
  labs(title="Interquartile Ranges of Admission grade by Course")

Questions

Based on preliminary exploration of the data:

  • How does the number of curricular units taken for the 1st semester vary by marital status?

  • Is there a gender split on enrollment based on the course?

  • Are some courses more select on enrollment in terms of admissions grade than others?

03 - Bank Marketing

This bank marketing data set looks at outcomes for a Portuguese banking institution phone call marketing campaign to determine outcomes on whether a bank term deposit was subscribed to. More details on this dataset can be found here.

## Read in File Location
bank_marketing <- read_delim("bank-marketing.csv", delim = ";")
## Rows: 45211 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (10): job, marital, education, default, housing, loan, contact, month, p...
## dbl  (7): age, balance, day, duration, campaign, pdays, previous
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Return column names and information from dataset
class(bank_marketing)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
colnames(bank_marketing)
##  [1] "age"       "job"       "marital"   "education" "default"   "balance"  
##  [7] "housing"   "loan"      "contact"   "day"       "month"     "duration" 
## [13] "campaign"  "pdays"     "previous"  "poutcome"  "y"

Explore Categorical Data Distributions

Education and Contact

# Generate a table of all education vs contact types
bank_marketing |>
  count(education, contact) |>
  pivot_wider(
    names_from = contact,
    values_from = n
  )
## # A tibble: 4 × 4
##   education cellular telephone unknown
##   <chr>        <int>     <int>   <int>
## 1 primary       3568       722    2561
## 2 secondary    14688      1365    7149
## 3 tertiary     10039       629    2633
## 4 unknown        990       190     677

Campaign Outcomes

# Generate a table to view the number of contacts for this campaign mapped against outcome
bank_marketing |>
  count(poutcome, y) |>
  pivot_wider(
    names_from = poutcome,
    values_from = n
  )
## # A tibble: 2 × 5
##   y     failure other success unknown
##   <chr>   <int> <int>   <int>   <int>
## 1 no       4283  1533     533   33573
## 2 yes       618   307     978    3386

Explore Continuous Data Distributions

Balance

# Display distribution of average balance variable
bank_marketing |>
  ggplot() +
  geom_boxplot(mapping = aes(y = balance))

We can see that the balance amounts are highly concentrated with significant outliers skewing our capability to truly visualize and understand the distribution. Let’s reformat the data into a histogram to view its distribution within bins of $100. Here we can start to see that the bulk of the data lies between $0 and $2,500.

# Display a bell curve chart to view the distribution of the average balance variable

bank_marketing |>
  ggplot() +
  geom_histogram(aes(x = balance, y = after_stat(density)), bins = 100) +
  coord_cartesian(xlim = c(0, 10000))

Questions

Based on preliminary exploration of the data:

  • Are “wealthier” clients more likely to subscribe to a term deposit?

  • Is there any connection between contact type and a term deposit?

  • Are prior campaign outcomes connected to current campaign success?