In this data dive, I am investigating (1) academic performance data and (2) marketing campaign outcome data to better understand their basic data structures, answer simple questions / queries about relationships amongst the data, and help determine which dataset I would like to use for my final project.
The following exploratory analysis is performed for each dataset in their respective areas – and a more concrete, summative findings section that meet the requirements of the course is directly below under the 01 - Output (Look Here) section. Specifically, per guidance from the TA – the focus will be on dataset #2 – bank campaign marketing data.
library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.2.1
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
setwd("C:/Users/chris/OneDrive - Indiana University/Graduate School/MIS/INFO-H 510/Project Data")
To begin we will start by reading in the bank_marketing data set, which is composed of 45,211 observations and 17 columns. This dataset displays campaign marketing outcome on whether the contacted client subscribed for a term deposit (variable y). Variables can be divided into a couple of different categories and are described below:
Demographics: Age, job, marital status, education
Financial Indicators: Credit default, balance, housing loan, personal loan
Contact: Contact type, day of the week, month, duration
Campaign: Number of contacts performed for this campaign, and whether the client subscribed to a term deposit
Prior Campaigns: Number of days past since prior campaign contact, previous number of times contacted before this campaign, and the outcome of the prior campaign
The main goal of the dataset is to be able to predict if the client will subscribe (yes/no) to a term deposit (variable y) as directly stated in the data documentation here. Key questions asked during data exploration include:
Are higher educated individuals more likely to have access to a cellphone?
Are “wealthier” clients more likely to subscribe to a term deposit?
Is there any connection between contact type and a term deposit?
## Read in File Location
bank_marketing <- read_delim("bank-marketing.csv", delim = ";")
## Rows: 45211 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (10): job, marital, education, default, housing, loan, contact, month, p...
## dbl (7): age, balance, day, duration, campaign, pdays, previous
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Return column names and information from dataset
class(bank_marketing)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
colnames(bank_marketing)
## [1] "age" "job" "marital" "education" "default" "balance"
## [7] "housing" "loan" "contact" "day" "month" "duration"
## [13] "campaign" "pdays" "previous" "poutcome" "y"
Yes, they are. 75% of individual with tertiary education – indicating that they have completed a college degree have a cellular phone versus 63% of those who only have completed up to high school. This falls into line with what expect–e.g., those who have completed more schooling likely have more resources and thus a greater mean to have access to a cellular device than those who do not.
# First display the raw number counts within each category
bank_marketing |>
count(education, contact) |>
pivot_wider(
names_from = contact,
values_from = n
)
## # A tibble: 4 × 4
## education cellular telephone unknown
## <chr> <int> <int> <int>
## 1 primary 3568 722 2561
## 2 secondary 14688 1365 7149
## 3 tertiary 10039 629 2633
## 4 unknown 990 190 677
# Now display these raw counts converted proportionally
bank_marketing |>
count(education, contact) |>
group_by(education) |>
mutate(pct = round(n / sum(n), 2)) |>
select(-n) |> # keeps all column but n for raw counts
pivot_wider(
names_from = contact,
values_from = pct
)
## # A tibble: 4 × 4
## # Groups: education [4]
## education cellular telephone unknown
## <chr> <dbl> <dbl> <dbl>
## 1 primary 0.52 0.11 0.37
## 2 secondary 0.63 0.06 0.31
## 3 tertiary 0.75 0.05 0.2
## 4 unknown 0.53 0.1 0.36
Yes, there is evidence to suggest that “wealthier clients”– those defined by having a higher average yearly balance deposited with the bank – are more likely to subscribe to a term deposit. This is suggested by the box plot below where the median average balance for those who did not subscribe is $417, contrasted with $733 for those who did.
Another way to visualize this potential discrepancy is by using a scatterplot and segmenting out the colors on whether a term deposit was made. I also added age onto the y-axis to complete the scatterplot. I also needed to take a sample of the populations to allow for a proportional comparison in the scatteplot below, which I did by taking 200 data points and plotting. There is clear evidence of this phenomena.
# Create a Boxplot displaying the binary outcome variable mapped across income
bank_marketing |>
ggplot(aes(x = y, y = balance)) +
geom_boxplot() +
stat_summary(
fun = median,
geom = 'text',
aes(label = round(after_stat(y), 0)),
vjust = -0.5
) +
coord_cartesian(ylim = c(0, 5000)) + # adjust y-axis to avoid super outliers
labs(
title = 'Difference in Average Balance by Term Depsoit Subscription',
x = 'Term Deposit Subscription',
y = 'Average Yearly Balance')
# Take a sample from each group equal to each other to compare proprtionality
bank_marketing_sample <- bank_marketing |>
group_by(y) |>
slice_sample(n = 200) |>
ungroup()
# Create a faceted scatter plot displaying average balance by color group (for outcome)
bank_marketing_sample |>
ggplot(aes(x = age, y = balance)) +
geom_point(data = mutate(bank_marketing_sample, y = NULL), color = 'grey85') +
geom_point(color = 'darkred') +
facet_wrap(~ y) +
coord_cartesian(ylim = c(0, 5000)) +
theme_classic() +
labs(
title = "Balance vs Age by Term Deposit Subscription Outcome",
x = "Age",
y = "Average Yearly Balance"
)
There is a 2-percent positive differential between whether the client is contacted / has a cellular phone as opposed to a telephone. This outcome is less impressive than I would have expected given that those contacted cellular would likely have higher education levels and higher average balance levels which I figured would translate based on outcomes defined above into a greater difference. This is a finding that we can definitely explore deeper.
#
bank_marketing |>
count(contact, y) |>
group_by(contact) |>
mutate(pct = round(n / sum(n), 2)) |>
select(-n) |>
pivot_wider(
names_from = y,
values_from = pct
)
## # A tibble: 3 × 3
## # Groups: contact [3]
## contact no yes
## <chr> <dbl> <dbl>
## 1 cellular 0.85 0.15
## 2 telephone 0.87 0.13
## 3 unknown 0.96 0.04
This academic performance dataset comes from the Univertisty of California Irvine to understand and predict academic performance – e.g., re-enrollment, graduation, or dropping out based on a set of variables. More information and data documentation can be found here
This following section is dedicated towards exploratory analysis and playing around with the data, please see above for the summary meeting all requirements of the Week 2 data dive.
## Read in File Location
acad_perf <- read_delim("academic-performance-data.csv", delim = ";")
## Rows: 4424 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (1): Target
## dbl (36): Marital status, Application mode, Application order, Course, Dayti...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Return column names from dataset
class(acad_perf)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
colnames(acad_perf)
## [1] "Marital status"
## [2] "Application mode"
## [3] "Application order"
## [4] "Course"
## [5] "Daytime/evening attendance\t"
## [6] "Previous qualification"
## [7] "Previous qualification (grade)"
## [8] "Nacionality"
## [9] "Mother's qualification"
## [10] "Father's qualification"
## [11] "Mother's occupation"
## [12] "Father's occupation"
## [13] "Admission grade"
## [14] "Displaced"
## [15] "Educational special needs"
## [16] "Debtor"
## [17] "Tuition fees up to date"
## [18] "Gender"
## [19] "Scholarship holder"
## [20] "Age at enrollment"
## [21] "International"
## [22] "Curricular units 1st sem (credited)"
## [23] "Curricular units 1st sem (enrolled)"
## [24] "Curricular units 1st sem (evaluations)"
## [25] "Curricular units 1st sem (approved)"
## [26] "Curricular units 1st sem (grade)"
## [27] "Curricular units 1st sem (without evaluations)"
## [28] "Curricular units 2nd sem (credited)"
## [29] "Curricular units 2nd sem (enrolled)"
## [30] "Curricular units 2nd sem (evaluations)"
## [31] "Curricular units 2nd sem (approved)"
## [32] "Curricular units 2nd sem (grade)"
## [33] "Curricular units 2nd sem (without evaluations)"
## [34] "Unemployment rate"
## [35] "Inflation rate"
## [36] "GDP"
## [37] "Target"
While the dataset above displays numerous dbl columns, most of these are actually categorical and need to be decoded as shown below.
The categorical columns are as follows:
Application mode; Course; Daytime/evening attendance; Previous qualification; Nacionality; Mother’s qualification; Father’s qualification; Mother’s occupation; Father’s occupation; Displaced; Educational special needs; Debtor; Tuition fees up to date; Gender; Scholarship holder.
## Replace martial status column with the textual values
acad_perf <- acad_perf |>
mutate(mar_stat_desc = recode(
`Marital status`,
`1` = "Single",
`2` = "Married",
`3` = "Widower",
`4` = "Divorced",
`5` = "De facto union",
`6` = "Legally separated"
))
## Now access updated data
acad_perf |>
## Count based on the marital status code
count(mar_stat_desc) |>
## Specify printing options (avoid truncation of output)
print(n = Inf, width = Inf)
## # A tibble: 6 × 2
## mar_stat_desc n
## <chr> <int>
## 1 De facto union 25
## 2 Divorced 91
## 3 Legally separated 6
## 4 Married 379
## 5 Single 3919
## 6 Widower 4
acad_perf <- acad_perf |>
mutate(course_desc = recode(
Course,
`33` = "Biofuel Production Technologies",
`171` = "Animation and Multimedia Design",
`8014` = "Social Service (evening attendance)",
`9003` = "Agronomy",
`9070` = "Communication Design",
`9085` = "Veterinary Nursing",
`9119` = "Informatics Engineering",
`9130` = "Equinculture",
`9147` = "Management",
`9238` = "Social Service",
`9254` = "Tourism",
`9500` = "Nursing",
`9556` = "Oral Hygiene",
`9670` = "Advertising and Marketing Management",
`9773` = "Journalism and Communication",
`9853` = "Basic Education",
`9991` = "Management (evening attendance)"
))
acad_perf |>
count(course_desc) |>
print(n = Inf, width = Inf)
## # A tibble: 17 × 2
## course_desc n
## <chr> <int>
## 1 Advertising and Marketing Management 268
## 2 Agronomy 210
## 3 Animation and Multimedia Design 215
## 4 Basic Education 192
## 5 Biofuel Production Technologies 12
## 6 Communication Design 226
## 7 Equinculture 141
## 8 Informatics Engineering 170
## 9 Journalism and Communication 331
## 10 Management 380
## 11 Management (evening attendance) 268
## 12 Nursing 766
## 13 Oral Hygiene 86
## 14 Social Service 355
## 15 Social Service (evening attendance) 215
## 16 Tourism 252
## 17 Veterinary Nursing 337
Nationality
acad_perf <- acad_perf |>
mutate(nationality_desc = recode(
Nacionality,
`1` = "Portuguese",
`2` = "German",
`6` = "Spanish",
`11` = "Italian",
`13` = "Dutch",
`14` = "English",
`17` = "Lithuanian",
`21` = "Angolan",
`22` = "Cape Verdean",
`24` = "Guinean",
`25` = "Mozambican",
`26` = "Santomean",
`32` = "Turkish",
`41` = "Brazilian",
`62` = "Romanian",
`100` = "Moldovan",
`101` = "Mexican",
`103` = "Ukrainian",
`105` = "Russian",
`108` = "Cuban",
`109` = "Colombian"
))
acad_perf |>
count(nationality_desc) |>
print(n = Inf, width = Inf)
## # A tibble: 21 × 2
## nationality_desc n
## <chr> <int>
## 1 Angolan 2
## 2 Brazilian 38
## 3 Cape Verdean 13
## 4 Colombian 1
## 5 Cuban 1
## 6 Dutch 1
## 7 English 1
## 8 German 2
## 9 Guinean 5
## 10 Italian 3
## 11 Lithuanian 1
## 12 Mexican 2
## 13 Moldovan 3
## 14 Mozambican 2
## 15 Portuguese 4314
## 16 Romanian 2
## 17 Russian 2
## 18 Santomean 14
## 19 Spanish 13
## 20 Turkish 1
## 21 Ukrainian 3
## Box plot of Target Outcome vs Admissions grade
acad_perf |>
ggplot() +
geom_boxplot(mapping = aes(x = Target, y = `Admission grade`))
Admissions grade seems to have little impact on whether a student re-enrolls; however, there may be some connection between whether a student ultimately graduates.
## Determine Minimum, Median, Mean, and Maximum of Curricular Units for 1st Semester (Enrolled)
acad_perf |>
select(`Curricular units 1st sem (enrolled)`) |>
summarize(
min = min(`Curricular units 1st sem (enrolled)`),
med = median(`Curricular units 1st sem (enrolled)`),
mean = mean(`Curricular units 1st sem (enrolled)`),
max = max(`Curricular units 1st sem (enrolled)`)
) |>
print(n = Inf, width = Inf)
## # A tibble: 1 × 4
## min med mean max
## <dbl> <dbl> <dbl> <dbl>
## 1 0 6 6.27 26
## Box Plot for Curricular Units for 1st Semester (Enrolled)
acad_perf |>
ggplot() +
geom_boxplot(mapping = aes(y = `Curricular units 1st sem (enrolled)`)) +
labs(title="Interquartile Ranges of Curricular Units for 1st Semester")
## Box plot for admission grade by course
acad_perf |>
ggplot() +
geom_boxplot(mapping = aes(x = course_desc, y = `Admission grade`)) +
coord_flip() +
labs(title="Interquartile Ranges of Admission grade by Course")
Based on preliminary exploration of the data:
How does the number of curricular units taken for the 1st semester vary by marital status?
Is there a gender split on enrollment based on the course?
Are some courses more select on enrollment in terms of admissions grade than others?
This bank marketing data set looks at outcomes for a Portuguese banking institution phone call marketing campaign to determine outcomes on whether a bank term deposit was subscribed to. More details on this dataset can be found here.
## Read in File Location
bank_marketing <- read_delim("bank-marketing.csv", delim = ";")
## Rows: 45211 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (10): job, marital, education, default, housing, loan, contact, month, p...
## dbl (7): age, balance, day, duration, campaign, pdays, previous
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Return column names and information from dataset
class(bank_marketing)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
colnames(bank_marketing)
## [1] "age" "job" "marital" "education" "default" "balance"
## [7] "housing" "loan" "contact" "day" "month" "duration"
## [13] "campaign" "pdays" "previous" "poutcome" "y"
# Generate a table of all education vs contact types
bank_marketing |>
count(education, contact) |>
pivot_wider(
names_from = contact,
values_from = n
)
## # A tibble: 4 × 4
## education cellular telephone unknown
## <chr> <int> <int> <int>
## 1 primary 3568 722 2561
## 2 secondary 14688 1365 7149
## 3 tertiary 10039 629 2633
## 4 unknown 990 190 677
# Generate a table to view the number of contacts for this campaign mapped against outcome
bank_marketing |>
count(poutcome, y) |>
pivot_wider(
names_from = poutcome,
values_from = n
)
## # A tibble: 2 × 5
## y failure other success unknown
## <chr> <int> <int> <int> <int>
## 1 no 4283 1533 533 33573
## 2 yes 618 307 978 3386
# Display distribution of average balance variable
bank_marketing |>
ggplot() +
geom_boxplot(mapping = aes(y = balance))
We can see that the balance amounts are highly concentrated with significant outliers skewing our capability to truly visualize and understand the distribution. Let’s reformat the data into a histogram to view its distribution within bins of $100. Here we can start to see that the bulk of the data lies between $0 and $2,500.
# Display a bell curve chart to view the distribution of the average balance variable
bank_marketing |>
ggplot() +
geom_histogram(aes(x = balance, y = after_stat(density)), bins = 100) +
coord_cartesian(xlim = c(0, 10000))
Based on preliminary exploration of the data:
Are “wealthier” clients more likely to subscribe to a term deposit?
Is there any connection between contact type and a term deposit?
Are prior campaign outcomes connected to current campaign success?