Project 2

Introduction

This project focuses on data transformation and clean up of three datasets from various sources. Each dataset required initial analysis in order to prepare it for downstream analysis and to answer main questions:

Customer churn - how many customers have been lost or gained over a year for three company divisions?
College costs and grants - comparing four New York colleges, what are their tuition costs and maximum possible grants?
Religion and salary - Plot reglious affiations with salary earnings.

Customer Churn

This sample dataset contains three company divisions (A, B and C) along with their customers gained and lost over a year long period. Given this information, determine which division has the highest and lowest rate of customer churn. In this case, customer churn is defined as a ratio of customers lost to customers gained. A division with a low churn rate would be promoted as a model of good operations and customer relations from which other divisions can learn from.

Source: https://tinyurl.com/rnb87tbv

Original Dataset

The original dataset is in a wide format with each division’s performance on a performance chart.

customers <- read_csv("https://raw.githubusercontent.com/cliftonleesps/607_acq_mgt/main/project2/customer_churn.csv", 
                      n_max=6, 
                      show_col_types=FALSE)

**Monthly Customer Data**
Division	Description	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
A	Gained	70	80	100	110	70	45	50	99	112	99	55	110
A	Lost	0	-90	-30	-45	-95	-33	-110	-34	-34	-88	-65	-45
B	Gained	80	80	90	120	100	119	75	119	90	80	80	90
B	Lost	0	-15	-30	-25	-50	-77	-45	-77	-30	-15	-20	-30
C	Gained	60	85	80	90	120	45	75	45	80	85	60	80
C	Lost	0	-45	-27	-17	-33	-80	-45	-80	-27	-45	-35	-27

Clean up

The operations used to clean up this data includes: pivoting; filtering; mutating; summing. Without these functions, it would be impossible to normalize the data set since the gained and lost data are on different rows. Additionally, summing up the total customers is an aggregate function and requires all numbers to be in a single column.

# Create a longer format with the tidyr pivot_longer function
customers_long_format <- pivot_longer(customers,cols=3:14,names_to="month", values_to="num_customers")

customers_aggregate <- tibble (division = '', churn = 0)

for (division in LETTERS[1:3]) {
    # sum up the customers lost and gained
    customers_lost <- customers_long_format %>% filter(Division==division & Description == "Lost") %>% select(num_customers) %>% sum()
    customers_gained <- customers_long_format %>% filter(Division==division & Description == "Gained") %>% select(num_customers) %>% sum()
    
    # add a row to the aggregate tibble
    customers_aggregate <- add_row(
                                    customers_aggregate,
                                    division = division,
                                    churn = abs(customers_lost/customers_gained)
                                   )
} # division

# Remove the unneeded first row
customers_aggregate <- customers_aggregate[-1,]

Plot of Division Customer Churn

As evident from the below chart, Division B has the lowest churn rate of 37%, followed by Division at 51%.

The detailed data:

## # A tibble: 3 × 2
##   division churn
##   <chr>    <dbl>
## 1 A        0.669
## 2 B        0.369
## 3 C        0.509

New York College Tuitions and Grants

The National Center for Education Statistics maintains an online tool for families to research colleges across many different aspects. Its website allows visitors to pick up to four colleges and generate a CSV that compares them in a long format. For this dataset, we only want to know the different in state tuition costs and the maximum possible grants (e.g. federal, state, institutional) to offset those costs.

Sources:

Original Dataset

The original dataset is in a long format with each school occupying a column.

# Read the CSV from Github, skipping the first 3 lines and up to 57 lines
colleges <- read_csv("https://raw.githubusercontent.com/cliftonleesps/607_acq_mgt/main/project2/collegenavigator_compare.csv",
                     col_names = c("attribute", "Hunter CUNY", "Stony Brook","Morrisville CUNY","Syracuse"), 
                     skip=3, 
                     n_max=57,
                     show_col_types=FALSE
                     )

College Navigator Dataset
attribute	Hunter CUNY	Stony Brook	Morrisville CUNY	Syracuse
total_cost	$24,377	$27,743	$27,270	$75,652
total_grant	$9,347	$10,468	$9,154	$25,864
federal_grant	$4,453	$5,253	$4,794	$5,492
pell_grant	$5,460	$5,410	$4,687	$5,024
other_federal_grant	$724	$1,445	$1,557	$2,191
state_local_grant	$5,186	$4,644	$3,730	$3,887
institutional_grant	$3,231	$4,146	$2,222	$24,390
net_price	$4,014	$16,478	$18,198	$47,177

Clean up

The operations used to clean up this data includes: casting to numeric; mutate; melt. These functions allow for rotating the original dataset into a more useful form for plotting and aggregating costs.

# Transpose to wider format
colleges <- as.data.frame(t(colleges))

# Rename the column names from row 1 and then remove redundant row 1
colnames(colleges) <- colleges[1,]
colleges <- colleges[-1,]

# Copy the row names as separate column
colleges <- tibble::rownames_to_column(colleges, "name")

colleges <- as_tibble(colleges)

# view(colleges)

# Convert everything to numeric
for (i in 2:length(colleges)) {
  colleges[[i]] <- as.numeric(colleges[[i]])
}


# Mutate and make new columns
colleges <- colleges %>%
            mutate(
                  total_possible_grants = federal_grant+pell_grant+other_federal_grant+state_local_grant+institutional_grant,
                  possible_price_minus_grants = total_cost - total_possible_grants,
                  difference_net_and_possible_price = net_price - possible_price_minus_grants
                  )

dfm <- melt(colleges[,c("name","total_cost", "total_possible_grants","possible_price_minus_grants")],id.vars = "name")

College Costs and Grants

The college with the highest costs also had the highest possible grants. However, the grants did not provide enough of a discount as compared to other colleges with lower tuition rates. Both Hunter and Stony Brook had lower tuition and relatively large possible grants.

Religion and Salary

This section’s dataset came from a survey about people’s religion and their salary range. One analysis we can do is see which religion had the highest percentage in the two highest income brackets (50k to 99k and 100k and up).

Reference: - https://tinyurl.com/36cxk573 - https://www.pewforum.org/religious-landscape-study/income-distribution

Original Dataset

The dataset is in a wide format with each religion’s percentages spread across each bracket. Although, this format is informative for the reader, we’ll have to use the melt function to chart brackets with each row.

income <- read_csv("https://raw.githubusercontent.com/cliftonleesps/607_acq_mgt/main/project2/income_distribution_by_religious_group.csv", 
                   skip=0,
                   show_col_types=FALSE
                  )

College Navigator Dataset
Religious tradition	Less than $30,000	$30,000-$49,999	$50,000-$99,999	$100,000 or more	Sample Size
Buddhist	36%	18%	32%	13%	233
Catholic	36%	19%	26%	19%	6137
Evangelical Protestant	35%	22%	28%	14%	7462
Hindu	17%	13%	34%	36%	172
Historically Black Protestant	53%	22%	17%	8%	1704
Jehovah’s Witness	48%	25%	22%	4%	208
Jewish	16%	15%	24%	44%	708
Mainline Protestant	29%	20%	28%	23%	5208
Mormon	27%	20%	33%	20%	594
Muslim	34%	17%	29%	20%	205
Orthodox Christian	18%	17%	36%	29%	155
Unaffiliated (religious “nones”)	33%	20%	26%	21%	6790

Data Cleanup

We have to shorten some of the affiliation names, as well as use the str_remove function to remove non numeric characters otherwise math functions will not work. We then ‘melt’ each religion with the bracket percentages.

# Shorten the names
income[1] = c(
              "Buddhist",
              "Catholic",
              "Evan. Prot.",
              "Hindu",
              "Hist. Black Prot",
              "JW",
              "Jewish",
              "Main. Prot.","Mormon",
              "Muslim",
              "Orth. Christian",
              "Unaffiliated")

# Remove the percent signs and convert to integer for some columns
for (c in c("less_than_30","30_to_49","50_to_99","100_or_more")) {
  income[[c]] = as.numeric(
                          as.character(
                                      str_replace(income[[c]], "%", "")
                                      )
                          )
}

# make a temporary data frame by melting religion with the salary range columns
dfm <- melt(income[,c("religion","less_than_30", "30_to_49", "50_to_99", "100_or_more")],id.vars = "religion")

Plot of Religions and Salary

From the dataset above, we can see that the Jewish and Orthodox Christian faiths occupy the highest brackets with percentages of 44% and 36% respectively. The plot below futher illustrates the relationship of faith to income ranges.

Project 2

Cliff Lee

10/03/2021

Introduction

Customer Churn

Original Dataset

Clean up

Plot of Division Customer Churn

New York College Tuitions and Grants

Original Dataset

Clean up

College Costs and Grants

Religion and Salary

Original Dataset

Data Cleanup

Plot of Religions and Salary