Introduction

This project focuses on data transformation and clean up of three datasets from various sources. Each dataset required initial analysis in order to prepare it for downstream analysis and to answer main questions:

Customer Churn

This sample dataset contains three company divisions (A, B and C) along with their customers gained and lost over a year long period. Given this information, determine which division has the highest and lowest rate of customer churn. In this case, customer churn is defined as a ratio of customers lost to customers gained. A division with a low churn rate would be promoted as a model of good operations and customer relations from which other divisions can learn from.

Source: https://tinyurl.com/rnb87tbv

Original Dataset

The original dataset is in a wide format with each division’s performance on a performance chart.

customers <- read_csv("https://raw.githubusercontent.com/cliftonleesps/607_acq_mgt/main/project2/customer_churn.csv", 
                      n_max=6, 
                      show_col_types=FALSE)




Monthly Customer Data
Division Description Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
A Gained 70 80 100 110 70 45 50 99 112 99 55 110
A Lost 0 -90 -30 -45 -95 -33 -110 -34 -34 -88 -65 -45
B Gained 80 80 90 120 100 119 75 119 90 80 80 90
B Lost 0 -15 -30 -25 -50 -77 -45 -77 -30 -15 -20 -30
C Gained 60 85 80 90 120 45 75 45 80 85 60 80
C Lost 0 -45 -27 -17 -33 -80 -45 -80 -27 -45 -35 -27
Clean up

The operations used to clean up this data includes: pivoting; filtering; mutating; summing. Without these functions, it would be impossible to normalize the data set since the gained and lost data are on different rows. Additionally, summing up the total customers is an aggregate function and requires all numbers to be in a single column.

# Create a longer format with the tidyr pivot_longer function
customers_long_format <- pivot_longer(customers,cols=3:14,names_to="month", values_to="num_customers")

customers_aggregate <- tibble (division = '', churn = 0)

for (division in LETTERS[1:3]) {
    # sum up the customers lost and gained
    customers_lost <- customers_long_format %>% filter(Division==division & Description == "Lost") %>% select(num_customers) %>% sum()
    customers_gained <- customers_long_format %>% filter(Division==division & Description == "Gained") %>% select(num_customers) %>% sum()
    
    # add a row to the aggregate tibble
    customers_aggregate <- add_row(
                                    customers_aggregate,
                                    division = division,
                                    churn = abs(customers_lost/customers_gained)
                                   )
} # division

# Remove the unneeded first row
customers_aggregate <- customers_aggregate[-1,]
Plot of Division Customer Churn

As evident from the below chart, Division B has the lowest churn rate of 37%, followed by Division at 51%.

The detailed data:

## # A tibble: 3 × 2
##   division churn
##   <chr>    <dbl>
## 1 A        0.669
## 2 B        0.369
## 3 C        0.509

New York College Tuitions and Grants

The National Center for Education Statistics maintains an online tool for families to research colleges across many different aspects. Its website allows visitors to pick up to four colleges and generate a CSV that compares them in a long format. For this dataset, we only want to know the different in state tuition costs and the maximum possible grants (e.g. federal, state, institutional) to offset those costs.

Sources:

Original Dataset

The original dataset is in a long format with each school occupying a column.

# Read the CSV from Github, skipping the first 3 lines and up to 57 lines
colleges <- read_csv("https://raw.githubusercontent.com/cliftonleesps/607_acq_mgt/main/project2/collegenavigator_compare.csv",
                     col_names = c("attribute", "Hunter CUNY", "Stony Brook","Morrisville CUNY","Syracuse"), 
                     skip=3, 
                     n_max=57,
                     show_col_types=FALSE
                     )
College Navigator Dataset
attribute Hunter CUNY Stony Brook Morrisville CUNY Syracuse
total_cost $24,377 $27,743 $27,270 $75,652
total_grant $9,347 $10,468 $9,154 $25,864
federal_grant $4,453 $5,253 $4,794 $5,492
pell_grant $5,460 $5,410 $4,687 $5,024
other_federal_grant $724 $1,445 $1,557 $2,191
state_local_grant $5,186 $4,644 $3,730 $3,887
institutional_grant $3,231 $4,146 $2,222 $24,390
net_price $4,014 $16,478 $18,198 $47,177
Clean up

The operations used to clean up this data includes: casting to numeric; mutate; melt. These functions allow for rotating the original dataset into a more useful form for plotting and aggregating costs.

# Transpose to wider format
colleges <- as.data.frame(t(colleges))

# Rename the column names from row 1 and then remove redundant row 1
colnames(colleges) <- colleges[1,]
colleges <- colleges[-1,]

# Copy the row names as separate column
colleges <- tibble::rownames_to_column(colleges, "name")

colleges <- as_tibble(colleges)

# view(colleges)

# Convert everything to numeric
for (i in 2:length(colleges)) {
  colleges[[i]] <- as.numeric(colleges[[i]])
}


# Mutate and make new columns
colleges <- colleges %>%
            mutate(
                  total_possible_grants = federal_grant+pell_grant+other_federal_grant+state_local_grant+institutional_grant,
                  possible_price_minus_grants = total_cost - total_possible_grants,
                  difference_net_and_possible_price = net_price - possible_price_minus_grants
                  )

dfm <- melt(colleges[,c("name","total_cost", "total_possible_grants","possible_price_minus_grants")],id.vars = "name")
College Costs and Grants

The college with the highest costs also had the highest possible grants. However, the grants did not provide enough of a discount as compared to other colleges with lower tuition rates. Both Hunter and Stony Brook had lower tuition and relatively large possible grants.

Religion and Salary

This section’s dataset came from a survey about people’s religion and their salary range. One analysis we can do is see which religion had the highest percentage in the two highest income brackets (50k to 99k and 100k and up).

Reference: - https://tinyurl.com/36cxk573 - https://www.pewforum.org/religious-landscape-study/income-distribution

Original Dataset

The dataset is in a wide format with each religion’s percentages spread across each bracket. Although, this format is informative for the reader, we’ll have to use the melt function to chart brackets with each row.

income <- read_csv("https://raw.githubusercontent.com/cliftonleesps/607_acq_mgt/main/project2/income_distribution_by_religious_group.csv", 
                   skip=0,
                   show_col_types=FALSE
                  )
College Navigator Dataset
Religious tradition Less than $30,000 $30,000-$49,999 $50,000-$99,999 $100,000 or more Sample Size
Buddhist 36% 18% 32% 13% 233
Catholic 36% 19% 26% 19% 6137
Evangelical Protestant 35% 22% 28% 14% 7462
Hindu 17% 13% 34% 36% 172
Historically Black Protestant 53% 22% 17% 8% 1704
Jehovah’s Witness 48% 25% 22% 4% 208
Jewish 16% 15% 24% 44% 708
Mainline Protestant 29% 20% 28% 23% 5208
Mormon 27% 20% 33% 20% 594
Muslim 34% 17% 29% 20% 205
Orthodox Christian 18% 17% 36% 29% 155
Unaffiliated (religious “nones”) 33% 20% 26% 21% 6790
Data Cleanup

We have to shorten some of the affiliation names, as well as use the str_remove function to remove non numeric characters otherwise math functions will not work. We then ‘melt’ each religion with the bracket percentages.

# Shorten the names
income[1] = c(
              "Buddhist",
              "Catholic",
              "Evan. Prot.",
              "Hindu",
              "Hist. Black Prot",
              "JW",
              "Jewish",
              "Main. Prot.","Mormon",
              "Muslim",
              "Orth. Christian",
              "Unaffiliated")

# Remove the percent signs and convert to integer for some columns
for (c in c("less_than_30","30_to_49","50_to_99","100_or_more")) {
  income[[c]] = as.numeric(
                          as.character(
                                      str_replace(income[[c]], "%", "")
                                      )
                          )
}

# make a temporary data frame by melting religion with the salary range columns
dfm <- melt(income[,c("religion","less_than_30", "30_to_49", "50_to_99", "100_or_more")],id.vars = "religion")
Plot of Religions and Salary

From the dataset above, we can see that the Jewish and Orthodox Christian faiths occupy the highest brackets with percentages of 44% and 36% respectively. The plot below futher illustrates the relationship of faith to income ranges.