This project focuses on data transformation and clean up of three datasets from various sources. Each dataset required initial analysis in order to prepare it for downstream analysis and to answer main questions:
This sample dataset contains three company divisions (A, B and C) along with their customers gained and lost over a year long period. Given this information, determine which division has the highest and lowest rate of customer churn. In this case, customer churn is defined as a ratio of customers lost to customers gained. A division with a low churn rate would be promoted as a model of good operations and customer relations from which other divisions can learn from.
Source: https://tinyurl.com/rnb87tbv
The original dataset is in a wide format with each division’s performance on a performance chart.
customers <- read_csv("https://raw.githubusercontent.com/cliftonleesps/607_acq_mgt/main/project2/customer_churn.csv",
n_max=6,
show_col_types=FALSE)
| Division | Description | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A | Gained | 70 | 80 | 100 | 110 | 70 | 45 | 50 | 99 | 112 | 99 | 55 | 110 |
| A | Lost | 0 | -90 | -30 | -45 | -95 | -33 | -110 | -34 | -34 | -88 | -65 | -45 |
| B | Gained | 80 | 80 | 90 | 120 | 100 | 119 | 75 | 119 | 90 | 80 | 80 | 90 |
| B | Lost | 0 | -15 | -30 | -25 | -50 | -77 | -45 | -77 | -30 | -15 | -20 | -30 |
| C | Gained | 60 | 85 | 80 | 90 | 120 | 45 | 75 | 45 | 80 | 85 | 60 | 80 |
| C | Lost | 0 | -45 | -27 | -17 | -33 | -80 | -45 | -80 | -27 | -45 | -35 | -27 |
The operations used to clean up this data includes: pivoting; filtering; mutating; summing. Without these functions, it would be impossible to normalize the data set since the gained and lost data are on different rows. Additionally, summing up the total customers is an aggregate function and requires all numbers to be in a single column.
# Create a longer format with the tidyr pivot_longer function
customers_long_format <- pivot_longer(customers,cols=3:14,names_to="month", values_to="num_customers")
customers_aggregate <- tibble (division = '', churn = 0)
for (division in LETTERS[1:3]) {
# sum up the customers lost and gained
customers_lost <- customers_long_format %>% filter(Division==division & Description == "Lost") %>% select(num_customers) %>% sum()
customers_gained <- customers_long_format %>% filter(Division==division & Description == "Gained") %>% select(num_customers) %>% sum()
# add a row to the aggregate tibble
customers_aggregate <- add_row(
customers_aggregate,
division = division,
churn = abs(customers_lost/customers_gained)
)
} # division
# Remove the unneeded first row
customers_aggregate <- customers_aggregate[-1,]
As evident from the below chart, Division B has the lowest churn rate of 37%, followed by Division at 51%.
The detailed data:
## # A tibble: 3 × 2
## division churn
## <chr> <dbl>
## 1 A 0.669
## 2 B 0.369
## 3 C 0.509
The National Center for Education Statistics maintains an online tool for families to research colleges across many different aspects. Its website allows visitors to pick up to four colleges and generate a CSV that compares them in a long format. For this dataset, we only want to know the different in state tuition costs and the maximum possible grants (e.g. federal, state, institutional) to offset those costs.
Sources:
The original dataset is in a long format with each school occupying a column.
# Read the CSV from Github, skipping the first 3 lines and up to 57 lines
colleges <- read_csv("https://raw.githubusercontent.com/cliftonleesps/607_acq_mgt/main/project2/collegenavigator_compare.csv",
col_names = c("attribute", "Hunter CUNY", "Stony Brook","Morrisville CUNY","Syracuse"),
skip=3,
n_max=57,
show_col_types=FALSE
)
| attribute | Hunter CUNY | Stony Brook | Morrisville CUNY | Syracuse |
|---|---|---|---|---|
| total_cost | $24,377 | $27,743 | $27,270 | $75,652 |
| total_grant | $9,347 | $10,468 | $9,154 | $25,864 |
| federal_grant | $4,453 | $5,253 | $4,794 | $5,492 |
| pell_grant | $5,460 | $5,410 | $4,687 | $5,024 |
| other_federal_grant | $724 | $1,445 | $1,557 | $2,191 |
| state_local_grant | $5,186 | $4,644 | $3,730 | $3,887 |
| institutional_grant | $3,231 | $4,146 | $2,222 | $24,390 |
| net_price | $4,014 | $16,478 | $18,198 | $47,177 |
The operations used to clean up this data includes: casting to numeric; mutate; melt. These functions allow for rotating the original dataset into a more useful form for plotting and aggregating costs.
# Transpose to wider format
colleges <- as.data.frame(t(colleges))
# Rename the column names from row 1 and then remove redundant row 1
colnames(colleges) <- colleges[1,]
colleges <- colleges[-1,]
# Copy the row names as separate column
colleges <- tibble::rownames_to_column(colleges, "name")
colleges <- as_tibble(colleges)
# view(colleges)
# Convert everything to numeric
for (i in 2:length(colleges)) {
colleges[[i]] <- as.numeric(colleges[[i]])
}
# Mutate and make new columns
colleges <- colleges %>%
mutate(
total_possible_grants = federal_grant+pell_grant+other_federal_grant+state_local_grant+institutional_grant,
possible_price_minus_grants = total_cost - total_possible_grants,
difference_net_and_possible_price = net_price - possible_price_minus_grants
)
dfm <- melt(colleges[,c("name","total_cost", "total_possible_grants","possible_price_minus_grants")],id.vars = "name")
The college with the highest costs also had the highest possible grants. However, the grants did not provide enough of a discount as compared to other colleges with lower tuition rates. Both Hunter and Stony Brook had lower tuition and relatively large possible grants.
This section’s dataset came from a survey about people’s religion and their salary range. One analysis we can do is see which religion had the highest percentage in the two highest income brackets (50k to 99k and 100k and up).
Reference: - https://tinyurl.com/36cxk573 - https://www.pewforum.org/religious-landscape-study/income-distribution
The dataset is in a wide format with each religion’s percentages spread across each bracket. Although, this format is informative for the reader, we’ll have to use the melt function to chart brackets with each row.
income <- read_csv("https://raw.githubusercontent.com/cliftonleesps/607_acq_mgt/main/project2/income_distribution_by_religious_group.csv",
skip=0,
show_col_types=FALSE
)
| Religious tradition | Less than $30,000 | $30,000-$49,999 | $50,000-$99,999 | $100,000 or more | Sample Size |
|---|---|---|---|---|---|
| Buddhist | 36% | 18% | 32% | 13% | 233 |
| Catholic | 36% | 19% | 26% | 19% | 6137 |
| Evangelical Protestant | 35% | 22% | 28% | 14% | 7462 |
| Hindu | 17% | 13% | 34% | 36% | 172 |
| Historically Black Protestant | 53% | 22% | 17% | 8% | 1704 |
| Jehovah’s Witness | 48% | 25% | 22% | 4% | 208 |
| Jewish | 16% | 15% | 24% | 44% | 708 |
| Mainline Protestant | 29% | 20% | 28% | 23% | 5208 |
| Mormon | 27% | 20% | 33% | 20% | 594 |
| Muslim | 34% | 17% | 29% | 20% | 205 |
| Orthodox Christian | 18% | 17% | 36% | 29% | 155 |
| Unaffiliated (religious “nones”) | 33% | 20% | 26% | 21% | 6790 |
We have to shorten some of the affiliation names, as well as use the str_remove function to remove non numeric characters otherwise math functions will not work. We then ‘melt’ each religion with the bracket percentages.
# Shorten the names
income[1] = c(
"Buddhist",
"Catholic",
"Evan. Prot.",
"Hindu",
"Hist. Black Prot",
"JW",
"Jewish",
"Main. Prot.","Mormon",
"Muslim",
"Orth. Christian",
"Unaffiliated")
# Remove the percent signs and convert to integer for some columns
for (c in c("less_than_30","30_to_49","50_to_99","100_or_more")) {
income[[c]] = as.numeric(
as.character(
str_replace(income[[c]], "%", "")
)
)
}
# make a temporary data frame by melting religion with the salary range columns
dfm <- melt(income[,c("religion","less_than_30", "30_to_49", "50_to_99", "100_or_more")],id.vars = "religion")
From the dataset above, we can see that the Jewish and Orthodox Christian faiths occupy the highest brackets with percentages of 44% and 36% respectively. The plot below futher illustrates the relationship of faith to income ranges.