Introduction

The goal of this assignment is to provide practice in preparing different datasets for downstream analysis work. The three datasets were chosen from the Discussion 5 discussion board.

Student Testing

This dataset was provided by me, Eric Lehmphul. It is a toy dataset that I created which holds student name, student’s testing scores, time studied for test, and gender. Analysis: Does study time impact testing scores?

The dataset can be found here: https://raw.githubusercontent.com/SaneSky109/DATA607/main/Project_2/Data/DATA607_discussion5.csv

Data Preprocessing

Load packages:

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.5
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Read .csv file from github repository:

test.data <-read.csv("https://raw.githubusercontent.com/SaneSky109/DATA607/main/Project_2/Data/DATA607_discussion5.csv")

Separate the Test Scores and Time Studied for each Test, since they were stored in the same column:

test.data <- test.data %>% 
  separate(Test1..TimeStudiedTest1, c("Test1", "Test1_ST"), sep = ",")
test.data
##     Student Test1 Test1_ST Test2..TimeStudiedTest2 Test3..TimeStudiedTest3
## 1       Bob    95       45                 88, 40                   92, 50
## 2      John    85       35                   60, 8                  75, 10
## 3       Sam    78       15                  75, 16                  80, 17
## 4     Jenna    92       60                  94, 65                  84, 60
## 5      Sara    97       40                  98, 50                  95, 45
## 6     Jacob    50        5                   40, 2                    <NA>
## 7   Melinda  <NA>     <NA>                  90, 47                  92, 55
## 8     Billy    78       15                  80, 25                  81, 36
## 9     Kayla   100       40                  100,40                 100, 45
## 10     Nick    90       35                  94, 32                  94, 30
## 11 Nicolete    75       20                  80, 20                  85, 23
##    Test4..TimeStudiedTest4 Gender
## 1                  100, 70   Male
## 2                   87, 25   Male
## 3                   76, 10 Female
## 4                  100, 75 Female
## 5                   92, 60 Female
## 6                     <NA>   Male
## 7                   89, 60 Female
## 8                   86, 30   Male
## 9                  100, 50 Female
## 10                  90, 30   Male
## 11                  79, 10 Female
test.data <- test.data %>% 
  separate(Test2..TimeStudiedTest2, c("Test2", "Test2_ST"), sep = ",")
test.data
##     Student Test1 Test1_ST Test2 Test2_ST Test3..TimeStudiedTest3
## 1       Bob    95       45    88      40                   92, 50
## 2      John    85       35    60        8                  75, 10
## 3       Sam    78       15    75       16                  80, 17
## 4     Jenna    92       60    94       65                  84, 60
## 5      Sara    97       40    98       50                  95, 45
## 6     Jacob    50        5    40        2                    <NA>
## 7   Melinda  <NA>     <NA>    90       47                  92, 55
## 8     Billy    78       15    80       25                  81, 36
## 9     Kayla   100       40   100       40                 100, 45
## 10     Nick    90       35    94       32                  94, 30
## 11 Nicolete    75       20    80       20                  85, 23
##    Test4..TimeStudiedTest4 Gender
## 1                  100, 70   Male
## 2                   87, 25   Male
## 3                   76, 10 Female
## 4                  100, 75 Female
## 5                   92, 60 Female
## 6                     <NA>   Male
## 7                   89, 60 Female
## 8                   86, 30   Male
## 9                  100, 50 Female
## 10                  90, 30   Male
## 11                  79, 10 Female
test.data <- test.data %>% 
  separate(Test3..TimeStudiedTest3, c("Test3", "Test3_ST"), sep = ",")
test.data
##     Student Test1 Test1_ST Test2 Test2_ST Test3 Test3_ST
## 1       Bob    95       45    88      40     92       50
## 2      John    85       35    60        8    75       10
## 3       Sam    78       15    75       16    80       17
## 4     Jenna    92       60    94       65    84       60
## 5      Sara    97       40    98       50    95       45
## 6     Jacob    50        5    40        2  <NA>     <NA>
## 7   Melinda  <NA>     <NA>    90       47    92       55
## 8     Billy    78       15    80       25    81       36
## 9     Kayla   100       40   100       40   100       45
## 10     Nick    90       35    94       32    94       30
## 11 Nicolete    75       20    80       20    85       23
##    Test4..TimeStudiedTest4 Gender
## 1                  100, 70   Male
## 2                   87, 25   Male
## 3                   76, 10 Female
## 4                  100, 75 Female
## 5                   92, 60 Female
## 6                     <NA>   Male
## 7                   89, 60 Female
## 8                   86, 30   Male
## 9                  100, 50 Female
## 10                  90, 30   Male
## 11                  79, 10 Female
test.data <- test.data %>% 
  separate(Test4..TimeStudiedTest4, c("Test4", "Test4_ST"), sep = ",")
test.data
##     Student Test1 Test1_ST Test2 Test2_ST Test3 Test3_ST Test4 Test4_ST Gender
## 1       Bob    95       45    88      40     92       50   100       70   Male
## 2      John    85       35    60        8    75       10    87       25   Male
## 3       Sam    78       15    75       16    80       17    76       10 Female
## 4     Jenna    92       60    94       65    84       60   100       75 Female
## 5      Sara    97       40    98       50    95       45    92       60 Female
## 6     Jacob    50        5    40        2  <NA>     <NA>  <NA>     <NA>   Male
## 7   Melinda  <NA>     <NA>    90       47    92       55    89       60 Female
## 8     Billy    78       15    80       25    81       36    86       30   Male
## 9     Kayla   100       40   100       40   100       45   100       50 Female
## 10     Nick    90       35    94       32    94       30    90       30   Male
## 11 Nicolete    75       20    80       20    85       23    79       10 Female

Create two data frames. One data frame holds information about Time Studied for Test and the other data frame hold the Test score data:

test.time.data<-test.data[,c(1,3,5,7,9)]
test.data<-test.data[,-c(3,5,7,9)]

Pivot both data frames into a long format. The data frames are arranged by student name:

test.data <- test.data %>% 
  pivot_longer(Test1:Test4, names_to = "Assessment", values_to = "Score") %>% 
  arrange(Student)

test.time.data <- test.time.data %>% 
  pivot_longer(Test1_ST:Test4_ST, values_to = "StudyTime") %>% 
  arrange(Student)

Add the Study Time data to the test.data data frame. Remove all na values from the data more easily perform analysis:

test.data <- test.data %>%
  add_column(StudyTime_min = test.time.data$StudyTime)

test.data <- test.data %>%
  na.omit

Adjust data types to reflect the proper information:

test.data$Student <- as.factor(test.data$Student)
test.data$Gender <- as.factor(test.data$Gender)
test.data$Assessment <- as.factor(test.data$Assessment)
test.data$Score <- as.numeric(test.data$Score)
test.data$StudyTime_min <- as.integer(test.data$StudyTime_min)

Cleaned dataset:

test.data
## # A tibble: 41 x 5
##    Student Gender Assessment Score StudyTime_min
##    <fct>   <fct>  <fct>      <dbl>         <int>
##  1 Billy   Male   Test1         78            15
##  2 Billy   Male   Test2         80            25
##  3 Billy   Male   Test3         81            36
##  4 Billy   Male   Test4         86            30
##  5 Bob     Male   Test1         95            45
##  6 Bob     Male   Test2         88            40
##  7 Bob     Male   Test3         92            50
##  8 Bob     Male   Test4        100            70
##  9 Jacob   Male   Test1         50             5
## 10 Jacob   Male   Test2         40             2
## # ... with 31 more rows

Analysis: Student Testing

From this plot it is clear to see that study time positively impacts student testing scores. This plot also indicates that females, on average, scored better than the males.

ggplot(test.data,aes(x=Score, y=StudyTime_min, color = Gender)) + geom_point() + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

Customer Churn

Mauricio Claudio posted an untidy dataset related to customer churn. The data is presented in a wide format where a long format is more suited for analysis. I analyzed the customer churn rate for each Division in the company.

The dataset can be found here: https://raw.githubusercontent.com/SaneSky109/DATA607/main/Project_2/Data/customer_churn_data.csv

Data Preprocessing

Read .csv file from github repository:

churn.data <- read.csv("https://raw.githubusercontent.com/SaneSky109/DATA607/main/Project_2/Data/customer_churn_data.csv")

Make sure all rows are accounted for in the Division column. Eliminate the extra row (row 7) because it is empty:

churn.data$Division[1:2] <- "A"
churn.data$Division[3:4] <- "B"
churn.data$Division[5:6] <- "C"
churn.data$Division[8:9] <- "Total"

churn.data <- churn.data[-7,]

Transform data into a long format:

churn.data <- churn.data %>%
  gather(key = "Month",
         value = "Churn",
         c(-Division, -Description))
churn.data
##    Division Description Month Churn
## 1         A      Gained   Jan    70
## 2         A        Lost   Jan     0
## 3         B      Gained   Jan    80
## 4         B        Lost   Jan     0
## 5         C      Gained   Jan    60
## 6         C        Lost   Jan     0
## 7     Total      Gained   Jan   210
## 8     Total        Lost   Jan     0
## 9         A      Gained   Feb    80
## 10        A        Lost   Feb   -90
## 11        B      Gained   Feb    80
## 12        B        Lost   Feb   -15
## 13        C      Gained   Feb    85
## 14        C        Lost   Feb   -45
## 15    Total      Gained   Feb   245
## 16    Total        Lost   Feb  -150
## 17        A      Gained   Mar   100
## 18        A        Lost   Mar   -30
## 19        B      Gained   Mar    90
## 20        B        Lost   Mar   -30
## 21        C      Gained   Mar    80
## 22        C        Lost   Mar   -27
## 23    Total      Gained   Mar   270
## 24    Total        Lost   Mar   -87
## 25        A      Gained   Apr   110
## 26        A        Lost   Apr   -45
## 27        B      Gained   Apr   120
## 28        B        Lost   Apr   -25
## 29        C      Gained   Apr    90
## 30        C        Lost   Apr   -17
## 31    Total      Gained   Apr   320
## 32    Total        Lost   Apr   -87
## 33        A      Gained   May    70
## 34        A        Lost   May   -95
## 35        B      Gained   May   100
## 36        B        Lost   May   -50
## 37        C      Gained   May   120
## 38        C        Lost   May   -33
## 39    Total      Gained   May   290
## 40    Total        Lost   May  -178
## 41        A      Gained   Jun    45
## 42        A        Lost   Jun   -33
## 43        B      Gained   Jun   119
## 44        B        Lost   Jun   -77
## 45        C      Gained   Jun    45
## 46        C        Lost   Jun   -80
## 47    Total      Gained   Jun   209
## 48    Total        Lost   Jun  -190
## 49        A      Gained   Jul    50
## 50        A        Lost   Jul  -110
## 51        B      Gained   Jul    75
## 52        B        Lost   Jul   -45
## 53        C      Gained   Jul    75
## 54        C        Lost   Jul   -45
## 55    Total      Gained   Jul   200
## 56    Total        Lost   Jul  -200
## 57        A      Gained   Aug    99
## 58        A        Lost   Aug   -34
## 59        B      Gained   Aug   119
## 60        B        Lost   Aug   -77
## 61        C      Gained   Aug    45
## 62        C        Lost   Aug   -80
## 63    Total      Gained   Aug   263
## 64    Total        Lost   Aug  -191
## 65        A      Gained   Sep   112
## 66        A        Lost   Sep   -34
## 67        B      Gained   Sep    90
## 68        B        Lost   Sep   -30
## 69        C      Gained   Sep    80
## 70        C        Lost   Sep   -27
## 71    Total      Gained   Sep   282
## 72    Total        Lost   Sep   -91
## 73        A      Gained   Oct    99
## 74        A        Lost   Oct   -88
## 75        B      Gained   Oct    80
## 76        B        Lost   Oct   -15
## 77        C      Gained   Oct    85
## 78        C        Lost   Oct   -45
## 79    Total      Gained   Oct   264
## 80    Total        Lost   Oct  -148
## 81        A      Gained   Nov    55
## 82        A        Lost   Nov   -65
## 83        B      Gained   Nov    80
## 84        B        Lost   Nov   -20
## 85        C      Gained   Nov    60
## 86        C        Lost   Nov   -35
## 87    Total      Gained   Nov   195
## 88    Total        Lost   Nov  -120
## 89        A      Gained   Dec   110
## 90        A        Lost   Dec   -45
## 91        B      Gained   Dec    90
## 92        B        Lost   Dec   -30
## 93        C      Gained   Dec    80
## 94        C        Lost   Dec   -27
## 95    Total      Gained   Dec   280
## 96    Total        Lost   Dec  -102

Create two filter dataframes to allow for easy calculation of churn rate downstream:

Gained <- churn.data %>%
  filter(Description == "Gained")

Lost <- churn.data %>%
  filter(Description == "Lost")

Rename columns:

Gained <- rename(Gained, "Churn_Gained" = Churn)
Lost <- rename(Lost, "Churn_Lost" = Churn)

Create churn.calc.data to hold the necessary information to calculate customer churn downstream. Remove irrelevant columns:

churn.calc.data <- Gained %>%
  add_column(Churn_Lost = Lost$Churn_Lost)

churn.calc.data <- churn.calc.data[,-2]

Created subsets of the data filtered by Division. These subsets will be used to calculate the customer churn:

A <- churn.calc.data %>%
  filter(Division == "A")

B <- churn.calc.data %>%
  filter(Division == "B")

C <- churn.calc.data %>%
  filter(Division == "C")

Total <- churn.calc.data %>%
  filter(Division == "Total")

Analysis: Customer Churn

Calculated the Monthly Customer Churn:

A <- A %>%
  mutate(churn_rate = abs(Churn_Lost)/Churn_Gained)

B <- B %>%
  mutate(churn_rate = abs(Churn_Lost)/Churn_Gained)

C <- C %>%
  mutate(churn_rate = abs(Churn_Lost)/Churn_Gained)

Total <- Total %>%
  mutate(churn_rate = abs(Churn_Lost)/Churn_Gained)

Combined the subsets into one data to graph:

analysis.data <- rbind(A,B,C,Total)

Division B appears to have the lowest customer churn throughout the entire period (monthly and by year). Division A appears to have the most customer churn and the most volatility month to month. Division C is has slightly less compared to Division A, but is noticeably more susceptible to a higher churn rate compared than Division B. The Total churn rate is similar to Division C for the entire period.

analysis.data$Month = factor(analysis.data$Month, levels = month.abb)
# Visualization
ggplot(analysis.data, aes(x = Month, y = churn_rate, group = Division)) + 
  geom_line(aes(y = churn_rate, color = Division, linetype = Division)) + 
  scale_color_manual(values = c("darkred", "steelblue", "green", "black")) +
  ggtitle("Churn Rate by Month")

print("Entire Period Churn Rate:")
## [1] "Entire Period Churn Rate:"
cat("Division A: ", sum(abs(A$Churn_Lost))/sum(A$Churn_Gained))
## Division A:  0.669
cat("Division B: ", sum(abs(B$Churn_Lost))/sum(B$Churn_Gained))
## Division B:  0.3686554
cat("Division C: ", sum(abs(C$Churn_Lost))/sum(C$Churn_Gained))
## Division C:  0.5093923
cat("Division Total: ", sum(abs(Total$Churn_Lost))/sum(Total$Churn_Gained))
## Division Total:  0.5099075

Running Training Regimen

Hazal Gunduz posted a dataset related to runners participating in a training regimen. I analyzed the percent change of the runners’ times and compared the results to see which runner improved the most with their training regimen.

The dataset can be found here: https://raw.githubusercontent.com/SaneSky109/DATA607/main/Project_2/Data/runner.data.csv

Data Preprocessing

Read .csv file from github repository:

run.data <- read.csv("https://raw.githubusercontent.com/SaneSky109/DATA607/main/Project_2/Data/runner.data.csv")

Convert the data to a long format, rather than a wide format. Data analysis will be easier in a long format:

run.data <- run.data %>%
  gather(key = "Month",
         value = "Time",
         c(-Athlete))

Organize and order the data by Athlete:

run.data <- run.data %>%
  arrange(Athlete)

Create a new data frame for each Athlete to make it easier to correctly calculate the percent change from each month:

Debi <- run.data %>%
  filter(Athlete == "Debi")

Joana <- run.data %>%
  filter(Athlete == "Joana")

Kerol <- run.data %>%
  filter(Athlete == "Kerol")

Sukhveer <- run.data %>%
  filter(Athlete == "Sukhveer")

Calculate the percent change for each Athlete:

Debi <- Debi %>%
  mutate(pct_change = (Time/lead(Time) - 1) * 100)

Joana <- Joana %>%
  mutate(pct_change = (Time/lead(Time) - 1) * 100)

Kerol <- Kerol %>%
  mutate(pct_change = (Time/lead(Time) - 1) * 100)

Sukhveer <- Sukhveer %>%
  mutate(pct_change = (Time/lead(Time) - 1) * 100)

Create a new data frame that contains all of the filtered Athletes:

run.analysis <- rbind(Debi, Joana, Kerol, Sukhveer)

Remove NA values produced from the percent change calculation, as the graph needs continuous data to sucessfully run:

run.analysis <- na.omit(run.analysis)

Update the run.analysis data frame factor names for month to better represent the percent change calculation:

run.analysis$Month[run.analysis$Month == "Month.0"] <- "Month 0 to Month 1"
run.analysis$Month[run.analysis$Month == "Month.1"] <- "Month 1 to Month 2"
run.analysis$Month[run.analysis$Month == "Month.2"] <- "Month 2 to Month 3"

Analysis: Runner Improvements

Based on the Percent Change plot below,

  • Joana had the best increase in her running time in the first month, but progress greatly decreased after Month 2.

  • Both Debi and Sukhveer, though slower in month 1 compared to month 0, managed to adapt to the new regimen and improve each month from month 1 to month 3.

  • Kerol, like Debi and Sukhveer, was slower in month 1 than month 0. Kerol showed tremendous improvement in month 1 to month 2, but progress receded after month 2.

It can be concluded that Kerol had the largest improvement, but Debi had the most consistent month to month improvement. This indicates that Debi’s running time could improve even more with more time (months) participating in the training regimen.

ggplot(run.analysis, aes(x = Month, y = pct_change, group = Athlete)) + 
  geom_line(aes(y = pct_change, color = Athlete, linetype = Athlete)) + 
  scale_color_manual(values = c("darkred", "steelblue", "purple", "black")) +
  ggtitle("Athlete Times: Percent Change")

cat("Debi's overall improvement was: ", max(Debi$Time) - min(Debi$Time))
## Debi's overall improvement was:  0.6
cat("Joana's overall improvement was: ", max(Joana$Time) - min(Joana$Time))
## Joana's overall improvement was:  0.52
cat("Kerol's overall improvement was: ", max(Kerol$Time) - min(Kerol$Time))
## Kerol's overall improvement was:  0.7
cat("Sukhveer's overall improvement was: ", max(Sukhveer$Time) - min(Sukhveer$Time))
## Sukhveer's overall improvement was:  0.3