The goal of this assignment is to provide practice in preparing different datasets for downstream analysis work. The three datasets were chosen from the Discussion 5 discussion board.
This dataset was provided by me, Eric Lehmphul. It is a toy dataset that I created which holds student name, student’s testing scores, time studied for test, and gender. Analysis: Does study time impact testing scores?
The dataset can be found here: https://raw.githubusercontent.com/SaneSky109/DATA607/main/Project_2/Data/DATA607_discussion5.csv
Load packages:
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.5
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Read .csv file from github repository:
test.data <-read.csv("https://raw.githubusercontent.com/SaneSky109/DATA607/main/Project_2/Data/DATA607_discussion5.csv")
Separate the Test Scores and Time Studied for each Test, since they were stored in the same column:
test.data <- test.data %>%
separate(Test1..TimeStudiedTest1, c("Test1", "Test1_ST"), sep = ",")
test.data
## Student Test1 Test1_ST Test2..TimeStudiedTest2 Test3..TimeStudiedTest3
## 1 Bob 95 45 88, 40 92, 50
## 2 John 85 35 60, 8 75, 10
## 3 Sam 78 15 75, 16 80, 17
## 4 Jenna 92 60 94, 65 84, 60
## 5 Sara 97 40 98, 50 95, 45
## 6 Jacob 50 5 40, 2 <NA>
## 7 Melinda <NA> <NA> 90, 47 92, 55
## 8 Billy 78 15 80, 25 81, 36
## 9 Kayla 100 40 100,40 100, 45
## 10 Nick 90 35 94, 32 94, 30
## 11 Nicolete 75 20 80, 20 85, 23
## Test4..TimeStudiedTest4 Gender
## 1 100, 70 Male
## 2 87, 25 Male
## 3 76, 10 Female
## 4 100, 75 Female
## 5 92, 60 Female
## 6 <NA> Male
## 7 89, 60 Female
## 8 86, 30 Male
## 9 100, 50 Female
## 10 90, 30 Male
## 11 79, 10 Female
test.data <- test.data %>%
separate(Test2..TimeStudiedTest2, c("Test2", "Test2_ST"), sep = ",")
test.data
## Student Test1 Test1_ST Test2 Test2_ST Test3..TimeStudiedTest3
## 1 Bob 95 45 88 40 92, 50
## 2 John 85 35 60 8 75, 10
## 3 Sam 78 15 75 16 80, 17
## 4 Jenna 92 60 94 65 84, 60
## 5 Sara 97 40 98 50 95, 45
## 6 Jacob 50 5 40 2 <NA>
## 7 Melinda <NA> <NA> 90 47 92, 55
## 8 Billy 78 15 80 25 81, 36
## 9 Kayla 100 40 100 40 100, 45
## 10 Nick 90 35 94 32 94, 30
## 11 Nicolete 75 20 80 20 85, 23
## Test4..TimeStudiedTest4 Gender
## 1 100, 70 Male
## 2 87, 25 Male
## 3 76, 10 Female
## 4 100, 75 Female
## 5 92, 60 Female
## 6 <NA> Male
## 7 89, 60 Female
## 8 86, 30 Male
## 9 100, 50 Female
## 10 90, 30 Male
## 11 79, 10 Female
test.data <- test.data %>%
separate(Test3..TimeStudiedTest3, c("Test3", "Test3_ST"), sep = ",")
test.data
## Student Test1 Test1_ST Test2 Test2_ST Test3 Test3_ST
## 1 Bob 95 45 88 40 92 50
## 2 John 85 35 60 8 75 10
## 3 Sam 78 15 75 16 80 17
## 4 Jenna 92 60 94 65 84 60
## 5 Sara 97 40 98 50 95 45
## 6 Jacob 50 5 40 2 <NA> <NA>
## 7 Melinda <NA> <NA> 90 47 92 55
## 8 Billy 78 15 80 25 81 36
## 9 Kayla 100 40 100 40 100 45
## 10 Nick 90 35 94 32 94 30
## 11 Nicolete 75 20 80 20 85 23
## Test4..TimeStudiedTest4 Gender
## 1 100, 70 Male
## 2 87, 25 Male
## 3 76, 10 Female
## 4 100, 75 Female
## 5 92, 60 Female
## 6 <NA> Male
## 7 89, 60 Female
## 8 86, 30 Male
## 9 100, 50 Female
## 10 90, 30 Male
## 11 79, 10 Female
test.data <- test.data %>%
separate(Test4..TimeStudiedTest4, c("Test4", "Test4_ST"), sep = ",")
test.data
## Student Test1 Test1_ST Test2 Test2_ST Test3 Test3_ST Test4 Test4_ST Gender
## 1 Bob 95 45 88 40 92 50 100 70 Male
## 2 John 85 35 60 8 75 10 87 25 Male
## 3 Sam 78 15 75 16 80 17 76 10 Female
## 4 Jenna 92 60 94 65 84 60 100 75 Female
## 5 Sara 97 40 98 50 95 45 92 60 Female
## 6 Jacob 50 5 40 2 <NA> <NA> <NA> <NA> Male
## 7 Melinda <NA> <NA> 90 47 92 55 89 60 Female
## 8 Billy 78 15 80 25 81 36 86 30 Male
## 9 Kayla 100 40 100 40 100 45 100 50 Female
## 10 Nick 90 35 94 32 94 30 90 30 Male
## 11 Nicolete 75 20 80 20 85 23 79 10 Female
Create two data frames. One data frame holds information about Time Studied for Test and the other data frame hold the Test score data:
test.time.data<-test.data[,c(1,3,5,7,9)]
test.data<-test.data[,-c(3,5,7,9)]
Pivot both data frames into a long format. The data frames are arranged by student name:
test.data <- test.data %>%
pivot_longer(Test1:Test4, names_to = "Assessment", values_to = "Score") %>%
arrange(Student)
test.time.data <- test.time.data %>%
pivot_longer(Test1_ST:Test4_ST, values_to = "StudyTime") %>%
arrange(Student)
Add the Study Time data to the test.data data frame. Remove all na values from the data more easily perform analysis:
test.data <- test.data %>%
add_column(StudyTime_min = test.time.data$StudyTime)
test.data <- test.data %>%
na.omit
Adjust data types to reflect the proper information:
test.data$Student <- as.factor(test.data$Student)
test.data$Gender <- as.factor(test.data$Gender)
test.data$Assessment <- as.factor(test.data$Assessment)
test.data$Score <- as.numeric(test.data$Score)
test.data$StudyTime_min <- as.integer(test.data$StudyTime_min)
Cleaned dataset:
test.data
## # A tibble: 41 x 5
## Student Gender Assessment Score StudyTime_min
## <fct> <fct> <fct> <dbl> <int>
## 1 Billy Male Test1 78 15
## 2 Billy Male Test2 80 25
## 3 Billy Male Test3 81 36
## 4 Billy Male Test4 86 30
## 5 Bob Male Test1 95 45
## 6 Bob Male Test2 88 40
## 7 Bob Male Test3 92 50
## 8 Bob Male Test4 100 70
## 9 Jacob Male Test1 50 5
## 10 Jacob Male Test2 40 2
## # ... with 31 more rows
From this plot it is clear to see that study time positively impacts student testing scores. This plot also indicates that females, on average, scored better than the males.
ggplot(test.data,aes(x=Score, y=StudyTime_min, color = Gender)) + geom_point() + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
Mauricio Claudio posted an untidy dataset related to customer churn. The data is presented in a wide format where a long format is more suited for analysis. I analyzed the customer churn rate for each Division in the company.
The dataset can be found here: https://raw.githubusercontent.com/SaneSky109/DATA607/main/Project_2/Data/customer_churn_data.csv
Read .csv file from github repository:
churn.data <- read.csv("https://raw.githubusercontent.com/SaneSky109/DATA607/main/Project_2/Data/customer_churn_data.csv")
Make sure all rows are accounted for in the Division column. Eliminate the extra row (row 7) because it is empty:
churn.data$Division[1:2] <- "A"
churn.data$Division[3:4] <- "B"
churn.data$Division[5:6] <- "C"
churn.data$Division[8:9] <- "Total"
churn.data <- churn.data[-7,]
Transform data into a long format:
churn.data <- churn.data %>%
gather(key = "Month",
value = "Churn",
c(-Division, -Description))
churn.data
## Division Description Month Churn
## 1 A Gained Jan 70
## 2 A Lost Jan 0
## 3 B Gained Jan 80
## 4 B Lost Jan 0
## 5 C Gained Jan 60
## 6 C Lost Jan 0
## 7 Total Gained Jan 210
## 8 Total Lost Jan 0
## 9 A Gained Feb 80
## 10 A Lost Feb -90
## 11 B Gained Feb 80
## 12 B Lost Feb -15
## 13 C Gained Feb 85
## 14 C Lost Feb -45
## 15 Total Gained Feb 245
## 16 Total Lost Feb -150
## 17 A Gained Mar 100
## 18 A Lost Mar -30
## 19 B Gained Mar 90
## 20 B Lost Mar -30
## 21 C Gained Mar 80
## 22 C Lost Mar -27
## 23 Total Gained Mar 270
## 24 Total Lost Mar -87
## 25 A Gained Apr 110
## 26 A Lost Apr -45
## 27 B Gained Apr 120
## 28 B Lost Apr -25
## 29 C Gained Apr 90
## 30 C Lost Apr -17
## 31 Total Gained Apr 320
## 32 Total Lost Apr -87
## 33 A Gained May 70
## 34 A Lost May -95
## 35 B Gained May 100
## 36 B Lost May -50
## 37 C Gained May 120
## 38 C Lost May -33
## 39 Total Gained May 290
## 40 Total Lost May -178
## 41 A Gained Jun 45
## 42 A Lost Jun -33
## 43 B Gained Jun 119
## 44 B Lost Jun -77
## 45 C Gained Jun 45
## 46 C Lost Jun -80
## 47 Total Gained Jun 209
## 48 Total Lost Jun -190
## 49 A Gained Jul 50
## 50 A Lost Jul -110
## 51 B Gained Jul 75
## 52 B Lost Jul -45
## 53 C Gained Jul 75
## 54 C Lost Jul -45
## 55 Total Gained Jul 200
## 56 Total Lost Jul -200
## 57 A Gained Aug 99
## 58 A Lost Aug -34
## 59 B Gained Aug 119
## 60 B Lost Aug -77
## 61 C Gained Aug 45
## 62 C Lost Aug -80
## 63 Total Gained Aug 263
## 64 Total Lost Aug -191
## 65 A Gained Sep 112
## 66 A Lost Sep -34
## 67 B Gained Sep 90
## 68 B Lost Sep -30
## 69 C Gained Sep 80
## 70 C Lost Sep -27
## 71 Total Gained Sep 282
## 72 Total Lost Sep -91
## 73 A Gained Oct 99
## 74 A Lost Oct -88
## 75 B Gained Oct 80
## 76 B Lost Oct -15
## 77 C Gained Oct 85
## 78 C Lost Oct -45
## 79 Total Gained Oct 264
## 80 Total Lost Oct -148
## 81 A Gained Nov 55
## 82 A Lost Nov -65
## 83 B Gained Nov 80
## 84 B Lost Nov -20
## 85 C Gained Nov 60
## 86 C Lost Nov -35
## 87 Total Gained Nov 195
## 88 Total Lost Nov -120
## 89 A Gained Dec 110
## 90 A Lost Dec -45
## 91 B Gained Dec 90
## 92 B Lost Dec -30
## 93 C Gained Dec 80
## 94 C Lost Dec -27
## 95 Total Gained Dec 280
## 96 Total Lost Dec -102
Create two filter dataframes to allow for easy calculation of churn rate downstream:
Gained <- churn.data %>%
filter(Description == "Gained")
Lost <- churn.data %>%
filter(Description == "Lost")
Rename columns:
Gained <- rename(Gained, "Churn_Gained" = Churn)
Lost <- rename(Lost, "Churn_Lost" = Churn)
Create churn.calc.data to hold the necessary information to calculate customer churn downstream. Remove irrelevant columns:
churn.calc.data <- Gained %>%
add_column(Churn_Lost = Lost$Churn_Lost)
churn.calc.data <- churn.calc.data[,-2]
Created subsets of the data filtered by Division. These subsets will be used to calculate the customer churn:
A <- churn.calc.data %>%
filter(Division == "A")
B <- churn.calc.data %>%
filter(Division == "B")
C <- churn.calc.data %>%
filter(Division == "C")
Total <- churn.calc.data %>%
filter(Division == "Total")
Calculated the Monthly Customer Churn:
A <- A %>%
mutate(churn_rate = abs(Churn_Lost)/Churn_Gained)
B <- B %>%
mutate(churn_rate = abs(Churn_Lost)/Churn_Gained)
C <- C %>%
mutate(churn_rate = abs(Churn_Lost)/Churn_Gained)
Total <- Total %>%
mutate(churn_rate = abs(Churn_Lost)/Churn_Gained)
Combined the subsets into one data to graph:
analysis.data <- rbind(A,B,C,Total)
Division B appears to have the lowest customer churn throughout the entire period (monthly and by year). Division A appears to have the most customer churn and the most volatility month to month. Division C is has slightly less compared to Division A, but is noticeably more susceptible to a higher churn rate compared than Division B. The Total churn rate is similar to Division C for the entire period.
analysis.data$Month = factor(analysis.data$Month, levels = month.abb)
# Visualization
ggplot(analysis.data, aes(x = Month, y = churn_rate, group = Division)) +
geom_line(aes(y = churn_rate, color = Division, linetype = Division)) +
scale_color_manual(values = c("darkred", "steelblue", "green", "black")) +
ggtitle("Churn Rate by Month")
print("Entire Period Churn Rate:")
## [1] "Entire Period Churn Rate:"
cat("Division A: ", sum(abs(A$Churn_Lost))/sum(A$Churn_Gained))
## Division A: 0.669
cat("Division B: ", sum(abs(B$Churn_Lost))/sum(B$Churn_Gained))
## Division B: 0.3686554
cat("Division C: ", sum(abs(C$Churn_Lost))/sum(C$Churn_Gained))
## Division C: 0.5093923
cat("Division Total: ", sum(abs(Total$Churn_Lost))/sum(Total$Churn_Gained))
## Division Total: 0.5099075
Hazal Gunduz posted a dataset related to runners participating in a training regimen. I analyzed the percent change of the runners’ times and compared the results to see which runner improved the most with their training regimen.
The dataset can be found here: https://raw.githubusercontent.com/SaneSky109/DATA607/main/Project_2/Data/runner.data.csv
Read .csv file from github repository:
run.data <- read.csv("https://raw.githubusercontent.com/SaneSky109/DATA607/main/Project_2/Data/runner.data.csv")
Convert the data to a long format, rather than a wide format. Data analysis will be easier in a long format:
run.data <- run.data %>%
gather(key = "Month",
value = "Time",
c(-Athlete))
Organize and order the data by Athlete:
run.data <- run.data %>%
arrange(Athlete)
Create a new data frame for each Athlete to make it easier to correctly calculate the percent change from each month:
Debi <- run.data %>%
filter(Athlete == "Debi")
Joana <- run.data %>%
filter(Athlete == "Joana")
Kerol <- run.data %>%
filter(Athlete == "Kerol")
Sukhveer <- run.data %>%
filter(Athlete == "Sukhveer")
Calculate the percent change for each Athlete:
Debi <- Debi %>%
mutate(pct_change = (Time/lead(Time) - 1) * 100)
Joana <- Joana %>%
mutate(pct_change = (Time/lead(Time) - 1) * 100)
Kerol <- Kerol %>%
mutate(pct_change = (Time/lead(Time) - 1) * 100)
Sukhveer <- Sukhveer %>%
mutate(pct_change = (Time/lead(Time) - 1) * 100)
Create a new data frame that contains all of the filtered Athletes:
run.analysis <- rbind(Debi, Joana, Kerol, Sukhveer)
Remove NA values produced from the percent change calculation, as the graph needs continuous data to sucessfully run:
run.analysis <- na.omit(run.analysis)
Update the run.analysis data frame factor names for month to better represent the percent change calculation:
run.analysis$Month[run.analysis$Month == "Month.0"] <- "Month 0 to Month 1"
run.analysis$Month[run.analysis$Month == "Month.1"] <- "Month 1 to Month 2"
run.analysis$Month[run.analysis$Month == "Month.2"] <- "Month 2 to Month 3"
Based on the Percent Change plot below,
Joana had the best increase in her running time in the first month, but progress greatly decreased after Month 2.
Both Debi and Sukhveer, though slower in month 1 compared to month 0, managed to adapt to the new regimen and improve each month from month 1 to month 3.
Kerol, like Debi and Sukhveer, was slower in month 1 than month 0. Kerol showed tremendous improvement in month 1 to month 2, but progress receded after month 2.
It can be concluded that Kerol had the largest improvement, but Debi had the most consistent month to month improvement. This indicates that Debi’s running time could improve even more with more time (months) participating in the training regimen.
ggplot(run.analysis, aes(x = Month, y = pct_change, group = Athlete)) +
geom_line(aes(y = pct_change, color = Athlete, linetype = Athlete)) +
scale_color_manual(values = c("darkred", "steelblue", "purple", "black")) +
ggtitle("Athlete Times: Percent Change")
cat("Debi's overall improvement was: ", max(Debi$Time) - min(Debi$Time))
## Debi's overall improvement was: 0.6
cat("Joana's overall improvement was: ", max(Joana$Time) - min(Joana$Time))
## Joana's overall improvement was: 0.52
cat("Kerol's overall improvement was: ", max(Kerol$Time) - min(Kerol$Time))
## Kerol's overall improvement was: 0.7
cat("Sukhveer's overall improvement was: ", max(Sukhveer$Time) - min(Sukhveer$Time))
## Sukhveer's overall improvement was: 0.3