In Part III of the Project 2, the assignment presents the effort to tidy the CSV file suggested by colleague Justin Hsi based on the Medium article at https://medium.com/analytics-vidhya/untidy-data-a90b6e3ebe4c. The initial CSV required multiple steps to perform a proper tidy of the data along with the creation of two separate tables. The analysis of the input data attempts to identify the highest and lowest performing students in addition to identifying the student with the most improvement.
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(RCurl)
## Loading required package: bitops
##
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
##
## complete
library(stringr)
library(stringi)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
# Load the csv file from the repo's URL
student_results_url <- getURL("https://gist.githubusercontent.com/Kimmirikwa/b69d0ea134820ea52f8481991ffae93e/raw/4db7b1698035ee29885d10e1a59bd902716ae168/student_results.csv")
student_results_raw <-data.frame(read.csv(text=student_results_url, header=T))
student_results_raw
## id name phone sex.and.age test.number term.1 term.2 term.3
## 1 1 Mike 134 m_12 test 1 76 84 87
## 2 2 Linda 270 f_13 test 1 88 90 73
## 3 3 Sam 210 m_11 test 1 78 74 80
## 4 4 Esther 617 f_12 test 1 68 75 74
## 5 5 Mary 114 f_14 test 1 65 67 64
## 6 1 Mike 134 m_12 test 2 85 80 90
## 7 2 Linda 270 f_13 test 2 87 82 94
## 8 3 Sam 210 m_11 test 2 80 87 80
## 9 4 Esther 617 f_12 test 2 70 75 78
## 10 5 Mary 114 f_14 test 2 68 70 63
First step is to split the column ‘sex.and.age’ into separate columns, as two values in one column does not meet the definition of tidy.
# Split column for sex and age
stu_res <- student_results_raw %>% separate(sex.and.age, c("sex","age"), sep = "_")
stu_res
## id name phone sex age test.number term.1 term.2 term.3
## 1 1 Mike 134 m 12 test 1 76 84 87
## 2 2 Linda 270 f 13 test 1 88 90 73
## 3 3 Sam 210 m 11 test 1 78 74 80
## 4 4 Esther 617 f 12 test 1 68 75 74
## 5 5 Mary 114 f 14 test 1 65 67 64
## 6 1 Mike 134 m 12 test 2 85 80 90
## 7 2 Linda 270 f 13 test 2 87 82 94
## 8 3 Sam 210 m 11 test 2 80 87 80
## 9 4 Esther 617 f 12 test 2 70 75 78
## 10 5 Mary 114 f 14 test 2 68 70 63
Next step is to melt the columns ‘term1’, ‘term2’, and ‘term3’ into a single column as those three columns represent one variable ‘term’, and thus should be a single column according to the definition of tidy data.
# Melt the 3 columns for term1, term2, term3
stu_res_long <- stu_res %>%
pivot_longer(c(`term.1`, `term.2`, `term.3`), names_to = "term", values_to = "result", values_drop_na = TRUE)
# Remove 'term' from the value, so the number of the term remains
stu_res_long$term <- stri_extract_first_regex(stu_res_long[["term"]], "[0-9]+")
stu_res_long$test.number <- str_replace_all(stu_res_long$test.number, ' ', '.')
stu_res_long
## # A tibble: 30 x 8
## id name phone sex age test.number term result
## <int> <fct> <int> <chr> <chr> <chr> <chr> <int>
## 1 1 Mike 134 m 12 test.1 1 76
## 2 1 Mike 134 m 12 test.1 2 84
## 3 1 Mike 134 m 12 test.1 3 87
## 4 2 Linda 270 f 13 test.1 1 88
## 5 2 Linda 270 f 13 test.1 2 90
## 6 2 Linda 270 f 13 test.1 3 73
## 7 3 Sam 210 m 11 test.1 1 78
## 8 3 Sam 210 m 11 test.1 2 74
## 9 3 Sam 210 m 11 test.1 3 80
## 10 4 Esther 617 f 12 test.1 1 68
## # … with 20 more rows
With the understanding of one row should be one observation, the ‘test 1’ and ‘test 2’ values represent variables for the single oberservation of one student’s results for one term. A pivot wider is performed to make each test result a single variable, and thus a single column each.
# Pivot wider to make the Test1 and Test2 variables
stu_res_wide <- stu_res_long %>%
pivot_wider(names_from = test.number, values_from = result)
stu_res_wide
## # A tibble: 15 x 8
## id name phone sex age term test.1 test.2
## <int> <fct> <int> <chr> <chr> <chr> <int> <int>
## 1 1 Mike 134 m 12 1 76 85
## 2 1 Mike 134 m 12 2 84 80
## 3 1 Mike 134 m 12 3 87 90
## 4 2 Linda 270 f 13 1 88 87
## 5 2 Linda 270 f 13 2 90 82
## 6 2 Linda 270 f 13 3 73 94
## 7 3 Sam 210 m 11 1 78 80
## 8 3 Sam 210 m 11 2 74 87
## 9 3 Sam 210 m 11 3 80 80
## 10 4 Esther 617 f 12 1 68 70
## 11 4 Esther 617 f 12 2 75 75
## 12 4 Esther 617 f 12 3 74 78
## 13 5 Mary 114 f 14 1 65 68
## 14 5 Mary 114 f 14 2 67 70
## 15 5 Mary 114 f 14 3 64 63
The form of the table now contains redundant information across two observations. One observation is the student and identifying features of the student.
# Create dataframe to represent the students (1 observation)
students_raw <- stu_res_wide %>% select(id:age)
# Dedupe the replicated rows of students
students_raw <- unique( students_raw[ , 1:5 ] )
# Final dataframe of the student observations
students_raw
## # A tibble: 5 x 5
## id name phone sex age
## <int> <fct> <int> <chr> <chr>
## 1 1 Mike 134 m 12
## 2 2 Linda 270 f 13
## 3 3 Sam 210 m 11
## 4 4 Esther 617 f 12
## 5 5 Mary 114 f 14
The second observation is the student’s result for one term.
# Create dataframe to represent the students' results per term (1 observation)
results_raw <- stu_res_wide %>% select(-(name:age))
# Final dataframe of the students' results observations
results_raw
## # A tibble: 15 x 4
## id term test.1 test.2
## <int> <chr> <int> <int>
## 1 1 1 76 85
## 2 1 2 84 80
## 3 1 3 87 90
## 4 2 1 88 87
## 5 2 2 90 82
## 6 2 3 73 94
## 7 3 1 78 80
## 8 3 2 74 87
## 9 3 3 80 80
## 10 4 1 68 70
## 11 4 2 75 75
## 12 4 3 74 78
## 13 5 1 65 68
## 14 5 2 67 70
## 15 5 3 64 63
With the data in a tidy form, the visualization step presents the performance of the students for each term in the form of a bar plot. The bar plots for Term 1 and Term 2 are presented side-by-side to allow for clear visualization of the relative performance between students for each term along with performance from Term 1 to Term 2.
# Plot
p1 <- ggplot(data=results_raw, aes(x=id, y=test.1, fill=term)) +
geom_bar(stat="identity", position=position_dodge()) +
theme(legend.position="bottom") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(subtitle="Student Vs Test 1 Result by Term",
y="Test 1 Result",
x="Student ID",
title="Bar Plot")
p2 <- ggplot(data=results_raw, aes(x=id, y=test.2, fill=term)) +
geom_bar(stat="identity", position=position_dodge()) +
theme(legend.position="bottom") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(subtitle="Student Vs Test 2 Result by Term",
y="Test 2 Result",
x="Student ID",
title="Bar Plot")
grid.arrange(p1, p2, ncol=2)
To analyze the data, an additional column is calculated to capture the test result average by term for each student. The term average for each student is presented to help visually identify the higher and lower performing students along with a clear method to see which students show positive or negative change from Term 1 to Term 3. The goal of the analysis is to identify the highest and lowest performing student in addition to the student with the highest improvement from start to finish.
# Adding column for average by term
results_with_avg <- results_raw %>% mutate(
term.avg = (test.1 + test.2) / 2
)
results_with_avg
## # A tibble: 15 x 5
## id term test.1 test.2 term.avg
## <int> <chr> <int> <int> <dbl>
## 1 1 1 76 85 80.5
## 2 1 2 84 80 82
## 3 1 3 87 90 88.5
## 4 2 1 88 87 87.5
## 5 2 2 90 82 86
## 6 2 3 73 94 83.5
## 7 3 1 78 80 79
## 8 3 2 74 87 80.5
## 9 3 3 80 80 80
## 10 4 1 68 70 69
## 11 4 2 75 75 75
## 12 4 3 74 78 76
## 13 5 1 65 68 66.5
## 14 5 2 67 70 68.5
## 15 5 3 64 63 63.5
p3 <- ggplot(data=results_with_avg, aes(x=id, y=term.avg, fill=term)) +
geom_bar(stat="identity", position=position_dodge()) +
theme(legend.position="bottom")
p3 + theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(subtitle="Student Vs Term Average by Term",
y="Term Average",
x="Student ID",
title="Bar Plot")
In an exercise to determine the highest performing term by the students, the terms are grouped and an average calculated for each term. The box plot below presents the overall students’ performance by term. The dots are added on the box plot to mark each of the student averages for the given term. As only five students participated in this study, the red dots line up with the elements of each box: median, whiskers, and upper and lower quartile.
by_term_avg <- group_by(results_with_avg, term)
summarise(by_term_avg, overall_term_avg = mean(term.avg, na.rm = TRUE))
## # A tibble: 3 x 2
## term overall_term_avg
## <chr> <dbl>
## 1 1 76.5
## 2 2 78.4
## 3 3 78.3
# Basic box plot by Term
p4 <- ggplot(results_with_avg, aes(x=term, y=term.avg)) +
geom_boxplot() +
geom_dotplot(binaxis='y',
stackdir='center',
dotsize = .5,
fill="red") +
stat_summary(fun.y=mean, geom="point", shape=23, size=4)
p4 + theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(subtitle="Term Vs Term Average",
y="Term Average",
x="Term",
title="Box + Dot Plot")
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
Back to the students’ performance, the box plot below captures each student’s performance by term average. This plot clearly shows which student had the highest average across the three terms (Student 2) along with which student had the lowest average across the three terms (Student 5). The diamond for each student marks the overall average of the student’s results. The red dots identify the term average for each term by student. The highest term average was earned by Student 1. This diagram captures a great snapsnot of the students’ performance by term to give a robust view of the underlying data.
by_student_avg <- group_by(results_with_avg, id)
summarise(by_student_avg, overall_avg = mean(term.avg, na.rm = TRUE))
## # A tibble: 5 x 2
## id overall_avg
## <int> <dbl>
## 1 1 83.7
## 2 2 85.7
## 3 3 79.8
## 4 4 73.3
## 5 5 66.2
# Basic box plot with dot plot by Student
p5 <- ggplot(results_with_avg, aes(x=id, y=term.avg, group=id)) +
geom_boxplot() +
geom_dotplot(binaxis='y',
stackdir='center',
dotsize = .5,
fill="red") +
stat_summary(fun.y=mean, geom="point", shape=23, size=4)
p5 + theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(subtitle="Student Vs Term Average",
y="Term Average",
x="Student ID",
title="Box + Dot Plot")
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
In an effort to determine the student with the most improvement, the tidy table is pivoted wide to include each student’s term average as a variable. This un-tidy modification is made to then calculate the delta for each student between the Term 1 average and the Term 3 average. The resulting delta for each student denotes which student showed the most improvement. The bar plot below identifies Student 1 as having the greatest positive change in term averages from Term 1 to Term 3. Interestingly, even though Student 2 earned the highest overall average across the three terms, Student 2 shows lack of improvement from Term 1 to Term 3 in regards to test results.
# Which student improved the most from term 1 to term 3
results_term_avg_only <- results_with_avg %>% select(-(test.1:test.2))
results_term_avg_only$term <- paste("term.", results_term_avg_only$term, sep = "")
results_term_avg_only <- results_term_avg_only %>%
pivot_wider(names_from = term, values_from = term.avg)
results_term_delta <- results_term_avg_only %>% mutate(
term.delta = (term.3 - term.1)
)
results_term_delta
## # A tibble: 5 x 5
## id term.1 term.2 term.3 term.delta
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 80.5 82 88.5 8
## 2 2 87.5 86 83.5 -4
## 3 3 79 80.5 80 1
## 4 4 69 75 76 7
## 5 5 66.5 68.5 63.5 -3
p6 <- ggplot(data=results_term_delta, aes(x=id, y=term.delta)) +
geom_bar(stat="identity", position=position_dodge()) +
theme(legend.position="bottom")
p6 + theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(subtitle="Student Vs Result Delta",
y="Result Delta",
x="Student ID",
title="Bar Plot")
In conclusion, the visualization and analysis of the provided student data demonstrates Student 2 is the highest performing student across the three terms while Student 3 is the lowest performing student across the three terms. The tidy and dplyr packages were valuable tools in order to transform the data that allowed for easy identification of ther students’ performances while also identifying which students show improvement or not from Term 1 to Term 3. Overall, though the input data is small, the challenge to rectify the input into tidy forms did require several techniques to ensure the data matched the definition of Tidy Data.