I chose dataset that was posted by Jhalak Das and that includes the information about the students, their grades for tests for each term.
students <- rbind(c("id", "name", "phone", "sex and age", "test number","term 1", "term 2", "term 3"),
c(1, "Mike", 134, "m_12", "test 1", 76, 84, 87),
c(2, "Linda", 270, "f_13", "test 1", 88, 90, 73),
c(3, "Sam", 210, "m_11", "test 1", 78, 74, 80),
c(4, "Esther", 617, "f_12", "test 1", 68, 75, 74),
c(5, "Marya", 114, "f_14", "test 1", 65, 67, 64),
c(1, "Mike", 134, "m_12", "test 2", 85, 80, 90),
c(2, "Linda", 270, "f_13", "test 2", 87, 82, 94),
c(3, "Sam", 210, "m_11", "test 2", 80, 87, 80),
c(4, "Esther", 617, "f_12", "test 2", 70, 75, 78),
c(5, "Marya", 114, "f_14", "test 2", 68, 70, 63))
write.table(students, file = "students.csv", sep = ",", col.names=F, row.names=F)
The file will be downloaded from the Github repository to csv file using read.csv function.
students <- read.csv('https://raw.githubusercontent.com/ex-pr/DATA607/Project-2/students.csv', header=TRUE, sep=",", check.names=FALSE)
By checking the data downloaded, we have 8 column with 10 rows.
summary(students)
## id name phone sex and age
## Min. :1 Length:10 Min. :114 Length:10
## 1st Qu.:2 Class :character 1st Qu.:134 Class :character
## Median :3 Mode :character Median :210 Mode :character
## Mean :3 Mean :269
## 3rd Qu.:4 3rd Qu.:270
## Max. :5 Max. :617
## test number term 1 term 2 term 3
## Length:10 Min. :65.00 Min. :67.00 Min. :63.00
## Class :character 1st Qu.:68.50 1st Qu.:74.25 1st Qu.:73.25
## Mode :character Median :77.00 Median :77.50 Median :79.00
## Mean :76.50 Mean :78.40 Mean :78.30
## 3rd Qu.:83.75 3rd Qu.:83.50 3rd Qu.:85.25
## Max. :88.00 Max. :90.00 Max. :94.00
head(students)
## id name phone sex and age test number term 1 term 2 term 3
## 1 1 Mike 134 m_12 test 1 76 84 87
## 2 2 Linda 270 f_13 test 1 88 90 73
## 3 3 Sam 210 m_11 test 1 78 74 80
## 4 4 Esther 617 f_12 test 1 68 75 74
## 5 5 Marya 114 f_14 test 1 65 67 64
## 6 1 Mike 134 m_12 test 2 85 80 90
Tidy data should follow the rules: each variable is its own column, each observation is its own row, and each value is its own cell. At the current moment, we have “wide data” an can transform it to “long data” as it is better to work with.
We are going to work with data from csv file and use tidyverse package to transform it.
First, the three columns with terms turn into one column called terms. Multiple variables should be stored in one column. pivot_longer() makes datasets longer by increasing the number of rows and decreasing the number of columns.
students <- students %>%
pivot_longer(c("term 1", "term 2", "term 3"), names_to = "term", values_to = "grade")
head(students)
## # A tibble: 6 x 7
## id name phone `sex and age` `test number` term grade
## <int> <chr> <int> <chr> <chr> <chr> <int>
## 1 1 Mike 134 m_12 test 1 term 1 76
## 2 1 Mike 134 m_12 test 1 term 2 84
## 3 1 Mike 134 m_12 test 1 term 3 87
## 4 2 Linda 270 f_13 test 1 term 1 88
## 5 2 Linda 270 f_13 test 1 term 2 90
## 6 2 Linda 270 f_13 test 1 term 3 73
Rename column “test number” to avoid spaces
students <- plyr::rename(students, c("test number" = "test_number"))
Then, instead of taking “test number” as variable, we can consider two columns with headers ‘test 1’ and ‘test 2’. The test number column is not a variable, it contains two variables that we will make as columns.
students <- students %>%
pivot_wider(names_from=test_number, values_from = grade)
head(students)
## # A tibble: 6 x 7
## id name phone `sex and age` term `test 1` `test 2`
## <int> <chr> <int> <chr> <chr> <int> <int>
## 1 1 Mike 134 m_12 term 1 76 85
## 2 1 Mike 134 m_12 term 2 84 80
## 3 1 Mike 134 m_12 term 3 87 90
## 4 2 Linda 270 f_13 term 1 88 87
## 5 2 Linda 270 f_13 term 2 90 82
## 6 2 Linda 270 f_13 term 3 73 94
In addition, “sex and age” should be two distinct columns; ‘sex’ and ‘age’.
students <- students %>%
separate("sex and age", into = c("sex", "age"))
head(students)
## # A tibble: 6 x 8
## id name phone sex age term `test 1` `test 2`
## <int> <chr> <int> <chr> <chr> <chr> <int> <int>
## 1 1 Mike 134 m 12 term 1 76 85
## 2 1 Mike 134 m 12 term 2 84 80
## 3 1 Mike 134 m 12 term 3 87 90
## 4 2 Linda 270 f 13 term 1 88 87
## 5 2 Linda 270 f 13 term 2 90 82
## 6 2 Linda 270 f 13 term 3 73 94
Also, we can substitute “m” and “f” for sex with “male” and “female” accordingly, and transform age column into numeric from character.
students["sex"][students["sex"] == "f"] <- "female"
students["sex"][students["sex"] == "m"] <- "male"
students$age = as.numeric(students$age)
head(students)
## # A tibble: 6 x 8
## id name phone sex age term `test 1` `test 2`
## <int> <chr> <int> <chr> <dbl> <chr> <int> <int>
## 1 1 Mike 134 male 12 term 1 76 85
## 2 1 Mike 134 male 12 term 2 84 80
## 3 1 Mike 134 male 12 term 3 87 90
## 4 2 Linda 270 female 13 term 1 88 87
## 5 2 Linda 270 female 13 term 2 90 82
## 6 2 Linda 270 female 13 term 3 73 94
summary(students)
## id name phone sex age
## Min. :1 Length:15 Min. :114 Length:15 Min. :11.0
## 1st Qu.:2 Class :character 1st Qu.:134 Class :character 1st Qu.:12.0
## Median :3 Mode :character Median :210 Mode :character Median :12.0
## Mean :3 Mean :269 Mean :12.4
## 3rd Qu.:4 3rd Qu.:270 3rd Qu.:13.0
## Max. :5 Max. :617 Max. :14.0
## term test 1 test 2
## Length:15 Min. :64.0 Min. :63.00
## Class :character 1st Qu.:70.5 1st Qu.:72.50
## Mode :character Median :75.0 Median :80.00
## Mean :76.2 Mean :79.27
## 3rd Qu.:82.0 3rd Qu.:86.00
## Max. :90.0 Max. :94.00
Finally, we will change values of column term from “term 1” to just “1” and make sure that we will get numeric values, not character.
students["term"][students["term"] == "term 1"] <- "1"
students["term"][students["term"] == "term 2"] <- "2"
students["term"][students["term"] == "term 3"] <- "3"
students$term = as.numeric(students$term)
head(students)
## # A tibble: 6 x 8
## id name phone sex age term `test 1` `test 2`
## <int> <chr> <int> <chr> <dbl> <dbl> <int> <int>
## 1 1 Mike 134 male 12 1 76 85
## 2 1 Mike 134 male 12 2 84 80
## 3 1 Mike 134 male 12 3 87 90
## 4 2 Linda 270 female 13 1 88 87
## 5 2 Linda 270 female 13 2 90 82
## 6 2 Linda 270 female 13 3 73 94
Since there were no questions for analysis in the post, we can create our own.
We can find out which age performs best on the tests using mean grade for test 1. We see that at age of 14 there is only 1 student, Maya who performs not so good as other students.
students %>%
group_by(age) %>%
summarise(test1Mean = mean(`test 1`),test2Mean = mean(`test 2`)) %>%
ggplot(data = ., aes(x = age, y = test1Mean,
fill = age)) +
geom_bar(stat = "identity") +
theme_light() +
labs(title="Test grades, mean by age", x= "Age", y = "Mean test grades") +
theme(plot.title = element_text(hjust = 0.5))
We can find out which sex performs better during the test. In general, there is no big difference between female and male mean test grades but male has a little gigher grade on average.
students %>%
group_by(sex) %>%
summarise(test1Mean = mean(`test 1`),test2Mean = mean(`test 2`))
## # A tibble: 2 x 3
## sex test1Mean test2Mean
## <chr> <dbl> <dbl>
## 1 female 73.8 76.3
## 2 male 79.8 83.7
ggplot(students, aes(x=sex, y=`test 1`, color=sex)) +
geom_boxplot() +
theme_light() +
labs(title="Test grades, mean by sex", x= "Sex", y = "Mean test grades") +
theme(plot.title = element_text(hjust = 0.5))
During the work, we have reviewed how to organize data in a way called “tidy data” using package tidyverse, how to transform wide format to long format.
By analyzing data, we see that in general all sex and ages are almost at the same level of knowledge based on the test results. But still students of age 13 and male students perform a little better.