The original Tidyverse assignment was done by Charles Ugiagbe. Code chunks that contains extensions/additions are wrapped with comments in five hash tags. In this assignment, I added several Tidyverse functions namely; as_tibble(), str_replace_all(), contains, & logical operator, sum(), caption, rowwise(), xlab, ylab, and coord_flip() functions to explore the data.
Tidyverse is just a collection of R packages underlying same design philosophy, grammar, and data structure. There are currently 8 packages in the tidyverse
package bundle including:
dplyr
: a set of tools for efficiently manipulating datasets;forcats
: a package for manipulating categorical variables / factors;ggplots2
: a classic package for data visualization;purrr
: another set of tools for manipulating datasets, specially vecters, a complement to dplyr
;readr
: a set of faster and more user friendly functions to read data than R default functions;stringr
: a package for common string operations;tibble
:a package for reimagining data.frames in a modern way;tidyr
: a package for reshaping data, a complement to dplyr
.In this assignment, I will use some handy functions in tidyverse package to perform some Analysis
library(tidyverse)
The dataset in this project is called “student performance” from https://www.kaggle.com/datasets; The dataset contains a sample of 1000 observations of 8 variables.
I use read.csv
function to import the csv file to R.
url <- "https://raw.githubusercontent.com/omocharly/DATA607_PROJECTS/main/StudentsPerformance.csv"
data <- read.csv(url, header = TRUE)
#write.csv(data,"C:/Users/newma/OneDrive/Desktop/MSDS Fall 2021/DATA 607 - Data Acquisition and Mgt/AssignmentWeek9/TidyEXTEND/to_extend_c.csv", row.names = FALSE)
head(data)
## gender race.ethnicity parental.level.of.education lunch
## 1 female group B bachelor's degree standard
## 2 female group C some college standard
## 3 female group B master's degree standard
## 4 male group A associate's degree free/reduced
## 5 male group C some college standard
## 6 female group B associate's degree standard
## test.preparation.course math.score reading.score writing.score
## 1 none 72 72 74
## 2 completed 69 90 88
## 3 none 90 95 93
## 4 none 47 57 44
## 5 none 76 78 75
## 6 none 71 83 78
data1<-as_tibble(data)
head(data1)
## # A tibble: 6 x 8
## gender race.ethnicity parental.level.of.~ lunch test.preparation~ math.score
## <chr> <chr> <chr> <chr> <chr> <int>
## 1 female group B bachelor's degree standa~ none 72
## 2 female group C some college standa~ completed 69
## 3 female group B master's degree standa~ none 90
## 4 male group A associate's degree free/r~ none 47
## 5 male group C some college standa~ none 76
## 6 female group B associate's degree standa~ none 71
## # ... with 2 more variables: reading.score <int>, writing.score <int>
Glimpse help us to catch sight of the data to see the data structure.
glimpse(data)
## Rows: 1,000
## Columns: 8
## $ gender <chr> "female", "female", "female", "male", "mal~
## $ race.ethnicity <chr> "group B", "group C", "group B", "group A"~
## $ parental.level.of.education <chr> "bachelor's degree", "some college", "mast~
## $ lunch <chr> "standard", "standard", "standard", "free/~
## $ test.preparation.course <chr> "none", "completed", "none", "none", "none~
## $ math.score <int> 72, 69, 90, 47, 76, 71, 88, 40, 64, 38, 58~
## $ reading.score <int> 72, 90, 95, 57, 78, 83, 95, 43, 64, 60, 54~
## $ writing.score <int> 74, 88, 93, 44, 75, 78, 92, 39, 67, 50, 52~
rename() changes the names of individual variables using in a column with a new one
data1 <- data %>% rename(race = race.ethnicity, parental_Educatn_level= parental.level.of.education, test.prep = test.preparation.course)
head(data1)
## gender race parental_Educatn_level lunch test.prep math.score
## 1 female group B bachelor's degree standard none 72
## 2 female group C some college standard completed 69
## 3 female group B master's degree standard none 90
## 4 male group A associate's degree free/reduced none 47
## 5 male group C some college standard none 76
## 6 female group B associate's degree standard none 71
## reading.score writing.score
## 1 72 74
## 2 90 88
## 3 95 93
## 4 57 44
## 5 78 75
## 6 83 78
data1$lunch<- str_replace_all(data1$lunch,"/"," or ")
data1$test.prep<- str_replace_all(data1$test.prep,"none","uncompleted")
head(data1)
## gender race parental_Educatn_level lunch test.prep math.score
## 1 female group B bachelor's degree standard uncompleted 72
## 2 female group C some college standard completed 69
## 3 female group B master's degree standard uncompleted 90
## 4 male group A associate's degree free or reduced uncompleted 47
## 5 male group C some college standard uncompleted 76
## 6 female group B associate's degree standard uncompleted 71
## reading.score writing.score
## 1 72 74
## 2 90 88
## 3 95 93
## 4 57 44
## 5 78 75
## 6 83 78
Select(): is use for selecting a range of consecutive variables or taking the complement of a set of variables
#Original Syntax
data2 <- data1 %>%
select(gender, math.score, reading.score, writing.score)
head(data2)
## gender math.score reading.score writing.score
## 1 female 72 72 74
## 2 female 69 90 88
## 3 female 90 95 93
## 4 male 47 57 44
## 5 male 76 78 75
## 6 female 71 83 78
data3 <- data1 %>%
select(gender, contains("score"))
head(data3)
## gender math.score reading.score writing.score
## 1 female 72 72 74
## 2 female 69 90 88
## 3 female 90 95 93
## 4 male 47 57 44
## 5 male 76 78 75
## 6 female 71 83 78
I use the filter() function to filter maths, writing and reading scores that are greater than 97
data3 <- data2 %>%
filter(math.score == 100, writing.score > 95, reading.score > 95)
data3
## gender math.score reading.score writing.score
## 1 female 100 100 100
## 2 male 100 97 99
## 3 male 100 100 100
## 4 female 100 100 100
data3 <- data2 %>%
filter(math.score == 100 & writing.score > 95 & reading.score > 95)
data3
## gender math.score reading.score writing.score
## 1 female 100 100 100
## 2 male 100 97 99
## 3 male 100 100 100
## 4 female 100 100 100
arrange(): orders the rows of a data frame by the values of selected columns.
data4 <- data2 %>% arrange(desc(math.score))
head(data4)
## gender math.score reading.score writing.score
## 1 male 100 100 93
## 2 female 100 92 97
## 3 female 100 100 100
## 4 male 100 96 86
## 5 male 100 97 99
## 6 male 100 100 100
mutate() adds new variables that are function of the existing ones to the table and also preserves existing ones.
data5 <- data4 %>%
mutate(avg.score = (math.score + writing.score + reading.score) / 3)
head(data5)
## gender math.score reading.score writing.score avg.score
## 1 male 100 100 93 97.66667
## 2 female 100 92 97 96.33333
## 3 female 100 100 100 100.00000
## 4 male 100 96 86 94.00000
## 5 male 100 97 99 98.66667
## 6 male 100 100 100 100.00000
data5 <- data4 %>% rowwise() %>%
mutate(avg.score = sum(c(math.score,writing.score,reading.score))/ 3)
head(data5)
## # A tibble: 6 x 5
## # Rowwise:
## gender math.score reading.score writing.score avg.score
## <chr> <int> <int> <int> <dbl>
## 1 male 100 100 93 97.7
## 2 female 100 92 97 96.3
## 3 female 100 100 100 100
## 4 male 100 96 86 94
## 5 male 100 97 99 98.7
## 6 male 100 100 100 100
case_when: function allows you to vectorise multiple if_else() statements. It is an R equivalent of the SQL CASE WHEN statement. If no cases match
data6 <- data5 %>%
mutate(pass_fail_grade = case_when(avg.score >= 85 ~ 'Pass'
,TRUE ~ 'Fail' )
)
head(data6)
## # A tibble: 6 x 6
## # Rowwise:
## gender math.score reading.score writing.score avg.score pass_fail_grade
## <chr> <int> <int> <int> <dbl> <chr>
## 1 male 100 100 93 97.7 Pass
## 2 female 100 92 97 96.3 Pass
## 3 female 100 100 100 100 Pass
## 4 male 100 96 86 94 Pass
## 5 male 100 97 99 98.7 Pass
## 6 male 100 100 100 100 Pass
data %>% group_by(gender) %>%
summarize( math_score = sum (math.score)/ n())
## # A tibble: 2 x 2
## gender math_score
## <chr> <dbl>
## 1 female 63.6
## 2 male 68.7
ggplot2 is a system for ‘declaratively’ creating graphics, based on “The Grammar of Graphics”.
ggplot(data = data6, aes(x = gender, y = avg.score, col = gender), col = red) + geom_boxplot() +labs(title="Distribution of Students Average score",) + theme(plot.title = element_text(hjust=0.5))
ggplot(data = data6, aes(x = gender, y = avg.score, col = gender), col = red) + geom_boxplot() + coord_flip()+
labs(title="Distribution of Students Average score",
caption="student performance” from https://www.kaggle.com/datasets") +
xlab("Gender")+ ylab("Average Score")+ theme(plot.title = element_text(hjust=0.5))
Other usage of Tidyverse can be found in the textbook “R for Data Science” and other online resource.