Tidyverse is just a collection of R packages underlying same design philosophy, grammar, and data structure. There are currently 8 packages in the tidyverse package bundle including:
dplyr: a set of tools for efficiently manipulating datasets;forcats: a package for manipulating categorical variables / factors;ggplots2: a classic package for data visualization;purrr: another set of tools for manipulating datasets, specially vecters, a complement to dplyr;readr: a set of faster and more user friendly functions to read data than R default functions;stringr: a package for common string operations;tibble:a package for reimagining data.frames in a modern way;tidyr: a package for reshaping data, a complement to dplyr.In this assignment, I will use some handy functions in tidyverse package to perform some Analysis
library(tidyverse)
The dataset in this project is called “student performance” from https://www.kaggle.com/datasets; The dataset contains a sample of 1000 observations of 8 variables.
I use read.csv function to import the csv file to R.
url <- "https://raw.githubusercontent.com/omocharly/DATA607_PROJECTS/main/StudentsPerformance.csv"
data <- read.csv(url, header = TRUE)
head(data)
## gender race.ethnicity parental.level.of.education lunch
## 1 female group B bachelor's degree standard
## 2 female group C some college standard
## 3 female group B master's degree standard
## 4 male group A associate's degree free/reduced
## 5 male group C some college standard
## 6 female group B associate's degree standard
## test.preparation.course math.score reading.score writing.score
## 1 none 72 72 74
## 2 completed 69 90 88
## 3 none 90 95 93
## 4 none 47 57 44
## 5 none 76 78 75
## 6 none 71 83 78
Glimpse help us to catch sight of the data to see the data structure.
glimpse(data)
## Rows: 1,000
## Columns: 8
## $ gender <chr> "female", "female", "female", "male", "mal~
## $ race.ethnicity <chr> "group B", "group C", "group B", "group A"~
## $ parental.level.of.education <chr> "bachelor's degree", "some college", "mast~
## $ lunch <chr> "standard", "standard", "standard", "free/~
## $ test.preparation.course <chr> "none", "completed", "none", "none", "none~
## $ math.score <int> 72, 69, 90, 47, 76, 71, 88, 40, 64, 38, 58~
## $ reading.score <int> 72, 90, 95, 57, 78, 83, 95, 43, 64, 60, 54~
## $ writing.score <int> 74, 88, 93, 44, 75, 78, 92, 39, 67, 50, 52~
rename() changes the names of individual variables using in a column with a new one
data1 <- data %>% rename(race = race.ethnicity, parental_Educatn_level= parental.level.of.education, test.prep = test.preparation.course)
head(data1)
## gender race parental_Educatn_level lunch test.prep math.score
## 1 female group B bachelor's degree standard none 72
## 2 female group C some college standard completed 69
## 3 female group B master's degree standard none 90
## 4 male group A associate's degree free/reduced none 47
## 5 male group C some college standard none 76
## 6 female group B associate's degree standard none 71
## reading.score writing.score
## 1 72 74
## 2 90 88
## 3 95 93
## 4 57 44
## 5 78 75
## 6 83 78
Select(): is use for selecting a range of consecutive variables or taking the complement of a set of variables
data2 <- data1 %>%
select(gender, math.score, reading.score, writing.score)
head(data2)
## gender math.score reading.score writing.score
## 1 female 72 72 74
## 2 female 69 90 88
## 3 female 90 95 93
## 4 male 47 57 44
## 5 male 76 78 75
## 6 female 71 83 78
I use the filter() function to filter maths, writing and reading scores that are greater than 97
data3 <- data2 %>%
filter(math.score == 100, writing.score > 95, reading.score > 95)
data3
## gender math.score reading.score writing.score
## 1 female 100 100 100
## 2 male 100 97 99
## 3 male 100 100 100
## 4 female 100 100 100
arrange(): orders the rows of a data frame by the values of selected columns.
data4 <- data2 %>% arrange(desc(math.score))
head(data4)
## gender math.score reading.score writing.score
## 1 male 100 100 93
## 2 female 100 92 97
## 3 female 100 100 100
## 4 male 100 96 86
## 5 male 100 97 99
## 6 male 100 100 100
mutate() adds new variables that are function of the existing ones to the table and also preserves existing ones.
data5 <- data4 %>%
mutate(avg.score = (math.score + writing.score + reading.score) / 3)
head(data5)
## gender math.score reading.score writing.score avg.score
## 1 male 100 100 93 97.66667
## 2 female 100 92 97 96.33333
## 3 female 100 100 100 100.00000
## 4 male 100 96 86 94.00000
## 5 male 100 97 99 98.66667
## 6 male 100 100 100 100.00000
case_when: function allows you to vectorise multiple if_else() statements. It is an R equivalent of the SQL CASE WHEN statement. If no cases match
data6 <- data5 %>%
mutate(pass_fail_grade = case_when(avg.score >= 85 ~ 'Pass'
,TRUE ~ 'Fail' )
)
head(data6)
## gender math.score reading.score writing.score avg.score pass_fail_grade
## 1 male 100 100 93 97.66667 Pass
## 2 female 100 92 97 96.33333 Pass
## 3 female 100 100 100 100.00000 Pass
## 4 male 100 96 86 94.00000 Pass
## 5 male 100 97 99 98.66667 Pass
## 6 male 100 100 100 100.00000 Pass
data %>% group_by(gender) %>%
summarize( math_score = sum (math.score)/ n())
## # A tibble: 2 x 2
## gender math_score
## <chr> <dbl>
## 1 female 63.6
## 2 male 68.7
ggplot2 is a system for ‘declaratively’ creating graphics, based on “The Grammar of Graphics”.
ggplot(data = data6, aes(x = gender, y = avg.score, col = gender), col = red) + geom_boxplot() + labs(title="Distribution of Students Average score") + theme(plot.title = element_text(hjust=0.5))
Other usage of Tidyverse can be found in the textbook “R for Data Science” and other online resource.