DATA607 Tidyverse Extend Vignette

About the Assignment

The original Tidyverse assignment was done by Charles Ugiagbe. Code chunks that contains extensions/additions are wrapped with comments in five hash tags. In this assignment, I added several Tidyverse functions namely; as_tibble(), str_replace_all(), contains, & logical operator, sum(), caption, rowwise(), xlab, ylab, and coord_flip() functions to explore the data.

Introduction

Tidyverse is just a collection of R packages underlying same design philosophy, grammar, and data structure. There are currently 8 packages in the tidyverse package bundle including:

dplyr: a set of tools for efficiently manipulating datasets;
forcats: a package for manipulating categorical variables / factors;
ggplots2: a classic package for data visualization;
purrr: another set of tools for manipulating datasets, specially vecters, a complement to dplyr;
readr: a set of faster and more user friendly functions to read data than R default functions;
stringr: a package for common string operations;
tibble：a package for reimagining data.frames in a modern way;
tidyr: a package for reshaping data, a complement to dplyr.

In this assignment, I will use some handy functions in tidyverse package to perform some Analysis

library(tidyverse)

Dataset

The dataset in this project is called “student performance” from https://www.kaggle.com/datasets; The dataset contains a sample of 1000 observations of 8 variables.

I use read.csv function to import the csv file to R.

url <- "https://raw.githubusercontent.com/omocharly/DATA607_PROJECTS/main/StudentsPerformance.csv"
data <- read.csv(url, header = TRUE)
#write.csv(data,"C:/Users/newma/OneDrive/Desktop/MSDS Fall 2021/DATA 607 - Data Acquisition and Mgt/AssignmentWeek9/TidyEXTEND/to_extend_c.csv", row.names = FALSE)

head(data)

##   gender race.ethnicity parental.level.of.education        lunch
## 1 female        group B           bachelor's degree     standard
## 2 female        group C                some college     standard
## 3 female        group B             master's degree     standard
## 4   male        group A          associate's degree free/reduced
## 5   male        group C                some college     standard
## 6 female        group B          associate's degree     standard
##   test.preparation.course math.score reading.score writing.score
## 1                    none         72            72            74
## 2               completed         69            90            88
## 3                    none         90            95            93
## 4                    none         47            57            44
## 5                    none         76            78            75
## 6                    none         71            83            78

Converting the dataframe data to a tibble is a nice approach in data wrangling as tibble keeps many important features of the original data frame. It also removes many of the outdated features

data1<-as_tibble(data)
head(data1)

## # A tibble: 6 x 8
##   gender race.ethnicity parental.level.of.~ lunch   test.preparation~ math.score
##   <chr>  <chr>          <chr>               <chr>   <chr>                  <int>
## 1 female group B        bachelor's degree   standa~ none                      72
## 2 female group C        some college        standa~ completed                 69
## 3 female group B        master's degree     standa~ none                      90
## 4 male   group A        associate's degree  free/r~ none                      47
## 5 male   group C        some college        standa~ none                      76
## 6 female group B        associate's degree  standa~ none                      71
## # ... with 2 more variables: reading.score <int>, writing.score <int>

Glimpse help us to catch sight of the data to see the data structure.

glimpse(data)

## Rows: 1,000
## Columns: 8
## $ gender                      <chr> "female", "female", "female", "male", "mal~
## $ race.ethnicity              <chr> "group B", "group C", "group B", "group A"~
## $ parental.level.of.education <chr> "bachelor's degree", "some college", "mast~
## $ lunch                       <chr> "standard", "standard", "standard", "free/~
## $ test.preparation.course     <chr> "none", "completed", "none", "none", "none~
## $ math.score                  <int> 72, 69, 90, 47, 76, 71, 88, 40, 64, 38, 58~
## $ reading.score               <int> 72, 90, 95, 57, 78, 83, 95, 43, 64, 60, 54~
## $ writing.score               <int> 74, 88, 93, 44, 75, 78, 92, 39, 67, 50, 52~

dplyr::rename()

rename() changes the names of individual variables using in a column with a new one

data1 <- data %>% rename(race = race.ethnicity, parental_Educatn_level= parental.level.of.education, test.prep = test.preparation.course)
head(data1)

##   gender    race parental_Educatn_level        lunch test.prep math.score
## 1 female group B      bachelor's degree     standard      none         72
## 2 female group C           some college     standard completed         69
## 3 female group B        master's degree     standard      none         90
## 4   male group A     associate's degree free/reduced      none         47
## 5   male group C           some college     standard      none         76
## 6 female group B     associate's degree     standard      none         71
##   reading.score writing.score
## 1            72            74
## 2            90            88
## 3            95            93
## 4            57            44
## 5            78            75
## 6            83            78

Using str_replace_all function replace all occurrences of “/” character in lunch variable to or and all none value in test.prep variable to uncompleted. This will make the values more readable

data1$lunch<- str_replace_all(data1$lunch,"/"," or ")
data1$test.prep<- str_replace_all(data1$test.prep,"none","uncompleted")
head(data1)

##   gender    race parental_Educatn_level           lunch   test.prep math.score
## 1 female group B      bachelor's degree        standard uncompleted         72
## 2 female group C           some college        standard   completed         69
## 3 female group B        master's degree        standard uncompleted         90
## 4   male group A     associate's degree free or reduced uncompleted         47
## 5   male group C           some college        standard uncompleted         76
## 6 female group B     associate's degree        standard uncompleted         71
##   reading.score writing.score
## 1            72            74
## 2            90            88
## 3            95            93
## 4            57            44
## 5            78            75
## 6            83            78

dplyr::select()

Select(): is use for selecting a range of consecutive variables or taking the complement of a set of variables

#Original Syntax
data2 <- data1 %>%
  select(gender, math.score, reading.score, writing.score)
head(data2)

##   gender math.score reading.score writing.score
## 1 female         72            72            74
## 2 female         69            90            88
## 3 female         90            95            93
## 4   male         47            57            44
## 5   male         76            78            75
## 6 female         71            83            78

Using contains function along side with the select function to subset to many variables with less code

New syntax

data3 <- data1 %>%
  select(gender, contains("score"))
head(data3)

##   gender math.score reading.score writing.score
## 1 female         72            72            74
## 2 female         69            90            88
## 3 female         90            95            93
## 4   male         47            57            44
## 5   male         76            78            75
## 6 female         71            83            78

dplyr::filter()

I use the filter() function to filter maths, writing and reading scores that are greater than 97

data3 <- data2 %>%
  filter(math.score == 100, writing.score > 95, reading.score > 95)
data3

##   gender math.score reading.score writing.score
## 1 female        100           100           100
## 2   male        100            97            99
## 3   male        100           100           100
## 4 female        100           100           100

Logical operator & can be used inplace of “,” in the filter function to make the code readable

New syntax with & logical operator

data3 <- data2 %>%
  filter(math.score == 100 & writing.score > 95 & reading.score > 95)
data3

##   gender math.score reading.score writing.score
## 1 female        100           100           100
## 2   male        100            97            99
## 3   male        100           100           100
## 4 female        100           100           100

dplyr::arrange()

arrange(): orders the rows of a data frame by the values of selected columns.

data4 <- data2 %>% arrange(desc(math.score))
head(data4)

##   gender math.score reading.score writing.score
## 1   male        100           100            93
## 2 female        100            92            97
## 3 female        100           100           100
## 4   male        100            96            86
## 5   male        100            97            99
## 6   male        100           100           100

dplyr::mutate()

mutate() adds new variables that are function of the existing ones to the table and also preserves existing ones.

data5 <- data4 %>%
  mutate(avg.score = (math.score + writing.score + reading.score) / 3)
head(data5)

##   gender math.score reading.score writing.score avg.score
## 1   male        100           100            93  97.66667
## 2 female        100            92            97  96.33333
## 3 female        100           100           100 100.00000
## 4   male        100            96            86  94.00000
## 5   male        100            97            99  98.66667
## 6   male        100           100           100 100.00000

rowwise function can be introduced with the sum function into the mutate function. This is more efficient than using the + mathematical operator

New syntax with sum function

data5 <- data4 %>% rowwise() %>%
  mutate(avg.score = sum(c(math.score,writing.score,reading.score))/ 3)
head(data5)

## # A tibble: 6 x 5
## # Rowwise: 
##   gender math.score reading.score writing.score avg.score
##   <chr>       <int>         <int>         <int>     <dbl>
## 1 male          100           100            93      97.7
## 2 female        100            92            97      96.3
## 3 female        100           100           100     100  
## 4 male          100            96            86      94  
## 5 male          100            97            99      98.7
## 6 male          100           100           100     100

dplyr::case_when()

case_when: function allows you to vectorise multiple if_else() statements. It is an R equivalent of the SQL CASE WHEN statement. If no cases match

data6 <- data5 %>% 
  mutate(pass_fail_grade = case_when(avg.score >= 85 ~ 'Pass'
                                     ,TRUE ~ 'Fail' )
         )
head(data6)

## # A tibble: 6 x 6
## # Rowwise: 
##   gender math.score reading.score writing.score avg.score pass_fail_grade
##   <chr>       <int>         <int>         <int>     <dbl> <chr>          
## 1 male          100           100            93      97.7 Pass           
## 2 female        100            92            97      96.3 Pass           
## 3 female        100           100           100     100   Pass           
## 4 male          100            96            86      94   Pass           
## 5 male          100            97            99      98.7 Pass           
## 6 male          100           100           100     100   Pass

dplyr::summarize()

data %>% group_by(gender) %>%
  summarize( math_score = sum (math.score)/ n())

## # A tibble: 2 x 2
##   gender math_score
##   <chr>       <dbl>
## 1 female       63.6
## 2 male         68.7

Tidyverse::ggplot2()

ggplot2 is a system for ‘declaratively’ creating graphics, based on “The Grammar of Graphics”.

ggplot(data = data6, aes(x = gender, y = avg.score, col = gender), col = red) + geom_boxplot() +labs(title="Distribution of Students Average score",) + theme(plot.title = element_text(hjust=0.5))

Adding caption is awesome to describe the source of the data that produced the graphics. The coord_flip() function rotates the coordinates of the graphics to accomodate the coordinate with numerous values

ggplot(data = data6, aes(x = gender, y = avg.score, col = gender), col = red) + geom_boxplot() + coord_flip()+
  labs(title="Distribution of Students Average score",
       caption="student performance” from https://www.kaggle.com/datasets") +
     xlab("Gender")+ ylab("Average Score")+ theme(plot.title = element_text(hjust=0.5))

Other usage of Tidyverse can be found in the textbook “R for Data Science” and other online resource.