TidyVerse Create Vignette

Title: CUNY SPS MDS DATA607_Tyderverse Create"

Author: Charles Ugiagbe

Date: “10/23/2021”

Introduction

Tidyverse is just a collection of R packages underlying same design philosophy, grammar, and data structure. There are currently 8 packages in the tidyverse package bundle including:

dplyr: a set of tools for efficiently manipulating datasets;
forcats: a package for manipulating categorical variables / factors;
ggplots2: a classic package for data visualization;
purrr: another set of tools for manipulating datasets, specially vecters, a complement to dplyr;
readr: a set of faster and more user friendly functions to read data than R default functions;
stringr: a package for common string operations;
tibble：a package for reimagining data.frames in a modern way;
tidyr: a package for reshaping data, a complement to dplyr.

In this assignment, I will use some handy functions in tidyverse package to perform some Analysis

library(tidyverse)

Dataset

The dataset in this project is called “student performance” from https://www.kaggle.com/datasets; The dataset contains a sample of 1000 observations of 8 variables.

I use read.csv function to import the csv file to R.

url <- "https://raw.githubusercontent.com/omocharly/DATA607_PROJECTS/main/StudentsPerformance.csv"
data <- read.csv(url, header = TRUE)

head(data)

##   gender race.ethnicity parental.level.of.education        lunch
## 1 female        group B           bachelor's degree     standard
## 2 female        group C                some college     standard
## 3 female        group B             master's degree     standard
## 4   male        group A          associate's degree free/reduced
## 5   male        group C                some college     standard
## 6 female        group B          associate's degree     standard
##   test.preparation.course math.score reading.score writing.score
## 1                    none         72            72            74
## 2               completed         69            90            88
## 3                    none         90            95            93
## 4                    none         47            57            44
## 5                    none         76            78            75
## 6                    none         71            83            78

Glimpse help us to catch sight of the data to see the data structure.

glimpse(data)

## Rows: 1,000
## Columns: 8
## $ gender                      <chr> "female", "female", "female", "male", "mal~
## $ race.ethnicity              <chr> "group B", "group C", "group B", "group A"~
## $ parental.level.of.education <chr> "bachelor's degree", "some college", "mast~
## $ lunch                       <chr> "standard", "standard", "standard", "free/~
## $ test.preparation.course     <chr> "none", "completed", "none", "none", "none~
## $ math.score                  <int> 72, 69, 90, 47, 76, 71, 88, 40, 64, 38, 58~
## $ reading.score               <int> 72, 90, 95, 57, 78, 83, 95, 43, 64, 60, 54~
## $ writing.score               <int> 74, 88, 93, 44, 75, 78, 92, 39, 67, 50, 52~

dplyr::rename()

rename() changes the names of individual variables using in a column with a new one

data1 <- data %>% rename(race = race.ethnicity, parental_Educatn_level= parental.level.of.education, test.prep = test.preparation.course)
head(data1)

##   gender    race parental_Educatn_level        lunch test.prep math.score
## 1 female group B      bachelor's degree     standard      none         72
## 2 female group C           some college     standard completed         69
## 3 female group B        master's degree     standard      none         90
## 4   male group A     associate's degree free/reduced      none         47
## 5   male group C           some college     standard      none         76
## 6 female group B     associate's degree     standard      none         71
##   reading.score writing.score
## 1            72            74
## 2            90            88
## 3            95            93
## 4            57            44
## 5            78            75
## 6            83            78

dplyr::select()

Select(): is use for selecting a range of consecutive variables or taking the complement of a set of variables

data2 <- data1 %>%
  select(gender, math.score, reading.score, writing.score)
head(data2)

##   gender math.score reading.score writing.score
## 1 female         72            72            74
## 2 female         69            90            88
## 3 female         90            95            93
## 4   male         47            57            44
## 5   male         76            78            75
## 6 female         71            83            78

dplyr::filter()

I use the filter() function to filter maths, writing and reading scores that are greater than 97

data3 <- data2 %>%
  filter(math.score == 100, writing.score > 95, reading.score > 95)
data3

##   gender math.score reading.score writing.score
## 1 female        100           100           100
## 2   male        100            97            99
## 3   male        100           100           100
## 4 female        100           100           100

dplyr::arrange()

arrange(): orders the rows of a data frame by the values of selected columns.

data4 <- data2 %>% arrange(desc(math.score))
head(data4)

##   gender math.score reading.score writing.score
## 1   male        100           100            93
## 2 female        100            92            97
## 3 female        100           100           100
## 4   male        100            96            86
## 5   male        100            97            99
## 6   male        100           100           100

dplyr::mutate()

mutate() adds new variables that are function of the existing ones to the table and also preserves existing ones.

data5 <- data4 %>%
  mutate(avg.score = (math.score + writing.score + reading.score) / 3)
head(data5)

##   gender math.score reading.score writing.score avg.score
## 1   male        100           100            93  97.66667
## 2 female        100            92            97  96.33333
## 3 female        100           100           100 100.00000
## 4   male        100            96            86  94.00000
## 5   male        100            97            99  98.66667
## 6   male        100           100           100 100.00000

dplyr::case_when()

case_when: function allows you to vectorise multiple if_else() statements. It is an R equivalent of the SQL CASE WHEN statement. If no cases match

data6 <- data5 %>% 
  mutate(pass_fail_grade = case_when(avg.score >= 85 ~ 'Pass'
                                     ,TRUE ~ 'Fail' )
         )
head(data6)

##   gender math.score reading.score writing.score avg.score pass_fail_grade
## 1   male        100           100            93  97.66667            Pass
## 2 female        100            92            97  96.33333            Pass
## 3 female        100           100           100 100.00000            Pass
## 4   male        100            96            86  94.00000            Pass
## 5   male        100            97            99  98.66667            Pass
## 6   male        100           100           100 100.00000            Pass

dplyr::summarize()

data %>% group_by(gender) %>%
  summarize( math_score = sum (math.score)/ n())

## # A tibble: 2 x 2
##   gender math_score
##   <chr>       <dbl>
## 1 female       63.6
## 2 male         68.7

Tidyverse::ggplot2()

ggplot2 is a system for ‘declaratively’ creating graphics, based on “The Grammar of Graphics”.

ggplot(data = data6, aes(x = gender, y = avg.score, col = gender), col = red) + geom_boxplot() + labs(title="Distribution of Students Average score") + theme(plot.title = element_text(hjust=0.5))

Other usage of Tidyverse can be found in the textbook “R for Data Science” and other online resource.