GitHub: https://github.com/seung-m1nsong/607
rpubs: https://rpubs.com/seungm1nsong/963598
The tidyverse is quickly replacing the original R syntax with the
advantage of being able to write intuitive code. The tidyverse package
is a package that helps you install and load R packages that follow the
tidy data paradigm at once. tidyverse is a package that installs and
manages core packages belonging to the tidy package ecosystem, such as
dplyr tidyr ggplot2, at once.
If I had to pick just one of the most important things in the tidyverse
ecosystem, I would choose the %>% operator from the magrittr package.
%>% can be entered with the shortcut Ctrl + Shift + M (OS X: Cmd +
Shift + M) in Rstudio, and by using this, you can code in a stream of
consciousness and write intuitive code. Let’s find out the advantages of
%>% through the head function.
dfdata <- read.csv(file = 'car_purchasing.csv', header = TRUE)
## head(dfdata)
dfdata %>% head
dfdata %>% head(n = 10)
This command is much more intuitive than the existing head(subset(a, State == “NY”)) as it shows the head by selecting only NY from dfdata.
## head(subset(a, gender == "1"))
dfdata %>% subset(gender == "1") %>% head
dplyr provides a set of functions to manipulate data effectively. Of
these, group_by and summarize provide a differentiated value from the
existing R syntax by easily showing summary statistics for each
group.
filter is the same function as the subset function, and is used to
filter data by specific conditions. Below is an example of extracting
only men from the data.
dfdata %>% filter(gender == "1")
In filter, you can use AND conditions with , in addition to &, so readability is good. You can also select a specific range of a continuous variable by using the between function, which is also more intuitive than using the existing &. Let’s look at an example of filtering between 50 and 60 years old.
## Age between 50 and 60.
dfdata %>% filter(age >= 50 & age <= 60)
arrange is a function that sorts data according to a specific order.
Unlike the order function, which only tells the sort order, it shows
sorted data.
## dfdata[order(dfdata$Age), ]
dfdata %>% arrange(age)
If there are two or more sort conditions, you can write them together with , and use the desc command to sort in descending order. Below is an example of sorting in ascending order on Age and descending order on gender.
## dfdata[order(dfdata$age, -dfdata$gender), ]
dfdata %>% arrange(age, desc(gender))
mutate is a function that creates a new variable. Let’s create Old and
Overweight variables that mean old age and obesity from Age and BMI
variables.
## dfdata$old <- as.integer(a$age >= 65); dfdata$middleclass <- as.integer(dfdata$annual_Salary >= 80000)
dfdata %>% mutate(Old = as.integer(age >= 50),
middleclass = as.integer(annual_Salary >= 80000)
)
To show only new variables, use transmute instead of mutate.
dfdata %>% transmute(Old = as.integer(age >= 50),
middleclass = as.integer(annual_Salary >= 80000)
)
By using group_by and summarize, you can divide groups as desired and
obtain summary statistics for each group. In basic R, the aggregate
function performs the same function.
dfdata %>%
group_by(age, gender) %>%
summarize(count = n(),
meanSalary = mean(annual_Salary))
## `summarise()` has grouped output by 'age'. You can override using the `.groups`
## argument.
To insert a string such as “age” in group_by, use the group_by_ function with an underscore (_).
So far, we have seen how to manipulate data using several packages in
the tidyverse ecosystem. As I said earlier, the most important thing in
this ecosystem is to perform coding according to the stream of
consciousness using the %>% operator, and if you apply the rest of
the contents one by one, at some point you will find yourself unable to
live without the tidyverse.