Dplyr is part of tidyverse, mainly use for manipulating the data set(dataframes). It supports numerous functions within, but in this vignette we will focus on mutate and varities of mutate funtions supported with in dplyr.
Mutate: adds new variables and preserves existing ones. Mutate_all: affects every variable in the given dataframe. Mutate_at: affects variables selected with a character vector or vars(). Mutate_if: affects variables selected with a predicate function.
Let’s use a sample data of blackfriday sales.
sales_df <- read.csv("https://raw.githubusercontent.com/san123i/CUNY/master/Semester1/607/Tidyverse_assignment_data/BlackFriday.csv")
head(sales_df)
## User_ID Product_ID Gender Age Purchase
## 1 1000001 P00069042 F 0-17 8370
## 2 1000001 P00248942 F 0-17 15200
## 3 1000001 P00087842 F 0-17 1422
## 4 1000001 P00085442 F 0-17 1057
## 5 1000002 P00285442 M 55+ 7969
## 6 1000003 P00193542 M 26-35 15227
#Use simple mutate function and create a premium_customer variable
sales_df <- mutate(sales_df, Premium_customer = ifelse(Purchase>10000, "Yes", "No"))
DT::datatable(sales_df)
#Using mutate_at and pick a particular variable(Purchase) and create a new variable (in Million) based on it
sales_df_2 <- mutate_at(sales_df, vars(Purchase), funs("Million"=./1000000))
DT::datatable(sales_df_2)
#Using mutate_all create multiple variables each specifies if the earlier variable is a numeric variable or not.
sales_df_mutate <- mutate_all(sales_df, funs("isNumeric"= is.numeric(.)))
DT::datatable(sales_df_mutate)
#Using mutate_if apply a function only if a condition is met
sales_df <- sales_df %>% mutate_if(is.character, toupper)
DT::datatable(sales_df)
ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
Below are some examples of how to implement different features in ggplot2
ggplot function supports adding the dataframe or dataset through ‘data’ attribute and supports an aethsetic ‘aes’ to set. Along with the above ggplot function, we need to provide below plot functions too which are suppported by ggplot.
geom_bar(): geom_bar makes the height of the bar proportional to the number of cases in each group. geom_col(): geom_cal makes the height of the bar proportional to the value of cases in each group. geom_point(): geom_point creates a plot of point based data plotted between two different parameters. geom_smooth(): Creates a representation of the data using any method, here in the example is a ‘linear model’. labs(): These are used to set the label and other properties related to the plots.
#geom_bar makes the height of the bar proportional to the number of cases in each group
ggplot(data=sales_df, aes(x=Age, fill=Gender)) + geom_bar()
# geom_cal makes the height of the bar proportional to the value of cases in each group
ggplot(data = sales_df, aes(x=Age, y=Purchase)) + geom_col()
#geom_point creates a plot of point based data plotted between two different parameters
ggplot(data=sales_df, aes(x=Age, y=Purchase)) + geom_point()
#Creates a representation of the data using any method, here in the example is a 'linear model'. Labs are used to set the label and other properties related to the plots.
ggplot(data=sales_df, aes(x=User_ID, y=Purchase)) + geom_point() + geom_smooth(method="lm") + labs(title="Purchase Point chart", y = "Amount", x="Customers")