The purpose of this post is to provide an instruction for R beginning users to work on data cleaning, summary statistics, visualization, and regression analysis. The data used in this post is original from ‘Instacart Market Basket Analysis’ at https://www.kaggle.com/c/instacart-market-basket-analysis and has been processed to a simpler data set for the educational purpose. The code used to process the original data is documented at this repository in Github.
Load packages and data
Load the following packages using library() after installing the packages using either instal.package() or click on the Pacakges tab in the bootom-right section in RStudio and type pacakge name in the Install Packages dialog.
# Load "readr" package which provides a fast and friendly way to read rectangular data
library("readr")
# Load "dplyr" package which is a widely used package for working with data frames
library("dplyr")
# Load "stargazer" that creates well-formatted tables
library("stargazer")
# Load "reshape2" that can convert a wide table to a long table
library("reshape2")
# Load "ggplot2", a powerful visualization package
library("ggplot2")
# Load "RColorBrewer", which provides more color palettes for creating graphics
library("RColorBrewer")
# Load "ggThemeAssist", a RStudio-Addin that delivers a graphical interface for editing ggplot2 theme elements
library("ggThemeAssist")
# Load "scales" package which provide extension function for modifying labels in ggplot
library("scales")
Load a csv file from Github.
# Read a csv file stored in the designated Github repository.
DF_orders <- read_csv(url("https://raw.githubusercontent.com/ykaih/AGEC317/master/Random_500users_train.csv"))
# Convert "DF_orders" as a data.frame object
DF_orders <- data.frame(DF_orders)
Note: “()” in R is used to call a function. “[]” is used for subsetting vectors, arrays, matrices, and data frame (and other such objects). Besides that, Typing “#” allows R to distinguish text descriptions and R codes in R scripts. When we write our code, making comments is highly recommended. It helps us to know what we code and also helps other people to understand when they read our code.
The “nrow()” command with a data frame object allows us to know the number of rows.
nrow(DF_orders)
## [1] 500
The “ncol()” command shows the number of columns.
ncol(DF_orders)
## [1] 5
The “dim()” command tells us the dimension of the given data frame. The command outputs two numbers: the first one indicates the number of rows, and the second on indicate the number of columns.
dim(DF_orders)
## [1] 500 5
“names()” and “colnames()” both can retrieve names of columns, i.e., names of variables.
names(DF_orders)
## [1] "X1" "order_id" "reorders"
## [4] "products" "days_since_prior_order"
“X1” is a sequence of numbers. “order_id” indicates the number of each online grocery order. “reorders” indicates the number of reordered items per order. “products” is the number of products per order. “days_since_prior_order” is the purchase frequency measurmed by the number of days since prior order.
“str()” is a handy command to overview the structure of a data frame object. It summarizes data frame information, such as dimension, variable (column) names, types of an object for each variable, overviews of the first few observations.
str(DF_orders)
## 'data.frame': 500 obs. of 5 variables:
## $ X1 : num 1 2 3 4 5 6 7 8 9 10 ...
## $ order_id : num 4998 6880 12182 19405 39465 ...
## $ reorders : num 3 4 1 2 34 1 4 16 1 4 ...
## $ products : num 3 6 3 3 45 6 4 20 7 16 ...
## $ days_since_prior_order: num 24 23 3 20 12 6 30 9 2 5 ...
“head()” displays the first 6 observations of our data by default:
head(DF_orders)
## X1 order_id reorders products days_since_prior_order
## 1 1 4998 3 3 24
## 2 2 6880 4 6 23
## 3 3 12182 1 3 3
## 4 4 19405 2 3 20
## 5 5 39465 34 45 12
## 6 6 51257 1 6 6
Or display last 6 observations of our data by using:
tail(DF_orders)
## X1 order_id reorders products days_since_prior_order
## 495 495 3379605 13 14 7
## 496 496 3380661 6 9 6
## 497 497 3381975 3 10 24
## 498 498 3412247 1 1 7
## 499 499 3413328 0 8 27
## 500 500 3419629 4 5 7
The default setting for the above two commands is 6. We can also manually specify how many observations we want to display by:
head(DF_orders, n=10)
## X1 order_id reorders products days_since_prior_order
## 1 1 4998 3 3 24
## 2 2 6880 4 6 23
## 3 3 12182 1 3 3
## 4 4 19405 2 3 20
## 5 5 39465 34 45 12
## 6 6 51257 1 6 6
## 7 7 51912 4 4 30
## 8 8 56815 16 20 9
## 9 9 73206 1 7 2
## 10 10 100586 4 16 5
We can also view the entire data frame as we do in Excel by using the “View()” command.
View(DF_orders)
To remove variables by their column index numbers, we can input the column index number in the following code and tell R to drop variables positioned at first and second columns. The minus sign indicates to drop variables.
DF_orders[,-c(1,2)]
In addition to using the above method, we can also generate the same resulting output using the following Method 2 - 4.
subset.data.frame(DF_orders, select = -c(X1, order_id) )
# Create a character vector where we store column names that we want to drop. This statement assigns the character vector to a new object named "drop_list," which is an arbitrary name. We can name a new object, whichever way we think it is easier to be identified.
drop_list <- c("X1", "order_id")
# The following line tells R that we want to the drop variables specified in the "drop_list" character vector from the "DF_orders" dataframe.
DF_orders[,!names(DF_orders) %in% drop_list]
# Delete column by column index numbers with the "select" command
select(DF_orders, -c(1:2))
or
# Delete column by column names with the "select" command
select(DF_orders, -c("X1", "order_id"))
If we want to save the resulting output as a new dataframe object, we can add “DF_orders_selected” and “<-” in the above line as follows:
# Store the resulting data set in the new object named "DF_orders_selected"
DF_orders_selected <- select(DF_orders, -c("X1", "order_id"))
Note that the above “DF_orders_selected” is an arbitrary new object name.
There are several ways to show summary statistics in R. One straightforward way is to use “summary()” function:
summary(DF_orders_selected)
## reorders products days_since_prior_order
## Min. : 0.000 Min. : 1.00 Min. : 0.00
## 1st Qu.: 2.000 1st Qu.: 4.75 1st Qu.: 7.00
## Median : 4.000 Median : 9.00 Median :16.00
## Mean : 6.332 Mean :10.51 Mean :17.25
## 3rd Qu.: 8.250 3rd Qu.:14.00 3rd Qu.:30.00
## Max. :40.000 Max. :45.00 Max. :30.00
To have a neater table for the summary statistics output, the “stargazer()” command can produce a descriptive summary output like:
stargazer(DF_orders_selected, type = "text")
##
## ====================================================================
## Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
## --------------------------------------------------------------------
## reorders 500 6.332 6.275 0 2 8.2 40
## products 500 10.510 7.962 1 4.8 14 45
## days_since_prior_order 500 17.252 10.687 0 7 30 30
## --------------------------------------------------------------------
Alternatively, we can flip the descriptive summary output by setting the “flip” argument as TRUE.
stargazer(DF_orders_selected, type = "text", flip = TRUE)
##
## ==================================================
## Statistic reorders products days_since_prior_order
## --------------------------------------------------
## N 500 500 500
## Mean 6.332 10.510 17.252
## St. Dev. 6.275 7.962 10.687
## Min 0 1 0
## Pctl(25) 2 4.8 7
## Pctl(75) 8.2 14 30
## Max 40 45 30
## --------------------------------------------------
R can also produce a pivot table that summarizes useful statistics from complex a data set conditional on specific selection criteria. For instance, if we want to know the relationship between the purchase frequency and the number of reorders and products per order, we can use the “group_by()” function from the “dplyr” package as follows:
# Make a pivot table
DF_pivot <- DF_orders_selected %>%
group_by(days_since_prior_order) %>%
summarize(reorders = sum(reorders), products = sum(products))
# Output the resulting pivot table
DF_pivot
## # A tibble: 31 x 3
## days_since_prior_order reorders products
## <dbl> <dbl> <dbl>
## 1 0 44 48
## 2 1 92 132
## 3 2 61 99
## 4 3 52 72
## 5 4 112 165
## 6 5 122 204
## 7 6 117 178
## 8 7 342 457
## 9 8 142 212
## 10 9 146 263
## # ... with 21 more rows
The “ggplot2” package provides a variety of visualization options in R. The following function, “ggplot()” creates a plotting base where we input one variable from the “DF_orders_selected” data frame object. The “geom_histogram()” function can generate customized histograms. The “labs()” function can customize the plot title and x-axis/y-axis label. The “theme_bw()” function makes a black and white backgound. The “theme()” function adjusts different features of the resulting plot.
# Make a histogram
ggplot(DF_orders_selected, aes(x=reorders)) +
geom_histogram() +
labs(x = "Number of reordered items", y="Frequency", title="Reorders per order") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5), text = element_text(size=20))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Utilize “ggplot theme assistant”
Instead of coding every detail in the “theme()” function, we can also use an RStudio-Addin that delivers a graphical interface for editing ggplot2 theme elements. First, we need to select the above chuck of the “ggplot()” and click on “Addins” on top of the Rstudio as shown in the following screenshot:
Next, click on “ggplot Theme Assistant” in the above drop-down list, and then we can see the following interactive panel which can manually customize the theme of the plot.
Suppose we want to show the relationship between the number of days since last order and the number of reorders and products per order, we can use the “DF_pivot” dataframe with the “geom_bar()” function to plot a barchart. Before we input the “DF_pivot” dataframe, we need to use “melt()” function to convert the wide table format to a long table format, which can be directly used in the “ggplot()” function.
# Covert a wide table to a long table
DF_pivot_long <- melt(DF_pivot, id.vars = "days_since_prior_order", value.name = "Numbers")
# Show the first 10 rows of the resulting long table
head(DF_pivot_long, 10)
## days_since_prior_order variable Numbers
## 1 0 reorders 44
## 2 1 reorders 92
## 3 2 reorders 61
## 4 3 reorders 52
## 5 4 reorders 112
## 6 5 reorders 122
## 7 6 reorders 117
## 8 7 reorders 342
## 9 8 reorders 142
## 10 9 reorders 146
# Show the last 10 rows of the resulting long table
tail(DF_pivot_long, 10)
## days_since_prior_order variable Numbers
## 53 21 products 50
## 54 22 products 73
## 55 23 products 57
## 56 24 products 40
## 57 25 products 71
## 58 26 products 71
## 59 27 products 150
## 60 28 products 134
## 61 29 products 55
## 62 30 products 1516
Then we can input the above long table in the “ggplot()” function with the “geom_bar()” function. The “scale_fill_brewer()” function provides more color palettes for creating graphics from the “RColorBrewer” package. The “scale_y_continuous()” and “scale_x_continuous()” function specify the desired breaks in the y-axis and x-axis, respectively.
# Make a barchart
ggplot(DF_pivot_long, aes(x=days_since_prior_order, y = Numbers, fill = variable)) +
geom_bar(stat = "identity", position="dodge", alpha = 0.8) +
scale_fill_brewer(palette = "Set1") +
scale_y_continuous(breaks = seq(0,1800, by=300), labels = comma) +
scale_x_continuous(breaks = seq(0,30, by=3)) +
labs(x = "Days since the last order", y="Number of reorders or products",
title="Breakdown of days since last order") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5), text = element_text(size=18),
plot.subtitle = element_text(vjust = 1), plot.caption = element_text(vjust = 1),
legend.position = "bottom", legend.direction = "horizontal")
To draw a scatterplot, we can use the “geom_point()” function to display data points, and then the “geom_smooth()” function can plot a regression line. There are several alternativie methods to draw a regression line, such as “auto”, “lm”, “glm”, “gam”, “loess” or other functions. Here we use a linear function method by typing “lm” as the selected method.
# Make a scatterplot
ggplot(DF_orders_selected, aes(x=products, y =reorders)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Number of products", y="Number of reordered items", title="Reorders vs. Products per order") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5), text = element_text(size=20))
The “lm()” function can perform regression analyses. “reorders ~ products” tells R that we want to regress “the number of reorders” on “the number of products,” and “data = DF_orders_selected” indicates that we will use data from “DF_orders_selected” data set to perform the above regression command. Use the “summary()” function to show the regression result.
# Run regression
reg1 <- lm(reorders ~ products, data = DF_orders_selected)
# Output the regression result
summary(reg1)
##
## Call:
## lm(formula = reorders ~ products, data = DF_orders_selected)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.4887 -1.7826 0.3717 1.7656 12.2934
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.80192 0.23646 -3.391 0.000751 ***
## products 0.67877 0.01794 37.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.191 on 498 degrees of freedom
## Multiple R-squared: 0.7419, Adjusted R-squared: 0.7414
## F-statistic: 1432 on 1 and 498 DF, p-value: < 2.2e-16
We can incorporate the above simple regression result in the previous scatter plot by coding the following lines:
# Extract the regression information from the "reg" regression object
Equa <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2,
list(a = format(unname(coef(reg1)[1]), digits = 2),
b = format(unname(coef(reg1)[2]), digits = 2),
r2 = format(summary(reg1)$r.squared, digits = 3)))
# "as.expression" attempts to coerce its argument into an expression object, which is not being used in its colloquial sense, that of mathematical expressions
Text_reg <- as.character(as.expression(Equa))
# Make a scatterplot with a regression line
ggplot(DF_orders_selected, aes(x=products, y =reorders)) +
geom_point() +
geom_smooth(method = "lm", alpha = .15) +
labs(x = "Number of products", y="Number of reordered items", title="Reorders vs. Products per order") +
geom_label(data = data.frame(x = 15, y = 35, label = Text_reg), aes(x = x, y = y, label = Text_reg),
parse = TRUE, color = "red", size =5) +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5), text = element_text(size=20))
Suppose we want to investigate how the number of ordered products and purchase frequency affect the number of reorders, we can perform the following multiple regression:
# Run regression
reg2 <- lm(reorders ~ products + days_since_prior_order, data = DF_orders_selected)
# Output the regression result
summary(reg2)
##
## Call:
## lm(formula = reorders ~ products + days_since_prior_order, data = DF_orders_selected)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.6564 -1.7720 0.3281 1.9876 11.5775
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.34778 0.32091 1.084 0.279
## products 0.67964 0.01750 38.842 < 2e-16 ***
## days_since_prior_order -0.06717 0.01304 -5.152 3.72e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.112 on 497 degrees of freedom
## Multiple R-squared: 0.755, Adjusted R-squared: 0.754
## F-statistic: 765.8 on 2 and 497 DF, p-value: < 2.2e-16
To compare the multiple regression results in a table format, we can input the above regression result objects, “reg1” and “reg2,” in the “stargazer()” function, where type=“text” indicates the output format as a text file, and out = “Reg_result.txt” tell R to save the resulting output as “Reg_result.txt” in the working directory.
# Report the regression result in a table format
stargazer(reg1, reg2, type="text", out = "Reg_result.txt")
##
## ==========================================================================
## Dependent variable:
## ---------------------------------------------------
## reorders
## (1) (2)
## --------------------------------------------------------------------------
## products 0.679*** 0.680***
## (0.018) (0.017)
##
## days_since_prior_order -0.067***
## (0.013)
##
## Constant -0.802*** 0.348
## (0.236) (0.321)
##
## --------------------------------------------------------------------------
## Observations 500 500
## R2 0.742 0.755
## Adjusted R2 0.741 0.754
## Residual Std. Error 3.191 (df = 498) 3.112 (df = 497)
## F Statistic 1,431.560*** (df = 1; 498) 765.768*** (df = 2; 497)
## ==========================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The comparison result suggests that a 10 item increase in the number of products per order is estimated to increase by 6.8 reorders per order. The purchase frequency has a negative relationship with the number of reorders based on the given data set. However, the purchase frequency may have a nonlinear effect on the number of reorders and varies across different products and/or users.
This post is meant for R beginners to learn some R functions related to data exploration, visualization, and linear regression analysis. There are some limitations to analysis per se. For instance, user information is missing in this data set, because the provided data is aggregated to the order-level (for the detailed data aggregation process, please refer to this Github repository). Besides, although there is a strong correlation between the number of reorders and products per order, the causality between reorders and products is arguable. Different grocery products likely have distinct consumption frequencies. The heterogeneous effect of different grocery products needs to be explored in the product-level data.