Data cleaning, summary statistics, visualization, and regression analysis in R

1. Introduction
2. Clean and explore data
3. Summary statistics
4. Visualization
5. Regression analysis
6. Endnote

1. Introduction

The purpose of this post is to provide an instruction for R beginning users to work on data cleaning, summary statistics, visualization, and regression analysis. The data used in this post is original from ‘Instacart Market Basket Analysis’ at https://www.kaggle.com/c/instacart-market-basket-analysis and has been processed to a simpler data set for the educational purpose. The code used to process the original data is documented at this repository in Github.

Load packages and data
Load the following packages using library() after installing the packages using either instal.package() or click on the Pacakges tab in the bootom-right section in RStudio and type pacakge name in the Install Packages dialog.

# Load "readr" package which provides a fast and friendly way to read rectangular data 
library("readr")

# Load "dplyr" package which is a widely used package for working with data frames
library("dplyr")

# Load "stargazer" that creates well-formatted tables
library("stargazer") 

# Load "reshape2" that can convert a wide table to a long table
library("reshape2") 

# Load "ggplot2", a powerful visualization package
library("ggplot2")

# Load "RColorBrewer", which provides more color palettes for creating graphics 
library("RColorBrewer")

# Load "ggThemeAssist", a RStudio-Addin that delivers a graphical interface for editing ggplot2 theme elements
library("ggThemeAssist")

# Load "scales" package which provide extension function for modifying labels in ggplot
library("scales")

Load a csv file from Github.

# Read a csv file stored in the designated Github repository. 
DF_orders <- read_csv(url("https://raw.githubusercontent.com/ykaih/AGEC317/master/Random_500users_train.csv"))

# Convert "DF_orders" as a data.frame object
DF_orders <- data.frame(DF_orders)

2. Clean and explore data

Note: “()” in R is used to call a function. “[]” is used for subsetting vectors, arrays, matrices, and data frame (and other such objects). Besides that, Typing “#” allows R to distinguish text descriptions and R codes in R scripts. When we write our code, making comments is highly recommended. It helps us to know what we code and also helps other people to understand when they read our code.

2.1 Check the dimension and structure of data

The “nrow()” command with a data frame object allows us to know the number of rows.

nrow(DF_orders)

## [1] 500

The “ncol()” command shows the number of columns.

ncol(DF_orders)

## [1] 5

The “dim()” command tells us the dimension of the given data frame. The command outputs two numbers: the first one indicates the number of rows, and the second on indicate the number of columns.

dim(DF_orders)

## [1] 500   5

“names()” and “colnames()” both can retrieve names of columns, i.e., names of variables.

names(DF_orders)

## [1] "X1"                     "order_id"               "reorders"              
## [4] "products"               "days_since_prior_order"

“X1” is a sequence of numbers. “order_id” indicates the number of each online grocery order. “reorders” indicates the number of reordered items per order. “products” is the number of products per order. “days_since_prior_order” is the purchase frequency measurmed by the number of days since prior order.

“str()” is a handy command to overview the structure of a data frame object. It summarizes data frame information, such as dimension, variable (column) names, types of an object for each variable, overviews of the first few observations.

str(DF_orders)

## 'data.frame':    500 obs. of  5 variables:
##  $ X1                    : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ order_id              : num  4998 6880 12182 19405 39465 ...
##  $ reorders              : num  3 4 1 2 34 1 4 16 1 4 ...
##  $ products              : num  3 6 3 3 45 6 4 20 7 16 ...
##  $ days_since_prior_order: num  24 23 3 20 12 6 30 9 2 5 ...

2.2 View data

“head()” displays the first 6 observations of our data by default:

head(DF_orders)

##   X1 order_id reorders products days_since_prior_order
## 1  1     4998        3        3                     24
## 2  2     6880        4        6                     23
## 3  3    12182        1        3                      3
## 4  4    19405        2        3                     20
## 5  5    39465       34       45                     12
## 6  6    51257        1        6                      6

Or display last 6 observations of our data by using:

tail(DF_orders)

##      X1 order_id reorders products days_since_prior_order
## 495 495  3379605       13       14                      7
## 496 496  3380661        6        9                      6
## 497 497  3381975        3       10                     24
## 498 498  3412247        1        1                      7
## 499 499  3413328        0        8                     27
## 500 500  3419629        4        5                      7

The default setting for the above two commands is 6. We can also manually specify how many observations we want to display by:

head(DF_orders, n=10)

##    X1 order_id reorders products days_since_prior_order
## 1   1     4998        3        3                     24
## 2   2     6880        4        6                     23
## 3   3    12182        1        3                      3
## 4   4    19405        2        3                     20
## 5   5    39465       34       45                     12
## 6   6    51257        1        6                      6
## 7   7    51912        4        4                     30
## 8   8    56815       16       20                      9
## 9   9    73206        1        7                      2
## 10 10   100586        4       16                      5

We can also view the entire data frame as we do in Excel by using the “View()” command.

View(DF_orders)

2.3 Delete and select variables

Method 1. Delete column by column index numbers

To remove variables by their column index numbers, we can input the column index number in the following code and tell R to drop variables positioned at first and second columns. The minus sign indicates to drop variables.

DF_orders[,-c(1,2)]

In addition to using the above method, we can also generate the same resulting output using the following Method 2 - 4.

Method 2. Delete column by name using the “subset.data.frame()” command.

subset.data.frame(DF_orders, select = -c(X1, order_id) )

Method 3. Delete column by name using “!” negation sign.

# Create a character vector where we store column names that we want to drop. This statement assigns the character vector to a new object named "drop_list," which is an arbitrary name. We can name a new object, whichever way we think it is easier to be identified.
drop_list <- c("X1", "order_id")

# The following line tells R that we want to the drop variables specified in the "drop_list" character vector from the "DF_orders" dataframe.
DF_orders[,!names(DF_orders) %in% drop_list]

Method 4. Delete column with the “select()” command in the “dplyr” package

# Delete column by column index numbers with the "select" command
select(DF_orders, -c(1:2))

# Delete column by column names with the "select" command
select(DF_orders, -c("X1", "order_id"))

If we want to save the resulting output as a new dataframe object, we can add “DF_orders_selected” and “<-” in the above line as follows:

# Store the resulting data set in the new object named "DF_orders_selected"
DF_orders_selected <- select(DF_orders, -c("X1", "order_id"))

Note that the above “DF_orders_selected” is an arbitrary new object name.

3. Summary statistics

There are several ways to show summary statistics in R. One straightforward way is to use “summary()” function:

3.1 Use “summary()” command

summary(DF_orders_selected)

##     reorders         products     days_since_prior_order
##  Min.   : 0.000   Min.   : 1.00   Min.   : 0.00         
##  1st Qu.: 2.000   1st Qu.: 4.75   1st Qu.: 7.00         
##  Median : 4.000   Median : 9.00   Median :16.00         
##  Mean   : 6.332   Mean   :10.51   Mean   :17.25         
##  3rd Qu.: 8.250   3rd Qu.:14.00   3rd Qu.:30.00         
##  Max.   :40.000   Max.   :45.00   Max.   :30.00

3.2 Use “stargazer()” command

To have a neater table for the summary statistics output, the “stargazer()” command can produce a descriptive summary output like:

stargazer(DF_orders_selected, type = "text")

## 
## ====================================================================
## Statistic               N   Mean  St. Dev. Min Pctl(25) Pctl(75) Max
## --------------------------------------------------------------------
## reorders               500 6.332   6.275    0     2       8.2    40 
## products               500 10.510  7.962    1    4.8       14    45 
## days_since_prior_order 500 17.252  10.687   0     7        30    30 
## --------------------------------------------------------------------

Alternatively, we can flip the descriptive summary output by setting the “flip” argument as TRUE.

stargazer(DF_orders_selected, type = "text", flip = TRUE)

## 
## ==================================================
## Statistic reorders products days_since_prior_order
## --------------------------------------------------
## N           500      500             500          
## Mean       6.332    10.510          17.252        
## St. Dev.   6.275    7.962           10.687        
## Min          0        1               0           
## Pctl(25)     2       4.8              7           
## Pctl(75)    8.2       14              30          
## Max          40       45              30          
## --------------------------------------------------

3.3. Pivot table: advanced summary statistics

R can also produce a pivot table that summarizes useful statistics from complex a data set conditional on specific selection criteria. For instance, if we want to know the relationship between the purchase frequency and the number of reorders and products per order, we can use the “group_by()” function from the “dplyr” package as follows:

# Make a pivot table 
DF_pivot <- DF_orders_selected %>% 
  group_by(days_since_prior_order) %>%
  summarize(reorders = sum(reorders), products = sum(products))

# Output the resulting pivot table
DF_pivot

## # A tibble: 31 x 3
##    days_since_prior_order reorders products
##                     <dbl>    <dbl>    <dbl>
##  1                      0       44       48
##  2                      1       92      132
##  3                      2       61       99
##  4                      3       52       72
##  5                      4      112      165
##  6                      5      122      204
##  7                      6      117      178
##  8                      7      342      457
##  9                      8      142      212
## 10                      9      146      263
## # ... with 21 more rows

4. Visualization

4.1 Histogram

The “ggplot2” package provides a variety of visualization options in R. The following function, “ggplot()” creates a plotting base where we input one variable from the “DF_orders_selected” data frame object. The “geom_histogram()” function can generate customized histograms. The “labs()” function can customize the plot title and x-axis/y-axis label. The “theme_bw()” function makes a black and white backgound. The “theme()” function adjusts different features of the resulting plot.

# Make a histogram
ggplot(DF_orders_selected, aes(x=reorders)) + 
  geom_histogram() +
  labs(x = "Number of reordered items", y="Frequency", title="Reorders per order") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5), text = element_text(size=20))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Utilize “ggplot theme assistant”
Instead of coding every detail in the “theme()” function, we can also use an RStudio-Addin that delivers a graphical interface for editing ggplot2 theme elements. First, we need to select the above chuck of the “ggplot()” and click on “Addins” on top of the Rstudio as shown in the following screenshot:

Next, click on “ggplot Theme Assistant” in the above drop-down list, and then we can see the following interactive panel which can manually customize the theme of the plot.

4.2 Barchart

Suppose we want to show the relationship between the number of days since last order and the number of reorders and products per order, we can use the “DF_pivot” dataframe with the “geom_bar()” function to plot a barchart. Before we input the “DF_pivot” dataframe, we need to use “melt()” function to convert the wide table format to a long table format, which can be directly used in the “ggplot()” function.

# Covert a wide table to a long table
DF_pivot_long <- melt(DF_pivot, id.vars = "days_since_prior_order", value.name = "Numbers")

# Show the first 10 rows of the resulting long table
head(DF_pivot_long, 10)

##    days_since_prior_order variable Numbers
## 1                       0 reorders      44
## 2                       1 reorders      92
## 3                       2 reorders      61
## 4                       3 reorders      52
## 5                       4 reorders     112
## 6                       5 reorders     122
## 7                       6 reorders     117
## 8                       7 reorders     342
## 9                       8 reorders     142
## 10                      9 reorders     146

# Show the last 10 rows of the resulting long table
tail(DF_pivot_long, 10)

##    days_since_prior_order variable Numbers
## 53                     21 products      50
## 54                     22 products      73
## 55                     23 products      57
## 56                     24 products      40
## 57                     25 products      71
## 58                     26 products      71
## 59                     27 products     150
## 60                     28 products     134
## 61                     29 products      55
## 62                     30 products    1516

Then we can input the above long table in the “ggplot()” function with the “geom_bar()” function. The “scale_fill_brewer()” function provides more color palettes for creating graphics from the “RColorBrewer” package. The “scale_y_continuous()” and “scale_x_continuous()” function specify the desired breaks in the y-axis and x-axis, respectively.

# Make a barchart
ggplot(DF_pivot_long, aes(x=days_since_prior_order, y = Numbers, fill = variable)) + 
  geom_bar(stat = "identity", position="dodge", alpha = 0.8) +
  scale_fill_brewer(palette = "Set1") + 
  scale_y_continuous(breaks = seq(0,1800, by=300), labels = comma) +
  scale_x_continuous(breaks = seq(0,30, by=3)) +
  labs(x = "Days since the last order", y="Number of reorders or products", 
       title="Breakdown of days since last order") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5), text = element_text(size=18), 
        plot.subtitle = element_text(vjust = 1), plot.caption = element_text(vjust = 1), 
        legend.position = "bottom", legend.direction = "horizontal")

4.3 Scatterplot

To draw a scatterplot, we can use the “geom_point()” function to display data points, and then the “geom_smooth()” function can plot a regression line. There are several alternativie methods to draw a regression line, such as “auto”, “lm”, “glm”, “gam”, “loess” or other functions. Here we use a linear function method by typing “lm” as the selected method.

# Make a scatterplot
ggplot(DF_orders_selected, aes(x=products, y =reorders)) + 
  geom_point() +
  geom_smooth(method = "lm") +
  labs(x = "Number of products", y="Number of reordered items", title="Reorders vs. Products per order") + 
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5), text = element_text(size=20))

5. Regression analysis

The “lm()” function can perform regression analyses. “reorders ~ products” tells R that we want to regress “the number of reorders” on “the number of products,” and “data = DF_orders_selected” indicates that we will use data from “DF_orders_selected” data set to perform the above regression command. Use the “summary()” function to show the regression result.

# Run regression
reg1 <- lm(reorders ~ products, data = DF_orders_selected)

# Output the regression result
summary(reg1)

## 
## Call:
## lm(formula = reorders ~ products, data = DF_orders_selected)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4887  -1.7826   0.3717   1.7656  12.2934 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.80192    0.23646  -3.391 0.000751 ***
## products     0.67877    0.01794  37.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.191 on 498 degrees of freedom
## Multiple R-squared:  0.7419, Adjusted R-squared:  0.7414 
## F-statistic:  1432 on 1 and 498 DF,  p-value: < 2.2e-16

We can incorporate the above simple regression result in the previous scatter plot by coding the following lines:

# Extract the regression information from the "reg" regression object
Equa <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2, 
                 list(a = format(unname(coef(reg1)[1]), digits = 2),
                      b = format(unname(coef(reg1)[2]), digits = 2),
                      r2 = format(summary(reg1)$r.squared, digits = 3)))

# "as.expression" attempts to coerce its argument into an expression object, which is not being used in its colloquial sense, that of mathematical expressions
Text_reg <- as.character(as.expression(Equa))

# Make a scatterplot with a regression line
ggplot(DF_orders_selected, aes(x=products, y =reorders)) + 
  geom_point() +
  geom_smooth(method = "lm", alpha = .15) +
  labs(x = "Number of products", y="Number of reordered items", title="Reorders vs. Products per order") + 
  geom_label(data = data.frame(x = 15, y = 35, label = Text_reg), aes(x = x, y = y, label = Text_reg), 
             parse = TRUE, color = "red", size =5) +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5), text = element_text(size=20))

Suppose we want to investigate how the number of ordered products and purchase frequency affect the number of reorders, we can perform the following multiple regression:

# Run regression
reg2 <- lm(reorders ~ products + days_since_prior_order, data = DF_orders_selected)

# Output the regression result
summary(reg2)

## 
## Call:
## lm(formula = reorders ~ products + days_since_prior_order, data = DF_orders_selected)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.6564  -1.7720   0.3281   1.9876  11.5775 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             0.34778    0.32091   1.084    0.279    
## products                0.67964    0.01750  38.842  < 2e-16 ***
## days_since_prior_order -0.06717    0.01304  -5.152 3.72e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.112 on 497 degrees of freedom
## Multiple R-squared:  0.755,  Adjusted R-squared:  0.754 
## F-statistic: 765.8 on 2 and 497 DF,  p-value: < 2.2e-16

To compare the multiple regression results in a table format, we can input the above regression result objects, “reg1” and “reg2,” in the “stargazer()” function, where type=“text” indicates the output format as a text file, and out = “Reg_result.txt” tell R to save the resulting output as “Reg_result.txt” in the working directory.

# Report the regression result in a table format
stargazer(reg1, reg2, type="text", out = "Reg_result.txt")

## 
## ==========================================================================
##                                        Dependent variable:                
##                        ---------------------------------------------------
##                                             reorders                      
##                                   (1)                       (2)           
## --------------------------------------------------------------------------
## products                        0.679***                  0.680***        
##                                 (0.018)                   (0.017)         
##                                                                           
## days_since_prior_order                                   -0.067***        
##                                                           (0.013)         
##                                                                           
## Constant                       -0.802***                   0.348          
##                                 (0.236)                   (0.321)         
##                                                                           
## --------------------------------------------------------------------------
## Observations                      500                       500           
## R2                               0.742                     0.755          
## Adjusted R2                      0.741                     0.754          
## Residual Std. Error         3.191 (df = 498)          3.112 (df = 497)    
## F Statistic            1,431.560*** (df = 1; 498) 765.768*** (df = 2; 497)
## ==========================================================================
## Note:                                          *p<0.1; **p<0.05; ***p<0.01

The comparison result suggests that a 10 item increase in the number of products per order is estimated to increase by 6.8 reorders per order. The purchase frequency has a negative relationship with the number of reorders based on the given data set. However, the purchase frequency may have a nonlinear effect on the number of reorders and varies across different products and/or users.

6. Endnote

This post is meant for R beginners to learn some R functions related to data exploration, visualization, and linear regression analysis. There are some limitations to analysis per se. For instance, user information is missing in this data set, because the provided data is aggregated to the order-level (for the detailed data aggregation process, please refer to this Github repository). Besides, although there is a strong correlation between the number of reorders and products per order, the causality between reorders and products is arguable. Different grocery products likely have distinct consumption frequencies. The heterogeneous effect of different grocery products needs to be explored in the product-level data.