Introduction

This paper pertains to the Term project assigned in the Introduction to Data Management course offered at GSU. The paper will be structured in a manner that progresses from exploring the origins and background of the data, to obtaining an analytic dataset, and finally conducting an exploratory analysis of the data.The R language is used for data analysis and visualization in this project.

Install required packages:

Before performing the analysis, it’s very much required to install using “install.packages(”“)” and call on required packages using library function.

Data Origin and Background

The Olist Store, an ecommerce platform based in Brazil, has made available a public dataset consisting of 100,000 order records from 2016 to 2018. These orders were placed on multiple marketplaces across Brazil, and the dataset provides a comprehensive view of various aspects of each order, including order status, price, payment, freight performance, customer location, product attributes, and customer reviews. Please note that while Olist also released a geolocation dataset mapping Brazilian zip codes to lat/lng coordinates, this project does not incorporate that particular dataset. Source: Data was downloaded from kaggle. You can access it on Olist Data To improve comprehension and organization, the data was separated into several datasets.Data was originally in csv. format, converted to xls files and was read into r using the read_excel function.

#read in data sets.
customer_data = read_excel("C:/Users/bukol/Documents/olist_customers_dataset.xlsx")
order_itemdata = read_excel("C:/Users/bukol/Documents/olist_order_items_dataset.xlsx")
pay_data = read_excel("C:/Users/bukol/Documents/olist_order_payments_dataset.xlsx")
rev_data = read_excel("C:/Users/bukol/Documents/olist_order_reviews_dataset.xlsx")
orders = read_excel("C:/Users/bukol/Documents/olist_orders_dataset.xlsx")
order_prddata = read_excel("C:/Users/bukol/Documents/olist_products_dataset.xlsx")
order_selldata = read_excel("C:/Users/bukol/Documents/olist_sellers_dataset.xlsx")
order_prd_catdata = read_excel("C:/Users/bukol/Documents/product_category_name_translation.xlsx")

Data Description

After reading the CSV files, I examined the number of columns and rows along with the column names for each file. The descriptions for all columns are shown in the table below:

Number of missing values for each variable
variable missing_count
review_comment_title review 87659
review_comment_message review 58259
order_delivered_customer_date order 2965
order_delivered_carrier_date order 1783
product_category_name product 610
product_name_lenght product 610
product_description_lenght product 610
product_photos_qty product 610
order_approved_at order 160
product_weight_g product 2
product_length_cm product 2
product_height_cm product 2
product_width_cm product 2

Getting Analytic Set

Data merging

The seven data sets were combined into a main data set using the left_join function. Each row in the main data set represents a transaction, and if there are multiple rows with the same order id, then they belong to the same order. The common variables in the main data set are order id, order item id, product id, and seller id.

To connect the tables, the following steps were taken:

  1. Merge the order items dataset table with the orders dataset table using the order id key.
  2. Merge the resulting table with the customer dataset table using the customer id key.
  3. Merge the resulting table with the payment dataset table using the order id key.
  4. Merge the resulting table with the review dataset table using the order id key.
  5. Merge the resulting table with the products dataset table using the product id key.
  6. Merge the resulting table with the product category dataset table using the product category name key.
  7. Merge the resulting table with the sellers dataset table using the seller id key.

The resulting “data” data frame contains information from all 7 datasets and is linked by the common variables specified above.

There are 118310 observations and 40 variables in the resulting dataset, “data”.

Treating missing values and depulication data

I observed duplicate observations with order id,customer id, order purchase timestamp and order delivered customer date. These duplicated rows were dropped keeping the first. Imputation was done for missing values for some variables.

1.For missing values in review comment message, nao_reveja was imputed.

2.For missing values in product weight, median of product weight was imputed.

3.For missing values in product length, median of product length was imputed.

4.For missing values in product height, median of product height was imputed.

5.For missing values in product width, median of product width was imputed.

6.The remaining missing values were deleted.

The merged dataset consists of 34 columns and it has categorical features like order status,payment type, customer state, and product category name english. One column named review comment message has text data that is in Portuguese. There are few numerical features also. The descriptions for all columns are shown in the table below:

Summary statistics

Based on the table provided, it can be noted that:

The highest order price recorded is 4590 Brazilian real, while the maximum freight value is approximately 375.28 Brazilian real. On average, the average freight value is approximately 22.26 Brazilian real. The lowest-priced order is at 3.90 Brazilian real.

In terms of payment_value, the largest payment made for an order is 7274.88 Brazilian real. Other numerical features such as percentile values, mean, standard deviation, count can be seen in the table.

##      price         freight_value    payment_value    
##  Min.   :   3.90   Min.   :  0.00   Min.   :   0.27  
##  1st Qu.:  44.00   1st Qu.: 13.07   1st Qu.:  63.86  
##  Median :  79.99   Median : 18.27   Median : 112.42  
##  Mean   : 136.26   Mean   : 22.26   Mean   : 174.11  
##  3rd Qu.: 144.00   3rd Qu.: 23.36   3rd Qu.: 186.76  
##  Max.   :4590.00   Max.   :375.28   Max.   :7274.88


In this eCommerce dataset, four primary payment methods are used: credit card, Boleto Bancário (also known as Baleto), voucher, and debit card.

Based on the bar chart above, we can see that the majority of orders are paid for using a credit card, followed by Boleto as the second most commonly used payment method. The bar chart illustrates the proportion of each payment mode, where credit card is used by 76% of the users, Boleto is used by 18%, and voucher and debit card are used by only 5% of the users combined.

Focusing on the target variable - review score, it is worth noting that scores of 3 or higher are categorized as positive, denoted by 1, and anything lower is labeled negative, indicated by 0. Upon analyzing the distribution of the target variable, we can see that approximately 81.7% of the total reviews are classified as positive, while 18.3% are negative. Overall, we can say majority of the people are happy shopping with olist.

From the analysis of the data, we can see that the top three most ordered product categories between 2016 and 2018 were health and beauty products,bed and bath products, watches and gift products. These categories stood out as having the highest number of total orders during this period.

This heat map shows the different average time it takes for orders to be delivered. The shortest delivery time is from PR to DF with less than 10 days and longest delivery time is from MG to PR and MG to RS with 40 days and above.

Regression model

A multiple linear regression analysis on a dataset containing information about customer orders is done. First, the relevant variables for regression are selected from the original data, including the review score, price, freight value, payment value, product weight, and product dimensions. Outliers in these variables are removed using Tukey’s method.

A linear regression model is then fit using the model, with the review score as the dependent variable and the other selected variables as independent variables.

R-squared: 0.023
Adjusted R-squared: 0.023
F-statistic: 27.723 on 7 and 8116 DF
Variable Estimate Std. Error t-value Pr(>|t|)
(Intercept) 4.0738571 0.0652810 62.4049552 0.0000000
price 0.0058428 0.0005143 11.3596881 0.0000000
freight_value -0.0002378 0.0024797 -0.0958891 0.9236110
payment_value -0.0059773 0.0004412 -13.5475603 0.0000000
product_weight_g 0.0000116 0.0000286 0.4052078 0.6853354
product_length_cm 0.0040839 0.0021618 1.8891769 0.0589037
product_height_cm 0.0018721 0.0023766 0.7877014 0.4308944
product_width_cm 0.0003375 0.0027361 0.1233620 0.9018235

` Based on the output, the regression model is:

review_score = 4.0738571 + 0.0058428 * price - 0.0002378 * freight_value - 0.0059773 * payment_value + 0.0000116 * product_weight_g + 0.0040839 * product_length_cm + 0.0018721 * product_height_cm + 0.0003375 * product_width_cm

Where price, freight value, payment value, product weight, product length, product height, and product width are the independent variables, and their corresponding coefficients indicate the strength and direction of their association with the dependent variable, review score

The R-squared value of 0.023 indicates that only 2.3% of the variability in the response variable (review score) can be explained by the linear regression model using the predictor variables (price, freight value, payment value, product weight, product length, product height, and product width).

The Adjusted R-squared value is the same as the R-squared value, which means that there is no penalty for the number of predictor variables used in the model. The F-statistic of 27.723 with 7 and 8116 degrees of freedom indicates that the overall fit of the model is statistically significant, with a p-value less than 0.05.


Interpretation of the coefficients in the regression model:

Looking at the coefficients, we can see that the intercept is 4.0738571, which means that when all the independent variables are zero, the expected value of the dependent variable (review score) is 4.07 point.

The coefficient for the variable price is positive (0.0058428), indicating that as the price of the product increases, the review score will increase by 0.0058428 point .

The coefficient for the variable freight value is negative (-0.0002378), but it is not statistically significant (p-value = 0.9236110), suggesting that this variable does not have a significant effect on the review score.

The coefficient for the variable payment value is negative (-0.0059773), indicating that as the payment value increases, the review score will decrease by 0.0059773 points .

The coefficients for product weight, ’product height, and ’product width are all small and not statistically significant, indicating that these variables do not have a significant effect on the review score.

The coefficient for ’product length is positive (0.0040839), but it is only marginally significant (p-value = 0.0589037), suggesting that this variable may have a weak effect on the review score.

Overall, the model suggests that price and payment value may be important factors influencing review scores, but the model is not a good fit for the data, as indicated by the low R-squared value.