This paper pertains to the Term project assigned in the Introduction to Data Management course offered at GSU. The paper will be structured in a manner that progresses from exploring the origins and background of the data, to obtaining an analytic dataset, and finally conducting an exploratory analysis of the data.The R language is used for data analysis and visualization in this project.
Before performing the analysis, it’s very much required to install using “install.packages(”“)” and call on required packages using library function.
The Olist Store, an ecommerce platform based in Brazil, has made available a public dataset consisting of 100,000 order records from 2016 to 2018. These orders were placed on multiple marketplaces across Brazil, and the dataset provides a comprehensive view of various aspects of each order, including order status, price, payment, freight performance, customer location, product attributes, and customer reviews. Please note that while Olist also released a geolocation dataset mapping Brazilian zip codes to lat/lng coordinates, this project does not incorporate that particular dataset. Source: Data was downloaded from kaggle. You can access it on Olist Data To improve comprehension and organization, the data was separated into several datasets.Data was originally in csv. format, converted to xls files and was read into r using the read_excel function.
#read in data sets.
customer_data = read_excel("C:/Users/bukol/Documents/olist_customers_dataset.xlsx")
order_itemdata = read_excel("C:/Users/bukol/Documents/olist_order_items_dataset.xlsx")
pay_data = read_excel("C:/Users/bukol/Documents/olist_order_payments_dataset.xlsx")
rev_data = read_excel("C:/Users/bukol/Documents/olist_order_reviews_dataset.xlsx")
orders = read_excel("C:/Users/bukol/Documents/olist_orders_dataset.xlsx")
order_prddata = read_excel("C:/Users/bukol/Documents/olist_products_dataset.xlsx")
order_selldata = read_excel("C:/Users/bukol/Documents/olist_sellers_dataset.xlsx")
order_prd_catdata = read_excel("C:/Users/bukol/Documents/product_category_name_translation.xlsx")After reading the CSV files, I examined the number of columns and rows along with the column names for each file. The descriptions for all columns are shown in the table below:
| variable | missing_count | |
|---|---|---|
| review_comment_title | review | 87659 |
| review_comment_message | review | 58259 |
| order_delivered_customer_date | order | 2965 |
| order_delivered_carrier_date | order | 1783 |
| product_category_name | product | 610 |
| product_name_lenght | product | 610 |
| product_description_lenght | product | 610 |
| product_photos_qty | product | 610 |
| order_approved_at | order | 160 |
| product_weight_g | product | 2 |
| product_length_cm | product | 2 |
| product_height_cm | product | 2 |
| product_width_cm | product | 2 |
The seven data sets were combined into a main data set using the left_join function. Each row in the main data set represents a transaction, and if there are multiple rows with the same order id, then they belong to the same order. The common variables in the main data set are order id, order item id, product id, and seller id.
To connect the tables, the following steps were taken:
The resulting “data” data frame contains information from all 7 datasets and is linked by the common variables specified above.
There are 118310 observations and 40 variables in the resulting dataset, “data”.
I observed duplicate observations with order id,customer id, order purchase timestamp and order delivered customer date. These duplicated rows were dropped keeping the first. Imputation was done for missing values for some variables.
1.For missing values in review comment message, nao_reveja was imputed.
2.For missing values in product weight, median of product weight was imputed.
3.For missing values in product length, median of product length was imputed.
4.For missing values in product height, median of product height was imputed.
5.For missing values in product width, median of product width was imputed.
6.The remaining missing values were deleted.
The merged dataset consists of 34 columns and it has categorical features like order status,payment type, customer state, and product category name english. One column named review comment message has text data that is in Portuguese. There are few numerical features also. The descriptions for all columns are shown in the table below:
Based on the table provided, it can be noted that:
The highest order price recorded is 4590 Brazilian real, while the maximum freight value is approximately 375.28 Brazilian real. On average, the average freight value is approximately 22.26 Brazilian real. The lowest-priced order is at 3.90 Brazilian real.
In terms of payment_value, the largest payment made for an order is 7274.88 Brazilian real. Other numerical features such as percentile values, mean, standard deviation, count can be seen in the table.
## price freight_value payment_value
## Min. : 3.90 Min. : 0.00 Min. : 0.27
## 1st Qu.: 44.00 1st Qu.: 13.07 1st Qu.: 63.86
## Median : 79.99 Median : 18.27 Median : 112.42
## Mean : 136.26 Mean : 22.26 Mean : 174.11
## 3rd Qu.: 144.00 3rd Qu.: 23.36 3rd Qu.: 186.76
## Max. :4590.00 Max. :375.28 Max. :7274.88
In this eCommerce dataset, four primary payment methods are used: credit card, Boleto Bancário (also known as Baleto), voucher, and debit card.
Based on the bar chart above, we can see that the majority of orders are paid for using a credit card, followed by Boleto as the second most commonly used payment method. The bar chart illustrates the proportion of each payment mode, where credit card is used by 76% of the users, Boleto is used by 18%, and voucher and debit card are used by only 5% of the users combined.
Focusing on the target variable - review score, it is worth noting that scores of 3 or higher are categorized as positive, denoted by 1, and anything lower is labeled negative, indicated by 0. Upon analyzing the distribution of the target variable, we can see that approximately 81.7% of the total reviews are classified as positive, while 18.3% are negative. Overall, we can say majority of the people are happy shopping with olist.
From the analysis of the data, we can see that the top three most ordered product categories between 2016 and 2018 were health and beauty products,bed and bath products, watches and gift products. These categories stood out as having the highest number of total orders during this period.
This heat map shows the different average time it takes for orders to be delivered. The shortest delivery time is from PR to DF with less than 10 days and longest delivery time is from MG to PR and MG to RS with 40 days and above.
A multiple linear regression analysis on a dataset containing information about customer orders is done. First, the relevant variables for regression are selected from the original data, including the review score, price, freight value, payment value, product weight, and product dimensions. Outliers in these variables are removed using Tukey’s method.
A linear regression model is then fit using the model, with the review score as the dependent variable and the other selected variables as independent variables.
| Variable | Estimate | Std. Error | t-value | Pr(>|t|) |
|---|---|---|---|---|
| (Intercept) | 4.0738571 | 0.0652810 | 62.4049552 | 0.0000000 |
| price | 0.0058428 | 0.0005143 | 11.3596881 | 0.0000000 |
| freight_value | -0.0002378 | 0.0024797 | -0.0958891 | 0.9236110 |
| payment_value | -0.0059773 | 0.0004412 | -13.5475603 | 0.0000000 |
| product_weight_g | 0.0000116 | 0.0000286 | 0.4052078 | 0.6853354 |
| product_length_cm | 0.0040839 | 0.0021618 | 1.8891769 | 0.0589037 |
| product_height_cm | 0.0018721 | 0.0023766 | 0.7877014 | 0.4308944 |
| product_width_cm | 0.0003375 | 0.0027361 | 0.1233620 | 0.9018235 |
` Based on the output, the regression model is:
review_score = 4.0738571 + 0.0058428 * price - 0.0002378 * freight_value - 0.0059773 * payment_value + 0.0000116 * product_weight_g + 0.0040839 * product_length_cm + 0.0018721 * product_height_cm + 0.0003375 * product_width_cm
Where price, freight value, payment value, product weight, product length, product height, and product width are the independent variables, and their corresponding coefficients indicate the strength and direction of their association with the dependent variable, review score
The R-squared value of 0.023 indicates that only 2.3% of the variability in the response variable (review score) can be explained by the linear regression model using the predictor variables (price, freight value, payment value, product weight, product length, product height, and product width).
The Adjusted R-squared value is the same as the R-squared value, which means that there is no penalty for the number of predictor variables used in the model. The F-statistic of 27.723 with 7 and 8116 degrees of freedom indicates that the overall fit of the model is statistically significant, with a p-value less than 0.05.
Looking at the coefficients, we can see that the intercept is 4.0738571, which means that when all the independent variables are zero, the expected value of the dependent variable (review score) is 4.07 point.
The coefficient for the variable price is positive (0.0058428), indicating that as the price of the product increases, the review score will increase by 0.0058428 point .
The coefficient for the variable freight value is negative (-0.0002378), but it is not statistically significant (p-value = 0.9236110), suggesting that this variable does not have a significant effect on the review score.
The coefficient for the variable payment value is negative (-0.0059773), indicating that as the payment value increases, the review score will decrease by 0.0059773 points .
The coefficients for product weight, ’product height, and ’product width are all small and not statistically significant, indicating that these variables do not have a significant effect on the review score.
The coefficient for ’product length is positive (0.0040839), but it is only marginally significant (p-value = 0.0589037), suggesting that this variable may have a weak effect on the review score.
Overall, the model suggests that price and payment value may be important factors influencing review scores, but the model is not a good fit for the data, as indicated by the low R-squared value.