Required packages
library(readr)
library(tidyr)
library(dplyr)
library(Hmisc)
library(outliers)
library(lubridate)
Executive Summary
In the first section of the report, all the required packages and the datasets were imported into the R workspace. Next, we performed the required type conversions to various variables in the dataset i.e., Char to Factor and Char to DateTime. Following this, the data was checked to make sure it adheres to the tidy data principles. After making sure each variable had it’s own column, each observation had it’s own row and each value had it’s own cell, we proceeded to add a new variable, Shipping days to the dataset. Next, the dataset was scanned to find any NA, NaN or infinite values. Since having these values could pose serious problems, they were subsequently removed from the dataset using the appropriate techniques. The outliers were then detected in the numeric variables with the help of box plots. In the outliers section of the report, we deal with them using various methods such as capping and removing. In the final section of the report, we perform a log transformation on a variable to get rid of it’s right skewness and normalize it.
Data
Data sets source - https://www.kaggle.com/olistbr/brazilian-ecommerce
The data sets used for this assignment were obtained from kaggle. They are a part of the Brazilian E-Commerce Public Dataset released by Olist, an e-commerce company made for sellers, that links merchants and their products to the major marketplaces of Brazil.
Below are the details of variables in olist_orders_dataset.csv order_id - unique identifier of the order. customer_id - key to the customer dataset. Each order has a unique customer_id. order_status - Reference to the order status (delivered, shipped, etc). order_purchase_timestamp - Shows the purchase timestamp. order_approved_at - Shows the payment approval timestamp. order_delivered_carrier_date - Shows the order posting timestamp. When it was handled to the logistic partner. order_delivered_customer_date - Shows the actual order delivery date to the customer. order_estimated_delivery_date - Shows the estimated delivery date that was informed to customer at the purchase moment.
Below are the details of variables in olist_order_payments_dataset.csv order_id - unique identifier of an order. payment_sequential - a customer may pay an order with more than one payment method. If he does so, a sequence will be created to accommodate all payments. payment_type - method of payment chosen by the customer. payment_installments - number of installments chosen by the customer. payment_value - transaction value.
First both the datasets were imported into using the read.csv. The option for converting strings to factors was set to false. Following this, both the datasets were merged based on the order_id column using the merge command. The merged dataset was named orders_final.
ord <- read.csv("olist_orders_dataset.csv", stringsAsFactors = FALSE)
ord
payment <- read.csv("olist_order_payments_dataset.csv", stringsAsFactors = FALSE)
payment
#Merging orders and payment into orders_final
orders_final <- merge(payment,ord, by = "order_id")
orders_final
Understand
The orders_final dataframe was summarized using the str() function. Following this, the categorical columns were converted to factors using the as.factor function and the levels were viewed using the levels function. After this, the columns containing dates were converted from character to datetime using the lubridate function ymd_hms. This was applied to columns 8 to 12 using the lapply function.
#checking structure of the merged data set
str(orders_final)
'data.frame': 103886 obs. of 12 variables:
$ order_id : chr "00010242fe8c5a6d1ba2dd792cb16214" "00018f77f2f0320c557190d7a144bdd3" "000229ec398224ef6ca0657da4fc703e" "00024acbcdf0a6daa1e931b038114c75" ...
$ payment_sequential : int 1 1 1 1 1 1 1 1 1 1 ...
$ payment_type : chr "credit_card" "credit_card" "credit_card" "credit_card" ...
$ payment_installments : int 2 3 5 2 3 1 1 10 3 1 ...
$ payment_value : num 72.2 259.8 216.9 25.8 218 ...
$ customer_id : chr "3ce436f183e68e07877b285a838db11a" "f6dd3ec061db4e3987629fe6b26e5cce" "6489ae5e4333f3693df5ad4372dab6d3" "d4eb9395c8c0431ee92fce09860c5a06" ...
$ order_status : chr "delivered" "delivered" "delivered" "delivered" ...
$ order_purchase_timestamp : chr "2017-09-13 08:59:02" "2017-04-26 10:53:06" "2018-01-14 14:33:31" "2018-08-08 10:00:35" ...
$ order_approved_at : chr "2017-09-13 09:45:35" "2017-04-26 11:05:13" "2018-01-14 14:48:30" "2018-08-08 10:10:18" ...
$ order_delivered_carrier_date : chr "2017-09-19 18:34:16" "2017-05-04 14:35:00" "2018-01-16 12:36:48" "2018-08-10 13:28:00" ...
$ order_delivered_customer_date: chr "2017-09-20 23:43:48" "2017-05-12 16:04:24" "2018-01-22 13:19:16" "2018-08-14 13:32:39" ...
$ order_estimated_delivery_date: chr "2017-09-29 00:00:00" "2017-05-15 00:00:00" "2018-02-05 00:00:00" "2018-08-20 00:00:00" ...
#Factoring variables payement type and order status
orders_final$payment_type <- as.factor(orders_final$payment_type)
levels(orders_final$payment_type)
[1] "boleto" "credit_card" "debit_card" "not_defined" "voucher"
orders_final$order_status <- as.factor(orders_final$order_status)
levels(orders_final$order_status)
[1] "approved" "canceled" "created" "delivered" "invoiced" "processing"
[7] "shipped" "unavailable"
#Converting character to date format for last five columns
orders_final[8:12] <- lapply(orders_final[8:12], ymd_hms)
sapply(orders_final[8:12], class)
order_purchase_timestamp order_approved_at order_delivered_carrier_date
[1,] "POSIXct" "POSIXct" "POSIXct"
[2,] "POSIXt" "POSIXt" "POSIXt"
order_delivered_customer_date order_estimated_delivery_date
[1,] "POSIXct" "POSIXct"
[2,] "POSIXt" "POSIXt"
Tidy & Manipulate Data I
Check if the data conforms the tidy data principles. If your data is untidy, reshape your data into a tidy format (minimum requirement #5). In addition to the R codes and outputs, explain everything that you do in this step.
The dataset was checked to make sure it adheres to the tidy data priniciples of 1) Each variable must have its own column. 2) Each observation must have its own row. 3) Each value must have its own cell.
Our dataset follows these priniciples and thereby is in a tidy format.
orders_final
Tidy & Manipulate Data II
The column shipping days was formed by finding the days difference between the order purchase day and the delivery day. It was created with the help of the mutate function with the units set to days. Following this, the floor function was applied to all observartions in the new column as days represent whole number.
#creating shipping days variable using mutate
orders_final <- mutate(orders_final,shipping_days = difftime(order_delivered_customer_date, order_purchase_timestamp, units = "days"))
#Rounf the date using floor for shipping days
orders_final$shipping_days <- floor(orders_final$shipping_days)
Scan I
- The number of NA values were obtained by using the is.na function on the dataset and then using the sum function on the result of the former.
- The total number of NA values were observed to be 8327. To get a more clearer picture, the NA values in each column was observed using the colSums function.
- The NA values were removed from the dataset using the na.omit function. The decision to remove them was made since they make up only 3 percent of the dataset and eliminating them won’t cause much of an impact to the dataset.
- The date and shipping days columns are all closely related to each other and missing values in one column could cause missing values in the others, this is why the number of NA values is shown at 8327.
- Validations were made to make sure the NA values were removed.
- checks were made to make sure if the dataset contained any NAN or infinite values using the is.nan and is.infinite functions respectively. This was applied to all the columns using the sapply function. From this, we found out there were no infinite or NaN values present in the dataset.
#Checking for NA values(Missing values)
sum(is.na(orders_final))
[1] 8327
colSums(is.na(orders_final))
order_id payment_sequential
0 0
payment_type payment_installments
0 0
payment_value customer_id
0 0
order_status order_purchase_timestamp
0 0
order_approved_at order_delivered_carrier_date
175 1888
order_delivered_customer_date order_estimated_delivery_date
3132 0
shipping_days
3132
##3132/103886 * 100 = 3.015 <- order_delivered customer date
#Percentage of missing values calculation
percentage_missing_values <- sum(is.na(orders_final$order_delivered_customer_date))/length(orders_final$order_delivered_customer_date) * 100
percentage_missing_values
[1] 3.014843
#After removing missing values
orders_final <- na.omit(orders_final)
#validating after updating the dataset
colSums(is.na(orders_final))
order_id payment_sequential
0 0
payment_type payment_installments
0 0
payment_value customer_id
0 0
order_status order_purchase_timestamp
0 0
order_approved_at order_delivered_carrier_date
0 0
order_delivered_customer_date order_estimated_delivery_date
0 0
shipping_days
0
#Code for checking any NaN (Not a number)
sum(sapply(orders_final, is.nan))
[1] 0
#Code for checking any infinite values
sum(sapply(orders_final, is.infinite))
[1] 0
Scan II
- Numeric variables like payment sequential, payment_installments and payment_value were checked for outliers.
- Payment sequential is handled by removing outlying observations as they contributed 4.3% of the dataset(calculation shown below).
- Capping approach was incorporated for payment value and payment installment by considering the nearest neighbor of the 95th or 5th percentile for the outlier value.
- Boxplots were plotted to visualize the outliers in these variables.
pay_sequence_outliers <- boxplot(orders_final$payment_sequential, main="Distribution of payment sequential", col = "grey")

#calculating percentage of missing values
percentage_missing_values <- length(pay_sequence_outliers$out)/length(orders_final$payment_sequential) * 100
percentage_missing_values
[1] 4.325038
#Payment sequence - removing outliers
orders_final <- orders_final[!(orders_final$payment_sequential %in% pay_sequence_outliers$out), ]
boxplot(orders_final$payment_sequential, main="Distribution of updated payment sequential", col = "lightblue")

#capping function
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
pay_installments_outliers <- boxplot(orders_final$payment_installments, main="Distribution of payment installments", col = "grey")

#payment installment - capping method
orders_final$payment_installments <- orders_final$payment_installments %>% cap()
boxplot(orders_final$payment_installments, main="Distribution of updated payment installments", col = "lightblue")

pay_value_outliers <- boxplot(orders_final$payment_value, main="Distribution of payment value", col = "grey")

#payment value - capping method
orders_final$payment_value <- orders_final$payment_value %>% cap()
boxplot(orders_final$payment_value, main="Distribution of updated payment value", col = "lightblue")

---
title: "MATH2349 Semester 2, 2019"
author: "Ashwin Anis - s3763476 / Deepak Prasad - s3759108"
subtitle: Assignment 3
output:
  html_notebook: default
---

```{r setup, message=FALSE, echo=FALSE}
require("knitr")
opts_knit$set(root.dir = "/Users/dprasadg/Desktop/DP/Assignment 3")
```

## Required packages 

```{r, echo = TRUE, message=FALSE}
library(readr)
library(tidyr)
library(dplyr)
library(Hmisc)
library(outliers)
library(lubridate)
```


## Executive Summary 

In the first section of the report, all the required packages and the datasets were imported into the R workspace. Next, we performed the required type conversions to various variables in the dataset i.e., Char to Factor and Char to DateTime. Following this, the data was checked to make sure it adheres to the tidy data principles. After making sure each variable had it's own column, each observation had it's own row and each value had it's own cell, we proceeded to add a new variable, Shipping days to the dataset. Next, the dataset was scanned to find any NA, NaN or infinite values. Since having these values could pose serious problems, they were subsequently removed from the dataset using the appropriate techniques. The outliers were then detected in the numeric variables with the help of box plots. In the outliers section of the report, we deal with them using various methods such as capping and removing. In the final section of the report, we perform a log transformation on a variable to get rid of it's right skewness and normalize it. 


## Data 

Data sets source - https://www.kaggle.com/olistbr/brazilian-ecommerce

The data sets used for this assignment were obtained from kaggle. They are a part of the Brazilian E-Commerce Public Dataset released by Olist, an e-commerce company made for sellers, that links merchants and their products to the major marketplaces of Brazil.  

Below are the details of variables in olist_orders_dataset.csv
order_id - unique identifier of the order.
customer_id - key to the customer dataset. Each order has a unique customer_id.
order_status - Reference to the order status (delivered, shipped, etc).
order_purchase_timestamp - Shows the purchase timestamp.
order_approved_at - Shows the payment approval timestamp.
order_delivered_carrier_date - Shows the order posting timestamp. When it was handled to the logistic partner.
order_delivered_customer_date - Shows the actual order delivery date to the customer.
order_estimated_delivery_date - Shows the estimated delivery date that was informed to customer at the purchase moment.

Below are the details of variables in olist_order_payments_dataset.csv
order_id - unique identifier of an order.
payment_sequential - a customer may pay an order with more than one payment method. If he does so, a sequence will be created to accommodate all payments.
payment_type - method of payment chosen by the customer.
payment_installments - number of installments chosen by the customer.
payment_value - transaction value.


First both the datasets were imported into using the read.csv. The option for converting strings to factors was set to false. Following this, both the datasets were merged based on the order_id column using the merge command. The merged dataset was named orders_final. 

```{r}
ord <- read.csv("olist_orders_dataset.csv", stringsAsFactors = FALSE)
ord
payment <- read.csv("olist_order_payments_dataset.csv", stringsAsFactors = FALSE)
payment

#Merging orders and payment into orders_final
orders_final <- merge(payment,ord, by = "order_id")
orders_final
```


## Understand 

The orders_final dataframe was summarized using the str() function. Following this, the categorical columns were converted to factors using the as.factor function and the levels were viewed using the levels function. After this, the columns containing dates were converted from character to datetime using the lubridate function ymd_hms. This was applied to columns 8 to 12 using the lapply function. 

```{r}
#checking structure of the merged data set
str(orders_final)

#Factoring variables payement type and order status
orders_final$payment_type <- as.factor(orders_final$payment_type)
levels(orders_final$payment_type)
orders_final$order_status <- as.factor(orders_final$order_status)
levels(orders_final$order_status)

#Converting character to date format for last five columns
orders_final[8:12] <- lapply(orders_final[8:12], ymd_hms)
sapply(orders_final[8:12], class)
```


##	Tidy & Manipulate Data I 

Check if the data conforms the tidy data principles. If your data is untidy, reshape your data into a tidy format (minimum requirement #5). In addition to the R codes and outputs, explain everything that you do in this step.

The dataset was checked to make sure it adheres to the tidy data priniciples of 
1) Each variable must have its own column.
2) Each observation must have its own row.
3) Each value must have its own cell.

Our dataset follows these priniciples and thereby is in a tidy format.

```{r}
orders_final
```


##	Tidy & Manipulate Data II 

The column shipping days was formed by finding the days difference between the order purchase day and the delivery day. It was created with the help of the mutate function with the units set to days. Following this, the floor function was applied to all observartions in the new column as days represent whole number.

```{r}
#creating shipping days variable using mutate
orders_final <- mutate(orders_final,shipping_days = difftime(order_delivered_customer_date, order_purchase_timestamp, units = "days"))
#Rounf the date using floor for shipping days
orders_final$shipping_days <- floor(orders_final$shipping_days)
```


##	Scan I 

* The number of NA values were obtained by using the is.na function on the dataset and then using the sum function on the result of the former. 
* The total number of NA values were observed to be 8327. To get a more clearer picture, the NA values in each column was observed using the colSums function. 
* The NA values were removed from the dataset using the na.omit function. The decision to remove them was made since they make up only 3 percent of the dataset and eliminating them won't cause much of an impact to the dataset. 
* The date and shipping days columns are all closely related to each other and missing values in one column could cause missing values in the others, this is why the number of NA values is shown at 8327. 
* Validations were made to make sure the NA values were removed. 
* checks were made to make sure if the dataset contained any NAN or infinite values using the is.nan and is.infinite functions respectively. This was applied to all the columns using the sapply function. From this, we found out there were no infinite or NaN values present in the dataset.

```{r}
#Checking for NA values(Missing values)
sum(is.na(orders_final))

colSums(is.na(orders_final))
##3132/103886 * 100 = 3.015 <- order_delivered customer date

#Percentage of missing values calculation
percentage_missing_values <- sum(is.na(orders_final$order_delivered_customer_date))/length(orders_final$order_delivered_customer_date) * 100
percentage_missing_values

#After removing missing values
orders_final <- na.omit(orders_final)
#validating after updating the dataset
colSums(is.na(orders_final))

#Code for checking any NaN (Not a number)
sum(sapply(orders_final, is.nan))

#Code for checking any infinite values
sum(sapply(orders_final, is.infinite))
```


##	Scan II

* Numeric variables like payment sequential, payment_installments and payment_value were checked for outliers. 
* Payment sequential is handled by removing outlying observations as they contributed 4.3% of the dataset(calculation shown below).
* Capping approach was incorporated for payment value and payment installment by considering the nearest neighbor of the 95th or 5th percentile for the outlier value.
* Boxplots were plotted to visualize the outliers in these variables.

```{r}

pay_sequence_outliers <- boxplot(orders_final$payment_sequential, main="Distribution of payment sequential", col = "grey")

#calculating percentage of missing values
percentage_missing_values <- length(pay_sequence_outliers$out)/length(orders_final$payment_sequential) * 100
percentage_missing_values

#Payment sequence - removing outliers
orders_final <- orders_final[!(orders_final$payment_sequential %in% pay_sequence_outliers$out), ]
boxplot(orders_final$payment_sequential, main="Distribution of updated payment sequential", col = "lightblue")

#capping function
cap <- function(x){
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
  x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
  x
}

pay_installments_outliers <- boxplot(orders_final$payment_installments, main="Distribution of payment installments", col = "grey")

#payment installment - capping method
orders_final$payment_installments <- orders_final$payment_installments %>% cap()
boxplot(orders_final$payment_installments, main="Distribution of updated payment installments", col = "lightblue")

pay_value_outliers <- boxplot(orders_final$payment_value, main="Distribution of payment value", col = "grey")

#payment value - capping method
orders_final$payment_value <- orders_final$payment_value %>% cap()
boxplot(orders_final$payment_value, main="Distribution of updated payment value", col = "lightblue")
```


##	Transform 

For the final task, we decided to use perfrom logarithmic transformation on the payment value variable. The reason for going with this transformation was to reduce the right skewness of the variable and normailize it. The transformation was performed using the log() function. Before and After histograms along with a distribution curve was plotted to visualize the transformation. 

```{r}
hist(orders_final$payment_value, main = "Histogram of payment value", xlab = "Payment value")
#logarithmic transformation of payement value variable as it was right skewed
orders_final <- orders_final %>% mutate(log_payment_value = log(orders_final$payment_value))

#after the transformation, the values are normalized.
h <- hist(orders_final$log_payment_value, breaks = 20, xlim=c(2,8), main = "Histogram of log  transformed payment value", xlab = "Payment value in terms of log")
xfit <- seq(min(orders_final$log_payment_value), max(orders_final$log_payment_value), length = length(orders_final$log_payment_value)) 
yfit <- dnorm(xfit, mean = mean(orders_final$log_payment_value), sd = sd(orders_final$log_payment_value)) 
yfit <- yfit * diff(h$mids[1:2]) * length(orders_final$log_payment_value) 
lines(xfit, yfit, col = "blue", lwd = 2)
```

<br>
<br>
