Data Cleaning and Data Filtering: Cleaning of Consumer Data and Extracting Insights from the Processed Data Base.

The data set we have available with us is a hypothetical consumer data set. Our primary objective will be to clean the data set and then extract insights which could be meaningful for the organisation to improve their sales and service. To be more particular on this, we will be doing our cleaning focusing on the columns which we would require for the respective tasks.

We first set the working directory.

setwd("C:/Users/salil/Desktop/AllDocuments/AnalyticsEdgeFolder")

We then load in our data set.

ConsumerData <- read.csv("ConsumerData.csv", na.strings = "")

We use the ‘na.strings’ arguement to replace the blank cells with ‘NA’ to make them convenient for our further work. This is how our initial data set looks like.

head(ConsumerData)

##   order_id user_id address_id address_pincode address_city order_status
## 1    44990  106778      42862          122004     Gurugram    DELIVERED
## 2    32971  121745      42561          110061        Delhi    DELIVERED
## 3    42071  142399      45637          122001     Gurugram    DELIVERED
## 4    29518  122805      50676          110071      Gurgaon    CANCELLED
## 5    42112  135852      53282          110074        Delhi    CANCELLED
## 6    31299  102901      42358          122015     Gurugram    DELIVERED
##   total_amount                       created_at payment_status
## 1         10.0 2020-07-10 16:23:51.884369+05:30            COD
## 2        467.5 2020-06-20 18:48:43.802539+05:30            COD
## 3        104.0 2020-07-05 21:25:14.544634+05:30            COD
## 4          0.0 2020-06-15 13:32:56.568548+05:30            COD
## 5          0.0 2020-07-05 23:08:51.535282+05:30            COD
## 6         77.0 2020-06-18 18:32:48.215438+05:30            COD
##              team_leader   coupon_code processing_at discount_amount commission
## 1 nageshwarmahto90157844          <NA>    10-07-2020               0       0.42
## 2         diipak72899731          <NA>    20-06-2020               0      18.17
## 3    deepakyadav80106581 SASTAKHAREEDO    06-07-2020              15       2.79
## 4        archana98689836         CM150    15-06-2020               0       0.00
## 5      ramburman81309118          <NA>    06-07-2020               0       0.00
## 6       prashant77038484          <NA>    18-06-2020               0       3.40

There are 2 columns of day/time. One, which records the day at and the time on which the order has been placed by the customers. The second column records the date on which the order has been processed. We will be working with the former one.

The library we will use is ‘lubridate’ as we have to deal with dates.

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

We split this column into date and time separately and eliminate the original column. The time column only represents the ‘hour’ and ‘minute’ values and not the ‘second’ values as it won’t be of any significant use to us.

ConsumerData$created_on <- format(as.Date(ConsumerData$created_at), "%d-%m-%Y")
ConsumerData$time <- format(as.POSIXct(ConsumerData$created_at), format = "%H:%M") 
ConsumerData$created_at <- NULL

We have to make sure that the date on which the order was placed and the date on which the order was processed are the same or not. If they are same, we can afford to delete one of the columns.

length(which(ConsumerData[,11]!=ConsumerData[,14]))

## [1] 6622

So there are significant number of entries for the same. We would therefore keep both the columns.

Let us rearrange the columns of our dataset to make it easily interpretable before we continue with the further steps.

col_order <- c("order_id", "user_id", "address_city", "address_id", "address_pincode", "created_on", "time", "processing_at", "total_amount", "payment_status", "order_status", "coupon_code", "discount_amount", "team_leader", "commission")
ConsumerData1 <- ConsumerData[, col_order]

This is how our data set looks till now.

head(ConsumerData1)

##   order_id user_id address_city address_id address_pincode created_on  time
## 1    44990  106778     Gurugram      42862          122004 10-07-2020 16:23
## 2    32971  121745        Delhi      42561          110061 20-06-2020 18:48
## 3    42071  142399     Gurugram      45637          122001 05-07-2020 21:25
## 4    29518  122805      Gurgaon      50676          110071 15-06-2020 13:32
## 5    42112  135852        Delhi      53282          110074 05-07-2020 23:08
## 6    31299  102901     Gurugram      42358          122015 18-06-2020 18:32
##   processing_at total_amount payment_status order_status   coupon_code
## 1    10-07-2020         10.0            COD    DELIVERED          <NA>
## 2    20-06-2020        467.5            COD    DELIVERED          <NA>
## 3    06-07-2020        104.0            COD    DELIVERED SASTAKHAREEDO
## 4    15-06-2020          0.0            COD    CANCELLED         CM150
## 5    06-07-2020          0.0            COD    CANCELLED          <NA>
## 6    18-06-2020         77.0            COD    DELIVERED          <NA>
##   discount_amount            team_leader commission
## 1               0 nageshwarmahto90157844       0.42
## 2               0         diipak72899731      18.17
## 3              15    deepakyadav80106581       2.79
## 4               0        archana98689836       0.00
## 5               0      ramburman81309118       0.00
## 6               0       prashant77038484       3.40

unique(ConsumerData1$address_city)

## [1] Gurugram  Delhi     Gurgaon   GURUGRAM  Rewari    NaN       DELHI    
## [8] Delhi NCR <NA>     
## Levels: Delhi DELHI Delhi NCR Gurgaon Gurugram GURUGRAM NaN Rewari

length(unique(ConsumerData1$address_city))

## [1] 9

So there are a total of 9 unique values for ‘address_city’. To make the analysis easy, it is important to assign common names to similar looking levels and change ‘NaN’ to ‘NA’ as well.

ConsumerData1$address_city[ConsumerData1$address_city=="NaN"] <- NA
ConsumerData1$address_city[which(ConsumerData1$address_city=="GURUGRAM")] <- c("Gurugram")
ConsumerData1$address_city[which(ConsumerData1$address_city=="Gurgaon")] <- c("Gurugram")
ConsumerData1$address_city[which(ConsumerData1$address_city=="Delhi")] <- c("Delhi NCR")
ConsumerData1$address_city[which(ConsumerData1$address_city=="DELHI")] <- c("Delhi NCR")

Now it is time to deal with the NAs.

colSums(is.na(ConsumerData1))

##        order_id         user_id    address_city      address_id address_pincode 
##               0               0              64              40             853 
##      created_on            time   processing_at    total_amount  payment_status 
##               0               0               0               0               0 
##    order_status     coupon_code discount_amount     team_leader      commission 
##               0           30937               0               0               0

We will try to remove the NAs from the address’ columns. First we need to check if every ‘address_id’ has a unique ‘address_city’ and ‘address_pincode’ associated with it or there are multiple values for the same. If the answer will be ‘yes’, we can use the ‘fill’ function to populate the entries for NAs if there will be any populated row with a similar ‘address_id’. If the answer will be ‘no’ we will check the percentage of ‘NA’ values out of the total number of observations. If the percentage will be less, we will delete the NAs of the considered variable column.

We start this by checking the frequency of address cities that have been assigned to all the pincodes. If we find that there exists no intersection, we can move further with what was mentioned earlier.

a <- data.frame(table(ConsumerData1$address_pincode, ConsumerData1$address_city))
aGGN <- a[a$Var2=="Gurugram",]
aDELNCR <- a[a$Var2=="Delhi NCR",]
aREW <- a[a$Var2=="Rewari",]
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Joina1 <- inner_join(aGGN, aDELNCR, by="Var1")
Joina2 <- inner_join(Joina1, aREW, by="Var1")

This is how the created data set looks like.

head(Joina2)

##     Var1   Var2.x Freq.x    Var2.y Freq.y   Var2 Freq
## 1 110001 Gurugram      4 Delhi NCR     44 Rewari    0
## 2 110003 Gurugram      0 Delhi NCR      5 Rewari    0
## 3 110005 Gurugram      0 Delhi NCR     17 Rewari    0
## 4 110006 Gurugram      1 Delhi NCR      8 Rewari    0
## 5 110007 Gurugram      0 Delhi NCR      1 Rewari    0
## 6 110008 Gurugram      2 Delhi NCR      5 Rewari    0

We now find the intersections that is the pincodes in which there are entered two or three address cities.

which(Joina2$Freq.x>0 & Joina2$Freq.y>0 & Joina2$Freq>0)

## [1]  67  98 145

which(Joina2$Freq.x>0 & Joina2$Freq.y>0)

##  [1]   1   4   6  11  14  15  17  18  19  22  24  25  26  27  30  31  32  35  37
## [20]  38  39  40  41  44  45  49  50  51  53  54  59  60  61  63  64  65  67  68
## [39]  69  72  74  76  77  78  79  80  81  82  83  89  96  98  99 100 102 110 112
## [58] 113 119 121 130 145 156 161 169 172 175 176 213

which(Joina2$Freq.x>0 & Joina2$Freq>0)

##  [1]  67  98 127 128 137 139 141 142 143 145 146 147 148 151

which(Joina2$Freq.y>0 & Joina2$Freq>0)

## [1]  67  98 145 157 185

There are a significant number of entries for the same.

Now we try to do the same thing by replacing address pincodes with address_ids.

b <- data.frame(table(ConsumerData1$address_id, ConsumerData1$address_city))
bGGN <- b[b$Var2=="Gurugram",]
bDELNCR <- b[b$Var2=="Delhi NCR",]
bREW <- b[b$Var2=="Rewari",]
Joinb1 <- inner_join(bGGN, bDELNCR, by="Var1")
Joinb2 <- inner_join(Joinb1, bREW, by="Var1")

Again we find the points of intersection.

which(Joinb2$Freq.x>0 & Joinb2$Freq.y>0 & Joinb2$Freq>0)

## [1] 1148 1322

which(Joinb2$Freq.x>0 & Joinb2$Freq.y>0)

##   [1]    8   11   13   19   33   34   36   42   44   47   49   52   53   54   55
##  [16]   56   57   59   60   61   65   82   91   99  103  106  108  109  112  114
##  [31]  115  116  117  119  120  121  122  124  134  150  151  158  161  163  167
##  [46]  168  169  170  171  173  185  186  190  199  200  202  209  222  223  227
##  [61]  238  247  251  256  261  262  263  268  272  275  277  278  287  295  297
##  [76]  301  303  312  326  329  336  337  338  341  348  361  362  363  368  370
##  [91]  383  387  389  403  414  423  443  447  501  518  547  621  666  673  730
## [106]  736  757  768  774  790  799  803  847  871  877  882  894  898  901  903
## [121]  905  915  954  956  970  978  988  993 1023 1028 1035 1039 1041 1050 1051
## [136] 1089 1095 1102 1104 1113 1123 1148 1253 1298 1322 1343 1367 1370 1372 1379
## [151] 1381 1387 1405 1453 1465 1518 1524 1541 1545 1562 1573 1576 1590 1600 1633
## [166] 1635 1642 1697 1708 1718 1746 1751 1758 1760

which(Joinb2$Freq.x>0 & Joinb2$Freq>0)

##  [1]   35   39  614  950  985 1006 1019 1040 1047 1057 1070 1084 1090 1119 1128
## [16] 1132 1137 1139 1140 1143 1144 1145 1148 1149 1151 1166 1167 1168 1169 1183
## [31] 1185 1188 1191 1200 1205 1206 1210 1212 1214 1221 1222 1229 1234 1239 1242
## [46] 1246 1252 1254 1256 1258 1262 1264 1278 1281 1282 1288 1295 1301 1316 1318
## [61] 1322 1327 1331 1341 1354 1361 1363 1368 1389 1409 1414 1436 1441 1455 1459
## [76] 1460 1476 1526 1651 1669 1696 1732

which(Joinb2$Freq.y>0 & Joinb2$Freq>0)

## [1] 1148 1202 1296 1322 1452 1668

This too has a significant number of entries.

Giving one last attempt, we will try to see if every user_id that is every user has a unique level of address city. We will use the same method which we were using for address_id and address_pincode.

c <- data.frame(table(ConsumerData1$user_id, ConsumerData1$address_city))
cGGN <- c[c$Var2=="Gurugram",]
cDELNCR <- c[c$Var2=="Delhi NCR",]
cREW <- c[c$Var2=="Rewari",]
Joinc1 <- inner_join(cGGN, cDELNCR, by="Var1")
Joinc2 <- inner_join(Joinc1, cREW, by="Var1")

The following are the points of intersection.

which(Joinc2$Freq.x>0 & Joinc2$Freq.y>0 & Joinc2$Freq>0)

## [1] 5042 5147 6000 6324 6706 6972 7071

which(Joinc2$Freq.x>0 & Joinc2$Freq.y>0)

##   [1]    1    5    7    9   12   23   55   61   63   77   86   87   91   92   99
##  [16]  104  106  108  112  115  159  162  180  188  191  196  210  228  230  251
##  [31]  255  281  283  288  292  300  301  302  303  320  353  359  361  371  374
##  [46]  379  390  399  401  402  405  406  434  435  436  450  464  479  487  488
##  [61]  489  492  493  498  501  504  505  511  512  518  520  524  525  531  537
##  [76]  551  557  564  567  573  586  591  593  595  596  610  612  618  624  635
##  [91]  646  650  651  658  661  663  667  676  678  689  702  711  714  715  726
## [106]  733  736  742  747  760  762  763  804  813  815  826  827  851  873  876
## [121]  887  899  902  903  906  909  910  915  917  931  940  941  948  950  952
## [136]  954  956  959  960  962  965  980  983  986  990  993  994 1007 1008 1014
## [151] 1020 1023 1024 1031 1033 1041 1043 1055 1058 1062 1068 1081 1084 1089 1090
## [166] 1101 1133 1134 1142 1157 1164 1166 1182 1187 1200 1221 1227 1232 1234 1246
## [181] 1258 1267 1270 1272 1275 1279 1281 1282 1287 1294 1303 1305 1306 1313 1314
## [196] 1317 1319 1330 1334 1336 1348 1350 1351 1357 1365 1366 1367 1372 1375 1400
## [211] 1401 1404 1409 1410 1415 1429 1433 1435 1444 1449 1460 1463 1464 1472 1473
## [226] 1475 1477 1479 1480 1488 1493 1524 1534 1553 1555 1560 1561 1567 1569 1574
## [241] 1575 1578 1579 1581 1585 1590 1591 1594 1596 1599 1600 1611 1626 1630 1631
## [256] 1660 1668 1678 1698 1705 1706 1709 1718 1748 1761 1787 1793 1798 1804 1816
## [271] 1823 1827 1836 1837 1845 1848 1852 1854 1857 1865 1880 1884 1890 1906 1908
## [286] 1915 1932 1933 1937 1946 1950 1956 1958 1962 1980 2025 2061 2066 2073 2076
## [301] 2078 2116 2127 2139 2142 2150 2156 2161 2178 2181 2187 2196 2200 2202 2275
## [316] 2316 2324 2391 2395 2432 2480 2487 2489 2500 2501 2508 2521 2644 2667 2840
## [331] 2862 2933 3008 3037 3091 3132 3157 3158 3186 3216 3247 3273 3278 3329 3342
## [346] 3356 3402 3405 3410 3414 3505 3548 3558 3559 3580 3582 3597 3599 3603 3616
## [361] 3659 3689 3708 3710 3721 3731 3745 3746 3757 3767 3796 3808 3848 3850 3853
## [376] 3862 3876 3906 3909 3915 3929 3939 3964 4010 4044 4058 4064 4095 4155 4206
## [391] 4319 4329 4330 4403 4409 4437 4446 4460 4466 4502 4503 4530 4552 4561 4575
## [406] 4576 4580 4599 4621 4638 4679 4803 4804 4807 4826 4839 4862 4864 4945 5030
## [421] 5042 5102 5147 5159 5162 5191 5301 5334 5484 5512 5567 5569 5605 5609 5611
## [436] 5628 5629 5649 5653 5814 5830 5833 5837 5877 5885 5956 5958 5966 6000 6030
## [451] 6044 6052 6165 6188 6269 6281 6283 6318 6324 6331 6434 6440 6445 6450 6494
## [466] 6504 6572 6587 6607 6625 6634 6657 6675 6695 6706 6820 6821 6896 6901 6921
## [481] 6927 6940 6957 6972 6976 7010 7038 7071 7081 7129 7131 7132 7137 7138 7170
## [496] 7184 7205 7223 7248 7256 7403 7405 7407 7414 7427 7428 7491 7504 7513 7642
## [511] 7656 7707 7741 7748 7752 7884 7923 7932 7973 7982 7988 8064 8137 8138 8231
## [526] 8284 9053

which(Joinc2$Freq.x>0 & Joinc2$Freq>0)

##   [1]  114  133  598 2975 4468 4548 4729 4748 4784 4819 4843 4861 4874 4886 4892
##  [16] 4900 4908 4918 4919 4920 4943 4944 4947 4950 4961 4963 4964 4969 4972 4974
##  [31] 4975 4978 4980 4985 5021 5037 5038 5042 5050 5059 5075 5081 5082 5087 5090
##  [46] 5111 5113 5114 5132 5147 5150 5177 5216 5233 5239 5289 5294 5363 5370 5389
##  [61] 5394 5436 5459 5476 5491 5510 5517 5518 5558 5564 5591 5599 5625 5703 5806
##  [76] 5817 5827 5828 5834 5843 5858 5860 5867 5886 5887 5895 5907 5910 5911 5922
##  [91] 5928 5929 5930 5935 5938 6000 6002 6015 6027 6050 6054 6056 6208 6280 6308
## [106] 6324 6325 6351 6361 6382 6383 6384 6415 6433 6435 6488 6493 6527 6560 6566
## [121] 6577 6626 6639 6649 6651 6669 6674 6680 6702 6706 6707 6731 6733 6752 6767
## [136] 6778 6785 6799 6806 6824 6830 6831 6837 6918 6922 6929 6972 7071 7195 7208
## [151] 7222 7305 7392 7467 7563 7598 7798 7814 7917 7959 7980 8633

which(Joinc2$Freq.y>0 & Joinc2$Freq>0)

##  [1] 4445 5042 5147 6000 6324 6706 6854 6972 6982 7071 7078 7192 7495 7680 7720
## [16] 8781 8811 8850

This happens probably because there are instances in which the same user has placed orders for more than one address.

It is therefore not possible to fill in the NAs according to values from adjacent columns of similar entries. This has happened because people(one or more) probably would have entered wrong address details while placing orders. This also gives rise to a possibility of people having entered wrong details for entries which did not have any intersection as well.

In order to rectify this problem, we would probably require a dictionary from a credible source which correctly lists the address ids with their pincodes and address cities.

It is also not possible to use any classifier algorithm or clustering algorithm as there does not exist any defined relationship between the different columns. Pincodes and address ids are at times assigned randomly and we cannot be sure of their credibility.

So, as mentioned earlier, we check the percentage of observations having NAs of the total number of entries.

(64/37735)*100

## [1] 0.1696038

(40/37735)*100

## [1] 0.1060024

(853/37735)*100

## [1] 2.260501

We can comfortably remove NAs if their presence is less than 5% and as we can observe, these are extremely small values. The analyst can therefore afford to remove these NAs for whichever column he would use in his analysis.

The first thing we would do is tabulate the total number of monthly orders of every user. For this we will first make a different column which mentions the month in which the order has been placed.

ConsumerData1$created_on <- strptime(ConsumerData1$created_on, format = "%d-%m-%Y")
ConsumerData1$Month <- months.POSIXt(ConsumerData1$created_on)

This is how the modified dataset looks like. There are a total of 16 columns as of now and the last column registers the month in which the order was placed.

head(ConsumerData1)

##   order_id user_id address_city address_id address_pincode created_on  time
## 1    44990  106778     Gurugram      42862          122004 2020-07-10 16:23
## 2    32971  121745    Delhi NCR      42561          110061 2020-06-20 18:48
## 3    42071  142399     Gurugram      45637          122001 2020-07-05 21:25
## 4    29518  122805     Gurugram      50676          110071 2020-06-15 13:32
## 5    42112  135852    Delhi NCR      53282          110074 2020-07-05 23:08
## 6    31299  102901     Gurugram      42358          122015 2020-06-18 18:32
##   processing_at total_amount payment_status order_status   coupon_code
## 1    10-07-2020         10.0            COD    DELIVERED          <NA>
## 2    20-06-2020        467.5            COD    DELIVERED          <NA>
## 3    06-07-2020        104.0            COD    DELIVERED SASTAKHAREEDO
## 4    15-06-2020          0.0            COD    CANCELLED         CM150
## 5    06-07-2020          0.0            COD    CANCELLED          <NA>
## 6    18-06-2020         77.0            COD    DELIVERED          <NA>
##   discount_amount            team_leader commission Month
## 1               0 nageshwarmahto90157844       0.42  July
## 2               0         diipak72899731      18.17  June
## 3              15    deepakyadav80106581       2.79  July
## 4               0        archana98689836       0.00  June
## 5               0      ramburman81309118       0.00  July
## 6               0       prashant77038484       3.40  June

We now create a new data set which counts the monthly frequency of every user.

MMFreq <- data.frame(table(ConsumerData1$user_id, ConsumerData1$Month))
MMFreqMay <- MMFreq[MMFreq$Var2=="May",]
MMFreqJune <- MMFreq[MMFreq$Var2=="June",]
MMFreqJuly <- MMFreq[MMFreq$Var2=="July",]
JoinMMFreq1 <- inner_join(MMFreqMay, MMFreqJune, by="Var1")
JoinMMFreq2 <- inner_join(JoinMMFreq1, MMFreqJuly, by="Var1")

We now make some slight changes to make the table look a little more presentable.

colnames(JoinMMFreq2) <- c("user_id", "May", "Freq in May", "June", "Freq in June", "July", "Freq in July")
JoinMMFreq2[,c(2,4,6)] <- NULL

This is somewhat how the data set looks like.

rbind(head(JoinMMFreq2), tail(JoinMMFreq2))

##      user_id Freq in May Freq in June Freq in July
## 1         55           4            6            3
## 2        158           0            2            0
## 3        371           1            0            1
## 4        382           0            2            0
## 5       1480           6            7            5
## 6       4970           1            0            2
## 9749  156317           0            0            1
## 9750  156320           0            0            2
## 9751  156348           0            0            1
## 9752  156354           0            0            1
## 9753  156369           0            0            1
## 9754  156386           0            0            1

Let’s find the number of active users in any two consecutive months and all three months.

length(which(JoinMMFreq2$`Freq in May`>0 & JoinMMFreq2$`Freq in June`>0))

## [1] 1135

length(which(JoinMMFreq2$`Freq in June`>0 & JoinMMFreq2$`Freq in July`>0))

## [1] 1601

length(which(JoinMMFreq2$`Freq in May`>0 & JoinMMFreq2$`Freq in June`>0 & JoinMMFreq2$`Freq in July`>0))

## [1] 474

So, there are 1135 users who ordered in both May and June, 1603 users who ordered in both June and July and 474 users who ordered in all three months.

But we have to keep in mind that these orders also include the orders which were cancelled by users and this figure is a significant number.

Let us create a data set which does not include the ‘Cancelled’ orders.

ConsumerData2 <- data.frame(ConsumerData1[!ConsumerData1$order_status=="CANCELLED",])

This is how the data set looks like.

rbind(head(ConsumerData2), tail(ConsumerData2))

##       order_id user_id address_city address_id address_pincode created_on  time
## 1        44990  106778     Gurugram      42862          122004 2020-07-10 16:23
## 2        32971  121745    Delhi NCR      42561          110061 2020-06-20 18:48
## 3        42071  142399     Gurugram      45637          122001 2020-07-05 21:25
## 6        31299  102901     Gurugram      42358          122015 2020-06-18 18:32
## 7        31025   99071     Gurugram      42280          122015 2020-06-18 11:32
## 8        14495  104199     Gurugram      39585          122002 2020-05-06 16:49
## 37730    28535  107330     Gurugram      44369          110030 2020-06-12 21:40
## 37731    28536  121976     Gurugram      39723          110077 2020-06-12 21:48
## 37732    28538  111983     Gurugram      47902          122016 2020-06-12 22:01
## 37733    28539  115290     Gurugram      47806          122003 2020-06-12 22:23
## 37734    49563  146591       Rewari      50874          123401 2020-07-17 22:46
## 37735    50703  127286       Rewari      51110          123401 2020-07-20 11:50
##       processing_at total_amount payment_status        order_status
## 1        10-07-2020         10.0            COD           DELIVERED
## 2        20-06-2020        467.5            COD           DELIVERED
## 3        06-07-2020        104.0            COD           DELIVERED
## 6        18-06-2020         77.0            COD           DELIVERED
## 7        18-06-2020          0.0            COD                 RTO
## 8        06-05-2020          0.0            COD                 RTO
## 37730    13-06-2020        111.0            COD           DELIVERED
## 37731    13-06-2020          0.0            COD                 RTO
## 37732    13-06-2020        843.0            COD PARTIALLY_DELIVERED
## 37733    13-06-2020         65.0        PREPAID PARTIALLY_DELIVERED
## 37734    18-07-2020        156.0            COD           DELIVERED
## 37735    20-07-2020        175.0            COD          DISPATCHED
##         coupon_code discount_amount              team_leader commission Month
## 1              <NA>               0   nageshwarmahto90157844       0.42  July
## 2              <NA>               0           diipak72899731      18.17  June
## 3     SASTAKHAREEDO              15      deepakyadav80106581       2.79  July
## 6              <NA>               0         prashant77038484       3.40  June
## 7              <NA>               0         salendra95554459       0.00  June
## 8              <NA>               0          shankar70531010       0.00   May
## 37730          <NA>               0             sonu98712237       3.05  June
## 37731          <NA>               0          CITYMALL_OFFICI       0.00  June
## 37732          <NA>               0           mnorma99587922      43.46  June
## 37733          <NA>               0             ajit95829726       8.00  June
## 37734          <NA>               0 ramkishanchauhan81680460       8.94  July
## 37735          <NA>               0      princekumar97292156       0.00  July

We can repeat the same process in order to find out the frequencies.

Let us now plot a graph which shows the frequency of orders at different hours of the day.

We first make a separate column highlighting the hours.

ConsumerData1$Hour <- format(strptime(ConsumerData1$time,"%H:%M"),'%H')

We now plot the graph.

HourlyOrders <- data.frame(table(ConsumerData1$Hour))
library(ggplot2)
ggplot(HourlyOrders, aes(x=Var1, 
                         y=Freq,
                         size=Freq,
                         )
       ) + geom_point() + xlab("Hour of the Day") + ylab("Number of Orders Placed")

So the maximum number of orders have been placed in the evening between 4 p.m. and 6 p.m.

As we had checked earlier, the date on which the orders were places and the date on which the order was processed was not always the same. There were some orders which got processed on the next day. So the next thing that we will do is compare the percentage of orders that got cancelled if the order was processed on the same day to that of orders which got cancelled when there was a delay.

We will make a separate column which prints ‘1’ if order was processed on the same day and 0 if order was processed on the next day.

ConsumerData1$created_on <- format(as.Date(ConsumerData1$created_on), "%d-%m-%Y")
ConsumerData1$xx <- ifelse(ConsumerData1$created_on==ConsumerData1$processing_at, print(1), print(0))

## [1] 1
## [1] 0

Now we create a table which displays the status of order as per the above column ‘xx’.

ConsumerData1Dates <- data.frame(table(ConsumerData1[,c(11,17)]))
library(tidyverse)

## -- Attaching packages ------------------------------------------------------------------------ tidyverse 1.3.0 --

## <U+2713> tibble  2.1.3     <U+2713> purrr   0.3.3
## <U+2713> tidyr   1.0.2     <U+2713> stringr 1.4.0
## <U+2713> readr   1.3.1     <U+2713> forcats 0.5.0

## -- Conflicts --------------------------------------------------------------------------- tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date()        masks base::date()
## x dplyr::filter()          masks stats::filter()
## x lubridate::intersect()   masks base::intersect()
## x dplyr::lag()             masks stats::lag()
## x lubridate::setdiff()     masks base::setdiff()
## x lubridate::union()       masks base::union()

ConsumerData1Dates <- spread(data = ConsumerData1Dates, key = order_status, value = Freq)
ConsumerData1Dates

##    Hour CANCELLED DELIVERED DISPATCHED FULFILLED NOT_FULFILLED PARTIAL_RTO
## 1    00        72       127          0         1             3           0
## 2    01        36        24          0         0             0           0
## 3    02        21        22          0         0             0           0
## 4    03        12        17          0         0             0           1
## 5    04        13        10          0         0             0           1
## 6    05        26        41          1         1             0           1
## 7    06        79       165          1         1             3           1
## 8    07       140       446          1         0            10           0
## 9    08       217       730          5         1             7           3
## 10   09       330       938         16         2             9           4
## 11   10       355      1140          8         6            16           0
## 12   11       681      1226         13         5            20           4
## 13   12       591      1372         14         4            23           7
## 14   13       551      1517         16        10            18           6
## 15   14       440      1469         13         0            17          10
## 16   15       478      1695         18         0            18           6
## 17   16       634      2273         23         0            21          14
## 18   17       779      2772         29         0            41           7
## 19   18       745      3852         47         0            55           9
## 20   19       591      1604         12        15            31           3
## 21   20       461      1021          9        16            15           7
## 22   21       352       781         11        14            12           2
## 23   22       295       572          6         6             5           2
## 24   23       159       328          9         1             5           0
##    PARTIALLY_DELIVERED PARTIALLY_DISPATCHED PARTIALLY_FULFILLED
## 1                   15                    0                   0
## 2                    4                    0                   0
## 3                    2                    0                   0
## 4                    3                    0                   0
## 5                    0                    0                   0
## 6                    5                    0                   0
## 7                   17                    0                   1
## 8                   42                    0                   0
## 9                   71                    0                   0
## 10                  75                    0                   3
## 11                  88                    0                   3
## 12                 117                    0                   3
## 13                 123                    1                   1
## 14                 117                    1                   3
## 15                  94                    0                   0
## 16                 147                    0                   0
## 17                 175                    1                   0
## 18                 217                    2                   0
## 19                 341                    0                   0
## 20                 148                    0                  13
## 21                  93                    1                  15
## 22                  77                    0                  15
## 23                  70                    0                   0
## 24                  29                    0                   3
##    PARTIALLY_RETURNED PENDING RETURNED RTO
## 1                   0       0        0  16
## 2                   0       0        0   7
## 3                   0       0        0   3
## 4                   0       0        0   9
## 5                   0       0        0   2
## 6                   0       0        0   3
## 7                   0       0        1  21
## 8                   1       0        3  41
## 9                   0       1        2  78
## 10                  0       3        3  97
## 11                  0       2       10 101
## 12                  0       2        3 138
## 13                  1       1        5 170
## 14                  0       1        5 145
## 15                  0       0        6 208
## 16                  0       1        2 162
## 17                  0       1        8 214
## 18                  1       1        9 239
## 19                  0       2        8 425
## 20                  1       6        4 156
## 21                  0       3        2 114
## 22                  0       0        3  80
## 23                  0       3        1  72
## 24                  0       0        0  40

If we consider only the ‘Cancelled’ and the ‘Delivered’ orders, we get

(1734/(1734+3852))*100

## [1] 31.04189

(6324/(6324+20290))*100

## [1] 23.76193

So 31% of the orders got cancelled if processing was done on the next day and 23% of the orders got cancelled if processing was done on the same day. So company should improve on this in order to make sure it doesn’t lose its customers.

Data Cleaning and Data Filtering: Cleaning of Consumer Data and Extracting Insights from the Processed Data Base.

Salil Suman Meher

23/08/2020