The data set we have available with us is a hypothetical consumer data set. Our primary objective will be to clean the data set and then extract insights which could be meaningful for the organisation to improve their sales and service. To be more particular on this, we will be doing our cleaning focusing on the columns which we would require for the respective tasks.
We first set the working directory.
setwd("C:/Users/salil/Desktop/AllDocuments/AnalyticsEdgeFolder")
We then load in our data set.
ConsumerData <- read.csv("ConsumerData.csv", na.strings = "")
We use the ‘na.strings’ arguement to replace the blank cells with ‘NA’ to make them convenient for our further work. This is how our initial data set looks like.
head(ConsumerData)
## order_id user_id address_id address_pincode address_city order_status
## 1 44990 106778 42862 122004 Gurugram DELIVERED
## 2 32971 121745 42561 110061 Delhi DELIVERED
## 3 42071 142399 45637 122001 Gurugram DELIVERED
## 4 29518 122805 50676 110071 Gurgaon CANCELLED
## 5 42112 135852 53282 110074 Delhi CANCELLED
## 6 31299 102901 42358 122015 Gurugram DELIVERED
## total_amount created_at payment_status
## 1 10.0 2020-07-10 16:23:51.884369+05:30 COD
## 2 467.5 2020-06-20 18:48:43.802539+05:30 COD
## 3 104.0 2020-07-05 21:25:14.544634+05:30 COD
## 4 0.0 2020-06-15 13:32:56.568548+05:30 COD
## 5 0.0 2020-07-05 23:08:51.535282+05:30 COD
## 6 77.0 2020-06-18 18:32:48.215438+05:30 COD
## team_leader coupon_code processing_at discount_amount commission
## 1 nageshwarmahto90157844 <NA> 10-07-2020 0 0.42
## 2 diipak72899731 <NA> 20-06-2020 0 18.17
## 3 deepakyadav80106581 SASTAKHAREEDO 06-07-2020 15 2.79
## 4 archana98689836 CM150 15-06-2020 0 0.00
## 5 ramburman81309118 <NA> 06-07-2020 0 0.00
## 6 prashant77038484 <NA> 18-06-2020 0 3.40
There are 2 columns of day/time. One, which records the day at and the time on which the order has been placed by the customers. The second column records the date on which the order has been processed. We will be working with the former one.
The library we will use is ‘lubridate’ as we have to deal with dates.
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
We split this column into date and time separately and eliminate the original column. The time column only represents the ‘hour’ and ‘minute’ values and not the ‘second’ values as it won’t be of any significant use to us.
ConsumerData$created_on <- format(as.Date(ConsumerData$created_at), "%d-%m-%Y")
ConsumerData$time <- format(as.POSIXct(ConsumerData$created_at), format = "%H:%M")
ConsumerData$created_at <- NULL
We have to make sure that the date on which the order was placed and the date on which the order was processed are the same or not. If they are same, we can afford to delete one of the columns.
length(which(ConsumerData[,11]!=ConsumerData[,14]))
## [1] 6622
So there are significant number of entries for the same. We would therefore keep both the columns.
Let us rearrange the columns of our dataset to make it easily interpretable before we continue with the further steps.
col_order <- c("order_id", "user_id", "address_city", "address_id", "address_pincode", "created_on", "time", "processing_at", "total_amount", "payment_status", "order_status", "coupon_code", "discount_amount", "team_leader", "commission")
ConsumerData1 <- ConsumerData[, col_order]
This is how our data set looks till now.
head(ConsumerData1)
## order_id user_id address_city address_id address_pincode created_on time
## 1 44990 106778 Gurugram 42862 122004 10-07-2020 16:23
## 2 32971 121745 Delhi 42561 110061 20-06-2020 18:48
## 3 42071 142399 Gurugram 45637 122001 05-07-2020 21:25
## 4 29518 122805 Gurgaon 50676 110071 15-06-2020 13:32
## 5 42112 135852 Delhi 53282 110074 05-07-2020 23:08
## 6 31299 102901 Gurugram 42358 122015 18-06-2020 18:32
## processing_at total_amount payment_status order_status coupon_code
## 1 10-07-2020 10.0 COD DELIVERED <NA>
## 2 20-06-2020 467.5 COD DELIVERED <NA>
## 3 06-07-2020 104.0 COD DELIVERED SASTAKHAREEDO
## 4 15-06-2020 0.0 COD CANCELLED CM150
## 5 06-07-2020 0.0 COD CANCELLED <NA>
## 6 18-06-2020 77.0 COD DELIVERED <NA>
## discount_amount team_leader commission
## 1 0 nageshwarmahto90157844 0.42
## 2 0 diipak72899731 18.17
## 3 15 deepakyadav80106581 2.79
## 4 0 archana98689836 0.00
## 5 0 ramburman81309118 0.00
## 6 0 prashant77038484 3.40
unique(ConsumerData1$address_city)
## [1] Gurugram Delhi Gurgaon GURUGRAM Rewari NaN DELHI
## [8] Delhi NCR <NA>
## Levels: Delhi DELHI Delhi NCR Gurgaon Gurugram GURUGRAM NaN Rewari
length(unique(ConsumerData1$address_city))
## [1] 9
So there are a total of 9 unique values for ‘address_city’. To make the analysis easy, it is important to assign common names to similar looking levels and change ‘NaN’ to ‘NA’ as well.
ConsumerData1$address_city[ConsumerData1$address_city=="NaN"] <- NA
ConsumerData1$address_city[which(ConsumerData1$address_city=="GURUGRAM")] <- c("Gurugram")
ConsumerData1$address_city[which(ConsumerData1$address_city=="Gurgaon")] <- c("Gurugram")
ConsumerData1$address_city[which(ConsumerData1$address_city=="Delhi")] <- c("Delhi NCR")
ConsumerData1$address_city[which(ConsumerData1$address_city=="DELHI")] <- c("Delhi NCR")
Now it is time to deal with the NAs.
colSums(is.na(ConsumerData1))
## order_id user_id address_city address_id address_pincode
## 0 0 64 40 853
## created_on time processing_at total_amount payment_status
## 0 0 0 0 0
## order_status coupon_code discount_amount team_leader commission
## 0 30937 0 0 0
We will try to remove the NAs from the address’ columns. First we need to check if every ‘address_id’ has a unique ‘address_city’ and ‘address_pincode’ associated with it or there are multiple values for the same. If the answer will be ‘yes’, we can use the ‘fill’ function to populate the entries for NAs if there will be any populated row with a similar ‘address_id’. If the answer will be ‘no’ we will check the percentage of ‘NA’ values out of the total number of observations. If the percentage will be less, we will delete the NAs of the considered variable column.
We start this by checking the frequency of address cities that have been assigned to all the pincodes. If we find that there exists no intersection, we can move further with what was mentioned earlier.
a <- data.frame(table(ConsumerData1$address_pincode, ConsumerData1$address_city))
aGGN <- a[a$Var2=="Gurugram",]
aDELNCR <- a[a$Var2=="Delhi NCR",]
aREW <- a[a$Var2=="Rewari",]
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:lubridate':
##
## intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Joina1 <- inner_join(aGGN, aDELNCR, by="Var1")
Joina2 <- inner_join(Joina1, aREW, by="Var1")
This is how the created data set looks like.
head(Joina2)
## Var1 Var2.x Freq.x Var2.y Freq.y Var2 Freq
## 1 110001 Gurugram 4 Delhi NCR 44 Rewari 0
## 2 110003 Gurugram 0 Delhi NCR 5 Rewari 0
## 3 110005 Gurugram 0 Delhi NCR 17 Rewari 0
## 4 110006 Gurugram 1 Delhi NCR 8 Rewari 0
## 5 110007 Gurugram 0 Delhi NCR 1 Rewari 0
## 6 110008 Gurugram 2 Delhi NCR 5 Rewari 0
We now find the intersections that is the pincodes in which there are entered two or three address cities.
which(Joina2$Freq.x>0 & Joina2$Freq.y>0 & Joina2$Freq>0)
## [1] 67 98 145
which(Joina2$Freq.x>0 & Joina2$Freq.y>0)
## [1] 1 4 6 11 14 15 17 18 19 22 24 25 26 27 30 31 32 35 37
## [20] 38 39 40 41 44 45 49 50 51 53 54 59 60 61 63 64 65 67 68
## [39] 69 72 74 76 77 78 79 80 81 82 83 89 96 98 99 100 102 110 112
## [58] 113 119 121 130 145 156 161 169 172 175 176 213
which(Joina2$Freq.x>0 & Joina2$Freq>0)
## [1] 67 98 127 128 137 139 141 142 143 145 146 147 148 151
which(Joina2$Freq.y>0 & Joina2$Freq>0)
## [1] 67 98 145 157 185
There are a significant number of entries for the same.
Now we try to do the same thing by replacing address pincodes with address_ids.
b <- data.frame(table(ConsumerData1$address_id, ConsumerData1$address_city))
bGGN <- b[b$Var2=="Gurugram",]
bDELNCR <- b[b$Var2=="Delhi NCR",]
bREW <- b[b$Var2=="Rewari",]
Joinb1 <- inner_join(bGGN, bDELNCR, by="Var1")
Joinb2 <- inner_join(Joinb1, bREW, by="Var1")
Again we find the points of intersection.
which(Joinb2$Freq.x>0 & Joinb2$Freq.y>0 & Joinb2$Freq>0)
## [1] 1148 1322
which(Joinb2$Freq.x>0 & Joinb2$Freq.y>0)
## [1] 8 11 13 19 33 34 36 42 44 47 49 52 53 54 55
## [16] 56 57 59 60 61 65 82 91 99 103 106 108 109 112 114
## [31] 115 116 117 119 120 121 122 124 134 150 151 158 161 163 167
## [46] 168 169 170 171 173 185 186 190 199 200 202 209 222 223 227
## [61] 238 247 251 256 261 262 263 268 272 275 277 278 287 295 297
## [76] 301 303 312 326 329 336 337 338 341 348 361 362 363 368 370
## [91] 383 387 389 403 414 423 443 447 501 518 547 621 666 673 730
## [106] 736 757 768 774 790 799 803 847 871 877 882 894 898 901 903
## [121] 905 915 954 956 970 978 988 993 1023 1028 1035 1039 1041 1050 1051
## [136] 1089 1095 1102 1104 1113 1123 1148 1253 1298 1322 1343 1367 1370 1372 1379
## [151] 1381 1387 1405 1453 1465 1518 1524 1541 1545 1562 1573 1576 1590 1600 1633
## [166] 1635 1642 1697 1708 1718 1746 1751 1758 1760
which(Joinb2$Freq.x>0 & Joinb2$Freq>0)
## [1] 35 39 614 950 985 1006 1019 1040 1047 1057 1070 1084 1090 1119 1128
## [16] 1132 1137 1139 1140 1143 1144 1145 1148 1149 1151 1166 1167 1168 1169 1183
## [31] 1185 1188 1191 1200 1205 1206 1210 1212 1214 1221 1222 1229 1234 1239 1242
## [46] 1246 1252 1254 1256 1258 1262 1264 1278 1281 1282 1288 1295 1301 1316 1318
## [61] 1322 1327 1331 1341 1354 1361 1363 1368 1389 1409 1414 1436 1441 1455 1459
## [76] 1460 1476 1526 1651 1669 1696 1732
which(Joinb2$Freq.y>0 & Joinb2$Freq>0)
## [1] 1148 1202 1296 1322 1452 1668
This too has a significant number of entries.
Giving one last attempt, we will try to see if every user_id that is every user has a unique level of address city. We will use the same method which we were using for address_id and address_pincode.
c <- data.frame(table(ConsumerData1$user_id, ConsumerData1$address_city))
cGGN <- c[c$Var2=="Gurugram",]
cDELNCR <- c[c$Var2=="Delhi NCR",]
cREW <- c[c$Var2=="Rewari",]
Joinc1 <- inner_join(cGGN, cDELNCR, by="Var1")
Joinc2 <- inner_join(Joinc1, cREW, by="Var1")
The following are the points of intersection.
which(Joinc2$Freq.x>0 & Joinc2$Freq.y>0 & Joinc2$Freq>0)
## [1] 5042 5147 6000 6324 6706 6972 7071
which(Joinc2$Freq.x>0 & Joinc2$Freq.y>0)
## [1] 1 5 7 9 12 23 55 61 63 77 86 87 91 92 99
## [16] 104 106 108 112 115 159 162 180 188 191 196 210 228 230 251
## [31] 255 281 283 288 292 300 301 302 303 320 353 359 361 371 374
## [46] 379 390 399 401 402 405 406 434 435 436 450 464 479 487 488
## [61] 489 492 493 498 501 504 505 511 512 518 520 524 525 531 537
## [76] 551 557 564 567 573 586 591 593 595 596 610 612 618 624 635
## [91] 646 650 651 658 661 663 667 676 678 689 702 711 714 715 726
## [106] 733 736 742 747 760 762 763 804 813 815 826 827 851 873 876
## [121] 887 899 902 903 906 909 910 915 917 931 940 941 948 950 952
## [136] 954 956 959 960 962 965 980 983 986 990 993 994 1007 1008 1014
## [151] 1020 1023 1024 1031 1033 1041 1043 1055 1058 1062 1068 1081 1084 1089 1090
## [166] 1101 1133 1134 1142 1157 1164 1166 1182 1187 1200 1221 1227 1232 1234 1246
## [181] 1258 1267 1270 1272 1275 1279 1281 1282 1287 1294 1303 1305 1306 1313 1314
## [196] 1317 1319 1330 1334 1336 1348 1350 1351 1357 1365 1366 1367 1372 1375 1400
## [211] 1401 1404 1409 1410 1415 1429 1433 1435 1444 1449 1460 1463 1464 1472 1473
## [226] 1475 1477 1479 1480 1488 1493 1524 1534 1553 1555 1560 1561 1567 1569 1574
## [241] 1575 1578 1579 1581 1585 1590 1591 1594 1596 1599 1600 1611 1626 1630 1631
## [256] 1660 1668 1678 1698 1705 1706 1709 1718 1748 1761 1787 1793 1798 1804 1816
## [271] 1823 1827 1836 1837 1845 1848 1852 1854 1857 1865 1880 1884 1890 1906 1908
## [286] 1915 1932 1933 1937 1946 1950 1956 1958 1962 1980 2025 2061 2066 2073 2076
## [301] 2078 2116 2127 2139 2142 2150 2156 2161 2178 2181 2187 2196 2200 2202 2275
## [316] 2316 2324 2391 2395 2432 2480 2487 2489 2500 2501 2508 2521 2644 2667 2840
## [331] 2862 2933 3008 3037 3091 3132 3157 3158 3186 3216 3247 3273 3278 3329 3342
## [346] 3356 3402 3405 3410 3414 3505 3548 3558 3559 3580 3582 3597 3599 3603 3616
## [361] 3659 3689 3708 3710 3721 3731 3745 3746 3757 3767 3796 3808 3848 3850 3853
## [376] 3862 3876 3906 3909 3915 3929 3939 3964 4010 4044 4058 4064 4095 4155 4206
## [391] 4319 4329 4330 4403 4409 4437 4446 4460 4466 4502 4503 4530 4552 4561 4575
## [406] 4576 4580 4599 4621 4638 4679 4803 4804 4807 4826 4839 4862 4864 4945 5030
## [421] 5042 5102 5147 5159 5162 5191 5301 5334 5484 5512 5567 5569 5605 5609 5611
## [436] 5628 5629 5649 5653 5814 5830 5833 5837 5877 5885 5956 5958 5966 6000 6030
## [451] 6044 6052 6165 6188 6269 6281 6283 6318 6324 6331 6434 6440 6445 6450 6494
## [466] 6504 6572 6587 6607 6625 6634 6657 6675 6695 6706 6820 6821 6896 6901 6921
## [481] 6927 6940 6957 6972 6976 7010 7038 7071 7081 7129 7131 7132 7137 7138 7170
## [496] 7184 7205 7223 7248 7256 7403 7405 7407 7414 7427 7428 7491 7504 7513 7642
## [511] 7656 7707 7741 7748 7752 7884 7923 7932 7973 7982 7988 8064 8137 8138 8231
## [526] 8284 9053
which(Joinc2$Freq.x>0 & Joinc2$Freq>0)
## [1] 114 133 598 2975 4468 4548 4729 4748 4784 4819 4843 4861 4874 4886 4892
## [16] 4900 4908 4918 4919 4920 4943 4944 4947 4950 4961 4963 4964 4969 4972 4974
## [31] 4975 4978 4980 4985 5021 5037 5038 5042 5050 5059 5075 5081 5082 5087 5090
## [46] 5111 5113 5114 5132 5147 5150 5177 5216 5233 5239 5289 5294 5363 5370 5389
## [61] 5394 5436 5459 5476 5491 5510 5517 5518 5558 5564 5591 5599 5625 5703 5806
## [76] 5817 5827 5828 5834 5843 5858 5860 5867 5886 5887 5895 5907 5910 5911 5922
## [91] 5928 5929 5930 5935 5938 6000 6002 6015 6027 6050 6054 6056 6208 6280 6308
## [106] 6324 6325 6351 6361 6382 6383 6384 6415 6433 6435 6488 6493 6527 6560 6566
## [121] 6577 6626 6639 6649 6651 6669 6674 6680 6702 6706 6707 6731 6733 6752 6767
## [136] 6778 6785 6799 6806 6824 6830 6831 6837 6918 6922 6929 6972 7071 7195 7208
## [151] 7222 7305 7392 7467 7563 7598 7798 7814 7917 7959 7980 8633
which(Joinc2$Freq.y>0 & Joinc2$Freq>0)
## [1] 4445 5042 5147 6000 6324 6706 6854 6972 6982 7071 7078 7192 7495 7680 7720
## [16] 8781 8811 8850
This happens probably because there are instances in which the same user has placed orders for more than one address.
It is therefore not possible to fill in the NAs according to values from adjacent columns of similar entries. This has happened because people(one or more) probably would have entered wrong address details while placing orders. This also gives rise to a possibility of people having entered wrong details for entries which did not have any intersection as well.
In order to rectify this problem, we would probably require a dictionary from a credible source which correctly lists the address ids with their pincodes and address cities.
It is also not possible to use any classifier algorithm or clustering algorithm as there does not exist any defined relationship between the different columns. Pincodes and address ids are at times assigned randomly and we cannot be sure of their credibility.
So, as mentioned earlier, we check the percentage of observations having NAs of the total number of entries.
(64/37735)*100
## [1] 0.1696038
(40/37735)*100
## [1] 0.1060024
(853/37735)*100
## [1] 2.260501
We can comfortably remove NAs if their presence is less than 5% and as we can observe, these are extremely small values. The analyst can therefore afford to remove these NAs for whichever column he would use in his analysis.
The first thing we would do is tabulate the total number of monthly orders of every user. For this we will first make a different column which mentions the month in which the order has been placed.
ConsumerData1$created_on <- strptime(ConsumerData1$created_on, format = "%d-%m-%Y")
ConsumerData1$Month <- months.POSIXt(ConsumerData1$created_on)
This is how the modified dataset looks like. There are a total of 16 columns as of now and the last column registers the month in which the order was placed.
head(ConsumerData1)
## order_id user_id address_city address_id address_pincode created_on time
## 1 44990 106778 Gurugram 42862 122004 2020-07-10 16:23
## 2 32971 121745 Delhi NCR 42561 110061 2020-06-20 18:48
## 3 42071 142399 Gurugram 45637 122001 2020-07-05 21:25
## 4 29518 122805 Gurugram 50676 110071 2020-06-15 13:32
## 5 42112 135852 Delhi NCR 53282 110074 2020-07-05 23:08
## 6 31299 102901 Gurugram 42358 122015 2020-06-18 18:32
## processing_at total_amount payment_status order_status coupon_code
## 1 10-07-2020 10.0 COD DELIVERED <NA>
## 2 20-06-2020 467.5 COD DELIVERED <NA>
## 3 06-07-2020 104.0 COD DELIVERED SASTAKHAREEDO
## 4 15-06-2020 0.0 COD CANCELLED CM150
## 5 06-07-2020 0.0 COD CANCELLED <NA>
## 6 18-06-2020 77.0 COD DELIVERED <NA>
## discount_amount team_leader commission Month
## 1 0 nageshwarmahto90157844 0.42 July
## 2 0 diipak72899731 18.17 June
## 3 15 deepakyadav80106581 2.79 July
## 4 0 archana98689836 0.00 June
## 5 0 ramburman81309118 0.00 July
## 6 0 prashant77038484 3.40 June
We now create a new data set which counts the monthly frequency of every user.
MMFreq <- data.frame(table(ConsumerData1$user_id, ConsumerData1$Month))
MMFreqMay <- MMFreq[MMFreq$Var2=="May",]
MMFreqJune <- MMFreq[MMFreq$Var2=="June",]
MMFreqJuly <- MMFreq[MMFreq$Var2=="July",]
JoinMMFreq1 <- inner_join(MMFreqMay, MMFreqJune, by="Var1")
JoinMMFreq2 <- inner_join(JoinMMFreq1, MMFreqJuly, by="Var1")
We now make some slight changes to make the table look a little more presentable.
colnames(JoinMMFreq2) <- c("user_id", "May", "Freq in May", "June", "Freq in June", "July", "Freq in July")
JoinMMFreq2[,c(2,4,6)] <- NULL
This is somewhat how the data set looks like.
rbind(head(JoinMMFreq2), tail(JoinMMFreq2))
## user_id Freq in May Freq in June Freq in July
## 1 55 4 6 3
## 2 158 0 2 0
## 3 371 1 0 1
## 4 382 0 2 0
## 5 1480 6 7 5
## 6 4970 1 0 2
## 9749 156317 0 0 1
## 9750 156320 0 0 2
## 9751 156348 0 0 1
## 9752 156354 0 0 1
## 9753 156369 0 0 1
## 9754 156386 0 0 1
Let’s find the number of active users in any two consecutive months and all three months.
length(which(JoinMMFreq2$`Freq in May`>0 & JoinMMFreq2$`Freq in June`>0))
## [1] 1135
length(which(JoinMMFreq2$`Freq in June`>0 & JoinMMFreq2$`Freq in July`>0))
## [1] 1601
length(which(JoinMMFreq2$`Freq in May`>0 & JoinMMFreq2$`Freq in June`>0 & JoinMMFreq2$`Freq in July`>0))
## [1] 474
So, there are 1135 users who ordered in both May and June, 1603 users who ordered in both June and July and 474 users who ordered in all three months.
But we have to keep in mind that these orders also include the orders which were cancelled by users and this figure is a significant number.
Let us create a data set which does not include the ‘Cancelled’ orders.
ConsumerData2 <- data.frame(ConsumerData1[!ConsumerData1$order_status=="CANCELLED",])
This is how the data set looks like.
rbind(head(ConsumerData2), tail(ConsumerData2))
## order_id user_id address_city address_id address_pincode created_on time
## 1 44990 106778 Gurugram 42862 122004 2020-07-10 16:23
## 2 32971 121745 Delhi NCR 42561 110061 2020-06-20 18:48
## 3 42071 142399 Gurugram 45637 122001 2020-07-05 21:25
## 6 31299 102901 Gurugram 42358 122015 2020-06-18 18:32
## 7 31025 99071 Gurugram 42280 122015 2020-06-18 11:32
## 8 14495 104199 Gurugram 39585 122002 2020-05-06 16:49
## 37730 28535 107330 Gurugram 44369 110030 2020-06-12 21:40
## 37731 28536 121976 Gurugram 39723 110077 2020-06-12 21:48
## 37732 28538 111983 Gurugram 47902 122016 2020-06-12 22:01
## 37733 28539 115290 Gurugram 47806 122003 2020-06-12 22:23
## 37734 49563 146591 Rewari 50874 123401 2020-07-17 22:46
## 37735 50703 127286 Rewari 51110 123401 2020-07-20 11:50
## processing_at total_amount payment_status order_status
## 1 10-07-2020 10.0 COD DELIVERED
## 2 20-06-2020 467.5 COD DELIVERED
## 3 06-07-2020 104.0 COD DELIVERED
## 6 18-06-2020 77.0 COD DELIVERED
## 7 18-06-2020 0.0 COD RTO
## 8 06-05-2020 0.0 COD RTO
## 37730 13-06-2020 111.0 COD DELIVERED
## 37731 13-06-2020 0.0 COD RTO
## 37732 13-06-2020 843.0 COD PARTIALLY_DELIVERED
## 37733 13-06-2020 65.0 PREPAID PARTIALLY_DELIVERED
## 37734 18-07-2020 156.0 COD DELIVERED
## 37735 20-07-2020 175.0 COD DISPATCHED
## coupon_code discount_amount team_leader commission Month
## 1 <NA> 0 nageshwarmahto90157844 0.42 July
## 2 <NA> 0 diipak72899731 18.17 June
## 3 SASTAKHAREEDO 15 deepakyadav80106581 2.79 July
## 6 <NA> 0 prashant77038484 3.40 June
## 7 <NA> 0 salendra95554459 0.00 June
## 8 <NA> 0 shankar70531010 0.00 May
## 37730 <NA> 0 sonu98712237 3.05 June
## 37731 <NA> 0 CITYMALL_OFFICI 0.00 June
## 37732 <NA> 0 mnorma99587922 43.46 June
## 37733 <NA> 0 ajit95829726 8.00 June
## 37734 <NA> 0 ramkishanchauhan81680460 8.94 July
## 37735 <NA> 0 princekumar97292156 0.00 July
We can repeat the same process in order to find out the frequencies.
Let us now plot a graph which shows the frequency of orders at different hours of the day.
We first make a separate column highlighting the hours.
ConsumerData1$Hour <- format(strptime(ConsumerData1$time,"%H:%M"),'%H')
We now plot the graph.
HourlyOrders <- data.frame(table(ConsumerData1$Hour))
library(ggplot2)
ggplot(HourlyOrders, aes(x=Var1,
y=Freq,
size=Freq,
)
) + geom_point() + xlab("Hour of the Day") + ylab("Number of Orders Placed")
So the maximum number of orders have been placed in the evening between 4 p.m. and 6 p.m.
As we had checked earlier, the date on which the orders were places and the date on which the order was processed was not always the same. There were some orders which got processed on the next day. So the next thing that we will do is compare the percentage of orders that got cancelled if the order was processed on the same day to that of orders which got cancelled when there was a delay.
We will make a separate column which prints ‘1’ if order was processed on the same day and 0 if order was processed on the next day.
ConsumerData1$created_on <- format(as.Date(ConsumerData1$created_on), "%d-%m-%Y")
ConsumerData1$xx <- ifelse(ConsumerData1$created_on==ConsumerData1$processing_at, print(1), print(0))
## [1] 1
## [1] 0
Now we create a table which displays the status of order as per the above column ‘xx’.
ConsumerData1Dates <- data.frame(table(ConsumerData1[,c(11,17)]))
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------ tidyverse 1.3.0 --
## <U+2713> tibble 2.1.3 <U+2713> purrr 0.3.3
## <U+2713> tidyr 1.0.2 <U+2713> stringr 1.4.0
## <U+2713> readr 1.3.1 <U+2713> forcats 0.5.0
## -- Conflicts --------------------------------------------------------------------------- tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date() masks base::date()
## x dplyr::filter() masks stats::filter()
## x lubridate::intersect() masks base::intersect()
## x dplyr::lag() masks stats::lag()
## x lubridate::setdiff() masks base::setdiff()
## x lubridate::union() masks base::union()
ConsumerData1Dates <- spread(data = ConsumerData1Dates, key = order_status, value = Freq)
ConsumerData1Dates
## Hour CANCELLED DELIVERED DISPATCHED FULFILLED NOT_FULFILLED PARTIAL_RTO
## 1 00 72 127 0 1 3 0
## 2 01 36 24 0 0 0 0
## 3 02 21 22 0 0 0 0
## 4 03 12 17 0 0 0 1
## 5 04 13 10 0 0 0 1
## 6 05 26 41 1 1 0 1
## 7 06 79 165 1 1 3 1
## 8 07 140 446 1 0 10 0
## 9 08 217 730 5 1 7 3
## 10 09 330 938 16 2 9 4
## 11 10 355 1140 8 6 16 0
## 12 11 681 1226 13 5 20 4
## 13 12 591 1372 14 4 23 7
## 14 13 551 1517 16 10 18 6
## 15 14 440 1469 13 0 17 10
## 16 15 478 1695 18 0 18 6
## 17 16 634 2273 23 0 21 14
## 18 17 779 2772 29 0 41 7
## 19 18 745 3852 47 0 55 9
## 20 19 591 1604 12 15 31 3
## 21 20 461 1021 9 16 15 7
## 22 21 352 781 11 14 12 2
## 23 22 295 572 6 6 5 2
## 24 23 159 328 9 1 5 0
## PARTIALLY_DELIVERED PARTIALLY_DISPATCHED PARTIALLY_FULFILLED
## 1 15 0 0
## 2 4 0 0
## 3 2 0 0
## 4 3 0 0
## 5 0 0 0
## 6 5 0 0
## 7 17 0 1
## 8 42 0 0
## 9 71 0 0
## 10 75 0 3
## 11 88 0 3
## 12 117 0 3
## 13 123 1 1
## 14 117 1 3
## 15 94 0 0
## 16 147 0 0
## 17 175 1 0
## 18 217 2 0
## 19 341 0 0
## 20 148 0 13
## 21 93 1 15
## 22 77 0 15
## 23 70 0 0
## 24 29 0 3
## PARTIALLY_RETURNED PENDING RETURNED RTO
## 1 0 0 0 16
## 2 0 0 0 7
## 3 0 0 0 3
## 4 0 0 0 9
## 5 0 0 0 2
## 6 0 0 0 3
## 7 0 0 1 21
## 8 1 0 3 41
## 9 0 1 2 78
## 10 0 3 3 97
## 11 0 2 10 101
## 12 0 2 3 138
## 13 1 1 5 170
## 14 0 1 5 145
## 15 0 0 6 208
## 16 0 1 2 162
## 17 0 1 8 214
## 18 1 1 9 239
## 19 0 2 8 425
## 20 1 6 4 156
## 21 0 3 2 114
## 22 0 0 3 80
## 23 0 3 1 72
## 24 0 0 0 40
If we consider only the ‘Cancelled’ and the ‘Delivered’ orders, we get
(1734/(1734+3852))*100
## [1] 31.04189
(6324/(6324+20290))*100
## [1] 23.76193
So 31% of the orders got cancelled if processing was done on the next day and 23% of the orders got cancelled if processing was done on the same day. So company should improve on this in order to make sure it doesn’t lose its customers.