The following analysis was done in R, where the data was read in through the provided spreadsheet saved as “shopify.csv”. Below is code to read in the dataset and to output column summaries.
dataset <- read.csv("shopify.csv")
summary(dataset)
## order_id shop_id user_id order_amount
## Min. : 1 Min. : 1.00 Min. :607.0 Min. : 90
## 1st Qu.:1251 1st Qu.: 24.00 1st Qu.:775.0 1st Qu.: 163
## Median :2500 Median : 50.00 Median :849.0 Median : 284
## Mean :2500 Mean : 50.08 Mean :849.1 Mean : 3145
## 3rd Qu.:3750 3rd Qu.: 75.00 3rd Qu.:925.0 3rd Qu.: 390
## Max. :5000 Max. :100.00 Max. :999.0 Max. :704000
## total_items payment_method created_at
## Min. : 1.000 Length:5000 Length:5000
## 1st Qu.: 1.000 Class :character Class :character
## Median : 2.000 Mode :character Mode :character
## Mean : 8.787
## 3rd Qu.: 3.000
## Max. :2000.000
dataset[dataset$order_amount > 10000,c("order_id", "shop_id", "user_id", "order_amount", "total_items")]
## order_id shop_id user_id order_amount total_items
## 16 16 42 607 704000 2000
## 61 61 42 607 704000 2000
## 161 161 78 990 25725 1
## 491 491 78 936 51450 2
## 494 494 78 983 51450 2
## 512 512 78 967 51450 2
## 521 521 42 607 704000 2000
## 618 618 78 760 51450 2
## 692 692 78 878 154350 6
## 1057 1057 78 800 25725 1
## 1105 1105 42 607 704000 2000
## 1194 1194 78 944 25725 1
## 1205 1205 78 970 25725 1
## 1260 1260 78 775 77175 3
## 1363 1363 42 607 704000 2000
## 1385 1385 78 867 25725 1
## 1420 1420 78 912 25725 1
## 1437 1437 42 607 704000 2000
## 1453 1453 78 812 25725 1
## 1530 1530 78 810 51450 2
## 1563 1563 42 607 704000 2000
## 1603 1603 42 607 704000 2000
## 2154 2154 42 607 704000 2000
## 2271 2271 78 855 25725 1
## 2298 2298 42 607 704000 2000
## 2453 2453 78 709 51450 2
## 2493 2493 78 834 102900 4
## 2496 2496 78 707 51450 2
## 2513 2513 78 935 51450 2
## 2549 2549 78 861 25725 1
## 2565 2565 78 915 77175 3
## 2691 2691 78 962 77175 3
## 2774 2774 78 890 25725 1
## 2819 2819 78 869 51450 2
## 2822 2822 78 814 51450 2
## 2836 2836 42 607 704000 2000
## 2907 2907 78 817 77175 3
## 2923 2923 78 740 25725 1
## 2970 2970 42 607 704000 2000
## 3086 3086 78 910 25725 1
## 3102 3102 78 855 51450 2
## 3152 3152 78 745 25725 1
## 3168 3168 78 927 51450 2
## 3333 3333 42 607 704000 2000
## 3404 3404 78 928 77175 3
## 3441 3441 78 982 25725 1
## 3706 3706 78 828 51450 2
## 3725 3725 78 766 77175 3
## 3781 3781 78 889 25725 1
## 4041 4041 78 852 25725 1
## 4057 4057 42 607 704000 2000
## 4080 4080 78 946 51450 2
## 4193 4193 78 787 77175 3
## 4312 4312 78 960 51450 2
## 4413 4413 78 756 51450 2
## 4421 4421 78 969 77175 3
## 4506 4506 78 866 25725 1
## 4585 4585 78 997 25725 1
## 4647 4647 42 607 704000 2000
## 4716 4716 78 818 77175 3
## 4869 4869 42 607 704000 2000
## 4883 4883 42 607 704000 2000
## 4919 4919 78 823 25725 1
Looking at the results, we notice that the shops with shop_id 42 and 78 are the only shops that have orders greater than $1100. Customers who order from the shop with shop_id 42 typically order in large quantities, while the shop with shop_id 78 sells sneakers for $25725 a pair. These two outliers significantly increase the average order value across the 100 different shops and thus this method of calculating the AOV is an unreliable measure to use.
b. The metric I would use to report for this dataset is the average order value on a per shop_id basis. In R, we can aggregate the mean order amount by shop_id with the following R command:
aggregate(dataset[,c("order_amount"),drop=FALSE], by=list(shop_id=dataset$shop_id), FUN=mean)
## shop_id order_amount
## 1 1 308.8182
## 2 2 174.3273
## 3 3 305.2500
## 4 4 258.5098
## 5 5 290.3111
## 6 6 383.5085
## 7 7 218.0000
## 8 8 241.0435
## 9 9 234.0000
## 10 10 332.3019
## 11 11 356.7347
## 12 12 352.6981
## 13 13 345.3968
## 14 14 242.0000
## 15 15 308.9423
## 16 16 270.1463
## 17 17 332.0755
## 18 18 342.5882
## 19 19 320.9062
## 20 20 251.5577
## 21 21 308.6957
## 22 22 273.7500
## 23 23 317.6727
## 24 24 320.7273
## 25 25 232.9167
## 26 26 341.2245
## 27 27 334.8704
## 28 28 320.3721
## 29 29 331.6207
## 30 30 295.0714
## 31 31 268.9787
## 32 32 189.9762
## 33 33 376.2750
## 34 34 234.2400
## 35 35 328.0000
## 36 36 254.8000
## 37 37 340.2083
## 38 38 390.8571
## 39 39 268.0000
## 40 40 295.1667
## 41 41 254.0000
## 42 42 235101.4902
## 43 43 333.9138
## 44 44 262.1538
## 45 45 269.3103
## 46 46 347.4419
## 47 47 259.1489
## 48 48 242.7750
## 49 49 279.9057
## 50 50 403.5455
## 51 51 361.8043
## 52 52 316.9268
## 53 53 214.1176
## 54 54 276.6400
## 55 55 327.7500
## 56 56 218.1892
## 57 57 296.7736
## 58 58 254.9492
## 59 59 358.9667
## 60 60 350.2340
## 61 61 344.4400
## 62 62 308.8372
## 63 63 264.9655
## 64 64 272.1860
## 65 65 330.8148
## 66 66 312.8868
## 67 67 272.6216
## 68 68 254.6383
## 69 69 264.1833
## 70 70 343.0678
## 71 71 323.0303
## 72 72 309.5652
## 73 73 335.6897
## 74 74 306.0000
## 75 75 240.7619
## 76 76 321.0714
## 77 77 280.8000
## 78 78 49213.0435
## 79 79 328.4815
## 80 80 299.6667
## 81 81 384.0000
## 82 82 349.7857
## 83 83 248.7857
## 84 84 342.3051
## 85 85 329.2571
## 86 86 277.5000
## 87 87 292.2692
## 88 88 355.5200
## 89 89 379.1475
## 90 90 403.2245
## 91 91 325.9259
## 92 92 162.8571
## 93 93 214.4746
## 94 94 297.7778
## 95 95 318.7692
## 96 96 330.0000
## 97 97 324.0000
## 98 98 245.3621
## 99 99 339.4444
## 100 100 213.6750
c. From the results of the command in part b above, we see that most shops have an average order value in the $200-$400 range. If we aggregate by total_items instead of order_amount, we see that most customers order 1-2 pairs of shoes per order.
aggregate(dataset$total_items, by=list(shop_id=dataset$shop_id), FUN=mean)
## shop_id x
## 1 1 1.954545
## 2 2 1.854545
## 3 3 2.062500
## 4 4 2.019608
## 5 5 2.044444
## 6 6 2.050847
## 7 7 1.946429
## 8 8 1.826087
## 9 9 1.983051
## 10 10 2.245283
## 11 11 1.938776
## 12 12 1.754717
## 13 13 2.158730
## 14 14 2.086207
## 15 15 2.019231
## 16 16 1.731707
## 17 17 1.886792
## 18 18 2.196078
## 19 19 1.968750
## 20 20 1.980769
## 21 21 2.173913
## 22 22 1.875000
## 23 23 2.036364
## 24 24 2.290909
## 25 25 1.791667
## 26 26 1.938776
## 27 27 1.981481
## 28 28 1.953488
## 29 29 2.034483
## 30 30 1.928571
## 31 31 2.085106
## 32 32 1.880952
## 33 33 2.175000
## 34 34 1.920000
## 35 35 2.000000
## 36 36 1.960000
## 37 37 2.395833
## 38 38 2.057143
## 39 39 2.000000
## 40 40 1.833333
## 41 41 2.152542
## 42 42 667.901961
## 43 43 1.844828
## 44 44 1.820513
## 45 45 1.896552
## 46 46 2.093023
## 47 47 1.787234
## 48 48 2.075000
## 49 49 2.169811
## 50 50 2.090909
## 51 51 1.934783
## 52 52 2.170732
## 53 53 1.911765
## 54 54 2.080000
## 55 55 1.916667
## 56 56 1.864865
## 57 57 2.018868
## 58 58 1.847458
## 59 59 2.016667
## 60 60 1.978723
## 61 61 2.180000
## 62 62 1.930233
## 63 63 1.948276
## 64 64 2.046512
## 65 65 2.148148
## 66 66 1.943396
## 67 67 2.081081
## 68 68 1.872340
## 69 69 2.016667
## 70 70 1.983051
## 71 71 1.969697
## 72 72 1.934783
## 73 73 2.034483
## 74 74 2.000000
## 75 75 1.880952
## 76 76 2.071429
## 77 77 1.800000
## 78 78 1.913043
## 79 79 1.814815
## 80 80 2.066667
## 81 81 2.169492
## 82 82 1.976190
## 83 83 1.928571
## 84 84 2.237288
## 85 85 1.914286
## 86 86 2.134615
## 87 87 1.961538
## 88 88 2.020000
## 89 89 1.934426
## 90 90 2.265306
## 91 91 2.037037
## 92 92 1.809524
## 93 93 1.881356
## 94 94 2.222222
## 95 95 1.897436
## 96 96 2.156863
## 97 97 2.000000
## 98 98 1.844828
## 99 99 1.740741
## 100 100 1.925000
These are reasonable numbers which would not have been obvious if the AOV was calculated over all 100 shops.
a.
select count(orderid)
from orders o
inner join shippers s on s.shipperid = o.shipperid
where s.shipperid = 1
Output: 54
b.
select top 1 productname
from ((products p
inner join orderdetails od on od.productid = p.productid)
inner join orders o on o.orderid = od.orderid)
inner join customers c on c.customerid = o.customerid
where country = 'Germany'
group by productname
order by count(productname) desc
Output: Gorgonzola Telino
c.
select top 1 e.lastname
from (employees e
inner join orders o on e.employeeid = o.employeeid)
group by e.lastname
order by count(orderid) desc
Output: Peacock