This document illustrates how formatted text can be combined with and
output of the code. This document is intended to be used with the
bakedAppleSampleData.csv file, which contains information
about how many units of various products are sold at stores where the
product is given to customers as a sample. Let’s start by reading in the
data and returning the first six rows of data.
## Invoice Whs EventDate EventType Div SplitVendors
## 1 I-1007258 Whs-1 2019-12-29 Regular A <NA>
## 2 I-1007258 Whs-2 2019-12-29 Regular A <NA>
## 3 I-1007258 Whs-3 2019-12-29 Regular A <NA>
## 4 I-1007258 Whs-4 2019-12-29 Regular A <NA>
## 5 I-1007258 Whs-5 2019-12-29 Regular A <NA>
## 6 I-1007258 Whs-6 2019-12-29 Regular A <NA>
## ItemorServiceDescription ItemDept DoorCount MembersSampled UnitsPurchased
## 1 Apple Scones Dept-01 5074 0 9
## 2 Apple Scones Dept-01 2362 0 10
## 3 Apple Scones Dept-01 4124 0 14
## 4 Apple Scones Dept-01 5305 0 12
## 5 Apple Scones Dept-01 2584 0 10
## 6 Apple Scones Dept-01 3303 0 9
## UnitsSold UnitCost UnitTax ProductChargeplusTax LaborCharge SupplyCharge
## 1 32 9.99 0 89.91 163 8.5
## 2 44 9.99 0 99.90 163 8.5
## 3 45 9.99 0 139.86 163 8.5
## 4 58 9.99 0 119.88 163 8.5
## 5 38 9.99 0 99.90 163 8.5
## 6 44 10.39 0 93.51 163 8.5
## LaborandSupplyTax EnhancementMiscOther Total
## 1 0 0.00 261.41
## 2 0 10.78 282.18
## 3 0 0.00 311.36
## 4 0 0.00 291.38
## 5 0 0.00 271.40
## 6 0 0.00 265.01
This data can be helped to better understand the health of Baked Apples products by evaluating the number of units sold. The most important columns are:
Let’s use the summary() function to return summary
statistics about all of the columns. Please note that the summary
information for character string columns is not very informative.
## Invoice Whs EventDate EventType
## Length:11533 Length:11533 Length:11533 Length:11533
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Div SplitVendors ItemorServiceDescription
## Length:11533 Length:11533 Length:11533
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## ItemDept DoorCount MembersSampled UnitsPurchased
## Length:11533 Min. : 0 Min. : 0.000 Min. : 0.00
## Class :character 1st Qu.: 2733 1st Qu.: 0.000 1st Qu.: 8.00
## Mode :character Median : 3427 Median : 0.000 Median :11.00
## Mean : 3524 Mean : 2.224 Mean :11.39
## 3rd Qu.: 4235 3rd Qu.: 0.000 3rd Qu.:14.00
## Max. :36661 Max. :1200.000 Max. :45.00
## UnitsSold UnitCost UnitTax ProductChargeplusTax
## Min. : -3.00 Min. : 0.01 Min. : 0.0000 Min. : 0.00
## 1st Qu.: 25.00 1st Qu.: 9.99 1st Qu.: 0.0000 1st Qu.: 84.72
## Median : 38.00 Median : 9.99 Median : 0.0000 Median :116.49
## Mean : 42.55 Mean :10.43 Mean : 0.7383 Mean :124.05
## 3rd Qu.: 55.00 3rd Qu.:10.59 3rd Qu.: 0.0000 3rd Qu.:155.04
## Max. :590.00 Max. :59.99 Max. :26.8900 Max. :611.55
## LaborCharge SupplyCharge LaborandSupplyTax EnhancementMiscOther
## Min. : 0.0 Min. : 0.000 Min. : 0.0000 Min. :-129.870
## 1st Qu.:163.0 1st Qu.: 8.500 1st Qu.: 0.0000 1st Qu.: 0.000
## Median :163.0 Median : 8.500 Median : 0.0000 Median : 0.000
## Mean :148.7 Mean : 7.448 Mean : 0.4902 Mean : -4.251
## 3rd Qu.:176.0 3rd Qu.: 8.500 3rd Qu.: 0.0000 3rd Qu.: 0.000
## Max. :354.0 Max. :16.800 Max. :25.3268 Max. : 221.800
## Total
## Min. : 0.0
## 1st Qu.:247.6
## Median :280.1
## Mean :276.5
## 3rd Qu.:316.4
## Max. :677.7
This column indicates how many units of the product were sold at the store. This is a pretty important column. We can see from the summary statistics that the values range from -3 to 590. Let’s use a histogram to get more nuance about the distribution of UnitsSold.
This histogram shows that about 8,000 of the 11,533 observations sell between 0 and 50 units.
Let’s look at the distribution of UnitsSold using a Box and Whisker
Plot, which is an alternative plot for evaluating the distribution of a
column of data.
The outliers make it hard to see the range of the majority of the data. You could filter the data to remove the outliers, but that will be discussed in another lesson.
One other thing that is often done to help remove the effect of outliers is to decrease the scale by using a logarithmic transformation. If we use a base 10 logarithm, then we will convert values of 10 to 1, values of 100 to 2, values of 1000 to 3, and so on.
## Warning in boxplot(log10(ba$UnitsSold)): NaNs produced
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 1 is not drawn
Notice that this makes it easier to see the values of the box. To transform these values to their original values, then you can raise 10 to the power of the value indicated in the plot.
What do you think the error messages mean? Should it concern us?
Let’s take a look at the distribution of UnitsSold for each of the Divisions.
This plot helps provide insight about the distribution of units sold for each division.
Let’s explore the relationship between units sold, and three other variables.
Some questions to consider:
?nrow ba(6) head(ba) range(ba$LaborCharge)