library(lavaan)
library(semPlot)
library(corrplot)
library(multcomp)
satData <- read.csv('sat.csv')
satData$Segment <- factor(satData$Segment)
head(satData)
iProdSAT iSalesSAT Segment iProdREC iSalesREC
1 6 2 1 4 3
2 4 5 3 4 4
3 5 3 4 5 4
4 3 3 2 4 4
5 3 3 3 2 2
6 4 4 4 5 4
summary(satData)
iProdSAT iSalesSAT Segment
Min. :1.00 Min. :1.000 1: 54
1st Qu.:3.00 1st Qu.:3.000 2:131
Median :4.00 Median :4.000 3:154
Mean :4.13 Mean :3.802 4:161
3rd Qu.:5.00 3rd Qu.:5.000
Max. :7.00 Max. :7.000
iProdREC iSalesREC
Min. :1.000 Min. :1.000
1st Qu.:3.000 1st Qu.:3.000
Median :4.000 Median :3.000
Mean :4.044 Mean :3.444
3rd Qu.:5.000 3rd Qu.:4.000
Max. :7.000 Max. :7.000
corrplot.mixed(cor(satData[, -3]))
aggregate(iProdSAT ~ Segment, satData, mean)
Segment iProdSAT
1 1 3.462963
2 2 3.725191
3 3 4.103896
4 4 4.708075
sat.anova <- aov(iProdSAT ~ -1 + Segment, satData)
summary(sat.anova)
Df Sum Sq Mean Sq F value Pr(>F)
Segment 4 8628 2157 2161 <2e-16 ***
Residuals 496 495 1
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
par(mar=c(4,8,4,2))
plot(glht(sat.anova))
satModel <- "SAT =~ iProdSAT + iSalesSAT
REC =~ iProdREC + iSalesREC
REC~ SAT"
sat.fit <- cfa(satModel, data=satData)
Found more than one class "Model" in cache; using the first, from namespace 'lavaan'
summary(sat.fit, fit.m=TRUE)
lavaan (0.5-20) converged normally after 31 iterations
Number of observations 500
Estimator ML
Minimum Function Test Statistic 2.319
Degrees of freedom 1
P-value (Chi-square) 0.128
Model test baseline model:
Minimum Function Test Statistic 278.557
Degrees of freedom 6
P-value 0.000
User model versus baseline model:
Comparative Fit Index (CFI) 0.995
Tucker-Lewis Index (TLI) 0.971
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -3040.385
Loglikelihood unrestricted model (H1) -3039.225
Number of free parameters 9
Akaike (AIC) 6098.769
Bayesian (BIC) 6136.701
Sample-size adjusted Bayesian (BIC) 6108.134
Root Mean Square Error of Approximation:
RMSEA 0.051
90 Percent Confidence Interval 0.000 0.142
P-value RMSEA <= 0.05 0.347
Standardized Root Mean Square Residual:
SRMR 0.012
Parameter Estimates:
Information Expected
Standard Errors Standard
Latent Variables:
Estimate Std.Err Z-value
SAT =~
iProdSAT 1.000
iSalesSAT 1.067 0.173 6.154
REC =~
iProdREC 1.000
iSalesREC 0.900 0.138 6.528
P(>|z|)
0.000
0.000
Regressions:
Estimate Std.Err Z-value
REC ~
SAT 0.758 0.131 5.804
P(>|z|)
0.000
Variances:
Estimate Std.Err Z-value
iProdSAT 0.706 0.088 7.994
iSalesSAT 0.793 0.100 7.918
iProdREC 0.892 0.129 6.890
iSalesREC 0.808 0.107 7.533
SAT 0.483 0.097 4.970
REC 0.516 0.115 4.505
P(>|z|)
0.000
0.000
0.000
0.000
0.000
0.000
semPaths(sat.fit, what="est",
residuals=FALSE, intercepts=FALSE, nCharNodes=9)
xSeq <- 1:(4*2)
xSeq
[1] 1 2 3 4 5 6 7 8
xSeq <- 1:4*2
xSeq
[1] 2 4 6 8
# Sequences are useful for indexing and you can use sequences inside [ ]:
1:300
[1] 1 2 3 4 5 6 7 8 9 10 11
[12] 12 13 14 15 16 17 18 19 20 21 22
[23] 23 24 25 26 27 28 29 30 31 32 33
[34] 34 35 36 37 38 39 40 41 42 43 44
[45] 45 46 47 48 49 50 51 52 53 54 55
[56] 56 57 58 59 60 61 62 63 64 65 66
[67] 67 68 69 70 71 72 73 74 75 76 77
[78] 78 79 80 81 82 83 84 85 86 87 88
[89] 89 90 91 92 93 94 95 96 97 98 99
[100] 100 101 102 103 104 105 106 107 108 109 110
[111] 111 112 113 114 115 116 117 118 119 120 121
[122] 122 123 124 125 126 127 128 129 130 131 132
[133] 133 134 135 136 137 138 139 140 141 142 143
[144] 144 145 146 147 148 149 150 151 152 153 154
[155] 155 156 157 158 159 160 161 162 163 164 165
[166] 166 167 168 169 170 171 172 173 174 175 176
[177] 177 178 179 180 181 182 183 184 185 186 187
[188] 188 189 190 191 192 193 194 195 196 197 198
[199] 199 200 201 202 203 204 205 206 207 208 209
[210] 210 211 212 213 214 215 216 217 218 219 220
[221] 221 222 223 224 225 226 227 228 229 230 231
[232] 232 233 234 235 236 237 238 239 240 241 242
[243] 243 244 245 246 247 248 249 250 251 252 253
[254] 254 255 256 257 258 259 260 261 262 263 264
[265] 265 266 267 268 269 270 271 272 273 274 275
[276] 276 277 278 279 280 281 282 283 284 285 286
[287] 287 288 289 290 291 292 293 294 295 296 297
[298] 298 299 300
my.test.scores <- c(91, NA, NA)
mean(my.test.scores)
[1] NA
max(my.test.scores)
[1] NA
mean(my.test.scores, na.rm = TRUE)
[1] 91
max(my.test.scores, na.rm = TRUE)
[1] 91
# A second approach is to remove NA values explicitly before calculating on them or assigning them elsewhere. This may be done most easily with the command na.omit():
mean(na.omit(my.test.scores))
[1] 91
# A third and more cumbersome alternative is to test for NA using the is.na() function, and then index data for the values that are not NA by adding the ! (“not”) operator:
is.na(my.test.scores)
[1] FALSE TRUE TRUE
my.test.scores[!is.na(my.test.scores)]
[1] 91
# One thing never to do in R is to use an actual numeric value such as −999 to in- dicate missing data. That will cause headaches at best and wrong answers at worst. Instead, as soon as you load such data into R, replace those values with NA using indices:
my.test.scores <- c(91, -999, -999)
mean(my.test.scores)
[1] -635.6667
my.test.scores[my.test.scores < -900] <- NA
mean(my.test.scores, na.rm=TRUE)
[1] 91
log(c(-1,0,1))
Warning in log(c(-1, 0, 1)) : NaNs produced
[1] NaN -Inf 0
# We get a warning because log() is undefined for negative numbers and log(-1) gives a value of NaN. Note also that log(0) = −∞ (-Inf).
xNum <- c(1, 3.14159, 5, 7)
xLog <- c(TRUE, FALSE, TRUE, TRUE)
xChar <- c("foo", "bar", "boo", "far")
xMix <- c(1, TRUE, 3, "Hello, world!")
xList <- list(xNum, xChar)
xList
[[1]]
[1] 1.00000 3.14159 5.00000 7.00000
[[2]]
[1] "foo" "bar" "boo" "far"
str(xNum)
num [1:4] 1 3.14 5 7
summary(xList[[1]])
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.606 4.071 4.035 5.500 7.000
lapply(xList, summary)
[[1]]
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.606 4.071 4.035 5.500 7.000
[[2]]
Length Class Mode
4 character character
x.df <- data.frame(xNum, xLog, xChar)
x.df
xNum xLog xChar
1 1.00000 TRUE foo
2 3.14159 FALSE bar
3 5.00000 TRUE boo
4 7.00000 TRUE far
# [ROW, COLUMN]notation:
x.df[2,1]
[1] 3.14159
x.df[1,3]
[1] foo
Levels: bar boo far foo
#
# Nominal
# Ordinal - less/greater
# Interval
# Ratio - true 0
Converting character strings to factors is a good thing for data that you might use in a statistical model because it tells R to handle it appropriately in the model, but it’s inconvenient when the data really is simple text such as an address or com- ments on a survey. You can prevent the conversion to factors by adding an option to data.frame() that sets stringsAsFactors=FALSE:
x.df <- data.frame(xNum, xLog, xChar, stringsAsFactors=FALSE)
x.df
xNum xLog xChar
1 1.00000 TRUE foo
2 3.14159 FALSE bar
3 5.00000 TRUE boo
4 7.00000 TRUE far
x.df[1,3]
[1] "foo"
# all obs
x.df[2, ]
xNum xLog xChar
2 3.14159 FALSE bar
#all columns
x.df[,2 ]
[1] TRUE FALSE TRUE TRUE
x.df[ ,3] # all of column 3
[1] "foo" "bar" "boo" "far"
# Index data frames by using vectors or ranges for the elements you want. Use nega- tive indices to omit elements:
x.df[2:3, ]
xNum xLog xChar
2 3.14159 FALSE bar
3 5.00000 TRUE boo
print(x.df[ ,1:2]) # two columns
xNum xLog
1 1.00000 TRUE
2 3.14159 FALSE
3 5.00000 TRUE
4 7.00000 TRUE
x.df[-3, ] # omit the third observation
xNum xLog xChar
1 1.00000 TRUE foo
2 3.14159 FALSE bar
4 7.00000 TRUE far
x.df[, -2] # omit the second column
xNum xChar
1 1.00000 foo
2 3.14159 bar
3 5.00000 boo
4 7.00000 far
rm(list=ls()) # caution, deletes all objects! See explanation below
store.num <- factor(c(3, 14, 21, 32, 54)) # store id
store.rev <- c(543, 654, 345, 678, 234) # store revenue, $1000
store.visits <- c(45, 78, 32, 56, 34) # visits, 1000s
store.manager <- c("Annie", "Bert", "Carla", "Dave", "Ella")
(store.df <- data.frame(store.num, store.rev, store.visits, store.manager, stringsAsFactors=F)) # F = FALSE
store.num store.rev store.visits store.manager
1 3 543 45 Annie
2 14 654 78 Bert
3 21 345 32 Carla
4 32 678 56 Dave
5 54 234 34 Ella
# Pearson's r
cor(store.df$store.rev, store.df$store.visits)
[1] 0.8291032
summary(store.df)
store.num store.rev store.visits
3 :1 Min. :234.0 Min. :32
14:1 1st Qu.:345.0 1st Qu.:34
21:1 Median :543.0 Median :45
32:1 Mean :490.8 Mean :49
54:1 3rd Qu.:654.0 3rd Qu.:56
Max. :678.0 Max. :78
store.manager
Length:5
Class :character
Mode :character
save(store.df, file="store-df-backup.RData")
rm(store.df)
# mean(store.df$store.rev)
load("store-df-backup.RData")
mean(store.df$store.rev)
[1] 490.8
# save() can also take a group of objects as an argument; just replace the single object name with list=c() and fill in c() with a character vector. For in- stance:
save(list=c("store.df","store.visits"), file="store-df-backup.RData")
se <- function(x) { sd(x) / sqrt(length(x)) }
se(store.df$store.visits)
[1] 8.42615
mean(store.df$store.visits) + 1.96 * se(store.df$store.visits)
[1] 65.51525
• Put braces around the body using { and }, even if it’s just a one line function • Create temporary values to hold results along the way inside the function • Comment the function profusely • Use the keyword return() to show the explicit value returned by the function.
se <- function(x) {
# computes standard error of the mean
tmp.sd <- sd(x) # standard deviation
tmp.N <- length(x) # sample size
tmp.se <- tmp.sd / sqrt(tmp.N) # std error of the mean
return(tmp.se)
}
# EXPR = expression()
# if (TEST) EXPR [else EXPR.b] # do EXPR if TEST is true, else EXPR.b
# while (TEST) EXPR
#
# for (NAME in VECTOR) EXPR # iterate expr for values of name from vector
#
# switch(INDEX, LIST) # INDEXth statement or matching argument name from LIST
# type <- 'trimmed'
# switch(type,
# mean = mean(x),
# median = median(x),
# trimmed = mean(x, trim = .1))
# repeat EXPR # repeats forever until break; not recommended
?switch
library(devtools)
x <- -2:2
log(x)
Warning in log(x) : NaNs produced
[1] NaN NaN -Inf 0.0000000
[5] 0.6931472
ifelse(x > 0, x, NA)
[1] NA NA NA 1 2
log(ifelse(x > 0, x, NA))
[1] NA NA NA 0.0000000
[5] 0.6931472
my.data <- matrix(runif(100), ncol = 5) # 100 rand nums in 5 cols
median(my.data[,1])/2
[1] 0.2877563
apply(my.data, 2, median) / 2
[1] 0.2877563 0.2235877 0.2258379 0.2197439
[5] 0.1710344
ls() # to see what you have in memory
[1] "my.data" "se" "store.df"
[4] "store.manager" "store.num" "store.rev"
[7] "store.visits" "x"
rm() #to remove the object
rm(list=ls())
Most of the present chapter is foundational to R, yet there are a few especially important points:
• For work that you want to preserve or edit, use a text editor and run commands from there (Sect. 2.3).
• Create vectors using c() for enumerated values, seq() for sequences, and rep() for repeated values (Sects. 2.4.1 and 2.4.3).
• Use the constant NA for missing values, not an arbitrary value such as −999 (Sect. 2.4.5).
• In R, data sets are most commonly data.frame objects created with a command such as my.df <- data.frame(vector1, vector2, …) (Sect. 2.5) or by reading a data file.
• Vectors and data frames are most often indexed with specific numbers (x[1]), ranges (x[2:4]), negative indices (x[-3]) to omit data, and by boolean selection (x[x>3]) (Sects. 2.5 and 2.4.3).
• Data frames are indexed by[ROW,COLUMN], where a blank value means “all of that dimension” such as my.df[2,] for row2, all columns(Sect.2.5).
• You can also index a data frame with $ and a column name, such as my.df$id (Sect. 2.5).
• Read and write data in CSV format with read.csv() and write.csv() (Sect. 2.6.2).
• Functions are straightforward to write and extend R’s capabilities. When you write a function, organize the code well and comment it profusely (Sect. 2.7).
• Cleanupyourworkspaceregularlytoavoidclutterandbugsfromobsoletevari- ables (Sect. 2.8).
k.stores <- 20 # 20 stores, using "k." for "constant"
k.weeks <- 104 # 2 years of data each
# created a data frame of initially missing values to hold the data
store.df <- data.frame(matrix(NA, ncol=10, nrow=k.stores*k.weeks))
names(store.df) <- c("storeNum", "Year", "Week", "p1sales", "p2sales","p1price", "p2price", "p1prom", "p2prom","country")
dim(store.df)
[1] 2080 10
# we create two vectors that will represent the store number and country for each observation
store.num <- 101:(100+k.stores)
(store.cty <- c(rep("US", 3), rep("DE", 5), rep("GB", 3), rep("BR", 2), rep("JP", 4), rep("AU", 1), rep("CN", 2)))
[1] "US" "US" "US" "DE" "DE" "DE" "DE" "DE" "GB"
[10] "GB" "GB" "BR" "BR" "JP" "JP" "JP" "JP" "AU"
[19] "CN" "CN"
length(store.cty)
[1] 20
# now we replace the appropiate columns in the data frame with those values using rep() to expand the vectors to match the number of stores and weeks
store.df$storeNum <- rep(store.num, each = k.weeks)
store.df$country <- rep(store.cty, each = k.weeks)
rm(store.num, store.cty) # clean up
str(store.df)
'data.frame': 2080 obs. of 10 variables:
$ storeNum: int 101 101 101 101 101 101 101 101 101 101 ...
$ Year : logi NA NA NA NA NA NA ...
$ Week : logi NA NA NA NA NA NA ...
$ p1sales : logi NA NA NA NA NA NA ...
$ p2sales : logi NA NA NA NA NA NA ...
$ p1price : logi NA NA NA NA NA NA ...
$ p2price : logi NA NA NA NA NA NA ...
$ p1prom : logi NA NA NA NA NA NA ...
$ p2prom : logi NA NA NA NA NA NA ...
$ country : chr "US" "US" "US" "US" ...
# check the types, notice the chr type vectors, if they are discrete values like 'country' then they should be "factored" using factor(), storeNum is a label so we can treat it as a factor as it isn't a "number" in the numerical sense(not for calculation) of the word.
#next we do the same to the week and year
(store.df$Week <- rep(1:52, times = k.stores*2))
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
[16] 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
[31] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
[46] 46 47 48 49 50 51 52 1 2 3 4 5 6 7 8
[61] 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
[76] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
[91] 39 40 41 42 43 44 45 46 47 48 49 50 51 52 1
[106] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
[121] 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
[136] 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
[151] 47 48 49 50 51 52 1 2 3 4 5 6 7 8 9
[166] 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
[181] 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
[196] 40 41 42 43 44 45 46 47 48 49 50 51 52 1 2
[211] 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
[226] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
[241] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
[256] 48 49 50 51 52 1 2 3 4 5 6 7 8 9 10
[271] 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[286] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
[301] 41 42 43 44 45 46 47 48 49 50 51 52 1 2 3
[316]10 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[331] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
[346] 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
[361] 49 50 51 52 1 2 3 4 5 6 7 8 9 10 11
[376] 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
[391] 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
[406] 42 43 44 45 46 47 48 49 50 51 52 1 2 3 4
[421] 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
[436] 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
[451] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
[466] 50 51 52 1 2 3 4 5 6 7 8 9 10 11 12
[481] 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
[496] 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
[511] 43 44 45 46 47 48 49 50 51 52 1 2 3 4 5
[526] 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
[541] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
[556] 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
[571] 51 52 1 2 3 4 5 6 7 8 9 10 11 12 13
[586] 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
[601] 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
[616] 44 45 46 47 48 49 50 51 52 1 2 3 4 5 6
[631] 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
[646] 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
[661] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
[676] 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14
[691] 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
[706] 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
[721] 45 46 47 48 49 50 51 52 1 2 3 4 5 6 7
[736] 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
[751] 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
[766] 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
[781] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
[796] 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
[811] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
[826] 46 47 48 49 50 51 52 1 2 3 4 5 6 7 8
[841] 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
[856] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
[871] 39 40 41 42 43 44 45 46 47 48 49 50 51 52 1
[886] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
[901] 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
[916] 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
[931] 47 48 49 50 51 52 1 2 3 4 5 6 7 8 9
[946] 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
[961] 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
[976] 40 41 42 43 44 45 46 47 48 49 50 51 52 1 2
[991] 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
[1006] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
[1021] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
[1036] 48 49 50 51 52 1 2 3 4 5 6 7 8 9 10
[1051] 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[1066] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
[1081] 41 42 43 44 45 46 47 48 49 50 51 52 1 2 3
[1096] 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[1111] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
[1126] 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
[1141] 49 50 51 52 1 2 3 4 5 6 7 8 9 10 11
[1156] 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
[1171] 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
[1186] 42 43 44 45 46 47 48 49 50 51 52 1 2 3 4
[1201] 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
[1216] 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
[1231] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
[1246] 50 51 52 1 2 3 4 5 6 7 8 9 10 11 12
[1261] 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
[1276] 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
[1291] 43 44 45 46 47 48 49 50 51 52 1 2 3 4 5
[1306] 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
[1321] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
[1336] 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
[1351] 51 52 1 2 3 4 5 6 7 8 9 10 11 12 13
[1366] 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
[1381] 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
[1396] 44 45 46 47 48 49 50 51 52 1 2 3 4 5 6
[1411] 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
[1426] 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
[1441] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
[1456] 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14
[1471] 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
[1486] 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
[1501] 45 46 47 48 49 50 51 52 1 2 3 4 5 6 7
[1516] 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
[1531] 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
[1546] 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
[1561] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
[1576] 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
[1591] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
[1606] 46 47 48 49 50 51 52 1 2 3 4 5 6 7 8
[1621] 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
[1636] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
[1651] 39 40 41 42 43 44 45 46 47 48 49 50 51 52 1
[1666] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
[1681] 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
[1696] 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
[1711] 47 48 49 50 51 52 1 2 3 4 5 6 7 8 9
[1726] 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
[1741] 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
[1756] 40 41 42 43 44 45 46 47 48 49 50 51 52 1 2
[1771] 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
[1786] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
[1801] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
[1816] 48 49 50 51 52 1 2 3 4 5 6 7 8 9 10
[1831] 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[1846] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
[1861] 41 42 43 44 45 46 47 48 49 50 51 52 1 2 3
[1876] 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[1891] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
[1906] 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
[1921] 49 50 51 52 1 2 3 4 5 6 7 8 9 10 11
[1936] 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
[1951] 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
[1966] 42 43 44 45 46 47 48 49 50 51 52 1 2 3 4
[1981] 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
[1996] 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
[2011] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
[2026] 50 51 52 1 2 3 4 5 6 7 8 9 10 11 12
[2041] 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
[2056] 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
[2071] 43 44 45 46 47 48 49 50 51 52
#try the inner parts of the next line to figure out how we use rep
(store.df$Year <- rep(rep(1:2, each=k.weeks/2), times=k.stores))
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[24] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[47] 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[70] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[93] 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1
[116] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[139] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2
[162] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[185] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[208] 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[231] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[254] 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[277] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[300] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
[323] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[346] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2
[369] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[392] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[415] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[438] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[461] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[484] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[507] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1
[530] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[553] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2
[576] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[599] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[622] 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[645] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[668] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[691] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[714] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1
[737] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[760] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2
[783] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[806] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[829] 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[852] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[875] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
[898] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[921] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1
[944] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[967] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
[990] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[1013] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[1036] 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[1059] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[1082] 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2
[1105] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[1128] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1
[1151] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[1174] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[1197] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[1220] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[1243] 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[1266] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[1289] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
[1312] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[1335] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1
[1358] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[1381] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[1404] 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[1427] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[1450] 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[1473] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[1496] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
[1519] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[1542] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1
[1565] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[1588] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[1611] 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[1634] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[1657] 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[1680] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[1703] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2
[1726] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[1749] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1
[1772] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[1795] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[1818] 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[1841] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[1864] 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[1887] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[1910] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
[1933] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[1956] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1
[1979] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[2002] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[2025] 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[2048] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[2071] 2 2 2 2 2 2 2 2 2 2
store.df$storeNum <- factor(store.df$storeNum)
store.df$country <- factor(store.df$country)
str(store.df)
'data.frame': 2080 obs. of 10 variables:
$ storeNum: Factor w/ 20 levels "101","102","103",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Year : int 1 1 1 1 1 1 1 1 1 1 ...
$ Week : int 1 2 3 4 5 6 7 8 9 10 ...
$ p1sales : logi NA NA NA NA NA NA ...
$ p2sales : logi NA NA NA NA NA NA ...
$ p1price : logi NA NA NA NA NA NA ...
$ p2price : logi NA NA NA NA NA NA ...
$ p1prom : logi NA NA NA NA NA NA ...
$ p2prom : logi NA NA NA NA NA NA ...
$ country : Factor w/ 7 levels "AU","BR","CN",..: 7 7 7 7 7 7 7 7 7 7 ...
#inspct data frame at first and last rows
head(store.df, 120) # 120 rows is enough to check 2 stores; not shown
storeNum Year Week p1sales p2sales p1price
1 101 1 1 NA NA NA
2 101 1 2 NA NA NA
3 101 1 3 NA NA NA
4 101 1 4 NA NA NA
5 101 1 5 NA NA NA
6 101 1 6 NA NA NA
7 101 1 7 NA NA NA
8 101 1 8 NA NA NA
9 101 1 9 NA NA NA
10 101 1 10 NA NA NA
11 101 1 11 NA NA NA
12 101 1 12 NA NA NA
13 101 1 13 NA NA NA
14 101 1 14 NA NA NA
15 101 1 15 NA NA NA
16 101 1 16 NA NA NA
17 101 1 17 NA NA NA
18 101 1 18 NA NA NA
19 101 1 19 NA NA NA
20 101 1 20 NA NA NA
21 101 1 21 NA NA NA
22 101 1 22 NA NA NA
23 101 1 23 NA NA NA
24 101 1 24 NA NA NA
25 101 1 25 NA NA NA
26 101 1 26 NA NA NA
27 101 1 27 NA NA NA
28 101 1 28 NA NA NA
29 101 1 29 NA NA NA
30 101 1 30 NA NA NA
31 101 1 31 NA NA NA
32 101 1 32 NA NA NA
33 101 1 33 NA NA NA
34 101 1 34 NA NA NA
35 101 1 35 NA NA NA
36 101 1 36 NA NA NA
37 101 1 37 NA NA NA
38 101 1 38 NA NA NA
39 101 1 39 NA NA NA
40 101 1 40 NA NA NA
41 101 1 41 NA NA NA
42 101 1 42 NA NA NA
43 101 1 43 NA NA NA
44 101 1 44 NA NA NA
45 101 1 45 NA NA NA
46 101 1 46 NA NA NA
47 101 1 47 NA NA NA
48 101 1 48 NA NA NA
49 101 1 49 NA NA NA
50 101 1 50 NA NA NA
51 101 1 51 NA NA NA
52 101 1 52 NA NA NA
53 101 2 1 NA NA NA
54 101 2 2 NA NA NA
55 101 2 3 NA NA NA
56 101 2 4 NA NA NA
57 101 2 5 NA NA NA
58 101 2 6 NA NA NA
59 101 2 7 NA NA NA
60 101 2 8 NA NA NA
61 101 2 9 NA NA NA
62 101 2 10 NA NA NA
63 101 2 11 NA NA NA
64 101 2 12 NA NA NA
65 101 2 13 NA NA NA
66 101 2 14 NA NA NA
67 101 2 15 NA NA NA
68 101 2 16 NA NA NA
69 101 2 17 NA NA NA
70 101 2 18 NA NA NA
71 101 2 19 NA NA NA
72 101 2 20 NA NA NA
73 101 2 21 NA NA NA
74 101 2 22 NA NA NA
75 101 2 23 NA NA NA
76 101 2 24 NA NA NA
77 101 2 25 NA NA NA
78 101 2 26 NA NA NA
79 101 2 27 NA NA NA
80 101 2 28 NA NA NA
81 101 2 29 NA NA NA
82 101 2 30 NA NA NA
83 101 2 31 NA NA NA
84 101 2 32 NA NA NA
85 101 2 33 NA NA NA
86 101 2 34 NA NA NA
87 101 2 35 NA NA NA
88 101 2 36 NA NA NA
89 101 2 37 NA NA NA
90 101 2 38 NA NA NA
91 101 2 39 NA NA NA
92 101 2 40 NA NA NA
93 101 2 41 NA NA NA
94 101 2 42 NA NA NA
95 101 2 43 NA NA NA
96 101 2 44 NA NA NA
97 101 2 45 NA NA NA
98 101 2 46 NA NA NA
99 101 2 47 NA NA NA
100 101 2 48 NA NA NA
101 101 2 49 NA NA NA
102 101 2 50 NA NA NA
103 101 2 51 NA NA NA
104 101 2 52 NA NA NA
105 102 1 1 NA NA NA
106 102 1 2 NA NA NA
107 102 1 3 NA NA NA
108 102 1 4 NA NA NA
109 102 1 5 NA NA NA
110 102 1 6 NA NA NA
111 102 1 7 NA NA NA
112 102 1 8 NA NA NA
113 102 1 9 NA NA NA
114 102 1 10 NA NA NA
115 102 1 11 NA NA NA
116 102 1 12 NA NA NA
117 102 1 13 NA NA NA
118 102 1 14 NA NA NA
119 102 1 15 NA NA NA
120 102 1 16 NA NA NA
p2price p1prom p2prom country
1 NA NA NA US
2 NA NA NA US
3 NA NA NA US
4 NA NA NA US
5 NA NA NA US
6 NA NA NA US
7 NA NA NA US
8 NA NA NA US
9 NA NA NA US
10 NA NA NA US
11 NA NA NA US
12 NA NA NA US
13 NA NA NA US
14 NA NA NA US
15 NA NA NA US
16 NA NA NA US
17 NA NA NA US
18 NA NA NA US
19 NA NA NA US
20 NA NA NA US
21 NA NA NA US
22 NA NA NA US
23 NA NA NA US
24 NA NA NA US
25 NA NA NA US
26 NA NA NA US
27 NA NA NA US
28 NA NA NA US
29 NA NA NA US
30 NA NA NA US
31 NA NA NA US
32 NA NA NA US
33 NA NA NA US
34 NA NA NA US
35 NA NA NA US
36 NA NA NA US
37 NA NA NA US
38 NA NA NA US
39 NA NA NA US
40 NA NA NA US
41 NA NA NA US
42 NA NA NA US
43 NA NA NA US
44 NA NA NA US
45 NA NA NA US
46 NA NA NA US
47 NA NA NA US
48 NA NA NA US
49 NA NA NA US
50 NA NA NA US
51 NA NA NA US
52 NA NA NA US
53 NA NA NA US
54 NA NA NA US
55 NA NA NA US
56 NA NA NA US
57 NA NA NA US
58 NA NA NA US
59 NA NA NA US
60 NA NA NA US
61 NA NA NA US
62 NA NA NA US
63 NA NA NA US
64 NA NA NA US
65 NA NA NA US
66 NA NA NA US
67 NA NA NA US
68 NA NA NA US
69 NA NA NA US
70 NA NA NA US
71 NA NA NA US
72 NA NA NA US
73 NA NA NA US
74 NA NA NA US
75 NA NA NA US
76 NA NA NA US
77 NA NA NA US
78 NA NA NA US
79 NA NA NA US
80 NA NA NA US
81 NA NA NA US
82 NA NA NA US
83 NA NA NA US
84 NA NA NA US
85 NA NA NA US
86 NA NA NA US
87 NA NA NA US
88 NA NA NA US
89 NA NA NA US
90 NA NA NA US
91 NA NA NA US
92 NA NA NA US
93 NA NA NA US
94 NA NA NA US
95 NA NA NA US
96 NA NA NA US
97 NA NA NA US
98 NA NA NA US
99 NA NA NA US
100 NA NA NA US
101 NA NA NA US
102 NA NA NA US
103 NA NA NA US
104 NA NA NA US
105 NA NA NA US
106 NA NA NA US
107 NA NA NA US
108 NA NA NA US
109 NA NA NA US
110 NA NA NA US
111 NA NA NA US
112 NA NA NA US
113 NA NA NA US
114 NA NA NA US
115 NA NA NA US
116 NA NA NA US
117 NA NA NA US
118 NA NA NA US
119 NA NA NA US
120 NA NA NA US
tail(store.df, 120) # make sure end looks OK too; not shown
storeNum Year Week p1sales p2sales p1price
1961 119 2 37 NA NA NA
1962 119 2 38 NA NA NA
1963 119 2 39 NA NA NA
1964 119 2 40 NA NA NA
1965 119 2 41 NA NA NA
1966 119 2 42 NA NA NA
1967 119 2 43 NA NA NA
1968 119 2 44 NA NA NA
1969 119 2 45 NA NA NA
1970 119 2 46 NA NA NA
1971 119 2 47 NA NA NA
1972 119 2 48 NA NA NA
1973 119 2 49 NA NA NA
1974 119 2 50 NA NA NA
1975 119 2 51 NA NA NA
1976 119 2 52 NA NA NA
1977 120 1 1 NA NA NA
1978 120 1 2 NA NA NA
1979 120 1 3 NA NA NA
1980 120 1 4 NA NA NA
1981 120 1 5 NA NA NA
1982 120 1 6 NA NA NA
1983 120 1 7 NA NA NA
1984 120 1 8 NA NA NA
1985 120 1 9 NA NA NA
1986 120 1 10 NA NA NA
1987 120 1 11 NA NA NA
1988 120 1 12 NA NA NA
1989 120 1 13 NA NA NA
1990 120 1 14 NA NA NA
1991 120 1 15 NA NA NA
1992 120 1 16 NA NA NA
1993 120 1 17 NA NA NA
1994 120 1 18 NA NA NA
1995 120 1 19 NA NA NA
1996 120 1 20 NA NA NA
1997 120 1 21 NA NA NA
1998 120 1 22 NA NA NA
1999 120 1 23 NA NA NA
2000 120 1 24 NA NA NA
2001 120 1 25 NA NA NA
2002 120 1 26 NA NA NA
2003 120 1 27 NA NA NA
2004 120 1 28 NA NA NA
2005 120 1 29 NA NA NA
2006 120 1 30 NA NA NA
2007 120 1 31 NA NA NA
2008 120 1 32 NA NA NA
2009 120 1 33 NA NA NA
2010 120 1 34 NA NA NA
2011 120 1 35 NA NA NA
2012 120 1 36 NA NA NA
2013 120 1 37 NA NA NA
2014 120 1 38 NA NA NA
2015 120 1 39 NA NA NA
2016 120 1 40 NA NA NA
2017 120 1 41 NA NA NA
2018 120 1 42 NA NA NA
2019 120 1 43 NA NA NA
2020 120 1 44 NA NA NA
2021 120 1 45 NA NA NA
2022 120 1 46 NA NA NA
2023 120 1 47 NA NA NA
2024 120 1 48 NA NA NA
2025 120 1 49 NA NA NA
2026 120 1 50 NA NA NA
2027 120 1 51 NA NA NA
2028 120 1 52 NA NA NA
2029 120 2 1 NA NA NA
2030 120 2 2 NA NA NA
2031 120 2 3 NA NA NA
2032 120 2 4 NA NA NA
2033 120 2 5 NA NA NA
2034 120 2 6 NA NA NA
2035 120 2 7 NA NA NA
2036 120 2 8 NA NA NA
2037 120 2 9 NA NA NA
2038 120 2 10 NA NA NA
2039 120 2 11 NA NA NA
2040 120 2 12 NA NA NA
2041 120 2 13 NA NA NA
2042 120 2 14 NA NA NA
2043 120 2 15 NA NA NA
2044 120 2 16 NA NA NA
2045 120 2 17 NA NA NA
2046 120 2 18 NA NA NA
2047 120 2 19 NA NA NA
2048 120 2 20 NA NA NA
2049 120 2 21 NA NA NA
2050 120 2 22 NA NA NA
2051 120 2 23 NA NA NA
2052 120 2 24 NA NA NA
2053 120 2 25 NA NA NA
2054 120 2 26 NA NA NA
2055 120 2 27 NA NA NA
2056 120 2 28 NA NA NA
2057 120 2 29 NA NA NA
2058 120 2 30 NA NA NA
2059 120 2 31 NA NA NA
2060 120 2 32 NA NA NA
2061 120 2 33 NA NA NA
2062 120 2 34 NA NA NA
2063 120 2 35 NA NA NA
2064 120 2 36 NA NA NA
2065 120 2 37 NA NA NA
2066 120 2 38 NA NA NA
2067 120 2 39 NA NA NA
2068 120 2 40 NA NA NA
2069 120 2 41 NA NA NA
2070 120 2 42 NA NA NA
2071 120 2 43 NA NA NA
2072 120 2 44 NA NA NA
2073 120 2 45 NA NA NA
2074 120 2 46 NA NA NA
2075 120 2 47 NA NA NA
2076 120 2 48 NA NA NA
2077 120 2 49 NA NA NA
2078 120 2 50 NA NA NA
2079 120 2 51 NA NA NA
2080 120 2 52 NA NA NA
p2price p1prom p2prom country
1961 NA NA NA CN
1962 NA NA NA CN
1963 NA NA NA CN
1964 NA NA NA CN
1965 NA NA NA CN
1966 NA NA NA CN
1967 NA NA NA CN
1968 NA NA NA CN
1969 NA NA NA CN
1970 NA NA NA CN
1971 NA NA NA CN
1972 NA NA NA CN
1973 NA NA NA CN
1974 NA NA NA CN
1975 NA NA NA CN
1976 NA NA NA CN
1977 NA NA NA CN
1978 NA NA NA CN
1979 NA NA NA CN
1980 NA NA NA CN
1981 NA NA NA CN
1982 NA NA NA CN
1983 NA NA NA CN
1984 NA NA NA CN
1985 NA NA NA CN
1986 NA NA NA CN
1987 NA NA NA CN
1988 NA NA NA CN
1989 NA NA NA CN
1990 NA NA NA CN
1991 NA NA NA CN
1992 NA NA NA CN
1993 NA NA NA CN
1994 NA NA NA CN
1995 NA NA NA CN
1996 NA NA NA CN
1997 NA NA NA CN
1998 NA NA NA CN
1999 NA NA NA CN
2000 NA NA NA CN
2001 NA NA NA CN
2002 NA NA NA CN
2003 NA NA NA CN
2004 NA NA NA CN
2005 NA NA NA CN
2006 NA NA NA CN
2007 NA NA NA CN
2008 NA NA NA CN
2009 NA NA NA CN
2010 NA NA NA CN
2011 NA NA NA CN
2012 NA NA NA CN
2013 NA NA NA CN
2014 NA NA NA CN
2015 NA NA NA CN
2016 NA NA NA CN
2017 NA NA NA CN
2018 NA NA NA CN
2019 NA NA NA CN
2020 NA NA NA CN
2021 NA NA NA CN
2022 NA NA NA CN
2023 NA NA NA CN
2024 NA NA NA CN
2025 NA NA NA CN
2026 NA NA NA CN
2027 NA NA NA CN
2028 NA NA NA CN
2029 NA NA NA CN
2030 NA NA NA CN
2031 NA NA NA CN
2032 NA NA NA CN
2033 NA NA NA CN
2034 NA NA NA CN
2035 NA NA NA CN
2036 NA NA NA CN
2037 NA NA NA CN
2038 NA NA NA CN
2039 NA NA NA CN
2040 NA NA NA CN
2041 NA NA NA CN
2042 NA NA NA CN
2043 NA NA NA CN
2044 NA NA NA CN
2045 NA NA NA CN
2046 NA NA NA CN
2047 NA NA NA CN
2048 NA NA NA CN
2049 NA NA NA CN
2050 NA NA NA CN
2051 NA NA NA CN
2052 NA NA NA CN
2053 NA NA NA CN
2054 NA NA NA CN
2055 NA NA NA CN
2056 NA NA NA CN
2057 NA NA NA CN
2058 NA NA NA CN
2059 NA NA NA CN
2060 NA NA NA CN
2061 NA NA NA CN
2062 NA NA NA CN
2063 NA NA NA CN
2064 NA NA NA CN
2065 NA NA NA CN
2066 NA NA NA CN
2067 NA NA NA CN
2068 NA NA NA CN
2069 NA NA NA CN
2070 NA NA NA CN
2071 NA NA NA CN
2072 NA NA NA CN
2073 NA NA NA CN
2074 NA NA NA CN
2075 NA NA NA CN
2076 NA NA NA CN
2077 NA NA NA CN
2078 NA NA NA CN
2079 NA NA NA CN
2080 NA NA NA CN
?set.seed
set.seed(98250) # a favorite US postal code
#now draw the random data; each row is one week of one year for one store
# we set the status of whether each product was promoted (value 1) by drawing from the binomial distribution that counts the number of “heads” in a collection of coin tosses (where the coin can have any proportion of heads, not just 50 %).
# we set the status of whether each product was promoted (value 1) by drawing from the binomial distribution that counts the number of “heads” in a collection of coin tosses (where the coin can have any proportion of heads, not just 50 %).
n=nrow(store.df)
n
[1] 2080
To detail that process: we use the rbinom(n, size, p) (decoded as “random binomial”) function to draw from the binomial distribution. For every row of the store data, as noted by n=nrow(store.df), we draw from a distribution representing the number of heads in a single coin toss (size=1) with a coin that has probability p=0.1 for product 1 and p=0.15 for product 2. In other words, we arbitrarily assign a 10 % likelihood of promotion for product 1, and 15% likelihood for product 2 and then randomly determine which weeks have promotions.
store.df$p1prom <- rbinom(n=nrow(store.df), size=1, p=0.1) # 10% promoted
store.df$p2prom <- rbinom(n=nrow(store.df), size=1, p=0.15) # 15% promoted
head(store.df) # how does it look so far? (not shown)
storeNum Year Week p1sales p2sales p1price
1 101 1 1 NA NA NA
2 101 1 2 NA NA NA
3 101 1 3 NA NA NA
4 101 1 4 NA NA NA
5 101 1 5 NA NA NA
6 101 1 6 NA NA NA
p2price p1prom p2prom country
1 NA 0 0 US
2 NA 0 0 US
3 NA 1 0 US
4 NA 0 0 US
5 NA 0 1 US
6 NA 0 0 US
#Next we set a price for each product in each row of the data. We suppose that each product is sold at one of five distinct price points ranging from $2.19 to $3.19 over- all. We randomly draw a price for each week by defining a vector with the five price pointsandusingsample(x, size, replace)todrawfromitasmanytimes as we have rows of data (size=nrow(store.df)). The five prices are sampled many times, so we sample with replacement (replace=TRUE):
store.df$p1price <- sample(x=c(2.19, 2.29, 2.49, 2.79, 2.99), size=nrow(store.df), replace=TRUE)
store.df$p2price <- sample(x=c(2.29, 2.49, 2.59, 2.99, 3.19),
size=nrow(store.df), replace=TRUE)
head(store.df) # now how does it look?
storeNum Year Week p1sales p2sales p1price
1 101 1 1 NA NA 2.29
2 101 1 2 NA NA 2.49
3 101 1 3 NA NA 2.99
4 101 1 4 NA NA 2.99
5 101 1 5 NA NA 2.49
6 101 1 6 NA NA 2.79
p2price p1prom p2prom country
1 2.29 0 0 US
2 2.49 0 0 US
3 2.99 1 0 US
4 3.19 0 0 US
5 2.59 0 1 US
6 2.49 0 0 US
# Question: if price occurs at five discrete levels, does that make it a factor variable???? That depends on the analytic question, but in general probably not. We often perform math on price, such as subtracting cost in order to find gross margin, multiplying by units to find total sales, and so forth. Thus, even though it may have only a few unique values, price is a number, not a factor.
# Our last step is to simulate the sales figures for each week. We calculate sales as a function of the relative prices of the two products along with the promotional status of each.
# Item sales are in unit counts, so we use the Poisson distribution to generate count data: rpois(n, lambda), where n is the number of draws and lambda is the mean value of units per week. We draw a random Poisson count for each row (nrow(store.df), and set the mean sales (lambda) of Product 1 to be higher than that of Product 2:
# sales data, using poisson (counts) distribution, rpois()
?rpois
ot <- rpois(5, lambda = 10)
p1price <- sample(x=c(2.19, 2.29, 2.49, 2.79, 2.99), size=5, replace=TRUE)
unit_price <-20
units_sold <- 5
unit_s_price <- 15
sales <- unit_price*units_sold
sales
[1] 100
sales/unit_s_price # you would need to sell this many more units at the unit_s_price to match revenue in the default unit_price
[1] 6.666667
ot
[1] 12 15 10 7 13
p1price
[1] 2.99 2.49 2.19 2.99 2.79
tt<-ot*p1price
tt
[1] 35.88 37.35 21.90 20.93 36.27
tt/p1price
[1] 12 15 10 7 13
ot1 <- rpois(5, lambda = 15)
mean(ot)
[1] 11.4
mean(ot1)
[1] 16.8
xa <- c(2,3,4,5)
ya <- c(4,6,8,10)
xa
[1] 2 3 4 5
log(xa)
[1] 0.6931472 1.0986123 1.3862944 1.6094379
ya
[1] 4 6 8 10
log(ya)
[1] 1.386294 1.791759 2.079442 2.302585
# sales data, using poisson (counts) distribution, rpois()
tmp.sales1 <- rpois(nrow(store.df), lambda = 120)
tmp.sales2 <- rpois(nrow(store.df), lambda = 100)
#Now we scale those counts up or down according to the relative prices. Price effects often follow a logarithmic function rather than a linear function, so we use log(price) here:
# scale sales according to the ratio of log(price)
tmp.sales1 <- tmp.sales1 * log(store.df$p2price) / log(store.df$p1price)
tmp.sales2 <- tmp.sales2 * log(store.df$p1price) / print(log(store.df$p2price))
[1] 0.8285518 0.9122827 1.0952734 1.1600209
[5] 0.9516579 0.9122827 1.1600209 0.8285518
[9] 0.8285518 1.0952734 1.0952734 1.1600209
[13] 0.9516579 0.8285518 0.9516579 0.8285518
[17] 0.9516579 1.0952734 0.9516579 1.0952734
[21] 0.9122827 0.9516579 0.9122827 0.8285518
[25] 0.9122827 0.9516579 0.9122827 0.9516579
[29] 1.0952734 0.8285518 1.1600209 0.9516579
[33] 1.0952734 0.9122827 0.8285518 1.1600209
[37] 1.1600209 1.0952734 1.1600209 1.1600209
[41] 0.8285518 0.8285518 0.8285518 0.8285518
[45] 0.9516579 0.9516579 1.0952734 1.0952734
[49] 0.8285518 0.8285518 0.9516579 1.0952734
[53] 1.1600209 0.8285518 0.9516579 0.9122827
[57] 0.9122827 1.0952734 1.0952734 0.8285518
[61] 0.9122827 0.8285518 0.9516579 0.9122827
[65] 0.9516579 1.1600209 1.1600209 1.0952734
[69] 0.9516579 0.9122827 0.9122827 1.0952734
[73] 0.8285518 1.0952734 1.0952734 1.0952734
[77] 0.9516579 0.9122827 0.8285518 1.0952734
[81] 1.1600209 1.0952734 0.9122827 0.8285518
[85] 0.9122827 1.0952734 1.0952734 0.9516579
[89] 1.0952734 1.0952734 0.9122827 0.9516579
[93] 1.0952734 1.0952734 0.9122827 0.9122827
[97] 1.1600209 1.0952734 0.9516579 1.0952734
[101] 0.8285518 1.0952734 0.9516579 1.1600209
[105] 0.8285518 0.9122827 1.0952734 1.1600209
[109] 0.9516579 0.9516579 0.9122827 1.1600209
[113] 1.0952734 0.9122827 0.9122827 0.9516579
[117] 1.1600209 0.8285518 1.0952734 0.9122827
[121] 0.8285518 0.9516579 0.9516579 0.9122827
[125] 1.1600209 0.8285518 0.8285518 1.1600209
[129] 0.9122827 1.1600209 0.9516579 0.9516579
[133] 1.1600209 1.0952734 0.9122827 1.1600209
[137] 0.8285518 0.9122827 0.9516579 1.1600209
[141] 1.1600209 0.8285518 0.8285518 0.9516579
[145] 0.9122827 1.1600209 1.1600209 0.9516579
[149] 0.9122827 1.1600209 0.9122827 0.8285518
[153] 0.9516579 0.9516579 1.0952734 1.1600209
[157] 0.8285518 1.1600209 0.9516579 1.1600209
[161] 0.9516579 1.1600209 1.1600209 1.0952734
[165] 0.8285518 1.1600209 0.8285518 0.8285518
[169] 0.8285518 0.9516579 0.9516579 1.0952734
[173] 1.0952734 0.9516579 1.1600209 0.9122827
[177] 0.9122827 1.0952734 0.9516579 1.0952734
[181] 0.8285518 0.9122827 1.1600209 0.9122827
[185] 0.9516579 0.9122827 0.9122827 0.9516579
[189] 0.9516579 0.8285518 1.1600209 1.1600209
[193] 1.1600209 1.0952734 0.8285518 0.8285518
[197] 1.1600209 0.8285518 0.9122827 1.1600209
[201] 1.1600209 1.0952734 0.8285518 1.1600209
[205] 0.9122827 1.0952734 0.8285518 1.0952734
[209] 1.0952734 1.0952734 1.0952734 0.9122827
[213] 0.9516579 0.9516579 1.0952734 0.9122827
[217] 0.8285518 0.8285518 0.9516579 0.8285518
[221] 0.8285518 0.9516579 0.9122827 0.9122827
[225] 1.0952734 0.9516579 0.9516579 0.9122827
[229] 0.9122827 0.9516579 0.8285518 0.9122827
[233] 1.0952734 1.0952734 1.1600209 0.9516579
[237] 1.1600209 0.9122827 0.8285518 0.8285518
[241] 1.0952734 0.8285518 0.9516579 0.9516579
[245] 1.1600209 0.9516579 1.1600209 0.9122827
[249] 0.8285518 0.8285518 0.8285518 0.9516579
[253] 0.9516579 1.0952734 1.1600209 1.1600209
[257] 0.8285518 0.9122827 1.1600209 0.9122827
[261] 0.9122827 1.0952734 0.9122827 1.0952734
[265] 0.9122827 0.9516579 0.9516579 0.9516579
[269] 0.9122827 0.8285518 1.1600209 0.8285518
[273] 1.0952734 1.0952734 1.1600209 1.1600209
[277] 1.1600209 0.9516579 0.9122827 0.8285518
[281] 1.1600209 0.8285518 1.1600209 1.0952734
[285] 1.1600209 0.8285518 0.9516579 0.9516579
[289] 1.0952734 1.1600209 1.0952734 0.9122827
[293] 1.1600209 1.0952734 0.8285518 0.9516579
[297] 0.9516579 1.1600209 0.8285518 0.9516579
[301] 0.9516579 1.1600209 0.8285518 1.1600209
[305] 0.9122827 1.1600209 1.1600209 1.1600209
[309] 1.1600209 0.9516579 0.9516579 1.0952734
[313] 1.0952734 1.1600209 1.1600209 0.9122827
[317] 0.8285518 0.9516579 1.1600209 0.9122827
[321] 0.8285518 0.9122827 0.9516579 0.8285518
[325] 0.8285518 0.9516579 1.1600209 1.0952734
[329] 0.9122827 0.9516579 0.8285518 1.0952734
[333] 0.9516579 0.8285518 1.0952734 0.9516579
[337] 0.9516579 0.8285518 1.0952734 1.0952734
[341] 1.1600209 0.9122827 0.9122827 0.8285518
[345] 1.1600209 0.8285518 0.8285518 0.9516579
[349] 1.0952734 0.9122827 0.9122827 0.9516579
[353] 1.0952734 1.1600209 1.1600209 1.0952734
[357] 0.9516579 0.8285518 1.1600209 0.9122827
[361] 1.0952734 1.1600209 0.9516579 1.0952734
[365] 1.1600209 1.1600209 0.9516579 1.0952734
[369] 1.1600209 0.8285518 0.9122827 0.8285518
[373] 1.1600209 0.9122827 1.1600209 0.8285518
[377] 0.9516579 0.8285518 1.1600209 0.8285518
[381] 1.0952734 0.9516579 0.9122827 1.1600209
[385] 0.9516579 1.1600209 0.9516579 0.9122827
[389] 1.0952734 0.8285518 0.9516579 0.9516579
[393] 0.9122827 0.9122827 1.1600209 0.8285518
[397] 1.1600209 0.9122827 1.1600209 0.9516579
[401] 0.9516579 0.8285518 1.1600209 1.1600209
[405] 0.8285518 1.0952734 0.9516579 0.9516579
[409] 0.9516579 1.1600209 1.0952734 1.1600209
[413] 0.8285518 1.0952734 1.1600209 0.8285518
[417] 0.8285518 1.1600209 1.0952734 0.9122827
[421] 0.8285518 1.0952734 0.9516579 0.9516579
[425] 0.9516579 0.9516579 0.8285518 0.9516579
[429] 0.9122827 1.1600209 0.9122827 0.8285518
[433] 1.1600209 0.9122827 1.0952734 0.9122827
[437] 1.0952734 0.9122827 0.9516579 0.9516579
[441] 0.9516579 0.9122827 0.9122827 0.9516579
[445] 0.9516579 0.9122827 1.1600209 0.8285518
[449] 0.9122827 0.8285518 1.0952734 1.1600209
[453] 1.1600209 0.8285518 0.8285518 1.1600209
[457] 0.8285518 0.8285518 0.9516579 0.8285518
[461] 0.8285518 1.0952734 0.9516579 1.1600209
[465] 0.8285518 1.0952734 0.8285518 0.9516579
[469] 1.0952734 0.9122827 0.9122827 1.1600209
[473] 1.1600209 0.9122827 0.9122827 0.9516579
[477] 0.9516579 0.8285518 1.0952734 1.1600209
[481] 1.0952734 0.9122827 1.1600209 0.9516579
[485] 0.8285518 0.8285518 1.0952734 0.8285518
[489] 0.9516579 0.9516579 0.9516579 0.8285518
[493] 0.8285518 1.1600209 0.9516579 1.0952734
[497] 1.0952734 0.9516579 0.8285518 1.0952734
[501] 1.0952734 0.9122827 0.9516579 0.9516579
[505] 0.9122827 0.9122827 0.8285518 0.9516579
[509] 0.8285518 1.1600209 0.9122827 1.0952734
[513] 1.1600209 0.8285518 0.9516579 0.9122827
[517] 1.0952734 1.0952734 0.9122827 1.1600209
[521] 0.9122827 0.8285518 0.8285518 0.8285518
[525] 1.0952734 1.1600209 0.9516579 0.8285518
[529] 0.8285518 0.8285518 0.8285518 1.1600209
[533] 0.8285518 0.9122827 1.1600209 0.9516579
[537] 0.9122827 0.9516579 0.8285518 0.9516579
[541] 1.0952734 1.1600209 0.8285518 1.0952734
[545] 1.0952734 1.0952734 0.9516579 1.1600209
[549] 0.8285518 0.9516579 1.1600209 0.9516579
[553] 1.1600209 0.9516579 1.1600209 0.8285518
[557] 1.0952734 0.8285518 0.9122827 0.9122827
[561] 1.1600209 0.9516579 0.9516579 1.0952734
[565] 0.9516579 0.8285518 1.1600209 0.9122827
[569] 0.9516579 0.9516579 0.9516579 0.9122827
[573] 1.1600209 0.9122827 1.1600209 0.9122827
[577] 0.9122827 1.0952734 1.1600209 0.9516579
[581] 0.8285518 0.8285518 0.8285518 0.9516579
[585] 0.8285518 0.9122827 0.9516579 1.0952734
[589] 1.0952734 1.1600209 1.1600209 1.1600209
[593] 1.1600209 0.9516579 0.9516579 0.9516579
[597] 1.1600209 0.9122827 0.9516579 1.0952734
[601] 1.1600209 0.9516579 1.1600209 0.8285518
[605] 1.1600209 0.9516579 1.1600209 0.9516579
[609] 0.8285518 0.9516579 0.9516579 1.0952734
[613] 0.8285518 1.1600209 0.9516579 0.9122827
[617] 0.9122827 0.9516579 0.9516579 1.0952734
[621] 1.0952734 0.8285518 1.1600209 1.1600209
[625] 0.8285518 0.9122827 1.1600209 0.9122827
[629] 0.9516579 1.0952734 0.8285518 0.9516579
[633] 0.9122827 1.0952734 0.9122827 0.9122827
[637] 0.9122827 0.8285518 0.8285518 0.8285518
[641] 1.1600209 0.8285518 0.9122827 0.9516579
[645] 1.0952734 1.1600209 0.9122827 1.1600209
[649] 0.9122827 0.8285518 0.9516579 1.1600209
[653] 0.9516579 0.9516579 0.9516579 0.8285518
[657] 0.9516579 1.0952734 0.9516579 0.8285518
[661] 0.9516579 1.0952734 0.9516579 1.1600209
[665] 0.8285518 0.9516579 0.9516579 0.9516579
[669] 0.9516579 0.8285518 1.0952734 0.8285518
[673] 1.0952734 1.0952734 0.9516579 0.9122827
[677] 0.9516579 0.9122827 1.1600209 0.9516579
[681] 0.8285518 0.8285518 1.1600209 1.1600209
[685] 0.8285518 0.9122827 1.0952734 1.1600209
[689] 1.0952734 0.8285518 0.9122827 0.9122827
[693] 0.9516579 0.9122827 1.0952734 0.9122827
[697] 1.0952734 1.0952734 1.0952734 0.9122827
[701] 0.9122827 0.8285518 1.1600209 1.0952734
[705] 1.0952734 0.9516579 1.0952734 1.0952734
[709] 1.0952734 0.8285518 0.9122827 1.0952734
[713] 0.8285518 0.9122827 0.9516579 1.0952734
[717] 0.8285518 0.8285518 0.8285518 1.1600209
[721] 1.1600209 0.8285518 1.1600209 1.0952734
[725] 1.0952734 1.1600209 1.0952734 1.1600209
[729] 0.9516579 0.9122827 1.1600209 1.0952734
[733] 0.9122827 0.9122827 1.1600209 1.1600209
[737] 0.9516579 0.9122827 1.0952734 0.9122827
[741] 0.8285518 0.9122827 0.8285518 0.9516579
[745] 0.9516579 1.0952734 0.9516579 0.9122827
[749] 0.9122827 0.9516579 0.9516579 1.0952734
[753] 1.1600209 1.0952734 0.9122827 0.9122827
[757] 0.9122827 0.9516579 0.9122827 1.1600209
[761] 0.9516579 1.1600209 0.8285518 1.0952734
[765] 1.0952734 0.9122827 0.9516579 0.9516579
[769] 1.1600209 0.9122827 0.9122827 0.8285518
[773] 0.9516579 1.1600209 0.8285518 0.9516579
[777] 0.9516579 0.9516579 1.1600209 1.0952734
[781] 1.0952734 1.1600209 0.8285518 0.9122827
[785] 0.9516579 0.8285518 0.9516579 1.1600209
[789] 0.8285518 1.1600209 1.1600209 0.8285518
[793] 0.9122827 0.9516579 0.9122827 1.1600209
[797] 0.8285518 1.1600209 0.8285518 0.9122827
[801] 0.9516579 0.9516579 0.8285518 1.0952734
[805] 1.1600209 0.9122827 1.0952734 0.9516579
[809] 0.9516579 1.1600209 1.0952734 0.9122827
[813] 0.9122827 1.0952734 0.9122827 1.0952734
[817] 0.8285518 0.9516579 1.0952734 1.1600209
[821] 0.8285518 0.9516579 0.8285518 1.0952734
[825] 0.9122827 1.1600209 0.9516579 0.9516579
[829] 0.8285518 0.9122827 0.9516579 0.9516579
[833] 0.9122827 1.0952734 0.9516579 0.9122827
[837] 0.9516579 0.9122827 1.1600209 0.9516579
[841] 0.9122827 1.1600209 1.1600209 0.8285518
[845] 0.9516579 1.0952734 0.9516579 1.1600209
[849] 1.0952734 1.0952734 0.8285518 0.9516579
[853] 0.8285518 0.9516579 1.1600209 0.9122827
[857] 0.8285518 0.9122827 1.0952734 0.9122827
[861] 0.9516579 0.9516579 0.9122827 0.9516579
[865] 1.0952734 1.0952734 0.9516579 1.1600209
[869] 0.9516579 1.0952734 1.1600209 0.9516579
[873] 1.0952734 1.1600209 0.8285518 0.9122827
[877] 0.9516579 0.9516579 0.9516579 1.0952734
[881] 0.8285518 1.1600209 0.8285518 1.0952734
[885] 1.0952734 1.1600209 1.0952734 0.9122827
[889] 0.8285518 0.9516579 0.9516579 0.9516579
[893] 0.9516579 1.1600209 0.8285518 0.9122827
[897] 1.0952734 1.1600209 1.1600209 1.0952734
[901] 0.9122827 0.9122827 0.8285518 1.1600209
[905] 0.8285518 0.8285518 0.8285518 1.1600209
[909] 1.1600209 0.8285518 0.9122827 0.8285518
[913] 0.9122827 0.9122827 0.9122827 0.8285518
[917] 1.1600209 0.9516579 0.9516579 0.9122827
[921] 1.0952734 0.9516579 0.8285518 0.9122827
[925] 1.1600209 1.1600209 1.1600209 1.1600209
[929] 0.9516579 0.8285518 0.8285518 1.1600209
[933] 0.9516579 1.0952734 1.0952734 0.9122827
[937] 0.9122827 0.8285518 0.9516579 0.9516579
[941] 1.0952734 0.8285518 1.1600209 1.1600209
[945] 1.0952734 0.8285518 0.9516579 0.9516579
[949] 0.9122827 1.0952734 1.1600209 1.1600209
[953] 1.0952734 0.9516579 1.1600209 1.0952734
[957] 1.0952734 0.9516579 1.0952734 0.8285518
[961] 1.1600209 0.8285518 0.8285518 1.1600209
[965] 0.9516579 0.9122827 1.0952734 0.8285518
[969] 0.9122827 0.9122827 0.9122827 1.1600209
[973] 1.0952734 1.1600209 0.9516579 1.1600209
[977] 1.0952734 0.9122827 1.1600209 0.9516579
[981] 1.0952734 1.1600209 0.8285518 0.8285518
[985] 0.9122827 0.9516579 0.8285518 0.9516579
[989] 1.1600209 0.9122827 0.9122827 1.1600209
[993] 0.9122827 0.9516579 0.8285518 1.0952734
[997] 0.9122827 1.1600209 0.8285518 1.0952734
[1001] 0.9516579 0.9122827 1.1600209 0.9516579
[1005] 0.8285518 1.1600209 0.9516579 0.8285518
[1009] 0.9122827 0.9122827 1.1600209 0.9516579
[1013] 1.0952734 1.0952734 0.9516579 1.0952734
[1017] 0.9516579 0.8285518 1.0952734 1.1600209
[1021] 0.9516579 0.9122827 1.0952734 0.9516579
[1025] 0.9122827 0.9122827 1.0952734 1.1600209
[1029] 0.8285518 1.0952734 0.9122827 1.1600209
[1033] 0.9122827 0.9122827 0.9516579 0.8285518
[1037] 1.0952734 0.9516579 0.9516579 0.9516579
[1041] 0.9122827 0.9516579 0.9516579 1.1600209
[1045] 0.9122827 0.8285518 0.9122827 1.1600209
[1049] 1.0952734 0.9516579 0.9122827 0.9516579
[1053] 0.8285518 1.1600209 0.9516579 0.9122827
[1057] 1.1600209 0.9122827 0.8285518 1.0952734
[1061] 0.9516579 1.0952734 0.9122827 0.8285518
[1065] 0.9122827 0.8285518 1.1600209 1.0952734
[1069] 0.8285518 0.8285518 0.9516579 0.8285518
[1073] 1.0952734 0.9516579 0.9516579 1.1600209
[1077] 0.9122827 1.1600209 1.0952734 1.0952734
[1081] 1.1600209 1.1600209 0.9516579 1.1600209
[1085] 0.8285518 0.9516579 0.8285518 1.0952734
[1089] 1.1600209 0.9122827 0.9516579 0.8285518
[1093] 0.9516579 0.9122827 0.9122827 0.9516579
[1097] 1.0952734 0.9122827 0.9122827 0.8285518
[1101] 0.9122827 0.8285518 0.9122827 0.9122827
[1105] 0.9516579 0.9516579 1.0952734 0.8285518
[1109] 1.1600209 0.8285518 0.9516579 0.8285518
[1113] 0.9122827 1.1600209 1.0952734 0.8285518
[1117] 0.9516579 0.9516579 1.0952734 0.9516579
[1121] 1.0952734 1.0952734 0.9516579 0.9122827
[1125] 0.8285518 0.8285518 1.1600209 1.0952734
[1129] 1.0952734 1.1600209 1.1600209 0.8285518
[1133] 0.9516579 0.8285518 0.8285518 0.9516579
[1137] 0.9122827 1.0952734 0.8285518 0.9516579
[1141] 1.1600209 0.8285518 0.9122827 1.0952734
[1145] 0.9516579 0.9122827 0.9122827 0.9122827
[1149] 1.0952734 0.9122827 1.0952734 0.9122827
[1153] 0.9122827 1.1600209 1.0952734 0.8285518
[1157] 0.8285518 0.9122827 0.8285518 1.0952734
[1161] 1.0952734 0.8285518 0.9122827 1.0952734
[1165] 1.1600209 1.0952734 1.0952734 0.9122827
[1169] 0.8285518 1.1600209 0.8285518 1.1600209
[1173] 0.9516579 1.0952734 0.9122827 0.9122827
[1177] 0.9516579 0.9122827 0.9516579 0.8285518
[1181] 1.1600209 1.0952734 0.9516579 0.8285518
[1185] 1.1600209 0.8285518 1.1600209 0.8285518
[1189] 1.0952734 0.9122827 0.8285518 0.9516579
[1193] 0.9122827 0.8285518 0.9516579 1.1600209
[1197] 0.9122827 1.0952734 0.8285518 0.8285518
[1201] 0.9516579 1.1600209 1.1600209 1.1600209
[1205] 1.0952734 0.8285518 0.8285518 1.0952734
[1209] 1.1600209 0.8285518 0.8285518 0.8285518
[1213] 1.1600209 1.0952734 1.1600209 0.9516579
[1217] 0.8285518 0.9516579 1.1600209 0.9122827
[1221] 0.9122827 0.9516579 0.8285518 0.9516579
[1225] 0.9122827 0.9516579 0.9122827 0.9516579
[1229] 0.9122827 0.8285518 1.1600209 1.1600209
[1233] 1.1600209 0.9516579 0.9122827 0.9122827
[1237] 0.9516579 1.0952734 0.8285518 0.9122827
[1241] 0.9122827 1.1600209 0.9516579 0.9122827
[1245] 0.8285518 0.9122827 1.1600209 1.0952734
[1249] 0.8285518 1.0952734 1.1600209 1.1600209
[1253] 1.1600209 1.0952734 1.0952734 0.9122827
[1257] 1.0952734 0.9122827 0.9122827 1.0952734
[1261] 0.8285518 0.9516579 1.1600209 1.1600209
[1265] 0.9122827 0.9516579 0.9516579 0.9516579
[1269] 1.0952734 1.1600209 1.1600209 0.9122827
[1273] 1.1600209 0.8285518 1.1600209 0.9516579
[1277] 0.9122827 0.8285518 1.0952734 0.8285518
[1281] 1.1600209 0.8285518 1.0952734 0.8285518
[1285] 1.1600209 0.9516579 0.8285518 0.9516579
[1289] 1.0952734 0.9122827 0.9122827 0.8285518
[1293] 1.1600209 1.0952734 0.8285518 0.9516579
[1297] 1.0952734 1.1600209 1.1600209 1.1600209
[1301] 0.8285518 0.8285518 1.1600209 0.9516579
[1305] 0.9122827 0.9516579 0.8285518 0.9122827
[1309] 1.0952734 1.1600209 0.9516579 0.8285518
[1313] 0.9122827 0.9516579 1.0952734 0.9122827
[1317] 0.9516579 0.9516579 0.9122827 1.1600209
[1321] 0.9516579 1.1600209 0.9122827 0.9122827
[1325] 1.1600209 0.9122827 0.9516579 0.9122827
[1329] 0.9122827 0.8285518 1.1600209 1.1600209
[1333] 0.8285518 0.9122827 0.9516579 1.1600209
[1337] 0.9122827 0.8285518 0.8285518 0.8285518
[1341] 0.9516579 1.1600209 0.8285518 1.0952734
[1345] 1.1600209 0.8285518 0.8285518 1.1600209
[1349] 0.9516579 0.9516579 0.9122827 0.8285518
[1353] 0.8285518 0.9516579 0.8285518 1.1600209
[1357] 1.0952734 0.9516579 0.9122827 1.1600209
[1361] 0.9122827 1.1600209 0.9516579 0.9122827
[1365] 0.8285518 0.8285518 1.1600209 1.1600209
[1369] 1.0952734 0.9122827 0.8285518 1.0952734
[1373] 0.9516579 1.1600209 1.0952734 1.1600209
[1377] 1.1600209 0.8285518 0.8285518 1.0952734
[1381] 0.8285518 1.1600209 1.0952734 0.9122827
[1385] 1.1600209 0.8285518 0.8285518 0.9516579
[1389] 1.1600209 0.8285518 0.9122827 0.9122827
[1393] 1.1600209 0.9122827 0.9516579 0.9122827
[1397] 0.8285518 0.9122827 1.1600209 0.9516579
[1401] 0.9516579 0.9516579 0.9516579 0.9122827
[1405] 1.1600209 0.9516579 0.9516579 0.8285518
[1409] 1.0952734 1.1600209 0.9122827 1.1600209
[1413] 1.0952734 0.9516579 0.9122827 0.9122827
[1417] 1.1600209 1.0952734 0.9516579 1.1600209
[1421] 0.9516579 0.9516579 1.1600209 0.9516579
[1425] 0.9122827 0.9122827 0.9516579 1.0952734
[1429] 1.0952734 0.8285518 0.9516579 1.0952734
[1433] 0.8285518 0.8285518 0.9516579 1.0952734
[1437] 1.1600209 0.8285518 0.8285518 0.9516579
[1441] 0.8285518 1.0952734 1.0952734 0.9516579
[1445] 1.0952734 1.0952734 0.8285518 0.9122827
[1449] 0.9122827 0.9516579 0.9122827 0.8285518
[1453] 1.1600209 0.9516579 1.0952734 1.1600209
[1457] 0.8285518 0.8285518 0.8285518 0.9122827
[1461] 0.9516579 1.1600209 1.1600209 0.9122827
[1465] 0.9516579 0.9516579 1.0952734 0.8285518
[1469] 0.9122827 0.8285518 1.0952734 0.8285518
[1473] 1.1600209 1.1600209 0.9516579 1.0952734
[1477] 1.1600209 1.1600209 0.9516579 0.8285518
[1481] 0.9122827 1.1600209 1.0952734 0.9516579
[1485] 0.9516579 0.8285518 1.1600209 1.0952734
[1489] 1.0952734 1.1600209 0.9516579 1.0952734
[1493] 0.9516579 0.9516579 1.1600209 0.8285518
[1497] 0.8285518 0.8285518 1.1600209 0.8285518
[1501] 0.8285518 1.1600209 0.8285518 1.0952734
[1505] 0.9122827 0.9122827 0.9122827 0.9516579
[1509] 0.9122827 0.9516579 0.9122827 0.9122827
[1513] 1.1600209 0.9122827 1.0952734 0.8285518
[1517] 1.1600209 1.1600209 0.8285518 0.8285518
[1521] 0.9516579 1.0952734 1.1600209 0.9516579
[1525] 1.1600209 0.8285518 0.8285518 0.9516579
[1529] 0.9516579 0.8285518 0.9122827 0.9122827
[1533] 0.8285518 1.0952734 0.9122827 1.0952734
[1537] 1.1600209 0.9516579 0.8285518 0.9516579
[1541] 0.8285518 1.0952734 0.8285518 0.9516579
[1545] 0.9516579 0.8285518 0.9516579 0.8285518
[1549] 0.8285518 1.0952734 0.9122827 1.1600209
[1553] 1.0952734 0.9516579 0.9122827 0.8285518
[1557] 0.9122827 0.8285518 0.9516579 1.0952734
[1561] 1.0952734 0.8285518 0.9516579 0.9122827
[1565] 1.1600209 0.9516579 0.9516579 1.1600209
[1569] 1.0952734 0.8285518 0.9516579 0.9122827
[1573] 0.9122827 1.1600209 0.9122827 1.1600209
[1577] 0.9122827 0.8285518 0.8285518 0.9122827
[1581] 0.8285518 1.1600209 0.8285518 1.1600209
[1585] 0.8285518 0.9516579 1.0952734 1.1600209
[1589] 1.0952734 0.9516579 1.0952734 0.9122827
[1593] 1.0952734 1.0952734 1.0952734 1.1600209
[1597] 0.9122827 1.1600209 0.8285518 1.0952734
[1601] 0.9516579 0.9516579 0.9516579 0.8285518
[1605] 1.1600209 1.0952734 1.0952734 0.9516579
[1609] 0.9516579 0.8285518 1.1600209 1.1600209
[1613] 0.8285518 1.1600209 0.9516579 1.1600209
[1617] 0.8285518 1.0952734 0.8285518 0.8285518
[1621] 1.0952734 0.9516579 0.9122827 1.0952734
[1625] 1.0952734 0.9516579 0.9516579 0.9122827
[1629] 1.1600209 0.9122827 0.9516579 0.8285518
[1633] 1.0952734 1.1600209 0.8285518 0.8285518
[1637] 0.8285518 0.9122827 0.9122827 0.8285518
[1641] 0.9122827 0.8285518 1.1600209 1.0952734
[1645] 0.9516579 1.1600209 0.9122827 1.0952734
[1649] 0.9122827 1.0952734 1.1600209 1.0952734
[1653] 0.9122827 1.0952734 1.0952734 0.8285518
[1657] 1.1600209 1.0952734 0.9516579 0.9516579
[1661] 0.8285518 1.1600209 0.8285518 0.8285518
[1665] 0.9122827 0.9122827 0.9516579 1.1600209
[1669] 0.9122827 0.9516579 0.8285518 0.8285518
[1673] 0.8285518 0.8285518 0.8285518 0.8285518
[1677] 0.8285518 1.1600209 1.0952734 0.9122827
[1681] 0.9516579 0.9122827 0.9516579 0.9122827
[1685] 0.9122827 0.9516579 1.1600209 0.9122827
[1689] 1.0952734 1.0952734 0.9516579 1.1600209
[1693] 0.8285518 1.1600209 1.0952734 1.1600209
[1697] 0.9122827 0.9122827 1.1600209 0.9122827
[1701] 0.9122827 0.8285518 0.9122827 0.9516579
[1705] 0.9516579 0.9122827 1.1600209 0.8285518
[1709] 1.1600209 0.9122827 1.0952734 1.1600209
[1713] 0.9122827 1.0952734 1.0952734 1.0952734
[1717] 1.1600209 1.1600209 1.1600209 1.0952734
[1721] 0.9122827 0.8285518 0.8285518 0.8285518
[1725] 1.0952734 0.8285518 0.9122827 0.9516579
[1729] 1.0952734 0.9516579 0.8285518 1.0952734
[1733] 0.9516579 0.8285518 1.0952734 0.9516579
[1737] 0.9122827 0.9122827 1.0952734 0.9516579
[1741] 1.1600209 1.1600209 1.0952734 0.9122827
[1745] 0.9122827 0.9516579 1.1600209 0.9516579
[1749] 0.9516579 1.0952734 0.8285518 0.9122827
[1753] 0.9122827 0.8285518 0.8285518 1.0952734
[1757] 0.9516579 0.9516579 0.9122827 1.0952734
[1761] 0.9516579 1.0952734 1.1600209 0.9122827
[1765] 0.9516579 0.8285518 1.1600209 0.9122827
[1769] 0.9516579 1.0952734 1.0952734 0.8285518
[1773] 0.9516579 0.9516579 0.8285518 1.0952734
[1777] 0.9516579 0.8285518 0.8285518 0.9516579
[1781] 1.1600209 1.1600209 0.9516579 1.1600209
[1785] 1.1600209 0.9122827 1.0952734 1.0952734
[1789] 0.8285518 1.0952734 0.8285518 0.8285518
[1793] 1.1600209 1.1600209 0.9516579 0.9516579
[1797] 1.0952734 0.9122827 0.9122827 0.9122827
[1801] 1.0952734 1.0952734 1.0952734 0.9516579
[1805] 1.0952734 1.0952734 0.8285518 0.9122827
[1809] 0.9122827 0.9122827 1.0952734 0.9122827
[1813] 1.1600209 0.8285518 0.9122827 0.9516579
[1817] 0.8285518 0.9122827 1.0952734 1.0952734
[1821] 1.1600209 0.9516579 1.0952734 0.9122827
[1825] 1.1600209 0.9122827 0.9516579 1.0952734
[1829] 0.8285518 0.9516579 0.9516579 0.9122827
[1833] 1.0952734 0.9516579 0.9122827 1.0952734
[1837] 1.1600209 0.9122827 1.0952734 0.9516579
[1841] 1.1600209 1.0952734 1.1600209 1.0952734
[1845] 1.1600209 0.9516579 0.9516579 1.1600209
[1849] 1.0952734 1.1600209 1.0952734 1.0952734
[1853] 1.0952734 1.1600209 0.8285518 1.1600209
[1857] 0.9122827 0.8285518 1.0952734 0.8285518
[1861] 0.9516579 1.0952734 0.8285518 0.8285518
[1865] 0.9516579 0.8285518 0.9516579 1.0952734
[1869] 1.1600209 1.0952734 0.9122827 0.9122827
[1873] 0.9516579 0.9122827 0.9516579 0.9122827
[1877] 1.0952734 1.0952734 0.9122827 1.0952734
[1881] 0.8285518 0.9516579 0.8285518 0.8285518
[1885] 0.9516579 0.9516579 1.0952734 0.9516579
[1889] 1.0952734 0.8285518 0.8285518 0.8285518
[1893] 0.8285518 1.1600209 1.1600209 1.1600209
[1897] 1.0952734 0.9122827 1.1600209 1.1600209
[1901] 1.0952734 1.1600209 1.0952734 0.8285518
[1905] 0.8285518 0.9122827 0.9122827 1.1600209
[1909] 1.1600209 0.9122827 0.9516579 1.1600209
[1913] 0.8285518 0.8285518 1.1600209 0.9122827
[1917] 0.8285518 0.8285518 0.8285518 0.8285518
[1921] 0.9122827 0.9516579 0.9122827 1.0952734
[1925] 1.1600209 0.9122827 0.9516579 1.0952734
[1929] 0.9516579 0.9516579 0.9122827 1.1600209
[1933] 0.9122827 0.9122827 1.1600209 1.1600209
[1937] 0.8285518 0.8285518 0.9122827 1.1600209
[1941] 1.0952734 1.0952734 1.0952734 0.8285518
[1945] 0.8285518 1.1600209 1.0952734 0.9516579
[1949] 1.0952734 0.8285518 0.9122827 1.0952734
[1953] 0.8285518 1.0952734 0.9122827 1.0952734
[1957] 0.8285518 1.1600209 0.9516579 0.9516579
[1961] 1.0952734 1.0952734 0.9122827 0.9516579
[1965] 0.8285518 0.9122827 1.0952734 0.9516579
[1969] 1.0952734 0.9122827 0.9516579 1.1600209
[1973] 1.0952734 1.1600209 0.8285518 0.8285518
[1977] 0.8285518 1.0952734 0.8285518 0.9122827
[1981] 1.0952734 0.9516579 0.8285518 1.1600209
[1985] 0.9122827 0.9516579 1.1600209 0.9516579
[1989] 0.8285518 0.9516579 1.1600209 0.9122827
[1993] 1.1600209 0.9122827 0.9516579 1.0952734
[1997] 0.8285518 0.9516579 0.8285518 0.8285518
[2001] 1.0952734 0.9516579 0.8285518 1.0952734
[2005] 0.9516579 0.9122827 0.9516579 0.8285518
[2009] 0.9516579 0.9122827 0.8285518 1.0952734
[2013] 0.9516579 0.9516579 1.0952734 0.9122827
[2017] 1.1600209 1.1600209 1.0952734 1.0952734
[2021] 1.0952734 0.9516579 0.9516579 0.8285518
[2025] 1.1600209 1.1600209 1.0952734 0.9516579
[2029] 0.9122827 0.9122827 0.9516579 1.0952734
[2033] 0.9122827 0.9122827 1.1600209 1.0952734
[2037] 0.9122827 0.8285518 1.0952734 0.9122827
[2041] 0.9122827 1.0952734 0.9516579 1.1600209
[2045] 0.9122827 1.1600209 0.9516579 1.1600209
[2049] 0.9122827 0.9516579 0.8285518 0.9122827
[2053] 0.9516579 0.8285518 0.9122827 0.8285518
[2057] 0.9516579 0.9516579 1.0952734 1.1600209
[2061] 0.9122827 0.9122827 0.9122827 0.9516579
[2065] 0.9122827 0.9122827 0.9122827 0.9516579
[2069] 1.1600209 1.1600209 0.9516579 1.1600209
[2073] 0.9122827 0.9122827 0.9516579 1.0952734
[2077] 0.9516579 0.8285518 0.8285518 0.9122827
# final sales get a 30% or 40% lift when promoted
store.df$p1sales <- floor(tmp.sales1 * (1 + store.df$p1prom*0.3))
store.df$p2sales <- floor(tmp.sales2 * (1 + store.df$p2prom*0.4))
# head(store.df)
library(car)
Attaching package: ‘car’
The following object is masked from ‘package:purrr’:
some
# gives us a random set to inspect
some(store.df, 10)
storeNum Year Week p1sales p2sales p1price
481 105 2 13 136 140 2.79
594 106 2 22 98 101 2.79
701 107 2 25 152 95 2.19
765 108 1 37 114 100 2.79
773 108 1 45 135 98 2.19
875 109 1 43 112 104 2.49
1070 111 1 30 107 102 2.49
1175 112 1 31 94 126 2.99
1735 117 2 19 129 104 2.99
1814 118 1 46 134 93 2.29
p2price p1prom p2prom country
481 2.99 0 1 DE
594 2.59 0 0 DE
701 2.49 0 0 DE
765 2.99 0 1 DE
773 2.59 0 0 DE
875 2.29 0 0 GB
1070 2.29 0 0 GB
1175 2.49 0 0 BR
1735 2.99 0 0 JP
1814 2.29 0 0 AU
\[s^{2} = \frac{\sum(x - \bar{x})^2}{n - 1} \]
hist(store.df$p1price)
p1.table <- table(store.df$p1price)
p1.table
2.19 2.29 2.49 2.79 2.99
395 444 423 443 375
plot(p1.table)
object.size(store.df)
120432 bytes
object.size(p1.table)
1136 bytes
# places var1 as the row index and var2 as the col index
# able to see how promotions and non-promotion items sell according to product one price
table(store.df$p1price, store.df$p1prom)
0 1
2.19 354 41
2.29 398 46
2.49 381 42
2.79 396 47
2.99 343 32
plot((table(store.df$p1price, store.df$p1prom)))
plot((table(store.df$p2price, store.df$p2prom)))
#At each price level, Product 1 is observed to have been promoted approximately 10 % of the time (as expected, given how we created the data in Sect. 3.1.1). In fact, we can compute the exact fraction of times product 1 is on promotion at each price point, if we assign the table to a variable and then divide the second column of the table by the sum of the first and second columns:
p1.table2 <- table(store.df$p1price, store.df$p1prom)
p1.table2
0 1
2.19 354 41
2.29 398 46
2.49 381 42
2.79 396 47
2.99 343 32
# sum the two columns to get the total then divide it by second column (freq of promotions) to get the fraction or ratio of price/prom
p1.table2[, 2] / (p1.table2[, 1] + p1.table2[, 2])
2.19 2.29 2.49 2.79
0.10379747 0.10360360 0.09929078 0.10609481
2.99
0.08533333
describes <- c("Extremes", 'Central Tendency', 'Dispersion', 'Points')
functions <- c("min(x), max(x)", 'mean(x), median (x)', 'var(x), sd(x), IQR(x), mad(x)', 'quantile(x, probs=c(...))')
values <- c("Min Max Value", 'Arith Mean / Median', 'Variance around the mean, Stand. Dev, sqrt(var(x)), Interquartile range(75-25%), median absolute deviation (robust variance estimator)', 'Percentiles')
distfuncs <- data.frame(
describes,
functions,
values
)
table(distfuncs)
, , values = Arith Mean / Median
functions
describes mean(x), median (x)
Central Tendency 1
Dispersion 0
Extremes 0
Points 0
functions
describes min(x), max(x)
Central Tendency 0
Dispersion 0
Extremes 0
Points 0
functions
describes quantile(x, probs=c(...))
Central Tendency 0
Dispersion 0
Extremes 0
Points 0
functions
describes var(x), sd(x), IQR(x), mad(x)
Central Tendency 0
Dispersion 0
Extremes 0
Points 0
, , values = Min Max Value
functions
describes mean(x), median (x)
Central Tendency 0
Dispersion 0
Extremes 0
Points 0
functions
describes min(x), max(x)
Central Tendency 0
Dispersion 0
Extremes 1
Points 0
functions
describes quantile(x, probs=c(...))
Central Tendency 0
Dispersion 0
Extremes 0
Points 0
functions
describes var(x), sd(x), IQR(x), mad(x)
Central Tendency 0
Dispersion 0
Extremes 0
Points 0
, , values = Percentiles
functions
describes mean(x), median (x)
Central Tendency 0
Dispersion 0
Extremes 0
Points 0
functions
describes min(x), max(x)
Central Tendency 0
Dispersion 0
Extremes 0
Points 0
functions
describes quantile(x, probs=c(...))
Central Tendency 0
Dispersion 0
Extremes 0
Points 1
functions
describes var(x), sd(x), IQR(x), mad(x)
Central Tendency 0
Dispersion 0
Extremes 0
Points 0
, , values = Variance around the mean, Stand. Dev, sqrt(var(x)), Interquartile range(75-25%), median absolute deviation (robust variance estimator)
functions
describes mean(x), median (x)
Central Tendency 0
Dispersion 0
Extremes 0
Points 0
functions
describes min(x), max(x)
Central Tendency 0
Dispersion 0
Extremes 0
Points 0
functions
describes quantile(x, probs=c(...))
Central Tendency 0
Dispersion 0
Extremes 0
Points 0
functions
describes var(x), sd(x), IQR(x), mad(x)
Central Tendency 0
Dispersion 1
Extremes 0
Points 0
min(store.df$p1sales)
[1] 71
max(store.df$p2sales)
[1] 221
mean(store.df$p1prom)
[1] 0.1
median(store.df$p2sales)
[1] 97
var(store.df$p1sales)
[1] 820.4308
sd(store.df$p1sales)
[1] 28.64316
IQR(store.df$p1sales)
[1] 37
mad(store.df$p1sales)
[1] 26.6868
quantile(store.df$p1sales, probs=c(0.25, 0.5, 0.75))
25% 50% 75%
113 130 150
0:10/10
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
# created 2 by 2 dataframe
mysummary.df <- data.frame(matrix(NA, nrow=2, ncol=2))
# label the column names
names(mysummary.df) <- c("Median Sales", "IQR")
# label the indexes of the obs
rownames(mysummary.df) <- c("Product 1", "Product 2")
# select by row index and set median to ->
mysummary.df["Product 1", "Median Sales"] <- median(store.df$p1sales)
# select by row index and set median to ->
mysummary.df["Product 2", "Median Sales"] <- median(store.df$p2sales)
mysummary.df["Product 1", "IQR"] <- IQR(store.df$p1sales)
mysummary.df["Product 2", "IQR"] <- IQR(store.df$p2sales)
mysummary.df
Median Sales IQR
Product 1 130 37
Product 2 97 30
summary(store.df$Year, digits = 1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 1 2 2 2 2
useful when interpreting data with regard to normal distributions.
library(psych)
describe(store.df)
vars n mean sd median trimmed mad min max range skew kurtosis se
storeNum* 1 2080 10.50 5.77 10.50 10.50 7.41 1.00 20.00 19.0 0.00 -1.21 0.13
Year 2 2080 1.50 0.50 1.50 1.50 0.74 1.00 2.00 1.0 0.00 -2.00 0.01
Week 3 2080 26.50 15.01 26.50 26.50 19.27 1.00 52.00 51.0 0.00 -1.20 0.33
p1sales 4 2080 133.07 28.64 130.00 131.13 26.69 71.00 254.00 183.0 0.70 0.57 0.63
p2sales 5 2080 100.06 24.11 97.00 97.88 22.24 52.00 221.00 169.0 1.00 1.56 0.53
p1price 6 2080 2.54 0.29 2.49 2.53 0.44 2.19 2.99 0.8 0.28 -1.44 0.01
p2price 7 2080 2.70 0.33 2.59 2.69 0.44 2.29 3.19 0.9 0.32 -1.40 0.01
p1prom 8 2080 0.10 0.30 0.00 0.00 0.00 0.00 1.00 1.0 2.66 5.10 0.01
p2prom 9 2080 0.14 0.35 0.00 0.05 0.00 0.00 1.00 1.0 2.09 2.38 0.01
country* 10 2080 4.55 1.72 4.50 4.62 2.22 1.00 7.00 6.0 -0.29 -0.81 0.04
print("")
[1] ""
# if you have non-numeric factor variables you'll get an error, to resolve that pass in specific columns
describe(store.df[ , c(2, 4:9)])
vars n mean sd median trimmed mad min max range skew kurtosis se
Year 1 2080 1.50 0.50 1.50 1.50 0.74 1.00 2.00 1.0 0.00 -2.00 0.01
p1sales 2 2080 133.07 28.64 130.00 131.13 26.69 71.00 254.00 183.0 0.70 0.57 0.63
p2sales 3 2080 100.06 24.11 97.00 97.88 22.24 52.00 221.00 169.0 1.00 1.56 0.53
p1price 4 2080 2.54 0.29 2.49 2.53 0.44 2.19 2.99 0.8 0.28 -1.44 0.01
p2price 5 2080 2.70 0.33 2.59 2.69 0.44 2.29 3.19 0.9 0.32 -1.40 0.01
p1prom 6 2080 0.10 0.30 0.00 0.00 0.00 0.00 1.00 1.0 2.66 5.10 0.01
p2prom 7 2080 0.14 0.35 0.00 0.05 0.00 0.00 1.00 1.0 2.09 2.38 0.01
apply(x=DATA, MARGIN=MARGIN, FUN=FUNCTION)
apply(store.df[,2:9], MARGIN=2, FUN=mean)
Year Week p1sales p2sales p1price
1.5000000 26.5000000 133.0668269 100.0634615 2.5443750
p2price p1prom p2prom
2.6995192 0.1000000 0.1384615
apply(store.df[,2:9], 1, mean)
# by adding 1 as the second parameter, which is now margin (margin = 1) it will take the rows
#Similarly, we might find the sum() or sd() for multiple columns with margin=2:
apply(store.df[,2:9], 2, sum)
Year Week p1sales p2sales p1price p2price
3120.0 55120.0 276779.0 208132.0 5292.3 5615.0
p1prom p2prom
208.0 288.0
apply(store.df[,2:9], 2, sd)
Year Week p1sales p2sales p1price
0.5001202 15.0119401 28.6431639 24.1071803 0.2948819
p2price p1prom p2prom
0.3292181 0.3000721 0.3454668
apply(store.df[,2:9], 2, function(x) { mean(x) - median(x) } )
Year Week p1sales p2sales p1price
0.0000000 0.0000000 3.0668269 3.0634615 0.0543750
p2price p1prom p2prom
0.1095192 0.1000000 0.1384615
Apply is a functional method, people that are use to procedural programming might envision solving this with a for loop. R is primarily a functional programming language. One must try to think in terms of functions that are applied across data as we do here.
mysummary2.df <- data.frame(matrix(NA, nrow=2,ncol=2))
names(mysummary2.df) <- c("Median Sales", "IQR")
rownames(mysummary2.df) <- names(store.df)[4:5] # names from the data frame
mysummary2.df
Median Sales IQR
p1sales NA NA
p2sales NA NA
mysummary2.df[, "Median Sales"] <- apply(store.df[,4:5], 2, median)
mysummary2.df[, "IQR"] <- apply(store.df[, 4:5], 2, IQR)
mysummary2.df
Median Sales IQR
p1sales 130 37
p2sales 97 30
If there were many products instead of just two, the code would still work if we changed the number of allocated rows, and apply() would run automatically across all of them.
mysummary2.df <- data.frame(matrix(NA, nrow=2,ncol=4))
names(mysummary2.df) <- c("Median Sales", "IQR", "Mean Sales", "kurtosi")
rownames(mysummary2.df) <- names(store.df)[4:5] # names from the data frame
mysummary2.df
Median Sales IQR Mean Sales kurtosi
p1sales NA NA NA NA
p2sales NA NA NA NA
mysummary2.df[, "Median Sales"] <- apply(store.df[,4:5], 2, median)
mysummary2.df[, "IQR"] <- apply(store.df[, 4:5], 2, IQR)
mysummary2.df[, "Mean Sales"] <- apply(store.df[, 4:5], 2, mean)
mysummary2.df[, "kurtosi"] <- apply(store.df[, 4:5], 2, psych::kurtosi)
mysummary2.df
Median Sales IQR Mean Sales kurtosi
p1sales 130 37 133.0668 0.5653844
p2sales 97 30 100.0635 1.5617362
hist(store.df$p1sales)
#add title and axis labels
hist(store.df$p1sales,
main="Product 1 Weekly Sales Frequencies, All Stores",
xlab="Product 1 Sales (Units)",
ylab="Count" )
# increase granularity by adding breaks
hist(store.df$p1sales,
main="Product 1 Weekly Sales Frequencies, All Stores",
xlab="Product 1 Sales (Units)",
ylab="Count",
breaks=30, # more columns
col="lightblue") # color the bars
NA
hist(store.df$p1sales,
main="Product 1 Weekly Sales Frequencies, All Stores",
xlab="Product 1 Sales (Units)",
ylab="Relative frequency",
breaks=30,
col="lightblue",
freq=FALSE, # freq=FALSE means plot density, not counts
xaxt="n") # xaxt="n" means "x axis tick marks == no"
lines(density(store.df$p1sales, bw=10), # "bw= ..." adjusts the smoothing
type="l", col="darkred", lwd=2) # lwd = line width
boxplot(store.df$p2sales, xlab="Weekly sales", ylab="P2",
main="Weekly sales of P2, All stores", horizontal=TRUE)
# boxplot p2sales by Store.
boxplot(store.df$p2sales ~ store.df$storeNum, horizontal=TRUE,
ylab="Store", xlab="Weekly unit sales", las=1,
main="Weekly Sales of P2 by Store")
# but do p2 sales differ in relation to in-store promotion
boxplot(p2sales ~ p2prom, data=store.df, horizontal=TRUE, yaxt="n",
ylab="P2 promoted in store?", xlab="Weekly sales",
main="Weekly sales of P2 with and without promotion")
axis(side=4, at=c(1,2), labels=c("No", "Yes"))
evaluate a distribution more formally
Quantile–quantile (QQ) plots are a good way to check one’s data against a distribution that you think it should come from. Some common statistics such as the correlation coefficient r (to be precise, the Pearson product-moment correlation coefficient) are interpreted under an assumption that data are normally distributed. A QQ plot can confirm that the distribution is, in fact, normal by plotting the observed quantiles of your data against the quantiles that would be expected for a normal distribution.
To do this, the qqnorm() command compares data vs. a normal distribution; you can use qqline() to add a diagonal line for easier reading. We check p1sales to see whether it is normally distributed:
qqnorm(store.df$p1sales)
#make sure you call a plot before calling qqline
qqline(store.df$p1sales)
The upward curving shape is typical of data with high positive skew
What do you do?? You might transform your daya. (logarithmic dist…)
qqnorm(log(store.df$p1sales))
?qqnorm
#make sure you call a plot before calling qqline
qqline(log(store.df$p1sales))
formally called - empirical cumulative distribution function (ECDF)
Next:
We plot the ECDF of p1sales by combining a few steps. First, we use the ecdf() function to find the ECDF of the data. Then we wrap plot() around that, adding options such as titles. Next we put some nicer-looking labels on the Y axis that relabel the proportions as percentiles. The paste() function combines a number vector (0, 10, 20, …) with the “%” symbol to make each label.
plot(ecdf(store.df$p1sales))
plot(ecdf(store.df$p1sales),
main="Cumulative distribution of P1 Weekly Sales",
ylab="Cumulative Proportion",
xlab=c("P1 weekly sales, all stores", "90% of weeks sold <= 171 units"),
yaxt="n")
axis(side=2, at=seq(0, 1, by=0.1), las=1,
labels=paste(seq(0,100,by=10), "%", sep=""))
abline(h=0.9, lty=3) # "h=" for horizontal line; "lty=3" for dotted
abline(v=quantile(store.df$p1sales, pr=0.9), lty=3) # "v=" for vertical line
What should we do if we want to break out data by factors and summarize it, a process you might know as “cross-tabs” or “pivot tables”?
For example, how can we compute the mean sales by store?
by and aggregate can help!
Suppose we wish to find the average sales of P1 by store
The DATA would be the weekly sales for each store, store.df\(p1sales. We wish to split this by store, so the INDICES (actually, “index” in this case) would be store.df\)storeNum. Finally, we get the average of each of those groups by using the mean function.
# Suppose we wish to find the average sales of P1 by store
by(store.df$p1sales, store.df$storeNum, mean)
by(store.df$p1sales, list(store.df$storeNum, store.df$Year), mean)
How does this work?(BELOW) Just as with by(), aggregate(x=DATA, by=BY, FUN=FUNCTION) applies a particular function (FUN) according to divisions of the data specified by a factor (by). We want to find the total sales by country, so we apply the mean function by store.df$country.
aggregate(store.df$p1sales, by=list(country=store.df$country), sum)
country x
1 AU 14583
2 BR 27864
3 CN 27411
4 DE 68928
5 GB 41103
6 JP 55303
7 US 41587
marketing data on a map - choropleth map
Here is a routine example. Suppose that we want to chart the total sales by coun- try. We use aggregate() as in Sect. 3.4.5 to find the total sales of P1 by country:
p1sales.sum <- aggregate(store.df$p1sales,
by=list(country=store.df$country), sum)
install.packages(c("rworldmap", "RColorBrewer")) # if needed
library(rworldmap)
library(RColorBrewer)
First, we have to associate the aggregated data with specific map regions using the country codes. This can be done with the joinCountryData2Map() func- tion, which matches country locations (store.df$country) for data points with the corresponding international standard names (ISO names) and returns a map object:
p1sales.map <- joinCountryData2Map(p1sales.sum, joinCode = "ISO2",
nameJoinColumn = "country")
Error: could not find function "joinCountryData2Map"
The following guidelines and pointers will help you to describe data accurately and quickly:
we create a data set for 1,000 customers of a retailer who sells products in stores and online. This data is typical of what one might sample from a company’s customer relationship management (CRM) system. We begin by setting a random number seed to make the process repeatable (as described in Sect. 3.1.2) and creating a data frame to store the data:
set.seed(21821)
ncust <- 1000
cust.df <- data.frame(cust.id=as.factor(c(1:ncust)))
Next we create a number of variables describing the customers, add those variables to the cust.df data frame, and inspect them with summary():
cust.df$age <- rnorm(n=ncust, mean=35, sd=5)
cust.df$credit.score <- rnorm(n=ncust, mean=3*cust.df$age+620, sd=50)
cust.df$email <- factor(sample(c("yes", "no"), size=ncust, replace=TRUE,
prob=c(0.8, 0.2)))
cust.df$distance.to.store <- exp(rnorm(n=ncust, mean=2, sd=1.2))
summary(cust.df)
cust.id age credit.score email distance.to.store
1 : 1 Min. :19.34 Min. :543.0 no :186 Min. : 0.2136
2 : 1 1st Qu.:31.43 1st Qu.:691.7 yes:814 1st Qu.: 3.3383
3 : 1 Median :35.10 Median :725.5 Median : 7.1317
4 : 1 Mean :34.92 Mean :725.5 Mean : 14.6553
5 : 1 3rd Qu.:38.20 3rd Qu.:757.2 3rd Qu.: 16.6589
6 : 1 Max. :51.86 Max. :880.8 Max. :267.0864
(Other):994
Our final variable for the basic CRM data is distance.to.store, which we assume follows the exponential of the normal distribution. That gives dis- tances that are all positive, with many distances that are relatively close to the nearest store and fewer that are far from a store. To see the distribution for yourself, try hist(cust.df$distance.to.store).
Formally, we say that distance.to.store follows a lognormal distribution. (This is sufficiently common that there is a built-in function called rlnorm(n, meanlog, sdlog) that does the same thing as taking the exponential of rnorm().)
Our next step is to create data for the online store: 1 year totals for each customer for online visits and transactions, plus total spending. We simulate the number of visits with a negative binomial distribution, a discrete distribution often used to model counts of events over time. Like the lognormal distribution, the negative binomial distribution generates positive values and has a long right-hand tail, meaning that in our data most customers make relatively few visits and a few customers make many visits. Data from the negative binomial distribution can be generated using rnbinom():
cust.df$online.visits
[1] 20 121 39 1 35 1 1 48 0 14 2 0 0 108 0 1 0 33 16 13 0
[22] 24 0 14 3 83 127 8 30 40 0 13 197 4 0 0 21 0 1 0 0 0
[43] 17 0 0 43 1 111 0 1 0 24 33 24 19 0 85 0 7 0 67 0 0
[64] 1 34 14 21 11 0 0 2 4 0 1 12 5 1 3 0 2 0 0 0 0
[85] 45 4 2 2 0 40 6 0 7 12 5 144 1 143 5 186 52 2 0 30 4
[106] 112 1 0 5 42 18 0 1 323 30 3 0 7 0 34 0 15 64 46 2 19
[127] 71 10 4 77 106 79 2 218 5 0 19 3 36 0 99 6 0 0 0 22 0
[148] 0 0 26 12 0 118 5 0 0 0 0 102 20 6 9 0 5 0 13 18 0
[169] 0 0 0 26 6 236 0 9 1 0 59 89 17 28 88 0 5 13 62 13 21
[190] 5 134 10 0 12 1 5 27 32 0 122 0 1 0 73 3 11 0 16 50 3
[211] 0 10 1 28 17 0 5 25 55 0 20 11 0 18 156 2 0 32 0 5 36
[232] 18 0 53 19 3 2 2 81 22 1 9 3 1 31 6 0 0 2 2 2 27
[253] 6 0 4 2 10 167 15 9 43 0 62 121 2 0 78 98 98 2 97 10 12
[274] 4 3 47 10 8 0 87 10 39 9 0 13 1 28 28 1 4 16 323 0 184
[295] 20 0 11 16 21 0 151 0 1 147 5 113 31 13 0 2 84 40 0 1 156
[316] 0 3 3 2 0 0 26 64 10 2 84 6 0 0 14 83 10 55 8 43 186
[337] 3 2 0 0 5 11 0 12 0 0 18 7 180 14 0 4 2 1 0 16 18
[358] 111 19 1 44 4 10 231 152 29 44 102 111 10 4 0 6 3 2 0 39 0
[379] 0 3 75 0 0 18 7 28 0 40 13 2 0 1 0 33 30 2 14 86 10
[400] 0 9 0 10 77 87 86 26 52 124 5 0 3 62 11 1 0 15 16 1 0
[421] 8 50 266 55 24 0 5 1 62 0 0 0 0 7 0 15 86 26 41 0 0
[442] 1 2 48 145 4 28 60 52 2 2 42 12 0 27 0 0 0 131 115 2 7
[463] 48 114 291 7 99 94 2 10 0 49 2 1 1 46 0 7 12 2 2 194 104
[484] 12 162 0 1 1 0 4 0 9 0 0 14 14 0 18 0 14 0 1 6 9
[505] 0 0 0 0 15 2 11 58 4 3 14 19 7 0 1 0 68 2 0 1 25
[526] 4 31 3 37 15 0 40 6 50 2 108 0 1 4 49 2 58 24 12 0 9
[547] 0 0 0 0 4 0 0 0 1 35 5 2 0 61 65 0 95 1 48 42 0
[568] 8 12 1 4 0 97 72 100 22 0 62 6 40 25 0 0 14 0 0 0 1
[589] 4 0 110 2 1 4 23 11 85 17 138 109 0 28 27 5 0 100 0 65 9
[610] 93 2 0 95 0 606 0 0 2 7 31 108 111 0 0 4 530 31 73 56 0
[631] 33 6 2 58 14 78 52 2 15 6 0 9 27 54 44 3 0 2 0 17 0
[652] 6 185 12 95 3 3 0 9 6 40 119 132 25 0 5 11 0 0 136 11 16
[673] 5 6 6 0 1 0 0 5 64 0 12 66 2 20 15 1 7 3 2 16 131
[694] 0 118 28 59 42 1 4 23 24 0 0 6 0 97 35 143 26 0 0 0 3
[715] 99 0 50 3 3 0 0 1 0 1 48 0 95 12 0 1 21 3 8 31 2
[736] 11 0 53 153 56 0 1 328 0 3 139 108 0 40 8 4 1 22 0 14 113
[757] 29 1 3 7 93 11 0 42 29 3 3 3 378 30 0 85 18 0 130 73 12
[778] 0 2 0 181 0 27 1 16 25 217 42 0 0 6 62 0 0 0 0 19 15
[799] 0 0 123 37 0 0 4 2 6 0 0 31 8 72 5 145 0 0 315 1 0
[820] 59 1 18 19 161 18 87 5 52 3 1 0 84 84 2 167 0 35 53 241 24
[841] 2 139 26 0 75 0 0 0 0 8 3 25 11 0 146 22 13 28 181 3 5
[862] 0 78 0 0 23 0 1 12 1 0 0 93 178 11 0 0 8 31 0 8 50
[883] 1 1 16 7 23 0 0 6 13 7 88 10 9 0 0 16 2 0 0 10 8
[904] 10 101 17 2 6 0 2 1 9 2 0 36 15 21 5 7 0 37 93 0 0
[925] 0 61 7 7 0 0 2 24 0 62 3 0 71 24 0 12 4 2 1 0 1
[946] 7 0 2 5 0 3 13 45 0 2 22 51 107 39 17 59 17 80 39 0 3
[967] 0 0 121 28 12 8 102 5 20 2 85 0 1 6 3 11 0 1 75 83 0
[988] 28 0 0 28 1 49 5 3 9 46 16 47 37
Above: We model the mean (mu) of the negative binomial with a baseline value of 15. The size argument sets the degree of dispersion (variation) for the samples. We add an average 15 online visits for customers who have an email on file, using ifelse() to generate a vector of 0 or 15 as appropriate. Finally, we add or subtract visits from the target mean based on the customer’s age relative to the sample median; customers who are younger are simulated to make more online visits. To see exactly how this works, try cutting and pasting pieces of the code above into the R console.
For each online visit that a customer makes, we assume there is a 30 % chance of placing an order and use rbinom() to create the variable online.trans. We assume that amounts spent in those orders (the variable online.spend) are lognormally distributed:
cust.df$online.trans <- rbinom(ncust, size=cust.df$online.visits, prob=0.3)
cust.df$online.spend <- exp(rnorm(ncust, mean=3, sd=0.1)) *
cust.df$online.trans
hist(cust.df$online.trans)
hist(cust.df$online.spend)
Above: The random value for amount spent per transaction—sampled with exp(rnorm()) is multiplied by the variable for number of transactions to get the total amount spent.
Next we generate in-store sales data similarly, except that we don’t generate a count of store visits; few customers visit a physical store without making a purchase and even if customers did visit without buying, the company probably couldn’t track the visit. We assume that transactions follow a negative binomial distribution, with lower average numbers of visits for customers who live farther away. We model in-store spending as a lognormally distributed variable simply multiplied by the number of transactions:
# lower average number of visits if they live further away
cust.df$store.trans <- rnbinom(ncust, size=5,
mu=3 / sqrt(cust.df$distance.to.store))
# lognormally distributed var * by the number of transactions
cust.df$store.spend <- exp(rnorm(ncust, mean=3.5, sd=0.4)) *
cust.df$store.trans
hist(cust.df$store.trans, plot = TRUE)
summary(cust.df)
cust.id age credit.score email distance.to.store
1 : 1 Min. :19.34 Min. :543.0 no :186 Min. : 0.2136
2 : 1 1st Qu.:31.43 1st Qu.:691.7 yes:814 1st Qu.: 3.3383
3 : 1 Median :35.10 Median :725.5 Median : 7.1317
4 : 1 Mean :34.92 Mean :725.5 Mean : 14.6553
5 : 1 3rd Qu.:38.20 3rd Qu.:757.2 3rd Qu.: 16.6589
6 : 1 Max. :51.86 Max. :880.8 Max. :267.0864
(Other):994
online.visits online.trans online.spend store.trans store.spend
Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. : 0.00
1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00
Median : 6.00 Median : 2.000 Median : 38.05 Median : 1.000 Median : 28.63
Mean : 28.29 Mean : 8.458 Mean : 170.44 Mean : 1.238 Mean : 44.56
3rd Qu.: 31.00 3rd Qu.: 9.000 3rd Qu.: 176.51 3rd Qu.: 2.000 3rd Qu.: 59.63
Max. :606.00 Max. :184.000 Max. :3630.99 Max. :19.000 Max. :617.97
hist(cust.df$store.spend, plot = TRUE)
summary(cust.df)
cust.id age credit.score email distance.to.store
1 : 1 Min. :19.34 Min. :543.0 no :186 Min. : 0.2136
2 : 1 1st Qu.:31.43 1st Qu.:691.7 yes:814 1st Qu.: 3.3383
3 : 1 Median :35.10 Median :725.5 Median : 7.1317
4 : 1 Mean :34.92 Mean :725.5 Mean : 14.6553
5 : 1 3rd Qu.:38.20 3rd Qu.:757.2 3rd Qu.: 16.6589
6 : 1 Max. :51.86 Max. :880.8 Max. :267.0864
(Other):994
online.visits online.trans online.spend store.trans store.spend
Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. : 0.00
1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00
Median : 6.00 Median : 2.000 Median : 38.05 Median : 1.000 Median : 28.63
Mean : 28.29 Mean : 8.458 Mean : 170.44 Mean : 1.238 Mean : 44.56
3rd Qu.: 31.00 3rd Qu.: 9.000 3rd Qu.: 176.51 3rd Qu.: 2.000 3rd Qu.: 59.63
Max. :606.00 Max. :184.000 Max. :3630.99 Max. :19.000 Max. :617.97
To simulate survey responses, we assume that each customer has an unobserved overall satisfaction with the brand. We generate this overall satisfaction from a nor- mal distribution:
We assume that overall satisfaction is a psychological construct that is not directly observable. Instead, the survey collects information on two items: satisfaction with service, and satisfaction with the selection of products. We assume that customers’ responses to the survey items are based on unobserved levels of satisfaction overall (sometimes called the “halo” in survey response) plus the specific levels of satisfac- tion with the service and product selection.
To create such a score from a halo variable, we add sat.overall (the halo) to a random value specific to the item, drawn using rnorm(). Because survey responses are typically given on a discrete, ordinal scale (i.e., “very unsatisfied”, “unsatisfied”, etc.), we convert our continuous random values to discrete integers using the floor() function.
summary(cbind(sat.service, sat.selection))
sat.service sat.selection
Min. :1.000 Min. :0.00
1st Qu.:3.000 1st Qu.:2.00
Median :3.000 Median :2.00
Mean :3.087 Mean :2.45
3rd Qu.:4.000 3rd Qu.:3.00
Max. :6.000 Max. :5.00
hist(sat.service)
hist(sat.selection)
Note that we use cbind() to temporarily combine our two vectors of data into a matrix, so that we can get a combined summary with a single line of code. The sum- mary shows that our data now ranges from −1 to 6. However, a typical satisfaction item might be given on a 5-point scale. To fit that, we replace values that are greater than 5 with 5, and values that are less than 1 with 1. This enforces the floor and ceiling effects often noted in survey response literature.
We set the ceiling by indexing with a vector that tests whether each element of sat.serviceisgreaterthan5):sat.service[sat.service > 5].This might be read as “sat.service, where sat.service is greater than 5.” For the elements that are selected—which means that the expression evaluates as TRUE—we replace the current values with the ceiling value of 5. We do the same for the floor effects (< 1,replacingwith1)andlikewisefortheceilingandfloorofsat.selection. While this sounds quite complicated, the code is simple:
# should be read as sat.service, where sat.service is greater than 5, set to 5
sat.service[sat.service > 5] <- 5
# should be read as sat.service, where sat.service is less than 1, set to 1
sat.service[sat.service < 1] <- 1
sat.selection[sat.selection > 5] <- 5
sat.selection[sat.selection < 1] <- 1
summary(cbind(sat.service, sat.selection))
sat.service sat.selection
Min. :1.000 Min. :1.000
1st Qu.:3.000 1st Qu.:2.000
Median :3.000 Median :2.000
Mean :3.086 Mean :2.462
3rd Qu.:4.000 3rd Qu.:3.000
Max. :5.000 Max. :5.000
Because some customers do not respond to surveys, we eliminate the simulated an- swers for a subset of respondents who are modeled as not answering. We do this by creating a variable of TRUE and FALSE values called no.response and then as- signing a value of NA for the survey response for customers whose no.response is TRUE. As we have discussed, NA is R’s built-in constant for missing data.
We model non-response as a function of age, with higher likelihood of not responding to the survey for older customers:
no.response <- as.logical(rbinom(ncust, size=1, prob=cust.df$age/100))
sat.service[no.response] <- NA
sat.selection[no.response] <- NA
summary(cbind(sat.service, sat.selection))
sat.service sat.selection
Min. :1.000 Min. :1.00
1st Qu.:2.000 1st Qu.:2.00
Median :3.000 Median :2.00
Mean :3.077 Mean :2.43
3rd Qu.:4.000 3rd Qu.:3.00
Max. :5.000 Max. :5.00
NA's :353 NA's :353
summary recognises that 341 customers with NA values and excludes them from the statistics
Then we add the survey responses to cust.df and clean up the workspace.
cust.df$sat.service <- sat.service
cust.df$sat.selection <- sat.selection
summary(cust.df)
We are ready for analysis!
str(cust.df)
'data.frame': 1000 obs. of 10 variables:
$ cust.id : Factor w/ 1000 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ age : num 22.9 28 35.9 30.5 38.7 ...
$ credit.score : num 631 749 733 830 734 ...
$ email : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 1 1 ...
$ distance.to.store: num 2.58 48.18 1.29 5.25 25.04 ...
$ online.visits : num 20 121 39 1 35 1 1 48 0 14 ...
$ online.trans : int 7 36 13 1 4 0 0 14 0 4 ...
$ online.spend : num 135.1 739.2 219.7 18.6 79.3 ...
$ store.trans : num 0 0 3 1 0 1 3 0 0 0 ...
$ store.spend : num 0 0 99.8 26.1 0 ...
Additional variables report 1-year total visits to the online site
plot(x=cust.df$age, y=cust.df$credit.score)
plot(cust.df$age, cust.df$credit.score,
col="blue",
xlim=c(15, 55), ylim=c(500, 900),
main="Active Customers as of June 2014",
xlab="Customer Age (years)", ylab="Customer Credit Score ")
abline(h=mean(cust.df$credit.score), col="dark blue", lty="dotted")
abline(v=mean(cust.df$age), col="dark blue", lty="dotted")
R looks at what type of data you are trying to plot and, based on the data type, R will choose a spe- cific lower-level plotting function, known as a method, that is appropriate to the data you are trying to plot. When we call plot() with vectors of x and y coordinates, R uses the plot.default() function. However, there are many other plotting functions for different data types. For example, if you plot the cust.df data frame by typing plot(cust.df) into the console, R will use plot.data.frame() instead of plot.default(). This produces one of several plot types depending on the number of dimensions in the data frame; in this case, it produces a scatterplot matrix, which we review in Sect. 4.4.2.
t may be because R has selected a different method than you expect. If so, check the data types of the vari- ables you’re sending to plot() because R uses those to select a plot method.
methods(plot)
[1] plot,allcategorical_missing_data.frame,binary-method
[2] plot,allcategorical_missing_data.frame,categorical-method
[3] plot,ANY,ANY-method
[4] plot,color,ANY-method
[5] plot,mi_list,ANY-method
[6] plot,mi,ANY-method
[7] plot,missing_data.frame,binary-method
[8] plot,missing_data.frame,categorical-method
[9] plot,missing_data.frame,missing_variable-method
[10] plot,missing_data.frame,semi-continuous-method
[11] plot,profile.mle,missing-method
[12] plot.aareg*
[13] plot.acf*
[14] plot.ACF*
[15] plot.agnes*
[16] plot.areg*
[17] plot.areg.boot*
[18] plot.aregImpute*
[19] plot.augPred*
[20] plot.balance*
[21] plot.biVar*
[22] plot.boot*
[23] plot.cld*
[24] plot.clusGap*
[25] plot.coef.mer*
[26] plot.cohesiveBlocks*
[27] plot.communities*
[28] plot.compareFits*
[29] plot.confint.glht*
[30] plot.correspondence*
[31] plot.cox.zph*
[32] plot.curveRep*
[33] plot.data.frame*
[34] plot.decomposed.ts*
[35] plot.default
[36] plot.dendrogram*
[37] plot.density*
[38] plot.diana*
[39] plot.drawPlot*
[40] plot.ecdf
[41] plot.factor*
[42] plot.formula*
[43] plot.function
[44] plot.gam*
[45] plot.gbayes*
[46] plot.ggplot*
[47] plot.glht*
[48] plot.gls*
[49] plot.grenander*
[50] plot.gtable*
[51] plot.hclust*
[52] plot.histogram*
[53] plot.HoltWinters*
[54] plot.huge*
[55] plot.igraph*
[56] plot.InformativeTesting*
[57] plot.intervals.lmList*
[58] plot.irt
[59] plot.isoreg*
[60] plot.jam*
[61] plot.lda*
[62] plot.lm*
[63] plot.lme*
[64] plot.lmList*
[65] plot.lmList4*
[66] plot.lmList4.confint*
[67] plot.mca*
[68] plot.mcmc*
[69] plot.mcmc.list*
[70] plot.medpolish*
[71] plot.merMod*
[72] plot.mlm*
[73] plot.mona*
[74] plot.monoreg*
[75] plot.nffGroupedData*
[76] plot.nfnGroupedData*
[77] plot.nls*
[78] plot.nmGroupedData*
[79] plot.partition*
[80] plot.PBmodcomp*
[81] plot.pdMat*
[82] plot.poly
[83] plot.poly.parallel
[84] plot.powerTransform*
[85] plot.ppr*
[86] plot.prcomp*
[87] plot.princomp*
[88] plot.profile*
[89] plot.profile.nls*
[90] plot.psych
[91] plot.qgraph*
[92] plot.qss1*
[93] plot.qss2*
[94] plot.Quantile2*
[95] plot.ranef.lme*
[96] plot.ranef.lmList*
[97] plot.ranef.mer*
[98] plot.raster*
[99] plot.residuals
[100] plot.ridgelm*
[101] plot.rm.boot*
[102] plot.roc*
[103] plot.rpart*
[104] plot.rq.process*
[105] plot.rqs*
[106] plot.rqss*
[107] plot.select*
[108] plot.shingle*
[109] plot.silhouette*
[110] plot.sim*
[111] plot.simulate.lme*
[112] plot.sir*
[113] plot.skewpowerTransform*
[114] plot.spec*
[115] plot.spline*
[116] plot.stepfun
[117] plot.stl*
[118] plot.summary.crqs*
[119] plot.summary.formula.response*
[120] plot.summary.formula.reverse*
[121] plot.summary.rqs*
[122] plot.summary.rqss*
[123] plot.summaryM*
[124] plot.summaryP*
[125] plot.summaryS*
[126] plot.survfit*
[127] plot.table*
[128] plot.table.rq*
[129] plot.testSlopes*
[130] plot.times*
[131] plot.transcan*
[132] plot.trellis*
[133] plot.ts
[134] plot.tskernel*
[135] plot.TukeyHSD*
[136] plot.varclus*
[137] plot.Variogram*
[138] plot.xyVector*
[139] plot.zoo*
see '?methods' for accessing help and source code
in our data, do customers who buy more online buy less in stores? We start by plotting online sales against in-store sales:
plot(cust.df$store.spend, cust.df$online.spend,
main="Customers as of June 2014",
xlab="Prior 12 months in-store sales ($)",
ylab="Prior 12 months online sales ($)",
cex=0.7)
Above: The resulting plot in Fig. 4.2 is typical of the skewed distributions that are common in behavioral data such as sales or transaction counts; most customers purchase rarely so the data is dense near zero. The resulting plot has a lot of points along the axes; we use the cex option, which scales down the plotted points to 0.7 of their default size so that we can see the points a bit more clearly. The plot shows that there are a large number of customers who didn’t buy anything on one of the two channels (the points along the axes), along with a smaller number of customers who purchase fairly large amounts on one of the channels.
Because of the skewedness of the data we use a histogram in hopes of better clairty.
hist(cust.df$store.spend,
breaks=(0:ceiling(max(cust.df$store.spend)/10))*10,
main="Customers as of June 2014",
xlab="Prior 12 months online sales ($)",
ylab="Count of customers")
The histogram in Fig. 4.3 shows clearly that a large number of customers bought nothing in the online store (about 400 out of 1,000). The distribution of sales among those who do buy has a mode around $20 and a long right-hand tail with a few customers whose 12-month spending was high. Such distributions are typical of spending and transaction counts in customer data.
Another question is whether the propensity to buy online versus in store is related to our email efforts (as reflected by whether or not a customer has an email address on file).
We can add the email dimension to the plot in Fig. 4.2 by coloring in the points for customers whose email address is known to us. To do this, we use plot() arguments that allow us to draw different colors (col=) and symbols for the points (pch=). Each argument takes a vector that specifies the option—the color or symbol—that you want for each individual point. Thus, if we provide a vector of colors of the same length as the vectors of x and y values, col= will use the corresponding colors for each point. Constructing such vectors can be tricky, so we will build them up slowly.
To begin, we first declare vectors for the color and point types that we want to use:
my.col <- c("black", "green3")
my.pch <- c(1, 19) # R’s symbols for solid and open circles (see ?points)
my.col[as.numeric(head(cust.df$email))]
[1] "green3" "green3" "green3" "green3" "black" "green3"
Now that we have a vector of colors, we can pass it as the col option in plot() to get a plot where customers with emails on file are plotted in green and customers without email addresses on file are plotted in black. We use a similar strategy for setting the point styles using the pch option, such that customers without email addresses have open circles instead of solid.
plot(cust.df$store.spend, cust.df$online.spend,
cex=0.7,
col=my.col[cust.df$email], pch=my.pch[cust.df$email],
main="Customers as of June 2014",
xlab="Prior 12 months in-store sales ($)",
ylab="Prior 12 months online sales ($)")
legend(x="topright", legend=paste("email on file:", levels(cust.df$email)), col=my.col, pch=my.pch)
When we created Fig. 4.1 earlier, we used an option col=“blue” and it turned all of the points blue. This is because if the vector you pass for col is shorter than the length of x and y, then R recycles the values. Thus, if your col vector has one element, all the points will be that single color. Similarly, if you were to pass thevectorc(“black”, “green3”),thenplotwouldsimplymakealternating points black or green, which might not be what you want. Usually what you’ll want is to create a vector that exactly matches the length of your data by starting with a shorter vector as we did here, and then indexing it with such that you extract a value for each one of your data points. That can be difficult to get right in practice, so we encourage you to experiment with these examples until you understand how it works.
Above
With raw values as plotted in the left panel of Fig. 4.4, it is still difficult to see whether there is a different relationship between in-store and online purchases for those with and without emails on file, because of the heavy skew in sales fig- ures. A common solution for such scatterplots with skewed data is to plot the data on a logarithmic scale. This is easy to do with the log= argument of plot(): set log=“x” to plot the x-axis on the log scale, log=“y” for the y-axis, or log=“xy” for both axes.
Since both values are skewed we use log=“xy” to apply log to both variables
plot(cust.df$store.spend + 1, cust.df$online.spend + 1,
log="xy", cex=0.7,
col=my.col[cust.df$email], pch=my.pch[cust.df$email],
main="Customers as of June 2014",
xlab="Prior 12 months in-store sales ($)",
ylab="Prior 12 months online sales ($)" )
legend(x="topright", legend=paste("email on file:", levels(cust.df$email)),
col=my.col, pch=my.pch)
Thus, there is no evidence here to suggest that online sales have cannibalized in-store sales (a formal test of that would be complex, but the present data do not argue for such an effect in any obvious way)
We also see in Fig. 4.4 that customers with no email address on file show slightly lower online sales than those with addresses; there are somewhat more black circles in the lower half of the plot than the upper half. If we have been sending email promotions to customers, then this suggests that the promotions might be working. An experiment to confirm that hypothesis could be an appropriate next step.
Sometimes we want to visualize several relationships at once. For instance, suppose we wish to examine whether customers who live closer to stores spend more in store, and whether those who live further away spend more online. Those involve different spending variables and thus need separate plots. If we plot several such things individually, we end up with many individual charts. Luckily, R can produce a single graphic that consists of multiple plots. You do this by telling R that you want multiple plots in a single graphical object with the par(mfrow=…) command; then simply plot each one with plot() as usual.
par(mfrow=c(2, 2))
plot(cust.df$distance.to.store, cust.df$store.spend, main="store")
plot(cust.df$distance.to.store, cust.df$online.spend, main="online")
plot(cust.df$distance.to.store, cust.df$store.spend+1, log="xy",
main="store, log")
plot(cust.df$distance.to.store, cust.df$online.spend+1, log="xy",
main="online, log")
Above: the graphical parameter mfrow to c(2, 2), which instructs R to create a single graphic comprising a two-by-two arrangement of plots, which begins on the first row and moves from left to right.
Above: We see a possible negative relationship between spend and distance to store
In our customer data, we have a number of variables that might be associated with each other; age, distance.to.store, and email all might be related to on- line and offline transactions and to spending. When you have several variables such as these, it is good practice to examine scatterplots between all pairs of variables before moving on to more complex analyses.
pairs(formula = ~ age + credit.score + email +
distance.to.store + online.visits + online.trans +
online.spend + store.trans + store.spend,
data=cust.df)
#same as
pairs(cust.df[ , c(2:10)]) #output not repeated; same as above
formula = ∼ age + credit.score + email + distance.to.store + online.visits + online.trans + online.spend + store.trans + store.spend, data=cust.df)
the car package adds several features to pairs
library(car)
scatterplotMatrix(formula = ~ age + credit.score + email + distance.to.store + online.visits + online.trans + online.spend + store.trans + store.spend, data = cust.df, diagonal="histogram")
Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE, spread = spread, :
could not fit smooth
Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE, spread = spread, :
could not fit smooth
Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE, spread = spread, :
could not fit smooth
Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE, spread = spread, :
could not fit smooth
Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE, spread = spread, :
could not fit smooth
Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE, spread = spread, :
could not fit smooth
Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE, spread = spread, :
could not fit smooth
Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE, spread = spread, :
could not fit smooth
A limitation of Figs. 4.6 and 4.7 concerns the display of the email variable. email is a binary factor with values yes and no, and a scatterplot is not ideal to vi- sualize a discrete variable. For such variables, the gpairs, or Generalized Pair Plots, package [41] provides a function called gpairs() that produces a scat- terplot matrix that includes better visualizations for both discrete and continuous variables. For example, if we want to look more closely at the relationship between email and online.visits, online.trans and online.spend, we can use gpairs() as follows:
library(gpairs)
Error in library(gpairs) : there is no package called ‘gpairs’
gpairs() produces scatterplots for pairs of continuous variables. However, for the factor email, gpairs includes a boxplot that compares the distribution of continuous variables for those who do and do not have email addresses in the data. A boxplot shows that the distributions of visits, transactions, and spending have longer tails among customers who have email addresses on file than those who don’t. We discuss boxplots in depth in Chap. 5, which focuses on comparisons between groups.
because it selects individual plots to fit the data types, gpairs() is useful for mar- keting data sets that include continuous and discrete variables. Note that gpairs() relies on the data types in R to determine how to construct its plots; if we had stored
assess the relationship between each pair with a single number. One measure of the relationship between two variables is the covari- ance, which can be computed for any two variables using the cov function:
cov(cust.df$age, cust.df$credit.score)
[1] 63.23443
However, it is difficult to interpret the magnitude of covariance because the scale depends on the variables involved. Covariance will be different if the variables are measured in cents versus dollars or in inches versus centimeters. So, it is helpful to scale the covariance by the standard deviation for each variable, which results in a standardized, rescaled correlation coefficient known as the Pearson product-moment correlation coefficient, often abbreviated as the symbol r.
cor(cust.df$age, cust.df$credit.score)
[1] 0.2545045
r is identical to rescale the covariance by the join std. dev (but more conv)
cov(cust.df$age, cust.df$credit.score) / (sd(cust.df$age)*sd(cust.df$credit.score))
[1] 0.2545045
What value of r signifies an important correlation between two variables in mar- keting? In engineering and physical sciences, physical measurements may demon- strate extremely high correlations; for instance, r between the lengths and weights of pieces of steel rod might be 0.9, 0.95, or even 0.999, depending on the unifor- mity of the rods and the precision of measurement. However, in social sciences such as marketing, we are concerned with human behavior, which is less consistent and more difficult to measure. This results in lower correlations, but they are still important.
We often use Cohen’s Rules of Thumb, which come out of the psychology tradi- tion [27]. Cohen proposed that for correlations between variables describing peo- ple, r = 0.1 should be considered a small or weak association, r = 0.3 might be considered to be medium in strength, and r = 0.5 or higher could be considered to be large or strong. Cohen’s interpretation of a large effect was that such an asso- ciation would be easily noticed by casual observers. A small effect would require careful measurement to detect yet might be important to our understanding and to statistical models.
Importantly, interpretation of r according to Cohen’s rules of thumb depends on the assumption that the variables are normally distributed (also known as Gaussian) or are approximately so
Is it statistically significant?
cor.test(cust.df$age, cust.df$credit.score)
Pearson's product-moment correlation
data: cust.df$age and cust.df$credit.score
t = 8.3138, df = 998, p-value = 3.008e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1955974 0.3115816
sample estimates:
cor
0.2545045
Above: This tells us that r = 0.25 and the 95 % confidence interval is r = 0.196 − 0.312. Because the confidence interval for r does not include 0 (and thus has p-value of p < 0.05), the association is statistically significant. Such a correlation, showing a medium-sized effect and statistical significance, probably should not be ignored in subsequent analyses.
for more than two vars you can compute the corrs between all pairs x,y
corrmat[corrmat > .20]
[1] 1.0000000 0.2545045 0.2545045 1.0000000 1.0000000 1.0000000 0.9881555 0.9802214 0.9881555 1.0000000 0.9940306 0.9802214 0.9940306 1.0000000
[15] 1.0000000 0.8815039 0.8815039 1.0000000
library(corrplot) # for correlation plot, install if needed
library(gplots) # color interpolation, install if needed
corrplot.mixed(corr=cor(cust.df[ , c(2, 3, 5:10)], use="complete.obs"),
upper="ellipse", tl.pos="lt",
col = colorpanel(50, "red", "gray60", "blue4"))
numeric values of r are shown in the lower triangle of the matrix. The upper triangle displays ellipses (because we used the argument upper=“ellipse”). These ellipses are tighter, progressively closer to being lines, for larger values of r, and are rounder, more like circles for r near zero. They are also shaded blue for positive direction, and red for negative (and show corresponding positive or negative
This makes it easy to find the larger correlations in the data: age is positively cor- related with credit.score; distance.to.store is negatively correlated with store.trans and store.spend; online.visits, online.trans, and online.spend are all strongly correlated with one another, as are store. trans and store.spend. In the survey items, sat.service is positively cor- related with sat.selection.
This makes it easy to find the larger correlations in the data: age is positively cor- related with credit.score; distance.to.store is negatively correlated with store.trans and store.spend; online.visits, online.trans, and online.spend are all strongly correlated with one another, as are store. trans and store.spend. In the survey items, sat.service is positively cor- related with sat.selection.
Notice we are comparing x x&2 to show the lack of “linear” correlation though we know they are related
Above: Many relationships in marketing data are nonlinear. The number of trips a customer makes to a store may be inversely related to distance from the store.
cor(cust.df$distance.to.store, cust.df$store.spend)