We are going to perform exploratory data analysis to a marketing campaign dataset of 2,240 customers of a wholesaler company.
So we first have a glimpse at the dataset, and to the structure of our dataset -using the R STR command and checking for missing values:
tabla1 <- as.data.frame(head(marketing_data, 10))
rmarkdown::paged_table(tabla1)
tabla2 <- as.data.frame(str(marketing_data))
## spec_tbl_df[,28] [2,240 x 28] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ID : num [1:2240] 1826 1 10476 1386 5371 ...
## $ Year_Birth : num [1:2240] 1970 1961 1958 1967 1989 ...
## $ Education : chr [1:2240] "Graduation" "Graduation" "Graduation" "Graduation" ...
## $ Marital_Status : chr [1:2240] "Divorced" "Single" "Married" "Together" ...
## $ Income : num [1:2240] 84835 57091 67267 32474 21474 ...
## $ Kidhome : num [1:2240] 0 0 0 1 1 0 0 0 0 0 ...
## $ Teenhome : num [1:2240] 0 0 1 1 0 0 0 1 1 1 ...
## $ Dt_Customer : Date[1:2240], format: "2014-06-16" "2014-06-15" ...
## $ Recency : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
## $ MntWines : num [1:2240] 189 464 134 10 6 336 769 78 384 384 ...
## $ MntFruits : num [1:2240] 104 5 11 0 16 130 80 0 0 0 ...
## $ MntMeatProducts : num [1:2240] 379 64 59 1 24 411 252 11 102 102 ...
## $ MntFishProducts : num [1:2240] 111 7 15 0 11 240 15 0 21 21 ...
## $ MntSweetProducts : num [1:2240] 189 0 2 0 0 32 34 0 32 32 ...
## $ MntGoldProds : num [1:2240] 218 37 30 0 34 43 65 7 5 5 ...
## $ NumDealsPurchases : num [1:2240] 1 1 1 1 2 1 1 1 3 3 ...
## $ NumWebPurchases : num [1:2240] 4 7 3 1 3 4 10 2 6 6 ...
## $ NumCatalogPurchases: num [1:2240] 4 3 2 0 1 7 10 1 2 2 ...
## $ NumStorePurchases : num [1:2240] 6 7 5 2 2 5 7 3 9 9 ...
## $ NumWebVisitsMonth : num [1:2240] 1 5 2 7 7 2 6 5 4 4 ...
## $ AcceptedCmp3 : num [1:2240] 0 0 0 0 1 0 1 0 0 0 ...
## $ AcceptedCmp4 : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp5 : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp1 : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp2 : num [1:2240] 0 1 0 0 0 0 0 0 0 0 ...
## $ Response : num [1:2240] 1 1 0 0 1 1 1 0 0 0 ...
## $ Complain : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
## $ Country : chr [1:2240] "Spain" "Canada" "USA" "Australia" ...
## - attr(*, "spec")=
## .. cols(
## .. ID = col_double(),
## .. Year_Birth = col_double(),
## .. Education = col_character(),
## .. Marital_Status = col_character(),
## .. Income = col_double(),
## .. Kidhome = col_double(),
## .. Teenhome = col_double(),
## .. Dt_Customer = col_date(format = "%Y-%m-%d"),
## .. Recency = col_double(),
## .. MntWines = col_double(),
## .. MntFruits = col_double(),
## .. MntMeatProducts = col_double(),
## .. MntFishProducts = col_double(),
## .. MntSweetProducts = col_double(),
## .. MntGoldProds = col_double(),
## .. NumDealsPurchases = col_double(),
## .. NumWebPurchases = col_double(),
## .. NumCatalogPurchases = col_double(),
## .. NumStorePurchases = col_double(),
## .. NumWebVisitsMonth = col_double(),
## .. AcceptedCmp3 = col_double(),
## .. AcceptedCmp4 = col_double(),
## .. AcceptedCmp5 = col_double(),
## .. AcceptedCmp1 = col_double(),
## .. AcceptedCmp2 = col_double(),
## .. Response = col_double(),
## .. Complain = col_double(),
## .. Country = col_character()
## .. )
rmarkdown::paged_table(tabla2)
Each record represents one customer and contains fields that can be classified as follows:
ID: Customer’s unique identifier
Year_Birth: Customer’s birth year
Education: Customer’s education level
Marital_Status: Customer’s marital status
Income: Customer’s yearly household income
Kidhome: Number of children in customer’s household
Teenhome: Number of teenagers in customer’s household
Dt_Customer: Date of customer’s enrollment with the company
Recency: Number of days since customer’s last purchase
Complain: 1 if customer complained in the last 2 years, 0 otherwise
Country: Customer’s location
MntWines: Amount spent on wine in the last 2 years
MntFruits: Amount spent on fruits in the last 2 years
MntMeatProducts: Amount spent on meat in the last 2 years
MntFishProducts: Amount spent on fish in the last 2 years
MntSweetProducts: Amount spent on sweets in the last 2 years
MntGoldProds: Amount spent on gold in the last 2 years
NumWebPurchases: Number of purchases made through the company’s web site
NumCatalogPurchases: Number of purchases made using a catalogue
NumStorePurchases: Number of purchases made directly in stores
NumWebVisitsMonth: Number of visits to company’s web site in the last month
NumDealsPurchases: Number of purchases made with a discount
AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
sapply(marketing_data, function(x) sum(is.na(x) ) )
## ID Year_Birth Education Marital_Status
## 0 0 0 0
## Income Kidhome Teenhome Dt_Customer
## 24 0 0 0
## Recency MntWines MntFruits MntMeatProducts
## 0 0 0 0
## MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases
## 0 0 0 0
## NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## 0 0 0 0
## AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1
## 0 0 0 0
## AcceptedCmp2 Response Complain Country
## 0 0 0 0
# highlight <- color_tile("yellow","yellow")
# formattable(tabla3, list( area(col = 5, row = 1 ) ~ highlight ))
And so firstly we see that:
most of the variables in the dataset are numeric (integer type)
the only variable with missing values is INCOME with 24 NA values.
the MARITAL_STATUS, EDUCATION & COUNTYR variables are of type character but the first 2 can easily be considered FACTOR variables.
summary(marketing_data)
## ID Year_Birth Education Marital_Status
## Min. : 0 Min. :1893 Length:2240 Length:2240
## 1st Qu.: 2828 1st Qu.:1959 Class :character Class :character
## Median : 5458 Median :1970 Mode :character Mode :character
## Mean : 5592 Mean :1969
## 3rd Qu.: 8428 3rd Qu.:1977
## Max. :11191 Max. :1996
##
## Income Kidhome Teenhome Dt_Customer
## Min. : 1730 Min. :0.0000 Min. :0.0000 Min. :2012-07-30
## 1st Qu.: 35303 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:2013-01-16
## Median : 51382 Median :0.0000 Median :0.0000 Median :2013-07-08
## Mean : 52247 Mean :0.4442 Mean :0.5062 Mean :2013-07-10
## 3rd Qu.: 68522 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:2013-12-30
## Max. :666666 Max. :2.0000 Max. :2.0000 Max. :2014-06-29
## NA's :24
## Recency MntWines MntFruits MntMeatProducts
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.:24.00 1st Qu.: 23.75 1st Qu.: 1.0 1st Qu.: 16.0
## Median :49.00 Median : 173.50 Median : 8.0 Median : 67.0
## Mean :49.11 Mean : 303.94 Mean : 26.3 Mean : 166.9
## 3rd Qu.:74.00 3rd Qu.: 504.25 3rd Qu.: 33.0 3rd Qu.: 232.0
## Max. :99.00 Max. :1493.00 Max. :199.0 Max. :1725.0
##
## MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 3.00 1st Qu.: 1.00 1st Qu.: 9.00 1st Qu.: 1.000
## Median : 12.00 Median : 8.00 Median : 24.00 Median : 2.000
## Mean : 37.53 Mean : 27.06 Mean : 44.02 Mean : 2.325
## 3rd Qu.: 50.00 3rd Qu.: 33.00 3rd Qu.: 56.00 3rd Qu.: 3.000
## Max. :259.00 Max. :263.00 Max. :362.00 Max. :15.000
##
## NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 3.00 1st Qu.: 3.000
## Median : 4.000 Median : 2.000 Median : 5.00 Median : 6.000
## Mean : 4.085 Mean : 2.662 Mean : 5.79 Mean : 5.317
## 3rd Qu.: 6.000 3rd Qu.: 4.000 3rd Qu.: 8.00 3rd Qu.: 7.000
## Max. :27.000 Max. :28.000 Max. :13.00 Max. :20.000
##
## AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.07277 Mean :0.07455 Mean :0.07277 Mean :0.06429
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000
##
## AcceptedCmp2 Response Complain Country
## Min. :0.00000 Min. :0.0000 Min. :0.000000 Length:2240
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.000000 Class :character
## Median :0.00000 Median :0.0000 Median :0.000000 Mode :character
## Mean :0.01339 Mean :0.1491 Mean :0.009375
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.0000 Max. :1.000000
##
INCOME has a couple of abnormally low values.
YEAR_BIRTH has values <1900 –that is certainly incorrect.
Let’s do an initial plot to see the distribution of each variable:
Obviously, this plot is not a good one. There are different orders of magnitude between several variables.
But it certainly shows that INCOME is the most disperse variable
There is a high value which is an extreme outlier
The value is 666666
We remove it and draw the histogram again:
marketing_data <- marketing_data %>% filter(Income < 200000)
hist(marketing_data$Income, col = 'blue', main = "Histogram of INCOME variable", xlab = "INCOME in AUD")
hist(marketing_data$Year_Birth, col = 'red', main = "Histogram of YEAR of BIRTH variable", xlab = "Decade of Birth")
As for YEAR of BIRTH there are 3 outliers that are wrong -since there can’t be customers aged >100 years old.
So we plot without those:
INCOME has a non-Normal distribution –as expected
We reinstate the 3 rows that we took out -we cannot impute using AVERAGE (since the distribution of this variable is not normal) -we will impute using the mode:
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))] }
getmode(marketing_data$Year_Birth)
## [1] 1976
### The mode is '1976'
marketing_data<- marketing_data %>% mutate(Year_Birth = replace(Year_Birth, Year_Birth <1901, 1976))
Now we look at the distribution of the Total Nr. of Purchases and TOTAL PURCHASED AMOUNT variables:
### DISTRIBUTION of 'TOTAL NR of PURCHASES' variables ####
Nr_ventas <- melt(marketing_data[ , c(16:19)])
ggplot(Nr_ventas,aes(x = value)) + facet_wrap(~variable,scales = "free_x") + geom_histogram() + ggtitle("HISTOGRAMS of variables containing the TOTAL # of PURCHASES")
### DISTRIBUTION of 'AMOUNT' variables #####
monto_ventas <- melt(marketing_data[ , c(10:15)])
ggplot(monto_ventas,aes(x = value)) + facet_wrap(~variable,scales = "free_x") + geom_histogram() + ggtitle("DISTRIBUTION - 'TOTAL AMOUNT SPENT per PURCHASE'")
tabla5 <- as.data.frame(colSums(marketing_data[ , 10:15]))
highlight <- color_tile("yellow","yellow")
formattable(tabla5, list(
area(col = 1, row = 1 ) ~ highlight,
area(col = 1, row = 3 ) ~ highlight
))
| colSums(marketing_data[, 10:15]) | |
|---|---|
| MntWines | 676074 |
| MntFruits | 58391 |
| MntMeatProducts | 370045 |
| MntFishProducts | 83397 |
| MntSweetProducts | 59895 |
| MntGoldProds | 97415 |
MEAT and WINES are the best performers
The last 3 campaigns were the most successful ones.
marketing_play2 <- marketing_data %>% select(ID,Education,Marital_Status, Country)
### If we want to display all 3 variables at once, we have to modify the dataset from "wide" to "long"
marketing_long <- gather(marketing_play2, key="measure", value="value", -ID)
head(marketing_long,10)
## # A tibble: 10 x 3
## ID measure value
## <dbl> <chr> <chr>
## 1 1826 Education Graduation
## 2 1 Education Graduation
## 3 10476 Education Graduation
## 4 1386 Education Graduation
## 5 5371 Education Graduation
## 6 7348 Education PhD
## 7 4073 Education 2n Cycle
## 8 1991 Education Graduation
## 9 4047 Education PhD
## 10 9477 Education PhD
ggplot(marketing_long, aes(x = value, fill = ID)) + geom_bar(fill='green') + facet_wrap(~ measure, scales = "free_x") + theme(axis.text.x = element_text(angle = 90, face = "bold"))
For the INCOME variable, we need to plot differently since this is a integer variable (not a categorial one):
## [1] "The average income of the customers is AUD"
## [2] "51969.8613995485"
## [1] "We can't use the mean because the distribution of income is not normal"
Is there a correlation between the customers’ INCOME and AMOUNT PURCHASED?
ggplot(marketing_data,aes(x = Income, y = MntWines)) + geom_point()
ggplot(marketing_data,aes(x = Income, y = MntFruits)) + geom_point()
ggplot(marketing_data,aes(x = Income, y = MntMeatProducts)) + geom_point()
ggplot(marketing_data,aes(x = Income, y = MntGoldProds)) + geom_point()
ggplot(marketing_data,aes(x = Income, y = MntSweetProducts)) + geom_point()
There seems to be positive correlation between INCOME and AMOUNT PURCHASED
Another way of visualizing it -there appears to be an exponential correlation:
marketing_play <- marketing_data %>% filter(Income < 100000)
ggplot(data = marketing_play, mapping = aes(x = Income, y = MntWines)) + geom_boxplot(mapping = aes(group = cut_width(Income, 2000)))
Fitting an exponential model between INCOME and any one of the AMOUNT PURCHASED channels produces a not very solid model:
##
## Call:
## lm(formula = log(MntWines) ~ Income, data = marketing_play2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.6505 -0.6433 0.0968 0.7998 3.0364
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.18109733 0.06655073 17.75 <0.0000000000000002 ***
## Income 0.00006662 0.00000118 56.46 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.186 on 2200 degrees of freedom
## Multiple R-squared: 0.5916, Adjusted R-squared: 0.5915
## F-statistic: 3187 on 1 and 2200 DF, p-value: < 0.00000000000000022
With only 59% of explained of values explained by it.