marketDf <- read.csv("C:/Users/PC/Documents/marketing_data.csv")
str(marketDf)
## 'data.frame': 2240 obs. of 28 variables:
## $ ï..ID : int 1826 1 10476 1386 5371 7348 4073 1991 4047 9477 ...
## $ Year_Birth : int 1970 1961 1958 1967 1989 1958 1954 1967 1954 1954 ...
## $ Education : Factor w/ 5 levels "2n Cycle","Basic",..: 3 3 3 3 3 5 1 3 5 5 ...
## $ Marital_Status : Factor w/ 8 levels "Absurd","Alone",..: 3 5 4 6 5 5 4 6 4 4 ...
## $ Income : Factor w/ 1975 levels "","$1,730.00 ",..: 1891 1141 1447 395 122 1579 1332 790 1388 1388 ...
## $ Kidhome : int 0 0 0 1 1 0 0 0 0 0 ...
## $ Teenhome : int 0 0 1 1 0 0 0 1 1 1 ...
## $ Dt_Customer : Factor w/ 663 levels "1/1/13","1/1/14",..: 473 471 410 406 398 298 40 18 6 6 ...
## $ Recency : int 0 0 0 0 0 0 0 0 0 0 ...
## $ MntWines : int 189 464 134 10 6 336 769 78 384 384 ...
## $ MntFruits : int 104 5 11 0 16 130 80 0 0 0 ...
## $ MntMeatProducts : int 379 64 59 1 24 411 252 11 102 102 ...
## $ MntFishProducts : int 111 7 15 0 11 240 15 0 21 21 ...
## $ MntSweetProducts : int 189 0 2 0 0 32 34 0 32 32 ...
## $ MntGoldProds : int 218 37 30 0 34 43 65 7 5 5 ...
## $ NumDealsPurchases : int 1 1 1 1 2 1 1 1 3 3 ...
## $ NumWebPurchases : int 4 7 3 1 3 4 10 2 6 6 ...
## $ NumCatalogPurchases: int 4 3 2 0 1 7 10 1 2 2 ...
## $ NumStorePurchases : int 6 7 5 2 2 5 7 3 9 9 ...
## $ NumWebVisitsMonth : int 1 5 2 7 7 2 6 5 4 4 ...
## $ AcceptedCmp3 : int 0 0 0 0 1 0 1 0 0 0 ...
## $ AcceptedCmp4 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp5 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp1 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp2 : int 0 1 0 0 0 0 0 0 0 0 ...
## $ Response : int 1 1 0 0 1 1 1 0 0 0 ...
## $ Complain : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Country : Factor w/ 8 levels "AUS","CA","GER",..: 7 2 8 1 7 7 3 7 8 4 ...
colSums(is.na(marketDf)) ## This result is Inaccurate and as we will see later some actual NA values are been masked.
## ï..ID Year_Birth Education Marital_Status
## 0 0 0 0
## Income Kidhome Teenhome Dt_Customer
## 0 0 0 0
## Recency MntWines MntFruits MntMeatProducts
## 0 0 0 0
## MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases
## 0 0 0 0
## NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## 0 0 0 0
## AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1
## 0 0 0 0
## AcceptedCmp2 Response Complain Country
## 0 0 0 0
In the following chunk we are going to: 1. Make a Copy 1. Drop the id column 1. Strip the Income Variable of the $ and reClassify it. 1. Reclassify Dt_Customer. 1. Create a new Variable Age from Year_Birth variable. 1. Drop Missing Values
## Make a copy
df <- marketDf
#1
## drop Id
# marketDf <- marketDf[,!(names(marketDf) %in% "ï..ID")]
df = df[-c(1)]
#2
## Reclassify Income and strip the "$" sign
df[, "Income"] <- as.numeric(df[, "Income"] %>%
gsub("[$,]", "",.))
#3
## Reclassify Dt_Customer to a date format
library(lubridate)
df$Dt_Customer <- as.Date(parse_date_time(df$Dt_Customer, "mdy"))
#4
## Create Age Variable from YOB
df <- df %>% mutate(df, Age = 2020 - df$Year_Birth)
str(df)
## 'data.frame': 2240 obs. of 28 variables:
## $ Year_Birth : int 1970 1961 1958 1967 1989 1958 1954 1967 1954 1954 ...
## $ Education : Factor w/ 5 levels "2n Cycle","Basic",..: 3 3 3 3 3 5 1 3 5 5 ...
## $ Marital_Status : Factor w/ 8 levels "Absurd","Alone",..: 3 5 4 6 5 5 4 6 4 4 ...
## $ Income : num 84835 57091 67267 32474 21474 ...
## $ Kidhome : int 0 0 0 1 1 0 0 0 0 0 ...
## $ Teenhome : int 0 0 1 1 0 0 0 1 1 1 ...
## $ Dt_Customer : Date, format: "2014-06-16" "2014-06-15" ...
## $ Recency : int 0 0 0 0 0 0 0 0 0 0 ...
## $ MntWines : int 189 464 134 10 6 336 769 78 384 384 ...
## $ MntFruits : int 104 5 11 0 16 130 80 0 0 0 ...
## $ MntMeatProducts : int 379 64 59 1 24 411 252 11 102 102 ...
## $ MntFishProducts : int 111 7 15 0 11 240 15 0 21 21 ...
## $ MntSweetProducts : int 189 0 2 0 0 32 34 0 32 32 ...
## $ MntGoldProds : int 218 37 30 0 34 43 65 7 5 5 ...
## $ NumDealsPurchases : int 1 1 1 1 2 1 1 1 3 3 ...
## $ NumWebPurchases : int 4 7 3 1 3 4 10 2 6 6 ...
## $ NumCatalogPurchases: int 4 3 2 0 1 7 10 1 2 2 ...
## $ NumStorePurchases : int 6 7 5 2 2 5 7 3 9 9 ...
## $ NumWebVisitsMonth : int 1 5 2 7 7 2 6 5 4 4 ...
## $ AcceptedCmp3 : int 0 0 0 0 1 0 1 0 0 0 ...
## $ AcceptedCmp4 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp5 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp1 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp2 : int 0 1 0 0 0 0 0 0 0 0 ...
## $ Response : int 1 1 0 0 1 1 1 0 0 0 ...
## $ Complain : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Country : Factor w/ 8 levels "AUS","CA","GER",..: 7 2 8 1 7 7 3 7 8 4 ...
## $ Age : num 50 59 62 53 31 62 66 53 66 66 ...
#5
## Check again for missing values
na_var <- colnames(df)[colSums(is.na(df)) > 0]
cat(na_var, "contains missing data")
## Income contains missing data
## Drop Missing Income Data
df <- df[!(is.na(df$Income)),]
Later on the Analysis Age had some spectacular outlier, the following distribution shows age going well ahead of 100 years. They are a small and quite outragoeus outlier, we will remove it.
df %>%
ggplot(aes(x = Age)) +
geom_histogram(color = "white") +
theme_minimal()
summary(df$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24.00 43.00 50.00 51.18 61.00 127.00
df[(df$Age > 100),]$Age
## [1] 127 121 120
Now we can see the more clearly the outliers in Age especially @ ages127, 121, 120, let us remove them.
#Remove datapoints where Age>100
paste0("Number of Datapoints to remove for Age>100: ",count(df%>%filter(.,Age>100)))
## [1] "Number of Datapoints to remove for Age>100: 3"
df <- df[!(df$Age > 100),]
Phew, our data should be sparkling now. However, we can still make our data not just clean but also Neat, let us merge repetitive data together in Feature Engineering.
## Total Amount Spent
df<- df %>%
mutate(Total_amt_spent = MntWines+ MntFruits+ MntMeatProducts+ MntFishProducts+ MntSweetProducts+ MntGoldProds)
summary(df$Total_amt_spent)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5 69 397 607 1048 2525
## Total Campaigns Accepted
df <- df %>%
mutate(Num_campaigns_Accepted = round((AcceptedCmp1+ AcceptedCmp2 +AcceptedCmp3+ AcceptedCmp4+ AcceptedCmp5)/4))
summary(df$Num_campaigns_Accepted)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.02485 0.00000 1.00000
## Total Number of Purchases
df <- df %>%
mutate(Num_purchases= NumDealsPurchases+ NumCatalogPurchases+ NumStorePurchases+ NumWebPurchases)
summary(df$Num_purchases)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 8.00 15.00 14.89 21.00 44.00
## Total Number of Dependents
df <- df %>%
mutate(Num_dependents = Kidhome + Teenhome)
summary(df$Num_dependents)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 1.0000 0.9476 1.0000 3.0000
Now having Merged what we need, let us drop the parent variables.
## Drop Irrelevant Columns
drp <- c("Kidhome","Teenhome","Dt_Customer","Recency","MntWines" ,"MntFruits","MntMeatProducts","MntFishProducts","MntSweetProducts" ,"MntGoldProds","NumDealsPurchases","NumWebPurchases","NumCatalogPurchases","NumStorePurchases","NumWebVisitsMonth","AcceptedCmp3", "AcceptedCmp4" ,"AcceptedCmp5","AcceptedCmp1","AcceptedCmp2", "Year_Birth")
df <- df[,!(names(df) %in% drp)]
## Move Total_Amt_Spent to the tail
df <- df %>% select(-Total_amt_spent, Total_amt_spent)
str(df)
## 'data.frame': 2213 obs. of 11 variables:
## $ Education : Factor w/ 5 levels "2n Cycle","Basic",..: 3 3 3 3 3 5 1 3 5 5 ...
## $ Marital_Status : Factor w/ 8 levels "Absurd","Alone",..: 3 5 4 6 5 5 4 6 4 4 ...
## $ Income : num 84835 57091 67267 32474 21474 ...
## $ Response : int 1 1 0 0 1 1 1 0 0 0 ...
## $ Complain : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Country : Factor w/ 8 levels "AUS","CA","GER",..: 7 2 8 1 7 7 3 7 8 4 ...
## $ Age : num 50 59 62 53 31 62 66 53 66 66 ...
## $ Num_campaigns_Accepted: num 0 0 0 0 0 0 0 0 0 0 ...
## $ Num_purchases : int 15 18 11 4 8 17 28 7 20 20 ...
## $ Num_dependents : int 0 0 1 2 1 0 0 1 1 1 ...
## $ Total_amt_spent : int 1190 577 251 11 91 1192 1215 96 544 544 ...
As far as Cleaning the data, it is now tidy enough for Visualisations and managing
df %>%
ggplot(aes(x = Num_purchases)) +
geom_histogram(color = "white", bins = 15) +
labs(title = "Participants by Total Purchases",
x = "Total Amount Spent") +
theme_minimal()
library(scales)
#2
## for loop to reclassify factor Variables
for(var in names(df)){
if(length(unique(df[[var]])) <= 2){
## Assign the Boolean Values
df[,var] <- ifelse(df[,var]==0, "False", "True")
## Convert to Factpr
df[,var] <- as.factor(df[,var])
}
}
## NumericVars
factorVars <- select_if(df, is.factor)
numericVars <- select_if(df, is.numeric)
## function to tidy up to a summary table
tidy_eval_arrange <- function(.data, var) {
plotdata <- .data %>%
count({{var}}) %>%
mutate(pct = n / sum(n),
pctlabel = paste0(round(pct * 100), "%"))
print(plotdata)
}
tidy_eval_arrange(df, Response)
## Response n pct pctlabel
## 1 False 1880 0.8495255 85%
## 2 True 333 0.1504745 15%
## function to Plot Graphs for Factor Variables
auto_factor_plot <- function(.data) {
## Type of Variable
nm <- names(.data)
## loop to plot through
for (i in seq_along(nm)) {
plot <- .data %>%
ggplot(aes_string(x = nm[i])) +
geom_histogram(alpha = .5,fill = "black", stat = "count") +
theme_minimal() +
ggtitle ("Percentage of Market Data by Categorical Variables") +
coord_flip() +
theme_minimal()
print(plot)
# print(plot)
}
}
auto_factor_plot(factorVars)
# calulate the correlations
library(kableExtra)
df_num <- select_if(df, is.numeric)
crr <- cor(df_num, use="complete.obs")
kable(round(crr,2))
| Income | Age | Num_purchases | Num_dependents | Total_amt_spent | |
|---|---|---|---|---|---|
| Income | 1.00 | 0.16 | 0.57 | -0.29 | 0.67 |
| Age | 0.16 | 1.00 | 0.18 | 0.09 | 0.12 |
| Num_purchases | 0.57 | 0.18 | 1.00 | -0.25 | 0.76 |
| Num_dependents | -0.29 | 0.09 | -0.25 | 1.00 | -0.50 |
| Total_amt_spent | 0.67 | 0.12 | 0.76 | -0.50 | 1.00 |
One thing we can already see if that those with more Dependents will tend to spend less. Let us visualize this relationship using a p-value.
crr %>%
corrplot(method = "color", type = "lower", tl.col = "black", tl.srt = 45,
p.mat = cor.mtest(crr)$p,
insig = "p-value", sig.level = -1)
## Dummify Factor Variables
linR <- lm(Num_purchases ~ ., data = df)
summary(linR)
##
## Call:
## lm(formula = Num_purchases ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.857 -3.219 -0.455 3.382 16.914
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.565e+00 3.446e+00 1.325 0.1854
## EducationBasic -1.672e+00 7.490e-01 -2.233 0.0257 *
## EducationGraduation -5.190e-01 3.725e-01 -1.393 0.1637
## EducationMaster -7.313e-01 4.284e-01 -1.707 0.0879 .
## EducationPhD -5.458e-01 4.139e-01 -1.319 0.1874
## Marital_StatusAlone 1.220e+00 4.384e+00 0.278 0.7808
## Marital_StatusDivorced 1.756e-01 3.413e+00 0.051 0.9590
## Marital_StatusMarried 2.519e-01 3.403e+00 0.074 0.9410
## Marital_StatusSingle -3.536e-01 3.405e+00 -0.104 0.9173
## Marital_StatusTogether -1.018e-01 3.405e+00 -0.030 0.9761
## Marital_StatusWidow 2.983e-01 3.446e+00 0.087 0.9310
## Marital_StatusYOLO 6.056e+00 4.805e+00 1.260 0.2076
## Income 2.682e-05 5.546e-06 4.836 1.42e-06 ***
## ResponseTrue -3.244e-01 3.115e-01 -1.041 0.2979
## ComplainTrue 6.173e-01 1.080e+00 0.572 0.5675
## CountryCA 2.271e-01 4.945e-01 0.459 0.6461
## CountryGER -1.834e-02 5.975e-01 -0.031 0.9755
## CountryIND 2.981e-01 5.633e-01 0.529 0.5967
## CountryME 1.389e+00 2.801e+00 0.496 0.6201
## CountrySA 1.748e-01 4.760e-01 0.367 0.7134
## CountrySP -2.903e-04 4.229e-01 -0.001 0.9995
## CountryUS 1.275e+00 6.108e-01 2.087 0.0370 *
## Age 3.702e-02 9.320e-03 3.972 7.35e-05 ***
## Num_campaigns_AcceptedTrue -3.958e+00 7.072e-01 -5.598 2.45e-08 ***
## Num_dependents 1.518e+00 1.616e-01 9.392 < 2e-16 ***
## Total_amt_spent 1.006e-02 2.602e-04 38.676 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.786 on 2187 degrees of freedom
## Multiple R-squared: 0.615, Adjusted R-squared: 0.6106
## F-statistic: 139.7 on 25 and 2187 DF, p-value: < 2.2e-16
Using P-values, we can observe variables that hold Significant Relationships with the Outcome Variable. p-val < 0.05 indicates a significant relationship.
#Retrieve all the p-values
pvaluedf <- data.frame(summary(linR)$coefficients[,c('Pr(>|t|)', 'Estimate')])
colnames(pvaluedf) <- c('pvalue', 'coefficient')
pvaluedf$variables <- rownames(pvaluedf)
#Plot the variables and their significance
pvaluedf %>%
ggplot(aes(x = reorder(variables, pvalue),y = pvalue)) +
geom_col() +
geom_hline(yintercept = 0.05, color = "white") + ## level of Significance
geom_text(aes(x = 2, y = 0.10),label = "5%", color = "darkred") +
scale_x_discrete(name = "Predictors") +
ggtitle("X-Variables and their P-Values") +
coord_flip() +
theme_minimal()
Using this Logic, from our graph US downwards hold signifiant relationships with the Number of Purchase made in the market.
Notice: The inspiration for this notebook, is another notebook on this subject-matter, a far illustrative and masterful work by Oi Zhen, you may want to check it out here Kaggle Marketing Analytics.