library(UsingR)
library(ggplot2)
library(skimr)
library(dplyr)
library(plotly)
library(vcd)
library(sampling)
options(scipen= 200)


a <- read.csv("https://raw.githubusercontent.com/LiaoliaoQJ/mkt/main/marketing_campaign%202.csv", header = TRUE, sep = "\t")

#Prepare Dataset#
b <- subset(a, Marital_Status %in% c ("Divorced" , "Married" , "Single", "Together", "Widow" ))

b <- data.frame(b)
b <- data.frame(b)
table(b$Year_Birth)-> years
2021- b$Year_Birth -> actual.years
b['age'] <- actual.years

b$Kidhome->kh
b$Teenhome->th
kh+th -> childhome
b$childhome <- childhome

#systematic sampling#
set.seed(101)
k <- ceiling(2233/50)
sample(k,1)-> num.sys
seq(num.sys,by=k,length=50)->s
b[s,]->system.sample.1

#strata sampling
set.seed(10)
table(b$Education)->b.22
names(b.22)->edu.naes
round((100*b.22/sum(b.22)))->strata.size
strata(b,stratanames = c("Education"),size = strata.size,
       method = "srswor" ,description = T)->strat_sample.num
## Stratum 1 
## 
## Population total and number of selected units: 1125 9 
## Stratum 2 
## 
## Population total and number of selected units: 483 2 
## Stratum 3 
## 
## Population total and number of selected units: 368 50 
## Stratum 4 
## 
## Population total and number of selected units: 54 16 
## Stratum 5 
## 
## Population total and number of selected units: 203 22 
## Number of strata  5 
## Total number of selected units 99
getdata(b,strat_sample.num)->strat_sample.all

Data Set Overview

Customer Personality and Purchasing Pattern Analysis is a detailed analysis of a company’s ideal customers. This data set was retrieved from Kaggle. It was created from information collected from 2240 customers. It includes the customers’ age, education levels, marital status, yearly household income, number of children in the customer’s household, as well as the amount they spent on different categories of product in the last two years, including wines, fruits, meat, fish, sweet, and gold.

Goal of Analysis

This analysis helps a business better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors, and concerns of different customers. Furthermore, it can also help a business adjust its product based on its target customers from different customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment.

Data Preparation

The data provided held a lot of valuable information. Most of the data are appropriate to be used directly, except the age information since it only indicates their birth year. Therefore, we use 2021 minus their birth year to get their current age. In addition, we created new data frames to store this information. We categorized customers between 18 to 25 years old into Young Adults, organized people between 25 to 55 years into senior adults, and people who are more than 55 years old are grouped as elderly.

Age Group Distribution

The pie chart demonstrates the customer age distribution. The majority of our customers are adults between 25 to 55 years old, 59.3% of the total population. The elderlies count toward 40.6% of the population. And we also found the number of young adults is relatively small, only taking up 0.0893%. strata sampling demonstrate the same pattern.

Marital Status Distribution

Most of our customers are married, which counts as 38.7% of the total population. The following groups are people who are in a relationship and single. The number of divorced and widowed people is less than the other three groups; they are only 10.4% and 3.45% of the population.

The Age distribution of Each Marital Status

The Single group has the most extensive age range, while the widows have the narrowest. This means the single group is more diverse in terms of age. On the contrary, widow customers’ ages are more concentrated in 50 to 70 years old.

Central Limit Theorem

the Constitution of Customers’ age

Furthermore,We test the Central Limit Theorem with our dataset, using customers’ random age means. According to the histogram of age distribution, we can learn that our population sample size is right-skewed. However, the random sample mean of age is still normally distributed.

Central Limit Theorem of Single people’s age

Furthermore,we test the Central Limit Theorem with our dataset, using random age means of single status. T we extract a single population and randomly sample their age mean with a sample size of 10, 20,30,40 respectively.The result aligns with the theorem. Random sample means of all graphies are normally distributed.

## Sample Size =  10  Mean =  49.6717  SD =  4.015649
## Sample Size =  20  Mean =  49.5285  SD =  2.757888
## Sample Size =  30  Mean =  49.45733  SD =  2.199519

## Sample Size =  40  Mean =  49.57073  SD =  1.942451

Education Level

The pie chart below shows the Education level of the people who took the survey. Most of people have a graduation education level. Only 2.42% have basic education level.

edutable <- with(Arthritis, table(b$Education))
label_education <- paste('(', round(prop.table(edutable)*100, 2), '%)', sep = '')

Edu_type <- c("2n Cycle", "Basic", "Graducation", "Master", "PhD")
label_Edu <- paste(Edu_type, label_education, sep = ' ')


ggplot(data = b, mapping = aes(x = Education, fill = Education)) + 
  geom_bar(width = 1) + 
  coord_polar(theta = "x")+
  labs(x = '', y = '', title = 'Education Level')+
  scale_fill_discrete(labels = label_Edu)

Median Earnings Distribution

Below is the distribution of the Median Earnings attribute. As expected with salaries, the distribution is skewed to the right. We found out that the median yearly household income is $52,243.98. The maximum annual income is $666,666, and the minimum annual wage is $1,730. The all-over range of income of distribution is $664,936.

p <- ggplot(b, aes(x=Income)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="#E2E2E2")+
  geom_density(alpha=.5, fill="#FC6B76")
  

bIncome <- na.omit(b$Income) 

p+geom_vline(data=b, aes(xintercept=mean(bIncome)),
             linetype="dashed",color= "purple",size = 1)

The Relationship between Income and Education Level

Below is the income distribution of people with different education levels. This graph indicates there is a positive correlation between income and level of education. People with Ph.D. degrees have the highest average annual income. And on the opposite side, people with basic education level earn the least. The majority of people have a yearly income in the range of $0-666,666.

ggplot(b, aes(x=Education, y=Income, color=Education, shape = Education )) + 
  geom_boxplot(alpha = 0)+
  geom_jitter(position=position_jitter(0.2))

The relationship among Marital Status, Education Level and Income

As can be seen from the figure below, people with partners and married couples who have high education degree have the most purchasing power because they have the highest income.

In contrast, single people with low education have the lowest income.

The relationship of Recency Marital status and Eduction

This animation chart shows the correlation between four variables, education levels, marital status, household children number, and the frequency of visiting the store. Surprisingly, the more children households have, the less they pay a visit to the store.(This pattern also continues in terms of how much they pay in the Number of Children in the Customer’s Household section) And customers of basic education have the least amount of recency.

Number of Children in the Customer’s Household

We calculated the number of children in each customer’s household by adding the number of kids and the number of teenagers in each family. The bar plot below shows 28.48% of customers in the dataset reported they don’t have children. About 50.35% of customers have one child in the house, and 18.79% have two children. There are also a small group of people who have three children, about 3.6%.

b['Children']=b['Kidhome']+b['Teenhome']
table_cd=sort(table(b$Children),decreasing=T)
per_cd = as.vector(prop.table(table_cd)*100)
per_cd <- round(per_cd,2)
labels_cd = c('1','0','2','3')
df_cd = data.frame(labels_cd,per_cd)
df_cd
##   labels_cd per_cd
## 1         1  50.38
## 2         0  28.48
## 3         2  18.76
## 4         3   2.37
p3 = plot_ly(df_cd, y=~labels_cd,x=~per_cd,type='bar',width=500,height=300,orientation='h',
               marker= list(color=c('#5DAAFC','#A8E88C','#FEE07A','#FE9C7A')))
p3 = p3 %>% layout(title="Percentage of Number of Children in customer's home",xaxis = list(title='Percentage'),
                       yaxis = list(title='Number of Children',categoryorder = "array",
                                    categoryarray = rev(labels_cd)))
p3

The Relationship of Income Level and Food Catagory Purchase

lessthan_25k <- b %>% filter(Income <= 25000) %>% select(MntWines, MntFruits, MntMeatProducts,MntSweetProducts,MntFishProducts,MntGoldProds)
avg_lessthan_25k_wine = mean(lessthan_25k$MntWines)
avg_lessthan_25k_friuts = mean(lessthan_25k$MntFruits)
avg_lessthan_25k_meat = mean(lessthan_25k$MntMeatProducts)
avg_lessthan_25k_fish = mean(lessthan_25k$MntFishProducts)
avg_lessthan_25k_sweet = mean(lessthan_25k$MntSweetProducts)
avg_lessthan_25k_gold = mean(lessthan_25k$MntGoldProds)

between_25k_50k = b %>% filter(25000 < Income & Income <= 50000) %>% select(MntWines,MntFruits, MntMeatProducts,MntSweetProducts,MntFishProducts,MntGoldProds)
avg_bw_25k_50k_wine = mean(between_25k_50k$MntWines)
avg_bw_25k_50k_friuts = mean(between_25k_50k$MntFruits)
avg_bw_25k_50k_meat = mean(between_25k_50k$MntMeatProducts)
avg_bw_25k_50k_fish = mean(between_25k_50k$MntFishProducts)
avg_bw_25k_50k_sweet = mean(between_25k_50k$MntSweetProducts)
avg_bw_25k_50k_gold = mean(between_25k_50k$MntGoldProds)



between_50k_75k = b %>% filter(50000 < Income & Income <= 75000) %>% select(MntWines,MntFruits, MntMeatProducts,MntSweetProducts,MntFishProducts,MntGoldProds)
avg_bw_50k_75k_wine = mean(between_50k_75k$MntWines)
avg_bw_50k_75k_friuts = mean(between_50k_75k$MntFruits)
avg_bw_50k_75k_meat = mean(between_50k_75k$MntMeatProducts)
avg_bw_50k_75k_fish = mean(between_50k_75k$MntFishProducts)
avg_bw_50k_75k_sweet = mean(between_50k_75k$MntSweetProducts)
avg_bw_50k_75k_gold = mean(between_50k_75k$MntGoldProds)



between_75k_100k = b %>% filter(75000 < Income & Income <= 100000) %>% select(MntWines,MntFruits, MntMeatProducts,MntSweetProducts,MntFishProducts,MntGoldProds)
avg_bw_75k_100k_wine = mean(between_75k_100k$MntWines)
avg_bw_75k_100k_friuts = mean(between_75k_100k$MntFruits)
avg_bw_75k_100k_meat = mean(between_75k_100k$MntMeatProducts)
avg_bw_75k_100k_fish = mean(between_75k_100k$MntFishProducts)
avg_bw_75k_100k_sweet = mean(between_75k_100k$MntSweetProducts)
avg_bw_75k_100k_gold = mean(between_75k_100k$MntGoldProds)



avg_wine=c(avg_lessthan_25k_wine,avg_bw_25k_50k_wine ,avg_bw_50k_75k_wine, avg_bw_75k_100k_wine)
avg_friuts=c(avg_lessthan_25k_friuts,avg_bw_25k_50k_friuts,avg_bw_50k_75k_friuts, avg_bw_75k_100k_friuts)
avg_meat=c(avg_lessthan_25k_meat ,avg_bw_25k_50k_meat ,avg_bw_50k_75k_meat, avg_bw_75k_100k_meat)
avg_fish=c(avg_lessthan_25k_fish,avg_bw_25k_50k_fish,avg_bw_50k_75k_fish, avg_bw_75k_100k_fish)
avg_sweet=c(avg_lessthan_25k_sweet,avg_bw_25k_50k_sweet,avg_bw_50k_75k_sweet, avg_bw_75k_100k_sweet)
avg_gold=c(avg_lessthan_25k_gold,avg_bw_25k_50k_gold ,avg_bw_50k_75k_gold, avg_bw_75k_100k_gold)
labels_avg = c('0-25k','25k-50k','50k-75k','75k-100k')
df_prods = data.frame(labels_avg,avg_wine,avg_friuts,avg_meat,avg_fish,avg_sweet,avg_gold)
df_prods
##   labels_avg  avg_wine avg_friuts  avg_meat  avg_fish avg_sweet avg_gold
## 1      0-25k  11.10744   6.095041  21.69835  7.892562  6.326446 18.85537
## 2    25k-50k  83.97174   7.087224  37.84275 11.678133  7.211302 22.38084
## 3    50k-75k 461.46038  35.605031 205.26415 48.704403 35.467925 61.22893
## 4   75k-100k 676.49855  64.478261 476.09565 94.023188 68.107246 71.83188
p5 = plot_ly(df_prods, x = ~labels_avg, y = ~avg_wine, type = 'bar', name = 'Wines',width=600,height=300)
p5 = p5 %>% add_trace(y = ~avg_friuts, name = 'Fruits')
p5 = p5 %>% add_trace(y = ~avg_meat, name = 'Meat')
p5 = p5 %>% add_trace(y = ~avg_fish, name = 'Fish')
p5 = p5 %>% add_trace(y = ~avg_sweet, name = 'Sweets')
p5 = p5 %>% add_trace(y = ~avg_gold, name = 'Gold')
p5 = p5 %>% layout(yaxis = list(title = 'Average'),xaxis=list(title='Income',categoryorder='array',
                                                              
categoryarray=rev(labels_avg), barmode = 'group'))


p5

Findings

We can see that there is an obviously positive correlation between the customer’s income and the amount they spent on each category of the products. The first observation is that people with relatively higher income levels tend to spend more on wine and meat. In the group of people with the highest income level, the amount they spent on purchasing wine and meat is even more than the sum of money they spent on other product categories. Wine and meat are examples of normal goods whose demand increases as people’s incomes and purchasing power rise. People’s shopping budget on wine fluctuates by the widest margin.

The Relationship of Education Level and Food Category Purchase

##    labels_ed  avg_ed_mw avg_ed_mf avg_ed_mp avg_ed_fp avg_ed_sp avg_ed_gp
## 1 Graduation 284.351111  30.73244 179.67556  43.02133  31.35644  50.70400
## 2        PhD 405.643892  20.16149 169.42650  26.86957  20.33126  32.29607
## 3     Master 332.782609  21.57609 162.77989  31.73370  21.27717  40.06250
## 4   2n Cycle 198.182266  28.95567 141.25616  47.48276  34.25123  46.39901
## 5      Basic   7.240741  11.11111  11.44444  17.05556  12.11111  22.83333

Findings

By observing this graph, we found out that customers with a Ph.D. degree tend to buy more wine products than customers with different levels of education. They tend to spend a relatively minor amount on all the other products except wine and meat products. This buying behavior pattern is also reflected among master’s and bachelor’s degrees customers. Another important finding is that customers with a basic education tend to spend a tiny amount to buy each kind of product.

The Relationship of Number of Children and Food Category Purchase

##   labels_cd avg_cd_mw avg_cd_mf avg_cd_mp avg_cd_fp avg_cd_sp avg_cd_gp
## 1         1  267.1760 19.392889  98.78133 26.688000 20.331556  40.79111
## 2         0  487.7201 52.256289 372.79874 76.141509 53.132075  63.70912
## 3         2  140.9570  7.904535  51.41289 11.431981  8.393795  25.39618
## 4         3  171.3774  6.905660  64.01887  7.075472  6.622642  18.60377

Findings

From observing this bar plot, we noticed that customers with no children tend to spend more money purchasing wines, sweets, and non-veg products such as meat and fish than customers with children. Moreover, customers with two or three children spend less on products such as fruits, fish, and sweets. Furthermore, we can also see a negative correlation between the amount spent on sweets and fish and each customer’s number of children.

Summing up

  1. Most of our customers are aged above 35.
  2. Customers with a Ph.D. degree spend more than the other educated customers. The percentage of Ph.D. customers is 21%. They prefer products such as meat and wine.
  3. Customers who have no children spend highly on the products such as wine and meat. The percentage of customers with no child is 28%.
  4. As we can see, the highly preferred products are wine and meat. This shows that we need to improve the quality and quantity of the rest of the products.
  5. There is a strong correlation between the income level and the expense generated by our customers.