OBJECTIVE This study aims to observe and understand the factors involving a household Income and Spending. Does education affect income?, if so, what education level has better income? Is it better to have higher education?
On spending, which marital status has more spending? what is the relationship of Income to spending? or to your marital status?
These are the questions we will attempt to understand and if these questions can relate to our ability to target which spending market.
We will utilize a data from a food/grocery company that serves across countries and see how their market base respond on spending.
This is a two part analysis, 1. Analyze Income vs Spending and 2. Spending habits (what they spend the most)
DATA STRUCTURE AND RETRIEVAL The data from this study came from kaggle - an online repository and site for data science enthusiast. https://www.kaggle.com/jackdaoud/marketing-data
household <- read.csv("household.csv")
head(household)
DEFINITION
ID : Customer id
Year_Birth: Customer birth year
Est Age : Estimated age by year Birth to 2021
Education : Educational attainment
Marital : Marital Status
Income : Annual Income
Kidhome : household with kids est. 12yo below
Teenhome : household with teen est. > 12yo and < 18yo
MntWines : Wine spending
MntFruits : Fruit spending
MntMeat.. : Meat spending
MntFish…: Fish Spending
MntSweet : Sweet or Snack spending
MntGold…: Non-essential spending
NumDeals : purchase count with discounts or promos
NumWeb : Online purchase count
NumStore : Physical store purchase count
NumVisit : Website visit count
TOTAL PARTICIPANTS
length(household$ID)
[1] 2240
We have a total of 2240 participants
AGE DISTRIBUTION
hist(household$Est_Age, main="Age distribution of Participants")

mean(household$Est_Age)
[1] 52.1942
The average age of the participants is 52yo
PARTICIPANTS BY COUNTRY
household %>% ggplot(aes(Country)) + geom_bar(fill="blue") + ggtitle("Household Participants by Country")

In this study, Spain has the largest participants with over 1000 and the rest averages 150 participants
PARTICIPANTS BY EDUCATION
household %>% ggplot(aes(Education)) + geom_bar(fill="blue") + ggtitle("Household Participants by Education")

Majority of the participants are at least graduate/college level with over 1000 while more than 300 have masters and more than 500 have Phd
INCOME DISTRIBUTION
household %>% filter(Est_Age<100) %>% ggplot(aes(x= Income, fill=Education)) + geom_density(alpha=0.2) + scale_x_continuous(trans = "log10", labels=scales::comma) + ggtitle("Income Density by Education")

household %>% filter(Est_Age<100) %>% ggplot(aes(Education, Income)) + geom_boxplot() + scale_y_continuous(trans = "log10", labels=scales::comma) + ylab("income in $") + ggtitle("Income by Education of Participants")

mean(household$Income)
[1] 51687.46
all_participants <- length(household$ID)
group_educ <- household %>% group_by(Education) %>% summarise(total_per_education=length(Education), percent = length(Education)/all_participants*100,avg_income=mean(Income))
group_educ
Majority of the participants are within 30,000 to 100,000 usd in Income range with an average of 51,687. It is surprising (at least for me) that the outlier of both highest and lowest in income levels are in the Graduation/college level category.
Average income for those in Graduation and Masteral level of Education has about the same Income and Phd is higher by 3000 usd. While Basic educ has the lowest of educ level having second level education proves to have competitive Income.
INCOME VS MARITAL STATUS
household %>% filter(Est_Age<100) %>% ggplot(aes(Marital_Status, Income)) + geom_boxplot() + scale_y_continuous(trans = "log10", labels=scales::comma) + ylab("income in $") + ggtitle("Income by Marital Status of Participants")

all_participants <- length(household$ID)
group_marital <- household %>% group_by(Marital_Status) %>% summarise(total_per_maritalstat=length(Marital_Status), percent_all = length(Marital_Status)/all_participants*100, avg_income=mean(Income))
group_marital
WHile there is nothing much of a difference in income regardless of Marital Status, as income is a result of education, skill or experience, we may see the impact of spending on the status especially to those having kids or teens to support - let’s find out how income affects marital status on spending
SPENDING
total_spend <- household %>% group_by(Marital_Status) %>% summarise(totalspend=MntWines+MntFruits+MntMeatProducts+MntFishProducts+MntSweetProducts+MntGoldProds)
`summarise()` has grouped output by 'Marital_Status'. You can override using the `.groups` argument.
total_spend_sum <- total_spend %>% group_by(Marital_Status) %>% summarise(total_per_mstat=length(Marital_Status), spending_per_mstat=sum(totalspend), percent_spend=formatC( total_per_mstat/spending_per_mstat*100))
total_spend_sum
total_spend_sum %>% filter(Marital_Status!="Alone" & Marital_Status!="YOLO") %>% ggplot(aes(Marital_Status,percent_spend, fill="red")) + geom_bar(stat = "identity") + ggtitle("Percent Spending by Marital Status")

Overall, the Married category has the most in spending which is not surprising followed by the Singles. While they may have different spending reasons as the Married may be focused on needs and singles on wants. Let’s find out as we break down the Essential and Non Essential spending.
ESSENTIAL VS NON-ESSENTIAL SPENDING
We have six major categories in spending: Wine, Sweets , Fruits, Meats, Fish and Gold
Fruit, Meats and Fish are considered ESSENTIALS and the rest would be NON_ESSENTIALS.
essential_comp <- household %>% group_by(Marital_Status) %>% summarise(Essentials= MntFruits+MntMeatProducts+MntFishProducts, Percent_Essential=as.numeric(formatC(length(Marital_Status)/Essentials*100)), Non_essential =MntWines+MntSweetProducts+MntGoldProds, Percent_NonEssential=as.numeric(formatC(length(Marital_Status)/Non_essential*100) ))
`summarise()` has grouped output by 'Marital_Status'. You can override using the `.groups` argument.
essential_comp_sum <- essential_comp %>% group_by(Marital_Status) %>% summarise(total_per_mstat=length(Marital_Status), sum_Essential=sum(Essentials), percent_Essential=total_per_mstat/sum_Essential*100, sum_NonEssential=sum(Non_essential), percent_NonEssential=total_per_mstat/sum_NonEssential*100)
essential_comp_sum
essential_comp_sum %>%filter(Marital_Status!="Alone" & Marital_Status!="YOLO") %>% ggplot(aes(Marital_Status,percent_Essential, fill="red")) + geom_bar(stat = "identity") + ggtitle("Percent Spending on Essential: Fruits, Fish, Meats") + ylim(0,0.5)

essential_comp_sum %>% filter(Marital_Status!="Alone" & Marital_Status!="YOLO") %>% ggplot(aes(Marital_Status,percent_NonEssential)) + geom_bar(stat = "identity") + ggtitle("Percent Spending on Non Essential: Wine, Sweets,Non-food") + ylim(0,0.5)

Overall the spending for essential items is greater compared to Non-essential as seen in the percent spending regardless of Marital Status. Divorced and Married is tops the spending for essential.
Non-essential spending is lower and single status tops the group but not by much.
INCOME vs SPENDING
In here we will look into the relationship of one’s income to its spending habit. We will use Correlation to evaluate how strong or weak the relationship between Income and spending.
THe Correlation coefficient is the measure of how strong or weak a relationship between two values (Income vs Spending) and it has a value between -1 to 1.
A coeff closer to 1 means a strong relationship between values and a coeff closer to 0 means a weak relationship
IncomexSpending <- household %>% summarise(house_income=Income, totalspend=MntWines+MntFruits+MntMeatProducts+MntFishProducts+MntSweetProducts+MntGoldProds)
head(IncomexSpending)
cor(IncomexSpending$house_income,IncomexSpending$totalspend)
[1] 0.6541537
IncomexSpending %>% filter(house_income<600000 & house_income>0) %>% ggplot(aes(house_income, totalspend)) +
geom_jitter() + scale_x_continuous(trans = "log2", labels=scales::comma) + geom_smooth(mathod="lm") + xlim(0,100000) + ggtitle("Overall Spending x Income Analysis")
Ignoring unknown parameters: mathodScale for 'x' is already present. Adding another scale for 'x', which
will replace the existing scale.

RIght off the bat we can see that there is lower spending on Income are below 50000 and more spending on higher income participants.
IncomexSpending %>% filter(house_income>0 & house_income<40000) %>% ggplot(aes(house_income, totalspend)) + geom_point() + scale_x_continuous(trans = "log2", labels=scales::comma) + geom_smooth(mathod="lm") + xlim(0,40000) + ylim(0,2500) + ggtitle("Spending x Income < 40,000")
Ignoring unknown parameters: mathodScale for 'x' is already present. Adding another scale for 'x', which
will replace the existing scale.

Zooming in, we see that for Income of less than 40000, there is way less spending and their distribution is compact ergo much more participants are spending less
IncomexSpending %>% filter(house_income>40001 & house_income<100000) %>% ggplot(aes(house_income, totalspend)) + geom_point() + geom_smooth(mathod="lm") + xlim(40000,100000)+ scale_x_continuous(trans = "log2", labels=scales::comma) + ggtitle("Spending x Income > 50,000")
Ignoring unknown parameters: mathodScale for 'x' is already present. Adding another scale for 'x', which
will replace the existing scale.

And for those Income greater than 40000, we can see a more upward spread. The greater the Income, the greater the spending
Incomeless50k <- IncomexSpending %>% filter(house_income>0 & house_income<40000)
cor(Incomeless50k$house_income, Incomeless50k$totalspend)
[1] 0.1387899
Incomegreat50k <- IncomexSpending %>% filter(house_income>40001 & house_income<100000)
cor(Incomegreat50k$house_income, Incomegreat50k$totalspend)
[1] 0.7838105
Computing for the Correlation Coefficients for both income<40000 and income>40000, we can see that there is a weak relationship for those below 40k Income, meaning there is not much of spending happening from them compared to those higher than 40k Income.
NEXT UP Now that we know who’s spending more, let’s find out which specific item category is being spent by our total participants to validate spending priorities, this will halp us focus our market targeting and advertising as a business entity.
STAY TUNED!
Prepared by Dodgecarl Incila
---
title: "Market Analysis: Household Income and Spending P1"
output: html_notebook
---

OBJECTIVE 
This study aims to observe and understand the factors involving a household Income and  Spending. Does education affect income?, if so, what education level has better income? Is it better to have higher education?

On spending, which marital status has more spending? what is the relationship of Income to spending? or to your marital status?

These are the questions we will attempt to understand and if these questions can relate to our ability to target which spending market.

We will utilize a data from a food/grocery company that serves across countries and see how their market base respond on spending.

This is a two part analysis, 1. Analyze Income vs Spending and 2. Spending habits (what they spend the most)

DATA STRUCTURE AND RETRIEVAL
The data from this study came from kaggle - an online repository and site for data science enthusiast. https://www.kaggle.com/jackdaoud/marketing-data

```{r}
household <- read.csv("household.csv")
head(household)
```
DEFINITION

ID        : Customer id

Year_Birth: Customer birth year

Est Age   : Estimated age by year Birth to 2021

Education : Educational attainment

Marital   : Marital Status

Income    : Annual Income

Kidhome   : household with kids est. 12yo below

Teenhome  : household with teen est.  > 12yo and < 18yo

MntWines  : Wine spending

MntFruits : Fruit spending

MntMeat.. : Meat spending

MntFish...: Fish Spending

MntSweet  : Sweet or Snack spending

MntGold...: Non-essential spending

NumDeals  : purchase count with discounts or promos

NumWeb    : Online purchase count

NumStore  : Physical store purchase count

NumVisit  : Website visit count



TOTAL PARTICIPANTS
```{r}
length(household$ID)

```
We have a total of 2240 participants


AGE DISTRIBUTION
```{r}
hist(household$Est_Age, main="Age distribution of Participants")
mean(household$Est_Age)
```
The average age of the participants is 52yo


PARTICIPANTS BY COUNTRY
```{r}
household %>% ggplot(aes(Country)) + geom_bar(fill="blue") + ggtitle("Household Participants by Country")
```
In this study, Spain has the largest participants with over 1000 and the rest averages 150 participants


PARTICIPANTS BY EDUCATION
```{r}
household %>% ggplot(aes(Education)) + geom_bar(fill="blue") + ggtitle("Household Participants by Education")
```
Majority of the participants are at least graduate/college level with over 1000 while more than 300 have masters and more than 500 have Phd 


INCOME DISTRIBUTION
```{r}
household %>% filter(Est_Age<100) %>% ggplot(aes(x= Income, fill=Education)) + geom_density(alpha=0.2) +  scale_x_continuous(trans = "log10", labels=scales::comma) + ggtitle("Income Density by Education")
household %>% filter(Est_Age<100) %>%  ggplot(aes(Education, Income)) + geom_boxplot() + scale_y_continuous(trans = "log10", labels=scales::comma) + ylab("income in $") + ggtitle("Income by Education of Participants")
mean(household$Income)

```

```{r}
all_participants <- length(household$ID)
group_educ <- household %>% group_by(Education) %>% summarise(total_per_education=length(Education), percent = length(Education)/all_participants*100,avg_income=mean(Income))
group_educ
```
Majority of the participants are within 30,000 to 100,000 usd in Income range with an average of 51,687. It is surprising (at least for me) that the outlier of both highest and lowest in income levels are in the Graduation/college level category. 

Average income for those in Graduation and Masteral level of Education has about the same Income and Phd is higher by 3000 usd. While Basic educ has the lowest of educ level having second level education proves to have competitive Income.  



INCOME VS MARITAL STATUS
```{r}
household %>% filter(Est_Age<100) %>%  ggplot(aes(Marital_Status, Income)) + geom_boxplot() + scale_y_continuous(trans = "log10", labels=scales::comma) + ylab("income in $") + ggtitle("Income by Marital Status of Participants")
```

```{r}
all_participants <- length(household$ID)
group_marital <- household %>% group_by(Marital_Status) %>% summarise(total_per_maritalstat=length(Marital_Status), percent_all = length(Marital_Status)/all_participants*100, avg_income=mean(Income))
group_marital
```
WHile there is nothing much of a difference in income regardless of Marital Status, as income is a result of education, skill or experience, we may see the impact of spending on the status especially to those having kids or teens to support - let's find out how income affects marital status on spending




SPENDING
```{r}
total_spend <- household %>% group_by(Marital_Status) %>% summarise(totalspend=MntWines+MntFruits+MntMeatProducts+MntFishProducts+MntSweetProducts+MntGoldProds)
total_spend_sum <- total_spend %>% group_by(Marital_Status) %>% summarise(total_per_mstat=length(Marital_Status), spending_per_mstat=sum(totalspend), percent_spend=formatC( total_per_mstat/spending_per_mstat*100))
  total_spend_sum
  
  total_spend_sum %>% filter(Marital_Status!="Alone" & Marital_Status!="YOLO") %>% ggplot(aes(Marital_Status,percent_spend, fill="red")) + geom_bar(stat = "identity") + ggtitle("Percent Spending by Marital Status")
```
Overall, the Married category has the most in spending which is not surprising followed by the Singles. While they may have different spending reasons as the Married may be focused on needs and singles on wants. Let's find out as we break down the Essential and Non Essential spending.



ESSENTIAL VS NON-ESSENTIAL SPENDING

We have six major categories in spending: Wine, Sweets , Fruits, Meats, Fish and Gold

Fruit, Meats and Fish are considered ESSENTIALS and the rest would be NON_ESSENTIALS.
```{r}
essential_comp <- household %>% group_by(Marital_Status) %>% summarise(Essentials= MntFruits+MntMeatProducts+MntFishProducts, Percent_Essential=as.numeric(formatC(length(Marital_Status)/Essentials*100)), Non_essential =MntWines+MntSweetProducts+MntGoldProds, Percent_NonEssential=as.numeric(formatC(length(Marital_Status)/Non_essential*100) ))

essential_comp_sum <- essential_comp %>%  group_by(Marital_Status) %>%  summarise(total_per_mstat=length(Marital_Status), sum_Essential=sum(Essentials), percent_Essential=total_per_mstat/sum_Essential*100, sum_NonEssential=sum(Non_essential), percent_NonEssential=total_per_mstat/sum_NonEssential*100)
essential_comp_sum
```

```{r}
essential_comp_sum %>%filter(Marital_Status!="Alone" & Marital_Status!="YOLO") %>%   ggplot(aes(Marital_Status,percent_Essential, fill="red")) + geom_bar(stat = "identity") + ggtitle("Percent Spending on Essential: Fruits, Fish, Meats") + ylim(0,0.5) 

essential_comp_sum %>%  filter(Marital_Status!="Alone" & Marital_Status!="YOLO") %>% ggplot(aes(Marital_Status,percent_NonEssential)) + geom_bar(stat = "identity") + ggtitle("Percent Spending on Non Essential: Wine, Sweets,Non-food") + ylim(0,0.5)
```
Overall the spending for essential items is greater compared to Non-essential as seen in the percent spending regardless of Marital Status. Divorced and Married is tops the spending for essential.

Non-essential spending is lower and single status tops the group but not by much. 



INCOME vs SPENDING

In here we will look into the relationship of one's income to its spending habit. We will use Correlation to evaluate how strong or weak the relationship between Income and spending.

THe Correlation coefficient is the measure of how strong or weak a relationship between two values (Income vs Spending) and it has a value between -1 to 1.

A coeff closer to 1 means a strong relationship between values and a coeff closer to 0 means a weak relationship

```{r}
IncomexSpending <- household %>% summarise(house_income=Income, totalspend=MntWines+MntFruits+MntMeatProducts+MntFishProducts+MntSweetProducts+MntGoldProds)
head(IncomexSpending)

cor(IncomexSpending$house_income,IncomexSpending$totalspend)

IncomexSpending %>% filter(house_income<600000 & house_income>0) %>%  ggplot(aes(house_income, totalspend)) +
geom_jitter() + scale_x_continuous(trans = "log2", labels=scales::comma) + geom_smooth(mathod="lm") + xlim(0,100000) + ggtitle("Overall Spending x Income Analysis")
```
RIght off the bat we can see that there is lower spending on Income are below 50000 and more spending on higher income participants.



```{r}
IncomexSpending %>%  filter(house_income>0 & house_income<40000) %>%  ggplot(aes(house_income, totalspend)) + geom_point() + scale_x_continuous(trans = "log2", labels=scales::comma) + geom_smooth(mathod="lm") + xlim(0,40000) + ylim(0,2500) + ggtitle("Spending x Income < 40,000")
```
Zooming in, we see that for Income of less than 40000, there is way less spending and their distribution is compact ergo much more participants are spending less

```{r}
IncomexSpending %>%  filter(house_income>40001 & house_income<100000) %>%  ggplot(aes(house_income, totalspend)) + geom_point() + geom_smooth(mathod="lm")  + xlim(40000,100000)+ scale_x_continuous(trans = "log2", labels=scales::comma) + ggtitle("Spending x Income > 50,000")
```
And for those Income greater than 40000, we can see a more upward spread. The greater the Income, the greater the spending 

```{r}
Incomeless50k <- IncomexSpending %>%  filter(house_income>0 & house_income<40000) 
cor(Incomeless50k$house_income, Incomeless50k$totalspend)

Incomegreat50k <- IncomexSpending %>%  filter(house_income>40001 & house_income<100000) 
cor(Incomegreat50k$house_income, Incomegreat50k$totalspend)
```
Computing for the Correlation Coefficients for both income<40000 and income>40000, we can see that there is a weak relationship for those below 40k Income, meaning there is not much of spending happening from them compared to those higher than 40k Income.

NEXT UP
Now that we know who's spending more, let's find out which specific item category is being spent by our total participants to validate spending priorities, this will halp us focus our market targeting and advertising as a business entity. 

STAY TUNED!


Prepared by Dodgecarl Incila















