Final Project Report

Introduction

Introduction

This dataset is taken from Bureau of Labor Statistics (BLS). It contains data about family expenditure on various categories such as Housing, Food, Transportation. The dataset also provides significant outlook on the income and the education level of the samples surveyed. We strongly believe there should be a strong correlation between the expenses on alcohol and tobacco consumption, and income,other expenses and the education level. We would like to verify this using multiple linear regression model over this data set and find the most significant categories. We also felt the dataset being considered had adequate potential to address the objectives we are considering.

Objective:

We first had to find the relevant dataset which will have potential to address this concern. We went through the explanation sheets of multiple relevant dataset and picked the current data due to its abundance of relevant variables which can provide significant contribution in addressing our concern.

We have decided to extract and merge the data set in such a way that we can compare various expenses and the education level to predict the income of a sample.

Technique Followed:

We are planning study and understand the variables present in the data and select those variables which are basically relevant for the analysis (removing irrelevant variables like serial number, iteration number, document number etc). Following which we are planning to check the adequacy of the data under each variable, if the variables does not contain a certain threshold amount of responses those variables need to removed from the data being generated for the analysis.

Benefit:

This analysis will provide an outlook for the marketing team in Tobacco and Alcohol industry about the influence of various factors over the monthly expense on alcohol and cigarette of an individual. Using this model, the sales and marketing team can identify and group customers into different baskets for promoting their products.

Packages Used

The following packages where used during the process:

library(dplyr)
library(doBy)
library(ggplot2)
library(fiftystater)
library(mapproj)
library(ggalt)

Datasource:

The following R code connects to the URL. Downloads the zipped content, unzips and extracts the required data into a local variable for further analysis:

LocalFile <- tempfile()
download.file("https://www.bls.gov/cex/pumd/data/comma/intrvw16.zip",LocalFile)
Expenses <- read.csv(unz(LocalFile, "intrvw16/fmli161x.csv"))
unlink(LocalFile)

About data

The Consumer Expenditure Survey (CE) program provides data on the buying habits of American consumers. These data are primarily used as weights for the Consumer Price Index.

The original dataset has 926 variables and the detailed definition of each of the variables are available at the follwing link:

https://www.bls.gov/cex/pumd/ce_pumd_interview_diary_dictionary.xlsx

Variable Overlook:

Among 926 variables, after riguourous discussion and analysis we decided to go ahead with the following subset of variables for our analysis.

We have formed a comprehensive table to describe each variable’s characteristic:

FINCBTAX
Total amount of family income before taxes. It is a of type Integer. 
VEHQ Total number of owned vehicles. It is of type Integer.
FAM_SIZE Number of Members in CU SUM OF MEMBERS BY CU_NUM. It is of type Integer.
ALCBEVPQ Alcoholic beverages last quarter. It is of type numeric.
HEALTHPQ Health care last quarter. It is of type numeric.
TOBACCPQ Tobacco and smoking supplies last quarter. It is of type numeric.
BLS_URBN Urban/Rural. It is of type numeric.
HLTHINPQ Health insurance last quarter. It is of type numeric.
FOODPQ Total food expense last quarter. It is of type numeric.
HIGH_EDU Highest level of education within the CU. It is of type Integer.
IRAX As of today, what is the total value of all retirement accounts, such as 401(k)s, IRAs. It is of type factor.
BEDROOMQ Number of bedrooms in CU. It is of type factor.
HOUSPQ Housing last quarter. It is of type numeric. It is of type numeric.
SHELTPQ Shelter last quarter. It is of type numeric. It is of type numeric.
PROPTXPQ Property taxes last quarter. It is of type numeric.
MRPINSPQ Maintenance, repairs, insurance, and other expenses last quarter. It is of type numeric.
RENDWEPQ Rented dwelling last quarter
RNTXRPPQ Rent excluding rent as pay last quarter. It is of type numeric.
UTILPQ Utilities, fuels and public services last quarter NTLGASPQ + ELCTRCPQ + ALLFULPQ + TELEPHPQ + WATRPSPQ
ELCTRCPQ Electricity last quarter. It is of type numeric.
TELEPHPQ Telephone services last quarter. It is of type numeric.
WATRPSPQ Water and other public services last quarter. It is of type int.
HOUSOPPQ Household operations last quarter.  It is of type int.
HOUSEQPQ House furnishings and equipment last quarter.  It is of type num.
FURNTRPQ Furniture last quarter.  It is of type num.
TRANSPQ Transportation last quarter.  It is of type int.
GASMOPQ Gasoline and motor oil last quarter.  It is of type num. 

Data Preparation Process

Importing the data into our work environment.

After importing the data,few of the cells were not identified as NA by R. So we manually identified the character and converted them to NA to analyse the dataset.

Expenses_sub[Expenses_sub == '.'] = NA

We felt few of the columns had more NA values than expected. So we decided to removed the observations which had more than 10% of its content as NAs.

which(colMeans(is.na(Expenses_sub)) > 0.10)
Expenses_sub <- select(Expenses_sub,-12,-29,-30)

We removed the observations with NA values.

Expenses_sub <- na.omit(Expenses_sub)
dim(Expenses_sub)
## [1] 5698   29

Outlier Analysis:

We created a comprehensive Boxplot of all the variables to analyse the outliers.

boxplot(Expenses_sub,header = TRUE)

We found Varaibles 1,5,13,14,24,25,26 had outliers which needed a fix.

Analysing the first variable with outliers

# Removing outlier for
boxplot(Expenses_sub[,1],labels = TRUE)

Removing the outliers from first column.

Expenses_sub <- Expenses_sub[-which.maxn(Expenses_sub[,1], n = 15),]
boxplot(Expenses_sub[,1],labels = TRUE)

Analysing variable 5 which has outliers

boxplot(Expenses_sub[,5],labels = TRUE)

Removing the outliers from 5th column.

Expenses_sub <- Expenses_sub[-which.maxn(Expenses_sub[,5], n = 3),]
boxplot(Expenses_sub[,5],labels = TRUE)

Analysing variable 13 which has outliers

boxplot(Expenses_sub[,13],labels = TRUE)

Removing the outliers from 13th column.

Expenses_sub <- Expenses_sub[-which.maxn(Expenses_sub[,13], n = 6),]
boxplot(Expenses_sub[,13],labels = TRUE)

Analysing variable 14 which has outliers.

boxplot(Expenses_sub[,14],labels = TRUE)

Removing the outliers from 14th column.

Expenses_sub <- Expenses_sub[-which.maxn(Expenses_sub[,14], n = 8),]
boxplot(Expenses_sub[,14],labels = TRUE)

Analysing variable 24 which has outliers.

boxplot(Expenses_sub[,24],labels = TRUE)

Removing the outliers from 24th column.

Expenses_sub <- Expenses_sub[-which.maxn(Expenses_sub[,24], n = 4),]
boxplot(Expenses_sub[,24],labels = TRUE)

Analysing variable 25 which has outliers.

boxplot(Expenses_sub[,25],labels = TRUE)

Removing the outliers from 25th column.

Expenses_sub <- Expenses_sub[-which.maxn(Expenses_sub[,25], n = 15),]
boxplot(Expenses_sub[,25],labels = TRUE)

Analysing variable 26 which has outliers.

boxplot(Expenses_sub[,26],labels = TRUE)

Removing the outliers from 26th column.

Expenses_sub <- Expenses_sub[-which.maxn(Expenses_sub[,26], n = 5),]
boxplot(Expenses_sub[,26],labels = TRUE)

Final Boxplot after removing all the outlier

#General Box Plot After removing ouliers
boxplot(Expenses_sub,labels = TRUE)

Final look of our cleaned data, ready for our analysis.

glimpse(Expenses_sub)
## Observations: 5,642
## Variables: 29
## $ FINCBTAX <int> 76000, 55002, 5537, 54256, 14686, 185050, 12000, 6000...
## $ VEHQ     <int> 2, 2, 0, 3, 0, 3, 1, 2, 1, 2, 1, 2, 0, 1, 1, 0, 3, 1,...
## $ FAM_SIZE <int> 4, 1, 3, 2, 1, 2, 1, 2, 4, 1, 5, 2, 7, 1, 1, 1, 5, 2,...
## $ ALCBEVPQ <dbl> 30, 300, 0, 75, 0, 360, 12, 465, 259, 0, 0, 0, 0, 0, ...
## $ HEALTHPQ <dbl> 489.0, 1179.0, 0.0, 1577.7, 100.0, 0.0, 0.0, 3833.0, ...
## $ TOBACCPQ <dbl> 195, 0, 0, 65, 0, 65, 0, 0, 182, 0, 0, 0, 0, 0, 0, 0,...
## $ BLS_URBN <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ HLTHINPQ <dbl> 489.0, 975.0, 0.0, 1376.7, 0.0, 0.0, 0.0, 3606.0, 259...
## $ FJSSDEDX <int> 5815, 4208, 0, 2601, 0, 13210, 918, 0, 11295, 0, 1989...
## $ FOODPQ   <dbl> 1480, 1820, 1755, 949, 1300, 2860, 4030, 2103, 3033, ...
## $ HIGH_EDU <int> 15, 15, 14, 13, 12, 15, 15, 15, 15, 12, 12, 12, 13, 1...
## $ BEDROOMQ <fctr> 3, 4, 2, 3, 3, 5, 1, 3, 4, 3, 5, 2, 5, 2, 2, 1, 3, 2...
## $ HOUSPQ   <dbl> 4440, 2693, 4578, 2295, 4976, 14935, 2523, 10628, 100...
## $ SHELTPQ  <dbl> 3075, 1638, 2925, 863, 4153, 9838, 1872, 3304, 5319, ...
## $ PROPTXPQ <dbl> 0, 550, 0, 300, 300, 1750, 0, 925, 1575, 475, 0, 250,...
## $ MRPINSPQ <dbl> 0, 0, 0, 413, 100, 243, 0, 588, 0, 0, 0, 0, 0, 643, 0...
## $ RENDWEPQ <dbl> 3075, 0, 2925, 0, 0, 0, 1872, 0, 0, 0, 1305, 0, 2475,...
## $ RNTXRPPQ <dbl> 3075, 0, 2925, 0, 0, 0, 1872, 0, 0, 0, 1305, 0, 2475,...
## $ UTILPQ   <dbl> 1215, 1055, 1151, 966, 468, 1087, 645, 1040, 1852, 57...
## $ ELCTRCPQ <int> 278, 150, 641, 186, 391, 255, 105, 207, 373, 258, 420...
## $ TELEPHPQ <dbl> 600, 249, 510, 356, 0, 114, 540, 602, 485, 184, 123, ...
## $ WATRPSPQ <int> 165, 266, 0, 242, 77, 508, 0, 189, 788, 136, 0, 116, ...
## $ HOUSOPPQ <dbl> 150, 0, 502, 326, 300, 1850, 0, 2499, 2457, 153, 198,...
## $ HOUSEQPQ <dbl> 0, 0, 0, 140, 55, 2160, 6, 3785, 412, 214, 0, 566, 0,...
## $ FURNTRPQ <int> 0, 0, 0, 140, 0, 2000, 0, 1323, 0, 0, 0, 286, 0, 0, 0...
## $ TRANSPQ  <dbl> 1158, 17549, 600, 2615, 860, 751, 257, 1407, 4654, 85...
## $ GASMOPQ  <dbl> 600.00, 900.00, 0.00, 1050.00, 600.00, 120.00, 192.75...
## $ REGION   <int> 4, 1, 4, 2, 3, 4, 3, 4, 2, 3, 1, 2, 2, 1, 4, 3, 3, 1,...
## $ STATE    <int> 8, 42, 15, 26, 12, 41, 48, 6, 26, 24, 42, 26, 26, 25,...

Merging New Variables

# link for state data : https://www2.census.gov/geo/docs/reference/state.txt
#state_code_data  <-  read.csv("State_codes.csv")
state_code_data  <-  read.csv("https://www.dropbox.com/s/65nbigkn3k7juyx/State_codes.csv?raw=1")


# New data set with state information
Updated_exp_data <- merge(Expenses_sub,state_code_data,by="STATE")

#Creating new variable ALC_TOB_PQ 

Updated_exp_data <- Updated_exp_data %>% 
  mutate(ALC_TOB_PQ = ALCBEVPQ + TOBACCPQ)

#Creating New dataframe for averages
State_exp_average <- aggregate(Updated_exp_data$ALC_TOB_PQ, by=list(State_Name = Updated_exp_data$STATE_NAME), FUN=mean)
State_exp_average <- aggregate(list(Updated_exp_data$ALC_TOB_PQ,Updated_exp_data$ALCBEVPQ,Updated_exp_data$TOBACCPQ), by=list(State_Name = Updated_exp_data$STATE_NAME), FUN=mean)
colnames(State_exp_average) <- c('STATE_NAME','Avg_Expense','Avg_Exp_Alcohol','Avg_Exp_Tobacco')
#order the dataset based on top 10 expenses
State_exp_average <- arrange(State_exp_average, desc(Avg_Expense))
#State_exp_average$State <- tolower(State_exp_average$State)
Updated_exp_data_map <- merge(x = state_code_data, y = State_exp_average, by = "STATE_NAME", all.x = TRUE)
Updated_exp_data_map[is.na(Updated_exp_data_map)] <- 0
Updated_exp_data_map$STATE_NAME <-tolower(Updated_exp_data_map$STATE_NAME)
#Removing States without adequate data and rounding off to two digits

Updated_exp_data_map$Avg_Expense <-  as.numeric(format(round(Updated_exp_data_map$Avg_Expense, 2), nsmall = 2))
Updated_exp_data_map$Avg_Exp_Alcohol <-  as.numeric(format(round(Updated_exp_data_map$Avg_Exp_Alcohol, 2), nsmall = 2))
Updated_exp_data_map$Avg_Exp_Tobacco <-  as.numeric(format(round(Updated_exp_data_map$Avg_Exp_Tobacco, 2), nsmall = 2))

Data Exploration

Spatial Mapping the Tobacco Consumption

People of the states of Indiana and New Hampshire spend the most on Tobacco throughout United States. New Hampshire people was amongst the top states when Alcohol expenditure was concerned but in Indiana the expenditure on alcohol was on the lower side.

data("fifty_states") 
p <- ggplot(Updated_exp_data_map, aes(map_id = STATE_NAME)) + 
  # map points to the fifty_states shape data
  geom_map(aes(fill = Avg_Exp_Tobacco), map = fifty_states) + 
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map() +
  scale_x_continuous(breaks = NULL) + 
  scale_y_continuous(breaks = NULL) +
  labs(x = "", y = "") +
  theme(legend.position = "bottom", 
        panel.background = element_blank())

p <- p + guides(fill=guide_legend(title="Monthly Average Spending on Tobacco($)"))
p

Spatial Mapping the Alcohol Consumption

The following map was generated to show the Average alcohol expense in all the states. The STATE variable in the Expenses_Sub data set contains the code numbers of the various states and data set was merged with a Data set, States code, which has name of the states associated with State Code number.

The Darker blue color shows the states with lowest alcohol consumption and as the consumption increases the blue color continues to get lighter. Colorado and New Hampshire are the states which have maximum expenditure on alcohol.

data("fifty_states") 
p <- ggplot(Updated_exp_data_map, aes(map_id = STATE_NAME)) + 
  # map points to the fifty_states shape data
  geom_map(aes(fill = Avg_Exp_Alcohol), map = fifty_states) + 
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map() +
  scale_x_continuous(breaks = NULL) + 
  scale_y_continuous(breaks = NULL) +
  labs(x = "", y = "") +
  theme(legend.position = "bottom", 
        panel.background = element_blank())

p <- p + guides(fill=guide_legend(title="Monthly Average Spending on Alcohol($)"))
p

Spatial Mapping the Alcohol & Tobacco Consumption

New Hampshire was the leader in the combined expenditure on Alcohol and Tobacco. This comes as no surprise as expenditure on individual categories of Alcohol and Tobacco was also highest in New Hampshire. Colorado due to its high expenditure in Alcohol was at the second spot in the combined expenditure.

data("fifty_states") 
p <- ggplot(Updated_exp_data_map, aes(map_id = STATE_NAME)) + 
  # map points to the fifty_states shape data
  geom_map(aes(fill = Avg_Expense), map = fifty_states) + 
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map() +
  scale_x_continuous(breaks = NULL) + 
  scale_y_continuous(breaks = NULL) +
  labs(x = "", y = "") +
  theme(legend.position = "bottom", 
        panel.background = element_blank())

p <- p + guides(fill=guide_legend(title="Monthly Average Spending on Tobacco and Alcohol ($)"))
p

Distribution

The average expense is pretty much equally distributed across the national average. The only state which has exceptional consumption is New Hampshire.

#create US average bar chart
Updated_exp_data_map<-Updated_exp_data_map[!(Updated_exp_data_map$Avg_Expense==0),]

ggplot(data=Updated_exp_data_map, aes(x=STUSAB, y=Updated_exp_data_map$Avg_Expense)) +
  geom_bar(stat="identity", fill="steelblue")+
  labs(x = "States",srt = 90)+
  labs(y = "Average Expense on Alcohol and Tobacco($)")+
  geom_hline(yintercept = mean(Updated_exp_data_map$Avg_Expense))+
  geom_text(aes('HI',mean(Updated_exp_data_map$Avg_Expense)+10,label = format(round(mean(Updated_exp_data_map$Avg_Expense), 2), nsmall = 2)))+
  theme_minimal()

A look at the average

This visualization has given a comprehensive overview of the distribution of the consumption across various states of the United States. It is valuable to note that more than 60% of the cities have consumption less than the national average. This gives us the learning that the national average of alcohol and tobacco consumption is partially dominated by the outliers like Colorado and Oregon.

theme_set(theme_bw())
ggplot(Updated_exp_data_map, aes(x=`STUSAB`, y=Avg_Expense, label=Avg_Expense)) + 
  geom_point(stat='identity', fill="black", size=8)  +
  geom_segment(aes(y = mean(Avg_Expense), 
                   x = `STUSAB`, 
                   yend = Avg_Expense, 
                   xend = `STUSAB`), 
               color = "black") +
  geom_text(color="grey", size=2) +
  labs(title="Diverging Lollipop Chart", 
       subtitle="Consumption variance of states from national average", x = "Average Expense($)", y = "State") + 
  ylim(10, 300) +
  coord_flip()

Understanding the regional trend within the United States

To understand the relation between Alcohol and Tobacco expenses combined with Food Expenses for all the four regions a Scatter plot with ggplot and geom_point is created and a regression line is added with the geom_smooth to understand the relationship. As the plot shoes Region 2, 3 and 4 namely Midwestern, Southern, Western respectively had no co-relation whereas in case of Region 1 i.e. Northeastern a minimal correlation was discovered.

ggplot(data = Updated_exp_data_Scatter, aes(x = FOODPQ, y = ALCBEVPQ, col = REGION)) +
  geom_point(alpha = 0.1) +
  geom_jitter() +
  geom_smooth(method = "loess", se =  F) +
  xlim(c(0,10000)) + 
  ylim(c(0,2500)) +
  labs(x="Expense on Food",y="Expense on Alcohol")

Difference between Tobacco and Alcohol consumption

To demonstrate the difference between the Alcohol and Tobacco expenditure in different states a dumbbell chart was created. The red Dot in the chart describes the Alcohol expense and the black dot depicts the Tobacco Expense. The line between these dots shoes the magnitude of the difference in the expenditures in the two categories. If red dot is on the right side it can be concluded that consumption of Alcohol was more and black dot on the right side shows higher expenditure on tobacco for that particular State.

#Dumbell Chart
gg <- ggplot(Updated_exp_data_map, aes(x=Avg_Exp_Alcohol, xend=Avg_Exp_Tobacco, y=`STUSAB`, group=`STUSAB`)) + 
  geom_dumbbell(color="#a3c4dc", colour_x = "red", colour_xend = "black",
                size=0.75, 
                point.colour.l="#0e668b",show.legend = TRUE) + 
  scale_x_continuous(limits=c(0, 250)) + 
  labs(x="Average Spending($)", 
       y="State", 
       title="Dumbbell Chart", 
       subtitle="Differentiates the consumption of Tobacco(Black) and Alcohol(Red)") +
  theme(plot.title = element_text(hjust=0.5, face="bold"),
        plot.background=element_rect(fill="#f7f7f7"),
        panel.background=element_rect(fill="#f7f7f7"),
        panel.grid.minor=element_blank(),
        panel.grid.major.y=element_blank(),
        panel.grid.major.x=element_line(),
        axis.ticks=element_blank(),
        legend.position="right",
        panel.border=element_blank())
plot(gg)

Summary

Summary

The main motive behind the project was to understand the underling relationship behind the expenditure on Alcohol and Tobacco from different dimensions. The data chosen (Bureau of Labor Statistics (BLS)) contained all the required potential for addressing all our questions and concerns.

Though the dataset had the required potential, it wasn’t in the right format for carrying out the analysis procedures. We cleaned the dataset multiple times to get it into proper shape without losing much of its contents. Also, we created fields like Average Total Expense based on variables present in the dataset. The dataset also lacked few information related to State Codes. To fulfill this, we had to merge couple of another file to get the required field in the main dataset being analyzed. It didn’t end here, this is just an overview of the work we performed. All steps are mentioned detailed in the data preparation tab of this report.

During the analysis, we found a lot of interesting facts hidden behind the data. First and fore most we were surprised to see that New Hampshire had an exceptional consumption when it came to Alcohol and Tobacco. There is a possibility that the cost and taxes involved in purchasing these products in NH can be exceptionally different from other states. But we couldn’t perform research on that part as it was beyond our scope for this project. Also, we found that the national average of expenditure was partially dominated by few states in the list(Colorado and Oregon). There is a higher possibility that is due to the population of the state which can affect the national average household expense in general.

The motive of the project was to identify the states which has higher consumption of Alcohol and Tobacco and find the reason for this activity. End users can use this report to understand the state wise information about this consumption. Also, they can benefit by comparing their observations quickly with the other states in the country.

Last but not the least, there are few limitations in this dataset. The dataset doesn’t contain data related to population and Taxes of the states being analyzed. Also, this study has been performed over a limit time frame of data to address the memory and processing capabilities of the system used. The information can differ while studied under a bigger range of time frame.

About us