U.S. Household Expense Analysis

Mid Term Review

Introduction:

This dataset is taken from Bureau of Labor Statistics (BLS). It contains data about family expenditure on various categories such as Housing, Food, Transportation. The dataset also provides significant outlook on the income and the education level of the samples surveyed. We strongly believe there should be a strong correlation between the expenses on alcohol and tobacco consumption, and income,other expenses and the education level. We would like to verify this using multiple linear regression model over this data set and find the most significant categories. We also felt the dataset being considered had adequate potential to address the objectives we are considering.

Objective:

We first had to find the relevant dataset which will have potential to address this concern. We went through the explanation sheets of multiple relevant dataset and picked the current data due to its abundance of relevant variables which can provide significant contribution in addressing our concern.

We have decided to extract and merge the data set in such a way that we can compare various expenses and the education level to predict the income of a sample.

Technique Followed:

We are planning study and understand the variables present in the data and select those variables which are basically relevant for the analysis (removing irrelevant variables like serial number, iteration number, document number etc). Following which we are planning to check the adequacy of the data under each variable, if the variables does not contain a certain threshold amount of responses those variables need to removed from the data being generated for the analysis.

Benefit:

This analysis will provide an outlook for the marketing team in Tobacco and Alcohol industry about the influence of various factors over the monthly expense on alcohol and cigarette of an individual. Using this model, the sales and marketing team can identify and group customers into different baskets for promoting their products.

Packages Used:

The following packages where used during the process:

library(dplyr)
library(stringr)
library(doBy)

Datasource:

The following R code connects to the URL. Downloads the zipped content, unzips and extracts the required data into a local variable for further analysis:

LocalFile <- tempfile()
download.file("https://www.bls.gov/cex/pumd/data/comma/intrvw16.zip",LocalFile)
Expenses <- read.csv(unz(LocalFile, "intrvw16/fmli161x.csv"))
unlink(LocalFile)

About the data:

The Consumer Expenditure Survey (CE) program provides data on the buying habits of American consumers. These data are primarily used as weights for the Consumer Price Index.

The original dataset has 926 variables and the detailed definition of each of the variables are available at the follwing link:

https://www.bls.gov/cex/pumd/ce_pumd_interview_diary_dictionary.xlsx

Variable Overlook:

Among 926 variables, after riguourous discussion and analysis we decided to go ahead with the following subset of variables for our analysis.

We have formed a comprehensive table to describe each variable’s characteristic:

FINCBTAX	Total amount of family income before taxes. It is a of type Integer.
VEHQ	Total number of owned vehicles. It is of type Integer.
FAM_SIZE	Number of Members in CU SUM OF MEMBERS BY CU_NUM. It is of type Integer.
ALCBEVPQ	Alcoholic beverages last quarter. It is of type numeric.
HEALTHPQ	Health care last quarter. It is of type numeric.
TOBACCPQ	Tobacco and smoking supplies last quarter. It is of type numeric.
BLS_URBN	Urban/Rural. It is of type numeric.
HLTHINPQ	Health insurance last quarter. It is of type numeric.
FOODPQ	Total food expense last quarter. It is of type numeric.
HIGH_EDU	Highest level of education within the CU. It is of type Integer.
IRAX	As of today, what is the total value of all retirement accounts, such as 401(k)s, IRAs. It is of type factor.
BEDROOMQ	Number of bedrooms in CU. It is of type factor.
HOUSPQ	Housing last quarter. It is of type numeric. It is of type numeric.
SHELTPQ	Shelter last quarter. It is of type numeric. It is of type numeric.
PROPTXPQ	Property taxes last quarter. It is of type numeric.
MRPINSPQ	Maintenance, repairs, insurance, and other expenses last quarter. It is of type numeric.
RENDWEPQ	Rented dwelling last quarter
RNTXRPPQ	Rent excluding rent as pay last quarter. It is of type numeric.
UTILPQ	Utilities, fuels and public services last quarter NTLGASPQ + ELCTRCPQ + ALLFULPQ + TELEPHPQ + WATRPSPQ
ELCTRCPQ	Electricity last quarter. It is of type numeric.
TELEPHPQ	Telephone services last quarter. It is of type numeric.
WATRPSPQ	Water and other public services last quarter. It is of type int.
HOUSOPPQ	Household operations last quarter. It is of type int.
HOUSEQPQ	House furnishings and equipment last quarter. It is of type num.
FURNTRPQ	Furniture last quarter. It is of type num.
TRANSPQ	Transportation last quarter. It is of type int.
GASMOPQ	Gasoline and motor oil last quarter. It is of type num.

Data Preparation Process:

Importing the data into our work environment.

After importing the data,few of the cells were not identified as NA by R. So we manually identified the character and converted them to NA to analyse the dataset.

Expenses_sub[Expenses_sub == '.'] = NA

We felt few of the columns had more NA values than expected. So we decided to removed the observations which had more than 10% of its content as NAs.

which(colMeans(is.na(Expenses_sub)) > 0.10)
Expenses_sub <- select(Expenses_sub,-12,-29,-30)

We removed the observations with NA values.

Expenses_sub <- na.omit(Expenses_sub)
dim(Expenses_sub)

## [1] 6111   28

Outlier Analysis:

We created a comprehensive Boxplot of all the variables to analyse the outliers.

boxplot(Expenses_sub,header = TRUE)

We found Varaibles 1,5,13,14,24,25,26 had outliers which needed a fix.

Analysing the first variable with outliers

# Removing outlier for
boxplot(Expenses_sub[,1],labels = TRUE)

Removing the outliers from first column.

Expenses_sub <- Expenses_sub[-which.maxn(Expenses_sub[,1], n = 15),]
boxplot(Expenses_sub[,1],labels = TRUE)

Analysing variable 5 which has outliers

boxplot(Expenses_sub[,5],labels = TRUE)

Removing the outliers from 5th column.

Expenses_sub <- Expenses_sub[-which.maxn(Expenses_sub[,5], n = 3),]
boxplot(Expenses_sub[,5],labels = TRUE)

Analysing variable 13 which has outliers

boxplot(Expenses_sub[,13],labels = TRUE)

Removing the outliers from 13th column.

Expenses_sub <- Expenses_sub[-which.maxn(Expenses_sub[,13], n = 6),]
boxplot(Expenses_sub[,13],labels = TRUE)

Analysing variable 14 which has outliers.

boxplot(Expenses_sub[,14],labels = TRUE)

Removing the outliers from 14th column.

Expenses_sub <- Expenses_sub[-which.maxn(Expenses_sub[,14], n = 8),]
boxplot(Expenses_sub[,14],labels = TRUE)

Analysing variable 24 which has outliers.

boxplot(Expenses_sub[,24],labels = TRUE)

Removing the outliers from 24th column.

Expenses_sub <- Expenses_sub[-which.maxn(Expenses_sub[,24], n = 4),]
boxplot(Expenses_sub[,24],labels = TRUE)

Analysing variable 25 which has outliers.

boxplot(Expenses_sub[,25],labels = TRUE)

Removing the outliers from 25th column.

Expenses_sub <- Expenses_sub[-which.maxn(Expenses_sub[,25], n = 15),]
boxplot(Expenses_sub[,25],labels = TRUE)

Analysing variable 26 which has outliers.

boxplot(Expenses_sub[,26],labels = TRUE)

Removing the outliers from 26th column.

Expenses_sub <- Expenses_sub[-which.maxn(Expenses_sub[,26], n = 5),]
boxplot(Expenses_sub[,26],labels = TRUE)

Final Boxplot after removing all the outlier

#General Box Plot After removing ouliers
boxplot(Expenses_sub,labels = TRUE)

Final look of our cleaned data, ready for our analysis.

glimpse(Expenses_sub)

## Observations: 6,055
## Variables: 28
## $ FINCBTAX <int> 76000, 55002, 5537, 54256, 14686, 185050, 12000, 2700...
## $ VEHQ     <int> 2, 2, 0, 3, 0, 3, 1, 1, 2, 1, 2, 1, 2, 0, 1, 1, 0, 3,...
## $ FAM_SIZE <int> 4, 1, 3, 2, 1, 2, 1, 1, 2, 4, 1, 5, 2, 7, 1, 1, 1, 5,...
## $ ALCBEVPQ <dbl> 30, 300, 0, 75, 0, 360, 12, 0, 465, 259, 0, 0, 0, 0, ...
## $ HEALTHPQ <dbl> 489.0, 1179.0, 0.0, 1577.7, 100.0, 0.0, 0.0, 0.0, 383...
## $ TOBACCPQ <dbl> 195, 0, 0, 65, 0, 65, 0, 0, 0, 182, 0, 0, 0, 0, 0, 0,...
## $ BLS_URBN <int> 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ HLTHINPQ <dbl> 489.0, 975.0, 0.0, 1376.7, 0.0, 0.0, 0.0, 0.0, 3606.0...
## $ FJSSDEDX <int> 5815, 4208, 0, 2601, 0, 13210, 918, 2066, 0, 11295, 0...
## $ FOODPQ   <dbl> 1480, 1820, 1755, 949, 1300, 2860, 4030, 1495, 2103, ...
## $ HIGH_EDU <int> 15, 15, 14, 13, 12, 15, 15, 12, 15, 15, 12, 12, 12, 1...
## $ BEDROOMQ <fctr> 3, 4, 2, 3, 3, 5, 1, 2, 3, 4, 3, 5, 2, 5, 2, 2, 1, 3...
## $ HOUSPQ   <dbl> 4440, 2693, 4578, 2295, 4976, 14935, 2523, 3132, 1062...
## $ SHELTPQ  <dbl> 3075, 1638, 2925, 863, 4153, 9838, 1872, 2400, 3304, ...
## $ PROPTXPQ <dbl> 0, 550, 0, 300, 300, 1750, 0, 0, 925, 1575, 475, 0, 2...
## $ MRPINSPQ <dbl> 0, 0, 0, 413, 100, 243, 0, 0, 588, 0, 0, 0, 0, 0, 643...
## $ RENDWEPQ <dbl> 3075, 0, 2925, 0, 0, 0, 1872, 2400, 0, 0, 0, 1305, 0,...
## $ RNTXRPPQ <dbl> 3075, 0, 2925, 0, 0, 0, 1872, 2400, 0, 0, 0, 1305, 0,...
## $ UTILPQ   <dbl> 1215, 1055, 1151, 966, 468, 1087, 645, 732, 1040, 185...
## $ ELCTRCPQ <int> 278, 150, 641, 186, 391, 255, 105, 0, 207, 373, 258, ...
## $ TELEPHPQ <dbl> 600, 249, 510, 356, 0, 114, 540, 267, 602, 485, 184, ...
## $ WATRPSPQ <int> 165, 266, 0, 242, 77, 508, 0, 0, 189, 788, 136, 0, 11...
## $ HOUSOPPQ <dbl> 150, 0, 502, 326, 300, 1850, 0, 0, 2499, 2457, 153, 1...
## $ HOUSEQPQ <dbl> 0, 0, 0, 140, 55, 2160, 6, 0, 3785, 412, 214, 0, 566,...
## $ FURNTRPQ <int> 0, 0, 0, 140, 0, 2000, 0, 0, 1323, 0, 0, 0, 286, 0, 0...
## $ TRANSPQ  <dbl> 1158, 17549, 600, 2615, 860, 751, 257, 410, 1407, 465...
## $ GASMOPQ  <dbl> 600.00, 900.00, 0.00, 1050.00, 600.00, 120.00, 192.75...
## $ REGION   <int> 4, 1, 4, 2, 3, 4, 3, 1, 4, 2, 3, 1, 2, 2, 1, 4, 3, 3,...

Slicing and Dicing:

We have planned to slice the data region wise and analyse its trend of Alcohol and Tobacco consumption.

What next? :

A geospatial visualisation of the data would be great. But we don’t have adequate information like Latitude and Longitude in the dataset to entertain this requirement. But still we have a generalised value over most of the observations which provides the region of the observation (Eg: North,East, West etc).

We have plans of generating scattered charts against income,education and expenses on alcohol and Tobacco products to identify the presence of any linear relationship in the date.

The things we need to work on:

We are not sure, if geospatial visualisation is possible in R and even if it supports geospatial visualization we are still not sure if our data will fit the requirement.

Machine Learning Techniques:

We are planning to build a multiple linear regression model that would predict the family consumption of Alcohol and Tobacco with other household expense variables and certain other factors such as higher education level and size of the family.