Mid Term Review

Introduction:

This dataset is taken from Bureau of Labor Statistics (BLS). It contains data about family expenditure on various categories such as Housing, Food, Transportation. The dataset also provides significant outlook on the income and the education level of the samples surveyed. We strongly believe there should be a strong correlation between the expenses on alcohol and tobacco consumption, and income,other expenses and the education level. We would like to verify this using multiple linear regression model over this data set and find the most significant categories. We also felt the dataset being considered had adequate potential to address the objectives we are considering.

Objective:

We first had to find the relevant dataset which will have potential to address this concern. We went through the explanation sheets of multiple relevant dataset and picked the current data due to its abundance of relevant variables which can provide significant contribution in addressing our concern.

We have decided to extract and merge the data set in such a way that we can compare various expenses and the education level to predict the income of a sample.

Technique Followed:

We are planning study and understand the variables present in the data and select those variables which are basically relevant for the analysis (removing irrelevant variables like serial number, iteration number, document number etc). Following which we are planning to check the adequacy of the data under each variable, if the variables does not contain a certain threshold amount of responses those variables need to removed from the data being generated for the analysis.

Benefit:

This analysis will provide an outlook for the marketing team in Tobacco and Alcohol industry about the influence of various factors over the monthly expense on alcohol and cigarette of an individual. Using this model, the sales and marketing team can identify and group customers into different baskets for promoting their products.

Packages Used:

The following packages where used during the process:

library(dplyr)
library(stringr)
library(doBy)

Datasource:

The following R code connects to the URL. Downloads the zipped content, unzips and extracts the required data into a local variable for further analysis:

LocalFile <- tempfile()
download.file("https://www.bls.gov/cex/pumd/data/comma/intrvw16.zip",LocalFile)
Expenses <- read.csv(unz(LocalFile, "intrvw16/fmli161x.csv"))
unlink(LocalFile)

About the data:

The Consumer Expenditure Survey (CE) program provides data on the buying habits of American consumers. These data are primarily used as weights for the Consumer Price Index.

The original dataset has 926 variables and the detailed definition of each of the variables are available at the follwing link:

https://www.bls.gov/cex/pumd/ce_pumd_interview_diary_dictionary.xlsx

Variable Overlook:

Among 926 variables, after riguourous discussion and analysis we decided to go ahead with the following subset of variables for our analysis.

We have formed a comprehensive table to describe each variable’s characteristic:

FINCBTAX
Total amount of family income before taxes. It is a of type Integer. 
VEHQ Total number of owned vehicles. It is of type Integer.
FAM_SIZE Number of Members in CU SUM OF MEMBERS BY CU_NUM. It is of type Integer.
ALCBEVPQ Alcoholic beverages last quarter. It is of type numeric.
HEALTHPQ Health care last quarter. It is of type numeric.
TOBACCPQ Tobacco and smoking supplies last quarter. It is of type numeric.
BLS_URBN Urban/Rural. It is of type numeric.
HLTHINPQ Health insurance last quarter. It is of type numeric.
FOODPQ Total food expense last quarter. It is of type numeric.
HIGH_EDU Highest level of education within the CU. It is of type Integer.
IRAX As of today, what is the total value of all retirement accounts, such as 401(k)s, IRAs. It is of type factor.
BEDROOMQ Number of bedrooms in CU. It is of type factor.
HOUSPQ Housing last quarter. It is of type numeric. It is of type numeric.
SHELTPQ Shelter last quarter. It is of type numeric. It is of type numeric.
PROPTXPQ Property taxes last quarter. It is of type numeric.
MRPINSPQ Maintenance, repairs, insurance, and other expenses last quarter. It is of type numeric.
RENDWEPQ Rented dwelling last quarter
RNTXRPPQ Rent excluding rent as pay last quarter. It is of type numeric.
UTILPQ Utilities, fuels and public services last quarter NTLGASPQ + ELCTRCPQ + ALLFULPQ + TELEPHPQ + WATRPSPQ
ELCTRCPQ Electricity last quarter. It is of type numeric.
TELEPHPQ Telephone services last quarter. It is of type numeric.
WATRPSPQ Water and other public services last quarter. It is of type int.
HOUSOPPQ Household operations last quarter.  It is of type int.
HOUSEQPQ House furnishings and equipment last quarter.  It is of type num.
FURNTRPQ Furniture last quarter.  It is of type num.
TRANSPQ Transportation last quarter.  It is of type int.
GASMOPQ Gasoline and motor oil last quarter.  It is of type num. 

Data Preparation Process:

Importing the data into our work environment.

After importing the data,few of the cells were not identified as NA by R. So we manually identified the character and converted them to NA to analyse the dataset.

Expenses_sub[Expenses_sub == '.'] = NA

We felt few of the columns had more NA values than expected. So we decided to removed the observations which had more than 10% of its content as NAs.

which(colMeans(is.na(Expenses_sub)) > 0.10)
Expenses_sub <- select(Expenses_sub,-12,-29,-30)

We removed the observations with NA values.

Expenses_sub <- na.omit(Expenses_sub)
dim(Expenses_sub)
## [1] 6111   28

Outlier Analysis:

We created a comprehensive Boxplot of all the variables to analyse the outliers.

boxplot(Expenses_sub,header = TRUE)

We found Varaibles 1,5,13,14,24,25,26 had outliers which needed a fix.

Analysing the first variable with outliers

# Removing outlier for
boxplot(Expenses_sub[,1],labels = TRUE)

Removing the outliers from first column.

Expenses_sub <- Expenses_sub[-which.maxn(Expenses_sub[,1], n = 15),]
boxplot(Expenses_sub[,1],labels = TRUE)

Analysing variable 5 which has outliers

boxplot(Expenses_sub[,5],labels = TRUE)

Removing the outliers from 5th column.

Expenses_sub <- Expenses_sub[-which.maxn(Expenses_sub[,5], n = 3),]
boxplot(Expenses_sub[,5],labels = TRUE)

Analysing variable 13 which has outliers

boxplot(Expenses_sub[,13],labels = TRUE)

Removing the outliers from 13th column.

Expenses_sub <- Expenses_sub[-which.maxn(Expenses_sub[,13], n = 6),]
boxplot(Expenses_sub[,13],labels = TRUE)

Analysing variable 14 which has outliers.

boxplot(Expenses_sub[,14],labels = TRUE)

Removing the outliers from 14th column.

Expenses_sub <- Expenses_sub[-which.maxn(Expenses_sub[,14], n = 8),]
boxplot(Expenses_sub[,14],labels = TRUE)

Analysing variable 24 which has outliers.

boxplot(Expenses_sub[,24],labels = TRUE)

Removing the outliers from 24th column.

Expenses_sub <- Expenses_sub[-which.maxn(Expenses_sub[,24], n = 4),]
boxplot(Expenses_sub[,24],labels = TRUE)

Analysing variable 25 which has outliers.

boxplot(Expenses_sub[,25],labels = TRUE)

Removing the outliers from 25th column.

Expenses_sub <- Expenses_sub[-which.maxn(Expenses_sub[,25], n = 15),]
boxplot(Expenses_sub[,25],labels = TRUE)

Analysing variable 26 which has outliers.

boxplot(Expenses_sub[,26],labels = TRUE)

Removing the outliers from 26th column.

Expenses_sub <- Expenses_sub[-which.maxn(Expenses_sub[,26], n = 5),]
boxplot(Expenses_sub[,26],labels = TRUE)

Final Boxplot after removing all the outlier

#General Box Plot After removing ouliers
boxplot(Expenses_sub,labels = TRUE)

Final look of our cleaned data, ready for our analysis.

glimpse(Expenses_sub)
## Observations: 6,055
## Variables: 28
## $ FINCBTAX <int> 76000, 55002, 5537, 54256, 14686, 185050, 12000, 2700...
## $ VEHQ     <int> 2, 2, 0, 3, 0, 3, 1, 1, 2, 1, 2, 1, 2, 0, 1, 1, 0, 3,...
## $ FAM_SIZE <int> 4, 1, 3, 2, 1, 2, 1, 1, 2, 4, 1, 5, 2, 7, 1, 1, 1, 5,...
## $ ALCBEVPQ <dbl> 30, 300, 0, 75, 0, 360, 12, 0, 465, 259, 0, 0, 0, 0, ...
## $ HEALTHPQ <dbl> 489.0, 1179.0, 0.0, 1577.7, 100.0, 0.0, 0.0, 0.0, 383...
## $ TOBACCPQ <dbl> 195, 0, 0, 65, 0, 65, 0, 0, 0, 182, 0, 0, 0, 0, 0, 0,...
## $ BLS_URBN <int> 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ HLTHINPQ <dbl> 489.0, 975.0, 0.0, 1376.7, 0.0, 0.0, 0.0, 0.0, 3606.0...
## $ FJSSDEDX <int> 5815, 4208, 0, 2601, 0, 13210, 918, 2066, 0, 11295, 0...
## $ FOODPQ   <dbl> 1480, 1820, 1755, 949, 1300, 2860, 4030, 1495, 2103, ...
## $ HIGH_EDU <int> 15, 15, 14, 13, 12, 15, 15, 12, 15, 15, 12, 12, 12, 1...
## $ BEDROOMQ <fctr> 3, 4, 2, 3, 3, 5, 1, 2, 3, 4, 3, 5, 2, 5, 2, 2, 1, 3...
## $ HOUSPQ   <dbl> 4440, 2693, 4578, 2295, 4976, 14935, 2523, 3132, 1062...
## $ SHELTPQ  <dbl> 3075, 1638, 2925, 863, 4153, 9838, 1872, 2400, 3304, ...
## $ PROPTXPQ <dbl> 0, 550, 0, 300, 300, 1750, 0, 0, 925, 1575, 475, 0, 2...
## $ MRPINSPQ <dbl> 0, 0, 0, 413, 100, 243, 0, 0, 588, 0, 0, 0, 0, 0, 643...
## $ RENDWEPQ <dbl> 3075, 0, 2925, 0, 0, 0, 1872, 2400, 0, 0, 0, 1305, 0,...
## $ RNTXRPPQ <dbl> 3075, 0, 2925, 0, 0, 0, 1872, 2400, 0, 0, 0, 1305, 0,...
## $ UTILPQ   <dbl> 1215, 1055, 1151, 966, 468, 1087, 645, 732, 1040, 185...
## $ ELCTRCPQ <int> 278, 150, 641, 186, 391, 255, 105, 0, 207, 373, 258, ...
## $ TELEPHPQ <dbl> 600, 249, 510, 356, 0, 114, 540, 267, 602, 485, 184, ...
## $ WATRPSPQ <int> 165, 266, 0, 242, 77, 508, 0, 0, 189, 788, 136, 0, 11...
## $ HOUSOPPQ <dbl> 150, 0, 502, 326, 300, 1850, 0, 0, 2499, 2457, 153, 1...
## $ HOUSEQPQ <dbl> 0, 0, 0, 140, 55, 2160, 6, 0, 3785, 412, 214, 0, 566,...
## $ FURNTRPQ <int> 0, 0, 0, 140, 0, 2000, 0, 0, 1323, 0, 0, 0, 286, 0, 0...
## $ TRANSPQ  <dbl> 1158, 17549, 600, 2615, 860, 751, 257, 410, 1407, 465...
## $ GASMOPQ  <dbl> 600.00, 900.00, 0.00, 1050.00, 600.00, 120.00, 192.75...
## $ REGION   <int> 4, 1, 4, 2, 3, 4, 3, 1, 4, 2, 3, 1, 2, 2, 1, 4, 3, 3,...

Slicing and Dicing:

We have planned to slice the data region wise and analyse its trend of Alcohol and Tobacco consumption.

What next? :

A geospatial visualisation of the data would be great. But we don’t have adequate information like Latitude and Longitude in the dataset to entertain this requirement. But still we have a generalised value over most of the observations which provides the region of the observation (Eg: North,East, West etc).

We have plans of generating scattered charts against income,education and expenses on alcohol and Tobacco products to identify the presence of any linear relationship in the date.

The things we need to work on:

We are not sure, if geospatial visualisation is possible in R and even if it supports geospatial visualization we are still not sure if our data will fit the requirement.

Machine Learning Techniques:

We are planning to build a multiple linear regression model that would predict the family consumption of Alcohol and Tobacco with other household expense variables and certain other factors such as higher education level and size of the family.