|
|
Mid Term Review |
This dataset is taken from Bureau of Labor Statistics (BLS). It contains data about family expenditure on various categories such as Housing, Food, Transportation. The dataset also provides significant outlook on the income and the education level of the samples surveyed. We strongly believe there should be a strong correlation between the expenses on alcohol and tobacco consumption, and income,other expenses and the education level. We would like to verify this using multiple linear regression model over this data set and find the most significant categories. We also felt the dataset being considered had adequate potential to address the objectives we are considering.
We first had to find the relevant dataset which will have potential to address this concern. We went through the explanation sheets of multiple relevant dataset and picked the current data due to its abundance of relevant variables which can provide significant contribution in addressing our concern.
We have decided to extract and merge the data set in such a way that we can compare various expenses and the education level to predict the income of a sample.
We are planning study and understand the variables present in the data and select those variables which are basically relevant for the analysis (removing irrelevant variables like serial number, iteration number, document number etc). Following which we are planning to check the adequacy of the data under each variable, if the variables does not contain a certain threshold amount of responses those variables need to removed from the data being generated for the analysis.
This analysis will provide an outlook for the marketing team in Tobacco and Alcohol industry about the influence of various factors over the monthly expense on alcohol and cigarette of an individual. Using this model, the sales and marketing team can identify and group customers into different baskets for promoting their products.
The following packages where used during the process:
library(dplyr)
library(stringr)
library(doBy)
The following R code connects to the URL. Downloads the zipped content, unzips and extracts the required data into a local variable for further analysis:
LocalFile <- tempfile()
download.file("https://www.bls.gov/cex/pumd/data/comma/intrvw16.zip",LocalFile)
Expenses <- read.csv(unz(LocalFile, "intrvw16/fmli161x.csv"))
unlink(LocalFile)
The Consumer Expenditure Survey (CE) program provides data on the buying habits of American consumers. These data are primarily used as weights for the Consumer Price Index.
The original dataset has 926 variables and the detailed definition of each of the variables are available at the follwing link:
https://www.bls.gov/cex/pumd/ce_pumd_interview_diary_dictionary.xlsx
Among 926 variables, after riguourous discussion and analysis we decided to go ahead with the following subset of variables for our analysis.
We have formed a comprehensive table to describe each variable’s characteristic:
|
FINCBTAX
|
Total amount of family income before taxes. It is a of type Integer. | |
| VEHQ | Total number of owned vehicles. It is of type Integer. | |
| FAM_SIZE | Number of Members in CU SUM OF MEMBERS BY CU_NUM. It is of type Integer. | |
| ALCBEVPQ | Alcoholic beverages last quarter. It is of type numeric. | |
| HEALTHPQ | Health care last quarter. It is of type numeric. | |
| TOBACCPQ | Tobacco and smoking supplies last quarter. It is of type numeric. | |
| BLS_URBN | Urban/Rural. It is of type numeric. | |
| HLTHINPQ | Health insurance last quarter. It is of type numeric. | |
| FOODPQ | Total food expense last quarter. It is of type numeric. | |
| HIGH_EDU | Highest level of education within the CU. It is of type Integer. | |
| IRAX | As of today, what is the total value of all retirement accounts, such as 401(k)s, IRAs. It is of type factor. | |
| BEDROOMQ | Number of bedrooms in CU. It is of type factor. | |
| HOUSPQ | Housing last quarter. It is of type numeric. It is of type numeric. | |
| SHELTPQ | Shelter last quarter. It is of type numeric. It is of type numeric. | |
| PROPTXPQ | Property taxes last quarter. It is of type numeric. | |
| MRPINSPQ | Maintenance, repairs, insurance, and other expenses last quarter. It is of type numeric. | |
| RENDWEPQ | Rented dwelling last quarter | |
| RNTXRPPQ | Rent excluding rent as pay last quarter. It is of type numeric. | |
| UTILPQ | Utilities, fuels and public services last quarter NTLGASPQ + ELCTRCPQ + ALLFULPQ + TELEPHPQ + WATRPSPQ | |
| ELCTRCPQ | Electricity last quarter. It is of type numeric. | |
| TELEPHPQ | Telephone services last quarter. It is of type numeric. | |
| WATRPSPQ | Water and other public services last quarter. It is of type int. | |
| HOUSOPPQ | Household operations last quarter. Â It is of type int. | |
| HOUSEQPQ | House furnishings and equipment last quarter. Â It is of type num. | |
| FURNTRPQ | Furniture last quarter. Â It is of type num. | |
| TRANSPQ | Transportation last quarter. Â It is of type int. | |
| GASMOPQ | Gasoline and motor oil last quarter.  It is of type num. | |
Importing the data into our work environment.
After importing the data,few of the cells were not identified as NA by R. So we manually identified the character and converted them to NA to analyse the dataset.
Expenses_sub[Expenses_sub == '.'] = NA
We felt few of the columns had more NA values than expected. So we decided to removed the observations which had more than 10% of its content as NAs.
which(colMeans(is.na(Expenses_sub)) > 0.10)
Expenses_sub <- select(Expenses_sub,-12,-29,-30)
We removed the observations with NA values.
Expenses_sub <- na.omit(Expenses_sub)
dim(Expenses_sub)
## [1] 6111 28
We created a comprehensive Boxplot of all the variables to analyse the outliers.
boxplot(Expenses_sub,header = TRUE)
We found Varaibles 1,5,13,14,24,25,26 had outliers which needed a fix.
Analysing the first variable with outliers
# Removing outlier for
boxplot(Expenses_sub[,1],labels = TRUE)
Removing the outliers from first column.
Expenses_sub <- Expenses_sub[-which.maxn(Expenses_sub[,1], n = 15),]
boxplot(Expenses_sub[,1],labels = TRUE)
Analysing variable 5 which has outliers
boxplot(Expenses_sub[,5],labels = TRUE)
Removing the outliers from 5th column.
Expenses_sub <- Expenses_sub[-which.maxn(Expenses_sub[,5], n = 3),]
boxplot(Expenses_sub[,5],labels = TRUE)
Analysing variable 13 which has outliers
boxplot(Expenses_sub[,13],labels = TRUE)
Removing the outliers from 13th column.
Expenses_sub <- Expenses_sub[-which.maxn(Expenses_sub[,13], n = 6),]
boxplot(Expenses_sub[,13],labels = TRUE)
Analysing variable 14 which has outliers.
boxplot(Expenses_sub[,14],labels = TRUE)
Removing the outliers from 14th column.
Expenses_sub <- Expenses_sub[-which.maxn(Expenses_sub[,14], n = 8),]
boxplot(Expenses_sub[,14],labels = TRUE)
Analysing variable 24 which has outliers.
boxplot(Expenses_sub[,24],labels = TRUE)
Removing the outliers from 24th column.
Expenses_sub <- Expenses_sub[-which.maxn(Expenses_sub[,24], n = 4),]
boxplot(Expenses_sub[,24],labels = TRUE)
Analysing variable 25 which has outliers.
boxplot(Expenses_sub[,25],labels = TRUE)
Removing the outliers from 25th column.
Expenses_sub <- Expenses_sub[-which.maxn(Expenses_sub[,25], n = 15),]
boxplot(Expenses_sub[,25],labels = TRUE)
Analysing variable 26 which has outliers.
boxplot(Expenses_sub[,26],labels = TRUE)
Removing the outliers from 26th column.
Expenses_sub <- Expenses_sub[-which.maxn(Expenses_sub[,26], n = 5),]
boxplot(Expenses_sub[,26],labels = TRUE)
Final Boxplot after removing all the outlier
#General Box Plot After removing ouliers
boxplot(Expenses_sub,labels = TRUE)
Final look of our cleaned data, ready for our analysis.
glimpse(Expenses_sub)
## Observations: 6,055
## Variables: 28
## $ FINCBTAX <int> 76000, 55002, 5537, 54256, 14686, 185050, 12000, 2700...
## $ VEHQ <int> 2, 2, 0, 3, 0, 3, 1, 1, 2, 1, 2, 1, 2, 0, 1, 1, 0, 3,...
## $ FAM_SIZE <int> 4, 1, 3, 2, 1, 2, 1, 1, 2, 4, 1, 5, 2, 7, 1, 1, 1, 5,...
## $ ALCBEVPQ <dbl> 30, 300, 0, 75, 0, 360, 12, 0, 465, 259, 0, 0, 0, 0, ...
## $ HEALTHPQ <dbl> 489.0, 1179.0, 0.0, 1577.7, 100.0, 0.0, 0.0, 0.0, 383...
## $ TOBACCPQ <dbl> 195, 0, 0, 65, 0, 65, 0, 0, 0, 182, 0, 0, 0, 0, 0, 0,...
## $ BLS_URBN <int> 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ HLTHINPQ <dbl> 489.0, 975.0, 0.0, 1376.7, 0.0, 0.0, 0.0, 0.0, 3606.0...
## $ FJSSDEDX <int> 5815, 4208, 0, 2601, 0, 13210, 918, 2066, 0, 11295, 0...
## $ FOODPQ <dbl> 1480, 1820, 1755, 949, 1300, 2860, 4030, 1495, 2103, ...
## $ HIGH_EDU <int> 15, 15, 14, 13, 12, 15, 15, 12, 15, 15, 12, 12, 12, 1...
## $ BEDROOMQ <fctr> 3, 4, 2, 3, 3, 5, 1, 2, 3, 4, 3, 5, 2, 5, 2, 2, 1, 3...
## $ HOUSPQ <dbl> 4440, 2693, 4578, 2295, 4976, 14935, 2523, 3132, 1062...
## $ SHELTPQ <dbl> 3075, 1638, 2925, 863, 4153, 9838, 1872, 2400, 3304, ...
## $ PROPTXPQ <dbl> 0, 550, 0, 300, 300, 1750, 0, 0, 925, 1575, 475, 0, 2...
## $ MRPINSPQ <dbl> 0, 0, 0, 413, 100, 243, 0, 0, 588, 0, 0, 0, 0, 0, 643...
## $ RENDWEPQ <dbl> 3075, 0, 2925, 0, 0, 0, 1872, 2400, 0, 0, 0, 1305, 0,...
## $ RNTXRPPQ <dbl> 3075, 0, 2925, 0, 0, 0, 1872, 2400, 0, 0, 0, 1305, 0,...
## $ UTILPQ <dbl> 1215, 1055, 1151, 966, 468, 1087, 645, 732, 1040, 185...
## $ ELCTRCPQ <int> 278, 150, 641, 186, 391, 255, 105, 0, 207, 373, 258, ...
## $ TELEPHPQ <dbl> 600, 249, 510, 356, 0, 114, 540, 267, 602, 485, 184, ...
## $ WATRPSPQ <int> 165, 266, 0, 242, 77, 508, 0, 0, 189, 788, 136, 0, 11...
## $ HOUSOPPQ <dbl> 150, 0, 502, 326, 300, 1850, 0, 0, 2499, 2457, 153, 1...
## $ HOUSEQPQ <dbl> 0, 0, 0, 140, 55, 2160, 6, 0, 3785, 412, 214, 0, 566,...
## $ FURNTRPQ <int> 0, 0, 0, 140, 0, 2000, 0, 0, 1323, 0, 0, 0, 286, 0, 0...
## $ TRANSPQ <dbl> 1158, 17549, 600, 2615, 860, 751, 257, 410, 1407, 465...
## $ GASMOPQ <dbl> 600.00, 900.00, 0.00, 1050.00, 600.00, 120.00, 192.75...
## $ REGION <int> 4, 1, 4, 2, 3, 4, 3, 1, 4, 2, 3, 1, 2, 2, 1, 4, 3, 3,...
We have planned to slice the data region wise and analyse its trend of Alcohol and Tobacco consumption.
A geospatial visualisation of the data would be great. But we don’t have adequate information like Latitude and Longitude in the dataset to entertain this requirement. But still we have a generalised value over most of the observations which provides the region of the observation (Eg: North,East, West etc).
We have plans of generating scattered charts against income,education and expenses on alcohol and Tobacco products to identify the presence of any linear relationship in the date.
We are not sure, if geospatial visualisation is possible in R and even if it supports geospatial visualization we are still not sure if our data will fit the requirement.
We are planning to build a multiple linear regression model that would predict the family consumption of Alcohol and Tobacco with other household expense variables and certain other factors such as higher education level and size of the family.