Problem description

A health company wants to understand the needs of its customers better. We want to help them predict two outcomes for given details of any US zipcode:

Predicting revenue that can be generated
Predicting which of the below products would be most popular in each zipcode:

College
Life
Retirement

Summary of Insights and Conclusions:

Revenue:

Revenue is linearly dependent on below variables and therefore can be estimated for any zipcode using below formula : \[Revenue = 31740 + 108.1*number_of_establishments + 0.08215*annual_payroll - 0.1162*male_population - 120.5*males_5_to_14_years + 59.21*females_15_to_44_years + 80.02*females_60_over_years - 124.9*males_under_5_years - 83.73*male_median_age \]

Popular Product:

Using multinomial logistic regression, Popular product in any given zipcode can be predicted with an accuracy of 99% using below 6 predictors out of the 25 potential predictors:

female_population
males_under_5_years
males_45_to_49_years
females_45_to_49_years
males_15_to_44_years
male_median_age

Odds of popularity of different products can be calculated using formula:

\[ Log(P(Life)/P(College)) = -55.818868 + female_population(0.00003791523) - males_under_5_years(9.06659) - males_45_to_49_years(0.09553038) - females_45_to_49_years(0.0718931) + males_15_to_44_years(3.06368448) -0.09443131(male_median_age) \]

\[ Log (P(Retirement)/P(College)) = -7.114253 + female_population(0.00004654075) - males_under_5_years(8.943701) + males_45_to_49_years(9.49303753) - females_45_to_49_years(0.2191584) + males_15_to_44_years(0.02681199) - male_median_age(0.10628488) \]

The odds of life plan w.r.t. college increase with increasing values of:
- female_population
- proportion of males between 15 to 44 years of age
The odds of life plan w.r.t. college decrease with increasing values of:
- proportion of males under 5 years of age
- proportion of males between 45 to 49 years of age
- proportion of females between 45 to 49 years of age
- Male median age
The odds of retirement plan w.r.t. college increase with increasing values of:
- female_population
- proportion of males between 45 to 49 years of age
- proportion of males between 15 to 44 years of age
The odds of retirement plan w.r.t. college decrease with increasing values of:
- proportion of males under 5 years of age
- proportion of females between 45 to 49 years of age
- Male median age

Exploratory Data Analysis

Importing Data

First, we will create a master data set containing all variables in one table.

library("car") #contains vif function to check for multicolliearity amongst predictors
library("nnet") #contains multino function for multinomial regression
library("GGally") #Vizualize correlations
library("dplyr") #Data Manipulations
library("ggplot2") #Create Vizualizations
library("tidyr") #Transformation of data
library("leaps") #For best subset selection
library("purrr")
library("knitr")
library("kableExtra")

#Data import
setwd("C:/Users/Vivobook/Documents/Data/clarity")
geo_info <- read.csv("geo_info.csv")
census <- read.csv("census_train.csv", na.strings = "-")
dependent_variables <- read.csv("revenues_products.csv")

#creating a master table with independent and dependent variables in one  
dependent_variables$Geographic.area.name <- as.character(dependent_variables$Geographic.area.name)
geo_info$Geographic.area.name <- as.character(geo_info$Geographic.area.name)
dependent_variables_2 <- left_join(dependent_variables, geo_info, by = c("Geographic.area.name" = "Geographic.area.name"))
master <- left_join(census, dependent_variables_2, by = "Id2")

Observation: In raw data missing numeric values were denoted by “-” placeholder. This has been changed to NA

Renaming and formatting data.

#Renaming columns
names(master) <- c("id", "total_population", "male_population", "female_population", "males_under_5_years", "females_under_5_years", "males_45_to_49_years", "females_45_to_49_years", "males_50_to_54_years", "females_50_to_54_years", "males_55_to_59_years", "females_55_to_59_years", "males_5_to_14_years",  "females_5_to_14_years", "males_15_to_44_years", "females_15_to_44_years", "males_60_over_years",  "females_60_over_years", "male_median_age", "female_median_age", "number_of_establishments",  "number_of_paid_employees", "first_quarter_payroll", "annual_payroll", "Geographic.area.name", "revenue",  "pop_product" )

#Correcting format
master$id <- as.numeric(master$id)
master$males_under_5_years <- as.numeric(master$males_under_5_years)
master$females_under_5_years <- as.numeric(master$females_under_5_years)
master$males_45_to_49_years <- as.numeric(master$males_45_to_49_years)
master$females_45_to_49_years <- as.numeric(master$females_45_to_49_years)
master$males_50_to_54_years <- as.numeric(master$males_50_to_54_years)
master$females_50_to_54_years <- as.numeric(master$females_50_to_54_years)
master$males_55_to_59_years <- as.numeric(master$males_55_to_59_years)
master$females_55_to_59_years <- as.numeric(master$females_55_to_59_years)
master$males_5_to_14_years <- as.numeric(master$males_5_to_14_years)
master$females_5_to_14_years <- as.numeric(master$females_5_to_14_years)
master$males_15_to_44_years <- as.numeric(master$males_15_to_44_years)
master$females_15_to_44_years <- as.numeric(master$females_15_to_44_years)
master$males_60_over_years <- as.numeric(master$males_60_over_years)
master$females_60_over_years <- as.numeric(master$females_60_over_years)
master$male_median_age <- as.numeric(master$male_median_age)
master$female_median_age <- as.numeric(master$female_median_age)

Data Description

Variable Description
Variable	Description
id	Area Zipcode
males_under_5_years	Proportion of males under 5 years of age
males_50_to_54_years	Proportion of males between 50 to 54 years
males_5_to_14_years	Proportion of males between 5 to 14 years
males_60_over_years	Proportion of males over 60 years
number_of_establishments	number of establishments in the area
Geographic.area.name	Area name along with zipcode
total_population	total_population of zipcode
females_under_5_years	Area Zipcode
females_50_to_54_years	Proportion of females under 5 years of age
females_5_to_14_years	Proportion of females between 50 to 54 years
females_60_over_years	Proportion of females between 5 to 14 years
number_of_paid_employees	Proportion of females over 60 years
revenue	Revenue generated. Dependent Variable
male_population	Total males population
males_45_to_49_years	Proportion of males between 45 and 49 years of age
males_55_to_59_years	Proportion of males between 55 and 59 years of age
males_15_to_44_years	Proportion of males between 15 and 44 years of age
male_median_age	Male median age
first_quarter_payroll	Income in first quarter of the year
pop_product	Popular product in zipcode. Categorical dependent variable
female_population	total female population of area
females_45_to_49_years	Proportion of females between 45 and 49 years of age
females_55_to_59_years	Proportion of females between 55 and 59 years of age
females_15_to_44_years	Proportion of females between 15 and 44 years of age
female_median_age	female median age of zipcode
annual_payroll	Average annual income of zipcode

Cleaning Data

Checking for duplicates

anyDuplicated(master)

## [1] 0

Observation: No duplicate rows in data

checking for missing values

sapply(master, function(x) {sum(is.na(x))})

##                       id         total_population          male_population 
##                        0                        0                        0 
##        female_population      males_under_5_years    females_under_5_years 
##                        0                       47                       51 
##     males_45_to_49_years   females_45_to_49_years     males_50_to_54_years 
##                       47                       51                       47 
##   females_50_to_54_years     males_55_to_59_years   females_55_to_59_years 
##                       51                       47                       51 
##      males_5_to_14_years    females_5_to_14_years     males_15_to_44_years 
##                       47                       51                       47 
##   females_15_to_44_years      males_60_over_years    females_60_over_years 
##                       51                       47                       51 
##          male_median_age        female_median_age number_of_establishments 
##                       91                       99                        0 
## number_of_paid_employees    first_quarter_payroll           annual_payroll 
##                        0                        0                        0 
##     Geographic.area.name                  revenue              pop_product 
##                        0                        0                        0

Observation: Most of the variables have missing values in them. Lets see how these missing values are distributed across rows:

missing_cols <- NULL
for(i in 1:nrow(master)){
  count1 <- 0
  for(j in 1:length(master)){
    count1 <- count1 + is.na(master[i,j])
  }
  missing_cols[i] <- count1
}
abc<- as.data.frame(cbind(master$id, missing_cols))
names(abc)[1] <- "id"
abc <- abc %>% 
  filter(missing_cols != 0) %>% 
  arrange(-missing_cols)
kable(abc[1:20,]) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "responsive","bordered")) %>%
  scroll_box(width = "100%", height = "500px")

id	missing_cols
28019	16
40363	16
11424	16
13138	16
10153	16
93943	16
20240	16
38132	16
83826	16
99164	16
48397	16
98158	16
10119	16
5740	16
19109	16
93064	16
19736	16
10199	16
23250	16
66115	16

Observations:

110 out of total 6842 rows have at least one missing value column. Of these 56 are mostly NA in all variables
The remaining 54 rows have at most 2 missing values. We can impute or regress for these values but since they are insignificant compared to 6842, we will omit them from our analysis

census_data <- anti_join(master, abc, by = "id")
nrow(census_data)

## [1] 6731

6841 - 110 = 6731 rows remaining

Data Exploration - Revenue

Summarizing data:

kable(round(cbind(
  "Min" =  apply(census_data[,-c(1,25,27)], 2, min),
  "Q1" = apply(census_data[,-c(1,25,27)], 2, quantile)[1,],
  "Q2" = apply(census_data[,-c(1,25,27)], 2, quantile)[2,],
  "Mean" = apply(census_data[,-c(1,25,27)], 2, mean),
  "Median" = apply(census_data[,-c(1,25,27)], 2, median),
  "Q3" = apply(census_data[,-c(1,25,27)], 2, quantile)[3,],
  "Q4" = apply(census_data[,-c(1,25,27)], 2, quantile)[4,],
  "Max" = apply(census_data[,-c(1,25,27)], 2, max)
),2))

	Min	Q1	Q2	Mean	Median	Q3	Q4	Max
total_population	22.0	22.0	1143.5	10969.77	3797.0	3797.0	15875.0	112709.0
male_population	11.0	11.0	580.5	5398.79	1904.0	1904.0	7807.0	60595.0
female_population	11.0	11.0	560.0	5570.98	1885.0	1885.0	8039.0	52550.0
males_under_5_years	0.0	0.0	3.8	5.87	5.7	5.7	7.5	47.4
females_under_5_years	0.0	0.0	3.6	5.57	5.4	5.4	7.2	47.0
males_45_to_49_years	0.0	0.0	5.1	6.77	6.7	6.7	8.2	55.2
females_45_to_49_years	0.0	0.0	5.1	6.68	6.5	6.5	8.0	50.0
males_50_to_54_years	0.0	0.0	5.7	7.55	7.3	7.3	8.9	52.2
females_50_to_54_years	0.0	0.0	5.8	7.62	7.3	7.3	8.9	83.9
males_55_to_59_years	0.0	0.0	5.5	7.42	7.0	7.0	8.8	72.7
females_55_to_59_years	0.0	0.0	5.6	7.55	7.1	7.1	8.8	80.3
males_5_to_14_years	0.0	0.0	9.9	12.74	12.9	12.9	15.6	57.1
females_5_to_14_years	0.0	0.0	9.3	12.08	12.1	12.1	14.7	62.1
males_15_to_44_years	0.0	0.0	31.6	37.01	36.7	36.7	41.8	100.0
females_15_to_44_years	0.0	0.0	30.4	35.31	35.1	35.1	39.7	100.0
males_60_over_years	0.0	0.0	16.5	22.64	21.0	21.0	26.5	100.0
females_60_over_years	0.0	0.0	19.1	25.19	24.0	24.0	29.4	95.7
male_median_age	6.7	6.7	35.3	40.58	40.1	40.1	45.3	83.9
female_median_age	6.9	6.9	37.2	42.44	42.3	42.3	47.3	84.2
number_of_establishments	3.0	3.0	17.0	261.80	68.0	68.0	346.0	6723.0
number_of_paid_employees	1.0	1.0	103.0	4014.39	655.0	655.0	4434.5	163895.0
first_quarter_payroll	1.0	1.0	735.5	51145.04	5414.0	5414.0	39540.0	8176054.0
annual_payroll	19.0	19.0	3338.0	201949.57	23499.0	23499.0	168815.5	22165514.0
revenue	14098.0	14098.0	32904.0	74537.95	41287.0	41287.0	82411.0	2686772.0

Observations

Average revenue is approx. $75,000
In population columns, total_population, male and female population have a much higher mean compared to median, indicating a positive skew
number_of establishments, number_of paid_employees, first_quarter_payroll and annual_payroll have a very high positive skew indicting presence of outlying values
Revenue has a very high positive skew indicating presence of outliers in dependent variables

Vizualizing Data:

census_data[,-c(1, 25, 27)] %>% 
  keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) + facet_wrap(~key, scales = "free") + geom_density()

Observations

Normally distributed features:
- female_median_age
- females_15_to_44_years
- females_5_to_14_years
- male_median_age
- male_15_to_15_years
All other features are positively skewed

Now, we will look at the distribution of all features against dependent variable, Revenue:

census_data[,-c(1, 25, 27)] %>% 
  gather(-revenue, key = "var", value = "value") %>% 
  ggplot(aes(x = value, y = revenue)) +
  geom_point() + 
  geom_smooth() +
  facet_wrap(~ var, scales = "free") +
  theme_bw()

Observation: All variables are follow approximately linear trend with respect to the dependent variable so we will predict revenue using linear regression modeling

Correlation w.r.t revenue:

M <- round(cor(census_data[,-c(1, 25,27)]),1)
ggcorr(M, nbreaks=8, palette='RdGy', label=TRUE, label_color='white')

Observations:

The below variables have a high correlation with revenue and with each other:
- annual_payroll
- first_quarter_payroll
- number_of_paid_employees
- number_of_establishments
- population(male and female)
Revenue are high in areas with higher overall, male or female population, possibly due to bigger potential customer pool
With increasing number of establishments, number of employees goes up. Also areas with more number of salaried employees have higher revenue
payrolls are higher in areas with more establishments in a zipcode. The revenue is higher in zipcodes with high payroll
All age groups except 15-44 years in both male and females have a positive correlation with revenue. This indicates that revenue is higher in areas with a high proportion of middle aged men and women

Data Exploration - popular product

Removing unused levels

census_data$pop_product <- droplevels(census_data$pop_product)

Distribution of pop_product classes

barplot(table(census_data$pop_product))

Observation: Dependent variable is almost equally distributed between the 3 factors.

Density of variables w.r.t. pop_product

census_data[,-c(1,25,26)] %>% 
  gather(-pop_product, key = "var", value = "value") %>% 
ggplot(aes(x = value, fill = pop_product)) + facet_wrap(~var, scales = "free") + geom_density()

Observations:

The following variables have the most discernable difference in densities of popular products:
- males_under_5_years
- males_45_to_49_years
- make_15_to_44_years
- male_median_age
- females_over_60_years
- female_median_age

Modeling Revenue (Regression)

Data Preperation

set.seed(100)
index <- sample(nrow(census_data), nrow(census_data)*0.8)
regression.train <- census_data[index,-c(1,25,27)] 
regression.test <- census_data[-index,-c(1,25,27)]

Feature Selection

We will use stepwise selection for feature selection. We will use CP as the selection criteria since it tends to favor fewer variables and is less likely to overfit our data with so may predictors having a high correlation.

nullmodel=lm(revenue~1, data=regression.train)
fullmodel=lm(revenue ~  ., data=regression.train)

model.regression <- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel),
             direction = c("forward"), select = "CP", trace = FALSE)
summary(model.regression)

## 
## Call:
## lm(formula = revenue ~ number_of_paid_employees + first_quarter_payroll + 
##     number_of_establishments + annual_payroll + male_population + 
##     males_5_to_14_years + females_15_to_44_years + females_60_over_years + 
##     males_15_to_44_years + males_60_over_years + males_under_5_years + 
##     male_median_age + total_population, data = regression.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -102215   -3558     126    3619  130942 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               2.765e+04  2.666e+03  10.371  < 2e-16 ***
## number_of_paid_employees  4.764e-01  5.002e-02   9.524  < 2e-16 ***
## first_quarter_payroll     4.471e-02  3.325e-03  13.449  < 2e-16 ***
## number_of_establishments  1.068e+02  6.733e-01 158.628  < 2e-16 ***
## annual_payroll            6.723e-02  1.298e-03  51.790  < 2e-16 ***
## male_population          -7.259e-01  2.325e-01  -3.122 0.001803 ** 
## males_5_to_14_years      -1.026e+02  3.111e+01  -3.297 0.000982 ***
## females_15_to_44_years    8.776e+01  1.666e+01   5.269 1.43e-07 ***
## females_60_over_years     6.967e+01  1.654e+01   4.213 2.56e-05 ***
## males_15_to_44_years      4.229e+01  2.539e+01   1.665 0.095899 .  
## males_60_over_years       6.076e+01  2.042e+01   2.976 0.002938 ** 
## males_under_5_years      -1.175e+02  3.844e+01  -3.056 0.002252 ** 
## male_median_age          -7.816e+01  3.394e+01  -2.303 0.021318 *  
## total_population          2.212e-01  1.142e-01   1.937 0.052793 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7019 on 5370 degrees of freedom
## Multiple R-squared:  0.9946, Adjusted R-squared:  0.9946 
## F-statistic: 7.56e+04 on 13 and 5370 DF,  p-value: < 2.2e-16

Observations:

Out of potential 24 variables, 13 have been shortlisted by stepwise selection
The model is statistically significant as per the p-value
adj-R2 is 99.46%

Diagnosing Initial Model

par(mfrow = c(2,2))
plot(model.regression)

Observations:

Residual Vs Fitted plot: Residuals seems to be independent but the trend line deviates due to outliers
Normal Q-Q: residuals are normally distributed
Homoscadascacity: In scale-location plot the variance trend changes on approaching outliers so outlier treatment is needed

Let’s have a closer look at outliers:

par(mfrow = c(1,1))
plot(model.regression,4)

Removing Outliers using cooks distance:

A general rule of thumb is that observations with a Cook’s Distance of more than 3 times the mean cooks distance are outliers

cooksd <- cooks.distance(model.regression)
cooks.data <- regression.train[which(cooksd < 3*mean(cooksd)),]

Checking for multicollinearity

vif(model.regression)

## number_of_paid_employees    first_quarter_payroll number_of_establishments 
##                18.445215                51.067082                 8.910991 
##           annual_payroll          male_population      males_5_to_14_years 
##                82.176014               311.389271                 3.081838 
##   females_15_to_44_years    females_60_over_years     males_15_to_44_years 
##                 3.135534                 3.262169                 7.991069 
##      males_60_over_years      males_under_5_years          male_median_age 
##                 5.081531                 2.093824                 8.870817 
##         total_population 
##               311.712791

Observation: Some of predictors have multicollinearity so we will remove them

Corrected Model

model.regression <- lm(revenue ~ number_of_establishments +
                         annual_payroll +
                         male_population + 
                         males_5_to_14_years +
                         females_15_to_44_years +
                         females_60_over_years +
                         males_under_5_years +
                         male_median_age,
                       data = cooks.data)

summary(model.regression)

## 
## Call:
## lm(formula = revenue ~ number_of_establishments + annual_payroll + 
##     male_population + males_5_to_14_years + females_15_to_44_years + 
##     females_60_over_years + males_under_5_years + male_median_age, 
##     data = cooks.data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -34773  -3562     71   3527  44962 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               3.174e+04  1.204e+03  26.367  < 2e-16 ***
## number_of_establishments  1.081e+02  5.058e-01 213.726  < 2e-16 ***
## annual_payroll            8.215e-02  3.598e-04 228.313  < 2e-16 ***
## male_population          -1.162e-01  1.999e-02  -5.813 6.49e-09 ***
## males_5_to_14_years      -1.205e+02  1.729e+01  -6.973 3.49e-12 ***
## females_15_to_44_years    5.921e+01  1.279e+01   4.629 3.75e-06 ***
## females_60_over_years     8.002e+01  1.199e+01   6.672 2.78e-11 ***
## males_under_5_years      -1.249e+02  2.479e+01  -5.039 4.83e-07 ***
## male_median_age          -8.373e+01  1.697e+01  -4.934 8.30e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5758 on 5347 degrees of freedom
## Multiple R-squared:  0.9931, Adjusted R-squared:  0.9931 
## F-statistic: 9.592e+04 on 8 and 5347 DF,  p-value: < 2.2e-16

par(mfrow = c(2,2))

plot(model.regression)

Observations:

As can be observed from Residuals Vs fitted plot, residuals are equally spread around horizontal line without any distinct pattern. This is an indication that we don’t have any non-linear relationships
Residuals follow a straight line for the most part on Q-Q plot therefore they are normally distributed
The assumption of homoscedasticity also holds true based on scale-location plot
As can be observed in the last plot, There are no high leverage points

Checking for overfitting

#predicted values for test data
test.pred.regression <- predict(model.regression, newdata = regression.test)

# Train RMSE
sqrt(mean((model.regression$fitted.values - cooks.data$revenue)^2))

## [1] 5753.413

# Test RMSE
sqrt(mean((test.pred.regression - regression.test$revenue)^2))

## [1] 6298.322

Observations: Since the test and train RMSE are close, the model does not overfit.

Interpretation - Revenue Prediction

summary(model.regression)

## 
## Call:
## lm(formula = revenue ~ number_of_establishments + annual_payroll + 
##     male_population + males_5_to_14_years + females_15_to_44_years + 
##     females_60_over_years + males_under_5_years + male_median_age, 
##     data = cooks.data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -34773  -3562     71   3527  44962 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               3.174e+04  1.204e+03  26.367  < 2e-16 ***
## number_of_establishments  1.081e+02  5.058e-01 213.726  < 2e-16 ***
## annual_payroll            8.215e-02  3.598e-04 228.313  < 2e-16 ***
## male_population          -1.162e-01  1.999e-02  -5.813 6.49e-09 ***
## males_5_to_14_years      -1.205e+02  1.729e+01  -6.973 3.49e-12 ***
## females_15_to_44_years    5.921e+01  1.279e+01   4.629 3.75e-06 ***
## females_60_over_years     8.002e+01  1.199e+01   6.672 2.78e-11 ***
## males_under_5_years      -1.249e+02  2.479e+01  -5.039 4.83e-07 ***
## male_median_age          -8.373e+01  1.697e+01  -4.934 8.30e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5758 on 5347 degrees of freedom
## Multiple R-squared:  0.9931, Adjusted R-squared:  0.9931 
## F-statistic: 9.592e+04 on 8 and 5347 DF,  p-value: < 2.2e-16

Interpretation

The F-static measures whether there is a significant relationship between our predictors and response variable on whole. A large F value in this case indicates a strong relationship
The outcome revenue predominantly depends on 7 independent variables:
- number_of_establishments
- annual_payroll
- males_5_to_14_years
- females_15_to_44
- females_60_over_years
- males_under_5_years
- male_median_age
The model has an ajusted R-square value of 0.99 which signifies that 99% of all variability in outcome can be explained by the model
Each of the independent variable has p-value less than 5% which signifies that it is unlikely to observe relationship between predictors and dependent variable by pure coincidence

Modeling pop_product (Classification)

classification.train <- census_data[index,-c(1,25,26)]
classification.test <- census_data[-index,-c(1,25,26)]

Data Preperation:

multinom.model <- multinom(pop_product~. ,
                           data = classification.train)
step.model <- step(multinom.model, trace = FALSE)
summary(step.model)

Observations: Below variables are selected in our model:

total_population
male_population
female_population
males_under_5_years
males_45_to_49_years
females_45_to_49_years
females_55_to_59_years
males_15_to_44_years
females_15_to_44_years
females_60_over_years
male_median_age

Diagnostics

Conducting a 2-tailed z test to measure significance of coefficients

z <- summary(step.model)$coefficients/summary(step.model)$standard.errors
(1-pnorm(abs(z),0,1))*2

##            (Intercept) total_population male_population female_population
## life                 0        0.3596849       0.1518196         0.1363326
## retirement           0        0.3346846       0.1945903         0.1749857
##            males_under_5_years males_45_to_49_years females_45_to_49_years
## life                         0                    0                      0
## retirement                   0                    0                      0
##            males_15_to_44_years male_median_age
## life                 0.00000000               0
## retirement           0.01962579               0

Observation: some predictors do have significant effect on odds of either “life” or “retirement” with respect to “college” so we will remove some of them.

Corrected model

## # weights:  24 (14 variable)
## initial  value 5914.928562 
## iter  10 value 2799.583413
## iter  20 value 572.241697
## iter  30 value 338.922701
## iter  40 value 331.651033
## iter  50 value 329.990961
## final  value 329.982646 
## converged

## Call:
## multinom(formula = pop_product ~ female_population + males_under_5_years + 
##     males_45_to_49_years + females_45_to_49_years + males_15_to_44_years + 
##     male_median_age, data = classification.train)
## 
## Coefficients:
##            (Intercept) female_population males_under_5_years
## life        -55.818868      3.791523e-05           -9.066590
## retirement   -7.114253      4.654075e-05           -8.943701
##            males_45_to_49_years females_45_to_49_years
## life                -0.09553038             -0.0718931
## retirement           9.49303753             -0.2191584
##            males_15_to_44_years male_median_age
## life                 3.06368448     -0.09443131
## retirement           0.02681199     -0.10628488
## 
## Std. Errors:
##             (Intercept) female_population males_under_5_years
## life       0.0003247091      1.779576e-05          0.02054906
## retirement 0.0005119870      2.090553e-05          0.03809443
##            males_45_to_49_years females_45_to_49_years
## life                 0.01512256             0.03729657
## retirement           0.01570786             0.02652953
##            males_15_to_44_years male_median_age
## life                 0.01318473      0.01225210
## retirement           0.01523920      0.01047576
## 
## Residual Deviance: 659.9653 
## AIC: 687.9653

Observations:

The residual deviance has reduced from 718 to 687 in the corrected model
Only significant predictors remain in the model

Measuring model effectiveness

Confusion Matrix

#Confusion matrix for training data:
table(classification.train$pop_product, predict(step.model))

##             
##              college life retirement
##   college       1758   15         18
##   life             5 1666          8
##   retirement      20    8       1886

#Confusion matrix for training data:
table(classification.test$pop_product, predict(step.model, newdata = classification.test))

##             
##              college life retirement
##   college        446    1          2
##   life             2  428          0
##   retirement       3    4        461

Observation: As per the conusion matrix, in case of both training and testing data prediction is almost accurate with very few misclassifications

Misclassification Rate

#Missclassification rate of training data
mean(classification.train$pop_product != predict(step.model))

## [1] 0.01374443

#Missclassification rate of testing data
mean(classification.test$pop_product != predict(step.model, newdata = classification.test))

## [1] 0.008908686

Observation The model has a very high accuracy (1- Misclassification rate) of close to 99%

Interpretation

all_data <- census_data[-index,-c(1,25,26)]
abc <- data.frame(female_population = all_data$female_population,
                  males_under_5_years = all_data$males_under_5_years, 
                  males_45_to_49_years = all_data$males_45_to_49_years, 
                  females_45_to_49_years = all_data$females_45_to_49_years, 
                  males_15_to_44_years = all_data$males_15_to_44_years, 
                  male_median_age = all_data$male_median_age, 
           predict(step.model, newdata = all_data, type = "probs"))

abc %>% 
  gather(c(college, life, retirement), key = "pop_product", value = "prob") %>% 
  gather(-c(pop_product,prob), key = "var", value = "value") %>% 
  ggplot(aes(x = value, y = prob, color = pop_product)) + facet_wrap(~var, scales = "free") + 
  geom_smooth(se = FALSE) + geom_point() + theme_minimal()

summary(step.model)

## Call:
## multinom(formula = pop_product ~ female_population + males_under_5_years + 
##     males_45_to_49_years + females_45_to_49_years + males_15_to_44_years + 
##     male_median_age, data = classification.train)
## 
## Coefficients:
##            (Intercept) female_population males_under_5_years
## life        -55.818868      3.791523e-05           -9.066590
## retirement   -7.114253      4.654075e-05           -8.943701
##            males_45_to_49_years females_45_to_49_years
## life                -0.09553038             -0.0718931
## retirement           9.49303753             -0.2191584
##            males_15_to_44_years male_median_age
## life                 3.06368448     -0.09443131
## retirement           0.02681199     -0.10628488
## 
## Std. Errors:
##             (Intercept) female_population males_under_5_years
## life       0.0003247091      1.779576e-05          0.02054906
## retirement 0.0005119870      2.090553e-05          0.03809443
##            males_45_to_49_years females_45_to_49_years
## life                 0.01512256             0.03729657
## retirement           0.01570786             0.02652953
##            males_15_to_44_years male_median_age
## life                 0.01318473      0.01225210
## retirement           0.01523920      0.01047576
## 
## Residual Deviance: 659.9653 
## AIC: 687.9653

Observations:

The odds of life plan w.r.t. college increase with increasing values of:
- female_population
- proportion of males between 15 to 44 years of age
The odds of life plan w.r.t. college decrease with increasing values of:
- proportion of males under 5 years of age
- proportion of males between 45 to 49 years of age
- proportion of females between 45 to 49 years of age
- Male median age
The odds of retirement plan w.r.t. college increase with increasing values of:
- female_population
- proportion of males between 45 to 49 years of age
- proportion of males between 15 to 44 years of age
The odds of retirement plan w.r.t. college decrease with increasing values of:
- proportion of males under 5 years of age
- proportion of females between 45 to 49 years of age
- Male median age

Predicting Revenue and Popular product

Palash Arora

June 20, 2019

Problem description

Summary of Insights and Conclusions:

Revenue:

Popular Product:

Exploratory Data Analysis

Importing Data

Data Description

Cleaning Data

Checking for duplicates

checking for missing values

Data Exploration - Revenue

Summarizing data:

Vizualizing Data:

Now, we will look at the distribution of all features against dependent variable, Revenue:

Correlation w.r.t revenue:

Data Exploration - popular product

Distribution of pop_product classes

Density of variables w.r.t. pop_product

Modeling Revenue (Regression)

Data Preperation

Feature Selection

Diagnosing Initial Model

Removing Outliers using cooks distance:

Checking for multicollinearity

Corrected Model

Checking for overfitting

Interpretation - Revenue Prediction

Modeling pop_product (Classification)

Data Preperation:

Diagnostics

Corrected model

Measuring model effectiveness

Confusion Matrix

Misclassification Rate

Interpretation