Problem description

A health company wants to understand the needs of its customers better. We want to help them predict two outcomes for given details of any US zipcode:

  1. Predicting revenue that can be generated
  2. Predicting which of the below products would be most popular in each zipcode:

Summary of Insights and Conclusions:

Revenue:

Revenue is linearly dependent on below variables and therefore can be estimated for any zipcode using below formula : \[Revenue = 31740 + 108.1*number_of_establishments + 0.08215*annual_payroll - 0.1162*male_population - 120.5*males_5_to_14_years + 59.21*females_15_to_44_years + 80.02*females_60_over_years - 124.9*males_under_5_years - 83.73*male_median_age \]

Exploratory Data Analysis

Importing Data

First, we will create a master data set containing all variables in one table.

library("car") #contains vif function to check for multicolliearity amongst predictors
library("nnet") #contains multino function for multinomial regression
library("GGally") #Vizualize correlations
library("dplyr") #Data Manipulations
library("ggplot2") #Create Vizualizations
library("tidyr") #Transformation of data
library("leaps") #For best subset selection
library("purrr")
library("knitr")
library("kableExtra")

#Data import
setwd("C:/Users/Vivobook/Documents/Data/clarity")
geo_info <- read.csv("geo_info.csv")
census <- read.csv("census_train.csv", na.strings = "-")
dependent_variables <- read.csv("revenues_products.csv")

#creating a master table with independent and dependent variables in one  
dependent_variables$Geographic.area.name <- as.character(dependent_variables$Geographic.area.name)
geo_info$Geographic.area.name <- as.character(geo_info$Geographic.area.name)
dependent_variables_2 <- left_join(dependent_variables, geo_info, by = c("Geographic.area.name" = "Geographic.area.name"))
master <- left_join(census, dependent_variables_2, by = "Id2")

Observation: In raw data missing numeric values were denoted by “-” placeholder. This has been changed to NA

Renaming and formatting data.

#Renaming columns
names(master) <- c("id", "total_population", "male_population", "female_population", "males_under_5_years", "females_under_5_years", "males_45_to_49_years", "females_45_to_49_years", "males_50_to_54_years", "females_50_to_54_years", "males_55_to_59_years", "females_55_to_59_years", "males_5_to_14_years",  "females_5_to_14_years", "males_15_to_44_years", "females_15_to_44_years", "males_60_over_years",  "females_60_over_years", "male_median_age", "female_median_age", "number_of_establishments",  "number_of_paid_employees", "first_quarter_payroll", "annual_payroll", "Geographic.area.name", "revenue",  "pop_product" )

#Correcting format
master$id <- as.numeric(master$id)
master$males_under_5_years <- as.numeric(master$males_under_5_years)
master$females_under_5_years <- as.numeric(master$females_under_5_years)
master$males_45_to_49_years <- as.numeric(master$males_45_to_49_years)
master$females_45_to_49_years <- as.numeric(master$females_45_to_49_years)
master$males_50_to_54_years <- as.numeric(master$males_50_to_54_years)
master$females_50_to_54_years <- as.numeric(master$females_50_to_54_years)
master$males_55_to_59_years <- as.numeric(master$males_55_to_59_years)
master$females_55_to_59_years <- as.numeric(master$females_55_to_59_years)
master$males_5_to_14_years <- as.numeric(master$males_5_to_14_years)
master$females_5_to_14_years <- as.numeric(master$females_5_to_14_years)
master$males_15_to_44_years <- as.numeric(master$males_15_to_44_years)
master$females_15_to_44_years <- as.numeric(master$females_15_to_44_years)
master$males_60_over_years <- as.numeric(master$males_60_over_years)
master$females_60_over_years <- as.numeric(master$females_60_over_years)
master$male_median_age <- as.numeric(master$male_median_age)
master$female_median_age <- as.numeric(master$female_median_age)

Data Description

Variable Description
Variable Description
id Area Zipcode
males_under_5_years Proportion of males under 5 years of age
males_50_to_54_years Proportion of males between 50 to 54 years
males_5_to_14_years Proportion of males between 5 to 14 years
males_60_over_years Proportion of males over 60 years
number_of_establishments number of establishments in the area
Geographic.area.name Area name along with zipcode
total_population total_population of zipcode
females_under_5_years Area Zipcode
females_50_to_54_years Proportion of females under 5 years of age
females_5_to_14_years Proportion of females between 50 to 54 years
females_60_over_years Proportion of females between 5 to 14 years
number_of_paid_employees Proportion of females over 60 years
revenue Revenue generated. Dependent Variable
male_population Total males population
males_45_to_49_years Proportion of males between 45 and 49 years of age
males_55_to_59_years Proportion of males between 55 and 59 years of age
males_15_to_44_years Proportion of males between 15 and 44 years of age
male_median_age Male median age
first_quarter_payroll Income in first quarter of the year
pop_product Popular product in zipcode. Categorical dependent variable
female_population total female population of area
females_45_to_49_years Proportion of females between 45 and 49 years of age
females_55_to_59_years Proportion of females between 55 and 59 years of age
females_15_to_44_years Proportion of females between 15 and 44 years of age
female_median_age female median age of zipcode
annual_payroll Average annual income of zipcode

Cleaning Data

Checking for duplicates

anyDuplicated(master)
## [1] 0

Observation: No duplicate rows in data

checking for missing values

sapply(master, function(x) {sum(is.na(x))})
##                       id         total_population          male_population 
##                        0                        0                        0 
##        female_population      males_under_5_years    females_under_5_years 
##                        0                       47                       51 
##     males_45_to_49_years   females_45_to_49_years     males_50_to_54_years 
##                       47                       51                       47 
##   females_50_to_54_years     males_55_to_59_years   females_55_to_59_years 
##                       51                       47                       51 
##      males_5_to_14_years    females_5_to_14_years     males_15_to_44_years 
##                       47                       51                       47 
##   females_15_to_44_years      males_60_over_years    females_60_over_years 
##                       51                       47                       51 
##          male_median_age        female_median_age number_of_establishments 
##                       91                       99                        0 
## number_of_paid_employees    first_quarter_payroll           annual_payroll 
##                        0                        0                        0 
##     Geographic.area.name                  revenue              pop_product 
##                        0                        0                        0

Observation: Most of the variables have missing values in them. Lets see how these missing values are distributed across rows:

missing_cols <- NULL
for(i in 1:nrow(master)){
  count1 <- 0
  for(j in 1:length(master)){
    count1 <- count1 + is.na(master[i,j])
  }
  missing_cols[i] <- count1
}
abc<- as.data.frame(cbind(master$id, missing_cols))
names(abc)[1] <- "id"
abc <- abc %>% 
  filter(missing_cols != 0) %>% 
  arrange(-missing_cols)
kable(abc[1:20,]) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "responsive","bordered")) %>%
  scroll_box(width = "100%", height = "500px")
id missing_cols
28019 16
40363 16
11424 16
13138 16
10153 16
93943 16
20240 16
38132 16
83826 16
99164 16
48397 16
98158 16
10119 16
5740 16
19109 16
93064 16
19736 16
10199 16
23250 16
66115 16

Observations:

  • 110 out of total 6842 rows have at least one missing value column. Of these 56 are mostly NA in all variables
  • The remaining 54 rows have at most 2 missing values. We can impute or regress for these values but since they are insignificant compared to 6842, we will omit them from our analysis
census_data <- anti_join(master, abc, by = "id")
nrow(census_data)
## [1] 6731
  • 6841 - 110 = 6731 rows remaining

Data Exploration - Revenue

Summarizing data:

kable(round(cbind(
  "Min" =  apply(census_data[,-c(1,25,27)], 2, min),
  "Q1" = apply(census_data[,-c(1,25,27)], 2, quantile)[1,],
  "Q2" = apply(census_data[,-c(1,25,27)], 2, quantile)[2,],
  "Mean" = apply(census_data[,-c(1,25,27)], 2, mean),
  "Median" = apply(census_data[,-c(1,25,27)], 2, median),
  "Q3" = apply(census_data[,-c(1,25,27)], 2, quantile)[3,],
  "Q4" = apply(census_data[,-c(1,25,27)], 2, quantile)[4,],
  "Max" = apply(census_data[,-c(1,25,27)], 2, max)
),2))
Min Q1 Q2 Mean Median Q3 Q4 Max
total_population 22.0 22.0 1143.5 10969.77 3797.0 3797.0 15875.0 112709.0
male_population 11.0 11.0 580.5 5398.79 1904.0 1904.0 7807.0 60595.0
female_population 11.0 11.0 560.0 5570.98 1885.0 1885.0 8039.0 52550.0
males_under_5_years 0.0 0.0 3.8 5.87 5.7 5.7 7.5 47.4
females_under_5_years 0.0 0.0 3.6 5.57 5.4 5.4 7.2 47.0
males_45_to_49_years 0.0 0.0 5.1 6.77 6.7 6.7 8.2 55.2
females_45_to_49_years 0.0 0.0 5.1 6.68 6.5 6.5 8.0 50.0
males_50_to_54_years 0.0 0.0 5.7 7.55 7.3 7.3 8.9 52.2
females_50_to_54_years 0.0 0.0 5.8 7.62 7.3 7.3 8.9 83.9
males_55_to_59_years 0.0 0.0 5.5 7.42 7.0 7.0 8.8 72.7
females_55_to_59_years 0.0 0.0 5.6 7.55 7.1 7.1 8.8 80.3
males_5_to_14_years 0.0 0.0 9.9 12.74 12.9 12.9 15.6 57.1
females_5_to_14_years 0.0 0.0 9.3 12.08 12.1 12.1 14.7 62.1
males_15_to_44_years 0.0 0.0 31.6 37.01 36.7 36.7 41.8 100.0
females_15_to_44_years 0.0 0.0 30.4 35.31 35.1 35.1 39.7 100.0
males_60_over_years 0.0 0.0 16.5 22.64 21.0 21.0 26.5 100.0
females_60_over_years 0.0 0.0 19.1 25.19 24.0 24.0 29.4 95.7
male_median_age 6.7 6.7 35.3 40.58 40.1 40.1 45.3 83.9
female_median_age 6.9 6.9 37.2 42.44 42.3 42.3 47.3 84.2
number_of_establishments 3.0 3.0 17.0 261.80 68.0 68.0 346.0 6723.0
number_of_paid_employees 1.0 1.0 103.0 4014.39 655.0 655.0 4434.5 163895.0
first_quarter_payroll 1.0 1.0 735.5 51145.04 5414.0 5414.0 39540.0 8176054.0
annual_payroll 19.0 19.0 3338.0 201949.57 23499.0 23499.0 168815.5 22165514.0
revenue 14098.0 14098.0 32904.0 74537.95 41287.0 41287.0 82411.0 2686772.0

Observations

  • Average revenue is approx. $75,000
  • In population columns, total_population, male and female population have a much higher mean compared to median, indicating a positive skew
  • number_of establishments, number_of paid_employees, first_quarter_payroll and annual_payroll have a very high positive skew indicting presence of outlying values
  • Revenue has a very high positive skew indicating presence of outliers in dependent variables

Vizualizing Data:

census_data[,-c(1, 25, 27)] %>% 
  keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) + facet_wrap(~key, scales = "free") + geom_density()

Observations

  • Normally distributed features:
    • female_median_age
    • females_15_to_44_years
    • females_5_to_14_years
    • male_median_age
    • male_15_to_15_years
  • All other features are positively skewed

Now, we will look at the distribution of all features against dependent variable, Revenue:

census_data[,-c(1, 25, 27)] %>% 
  gather(-revenue, key = "var", value = "value") %>% 
  ggplot(aes(x = value, y = revenue)) +
  geom_point() + 
  geom_smooth() +
  facet_wrap(~ var, scales = "free") +
  theme_bw()

Observation: All variables are follow approximately linear trend with respect to the dependent variable so we will predict revenue using linear regression modeling

Correlation w.r.t revenue:

M <- round(cor(census_data[,-c(1, 25,27)]),1)
ggcorr(M, nbreaks=8, palette='RdGy', label=TRUE, label_color='white')

Observations:

  • The below variables have a high correlation with revenue and with each other:
    • annual_payroll
    • first_quarter_payroll
    • number_of_paid_employees
    • number_of_establishments
    • population(male and female)
  • Revenue are high in areas with higher overall, male or female population, possibly due to bigger potential customer pool
  • With increasing number of establishments, number of employees goes up. Also areas with more number of salaried employees have higher revenue
  • payrolls are higher in areas with more establishments in a zipcode. The revenue is higher in zipcodes with high payroll
  • All age groups except 15-44 years in both male and females have a positive correlation with revenue. This indicates that revenue is higher in areas with a high proportion of middle aged men and women

Modeling Revenue (Regression)

Data Preperation

set.seed(100)
index <- sample(nrow(census_data), nrow(census_data)*0.8)
regression.train <- census_data[index,-c(1,25,27)] 
regression.test <- census_data[-index,-c(1,25,27)] 

Feature Selection

We will use stepwise selection for feature selection. We will use CP as the selection criteria since it tends to favor fewer variables and is less likely to overfit our data with so may predictors having a high correlation.

nullmodel=lm(revenue~1, data=regression.train)
fullmodel=lm(revenue ~  ., data=regression.train)

model.regression <- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel),
             direction = c("forward"), select = "CP", trace = FALSE)
summary(model.regression)
## 
## Call:
## lm(formula = revenue ~ number_of_paid_employees + first_quarter_payroll + 
##     number_of_establishments + annual_payroll + male_population + 
##     males_5_to_14_years + females_15_to_44_years + females_60_over_years + 
##     males_15_to_44_years + males_60_over_years + males_under_5_years + 
##     male_median_age + total_population, data = regression.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -102215   -3558     126    3619  130942 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               2.765e+04  2.666e+03  10.371  < 2e-16 ***
## number_of_paid_employees  4.764e-01  5.002e-02   9.524  < 2e-16 ***
## first_quarter_payroll     4.471e-02  3.325e-03  13.449  < 2e-16 ***
## number_of_establishments  1.068e+02  6.733e-01 158.628  < 2e-16 ***
## annual_payroll            6.723e-02  1.298e-03  51.790  < 2e-16 ***
## male_population          -7.259e-01  2.325e-01  -3.122 0.001803 ** 
## males_5_to_14_years      -1.026e+02  3.111e+01  -3.297 0.000982 ***
## females_15_to_44_years    8.776e+01  1.666e+01   5.269 1.43e-07 ***
## females_60_over_years     6.967e+01  1.654e+01   4.213 2.56e-05 ***
## males_15_to_44_years      4.229e+01  2.539e+01   1.665 0.095899 .  
## males_60_over_years       6.076e+01  2.042e+01   2.976 0.002938 ** 
## males_under_5_years      -1.175e+02  3.844e+01  -3.056 0.002252 ** 
## male_median_age          -7.816e+01  3.394e+01  -2.303 0.021318 *  
## total_population          2.212e-01  1.142e-01   1.937 0.052793 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7019 on 5370 degrees of freedom
## Multiple R-squared:  0.9946, Adjusted R-squared:  0.9946 
## F-statistic: 7.56e+04 on 13 and 5370 DF,  p-value: < 2.2e-16

Observations:

  • Out of potential 24 variables, 13 have been shortlisted by stepwise selection
  • The model is statistically significant as per the p-value
  • adj-R2 is 99.46%

Diagnosing Initial Model

par(mfrow = c(2,2))
plot(model.regression)

Observations:

  • Residual Vs Fitted plot: Residuals seems to be independent but the trend line deviates due to outliers
  • Normal Q-Q: residuals are normally distributed
  • Homoscadascacity: In scale-location plot the variance trend changes on approaching outliers so outlier treatment is needed

Let’s have a closer look at outliers:

par(mfrow = c(1,1))
plot(model.regression,4)

Removing Outliers using cooks distance:

A general rule of thumb is that observations with a Cook’s Distance of more than 3 times the mean cooks distance are outliers

cooksd <- cooks.distance(model.regression)
cooks.data <- regression.train[which(cooksd < 3*mean(cooksd)),]

Checking for multicollinearity

vif(model.regression)
## number_of_paid_employees    first_quarter_payroll number_of_establishments 
##                18.445215                51.067082                 8.910991 
##           annual_payroll          male_population      males_5_to_14_years 
##                82.176014               311.389271                 3.081838 
##   females_15_to_44_years    females_60_over_years     males_15_to_44_years 
##                 3.135534                 3.262169                 7.991069 
##      males_60_over_years      males_under_5_years          male_median_age 
##                 5.081531                 2.093824                 8.870817 
##         total_population 
##               311.712791

Observation: Some of predictors have multicollinearity so we will remove them

Corrected Model

model.regression <- lm(revenue ~ number_of_establishments +
                         annual_payroll +
                         male_population + 
                         males_5_to_14_years +
                         females_15_to_44_years +
                         females_60_over_years +
                         males_under_5_years +
                         male_median_age,
                       data = cooks.data)

summary(model.regression)
## 
## Call:
## lm(formula = revenue ~ number_of_establishments + annual_payroll + 
##     male_population + males_5_to_14_years + females_15_to_44_years + 
##     females_60_over_years + males_under_5_years + male_median_age, 
##     data = cooks.data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -34773  -3562     71   3527  44962 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               3.174e+04  1.204e+03  26.367  < 2e-16 ***
## number_of_establishments  1.081e+02  5.058e-01 213.726  < 2e-16 ***
## annual_payroll            8.215e-02  3.598e-04 228.313  < 2e-16 ***
## male_population          -1.162e-01  1.999e-02  -5.813 6.49e-09 ***
## males_5_to_14_years      -1.205e+02  1.729e+01  -6.973 3.49e-12 ***
## females_15_to_44_years    5.921e+01  1.279e+01   4.629 3.75e-06 ***
## females_60_over_years     8.002e+01  1.199e+01   6.672 2.78e-11 ***
## males_under_5_years      -1.249e+02  2.479e+01  -5.039 4.83e-07 ***
## male_median_age          -8.373e+01  1.697e+01  -4.934 8.30e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5758 on 5347 degrees of freedom
## Multiple R-squared:  0.9931, Adjusted R-squared:  0.9931 
## F-statistic: 9.592e+04 on 8 and 5347 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(model.regression) 

Observations:

  • As can be observed from Residuals Vs fitted plot, residuals are equally spread around horizontal line without any distinct pattern. This is an indication that we don’t have any non-linear relationships
  • Residuals follow a straight line for the most part on Q-Q plot therefore they are normally distributed
  • The assumption of homoscedasticity also holds true based on scale-location plot
  • As can be observed in the last plot, There are no high leverage points

Checking for overfitting

#predicted values for test data
test.pred.regression <- predict(model.regression, newdata = regression.test)

# Train RMSE
sqrt(mean((model.regression$fitted.values - cooks.data$revenue)^2))
## [1] 5753.413
# Test RMSE
sqrt(mean((test.pred.regression - regression.test$revenue)^2))
## [1] 6298.322

Observations: Since the test and train RMSE are close, the model does not overfit.

Interpretation - Revenue Prediction

summary(model.regression)
## 
## Call:
## lm(formula = revenue ~ number_of_establishments + annual_payroll + 
##     male_population + males_5_to_14_years + females_15_to_44_years + 
##     females_60_over_years + males_under_5_years + male_median_age, 
##     data = cooks.data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -34773  -3562     71   3527  44962 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               3.174e+04  1.204e+03  26.367  < 2e-16 ***
## number_of_establishments  1.081e+02  5.058e-01 213.726  < 2e-16 ***
## annual_payroll            8.215e-02  3.598e-04 228.313  < 2e-16 ***
## male_population          -1.162e-01  1.999e-02  -5.813 6.49e-09 ***
## males_5_to_14_years      -1.205e+02  1.729e+01  -6.973 3.49e-12 ***
## females_15_to_44_years    5.921e+01  1.279e+01   4.629 3.75e-06 ***
## females_60_over_years     8.002e+01  1.199e+01   6.672 2.78e-11 ***
## males_under_5_years      -1.249e+02  2.479e+01  -5.039 4.83e-07 ***
## male_median_age          -8.373e+01  1.697e+01  -4.934 8.30e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5758 on 5347 degrees of freedom
## Multiple R-squared:  0.9931, Adjusted R-squared:  0.9931 
## F-statistic: 9.592e+04 on 8 and 5347 DF,  p-value: < 2.2e-16

Interpretation

  • The F-static measures whether there is a significant relationship between our predictors and response variable on whole. A large F value in this case indicates a strong relationship
  • The outcome revenue predominantly depends on 7 independent variables:
    • number_of_establishments
    • annual_payroll
    • males_5_to_14_years
    • females_15_to_44
    • females_60_over_years
    • males_under_5_years
    • male_median_age
  • The model has an ajusted R-square value of 0.99 which signifies that 99% of all variability in outcome can be explained by the model
  • Each of the independent variable has p-value less than 5% which signifies that it is unlikely to observe relationship between predictors and dependent variable by pure coincidence

Modeling pop_product (Classification)

classification.train <- census_data[index,-c(1,25,26)]
classification.test <- census_data[-index,-c(1,25,26)]

Data Preperation:

multinom.model <- multinom(pop_product~. ,
                           data = classification.train)
step.model <- step(multinom.model, trace = FALSE)
summary(step.model)

Observations: Below variables are selected in our model:

  • total_population
  • male_population
  • female_population
  • males_under_5_years
  • males_45_to_49_years
  • females_45_to_49_years
  • females_55_to_59_years
  • males_15_to_44_years
  • females_15_to_44_years
  • females_60_over_years
  • male_median_age

Diagnostics

Conducting a 2-tailed z test to measure significance of coefficients

z <- summary(step.model)$coefficients/summary(step.model)$standard.errors
(1-pnorm(abs(z),0,1))*2
##            (Intercept) total_population male_population female_population
## life                 0        0.3596849       0.1518196         0.1363326
## retirement           0        0.3346846       0.1945903         0.1749857
##            males_under_5_years males_45_to_49_years females_45_to_49_years
## life                         0                    0                      0
## retirement                   0                    0                      0
##            males_15_to_44_years male_median_age
## life                 0.00000000               0
## retirement           0.01962579               0

Observation: some predictors do have significant effect on odds of either “life” or “retirement” with respect to “college” so we will remove some of them.

Corrected model

## # weights:  24 (14 variable)
## initial  value 5914.928562 
## iter  10 value 2799.583413
## iter  20 value 572.241697
## iter  30 value 338.922701
## iter  40 value 331.651033
## iter  50 value 329.990961
## final  value 329.982646 
## converged
## Call:
## multinom(formula = pop_product ~ female_population + males_under_5_years + 
##     males_45_to_49_years + females_45_to_49_years + males_15_to_44_years + 
##     male_median_age, data = classification.train)
## 
## Coefficients:
##            (Intercept) female_population males_under_5_years
## life        -55.818868      3.791523e-05           -9.066590
## retirement   -7.114253      4.654075e-05           -8.943701
##            males_45_to_49_years females_45_to_49_years
## life                -0.09553038             -0.0718931
## retirement           9.49303753             -0.2191584
##            males_15_to_44_years male_median_age
## life                 3.06368448     -0.09443131
## retirement           0.02681199     -0.10628488
## 
## Std. Errors:
##             (Intercept) female_population males_under_5_years
## life       0.0003247091      1.779576e-05          0.02054906
## retirement 0.0005119870      2.090553e-05          0.03809443
##            males_45_to_49_years females_45_to_49_years
## life                 0.01512256             0.03729657
## retirement           0.01570786             0.02652953
##            males_15_to_44_years male_median_age
## life                 0.01318473      0.01225210
## retirement           0.01523920      0.01047576
## 
## Residual Deviance: 659.9653 
## AIC: 687.9653

Observations:

  • The residual deviance has reduced from 718 to 687 in the corrected model
  • Only significant predictors remain in the model

Measuring model effectiveness

Confusion Matrix

#Confusion matrix for training data:
table(classification.train$pop_product, predict(step.model))
##             
##              college life retirement
##   college       1758   15         18
##   life             5 1666          8
##   retirement      20    8       1886
#Confusion matrix for training data:
table(classification.test$pop_product, predict(step.model, newdata = classification.test))
##             
##              college life retirement
##   college        446    1          2
##   life             2  428          0
##   retirement       3    4        461

Observation: As per the conusion matrix, in case of both training and testing data prediction is almost accurate with very few misclassifications

Misclassification Rate

#Missclassification rate of training data
mean(classification.train$pop_product != predict(step.model))
## [1] 0.01374443
#Missclassification rate of testing data
mean(classification.test$pop_product != predict(step.model, newdata = classification.test))
## [1] 0.008908686

Observation The model has a very high accuracy (1- Misclassification rate) of close to 99%

Interpretation

all_data <- census_data[-index,-c(1,25,26)]
abc <- data.frame(female_population = all_data$female_population,
                  males_under_5_years = all_data$males_under_5_years, 
                  males_45_to_49_years = all_data$males_45_to_49_years, 
                  females_45_to_49_years = all_data$females_45_to_49_years, 
                  males_15_to_44_years = all_data$males_15_to_44_years, 
                  male_median_age = all_data$male_median_age, 
           predict(step.model, newdata = all_data, type = "probs"))

abc %>% 
  gather(c(college, life, retirement), key = "pop_product", value = "prob") %>% 
  gather(-c(pop_product,prob), key = "var", value = "value") %>% 
  ggplot(aes(x = value, y = prob, color = pop_product)) + facet_wrap(~var, scales = "free") + 
  geom_smooth(se = FALSE) + geom_point() + theme_minimal()

summary(step.model)
## Call:
## multinom(formula = pop_product ~ female_population + males_under_5_years + 
##     males_45_to_49_years + females_45_to_49_years + males_15_to_44_years + 
##     male_median_age, data = classification.train)
## 
## Coefficients:
##            (Intercept) female_population males_under_5_years
## life        -55.818868      3.791523e-05           -9.066590
## retirement   -7.114253      4.654075e-05           -8.943701
##            males_45_to_49_years females_45_to_49_years
## life                -0.09553038             -0.0718931
## retirement           9.49303753             -0.2191584
##            males_15_to_44_years male_median_age
## life                 3.06368448     -0.09443131
## retirement           0.02681199     -0.10628488
## 
## Std. Errors:
##             (Intercept) female_population males_under_5_years
## life       0.0003247091      1.779576e-05          0.02054906
## retirement 0.0005119870      2.090553e-05          0.03809443
##            males_45_to_49_years females_45_to_49_years
## life                 0.01512256             0.03729657
## retirement           0.01570786             0.02652953
##            males_15_to_44_years male_median_age
## life                 0.01318473      0.01225210
## retirement           0.01523920      0.01047576
## 
## Residual Deviance: 659.9653 
## AIC: 687.9653

Observations:

  • The odds of life plan w.r.t. college increase with increasing values of:
    • female_population
    • proportion of males between 15 to 44 years of age
  • The odds of life plan w.r.t. college decrease with increasing values of:
    • proportion of males under 5 years of age
    • proportion of males between 45 to 49 years of age
    • proportion of females between 45 to 49 years of age
    • Male median age
  • The odds of retirement plan w.r.t. college increase with increasing values of:
    • female_population
    • proportion of males between 45 to 49 years of age
    • proportion of males between 15 to 44 years of age
  • The odds of retirement plan w.r.t. college decrease with increasing values of:
    • proportion of males under 5 years of age
    • proportion of females between 45 to 49 years of age
    • Male median age