Revenue and Insurance Product Prediction using Census Data

Introduction

Problem Description

A financial institution wants to analyze the demographic information of the US population to predict:

Revenue earned by the company for a given ZIP code
The most popular product sold in a given ZIP code:
- college fund
- retirement fund
- life insurance

Data

The sample data given is a portion of the American Community Survey 2015 from census.gov. It contains several fields detailing population, gender, and business metrics grouped by selected ZIP codes throughout the United States.

Importing Data

Libraries Required

require(Rmisc) # draw multiple plots 
require(readxl) # read excel input
require(DT) # display table on HTML page
require(tidyverse) # perform data manipulation
require(kableExtra)  # display table on HTML page
require(ggplot2) # perform data visualization
require(FNN) # KNN
require(GGally) # plot correlation plots
require(car) # check variance inflation factor
require(nnet) # perform multinomial logistic regression
options(scipen = 999) # do not display numbers in scientific format

Reading data

Importing Data

# reading predictor variable data
data <- read_excel("american_fact_finder.xlsx", sheet = "census_train")

# reading response variable data
data_response <- read_excel("american_fact_finder.xlsx", sheet = "response_variable")

# reading id-geographic mapping data
data_mapping <- read_excel("american_fact_finder.xlsx", sheet = "id_geography_mapping")

# renaming predictor variable data
colnames(data) <- c("Id2", 
                    "tot_population",
                    "tot_male_population",
                    "tot_female_population",
                    "perc_male_under_5yrs",
                    "perc_male_45_49yrs",
                    "perc_male_50_54yrs",
                    "perc_male_55_59yrs",
                    "perc_male_5_14yrs",
                    "perc_male_15_44yrs",
                    "perc_male_over_60yrs",
                    "perc_female_under_5yrs",
                    "perc_female_45_49yrs",
                    "perc_female_50_54yrs",
                    "perc_female_55_59yrs",
                    "perc_female_5_14yrs",
                    "perc_female_15_44yrs",
                    "perc_female_over_60yrs",
                    "male_median_age",
                    "female_median_age",
                    "no_establishments",
                    "paid_employees",
                    "payroll_quarter1",
                    "annual_payroll")

# renaming column names of response variable data
colnames(data_response) <- c("geographic_area_name",
                             "revenue",
                             "popular_product")

# renaming column names of id-geographic mapping data
colnames(data_mapping) <- c("Id2",
                             "geographic_area_name")

Data Dictionary

Predictor Variable Data

Variable	Description
Id2	Unique Zipcode ID
tot_population	Total Population
tot_male_population	Total Male Population
tot_female_population	Total Female Population
perc_male_under_5yrs	Percentage of Males Under 5 Years of Age
perc_male_45_49yrs	Percentage of Males Between 45-49 Years of Age
perc_male_50_54yrs	Percentage of Males Between 50-54 Years of Age
perc_male_55_59yrs	Percentage of Males Between 55-59 Years of Age
perc_male_5_14yrs	Percentage of Males Between 5-14 Years of Age
perc_male_15_44yrs	Percentage of Males Between 15-44 Years of Age
perc_male_over_60yrs	Percentage of Males Over 60 Years of Age
perc_female_under_5yrs	Percentage of Females Under 5 Years of Age
perc_female_45_49yrs	Percentage of Females Between 45-49 Years of Age
perc_female_50_54yrs	Percentage of Females Between 50-54 Years of Age
perc_female_55_59yrs	Percentage of Females Between 55-59 Years of Age
perc_female_5_14yrs	Percentage of Females Between 5-14 Years of Age
perc_female_15_44yrs	Percentage of Females Between 15-44 Years of Age
perc_female_over_60yrs	Percentage of Females Over 60 Years of Age
male_median_age	Median Age of Males in Years
female_median_age	Median Age of Females in Years
no_establishments	Number of Establishments
paid_employees	Paid employees for pay period including March 12 (number)
payroll_quarter1	First-quarter payroll ($1,000)
annual_payroll	Annual payroll ($1,000)

Response Variable Data

Variable	Description
geographic_area_name	Geographic Area Name
revenue	Revenue
popular_product	Popular Product

ID-Geography Mapping Data

Variable	Description
Id2	Unique Zipcode ID
geographic_area_name	Geographic Area Name

Viewing Data

Predictor Variable Data

# display first 100 rows
kable(head(data, 100)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>%
  scroll_box(width = "100%", height = "500px")

Id2	tot_population	tot_male_population	tot_female_population	perc_male_under_5yrs	perc_male_45_49yrs	perc_male_50_54yrs	perc_male_55_59yrs	perc_male_5_14yrs	perc_male_15_44yrs	perc_male_over_60yrs	perc_female_under_5yrs	perc_female_45_49yrs	perc_female_50_54yrs	perc_female_55_59yrs	perc_female_5_14yrs	perc_female_15_44yrs	perc_female_over_60yrs	male_median_age	female_median_age	no_establishments	paid_employees	payroll_quarter1	annual_payroll
44319	22592	11419	11173	4.2	7.3	8.6999999999999993	9.4	9.4	36.200000000000003	24.8	4.5	6.5	8.1	8.4	9.1999999999999993	34.9	28.5	45.1	45.8	582	6810	43116	196189
73067	2146	1171	975	2.8	7.9	8.4	7.9	13.1	39.1	20.8	5.2	8.1	9.3000000000000007	8.1	11.1	31.7	26.5	41.4	45.7	33	205	1724	6045
1851	30022	14858	15164	8.1	7.8	7.3	6.3	11.2	46.1	13	6.1	7.4	6.8	6.4	13.6	43.6	16.100000000000001	32.799999999999997	34.4	450	8568	147660	551236
67335	3332	1803	1529	7	6.2	8.8000000000000007	9.1999999999999993	12.9	34.1	22	2	5.6	6.1	8.6	13.7	33.6	30.3	41.4	45.8	52	491	2852	12186
72524	280	162	118	17.3	12.3	0	13	15.4	30.9	11.1	0	16.100000000000001	0	0	0	49.2	34.700000000000003	30.4	46.1	9	93	470	2592
35901	19573	9359	10214	6.9	6.3	6.2	7.8	13.9	38.799999999999997	20.100000000000001	5.8	7.4	5.9	8.3000000000000007	9.9	36.4	26.3	38.200000000000003	43.5	714	9219	66462	289032
8752	2367	1229	1138	4.4000000000000004	3.4	6	9.3000000000000007	7.9	23	46	0	8.8000000000000007	5.7	8.5	3.8	25	48.2	58.8	58.4	58	273	1705	10276
35063	3900	1945	1955	6.4	6.9	8.3000000000000007	9.1999999999999993	12.6	32.4	24.2	3.7	7.1	10.5	4.8	13.5	29	31.5	42.8	47.5	13	47	263	1146
48114	20600	10152	10448	4.3	7.8	10	9.1	13.3	34.4	21.1	4.0999999999999996	8.1	8.6	9.1999999999999993	12.8	33.299999999999997	24	43.2	44.9	505	4086	41436	188834
20037	15687	7384	8303	1.5	4	2.5	4.5999999999999996	0	71.7	15.7	1.4	3	2.8	1.6	1	74.5	15.6	30.7	24.3	829	25487	609829	2389256
54421	3270	1583	1687	5.9	7.3	8.1	6.4	18.600000000000001	34.4	19.3	7.5	5.9	7.8	4.4000000000000004	16.7	30.7	27	36.799999999999997	40.5	71	997	8410	34853
79070	10333	5282	5051	9.8000000000000007	6	6.6	6.3	18.3	40.799999999999997	12.2	8.1999999999999993	7.5	5.3	5.7	18.399999999999999	40.200000000000003	14.7	30.4	32.5	348	3862	53937	200034
49676	3441	1736	1705	6	5.6	6	8.6999999999999993	10.5	31	32	3	8.3000000000000007	7.8	7.1	8.3000000000000007	33.700000000000003	31.7	48.4	46.9	51	312	2928	12681
3583	933	424	509	3.8	8.3000000000000007	11.3	8	8.3000000000000007	22.6	37.700000000000003	3.5	11.2	10.199999999999999	10.4	8.1	21.2	35.4	52.3	53.8	25	175	1459	9473
54454	1442	742	700	5.3	8.1	9.6999999999999993	11.6	11.9	35.200000000000003	18.3	6.9	5.3	8.1	11.1	15.1	37.4	16	42.5	38.5	17	92	548	2438
8019	1012	516	496	3.1	11.4	12.2	6.2	5.6	43.4	18	4.2	9.1	9.3000000000000007	6.3	10.1	36.1	25	42.3	44.7	14	45	439	2337
36003	1716	873	843	4.8	1.9	12.7	8.1	13.6	36	22.8	1.9	9.8000000000000007	13.6	5.6	6.6	38.6	23.8	40.9	48	15	128	751	3257
13335	1606	851	755	6.3	15.9	8.5	6.2	9.1999999999999993	33.299999999999997	20.7	7.3	7.9	7.8	5.3	13.9	34.799999999999997	22.9	45.5	41.4	31	1086	10991	47243
62215	1731	894	837	7.2	9.1	8.4	5.9	15.9	38.5	15.1	4.9000000000000004	7.6	9.1	5.5	12.1	42.3	18.5	34.4	37	42	299	1776	8537
53575	16070	8358	7712	4.8	10.3	8.1	7	14.3	40.299999999999997	15.1	4.5999999999999996	9.1	7.9	6.3	18.3	33.9	19.899999999999999	39.6	41.1	319	2548	20206	94328
79934	24062	12189	11873	9.5	4.5	4.2	5.9	19.600000000000001	49.6	6.7	9.4	6.8	5.7	4	17	49.3	7.8	28	29.9	92	1123	9648	44193
37355	25493	12334	13159	6	7.5	6.5	7.2	13.8	38	21	5	5.6	9	5.8	13.3	36.799999999999997	24.6	39.200000000000003	40.200000000000003	505	8020	66808	279814
15610	3667	1794	1873	2.2000000000000002	6.4	10.8	11.3	11.6	32.6	25.1	5.6	8.5	13.1	9.8000000000000007	13	27	22.9	48.5	47.8	45	359	2840	13443
73106	13242	6977	6265	8.1	7.3	6.7	9.1999999999999993	11.3	46.4	11.1	6.8	5.0999999999999996	5.0999999999999996	7.8	12.1	51.4	11.7	33.1	30.3	479	8560	112417	450773
1129	7377	3466	3911	7.2	7.6	8.8000000000000007	6.8	15.6	35.5	18.600000000000001	3.7	6.9	7.2	6.8	9.5	38	27.9	35.799999999999997	43.7	162	2657	15369	63623
4926	36	18	18	0	0	0	0	0	0	100	0	0	0	0	0	0	100			11	59	382	1773
28269	76360	37057	39303	8.1	7.5	6.5	4.8	15.7	46.9	10.6	7.2	7.6	5.3	5.7	14.2	47.6	12.4	32.6	34.4	1266	20659	249180	1002725
75691	2817	1435	1382	3.8	9.5	2.9	5.9	17.3	36.6	24	4.2	6.2	4.5999999999999996	7.7	22.1	31.8	23.4	40.4	36.9	71	1481	31346	98008
25670	6074	3019	3055	7.5	7.6	4.8	7.3	12.5	38.700000000000003	21.6	6.8	8.1999999999999993	7.8	6.9	10.5	38.1	21.6	35.200000000000003	42.1	48	463	2676	10959
48184	17251	8354	8897	6.5	8	8	7.1	13.7	39.700000000000003	16.899999999999999	3.9	6	6.1	9.1999999999999993	12.2	41.2	21.5	36.700000000000003	39.799999999999997	369	11888	170286	666154
2535	1511	706	805	5.5	2.8	8.1	9.3000000000000007	11	36.799999999999997	26.3	2.9	3.7	11.8	9.1999999999999993	7.6	37.5	27.3	41.9	48.5	49	166	1690	9080
15746	106	47	59	0	0	6.4	19.100000000000001	0	53.2	21.3	10.199999999999999	0	6.8	0	13.6	22	47.5	39.6	53.6	9	127	1011	4049
88008	6881	3521	3360	12.6	6.1	7.2	7.1	18.600000000000001	34.299999999999997	14.1	5.3	7.7	7.5	5.7	16.8	45.7	11.4	29.6	31.2	103	1952	16335	65725
39217	1138	418	720	0	0	0	0	0	100	0	0	0	0	0	0	100	0	20.5	20.100000000000001	12	192	806	3360
97540	8061	3893	4168	2.6	7.2	5.4	5.8	16.600000000000001	35.5	27	4.8	7.2	7.1	7.1	11.1	33.700000000000003	29	42.7	45.1	160	893	5819	26660
99169	2577	1315	1262	9.1	7.7	7.5	10	13.9	29.2	22.7	11.1	5.9	8.1999999999999993	9.1	9.9	27.4	28.4	43.4	45.9	80	606	4716	23112
32955	37745	17729	20016	5	7	9.9	7.2	12.7	32.4	25.8	4.4000000000000004	6.6	8.1	6.2	11.9	32.6	30.1	44.9	45.8	918	11235	103807	432061
4360	517	266	251	0.8	11.7	9.4	6.8	9.4	30.1	32	1.2	9.1999999999999993	4.8	14.7	5.6	28.7	35.9	48.8	55.2	12	27	127	743
31408	10611	5919	4692	6.8	9	4.7	4.9000000000000004	8	54.1	12.7	10.8	7.2	6.6	4.2	12.5	38.6	20.100000000000001	33.6	35.200000000000003	591	15300	152484	622458
97503	12592	6309	6283	7.1	7.9	6.4	5.7	12.9	42.2	17.8	10.5	5.7	2.8	6.5	14.3	42.9	17.2	35.9	30.4	240	4858	47896	196024
61448	3706	1852	1854	4.7	6.2	7.7	7.1	12.7	34.4	27.3	3.6	6.4	6.6	7.1	12.6	29.9	33.799999999999997	43	47.5	42	216	798	4582
76177	6470	3321	3149	7.2	3.6	7.6	5.5	8.6	58.4	9.1999999999999993	14	4.5	6.5	4.4000000000000004	20.5	41.1	8.9	35.4	29.9	421	24624	358954	1390697
37324	5145	2578	2567	9.8000000000000007	4.4000000000000004	6.2	5.7	14	35.4	24.6	6.7	6.2	5.0999999999999996	9.6999999999999993	7.3	38.6	26.4	37.200000000000003	43.8	137	3201	40872	185889
47923	3519	1682	1837	6.4	6.8	7.5	6.1	16.2	40	17.100000000000001	6.8	9.3000000000000007	5.4	6.8	16.8	34.299999999999997	20.6	39.5	40.4	76	561	4186	17733
15663	418	197	221	8.1	9.6	11.7	12.2	2	16.2	40.1	2.2999999999999998	1.8	6.8	14.9	8.1	26.2	39.799999999999997	56.4	56.2	19	796	16185	65337
95120	37644	18723	18921	4.5999999999999996	8.4	10.3	8	17.600000000000001	29.4	21.9	3.6	10.4	8.6	7.5	14.5	31.1	24.3	44.3	45.4	512	3361	64597	265542
27012	25699	12259	13440	6.6	10	7.2	8.5	15	31.4	21.2	4.5999999999999996	9.4	8.3000000000000007	8.4	11.8	33.4	24.2	43.3	45.1	632	6456	47670	199328
24620	2992	1444	1548	9.6	7	7.8	5.6	10.8	40	19.2	4.7	9.9	10.9	13.5	11.4	34.799999999999997	14.8	38.299999999999997	42.8	25	219	2157	8701
46617	9403	4546	4857	6.6	5.0999999999999996	4.8	5.8	12.4	47.5	17.8	6.8	5.8	5.8	4.5	10.9	41.7	24.6	30.9	35.4	269	4948	52919	204581
28520	413	166	247	24.1	0	6	0	9.6	33.700000000000003	26.5	9.3000000000000007	0	4.5	14.6	19.399999999999999	33.6	18.600000000000001	39.4	36.6	3	4	14	60
7444	10811	4643	6168	4.4000000000000004	8.8000000000000007	12.4	6.8	10.3	27.6	29.7	2.7	7.8	9.6	5.6	11	29	34.200000000000003	49.1	49.4	351	5188	55156	228858
57031	733	367	366	3.5	7.9	7.9	8.4	10.6	42.2	19.3	5.5	4.0999999999999996	10.7	11.7	6.6	42.6	18.899999999999999	38.9	39.299999999999997	9	52	394	1811
37118	1173	551	622	1.5	8.6999999999999993	17.100000000000001	0	10.199999999999999	39.4	23.2	3.7	10.8	7.6	8.8000000000000007	18.3	34.1	16.7	34.9	40.200000000000003	5	16	152	611
5252	245	114	131	0	6.1	14	17.5	7	21.9	33.299999999999997	3.1	11.5	9.1999999999999993	17.600000000000001	3.1	26.7	29	55.3	53.4	19	97	1271	5283
7601	44035	22132	21903	6.5	8	7.6	6.5	9.1	46.1	16.3	6.9	7.3	7	7.6	6.8	42.8	21.6	37.9	40.299999999999997	2273	40913	538677	2340830
12928	1882	962	920	2.2999999999999998	6.3	10.199999999999999	8.9	12.7	33.1	26.5	2.6	5.8	8.9	7.2	12.9	36.700000000000003	25.9	47.6	41.4	29	151	1054	4284
18818	1281	673	608	4	9.6999999999999993	11	7.9	12.3	27	28.1	3.1	5.6	12.5	11.3	7.4	31.4	28.6	48.4	51.3	26	202	1580	7545
59936	487	255	232	0	7.1	0	5.5	8.1999999999999993	60	19.2	0	7.3	8.1999999999999993	2.6	8.1999999999999993	41.8	31.9	32.299999999999997	44.5	37	90	480	6205
98037	27994	13631	14363	5.8	6.9	7.4	6.7	12.4	44	16.899999999999999	5.7	6.2	7.6	6.9	12.9	41.7	19	36.5	37.5	926	11958	103085	427303
25245	622	279	343	0	13.6	0	4.3	12.5	31.5	38	10.199999999999999	11.4	5.2	13.7	9.9	21.9	27.7	49.4	47.6	6	6	39	132
96712	8441	4590	3851	4.5999999999999996	3.2	4.3	6.9	15.6	47.8	17.600000000000001	4.0999999999999996	5.5	6.5	4.7	18.2	38.1	22.8	34.6	39.299999999999997	218	1756	11341	47992
54160	303	154	149	0.6	3.9	10.4	5.2	15.6	49.4	14.9	3.4	7.4	11.4	6.7	9.4	43.6	18.100000000000001	39.700000000000003	44.3	9	99	713	3060
77360	4949	2445	2504	5.8	7.5	2.2000000000000002	4.5999999999999996	12.1	28.1	39.6	3.1	4.5	9.1999999999999993	5.5	11.3	30.4	36	46.7	50.3	61	398	3093	15488
54495	7997	4061	3936	9.8000000000000007	6.5	7.8	6.1	13.9	35.5	20.399999999999999	5.0999999999999996	6.8	7.3	9.1	12.7	35.200000000000003	23.9	35.4	42.5	190	3966	54022	199622
30559	4410	2251	2159	0.8	5.2	7.8	6.5	13.1	35.200000000000003	31.3	4.8	4.9000000000000004	9.5	10.1	6.9	31.4	32.4	46.3	51.1	53	228	1600	6525
35592	4831	2326	2505	5.5	4.8	9.6	8.9	10.9	39.4	20.9	5.4	6.2	6.8	7.5	13.3	36.5	24.4	42.6	40	118	1464	11288	49373
41831	1001	490	511	1.4	11.4	7.3	19.2	9.8000000000000007	39.6	11.2	1	16.600000000000001	11	8.4	7.4	39.1	16.399999999999999	43.6	46.4	7	47	526	2071
82639	779	419	360	11.7	5.5	13.8	5.5	11.5	35.799999999999997	16.2	3.1	13.1	4.7	1.7	14.7	42.8	20	40.4	32.9	33	91	611	3242
20732	9973	4522	5451	7.2	10.6	7.9	5.7	14.3	37.6	16.7	7.4	4.5	7.4	7.3	16.2	43.9	13.3	37.200000000000003	33	106	773	4925	23494
67837	890	429	461	4.7	7.7	6.5	4.4000000000000004	16.600000000000001	40.6	19.600000000000001	6.7	6.5	3.3	6.7	13.4	41	22.3	35.4	33.1	37	136	1290	5098
16435	1904	945	959	6	5.3	7	11.5	11.7	37.799999999999997	20.6	7	7	10.5	7	12.3	36.799999999999997	19.399999999999999	40.9	41.3	25	94	731	3602
17870	15098	7382	7716	5.5	6.9	6.8	7.1	11.9	42.8	19	2.7	6.8	5.4	8	10.1	42.7	24.5	35.299999999999997	40.5	414	7571	47432	199494
77454	188	111	77	0	0	0	15.3	0	57.7	27	0	0	0	22.1	0	0	77.900000000000006	25.8	60.6	6	10	41	194
54947	2539	1269	1270	2	9.6	12.1	9.1999999999999993	7.2	31.6	28.1	4.5999999999999996	9.8000000000000007	10.199999999999999	7.4	7	30	30.9	49.9	49.6	32	110	706	3462
61475	363	160	203	2.5	18.8	8.1	2.5	8.8000000000000007	41.3	18.100000000000001	1	14.3	11.8	0	17.2	37.9	17.7	43.4	35.9	5	15	221	2292
94572	9724	4335	5389	9.1999999999999993	5	6.1	10.6	10.4	39.4	19.2	4	5	7.6	6.2	12.8	41.9	22.6	35.700000000000003	37.5	79	996	24776	77658
64643	1391	675	716	12	9.1999999999999993	4.3	6.7	14.2	30.8	22.8	5.3	7.4	4.9000000000000004	4.7	19.100000000000001	36	22.5	38.9	39.4	14	69	445	2152
70466	8560	4557	4003	6.9	4.2	6	7.2	15.2	49.4	11.1	5.2	7.4	5.8	7.1	11.5	42.5	20.5	31.8	38.799999999999997	66	325	1844	8450
92887	20292	10080	10212	5.8	7	9.5	9.3000000000000007	14.9	34.799999999999997	18.7	6.1	8.9	9.6999999999999993	9.4	11.8	34.799999999999997	19.3	40.5	43.3	625	6794	95375	376988
45389	564	280	284	5.7	3.2	18.2	5.4	15.7	30.4	21.4	4.5999999999999996	6	5.3	10.199999999999999	12	34.9	27.1	39	42.6	10	28	98	375
58538	2669	1385	1284	13.9	5.3	5.8	4.0999999999999996	21	41	8.8000000000000007	10.4	7	6.8	3.7	20.100000000000001	41.1	11	26	29.2	17	932	6505	27357
18829	888	452	436	5.8	11.9	8.8000000000000007	6.4	14.2	35.6	17.3	6.9	7.8	10.8	4.5999999999999996	13.3	30	26.6	42	44.8	12	61	276	1252
60443	21184	9728	11456	5.5	7.1	8.1	8.1	13.4	36.9	20.9	3.4	5	6.2	9.9	16.100000000000001	34.9	24.5	40	41.6	346	6291	45487	182411
21778	1077	532	545	2.6	11.8	5.0999999999999996	11.3	1.7	53	14.5	0	8.8000000000000007	6.8	11.6	22.6	35.200000000000003	15	39.799999999999997	40.200000000000003	18	109	867	4789
5251	1399	638	761	1.3	6.4	11.8	6.6	9.1	21.8	43.1	5.0999999999999996	9.3000000000000007	10.1	12.2	7.9	19.399999999999999	35.9	54.9	53.6	73	479	3763	16779
32216	37190	18268	18922	9.1	7.8	5.5	5.4	9.6	46.6	16.100000000000001	4.7	6.2	7.6	4.5999999999999996	12.4	40.799999999999997	23.7	34.6	37	1797	32050	413167	1535742
80465	15986	7934	8052	4.5999999999999996	7.8	9	8.9	11.9	34.4	23.4	5.8	8.6999999999999993	9	9.5	12.3	34.299999999999997	20.5	44.5	43.5	336	1606	12446	60620
67332	168	89	79	0	11.2	22.5	5.6	9	30.3	21.3	0	17.7	15.2	13.9	1.3	35.4	16.5	49.9	49.3	4	73	807	2977
82215	171	107	64	0	2.8	13.1	14	4.7	36.4	29	0	4.7	0	25	4.7	10.9	54.7	50.7	63.3	7	27	138	668
27863	11877	5898	5979	10.199999999999999	7.5	5.5	8	14.3	39.4	15.1	5.5	6.4	7	3.6	15.8	40.200000000000003	21.6	36.4	36.700000000000003	100	672	3985	18454
67732	437	226	211	2.2000000000000002	2.2000000000000002	11.1	12.4	12.4	25.2	34.5	7.1	2.8	14.7	9	8.5	28.4	29.4	53	50.7	16	104	895	4228
91405	53347	26612	26735	7.3	7.3	6.6	6	14.6	47.4	10.8	7.6	7.3	7.1	5.6	13.2	44.4	14.8	31.6	34.9	957	12807	126812	549159
25121	678	316	362	0	28.5	3.8	0	0	38.6	29.1	0	0	6.9	19.100000000000001	16.600000000000001	45	12.4	47.3	43.5	8	53	505	2014
19009	955	491	464	0.6	0.8	9	13.6	16.100000000000001	44.6	15.3	5.8	5.4	12.9	6.9	10.6	38.799999999999997	19.600000000000001	28.1	39.299999999999997	27	778	5335	22277
93110	17382	8997	8385	5.4	8.1	6.6	6.8	9.3000000000000007	41.4	22.5	3.6	8.5	6.2	8.8000000000000007	10.3	33.5	29.1	38.6	47.1	326	3721	69125	256148
40870	396	206	190	0	0	4.9000000000000004	12.6	5.8	48.5	28.2	11.6	7.4	5.8	12.6	13.7	31.1	17.899999999999999	43.9	38.200000000000003	4	52	418	1882
27939	2592	1209	1383	11.8	5.0999999999999996	6.6	5.5	17.8	30.4	22.7	6.5	2.5	6.9	5.0999999999999996	22.4	39.9	16.600000000000001	36.700000000000003	31.5	44	354	1646	7561
63033	42635	18843	23792	7	5.3	7.6	8.1	16	37.6	18.399999999999999	5.4	5.3	7.3	6.5	11	39.4	25.2	34.9	40.6	909	7856	42228	179903
25241	1612	910	702	6.7	8.4	4.3	5.6	15.7	41.3	18	8.1	7.3	7.1	5.4	16.5	27.2	28.3	35.799999999999997	43.8	20	65	345	1819
68532	812	484	328	9.6999999999999993	12	1.4	8.3000000000000007	15.5	42.4	10.7	2.4	25	3.7	5.8	28.4	18.899999999999999	15.9	17.899999999999999	45	10	27	184	1103

Response Variable Data

# display first 100 rows
kable(head(data_response, 100)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>%
  scroll_box(width = "100%", height = "500px")

geographic_area_name	revenue	popular_product
ZIP 44319 (Akron, OH)	110168	retirement
ZIP 73067 (Ninnekah, OK)	32785	retirement
ZIP 01851 (Lowell, MA)	124838	life
ZIP 67335 (Cherryvale, KS)	48629	college
ZIP 72524 (Cord, AR)	23694	college
ZIP 35901 (Gadsden, AL)	131639	college
ZIP 08752 (Seaside Park, NJ)	37291	college
ZIP 35063 (Empire, AL)	34062	college
ZIP 48114 (Brighton, MI)	96213	retirement
ZIP 20037 (Washington, DC)	367039	life
ZIP 54421 (Colby, WI)	43943	retirement
ZIP 79070 (Perryton, TX)	74486	college
ZIP 49676 (Rapid City, MI)	38341	college
ZIP 03583 (Jefferson, NH)	26785	retirement
ZIP 54454 (Milladore, WI)	32692	retirement
ZIP 08019 (Chatsworth, NJ)	33492	retirement
ZIP 36003 (Autaugaville, AL)	33899	life
ZIP 13335 (Edmeston, NY)	37718	retirement
ZIP 62215 (Albers, IL)	31449	retirement
ZIP 53575 (Oregon, WI)	72434	retirement
ZIP 79934 (El Paso, TX)	35526	life
ZIP 37355 (Manchester, TN)	118132	retirement
ZIP 15610 (Acme, PA)	42750	retirement
ZIP 73106 (Oklahoma City, OK)	121778	life
ZIP 01129 (Springfield, MA)	51292	college
ZIP 04926 (China Village, ME)	31080	college
ZIP 28269 (Charlotte, NC)	245501	life
ZIP 75691 (Tatum, TX)	51806	retirement
ZIP 25670 (Delbarton, WV)	46552	college
ZIP 48184 (Wayne, MI)	123227	retirement
ZIP 02535 (Chilmark, MA)	32454	life
ZIP 15746 (Hillsdale, PA)	34060	life
ZIP 88008 (Santa Teresa, NM)	50212	college
ZIP 39217 (Jackson, MS)	26521	life
ZIP 97540 (Talent, OR)	46696	retirement
ZIP 99169 (Ritzville, WA)	49698	college
ZIP 32955 (Rockledge, FL)	164311	retirement
ZIP 04360 (Vienna, ME)	35425	retirement
ZIP 31408 (Savannah, GA)	153679	life
ZIP 97503 (White City, OR)	73316	life
ZIP 61448 (Knoxville, IL)	29695	retirement
ZIP 76177 (Fort Worth, TX)	193817	life
ZIP 37324 (Decherd, TN)	63595	college
ZIP 47923 (Brookston, IN)	44056	life
ZIP 15663 (Madison, PA)	37042	retirement
ZIP 95120 (San Jose, CA)	102039	retirement
ZIP 27012 (Clemmons, NC)	113884	retirement
ZIP 24620 (Hurley, VA)	36325	college
ZIP 46617 (South Bend, IN)	76865	life
ZIP 28520 (Cedar Island, NC)	21887	college
ZIP 07444 (Pompton Plains, NJ)	84611	retirement
ZIP 57031 (Gayville, SD)	23241	life
ZIP 37118 (Milton, TN)	25822	retirement
ZIP 05252 (East Arlington, VT)	23640	retirement
ZIP 07601 (Hackensack, NJ)	463202	life
ZIP 12928 (Crown Point, NY)	31085	retirement
ZIP 18818 (Friendsville, PA)	31768	retirement
ZIP 59936 (West Glacier, MT)	31508	life
ZIP 98037 (Lynnwood, WA)	168494	life
ZIP 25245 (Given, WV)	34714	retirement
ZIP 96712 (Haleiwa, HI)	50489	life
ZIP 54160 (Potter, WI)	32849	life
ZIP 77360 (Onalaska, TX)	40400	retirement
ZIP 54495 (Wisconsin Rapids, WI)	66623	college
ZIP 30559 (Mineral Bluff, GA)	27135	life
ZIP 35592 (Vernon, AL)	50156	life
ZIP 41831 (Leburn, KY)	34033	retirement
ZIP 82639 (Kaycee, WY)	32708	college
ZIP 20732 (Chesapeake Beach, MD)	36165	retirement
ZIP 67837 (Copeland, KS)	33355	life
ZIP 16435 (Springboro, PA)	37128	life
ZIP 17870 (Selinsgrove, PA)	86302	life
ZIP 77454 (Lissie, TX)	33712	life
ZIP 54947 (Larsen, WI)	35079	retirement
ZIP 61475 (Sciota, IL)	25766	retirement
ZIP 94572 (Rodeo, CA)	42240	college
ZIP 64643 (Hale, MO)	23710	college
ZIP 70466 (Tickfaw, LA)	36292	life
ZIP 92887 (Yorba Linda, CA)	121839	retirement
ZIP 45389 (Christiansburg, OH)	27194	college
ZIP 58538 (Fort Yates, ND)	50303	college
ZIP 18829 (Le Raysville, PA)	36727	retirement
ZIP 60443 (Matteson, IL)	85899	retirement
ZIP 21778 (Rocky Ridge, MD)	38034	retirement
ZIP 05251 (Dorset, VT)	41918	retirement
ZIP 32216 (Jacksonville, FL)	343459	life
ZIP 80465 (Morrison, CO)	70866	retirement
ZIP 67332 (Bartlett, KS)	25024	retirement
ZIP 82215 (Hartville, WY)	32281	life
ZIP 27863 (Pikeville, NC)	40999	college
ZIP 67732 (Brewster, KS)	29529	college
ZIP 91405 (Van Nuys, CA)	177474	life
ZIP 25121 (Lake, WV)	37739	retirement
ZIP 19009 (Bryn Athyn, PA)	38188	life
ZIP 93110 (Santa Barbara, CA)	70880	retirement
ZIP 40870 (Totz, KY)	25584	life
ZIP 27939 (Grandy, NC)	42647	college
ZIP 63033 (Florissant, MO)	142530	college
ZIP 25241 (Evans, WV)	32792	retirement
ZIP 68532 (Lincoln, NE)	37814	retirement

ID-Geography Mapping Data

# display first 100 rows
kable(head(data_mapping, 100)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>%
  scroll_box(width = "100%", height = "500px")

Id2	geographic_area_name
1001	ZIP 01001 (Agawam, MA)
1002	ZIP 01002 (Amherst, MA)
1003	ZIP 01003 (Amherst, MA)
1005	ZIP 01005 (Barre, MA)
1007	ZIP 01007 (Belchertown, MA)
1008	ZIP 01008 (Blandford, MA)
1009	ZIP 01009 (Bondsville, MA)
1010	ZIP 01010 (Brimfield, MA)
1011	ZIP 01011 (Chester, MA)
1012	ZIP 01012 (Chesterfield, MA)
1013	ZIP 01013 (Chicopee, MA)
1020	ZIP 01020 (Chicopee, MA)
1022	ZIP 01022 (Chicopee, MA)
1026	ZIP 01026 (Cummington, MA)
1027	ZIP 01027 (Easthampton, MA)
1028	ZIP 01028 (East Longmeadow, MA)
1029	ZIP 01029 (East Otis, MA)
1030	ZIP 01030 (Feeding Hills, MA)
1031	ZIP 01031 (Gilbertville, MA)
1032	ZIP 01032 (Goshen, MA)
1033	ZIP 01033 (Granby, MA)
1034	ZIP 01034 (Granville, MA)
1035	ZIP 01035 (Hadley, MA)
1036	ZIP 01036 (Hampden, MA)
1037	ZIP 01037 (Hardwick, MA)
1038	ZIP 01038 (Hatfield, MA)
1039	ZIP 01039 (Haydenville, MA)
1040	ZIP 01040 (Holyoke, MA)
1050	ZIP 01050 (Huntington, MA)
1053	ZIP 01053 (Leeds, MA)
1054	ZIP 01054 (Leverett, MA)
1056	ZIP 01056 (Ludlow, MA)
1057	ZIP 01057 (Monson, MA)
1060	ZIP 01060 (Northampton, MA)
1062	ZIP 01062 (Florence, MA)
1068	ZIP 01068 (Oakham, MA)
1069	ZIP 01069 (Palmer, MA)
1070	ZIP 01070 (Plainfield, MA)
1071	ZIP 01071 (Russell, MA)
1072	ZIP 01072 (Shutesbury, MA)
1073	ZIP 01073 (Southampton, MA)
1074	ZIP 01074 (South Barre, MA)
1075	ZIP 01075 (South Hadley, MA)
1077	ZIP 01077 (Southwick, MA)
1079	ZIP 01079 (Thorndike, MA)
1080	ZIP 01080 (Three Rivers, MA)
1081	ZIP 01081 (Wales, MA)
1082	ZIP 01082 (Ware, MA)
1083	ZIP 01083 (Warren, MA)
1084	ZIP 01084 (West Chesterfield, MA)
1085	ZIP 01085 (Westfield, MA)
1086	ZIP 01086 (Westfield, MA)
1088	ZIP 01088 (West Hatfield, MA)
1089	ZIP 01089 (West Springfield, MA)
1092	ZIP 01092 (West Warren, MA)
1093	ZIP 01093 (Whately, MA)
1095	ZIP 01095 (Wilbraham, MA)
1096	ZIP 01096 (Williamsburg, MA)
1098	ZIP 01098 (Worthington, MA)
1103	ZIP 01103 (Springfield, MA)
1104	ZIP 01104 (Springfield, MA)
1105	ZIP 01105 (Springfield, MA)
1106	ZIP 01106 (Longmeadow, MA)
1107	ZIP 01107 (Springfield, MA)
1108	ZIP 01108 (Springfield, MA)
1109	ZIP 01109 (Springfield, MA)
1118	ZIP 01118 (Springfield, MA)
1119	ZIP 01119 (Springfield, MA)
1128	ZIP 01128 (Springfield, MA)
1129	ZIP 01129 (Springfield, MA)
1151	ZIP 01151 (Indian Orchard, MA)
1201	ZIP 01201 (Pittsfield, MA)
1220	ZIP 01220 (Adams, MA)
1222	ZIP 01222 (Ashley Falls, MA)
1223	ZIP 01223 (Becket, MA)
1224	ZIP 01224 (Berkshire, MA)
1225	ZIP 01225 (Cheshire, MA)
1226	ZIP 01226 (Dalton, MA)
1229	ZIP 01229 (Glendale, MA)
1230	ZIP 01230 (Great Barrington, MA)
1235	ZIP 01235 (Hinsdale, MA)
1236	ZIP 01236 (Housatonic, MA)
1237	ZIP 01237 (Lanesboro, MA)
1238	ZIP 01238 (Lee, MA)
1240	ZIP 01240 (Lenox, MA)
1242	ZIP 01242 (Lenox Dale, MA)
1243	ZIP 01243 (Middlefield, MA)
1245	ZIP 01245 (Monterey, MA)
1247	ZIP 01247 (North Adams, MA)
1253	ZIP 01253 (Otis, MA)
1254	ZIP 01254 (Richmond, MA)
1255	ZIP 01255 (Sandisfield, MA)
1256	ZIP 01256 (Savoy, MA)
1257	ZIP 01257 (Sheffield, MA)
1258	ZIP 01258 (South Egremont, MA)
1259	ZIP 01259 (Southfield, MA)
1260	ZIP 01260 (South Lee, MA)
1262	ZIP 01262 (Stockbridge, MA)
1266	ZIP 01266 (West Stockbridge, MA)
1267	ZIP 01267 (Williamstown, MA)

Data Cleaning and Exploration

Data Cleaning

Missing Value Table

# all column indices except ID column
index <- 2:ncol(data)

# convert above columns to numeric
data[index] <- lapply(data[index], as.numeric) 

# calculate missing values
na_table <-
  map_dbl(data, function(x) sum(is.na(x))) %>% 
  sort(decreasing = TRUE) %>% 
  data.frame()

# rename column
colnames(na_table) <- c("total_missing")

# display missing value table
datatable(na_table)

Removing rows with 8 or more missing columns

 # index for row having 8 or more missing columns
index <- apply(data, 1, function(x) sum(is.na(x)) >= 8) 

# removing above rows
data <- data[!index,]

# calculate missing values
na_table <-
  map_dbl(data, function(x) sum(is.na(x))) %>% 
  sort(decreasing = TRUE) %>% 
  data.frame()

# rename column
colnames(na_table) <- c("total_missing")

# display missing value table
datatable(na_table)

KNN-Imputation of median ages of males and females

###### 1. male_median_age ######

# data with output variable and variables on which knn needs to be trained
tr <- na.omit(select(data, tot_population, tot_male_population, 
                      perc_male_under_5yrs:perc_male_over_60yrs, male_median_age))

# data with variables on which knn needs to be trained
tr1 <- select(tr, tot_population, tot_male_population, 
              perc_male_under_5yrs:perc_male_over_60yrs)

# missing data: data on which knn would provide predictions
te1 <- 
  data %>% 
  filter(is.na(male_median_age)) %>% 
    select(tot_population, tot_male_population, 
          perc_male_under_5yrs:perc_male_over_60yrs)
    

# output variable
output_tr <- select(tr, male_median_age)$male_median_age  

# training data for knn without output variable
train <- as.data.frame(tr1) 

# test data to be imputed using knn
test <- as.data.frame(te1)

# output variable
output_var_train <- as.numeric(output_tr)

# perform k-nearest neighbour with k=20
knn_prediction_male <- knn.reg(train=train, test=test, y=output_var_train, k = 20)

# impute missing values with above algorithm
data[is.na(data$male_median_age),]$male_median_age <- knn_prediction_male$pred

###### 2. female_median_age ######

# data with output variable and variables on which knn needs to be trained
tr <- na.omit(select(data, tot_population, tot_female_population, 
                      perc_female_under_5yrs:perc_female_over_60yrs, female_median_age))

# data with variables on which knn needs to be trained
tr1 <- select(tr, tot_population, tot_female_population, 
              perc_female_under_5yrs:perc_female_over_60yrs)

# missing data: data on which knn would provide predictions
te1 <- 
  data %>% 
  filter(is.na(female_median_age)) %>% 
    select(tot_population, tot_female_population, 
          perc_female_under_5yrs:perc_female_over_60yrs)
    

# output variable
output_tr <- select(tr, female_median_age)$female_median_age  

# training data for knn without output variable
train <- as.data.frame(tr1) 

# test data to be imputed using knn
test <- as.data.frame(te1)

# output variable
output_var_train <- as.numeric(output_tr)

# perform k-nearest neighbour with k=20
knn_prediction_female <- knn.reg(train=train, test=test, y=output_var_train, k = 20)

# impute missing values with above algorithm
data[is.na(data$female_median_age),]$female_median_age <- knn_prediction_female$pred

# calculate missing values
na_table <-
  map_dbl(data, function(x) sum(is.na(x))) %>% 
  sort(decreasing = TRUE) %>% 
  data.frame()

# rename column
colnames(na_table) <- c("total_missing")

# display missing value table
datatable(na_table)

Summary Table

# summary statistics for numerical variables
summary <- data.frame()
for(i in 2:ncol(data))
{
  name = colnames(data)[i]
  min = min(data[,i], na.rm=TRUE) %>% round(3)
  mean = mean(data[,i,drop=TRUE], na.rm=TRUE) %>% round(3)
  median = median(data[,i,drop=TRUE], na.rm=TRUE) %>% round(3)
  max = max(data[,i], na.rm=TRUE) %>% round(3) %>% round(3)
  count = sum(!is.na(data[,i]))
  df = data.frame(name, min, mean=mean, median=median, max, count)
  summary <- rbind(summary, df)
}

# display summary table
datatable(summary)

Minimum population greater than equal to 100

# filtering on population
data <-
  data %>% 
    filter(tot_population >=100)

# summary statistics for numerical variables
summary <- data.frame() 
for(i in 2:ncol(data))
{
  name = colnames(data)[i]
  min = min(data[,i], na.rm=TRUE) %>% round(3)
  mean = mean(data[,i,drop=TRUE], na.rm=TRUE) %>% round(3)
  median = median(data[,i,drop=TRUE], na.rm=TRUE) %>% round(3)
  max = max(data[,i], na.rm=TRUE) %>% round(3) %>% round(3)
  count = sum(!is.na(data[,i]))
  df = data.frame(name, min, mean=mean, median=median, max, count)
  summary <- rbind(summary, df)
}

# display summary table
datatable(summary)

Data Exploration

Joining Response and Predictor Variables

# joining response and predictor variables
data <-
  data %>% 
  left_join(data_mapping, by = "Id2") %>% 
  left_join(data_response, by = "geographic_area_name")

# converting popular product to factor
data$popular_product <- as.factor(data$popular_product)

Univariate Analysis-I

Following variables are heavily right-skewed:

tot_population
tot_male_population
tot_female_population
no_establishments
paid_employees

Following variables are slighly right-skewed:

payroll_quarter1
annual_payroll
revenue

Following variables are normally distributed:

male_median_age
female_median_age

edaPlot <- function(value, name)
{
  hist <- ggplot(data, aes(x=value)) +
    geom_histogram() +
    ggtitle(name)
  
  hist
}

p1 <- edaPlot(data$tot_population,"tot_population")
p2 <- edaPlot(data$tot_male_population,"tot_male_population")
p3 <- edaPlot(data$tot_female_population,"tot_female_population")
p4 <- edaPlot(data$male_median_age,"male_median_age")
p5 <- edaPlot(data$female_median_age,"female_median_age")
p6 <- edaPlot(data$no_establishments,"no_establishments")
p7 <- edaPlot(data$paid_employees,"paid_employees")
p8 <- edaPlot(data$payroll_quarter1,"payroll_quarter1")
p9 <- edaPlot(data$annual_payroll,"annual_payroll")
p10 <- edaPlot(data$revenue,"revenue")

Rmisc::multiplot(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, cols = 2)

Univariate Analysis-II

Most of the zipcodes have males:

Between 15-44 years of age
Over 60 years of age
Between 5-14 years of age

p1 <- edaPlot(data$perc_male_under_5yrs,"perc_male_under_5yrs")
p2 <- edaPlot(data$perc_male_5_14yrs,"perc_male_5_14yrs")
p3 <- edaPlot(data$perc_male_15_44yrs,"perc_male_15_44yrs")
p4 <- edaPlot(data$perc_male_45_49yrs,"perc_male_45_49yrs")
p5 <- edaPlot(data$perc_male_50_54yrs,"perc_male_50_54yrs")
p6 <- edaPlot(data$perc_male_55_59yrs,"perc_male_55_59yrs")
p7 <- edaPlot(data$perc_male_over_60yrs,"perc_male_over_60yrs")

Rmisc::multiplot(p1, p2, p3, p4, p5, p6, p7, cols = 2)

Univariate Analysis-III

Most of the zipcodes have females:

Between 15-44 years of age
Over 60 years of age
Between 5-14 years of age

p1 <- edaPlot(data$perc_female_under_5yrs,"perc_female_under_5yrs")
p2 <- edaPlot(data$perc_female_5_14yrs,"perc_female_5_14yrs")
p3 <- edaPlot(data$perc_female_15_44yrs,"perc_female_15_44yrs")
p4 <- edaPlot(data$perc_female_45_49yrs,"perc_female_45_49yrs")
p5 <- edaPlot(data$perc_female_50_54yrs,"perc_female_50_54yrs")
p6 <- edaPlot(data$perc_female_55_59yrs,"perc_female_55_59yrs")
p7 <- edaPlot(data$perc_female_over_60yrs,"perc_female_over_60yrs")

Rmisc::multiplot(p1, p2, p3, p4, p5, p6, p7, cols = 2)

Univariate Analysis-IV

Popular Products are equally distributed among college, life and retirement.

ggplot(data, aes(x=popular_product)) +
  geom_bar()

Revenue Bivariate Analysis-I

Note: See the last row only

Revenue has high positive correlation with:

annual_payroll
payroll_quarter1
paid_employees
no_establishments

df1 <- select(data, tot_population:tot_female_population
              , male_median_age:annual_payroll, revenue)

pairs(df1)

Note: See the last column only

The above inference is verified by the correlation plot

ggcorr(df1, label = TRUE, angle = 20)

Revenue Bivariate Analysis-II

Note: See the last row only

There does not seem any significant correlations.

df2 <- select(data, perc_male_under_5yrs:perc_male_over_60yrs, revenue)

pairs(df2)

Note: See the last column only

Observations:

Revenue is negatively correlated with perc_male_over_60yrs. This does make sense as older the population lower the revenue generated.
Revenue is positively correlated with perc_male_15_44yrs. This does make sense as younger the population higher the revenue generated.

ggcorr(df2, label = TRUE, angle = 20)

Revenue Bivariate Analysis-III

Note: See the last row only

There does not seem any significant correlations.

df3 <- select(data, perc_female_under_5yrs:perc_female_over_60yrs, revenue)

pairs(df3)

Observations:

Revenue is negatively correlated with perc_female_over_60yrs. This does make sense as older the population lower the revenue generated.
Revenue is positively correlated with perc_female_15_44yrs. This does make sense as younger the population higher the revenue generated.

ggcorr(df3, label = TRUE, angle = 20)

Popularity Bivariate Analysis-I

Variables which have separation in distribution of Popular Products:

tot_population
tot_male_population
tot_female_population
male_median_age
female_median_age

distribution_plot <- function(x, name)
{
  ggplot(data, aes(x=x, fill=popular_product)) +
    geom_histogram(position = 'dodge', aes(y=..density..)) +
    labs(x=name)
}

p1 <- distribution_plot(data$tot_population,"tot_population")
p2 <- distribution_plot(data$tot_male_population,"tot_male_population")
p3 <- distribution_plot(data$tot_female_population,"tot_female_population")
p4 <- distribution_plot(data$male_median_age,"male_median_age")
p5 <- distribution_plot(data$female_median_age,"female_median_age")
p6 <- distribution_plot(data$no_establishments,"no_establishments")
p7 <- distribution_plot(data$paid_employees,"paid_employees")
p8 <- distribution_plot(data$payroll_quarter1,"payroll_quarter1")
p9 <- distribution_plot(data$annual_payroll,"annual_payroll")
p10 <- distribution_plot(data$revenue,"revenue")

Rmisc::multiplot(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, cols = 2)

Popularity Bivariate Analysis-II

Variables which have separation in distribution of Popular Products:

perc_male_under_5yrs
perc_male_15_44yrs
perc_male_45_49yrs

p1 <- distribution_plot(data$perc_male_under_5yrs,"perc_male_under_5yrs")
p2 <- distribution_plot(data$perc_male_5_14yrs,"perc_male_5_14yrs")
p3 <- distribution_plot(data$perc_male_15_44yrs,"perc_male_15_44yrs")
p4 <- distribution_plot(data$perc_male_45_49yrs,"perc_male_45_49yrs")
p5 <- distribution_plot(data$perc_male_50_54yrs,"perc_male_50_54yrs")
p6 <- distribution_plot(data$perc_male_55_59yrs,"perc_male_55_59yrs")
p7 <- distribution_plot(data$perc_male_over_60yrs,"perc_male_over_60yrs")

Rmisc::multiplot(p1, p2, p3, p4, p5, p6, p7, cols = 2)

Popularity Bivariate Analysis-III

Variables which have separation in distribution of Popular Products:

perc_female_15_44yrs
perc_female_45_49yrs

p1 <- distribution_plot(data$perc_female_under_5yrs,"perc_female_under_5yrs")
p2 <- distribution_plot(data$perc_female_5_14yrs,"perc_female_5_14yrs")
p3 <- distribution_plot(data$perc_female_15_44yrs,"perc_female_15_44yrs")
p4 <- distribution_plot(data$perc_female_45_49yrs,"perc_female_45_49yrs")
p5 <- distribution_plot(data$perc_female_50_54yrs,"perc_female_50_54yrs")
p6 <- distribution_plot(data$perc_female_55_59yrs,"perc_female_55_59yrs")
p7 <- distribution_plot(data$perc_female_over_60yrs,"perc_female_over_60yrs")

Rmisc::multiplot(p1, p2, p3, p4, p5, p6, p7, cols = 2)

Model

Model Building

Dividing data into train and test set

# 80-20 train vs test split
set.seed(1)
index<-sample(nrow(data),nrow(data)*0.8)
train<-data[index,]
test<-data[-index,]

Building the model to predict Revenue

The following model has been built using Linear Regression
The variable selection has been done using forward selection (BIC)

# Null Model - Regress square feet on only the intercept
nullmodel=lm(revenue~1, data=train)

# Full Model - Regress square feet on all predictor variables
fullmodel=lm(revenue ~  .-Id2-geographic_area_name-popular_product
             , data=train)

# Final Model built using stepwise variable selection (BIC)
model_revenue <- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel),
                         direction = c("forward"), k=log(nrow(train)))

Model Summary

The adjusted R square of the model is ~99%

summary(model_revenue)

## 
## Call:
## lm(formula = revenue ~ paid_employees + payroll_quarter1 + no_establishments + 
##     annual_payroll + tot_male_population + perc_male_5_14yrs + 
##     perc_female_15_44yrs + perc_female_over_60yrs + perc_male_15_44yrs + 
##     perc_male_over_60yrs, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -106003   -3512     160    3656  129563 
## 
## Coefficients:
##                            Estimate   Std. Error t value
## (Intercept)            21666.572597  1139.057724  19.021
## paid_employees             0.537635     0.049095  10.951
## payroll_quarter1           0.039493     0.003333  11.849
## no_establishments        105.430307     0.669318 157.519
## annual_payroll             0.068797     0.001306  52.694
## tot_male_population       -0.249488     0.022758 -10.963
## perc_male_5_14yrs        -52.203677    22.976294  -2.272
## perc_female_15_44yrs      81.041546    15.869788   5.107
## perc_female_over_60yrs    66.357749    16.549801   4.010
## perc_male_15_44yrs        84.371563    16.939855   4.981
## perc_male_over_60yrs      69.693437    19.040925   3.660
##                                    Pr(>|t|)    
## (Intercept)            < 0.0000000000000002 ***
## paid_employees         < 0.0000000000000002 ***
## payroll_quarter1       < 0.0000000000000002 ***
## no_establishments      < 0.0000000000000002 ***
## annual_payroll         < 0.0000000000000002 ***
## tot_male_population    < 0.0000000000000002 ***
## perc_male_5_14yrs                  0.023122 *  
## perc_female_15_44yrs            0.000000339 ***
## perc_female_over_60yrs          0.000061655 ***
## perc_male_15_44yrs              0.000000654 ***
## perc_male_over_60yrs               0.000254 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6940 on 5353 degrees of freedom
## Multiple R-squared:  0.9946, Adjusted R-squared:  0.9946 
## F-statistic: 9.808e+04 on 10 and 5353 DF,  p-value: < 0.00000000000000022

Calculating Mean Square Error on Train and Test Set

Train and Test MSEs are comparable, thus there seems to be no overfitting

# prediction on train set
linear_regression_pred_train <- predict(model_revenue)

# prediction on test set
linear_regression_pred_test <- predict(model_revenue,
                                      newdata=test)

# train MSE
linear_regression_train_mse <- mean((linear_regression_pred_train-train$revenue)^2)

# test MSE
linear_regression_test_mse <- mean((linear_regression_pred_test-test$revenue)^2)

cat("Train MSE:", linear_regression_train_mse, "| Test MSE:", linear_regression_test_mse)

## Train MSE: 48065318 | Test MSE: 45795763

Building the model to predict the popularity of product

The following model has been built using Multinomial Logistic Regression
The variable selection has been done using forward selection (AIC)

# Multinomial Logistics Regression
model_pop_product <- multinom(popular_product ~ 
                              male_median_age
                              +female_median_age
                              +tot_population
                              +tot_male_population
                              +tot_female_population
                              +perc_male_under_5yrs
                              +perc_male_15_44yrs
                              +perc_male_45_49yrs
                              +perc_male_over_60yrs
                              +perc_female_under_5yrs
                              +perc_female_15_44yrs
                              +perc_female_45_49yrs
                              +perc_female_over_60yrs
                              , select(train, -geographic_area_name, -Id2, -revenue), maxit=200)

# performing forward selection (AIC)
model_pop_product_step <- step(model_pop_product)

Model Summary

summary(model_pop_product_step)

## Call:
## multinom(formula = popular_product ~ male_median_age + tot_population + 
##     tot_male_population + tot_female_population + perc_male_under_5yrs + 
##     perc_male_15_44yrs + perc_male_45_49yrs + perc_female_15_44yrs + 
##     perc_female_45_49yrs + perc_female_over_60yrs, data = select(train, 
##     -geographic_area_name, -Id2, -revenue), maxit = 200)
## 
## Coefficients:
##            (Intercept) male_median_age tot_population tot_male_population
## life         -63.96765      -0.1192173 0.000007577717       -0.0002616518
## retirement   -10.82925      -0.1518854 0.000009655579       -0.0003251257
##            tot_female_population perc_male_under_5yrs perc_male_15_44yrs
## life                0.0002682052            -9.712368         3.24700966
## retirement          0.0003353229            -9.637377         0.05265874
##            perc_male_45_49yrs perc_female_15_44yrs perc_female_45_49yrs
## life               -0.0580214           0.08844885           0.04047339
## retirement         10.0168383           0.07618278          -0.11984784
##            perc_female_over_60yrs
## life                   0.08778008
## retirement             0.07599648
## 
## Std. Errors:
##              (Intercept) male_median_age tot_population
## life       0.00007243927      0.01532116 0.000007927697
## retirement 0.00004870983      0.01328432 0.000009170222
##            tot_male_population tot_female_population perc_male_under_5yrs
## life              0.0002276797          0.0002232655          0.004143988
## retirement        0.0002778930          0.0002720334          0.004521807
##            perc_male_15_44yrs perc_male_45_49yrs perc_female_15_44yrs
## life               0.01308191        0.005274154           0.01542477
## retirement         0.01605085        0.002400195           0.01779370
##            perc_female_45_49yrs perc_female_over_60yrs
## life                0.008113711             0.01753665
## retirement          0.004736040             0.01402347
## 
## Residual Deviance: 599.6059 
## AIC: 639.6059

Confusion matrix on train set

Accuracy Rate: 98.84%

conf_matrix_train <-
  table(train$popular_product,predict(model_pop_product))

conf_matrix_train

##             
##              college life retirement
##   college       1777   11         14
##   life             7 1660          7
##   retirement      19    4       1865

Confusion matrix on test set

Accuracy Rate: 97.83%

conf_matrix_test <-
  table(test$popular_product,predict(model_pop_product, newdata = test))

conf_matrix_test

##             
##              college life retirement
##   college        415    1          9
##   life             3  424          0
##   retirement      11    5        474

Model Diagonostics

Errors are normally distributed with mean=0

Using a Q-Q Plot, errors seem to be normally distributed. However, both the tails seem to be heavily skewed due to outliers.

# Constructing a dataframe containing model attributes
model_attributes1 <-
  data.frame(index=1:nrow(train),
             residuals = model_revenue$residuals, 
             fitted_values = model_revenue$fitted.values)

# Constructing Q-Q Plot
qqnorm(model_attributes1$residuals)
qqline(model_attributes1$residuals, col='red')

Uncorrelated Errors

There seems to be no pattern for the errors over time (index). Thus we can safely assume that the errors are uncorrelated.

# Plotting Residuals over Time
model_attributes1 %>%
  ggplot(aes(x=index,y=residuals)) +
  geom_point()

Constance Variance

We can clearly see that the residuals are constantly varied across the majority of the fitted values. However, for very large fitted values, the variation seems to fan out. If we remove the outliers, we would remove the fan shaped variance as well.

# Residuals vs Fitted-Value Plot
ggplot(model_attributes1, aes(x=fitted_values,y=residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red") +
  geom_hline(yintercept = 3, color = "blue") +
  geom_hline(yintercept = -3, color = "blue")

Predictor Variables are independent of each other

Variation Inflation Factor is high for a few variables like annual_payroll, payroll_quarter1 and paid_employees. We will need to remove some of these to reduce multi-collinearity

vif(model_revenue)

##         paid_employees       payroll_quarter1      no_establishments 
##              17.593212              52.967413               8.692075 
##         annual_payroll    tot_male_population      perc_male_5_14yrs 
##              84.343784               3.042816               1.653595 
##   perc_female_15_44yrs perc_female_over_60yrs     perc_male_15_44yrs 
##               2.944912               3.355291               3.635962 
##   perc_male_over_60yrs 
##               4.494583

No influential outliers

Almost all standardized errors are below the absolute value of 5. However, there are a few observation that are over the absolute value of 5 and hence can be considered as influential outliers. In an ideal case, these should be removed.

# Plotting Studentized/Standardized Errors
rstan <- rstandard(model_revenue)  
plot(rstan)

Corrected Model

Removing Influential Outliers

# calculating cooks distance
cooksd <- cooks.distance(model_revenue)

# influential outliers row numbers
influential <- as.numeric(names(cooksd)[(cooksd > mean(cooksd, na.rm=T))]) 

# remove influential outliers
train <- train[-influential,]

Building the model to predict Revenue

The following model has been built using Linear Regression
The variable selection has been done using forward selection (BIC)
annual_payroll and paid_employees is removed to reduce multi-collinearity

# Null Model - Regress square feet on only the intercept
nullmodel=lm(revenue~1, data=train)

# Full Model - Regress square feet on all predictor variables except annual_payroll
fullmodel=lm(revenue ~  .-Id2-geographic_area_name-popular_product-annual_payroll-paid_employees
             , data=train)

# Final Model built using stepwise variable selection (BIC)
model_revenue <- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel),
                         direction = c("forward"), k=log(nrow(train)))

Model Summary

The adjusted R square of the model is ~99%

summary(model_revenue)

## 
## Call:
## lm(formula = revenue ~ no_establishments + payroll_quarter1 + 
##     perc_male_5_14yrs + tot_population + perc_male_15_44yrs + 
##     perc_female_over_60yrs + perc_female_15_44yrs, data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -56485  -3607    107   3626  34374 
## 
## Coefficients:
##                            Estimate   Std. Error t value
## (Intercept)            24568.791530   835.112612  29.420
## no_establishments        113.863276     0.540313 210.736
## payroll_quarter1           0.306316     0.001617 189.485
## perc_male_5_14yrs        -59.928591    18.092538  -3.312
## tot_population            -0.060523     0.010521  -5.753
## perc_male_15_44yrs        58.524142    12.570366   4.656
## perc_female_over_60yrs    87.225537    12.370295   7.051
## perc_female_15_44yrs      54.477925    14.006055   3.890
##                                    Pr(>|t|)    
## (Intercept)            < 0.0000000000000002 ***
## no_establishments      < 0.0000000000000002 ***
## payroll_quarter1       < 0.0000000000000002 ***
## perc_male_5_14yrs                  0.000931 ***
## tot_population               0.000000009279 ***
## perc_male_15_44yrs           0.000003307055 ***
## perc_female_over_60yrs       0.000000000002 ***
## perc_female_15_44yrs               0.000102 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6067 on 5312 degrees of freedom
## Multiple R-squared:  0.9913, Adjusted R-squared:  0.9912 
## F-statistic: 8.605e+04 on 7 and 5312 DF,  p-value: < 0.00000000000000022

Calculating Mean Square Error on Train and Test Set

Train and Test MSEs are comparable, thus there seems to be no overfitting

# prediction on train set
linear_regression_pred_train <- predict(model_revenue)

# prediction on test set
linear_regression_pred_test <- predict(model_revenue,
                                      newdata=test)

# train MSE
linear_regression_train_mse <- mean((linear_regression_pred_train-train$revenue)^2)

# test MSE
linear_regression_test_mse <- mean((linear_regression_pred_test-test$revenue)^2)

cat("Train MSE:", linear_regression_train_mse, "| Test MSE:", linear_regression_test_mse)

## Train MSE: 36752185 | Test MSE: 321499258

Building the model to predict the popularity of product

The following model has been built using Multinomial Logistic Regression
The variable selection has been done using forward selection (AIC)

# Multinomial Logistics Regression
model_pop_product <- multinom(popular_product ~ 
                              male_median_age
                              +female_median_age
                              +tot_population
                              +tot_male_population
                              +tot_female_population
                              +perc_male_under_5yrs
                              +perc_male_15_44yrs
                              +perc_male_45_49yrs
                              +perc_male_over_60yrs
                              +perc_female_under_5yrs
                              +perc_female_15_44yrs
                              +perc_female_45_49yrs
                              +perc_female_over_60yrs
                              , select(train, -geographic_area_name, -Id2, -revenue), maxit=200)

# performing forward selection (AIC)
model_pop_product_step <- step(model_pop_product)

Model Summary

summary(model_pop_product_step)

## Call:
## multinom(formula = popular_product ~ male_median_age + tot_population + 
##     tot_male_population + tot_female_population + perc_male_under_5yrs + 
##     perc_male_15_44yrs + perc_male_45_49yrs + perc_female_15_44yrs + 
##     perc_female_45_49yrs + perc_female_over_60yrs, data = select(train, 
##     -geographic_area_name, -Id2, -revenue), maxit = 200)
## 
## Coefficients:
##            (Intercept) male_median_age tot_population tot_male_population
## life         -63.97387      -0.1186000 0.000007263478       -0.0002651740
## retirement   -10.77528      -0.1520542 0.000010230867       -0.0003063879
##            tot_female_population perc_male_under_5yrs perc_male_15_44yrs
## life                0.0002721347            -9.720721          3.2471242
## retirement          0.0003166178            -9.650487          0.0560999
##            perc_male_45_49yrs perc_female_15_44yrs perc_female_45_49yrs
## life              -0.05536281           0.09030834           0.03580886
## retirement         9.99796531           0.07939600          -0.13179365
##            perc_female_over_60yrs
## life                    0.0871428
## retirement              0.0761664
## 
## Std. Errors:
##              (Intercept) male_median_age tot_population
## life       0.00007348353      0.01536158 0.000007966623
## retirement 0.00004912882      0.01327730 0.000009400856
##            tot_male_population tot_female_population perc_male_under_5yrs
## life              0.0002295425          0.0002250732          0.004113861
## retirement        0.0002871352          0.0002809541          0.004432349
##            perc_male_15_44yrs perc_male_45_49yrs perc_female_15_44yrs
## life               0.01312008        0.005267846           0.01544974
## retirement         0.01604569        0.002397547           0.01782996
##            perc_female_45_49yrs perc_female_over_60yrs
## life                0.008229996             0.01756903
## retirement          0.004837625             0.01399654
## 
## Residual Deviance: 596.7282 
## AIC: 636.7282

Confusion matrix on train set

Accuracy Rate: 97.76%

conf_matrix_train <-
  table(train$popular_product,predict(model_pop_product))

conf_matrix_train

##             
##              college life retirement
##   college       1772   11         12
##   life             7 1634          7
##   retirement      18    5       1854

Confusion matrix on test set

Accuracy Rate: 97.83%

conf_matrix_test <-
  table(test$popular_product,predict(model_pop_product, newdata = test))

conf_matrix_test

##             
##              college life retirement
##   college        414    2          9
##   life             3  424          0
##   retirement      11    5        474

Corrected Model Diagonostics

Errors are normally distributed with mean=0

Using a Q-Q Plot, errors seem to be normally distributed.

# Constructing a dataframe containing model attributes
model_attributes1 <-
  data.frame(index=1:nrow(train),
             residuals = model_revenue$residuals, 
             fitted_values = model_revenue$fitted.values)

# Constructing Q-Q Plot
qqnorm(model_attributes1$residuals)
qqline(model_attributes1$residuals, col='red')

Uncorrelated Errors

There seems to be no pattern for the errors over time (index). Thus we can safely assume that the errors are uncorrelated.

# Plotting Residuals over Time
model_attributes1 %>%
  ggplot(aes(x=index,y=residuals)) +
  geom_point()

Constance Variance

We can clearly see that the residuals are constantly varied across the fitted values.

# Residuals vs Fitted-Value Plot
ggplot(model_attributes1, aes(x=fitted_values,y=residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red") +
  geom_hline(yintercept = 3, color = "blue") +
  geom_hline(yintercept = -3, color = "blue")

Predictor Variables are independent of each other

Variation Inflation Factor <10 for all variables. Thus their is no multi-collinearity.

vif(model_revenue)

##      no_establishments       payroll_quarter1      perc_male_5_14yrs 
##               5.844737               2.785597               1.318786 
##         tot_population     perc_male_15_44yrs perc_female_over_60yrs 
##               3.459487               2.494747               2.396186 
##   perc_female_15_44yrs 
##               2.850943

No influential outliers

Almost all standardized errors are below the absolute value of 5.

# Plotting Studentized/Standardized Errors
rstan <- rstandard(model_revenue)  
plot(rstan)

Model Interpretation

Linear Regression - Predicting Revenue

Revenue = output variable
99% variance in the output variable is explained by:
- no_establishments
- payroll_quarter1
- perc_male_5_14yrs
- tot_population
- perc_male_15_44yrs
- perc_female_over_60yrs
- perc_female_15_44yrs
All held constant, with 1 unit increase in no_establishments, the average revenue increases by $113.86.
A more concrete way of elaborating the above point would be: All held constant, we are 95 % confident that with 1 unit increase in no_establishments, the average revenue increases by $112.78 - $114.94.
All other predictor variables can be interpretted in the same way.
The t-tests correspond to the following hypothesis test:
- H0: Beta = 0
- HA: Beta !=0
- For all p-values < 0.05, we reject H0
- All predictor variables have p-value < 0.05
The f-test correspond to the following hypothesis test:
- H0: All Beta’s = 0
- HA: At least one Beta != 0
- As p-value < 0.05, we reject H0. Thus our model as a whole is significant.

Multinomial Logistic Regression

Final variables selected are as follows:
- male_median_age
- tot_population
- tot_male_population
- tot_female_population
- perc_male_under_5yrs
- perc_male_15_44yrs
- perc_male_45_49yrs
- perc_female_15_44yrs
- perc_female_45_49yrs
- perc_female_over_60yrs
The interpretation of the model is as follows:
- With 1 unit increase in perc_male_15_44yrs:
  - odds of pop product = life relative to pop product = college increases by a factor of 25.71
  - odds of pop product = life relative to pop product = college increases by a factor of 1.05
- All other variables can be interpretted in the same way
AIC - which a measure of the quality of the model - is 636.72
Accuracy Rate on test set i.e. percentage of popular products correctly classified are 97.83%

Insights

Predicting Revenue Generated

Revenue generated in each zipcode is predicted using a linear regression model.
We are getting an almost perfect model (99% Adjusted R-Square) for predicting the revenue generated in each zipcode. Thus, we can very accurately predict the same and take actions to increase/decrease the revenue in selected zip codes.
Revenue highly depends upon these factors:
- no_establishments
- payroll_quarter1
- perc_male_5_14yrs
- tot_population
- perc_male_15_44yrs
- perc_female_over_60yrs
- perc_female_15_44yrs
MSE(Train): 36752185 | MSE(Test): 321499258

Predicting Popular Product

Popular product in each zipcode is predicted using a multinomial logistic regression model.
We are getting high accuracy rate on both train and test set for predicting the popular products:
- Accuracy Rate(Train): 97.86%
- Accuracy Rate(Test): 97.83%
Popular products highly depends upon these factors:
- male_median_age
- tot_population
- tot_male_population
- tot_female_population
- perc_male_under_5yrs
- perc_male_15_44yrs
- perc_male_45_49yrs
- perc_female_15_44yrs
- perc_female_45_49yrs
- perc_female_over_60yrs
Model AIC: 636.7282

Future Work

Perform cross-validation to find the optimal k in k-nearest neighbour algorithm.
Perform cross-validation on both the regression and classfication models to verify that the models are not overfitting.
Perform transformation on response and/or predictor variables to satisfy the assumptions of the linear regression model (better than current model).
Use different variable selection techniques like backward, stepwise and lasso on parameters like adjusted R square, AIC, BIC, deviance, p-value etc.
Use more advanced models like decision trees, random forest, bagging and boosting to increase accuracy.

Revenue and Insurance Product Prediction using Census Data

Akshay Kher

April 3, 2019

Introduction

Problem Description

Data

Importing Data

Libraries Required

Reading data

Data Cleaning and Exploration

Data Cleaning

Data Exploration

Model

Model Building

Model Diagonostics

Corrected Model

Corrected Model Diagonostics

Model Interpretation

Insights

Predicting Revenue Generated

Predicting Popular Product

Future Work