Introduction

Problem Description

A financial institution wants to analyze the demographic information of the US population to predict:

  1. Revenue earned by the company for a given ZIP code
  2. The most popular product sold in a given ZIP code:
    • college fund
    • retirement fund
    • life insurance

Data

The sample data given is a portion of the American Community Survey 2015 from census.gov. It contains several fields detailing population, gender, and business metrics grouped by selected ZIP codes throughout the United States.

Importing Data

Libraries Required

require(Rmisc) # draw multiple plots 
require(readxl) # read excel input
require(DT) # display table on HTML page
require(tidyverse) # perform data manipulation
require(kableExtra)  # display table on HTML page
require(ggplot2) # perform data visualization
require(FNN) # KNN
require(GGally) # plot correlation plots
require(car) # check variance inflation factor
require(nnet) # perform multinomial logistic regression
options(scipen = 999) # do not display numbers in scientific format

Reading data

Importing Data

# reading predictor variable data
data <- read_excel("american_fact_finder.xlsx", sheet = "census_train")

# reading response variable data
data_response <- read_excel("american_fact_finder.xlsx", sheet = "response_variable")

# reading id-geographic mapping data
data_mapping <- read_excel("american_fact_finder.xlsx", sheet = "id_geography_mapping")

# renaming predictor variable data
colnames(data) <- c("Id2", 
                    "tot_population",
                    "tot_male_population",
                    "tot_female_population",
                    "perc_male_under_5yrs",
                    "perc_male_45_49yrs",
                    "perc_male_50_54yrs",
                    "perc_male_55_59yrs",
                    "perc_male_5_14yrs",
                    "perc_male_15_44yrs",
                    "perc_male_over_60yrs",
                    "perc_female_under_5yrs",
                    "perc_female_45_49yrs",
                    "perc_female_50_54yrs",
                    "perc_female_55_59yrs",
                    "perc_female_5_14yrs",
                    "perc_female_15_44yrs",
                    "perc_female_over_60yrs",
                    "male_median_age",
                    "female_median_age",
                    "no_establishments",
                    "paid_employees",
                    "payroll_quarter1",
                    "annual_payroll")

# renaming column names of response variable data
colnames(data_response) <- c("geographic_area_name",
                             "revenue",
                             "popular_product")

# renaming column names of id-geographic mapping data
colnames(data_mapping) <- c("Id2",
                             "geographic_area_name")

Data Dictionary

Predictor Variable Data
Variable Description
Id2 Unique Zipcode ID
tot_population Total Population
tot_male_population Total Male Population
tot_female_population Total Female Population
perc_male_under_5yrs Percentage of Males Under 5 Years of Age
perc_male_45_49yrs Percentage of Males Between 45-49 Years of Age
perc_male_50_54yrs Percentage of Males Between 50-54 Years of Age
perc_male_55_59yrs Percentage of Males Between 55-59 Years of Age
perc_male_5_14yrs Percentage of Males Between 5-14 Years of Age
perc_male_15_44yrs Percentage of Males Between 15-44 Years of Age
perc_male_over_60yrs Percentage of Males Over 60 Years of Age
perc_female_under_5yrs Percentage of Females Under 5 Years of Age
perc_female_45_49yrs Percentage of Females Between 45-49 Years of Age
perc_female_50_54yrs Percentage of Females Between 50-54 Years of Age
perc_female_55_59yrs Percentage of Females Between 55-59 Years of Age
perc_female_5_14yrs Percentage of Females Between 5-14 Years of Age
perc_female_15_44yrs Percentage of Females Between 15-44 Years of Age
perc_female_over_60yrs Percentage of Females Over 60 Years of Age
male_median_age Median Age of Males in Years
female_median_age Median Age of Females in Years
no_establishments Number of Establishments
paid_employees Paid employees for pay period including March 12 (number)
payroll_quarter1 First-quarter payroll ($1,000)
annual_payroll Annual payroll ($1,000)

Response Variable Data
Variable Description
geographic_area_name Geographic Area Name
revenue Revenue
popular_product Popular Product

ID-Geography Mapping Data
Variable Description
Id2 Unique Zipcode ID
geographic_area_name Geographic Area Name

Viewing Data

Predictor Variable Data

# display first 100 rows
kable(head(data, 100)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>%
  scroll_box(width = "100%", height = "500px")
Id2 tot_population tot_male_population tot_female_population perc_male_under_5yrs perc_male_45_49yrs perc_male_50_54yrs perc_male_55_59yrs perc_male_5_14yrs perc_male_15_44yrs perc_male_over_60yrs perc_female_under_5yrs perc_female_45_49yrs perc_female_50_54yrs perc_female_55_59yrs perc_female_5_14yrs perc_female_15_44yrs perc_female_over_60yrs male_median_age female_median_age no_establishments paid_employees payroll_quarter1 annual_payroll
44319 22592 11419 11173 4.2 7.3 8.6999999999999993 9.4 9.4 36.200000000000003 24.8 4.5 6.5 8.1 8.4 9.1999999999999993 34.9 28.5 45.1 45.8 582 6810 43116 196189
73067 2146 1171 975 2.8 7.9 8.4 7.9 13.1 39.1 20.8 5.2 8.1 9.3000000000000007 8.1 11.1 31.7 26.5 41.4 45.7 33 205 1724 6045
1851 30022 14858 15164 8.1 7.8 7.3 6.3 11.2 46.1 13 6.1 7.4 6.8 6.4 13.6 43.6 16.100000000000001 32.799999999999997 34.4 450 8568 147660 551236
67335 3332 1803 1529 7 6.2 8.8000000000000007 9.1999999999999993 12.9 34.1 22 2 5.6 6.1 8.6 13.7 33.6 30.3 41.4 45.8 52 491 2852 12186
72524 280 162 118 17.3 12.3 0 13 15.4 30.9 11.1 0 16.100000000000001 0 0 0 49.2 34.700000000000003 30.4 46.1 9 93 470 2592
35901 19573 9359 10214 6.9 6.3 6.2 7.8 13.9 38.799999999999997 20.100000000000001 5.8 7.4 5.9 8.3000000000000007 9.9 36.4 26.3 38.200000000000003 43.5 714 9219 66462 289032
8752 2367 1229 1138 4.4000000000000004 3.4 6 9.3000000000000007 7.9 23 46 0 8.8000000000000007 5.7 8.5 3.8 25 48.2 58.8 58.4 58 273 1705 10276
35063 3900 1945 1955 6.4 6.9 8.3000000000000007 9.1999999999999993 12.6 32.4 24.2 3.7 7.1 10.5 4.8 13.5 29 31.5 42.8 47.5 13 47 263 1146
48114 20600 10152 10448 4.3 7.8 10 9.1 13.3 34.4 21.1 4.0999999999999996 8.1 8.6 9.1999999999999993 12.8 33.299999999999997 24 43.2 44.9 505 4086 41436 188834
20037 15687 7384 8303 1.5 4 2.5 4.5999999999999996 0 71.7 15.7 1.4 3 2.8 1.6 1 74.5 15.6 30.7 24.3 829 25487 609829 2389256
54421 3270 1583 1687 5.9 7.3 8.1 6.4 18.600000000000001 34.4 19.3 7.5 5.9 7.8 4.4000000000000004 16.7 30.7 27 36.799999999999997 40.5 71 997 8410 34853
79070 10333 5282 5051 9.8000000000000007 6 6.6 6.3 18.3 40.799999999999997 12.2 8.1999999999999993 7.5 5.3 5.7 18.399999999999999 40.200000000000003 14.7 30.4 32.5 348 3862 53937 200034
49676 3441 1736 1705 6 5.6 6 8.6999999999999993 10.5 31 32 3 8.3000000000000007 7.8 7.1 8.3000000000000007 33.700000000000003 31.7 48.4 46.9 51 312 2928 12681
3583 933 424 509 3.8 8.3000000000000007 11.3 8 8.3000000000000007 22.6 37.700000000000003 3.5 11.2 10.199999999999999 10.4 8.1 21.2 35.4 52.3 53.8 25 175 1459 9473
54454 1442 742 700 5.3 8.1 9.6999999999999993 11.6 11.9 35.200000000000003 18.3 6.9 5.3 8.1 11.1 15.1 37.4 16 42.5 38.5 17 92 548 2438
8019 1012 516 496 3.1 11.4 12.2 6.2 5.6 43.4 18 4.2 9.1 9.3000000000000007 6.3 10.1 36.1 25 42.3 44.7 14 45 439 2337
36003 1716 873 843 4.8 1.9 12.7 8.1 13.6 36 22.8 1.9 9.8000000000000007 13.6 5.6 6.6 38.6 23.8 40.9 48 15 128 751 3257
13335 1606 851 755 6.3 15.9 8.5 6.2 9.1999999999999993 33.299999999999997 20.7 7.3 7.9 7.8 5.3 13.9 34.799999999999997 22.9 45.5 41.4 31 1086 10991 47243
62215 1731 894 837 7.2 9.1 8.4 5.9 15.9 38.5 15.1 4.9000000000000004 7.6 9.1 5.5 12.1 42.3 18.5 34.4 37 42 299 1776 8537
53575 16070 8358 7712 4.8 10.3 8.1 7 14.3 40.299999999999997 15.1 4.5999999999999996 9.1 7.9 6.3 18.3 33.9 19.899999999999999 39.6 41.1 319 2548 20206 94328
79934 24062 12189 11873 9.5 4.5 4.2 5.9 19.600000000000001 49.6 6.7 9.4 6.8 5.7 4 17 49.3 7.8 28 29.9 92 1123 9648 44193
37355 25493 12334 13159 6 7.5 6.5 7.2 13.8 38 21 5 5.6 9 5.8 13.3 36.799999999999997 24.6 39.200000000000003 40.200000000000003 505 8020 66808 279814
15610 3667 1794 1873 2.2000000000000002 6.4 10.8 11.3 11.6 32.6 25.1 5.6 8.5 13.1 9.8000000000000007 13 27 22.9 48.5 47.8 45 359 2840 13443
73106 13242 6977 6265 8.1 7.3 6.7 9.1999999999999993 11.3 46.4 11.1 6.8 5.0999999999999996 5.0999999999999996 7.8 12.1 51.4 11.7 33.1 30.3 479 8560 112417 450773
1129 7377 3466 3911 7.2 7.6 8.8000000000000007 6.8 15.6 35.5 18.600000000000001 3.7 6.9 7.2 6.8 9.5 38 27.9 35.799999999999997 43.7 162 2657 15369 63623
4926 36 18 18 0 0 0 0 0 0 100 0 0 0 0 0 0 100
11 59 382 1773
28269 76360 37057 39303 8.1 7.5 6.5 4.8 15.7 46.9 10.6 7.2 7.6 5.3 5.7 14.2 47.6 12.4 32.6 34.4 1266 20659 249180 1002725
75691 2817 1435 1382 3.8 9.5 2.9 5.9 17.3 36.6 24 4.2 6.2 4.5999999999999996 7.7 22.1 31.8 23.4 40.4 36.9 71 1481 31346 98008
25670 6074 3019 3055 7.5 7.6 4.8 7.3 12.5 38.700000000000003 21.6 6.8 8.1999999999999993 7.8 6.9 10.5 38.1 21.6 35.200000000000003 42.1 48 463 2676 10959
48184 17251 8354 8897 6.5 8 8 7.1 13.7 39.700000000000003 16.899999999999999 3.9 6 6.1 9.1999999999999993 12.2 41.2 21.5 36.700000000000003 39.799999999999997 369 11888 170286 666154
2535 1511 706 805 5.5 2.8 8.1 9.3000000000000007 11 36.799999999999997 26.3 2.9 3.7 11.8 9.1999999999999993 7.6 37.5 27.3 41.9 48.5 49 166 1690 9080
15746 106 47 59 0 0 6.4 19.100000000000001 0 53.2 21.3 10.199999999999999 0 6.8 0 13.6 22 47.5 39.6 53.6 9 127 1011 4049
88008 6881 3521 3360 12.6 6.1 7.2 7.1 18.600000000000001 34.299999999999997 14.1 5.3 7.7 7.5 5.7 16.8 45.7 11.4 29.6 31.2 103 1952 16335 65725
39217 1138 418 720 0 0 0 0 0 100 0 0 0 0 0 0 100 0 20.5 20.100000000000001 12 192 806 3360
97540 8061 3893 4168 2.6 7.2 5.4 5.8 16.600000000000001 35.5 27 4.8 7.2 7.1 7.1 11.1 33.700000000000003 29 42.7 45.1 160 893 5819 26660
99169 2577 1315 1262 9.1 7.7 7.5 10 13.9 29.2 22.7 11.1 5.9 8.1999999999999993 9.1 9.9 27.4 28.4 43.4 45.9 80 606 4716 23112
32955 37745 17729 20016 5 7 9.9 7.2 12.7 32.4 25.8 4.4000000000000004 6.6 8.1 6.2 11.9 32.6 30.1 44.9 45.8 918 11235 103807 432061
4360 517 266 251 0.8 11.7 9.4 6.8 9.4 30.1 32 1.2 9.1999999999999993 4.8 14.7 5.6 28.7 35.9 48.8 55.2 12 27 127 743
31408 10611 5919 4692 6.8 9 4.7 4.9000000000000004 8 54.1 12.7 10.8 7.2 6.6 4.2 12.5 38.6 20.100000000000001 33.6 35.200000000000003 591 15300 152484 622458
97503 12592 6309 6283 7.1 7.9 6.4 5.7 12.9 42.2 17.8 10.5 5.7 2.8 6.5 14.3 42.9 17.2 35.9 30.4 240 4858 47896 196024
61448 3706 1852 1854 4.7 6.2 7.7 7.1 12.7 34.4 27.3 3.6 6.4 6.6 7.1 12.6 29.9 33.799999999999997 43 47.5 42 216 798 4582
76177 6470 3321 3149 7.2 3.6 7.6 5.5 8.6 58.4 9.1999999999999993 14 4.5 6.5 4.4000000000000004 20.5 41.1 8.9 35.4 29.9 421 24624 358954 1390697
37324 5145 2578 2567 9.8000000000000007 4.4000000000000004 6.2 5.7 14 35.4 24.6 6.7 6.2 5.0999999999999996 9.6999999999999993 7.3 38.6 26.4 37.200000000000003 43.8 137 3201 40872 185889
47923 3519 1682 1837 6.4 6.8 7.5 6.1 16.2 40 17.100000000000001 6.8 9.3000000000000007 5.4 6.8 16.8 34.299999999999997 20.6 39.5 40.4 76 561 4186 17733
15663 418 197 221 8.1 9.6 11.7 12.2 2 16.2 40.1 2.2999999999999998 1.8 6.8 14.9 8.1 26.2 39.799999999999997 56.4 56.2 19 796 16185 65337
95120 37644 18723 18921 4.5999999999999996 8.4 10.3 8 17.600000000000001 29.4 21.9 3.6 10.4 8.6 7.5 14.5 31.1 24.3 44.3 45.4 512 3361 64597 265542
27012 25699 12259 13440 6.6 10 7.2 8.5 15 31.4 21.2 4.5999999999999996 9.4 8.3000000000000007 8.4 11.8 33.4 24.2 43.3 45.1 632 6456 47670 199328
24620 2992 1444 1548 9.6 7 7.8 5.6 10.8 40 19.2 4.7 9.9 10.9 13.5 11.4 34.799999999999997 14.8 38.299999999999997 42.8 25 219 2157 8701
46617 9403 4546 4857 6.6 5.0999999999999996 4.8 5.8 12.4 47.5 17.8 6.8 5.8 5.8 4.5 10.9 41.7 24.6 30.9 35.4 269 4948 52919 204581
28520 413 166 247 24.1 0 6 0 9.6 33.700000000000003 26.5 9.3000000000000007 0 4.5 14.6 19.399999999999999 33.6 18.600000000000001 39.4 36.6 3 4 14 60
7444 10811 4643 6168 4.4000000000000004 8.8000000000000007 12.4 6.8 10.3 27.6 29.7 2.7 7.8 9.6 5.6 11 29 34.200000000000003 49.1 49.4 351 5188 55156 228858
57031 733 367 366 3.5 7.9 7.9 8.4 10.6 42.2 19.3 5.5 4.0999999999999996 10.7 11.7 6.6 42.6 18.899999999999999 38.9 39.299999999999997 9 52 394 1811
37118 1173 551 622 1.5 8.6999999999999993 17.100000000000001 0 10.199999999999999 39.4 23.2 3.7 10.8 7.6 8.8000000000000007 18.3 34.1 16.7 34.9 40.200000000000003 5 16 152 611
5252 245 114 131 0 6.1 14 17.5 7 21.9 33.299999999999997 3.1 11.5 9.1999999999999993 17.600000000000001 3.1 26.7 29 55.3 53.4 19 97 1271 5283
7601 44035 22132 21903 6.5 8 7.6 6.5 9.1 46.1 16.3 6.9 7.3 7 7.6 6.8 42.8 21.6 37.9 40.299999999999997 2273 40913 538677 2340830
12928 1882 962 920 2.2999999999999998 6.3 10.199999999999999 8.9 12.7 33.1 26.5 2.6 5.8 8.9 7.2 12.9 36.700000000000003 25.9 47.6 41.4 29 151 1054 4284
18818 1281 673 608 4 9.6999999999999993 11 7.9 12.3 27 28.1 3.1 5.6 12.5 11.3 7.4 31.4 28.6 48.4 51.3 26 202 1580 7545
59936 487 255 232 0 7.1 0 5.5 8.1999999999999993 60 19.2 0 7.3 8.1999999999999993 2.6 8.1999999999999993 41.8 31.9 32.299999999999997 44.5 37 90 480 6205
98037 27994 13631 14363 5.8 6.9 7.4 6.7 12.4 44 16.899999999999999 5.7 6.2 7.6 6.9 12.9 41.7 19 36.5 37.5 926 11958 103085 427303
25245 622 279 343 0 13.6 0 4.3 12.5 31.5 38 10.199999999999999 11.4 5.2 13.7 9.9 21.9 27.7 49.4 47.6 6 6 39 132
96712 8441 4590 3851 4.5999999999999996 3.2 4.3 6.9 15.6 47.8 17.600000000000001 4.0999999999999996 5.5 6.5 4.7 18.2 38.1 22.8 34.6 39.299999999999997 218 1756 11341 47992
54160 303 154 149 0.6 3.9 10.4 5.2 15.6 49.4 14.9 3.4 7.4 11.4 6.7 9.4 43.6 18.100000000000001 39.700000000000003 44.3 9 99 713 3060
77360 4949 2445 2504 5.8 7.5 2.2000000000000002 4.5999999999999996 12.1 28.1 39.6 3.1 4.5 9.1999999999999993 5.5 11.3 30.4 36 46.7 50.3 61 398 3093 15488
54495 7997 4061 3936 9.8000000000000007 6.5 7.8 6.1 13.9 35.5 20.399999999999999 5.0999999999999996 6.8 7.3 9.1 12.7 35.200000000000003 23.9 35.4 42.5 190 3966 54022 199622
30559 4410 2251 2159 0.8 5.2 7.8 6.5 13.1 35.200000000000003 31.3 4.8 4.9000000000000004 9.5 10.1 6.9 31.4 32.4 46.3 51.1 53 228 1600 6525
35592 4831 2326 2505 5.5 4.8 9.6 8.9 10.9 39.4 20.9 5.4 6.2 6.8 7.5 13.3 36.5 24.4 42.6 40 118 1464 11288 49373
41831 1001 490 511 1.4 11.4 7.3 19.2 9.8000000000000007 39.6 11.2 1 16.600000000000001 11 8.4 7.4 39.1 16.399999999999999 43.6 46.4 7 47 526 2071
82639 779 419 360 11.7 5.5 13.8 5.5 11.5 35.799999999999997 16.2 3.1 13.1 4.7 1.7 14.7 42.8 20 40.4 32.9 33 91 611 3242
20732 9973 4522 5451 7.2 10.6 7.9 5.7 14.3 37.6 16.7 7.4 4.5 7.4 7.3 16.2 43.9 13.3 37.200000000000003 33 106 773 4925 23494
67837 890 429 461 4.7 7.7 6.5 4.4000000000000004 16.600000000000001 40.6 19.600000000000001 6.7 6.5 3.3 6.7 13.4 41 22.3 35.4 33.1 37 136 1290 5098
16435 1904 945 959 6 5.3 7 11.5 11.7 37.799999999999997 20.6 7 7 10.5 7 12.3 36.799999999999997 19.399999999999999 40.9 41.3 25 94 731 3602
17870 15098 7382 7716 5.5 6.9 6.8 7.1 11.9 42.8 19 2.7 6.8 5.4 8 10.1 42.7 24.5 35.299999999999997 40.5 414 7571 47432 199494
77454 188 111 77 0 0 0 15.3 0 57.7 27 0 0 0 22.1 0 0 77.900000000000006 25.8 60.6 6 10 41 194
54947 2539 1269 1270 2 9.6 12.1 9.1999999999999993 7.2 31.6 28.1 4.5999999999999996 9.8000000000000007 10.199999999999999 7.4 7 30 30.9 49.9 49.6 32 110 706 3462
61475 363 160 203 2.5 18.8 8.1 2.5 8.8000000000000007 41.3 18.100000000000001 1 14.3 11.8 0 17.2 37.9 17.7 43.4 35.9 5 15 221 2292
94572 9724 4335 5389 9.1999999999999993 5 6.1 10.6 10.4 39.4 19.2 4 5 7.6 6.2 12.8 41.9 22.6 35.700000000000003 37.5 79 996 24776 77658
64643 1391 675 716 12 9.1999999999999993 4.3 6.7 14.2 30.8 22.8 5.3 7.4 4.9000000000000004 4.7 19.100000000000001 36 22.5 38.9 39.4 14 69 445 2152
70466 8560 4557 4003 6.9 4.2 6 7.2 15.2 49.4 11.1 5.2 7.4 5.8 7.1 11.5 42.5 20.5 31.8 38.799999999999997 66 325 1844 8450
92887 20292 10080 10212 5.8 7 9.5 9.3000000000000007 14.9 34.799999999999997 18.7 6.1 8.9 9.6999999999999993 9.4 11.8 34.799999999999997 19.3 40.5 43.3 625 6794 95375 376988
45389 564 280 284 5.7 3.2 18.2 5.4 15.7 30.4 21.4 4.5999999999999996 6 5.3 10.199999999999999 12 34.9 27.1 39 42.6 10 28 98 375
58538 2669 1385 1284 13.9 5.3 5.8 4.0999999999999996 21 41 8.8000000000000007 10.4 7 6.8 3.7 20.100000000000001 41.1 11 26 29.2 17 932 6505 27357
18829 888 452 436 5.8 11.9 8.8000000000000007 6.4 14.2 35.6 17.3 6.9 7.8 10.8 4.5999999999999996 13.3 30 26.6 42 44.8 12 61 276 1252
60443 21184 9728 11456 5.5 7.1 8.1 8.1 13.4 36.9 20.9 3.4 5 6.2 9.9 16.100000000000001 34.9 24.5 40 41.6 346 6291 45487 182411
21778 1077 532 545 2.6 11.8 5.0999999999999996 11.3 1.7 53 14.5 0 8.8000000000000007 6.8 11.6 22.6 35.200000000000003 15 39.799999999999997 40.200000000000003 18 109 867 4789
5251 1399 638 761 1.3 6.4 11.8 6.6 9.1 21.8 43.1 5.0999999999999996 9.3000000000000007 10.1 12.2 7.9 19.399999999999999 35.9 54.9 53.6 73 479 3763 16779
32216 37190 18268 18922 9.1 7.8 5.5 5.4 9.6 46.6 16.100000000000001 4.7 6.2 7.6 4.5999999999999996 12.4 40.799999999999997 23.7 34.6 37 1797 32050 413167 1535742
80465 15986 7934 8052 4.5999999999999996 7.8 9 8.9 11.9 34.4 23.4 5.8 8.6999999999999993 9 9.5 12.3 34.299999999999997 20.5 44.5 43.5 336 1606 12446 60620
67332 168 89 79 0 11.2 22.5 5.6 9 30.3 21.3 0 17.7 15.2 13.9 1.3 35.4 16.5 49.9 49.3 4 73 807 2977
82215 171 107 64 0 2.8 13.1 14 4.7 36.4 29 0 4.7 0 25 4.7 10.9 54.7 50.7 63.3 7 27 138 668
27863 11877 5898 5979 10.199999999999999 7.5 5.5 8 14.3 39.4 15.1 5.5 6.4 7 3.6 15.8 40.200000000000003 21.6 36.4 36.700000000000003 100 672 3985 18454
67732 437 226 211 2.2000000000000002 2.2000000000000002 11.1 12.4 12.4 25.2 34.5 7.1 2.8 14.7 9 8.5 28.4 29.4 53 50.7 16 104 895 4228
91405 53347 26612 26735 7.3 7.3 6.6 6 14.6 47.4 10.8 7.6 7.3 7.1 5.6 13.2 44.4 14.8 31.6 34.9 957 12807 126812 549159
25121 678 316 362 0 28.5 3.8 0 0 38.6 29.1 0 0 6.9 19.100000000000001 16.600000000000001 45 12.4 47.3 43.5 8 53 505 2014
19009 955 491 464 0.6 0.8 9 13.6 16.100000000000001 44.6 15.3 5.8 5.4 12.9 6.9 10.6 38.799999999999997 19.600000000000001 28.1 39.299999999999997 27 778 5335 22277
93110 17382 8997 8385 5.4 8.1 6.6 6.8 9.3000000000000007 41.4 22.5 3.6 8.5 6.2 8.8000000000000007 10.3 33.5 29.1 38.6 47.1 326 3721 69125 256148
40870 396 206 190 0 0 4.9000000000000004 12.6 5.8 48.5 28.2 11.6 7.4 5.8 12.6 13.7 31.1 17.899999999999999 43.9 38.200000000000003 4 52 418 1882
27939 2592 1209 1383 11.8 5.0999999999999996 6.6 5.5 17.8 30.4 22.7 6.5 2.5 6.9 5.0999999999999996 22.4 39.9 16.600000000000001 36.700000000000003 31.5 44 354 1646 7561
63033 42635 18843 23792 7 5.3 7.6 8.1 16 37.6 18.399999999999999 5.4 5.3 7.3 6.5 11 39.4 25.2 34.9 40.6 909 7856 42228 179903
25241 1612 910 702 6.7 8.4 4.3 5.6 15.7 41.3 18 8.1 7.3 7.1 5.4 16.5 27.2 28.3 35.799999999999997 43.8 20 65 345 1819
68532 812 484 328 9.6999999999999993 12 1.4 8.3000000000000007 15.5 42.4 10.7 2.4 25 3.7 5.8 28.4 18.899999999999999 15.9 17.899999999999999 45 10 27 184 1103

Response Variable Data

# display first 100 rows
kable(head(data_response, 100)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>%
  scroll_box(width = "100%", height = "500px")
geographic_area_name revenue popular_product
ZIP 44319 (Akron, OH) 110168 retirement
ZIP 73067 (Ninnekah, OK) 32785 retirement
ZIP 01851 (Lowell, MA) 124838 life
ZIP 67335 (Cherryvale, KS) 48629 college
ZIP 72524 (Cord, AR) 23694 college
ZIP 35901 (Gadsden, AL) 131639 college
ZIP 08752 (Seaside Park, NJ) 37291 college
ZIP 35063 (Empire, AL) 34062 college
ZIP 48114 (Brighton, MI) 96213 retirement
ZIP 20037 (Washington, DC) 367039 life
ZIP 54421 (Colby, WI) 43943 retirement
ZIP 79070 (Perryton, TX) 74486 college
ZIP 49676 (Rapid City, MI) 38341 college
ZIP 03583 (Jefferson, NH) 26785 retirement
ZIP 54454 (Milladore, WI) 32692 retirement
ZIP 08019 (Chatsworth, NJ) 33492 retirement
ZIP 36003 (Autaugaville, AL) 33899 life
ZIP 13335 (Edmeston, NY) 37718 retirement
ZIP 62215 (Albers, IL) 31449 retirement
ZIP 53575 (Oregon, WI) 72434 retirement
ZIP 79934 (El Paso, TX) 35526 life
ZIP 37355 (Manchester, TN) 118132 retirement
ZIP 15610 (Acme, PA) 42750 retirement
ZIP 73106 (Oklahoma City, OK) 121778 life
ZIP 01129 (Springfield, MA) 51292 college
ZIP 04926 (China Village, ME) 31080 college
ZIP 28269 (Charlotte, NC) 245501 life
ZIP 75691 (Tatum, TX) 51806 retirement
ZIP 25670 (Delbarton, WV) 46552 college
ZIP 48184 (Wayne, MI) 123227 retirement
ZIP 02535 (Chilmark, MA) 32454 life
ZIP 15746 (Hillsdale, PA) 34060 life
ZIP 88008 (Santa Teresa, NM) 50212 college
ZIP 39217 (Jackson, MS) 26521 life
ZIP 97540 (Talent, OR) 46696 retirement
ZIP 99169 (Ritzville, WA) 49698 college
ZIP 32955 (Rockledge, FL) 164311 retirement
ZIP 04360 (Vienna, ME) 35425 retirement
ZIP 31408 (Savannah, GA) 153679 life
ZIP 97503 (White City, OR) 73316 life
ZIP 61448 (Knoxville, IL) 29695 retirement
ZIP 76177 (Fort Worth, TX) 193817 life
ZIP 37324 (Decherd, TN) 63595 college
ZIP 47923 (Brookston, IN) 44056 life
ZIP 15663 (Madison, PA) 37042 retirement
ZIP 95120 (San Jose, CA) 102039 retirement
ZIP 27012 (Clemmons, NC) 113884 retirement
ZIP 24620 (Hurley, VA) 36325 college
ZIP 46617 (South Bend, IN) 76865 life
ZIP 28520 (Cedar Island, NC) 21887 college
ZIP 07444 (Pompton Plains, NJ) 84611 retirement
ZIP 57031 (Gayville, SD) 23241 life
ZIP 37118 (Milton, TN) 25822 retirement
ZIP 05252 (East Arlington, VT) 23640 retirement
ZIP 07601 (Hackensack, NJ) 463202 life
ZIP 12928 (Crown Point, NY) 31085 retirement
ZIP 18818 (Friendsville, PA) 31768 retirement
ZIP 59936 (West Glacier, MT) 31508 life
ZIP 98037 (Lynnwood, WA) 168494 life
ZIP 25245 (Given, WV) 34714 retirement
ZIP 96712 (Haleiwa, HI) 50489 life
ZIP 54160 (Potter, WI) 32849 life
ZIP 77360 (Onalaska, TX) 40400 retirement
ZIP 54495 (Wisconsin Rapids, WI) 66623 college
ZIP 30559 (Mineral Bluff, GA) 27135 life
ZIP 35592 (Vernon, AL) 50156 life
ZIP 41831 (Leburn, KY) 34033 retirement
ZIP 82639 (Kaycee, WY) 32708 college
ZIP 20732 (Chesapeake Beach, MD) 36165 retirement
ZIP 67837 (Copeland, KS) 33355 life
ZIP 16435 (Springboro, PA) 37128 life
ZIP 17870 (Selinsgrove, PA) 86302 life
ZIP 77454 (Lissie, TX) 33712 life
ZIP 54947 (Larsen, WI) 35079 retirement
ZIP 61475 (Sciota, IL) 25766 retirement
ZIP 94572 (Rodeo, CA) 42240 college
ZIP 64643 (Hale, MO) 23710 college
ZIP 70466 (Tickfaw, LA) 36292 life
ZIP 92887 (Yorba Linda, CA) 121839 retirement
ZIP 45389 (Christiansburg, OH) 27194 college
ZIP 58538 (Fort Yates, ND) 50303 college
ZIP 18829 (Le Raysville, PA) 36727 retirement
ZIP 60443 (Matteson, IL) 85899 retirement
ZIP 21778 (Rocky Ridge, MD) 38034 retirement
ZIP 05251 (Dorset, VT) 41918 retirement
ZIP 32216 (Jacksonville, FL) 343459 life
ZIP 80465 (Morrison, CO) 70866 retirement
ZIP 67332 (Bartlett, KS) 25024 retirement
ZIP 82215 (Hartville, WY) 32281 life
ZIP 27863 (Pikeville, NC) 40999 college
ZIP 67732 (Brewster, KS) 29529 college
ZIP 91405 (Van Nuys, CA) 177474 life
ZIP 25121 (Lake, WV) 37739 retirement
ZIP 19009 (Bryn Athyn, PA) 38188 life
ZIP 93110 (Santa Barbara, CA) 70880 retirement
ZIP 40870 (Totz, KY) 25584 life
ZIP 27939 (Grandy, NC) 42647 college
ZIP 63033 (Florissant, MO) 142530 college
ZIP 25241 (Evans, WV) 32792 retirement
ZIP 68532 (Lincoln, NE) 37814 retirement

ID-Geography Mapping Data

# display first 100 rows
kable(head(data_mapping, 100)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>%
  scroll_box(width = "100%", height = "500px")
Id2 geographic_area_name
1001 ZIP 01001 (Agawam, MA)
1002 ZIP 01002 (Amherst, MA)
1003 ZIP 01003 (Amherst, MA)
1005 ZIP 01005 (Barre, MA)
1007 ZIP 01007 (Belchertown, MA)
1008 ZIP 01008 (Blandford, MA)
1009 ZIP 01009 (Bondsville, MA)
1010 ZIP 01010 (Brimfield, MA)
1011 ZIP 01011 (Chester, MA)
1012 ZIP 01012 (Chesterfield, MA)
1013 ZIP 01013 (Chicopee, MA)
1020 ZIP 01020 (Chicopee, MA)
1022 ZIP 01022 (Chicopee, MA)
1026 ZIP 01026 (Cummington, MA)
1027 ZIP 01027 (Easthampton, MA)
1028 ZIP 01028 (East Longmeadow, MA)
1029 ZIP 01029 (East Otis, MA)
1030 ZIP 01030 (Feeding Hills, MA)
1031 ZIP 01031 (Gilbertville, MA)
1032 ZIP 01032 (Goshen, MA)
1033 ZIP 01033 (Granby, MA)
1034 ZIP 01034 (Granville, MA)
1035 ZIP 01035 (Hadley, MA)
1036 ZIP 01036 (Hampden, MA)
1037 ZIP 01037 (Hardwick, MA)
1038 ZIP 01038 (Hatfield, MA)
1039 ZIP 01039 (Haydenville, MA)
1040 ZIP 01040 (Holyoke, MA)
1050 ZIP 01050 (Huntington, MA)
1053 ZIP 01053 (Leeds, MA)
1054 ZIP 01054 (Leverett, MA)
1056 ZIP 01056 (Ludlow, MA)
1057 ZIP 01057 (Monson, MA)
1060 ZIP 01060 (Northampton, MA)
1062 ZIP 01062 (Florence, MA)
1068 ZIP 01068 (Oakham, MA)
1069 ZIP 01069 (Palmer, MA)
1070 ZIP 01070 (Plainfield, MA)
1071 ZIP 01071 (Russell, MA)
1072 ZIP 01072 (Shutesbury, MA)
1073 ZIP 01073 (Southampton, MA)
1074 ZIP 01074 (South Barre, MA)
1075 ZIP 01075 (South Hadley, MA)
1077 ZIP 01077 (Southwick, MA)
1079 ZIP 01079 (Thorndike, MA)
1080 ZIP 01080 (Three Rivers, MA)
1081 ZIP 01081 (Wales, MA)
1082 ZIP 01082 (Ware, MA)
1083 ZIP 01083 (Warren, MA)
1084 ZIP 01084 (West Chesterfield, MA)
1085 ZIP 01085 (Westfield, MA)
1086 ZIP 01086 (Westfield, MA)
1088 ZIP 01088 (West Hatfield, MA)
1089 ZIP 01089 (West Springfield, MA)
1092 ZIP 01092 (West Warren, MA)
1093 ZIP 01093 (Whately, MA)
1095 ZIP 01095 (Wilbraham, MA)
1096 ZIP 01096 (Williamsburg, MA)
1098 ZIP 01098 (Worthington, MA)
1103 ZIP 01103 (Springfield, MA)
1104 ZIP 01104 (Springfield, MA)
1105 ZIP 01105 (Springfield, MA)
1106 ZIP 01106 (Longmeadow, MA)
1107 ZIP 01107 (Springfield, MA)
1108 ZIP 01108 (Springfield, MA)
1109 ZIP 01109 (Springfield, MA)
1118 ZIP 01118 (Springfield, MA)
1119 ZIP 01119 (Springfield, MA)
1128 ZIP 01128 (Springfield, MA)
1129 ZIP 01129 (Springfield, MA)
1151 ZIP 01151 (Indian Orchard, MA)
1201 ZIP 01201 (Pittsfield, MA)
1220 ZIP 01220 (Adams, MA)
1222 ZIP 01222 (Ashley Falls, MA)
1223 ZIP 01223 (Becket, MA)
1224 ZIP 01224 (Berkshire, MA)
1225 ZIP 01225 (Cheshire, MA)
1226 ZIP 01226 (Dalton, MA)
1229 ZIP 01229 (Glendale, MA)
1230 ZIP 01230 (Great Barrington, MA)
1235 ZIP 01235 (Hinsdale, MA)
1236 ZIP 01236 (Housatonic, MA)
1237 ZIP 01237 (Lanesboro, MA)
1238 ZIP 01238 (Lee, MA)
1240 ZIP 01240 (Lenox, MA)
1242 ZIP 01242 (Lenox Dale, MA)
1243 ZIP 01243 (Middlefield, MA)
1245 ZIP 01245 (Monterey, MA)
1247 ZIP 01247 (North Adams, MA)
1253 ZIP 01253 (Otis, MA)
1254 ZIP 01254 (Richmond, MA)
1255 ZIP 01255 (Sandisfield, MA)
1256 ZIP 01256 (Savoy, MA)
1257 ZIP 01257 (Sheffield, MA)
1258 ZIP 01258 (South Egremont, MA)
1259 ZIP 01259 (Southfield, MA)
1260 ZIP 01260 (South Lee, MA)
1262 ZIP 01262 (Stockbridge, MA)
1266 ZIP 01266 (West Stockbridge, MA)
1267 ZIP 01267 (Williamstown, MA)

Data Cleaning and Exploration

Data Cleaning

Missing Value Table

# all column indices except ID column
index <- 2:ncol(data)

# convert above columns to numeric
data[index] <- lapply(data[index], as.numeric) 

# calculate missing values
na_table <-
  map_dbl(data, function(x) sum(is.na(x))) %>% 
  sort(decreasing = TRUE) %>% 
  data.frame()

# rename column
colnames(na_table) <- c("total_missing")

# display missing value table
datatable(na_table)

Removing rows with 8 or more missing columns

 # index for row having 8 or more missing columns
index <- apply(data, 1, function(x) sum(is.na(x)) >= 8) 

# removing above rows
data <- data[!index,]

# calculate missing values
na_table <-
  map_dbl(data, function(x) sum(is.na(x))) %>% 
  sort(decreasing = TRUE) %>% 
  data.frame()

# rename column
colnames(na_table) <- c("total_missing")

# display missing value table
datatable(na_table)

KNN-Imputation of median ages of males and females

###### 1. male_median_age ######

# data with output variable and variables on which knn needs to be trained
tr <- na.omit(select(data, tot_population, tot_male_population, 
                      perc_male_under_5yrs:perc_male_over_60yrs, male_median_age))

# data with variables on which knn needs to be trained
tr1 <- select(tr, tot_population, tot_male_population, 
              perc_male_under_5yrs:perc_male_over_60yrs)

# missing data: data on which knn would provide predictions
te1 <- 
  data %>% 
  filter(is.na(male_median_age)) %>% 
    select(tot_population, tot_male_population, 
          perc_male_under_5yrs:perc_male_over_60yrs)
    

# output variable
output_tr <- select(tr, male_median_age)$male_median_age  

# training data for knn without output variable
train <- as.data.frame(tr1) 

# test data to be imputed using knn
test <- as.data.frame(te1)

# output variable
output_var_train <- as.numeric(output_tr)

# perform k-nearest neighbour with k=20
knn_prediction_male <- knn.reg(train=train, test=test, y=output_var_train, k = 20)

# impute missing values with above algorithm
data[is.na(data$male_median_age),]$male_median_age <- knn_prediction_male$pred

###### 2. female_median_age ######

# data with output variable and variables on which knn needs to be trained
tr <- na.omit(select(data, tot_population, tot_female_population, 
                      perc_female_under_5yrs:perc_female_over_60yrs, female_median_age))

# data with variables on which knn needs to be trained
tr1 <- select(tr, tot_population, tot_female_population, 
              perc_female_under_5yrs:perc_female_over_60yrs)

# missing data: data on which knn would provide predictions
te1 <- 
  data %>% 
  filter(is.na(female_median_age)) %>% 
    select(tot_population, tot_female_population, 
          perc_female_under_5yrs:perc_female_over_60yrs)
    

# output variable
output_tr <- select(tr, female_median_age)$female_median_age  

# training data for knn without output variable
train <- as.data.frame(tr1) 

# test data to be imputed using knn
test <- as.data.frame(te1)

# output variable
output_var_train <- as.numeric(output_tr)

# perform k-nearest neighbour with k=20
knn_prediction_female <- knn.reg(train=train, test=test, y=output_var_train, k = 20)

# impute missing values with above algorithm
data[is.na(data$female_median_age),]$female_median_age <- knn_prediction_female$pred

# calculate missing values
na_table <-
  map_dbl(data, function(x) sum(is.na(x))) %>% 
  sort(decreasing = TRUE) %>% 
  data.frame()

# rename column
colnames(na_table) <- c("total_missing")

# display missing value table
datatable(na_table)

Summary Table

# summary statistics for numerical variables
summary <- data.frame()
for(i in 2:ncol(data))
{
  name = colnames(data)[i]
  min = min(data[,i], na.rm=TRUE) %>% round(3)
  mean = mean(data[,i,drop=TRUE], na.rm=TRUE) %>% round(3)
  median = median(data[,i,drop=TRUE], na.rm=TRUE) %>% round(3)
  max = max(data[,i], na.rm=TRUE) %>% round(3) %>% round(3)
  count = sum(!is.na(data[,i]))
  df = data.frame(name, min, mean=mean, median=median, max, count)
  summary <- rbind(summary, df)
}

# display summary table
datatable(summary)

Minimum population greater than equal to 100

# filtering on population
data <-
  data %>% 
    filter(tot_population >=100)

# summary statistics for numerical variables
summary <- data.frame() 
for(i in 2:ncol(data))
{
  name = colnames(data)[i]
  min = min(data[,i], na.rm=TRUE) %>% round(3)
  mean = mean(data[,i,drop=TRUE], na.rm=TRUE) %>% round(3)
  median = median(data[,i,drop=TRUE], na.rm=TRUE) %>% round(3)
  max = max(data[,i], na.rm=TRUE) %>% round(3) %>% round(3)
  count = sum(!is.na(data[,i]))
  df = data.frame(name, min, mean=mean, median=median, max, count)
  summary <- rbind(summary, df)
}

# display summary table
datatable(summary)

Data Exploration

Joining Response and Predictor Variables

# joining response and predictor variables
data <-
  data %>% 
  left_join(data_mapping, by = "Id2") %>% 
  left_join(data_response, by = "geographic_area_name")

# converting popular product to factor
data$popular_product <- as.factor(data$popular_product)

Univariate Analysis-I

Following variables are heavily right-skewed:

  • tot_population
  • tot_male_population
  • tot_female_population
  • no_establishments
  • paid_employees

Following variables are slighly right-skewed:

  • payroll_quarter1
  • annual_payroll
  • revenue

Following variables are normally distributed:

  • male_median_age
  • female_median_age
edaPlot <- function(value, name)
{
  hist <- ggplot(data, aes(x=value)) +
    geom_histogram() +
    ggtitle(name)
  
  hist
}

p1 <- edaPlot(data$tot_population,"tot_population")
p2 <- edaPlot(data$tot_male_population,"tot_male_population")
p3 <- edaPlot(data$tot_female_population,"tot_female_population")
p4 <- edaPlot(data$male_median_age,"male_median_age")
p5 <- edaPlot(data$female_median_age,"female_median_age")
p6 <- edaPlot(data$no_establishments,"no_establishments")
p7 <- edaPlot(data$paid_employees,"paid_employees")
p8 <- edaPlot(data$payroll_quarter1,"payroll_quarter1")
p9 <- edaPlot(data$annual_payroll,"annual_payroll")
p10 <- edaPlot(data$revenue,"revenue")

Rmisc::multiplot(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, cols = 2)


Univariate Analysis-II

Most of the zipcodes have males:

  1. Between 15-44 years of age
  2. Over 60 years of age
  3. Between 5-14 years of age
p1 <- edaPlot(data$perc_male_under_5yrs,"perc_male_under_5yrs")
p2 <- edaPlot(data$perc_male_5_14yrs,"perc_male_5_14yrs")
p3 <- edaPlot(data$perc_male_15_44yrs,"perc_male_15_44yrs")
p4 <- edaPlot(data$perc_male_45_49yrs,"perc_male_45_49yrs")
p5 <- edaPlot(data$perc_male_50_54yrs,"perc_male_50_54yrs")
p6 <- edaPlot(data$perc_male_55_59yrs,"perc_male_55_59yrs")
p7 <- edaPlot(data$perc_male_over_60yrs,"perc_male_over_60yrs")

Rmisc::multiplot(p1, p2, p3, p4, p5, p6, p7, cols = 2)


Univariate Analysis-III

Most of the zipcodes have females:

  1. Between 15-44 years of age
  2. Over 60 years of age
  3. Between 5-14 years of age
p1 <- edaPlot(data$perc_female_under_5yrs,"perc_female_under_5yrs")
p2 <- edaPlot(data$perc_female_5_14yrs,"perc_female_5_14yrs")
p3 <- edaPlot(data$perc_female_15_44yrs,"perc_female_15_44yrs")
p4 <- edaPlot(data$perc_female_45_49yrs,"perc_female_45_49yrs")
p5 <- edaPlot(data$perc_female_50_54yrs,"perc_female_50_54yrs")
p6 <- edaPlot(data$perc_female_55_59yrs,"perc_female_55_59yrs")
p7 <- edaPlot(data$perc_female_over_60yrs,"perc_female_over_60yrs")

Rmisc::multiplot(p1, p2, p3, p4, p5, p6, p7, cols = 2)


Univariate Analysis-IV

Popular Products are equally distributed among college, life and retirement.

ggplot(data, aes(x=popular_product)) +
  geom_bar()


Revenue Bivariate Analysis-I

Note: See the last row only

Revenue has high positive correlation with:

  • annual_payroll
  • payroll_quarter1
  • paid_employees
  • no_establishments
df1 <- select(data, tot_population:tot_female_population
              , male_median_age:annual_payroll, revenue)

pairs(df1)

Note: See the last column only

The above inference is verified by the correlation plot

ggcorr(df1, label = TRUE, angle = 20)


Revenue Bivariate Analysis-II

Note: See the last row only

There does not seem any significant correlations.

df2 <- select(data, perc_male_under_5yrs:perc_male_over_60yrs, revenue)

pairs(df2)

Note: See the last column only

Observations:

  • Revenue is negatively correlated with perc_male_over_60yrs. This does make sense as older the population lower the revenue generated.
  • Revenue is positively correlated with perc_male_15_44yrs. This does make sense as younger the population higher the revenue generated.
ggcorr(df2, label = TRUE, angle = 20)


Revenue Bivariate Analysis-III

Note: See the last row only

There does not seem any significant correlations.

df3 <- select(data, perc_female_under_5yrs:perc_female_over_60yrs, revenue)

pairs(df3)

Observations:

  • Revenue is negatively correlated with perc_female_over_60yrs. This does make sense as older the population lower the revenue generated.
  • Revenue is positively correlated with perc_female_15_44yrs. This does make sense as younger the population higher the revenue generated.
ggcorr(df3, label = TRUE, angle = 20)


Popularity Bivariate Analysis-I

Variables which have separation in distribution of Popular Products:

  • tot_population
  • tot_male_population
  • tot_female_population
  • male_median_age
  • female_median_age
distribution_plot <- function(x, name)
{
  ggplot(data, aes(x=x, fill=popular_product)) +
    geom_histogram(position = 'dodge', aes(y=..density..)) +
    labs(x=name)
}

p1 <- distribution_plot(data$tot_population,"tot_population")
p2 <- distribution_plot(data$tot_male_population,"tot_male_population")
p3 <- distribution_plot(data$tot_female_population,"tot_female_population")
p4 <- distribution_plot(data$male_median_age,"male_median_age")
p5 <- distribution_plot(data$female_median_age,"female_median_age")
p6 <- distribution_plot(data$no_establishments,"no_establishments")
p7 <- distribution_plot(data$paid_employees,"paid_employees")
p8 <- distribution_plot(data$payroll_quarter1,"payroll_quarter1")
p9 <- distribution_plot(data$annual_payroll,"annual_payroll")
p10 <- distribution_plot(data$revenue,"revenue")

Rmisc::multiplot(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, cols = 2)


Popularity Bivariate Analysis-II

Variables which have separation in distribution of Popular Products:

  • perc_male_under_5yrs
  • perc_male_15_44yrs
  • perc_male_45_49yrs
p1 <- distribution_plot(data$perc_male_under_5yrs,"perc_male_under_5yrs")
p2 <- distribution_plot(data$perc_male_5_14yrs,"perc_male_5_14yrs")
p3 <- distribution_plot(data$perc_male_15_44yrs,"perc_male_15_44yrs")
p4 <- distribution_plot(data$perc_male_45_49yrs,"perc_male_45_49yrs")
p5 <- distribution_plot(data$perc_male_50_54yrs,"perc_male_50_54yrs")
p6 <- distribution_plot(data$perc_male_55_59yrs,"perc_male_55_59yrs")
p7 <- distribution_plot(data$perc_male_over_60yrs,"perc_male_over_60yrs")

Rmisc::multiplot(p1, p2, p3, p4, p5, p6, p7, cols = 2)


Popularity Bivariate Analysis-III

Variables which have separation in distribution of Popular Products:

  • perc_female_15_44yrs
  • perc_female_45_49yrs
p1 <- distribution_plot(data$perc_female_under_5yrs,"perc_female_under_5yrs")
p2 <- distribution_plot(data$perc_female_5_14yrs,"perc_female_5_14yrs")
p3 <- distribution_plot(data$perc_female_15_44yrs,"perc_female_15_44yrs")
p4 <- distribution_plot(data$perc_female_45_49yrs,"perc_female_45_49yrs")
p5 <- distribution_plot(data$perc_female_50_54yrs,"perc_female_50_54yrs")
p6 <- distribution_plot(data$perc_female_55_59yrs,"perc_female_55_59yrs")
p7 <- distribution_plot(data$perc_female_over_60yrs,"perc_female_over_60yrs")

Rmisc::multiplot(p1, p2, p3, p4, p5, p6, p7, cols = 2)

Model

Model Building

Dividing data into train and test set

# 80-20 train vs test split
set.seed(1)
index<-sample(nrow(data),nrow(data)*0.8)
train<-data[index,]
test<-data[-index,]

Building the model to predict Revenue

  • The following model has been built using Linear Regression
  • The variable selection has been done using forward selection (BIC)
# Null Model - Regress square feet on only the intercept
nullmodel=lm(revenue~1, data=train)

# Full Model - Regress square feet on all predictor variables
fullmodel=lm(revenue ~  .-Id2-geographic_area_name-popular_product
             , data=train)

# Final Model built using stepwise variable selection (BIC)
model_revenue <- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel),
                         direction = c("forward"), k=log(nrow(train)))

Model Summary

The adjusted R square of the model is ~99%

summary(model_revenue)
## 
## Call:
## lm(formula = revenue ~ paid_employees + payroll_quarter1 + no_establishments + 
##     annual_payroll + tot_male_population + perc_male_5_14yrs + 
##     perc_female_15_44yrs + perc_female_over_60yrs + perc_male_15_44yrs + 
##     perc_male_over_60yrs, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -106003   -3512     160    3656  129563 
## 
## Coefficients:
##                            Estimate   Std. Error t value
## (Intercept)            21666.572597  1139.057724  19.021
## paid_employees             0.537635     0.049095  10.951
## payroll_quarter1           0.039493     0.003333  11.849
## no_establishments        105.430307     0.669318 157.519
## annual_payroll             0.068797     0.001306  52.694
## tot_male_population       -0.249488     0.022758 -10.963
## perc_male_5_14yrs        -52.203677    22.976294  -2.272
## perc_female_15_44yrs      81.041546    15.869788   5.107
## perc_female_over_60yrs    66.357749    16.549801   4.010
## perc_male_15_44yrs        84.371563    16.939855   4.981
## perc_male_over_60yrs      69.693437    19.040925   3.660
##                                    Pr(>|t|)    
## (Intercept)            < 0.0000000000000002 ***
## paid_employees         < 0.0000000000000002 ***
## payroll_quarter1       < 0.0000000000000002 ***
## no_establishments      < 0.0000000000000002 ***
## annual_payroll         < 0.0000000000000002 ***
## tot_male_population    < 0.0000000000000002 ***
## perc_male_5_14yrs                  0.023122 *  
## perc_female_15_44yrs            0.000000339 ***
## perc_female_over_60yrs          0.000061655 ***
## perc_male_15_44yrs              0.000000654 ***
## perc_male_over_60yrs               0.000254 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6940 on 5353 degrees of freedom
## Multiple R-squared:  0.9946, Adjusted R-squared:  0.9946 
## F-statistic: 9.808e+04 on 10 and 5353 DF,  p-value: < 0.00000000000000022

Calculating Mean Square Error on Train and Test Set

Train and Test MSEs are comparable, thus there seems to be no overfitting

# prediction on train set
linear_regression_pred_train <- predict(model_revenue)

# prediction on test set
linear_regression_pred_test <- predict(model_revenue,
                                      newdata=test)

# train MSE
linear_regression_train_mse <- mean((linear_regression_pred_train-train$revenue)^2)

# test MSE
linear_regression_test_mse <- mean((linear_regression_pred_test-test$revenue)^2)

cat("Train MSE:", linear_regression_train_mse, "| Test MSE:", linear_regression_test_mse)
## Train MSE: 48065318 | Test MSE: 45795763

Building the model to predict the popularity of product

  • The following model has been built using Multinomial Logistic Regression
  • The variable selection has been done using forward selection (AIC)
# Multinomial Logistics Regression
model_pop_product <- multinom(popular_product ~ 
                              male_median_age
                              +female_median_age
                              +tot_population
                              +tot_male_population
                              +tot_female_population
                              +perc_male_under_5yrs
                              +perc_male_15_44yrs
                              +perc_male_45_49yrs
                              +perc_male_over_60yrs
                              +perc_female_under_5yrs
                              +perc_female_15_44yrs
                              +perc_female_45_49yrs
                              +perc_female_over_60yrs
                              , select(train, -geographic_area_name, -Id2, -revenue), maxit=200)

# performing forward selection (AIC)
model_pop_product_step <- step(model_pop_product)

Model Summary

summary(model_pop_product_step)
## Call:
## multinom(formula = popular_product ~ male_median_age + tot_population + 
##     tot_male_population + tot_female_population + perc_male_under_5yrs + 
##     perc_male_15_44yrs + perc_male_45_49yrs + perc_female_15_44yrs + 
##     perc_female_45_49yrs + perc_female_over_60yrs, data = select(train, 
##     -geographic_area_name, -Id2, -revenue), maxit = 200)
## 
## Coefficients:
##            (Intercept) male_median_age tot_population tot_male_population
## life         -63.96765      -0.1192173 0.000007577717       -0.0002616518
## retirement   -10.82925      -0.1518854 0.000009655579       -0.0003251257
##            tot_female_population perc_male_under_5yrs perc_male_15_44yrs
## life                0.0002682052            -9.712368         3.24700966
## retirement          0.0003353229            -9.637377         0.05265874
##            perc_male_45_49yrs perc_female_15_44yrs perc_female_45_49yrs
## life               -0.0580214           0.08844885           0.04047339
## retirement         10.0168383           0.07618278          -0.11984784
##            perc_female_over_60yrs
## life                   0.08778008
## retirement             0.07599648
## 
## Std. Errors:
##              (Intercept) male_median_age tot_population
## life       0.00007243927      0.01532116 0.000007927697
## retirement 0.00004870983      0.01328432 0.000009170222
##            tot_male_population tot_female_population perc_male_under_5yrs
## life              0.0002276797          0.0002232655          0.004143988
## retirement        0.0002778930          0.0002720334          0.004521807
##            perc_male_15_44yrs perc_male_45_49yrs perc_female_15_44yrs
## life               0.01308191        0.005274154           0.01542477
## retirement         0.01605085        0.002400195           0.01779370
##            perc_female_45_49yrs perc_female_over_60yrs
## life                0.008113711             0.01753665
## retirement          0.004736040             0.01402347
## 
## Residual Deviance: 599.6059 
## AIC: 639.6059

Confusion matrix on train set

Accuracy Rate: 98.84%

conf_matrix_train <-
  table(train$popular_product,predict(model_pop_product))

conf_matrix_train
##             
##              college life retirement
##   college       1777   11         14
##   life             7 1660          7
##   retirement      19    4       1865

Confusion matrix on test set

Accuracy Rate: 97.83%

conf_matrix_test <-
  table(test$popular_product,predict(model_pop_product, newdata = test))

conf_matrix_test
##             
##              college life retirement
##   college        415    1          9
##   life             3  424          0
##   retirement      11    5        474

Model Diagonostics

  1. Errors are normally distributed with mean=0

Using a Q-Q Plot, errors seem to be normally distributed. However, both the tails seem to be heavily skewed due to outliers.

# Constructing a dataframe containing model attributes
model_attributes1 <-
  data.frame(index=1:nrow(train),
             residuals = model_revenue$residuals, 
             fitted_values = model_revenue$fitted.values)

# Constructing Q-Q Plot
qqnorm(model_attributes1$residuals)
qqline(model_attributes1$residuals, col='red')


  1. Uncorrelated Errors

There seems to be no pattern for the errors over time (index). Thus we can safely assume that the errors are uncorrelated.

# Plotting Residuals over Time
model_attributes1 %>%
  ggplot(aes(x=index,y=residuals)) +
  geom_point()


  1. Constance Variance

We can clearly see that the residuals are constantly varied across the majority of the fitted values. However, for very large fitted values, the variation seems to fan out. If we remove the outliers, we would remove the fan shaped variance as well.

# Residuals vs Fitted-Value Plot
ggplot(model_attributes1, aes(x=fitted_values,y=residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red") +
  geom_hline(yintercept = 3, color = "blue") +
  geom_hline(yintercept = -3, color = "blue")


  1. Predictor Variables are independent of each other

Variation Inflation Factor is high for a few variables like annual_payroll, payroll_quarter1 and paid_employees. We will need to remove some of these to reduce multi-collinearity

vif(model_revenue)
##         paid_employees       payroll_quarter1      no_establishments 
##              17.593212              52.967413               8.692075 
##         annual_payroll    tot_male_population      perc_male_5_14yrs 
##              84.343784               3.042816               1.653595 
##   perc_female_15_44yrs perc_female_over_60yrs     perc_male_15_44yrs 
##               2.944912               3.355291               3.635962 
##   perc_male_over_60yrs 
##               4.494583

  1. No influential outliers

Almost all standardized errors are below the absolute value of 5. However, there are a few observation that are over the absolute value of 5 and hence can be considered as influential outliers. In an ideal case, these should be removed.

# Plotting Studentized/Standardized Errors
rstan <- rstandard(model_revenue)  
plot(rstan)

Corrected Model

Removing Influential Outliers

# calculating cooks distance
cooksd <- cooks.distance(model_revenue)

# influential outliers row numbers
influential <- as.numeric(names(cooksd)[(cooksd > mean(cooksd, na.rm=T))]) 

# remove influential outliers
train <- train[-influential,]

Building the model to predict Revenue

  • The following model has been built using Linear Regression
  • The variable selection has been done using forward selection (BIC)
  • annual_payroll and paid_employees is removed to reduce multi-collinearity
# Null Model - Regress square feet on only the intercept
nullmodel=lm(revenue~1, data=train)

# Full Model - Regress square feet on all predictor variables except annual_payroll
fullmodel=lm(revenue ~  .-Id2-geographic_area_name-popular_product-annual_payroll-paid_employees
             , data=train)

# Final Model built using stepwise variable selection (BIC)
model_revenue <- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel),
                         direction = c("forward"), k=log(nrow(train)))

Model Summary

The adjusted R square of the model is ~99%

summary(model_revenue)
## 
## Call:
## lm(formula = revenue ~ no_establishments + payroll_quarter1 + 
##     perc_male_5_14yrs + tot_population + perc_male_15_44yrs + 
##     perc_female_over_60yrs + perc_female_15_44yrs, data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -56485  -3607    107   3626  34374 
## 
## Coefficients:
##                            Estimate   Std. Error t value
## (Intercept)            24568.791530   835.112612  29.420
## no_establishments        113.863276     0.540313 210.736
## payroll_quarter1           0.306316     0.001617 189.485
## perc_male_5_14yrs        -59.928591    18.092538  -3.312
## tot_population            -0.060523     0.010521  -5.753
## perc_male_15_44yrs        58.524142    12.570366   4.656
## perc_female_over_60yrs    87.225537    12.370295   7.051
## perc_female_15_44yrs      54.477925    14.006055   3.890
##                                    Pr(>|t|)    
## (Intercept)            < 0.0000000000000002 ***
## no_establishments      < 0.0000000000000002 ***
## payroll_quarter1       < 0.0000000000000002 ***
## perc_male_5_14yrs                  0.000931 ***
## tot_population               0.000000009279 ***
## perc_male_15_44yrs           0.000003307055 ***
## perc_female_over_60yrs       0.000000000002 ***
## perc_female_15_44yrs               0.000102 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6067 on 5312 degrees of freedom
## Multiple R-squared:  0.9913, Adjusted R-squared:  0.9912 
## F-statistic: 8.605e+04 on 7 and 5312 DF,  p-value: < 0.00000000000000022

Calculating Mean Square Error on Train and Test Set

Train and Test MSEs are comparable, thus there seems to be no overfitting

# prediction on train set
linear_regression_pred_train <- predict(model_revenue)

# prediction on test set
linear_regression_pred_test <- predict(model_revenue,
                                      newdata=test)

# train MSE
linear_regression_train_mse <- mean((linear_regression_pred_train-train$revenue)^2)

# test MSE
linear_regression_test_mse <- mean((linear_regression_pred_test-test$revenue)^2)

cat("Train MSE:", linear_regression_train_mse, "| Test MSE:", linear_regression_test_mse)
## Train MSE: 36752185 | Test MSE: 321499258

Building the model to predict the popularity of product

  • The following model has been built using Multinomial Logistic Regression
  • The variable selection has been done using forward selection (AIC)
# Multinomial Logistics Regression
model_pop_product <- multinom(popular_product ~ 
                              male_median_age
                              +female_median_age
                              +tot_population
                              +tot_male_population
                              +tot_female_population
                              +perc_male_under_5yrs
                              +perc_male_15_44yrs
                              +perc_male_45_49yrs
                              +perc_male_over_60yrs
                              +perc_female_under_5yrs
                              +perc_female_15_44yrs
                              +perc_female_45_49yrs
                              +perc_female_over_60yrs
                              , select(train, -geographic_area_name, -Id2, -revenue), maxit=200)

# performing forward selection (AIC)
model_pop_product_step <- step(model_pop_product)

Model Summary

summary(model_pop_product_step)
## Call:
## multinom(formula = popular_product ~ male_median_age + tot_population + 
##     tot_male_population + tot_female_population + perc_male_under_5yrs + 
##     perc_male_15_44yrs + perc_male_45_49yrs + perc_female_15_44yrs + 
##     perc_female_45_49yrs + perc_female_over_60yrs, data = select(train, 
##     -geographic_area_name, -Id2, -revenue), maxit = 200)
## 
## Coefficients:
##            (Intercept) male_median_age tot_population tot_male_population
## life         -63.97387      -0.1186000 0.000007263478       -0.0002651740
## retirement   -10.77528      -0.1520542 0.000010230867       -0.0003063879
##            tot_female_population perc_male_under_5yrs perc_male_15_44yrs
## life                0.0002721347            -9.720721          3.2471242
## retirement          0.0003166178            -9.650487          0.0560999
##            perc_male_45_49yrs perc_female_15_44yrs perc_female_45_49yrs
## life              -0.05536281           0.09030834           0.03580886
## retirement         9.99796531           0.07939600          -0.13179365
##            perc_female_over_60yrs
## life                    0.0871428
## retirement              0.0761664
## 
## Std. Errors:
##              (Intercept) male_median_age tot_population
## life       0.00007348353      0.01536158 0.000007966623
## retirement 0.00004912882      0.01327730 0.000009400856
##            tot_male_population tot_female_population perc_male_under_5yrs
## life              0.0002295425          0.0002250732          0.004113861
## retirement        0.0002871352          0.0002809541          0.004432349
##            perc_male_15_44yrs perc_male_45_49yrs perc_female_15_44yrs
## life               0.01312008        0.005267846           0.01544974
## retirement         0.01604569        0.002397547           0.01782996
##            perc_female_45_49yrs perc_female_over_60yrs
## life                0.008229996             0.01756903
## retirement          0.004837625             0.01399654
## 
## Residual Deviance: 596.7282 
## AIC: 636.7282

Confusion matrix on train set

Accuracy Rate: 97.76%

conf_matrix_train <-
  table(train$popular_product,predict(model_pop_product))

conf_matrix_train
##             
##              college life retirement
##   college       1772   11         12
##   life             7 1634          7
##   retirement      18    5       1854

Confusion matrix on test set

Accuracy Rate: 97.83%

conf_matrix_test <-
  table(test$popular_product,predict(model_pop_product, newdata = test))

conf_matrix_test
##             
##              college life retirement
##   college        414    2          9
##   life             3  424          0
##   retirement      11    5        474

Corrected Model Diagonostics

  1. Errors are normally distributed with mean=0

Using a Q-Q Plot, errors seem to be normally distributed.

# Constructing a dataframe containing model attributes
model_attributes1 <-
  data.frame(index=1:nrow(train),
             residuals = model_revenue$residuals, 
             fitted_values = model_revenue$fitted.values)

# Constructing Q-Q Plot
qqnorm(model_attributes1$residuals)
qqline(model_attributes1$residuals, col='red')


  1. Uncorrelated Errors

There seems to be no pattern for the errors over time (index). Thus we can safely assume that the errors are uncorrelated.

# Plotting Residuals over Time
model_attributes1 %>%
  ggplot(aes(x=index,y=residuals)) +
  geom_point()


  1. Constance Variance

We can clearly see that the residuals are constantly varied across the fitted values.

# Residuals vs Fitted-Value Plot
ggplot(model_attributes1, aes(x=fitted_values,y=residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red") +
  geom_hline(yintercept = 3, color = "blue") +
  geom_hline(yintercept = -3, color = "blue")


  1. Predictor Variables are independent of each other

Variation Inflation Factor <10 for all variables. Thus their is no multi-collinearity.

vif(model_revenue)
##      no_establishments       payroll_quarter1      perc_male_5_14yrs 
##               5.844737               2.785597               1.318786 
##         tot_population     perc_male_15_44yrs perc_female_over_60yrs 
##               3.459487               2.494747               2.396186 
##   perc_female_15_44yrs 
##               2.850943

  1. No influential outliers

Almost all standardized errors are below the absolute value of 5.

# Plotting Studentized/Standardized Errors
rstan <- rstandard(model_revenue)  
plot(rstan)

Model Interpretation

Linear Regression - Predicting Revenue

  • Revenue = output variable

  • 99% variance in the output variable is explained by:
    • no_establishments
    • payroll_quarter1
    • perc_male_5_14yrs
    • tot_population
    • perc_male_15_44yrs
    • perc_female_over_60yrs
    • perc_female_15_44yrs
  • All held constant, with 1 unit increase in no_establishments, the average revenue increases by $113.86.

  • A more concrete way of elaborating the above point would be: All held constant, we are 95 % confident that with 1 unit increase in no_establishments, the average revenue increases by $112.78 - $114.94.

  • All other predictor variables can be interpretted in the same way.

  • The t-tests correspond to the following hypothesis test:
    • H0: Beta = 0
    • HA: Beta !=0
    • For all p-values < 0.05, we reject H0
    • All predictor variables have p-value < 0.05
  • The f-test correspond to the following hypothesis test:
    • H0: All Beta’s = 0
    • HA: At least one Beta != 0
    • As p-value < 0.05, we reject H0. Thus our model as a whole is significant.

Multinomial Logistic Regression

  • Final variables selected are as follows:
    • male_median_age
    • tot_population
    • tot_male_population
    • tot_female_population
    • perc_male_under_5yrs
    • perc_male_15_44yrs
    • perc_male_45_49yrs
    • perc_female_15_44yrs
    • perc_female_45_49yrs
    • perc_female_over_60yrs
  • The interpretation of the model is as follows:
    • With 1 unit increase in perc_male_15_44yrs:
      • odds of pop product = life relative to pop product = college increases by a factor of 25.71
      • odds of pop product = life relative to pop product = college increases by a factor of 1.05
    • All other variables can be interpretted in the same way
  • AIC - which a measure of the quality of the model - is 636.72

  • Accuracy Rate on test set i.e. percentage of popular products correctly classified are 97.83%

Insights

Predicting Revenue Generated

  • Revenue generated in each zipcode is predicted using a linear regression model.

  • We are getting an almost perfect model (99% Adjusted R-Square) for predicting the revenue generated in each zipcode. Thus, we can very accurately predict the same and take actions to increase/decrease the revenue in selected zip codes.

  • Revenue highly depends upon these factors:
    • no_establishments
    • payroll_quarter1
    • perc_male_5_14yrs
    • tot_population
    • perc_male_15_44yrs
    • perc_female_over_60yrs
    • perc_female_15_44yrs
  • MSE(Train): 36752185 | MSE(Test): 321499258

Future Work

  1. Perform cross-validation to find the optimal k in k-nearest neighbour algorithm.
  2. Perform cross-validation on both the regression and classfication models to verify that the models are not overfitting.
  3. Perform transformation on response and/or predictor variables to satisfy the assumptions of the linear regression model (better than current model).
  4. Use different variable selection techniques like backward, stepwise and lasso on parameters like adjusted R square, AIC, BIC, deviance, p-value etc.
  5. Use more advanced models like decision trees, random forest, bagging and boosting to increase accuracy.