A financial institution wants to analyze the demographic information of the US population to predict:
The sample data given is a portion of the American Community Survey 2015 from census.gov. It contains several fields detailing population, gender, and business metrics grouped by selected ZIP codes throughout the United States.
require(Rmisc) # draw multiple plots
require(readxl) # read excel input
require(DT) # display table on HTML page
require(tidyverse) # perform data manipulation
require(kableExtra) # display table on HTML page
require(ggplot2) # perform data visualization
require(FNN) # KNN
require(GGally) # plot correlation plots
require(car) # check variance inflation factor
require(nnet) # perform multinomial logistic regression
options(scipen = 999) # do not display numbers in scientific format
Importing Data
# reading predictor variable data
data <- read_excel("american_fact_finder.xlsx", sheet = "census_train")
# reading response variable data
data_response <- read_excel("american_fact_finder.xlsx", sheet = "response_variable")
# reading id-geographic mapping data
data_mapping <- read_excel("american_fact_finder.xlsx", sheet = "id_geography_mapping")
# renaming predictor variable data
colnames(data) <- c("Id2",
"tot_population",
"tot_male_population",
"tot_female_population",
"perc_male_under_5yrs",
"perc_male_45_49yrs",
"perc_male_50_54yrs",
"perc_male_55_59yrs",
"perc_male_5_14yrs",
"perc_male_15_44yrs",
"perc_male_over_60yrs",
"perc_female_under_5yrs",
"perc_female_45_49yrs",
"perc_female_50_54yrs",
"perc_female_55_59yrs",
"perc_female_5_14yrs",
"perc_female_15_44yrs",
"perc_female_over_60yrs",
"male_median_age",
"female_median_age",
"no_establishments",
"paid_employees",
"payroll_quarter1",
"annual_payroll")
# renaming column names of response variable data
colnames(data_response) <- c("geographic_area_name",
"revenue",
"popular_product")
# renaming column names of id-geographic mapping data
colnames(data_mapping) <- c("Id2",
"geographic_area_name")
Data Dictionary
Predictor Variable Data| Variable | Description |
|---|---|
| Id2 | Unique Zipcode ID |
| tot_population | Total Population |
| tot_male_population | Total Male Population |
| tot_female_population | Total Female Population |
| perc_male_under_5yrs | Percentage of Males Under 5 Years of Age |
| perc_male_45_49yrs | Percentage of Males Between 45-49 Years of Age |
| perc_male_50_54yrs | Percentage of Males Between 50-54 Years of Age |
| perc_male_55_59yrs | Percentage of Males Between 55-59 Years of Age |
| perc_male_5_14yrs | Percentage of Males Between 5-14 Years of Age |
| perc_male_15_44yrs | Percentage of Males Between 15-44 Years of Age |
| perc_male_over_60yrs | Percentage of Males Over 60 Years of Age |
| perc_female_under_5yrs | Percentage of Females Under 5 Years of Age |
| perc_female_45_49yrs | Percentage of Females Between 45-49 Years of Age |
| perc_female_50_54yrs | Percentage of Females Between 50-54 Years of Age |
| perc_female_55_59yrs | Percentage of Females Between 55-59 Years of Age |
| perc_female_5_14yrs | Percentage of Females Between 5-14 Years of Age |
| perc_female_15_44yrs | Percentage of Females Between 15-44 Years of Age |
| perc_female_over_60yrs | Percentage of Females Over 60 Years of Age |
| male_median_age | Median Age of Males in Years |
| female_median_age | Median Age of Females in Years |
| no_establishments | Number of Establishments |
| paid_employees | Paid employees for pay period including March 12 (number) |
| payroll_quarter1 | First-quarter payroll ($1,000) |
| annual_payroll | Annual payroll ($1,000) |
| Variable | Description |
|---|---|
| geographic_area_name | Geographic Area Name |
| revenue | Revenue |
| popular_product | Popular Product |
| Variable | Description |
|---|---|
| Id2 | Unique Zipcode ID |
| geographic_area_name | Geographic Area Name |
Viewing Data
Predictor Variable Data
# display first 100 rows
kable(head(data, 100)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>%
scroll_box(width = "100%", height = "500px")
| Id2 | tot_population | tot_male_population | tot_female_population | perc_male_under_5yrs | perc_male_45_49yrs | perc_male_50_54yrs | perc_male_55_59yrs | perc_male_5_14yrs | perc_male_15_44yrs | perc_male_over_60yrs | perc_female_under_5yrs | perc_female_45_49yrs | perc_female_50_54yrs | perc_female_55_59yrs | perc_female_5_14yrs | perc_female_15_44yrs | perc_female_over_60yrs | male_median_age | female_median_age | no_establishments | paid_employees | payroll_quarter1 | annual_payroll |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 44319 | 22592 | 11419 | 11173 | 4.2 | 7.3 | 8.6999999999999993 | 9.4 | 9.4 | 36.200000000000003 | 24.8 | 4.5 | 6.5 | 8.1 | 8.4 | 9.1999999999999993 | 34.9 | 28.5 | 45.1 | 45.8 | 582 | 6810 | 43116 | 196189 |
| 73067 | 2146 | 1171 | 975 | 2.8 | 7.9 | 8.4 | 7.9 | 13.1 | 39.1 | 20.8 | 5.2 | 8.1 | 9.3000000000000007 | 8.1 | 11.1 | 31.7 | 26.5 | 41.4 | 45.7 | 33 | 205 | 1724 | 6045 |
| 1851 | 30022 | 14858 | 15164 | 8.1 | 7.8 | 7.3 | 6.3 | 11.2 | 46.1 | 13 | 6.1 | 7.4 | 6.8 | 6.4 | 13.6 | 43.6 | 16.100000000000001 | 32.799999999999997 | 34.4 | 450 | 8568 | 147660 | 551236 |
| 67335 | 3332 | 1803 | 1529 | 7 | 6.2 | 8.8000000000000007 | 9.1999999999999993 | 12.9 | 34.1 | 22 | 2 | 5.6 | 6.1 | 8.6 | 13.7 | 33.6 | 30.3 | 41.4 | 45.8 | 52 | 491 | 2852 | 12186 |
| 72524 | 280 | 162 | 118 | 17.3 | 12.3 | 0 | 13 | 15.4 | 30.9 | 11.1 | 0 | 16.100000000000001 | 0 | 0 | 0 | 49.2 | 34.700000000000003 | 30.4 | 46.1 | 9 | 93 | 470 | 2592 |
| 35901 | 19573 | 9359 | 10214 | 6.9 | 6.3 | 6.2 | 7.8 | 13.9 | 38.799999999999997 | 20.100000000000001 | 5.8 | 7.4 | 5.9 | 8.3000000000000007 | 9.9 | 36.4 | 26.3 | 38.200000000000003 | 43.5 | 714 | 9219 | 66462 | 289032 |
| 8752 | 2367 | 1229 | 1138 | 4.4000000000000004 | 3.4 | 6 | 9.3000000000000007 | 7.9 | 23 | 46 | 0 | 8.8000000000000007 | 5.7 | 8.5 | 3.8 | 25 | 48.2 | 58.8 | 58.4 | 58 | 273 | 1705 | 10276 |
| 35063 | 3900 | 1945 | 1955 | 6.4 | 6.9 | 8.3000000000000007 | 9.1999999999999993 | 12.6 | 32.4 | 24.2 | 3.7 | 7.1 | 10.5 | 4.8 | 13.5 | 29 | 31.5 | 42.8 | 47.5 | 13 | 47 | 263 | 1146 |
| 48114 | 20600 | 10152 | 10448 | 4.3 | 7.8 | 10 | 9.1 | 13.3 | 34.4 | 21.1 | 4.0999999999999996 | 8.1 | 8.6 | 9.1999999999999993 | 12.8 | 33.299999999999997 | 24 | 43.2 | 44.9 | 505 | 4086 | 41436 | 188834 |
| 20037 | 15687 | 7384 | 8303 | 1.5 | 4 | 2.5 | 4.5999999999999996 | 0 | 71.7 | 15.7 | 1.4 | 3 | 2.8 | 1.6 | 1 | 74.5 | 15.6 | 30.7 | 24.3 | 829 | 25487 | 609829 | 2389256 |
| 54421 | 3270 | 1583 | 1687 | 5.9 | 7.3 | 8.1 | 6.4 | 18.600000000000001 | 34.4 | 19.3 | 7.5 | 5.9 | 7.8 | 4.4000000000000004 | 16.7 | 30.7 | 27 | 36.799999999999997 | 40.5 | 71 | 997 | 8410 | 34853 |
| 79070 | 10333 | 5282 | 5051 | 9.8000000000000007 | 6 | 6.6 | 6.3 | 18.3 | 40.799999999999997 | 12.2 | 8.1999999999999993 | 7.5 | 5.3 | 5.7 | 18.399999999999999 | 40.200000000000003 | 14.7 | 30.4 | 32.5 | 348 | 3862 | 53937 | 200034 |
| 49676 | 3441 | 1736 | 1705 | 6 | 5.6 | 6 | 8.6999999999999993 | 10.5 | 31 | 32 | 3 | 8.3000000000000007 | 7.8 | 7.1 | 8.3000000000000007 | 33.700000000000003 | 31.7 | 48.4 | 46.9 | 51 | 312 | 2928 | 12681 |
| 3583 | 933 | 424 | 509 | 3.8 | 8.3000000000000007 | 11.3 | 8 | 8.3000000000000007 | 22.6 | 37.700000000000003 | 3.5 | 11.2 | 10.199999999999999 | 10.4 | 8.1 | 21.2 | 35.4 | 52.3 | 53.8 | 25 | 175 | 1459 | 9473 |
| 54454 | 1442 | 742 | 700 | 5.3 | 8.1 | 9.6999999999999993 | 11.6 | 11.9 | 35.200000000000003 | 18.3 | 6.9 | 5.3 | 8.1 | 11.1 | 15.1 | 37.4 | 16 | 42.5 | 38.5 | 17 | 92 | 548 | 2438 |
| 8019 | 1012 | 516 | 496 | 3.1 | 11.4 | 12.2 | 6.2 | 5.6 | 43.4 | 18 | 4.2 | 9.1 | 9.3000000000000007 | 6.3 | 10.1 | 36.1 | 25 | 42.3 | 44.7 | 14 | 45 | 439 | 2337 |
| 36003 | 1716 | 873 | 843 | 4.8 | 1.9 | 12.7 | 8.1 | 13.6 | 36 | 22.8 | 1.9 | 9.8000000000000007 | 13.6 | 5.6 | 6.6 | 38.6 | 23.8 | 40.9 | 48 | 15 | 128 | 751 | 3257 |
| 13335 | 1606 | 851 | 755 | 6.3 | 15.9 | 8.5 | 6.2 | 9.1999999999999993 | 33.299999999999997 | 20.7 | 7.3 | 7.9 | 7.8 | 5.3 | 13.9 | 34.799999999999997 | 22.9 | 45.5 | 41.4 | 31 | 1086 | 10991 | 47243 |
| 62215 | 1731 | 894 | 837 | 7.2 | 9.1 | 8.4 | 5.9 | 15.9 | 38.5 | 15.1 | 4.9000000000000004 | 7.6 | 9.1 | 5.5 | 12.1 | 42.3 | 18.5 | 34.4 | 37 | 42 | 299 | 1776 | 8537 |
| 53575 | 16070 | 8358 | 7712 | 4.8 | 10.3 | 8.1 | 7 | 14.3 | 40.299999999999997 | 15.1 | 4.5999999999999996 | 9.1 | 7.9 | 6.3 | 18.3 | 33.9 | 19.899999999999999 | 39.6 | 41.1 | 319 | 2548 | 20206 | 94328 |
| 79934 | 24062 | 12189 | 11873 | 9.5 | 4.5 | 4.2 | 5.9 | 19.600000000000001 | 49.6 | 6.7 | 9.4 | 6.8 | 5.7 | 4 | 17 | 49.3 | 7.8 | 28 | 29.9 | 92 | 1123 | 9648 | 44193 |
| 37355 | 25493 | 12334 | 13159 | 6 | 7.5 | 6.5 | 7.2 | 13.8 | 38 | 21 | 5 | 5.6 | 9 | 5.8 | 13.3 | 36.799999999999997 | 24.6 | 39.200000000000003 | 40.200000000000003 | 505 | 8020 | 66808 | 279814 |
| 15610 | 3667 | 1794 | 1873 | 2.2000000000000002 | 6.4 | 10.8 | 11.3 | 11.6 | 32.6 | 25.1 | 5.6 | 8.5 | 13.1 | 9.8000000000000007 | 13 | 27 | 22.9 | 48.5 | 47.8 | 45 | 359 | 2840 | 13443 |
| 73106 | 13242 | 6977 | 6265 | 8.1 | 7.3 | 6.7 | 9.1999999999999993 | 11.3 | 46.4 | 11.1 | 6.8 | 5.0999999999999996 | 5.0999999999999996 | 7.8 | 12.1 | 51.4 | 11.7 | 33.1 | 30.3 | 479 | 8560 | 112417 | 450773 |
| 1129 | 7377 | 3466 | 3911 | 7.2 | 7.6 | 8.8000000000000007 | 6.8 | 15.6 | 35.5 | 18.600000000000001 | 3.7 | 6.9 | 7.2 | 6.8 | 9.5 | 38 | 27.9 | 35.799999999999997 | 43.7 | 162 | 2657 | 15369 | 63623 |
| 4926 | 36 | 18 | 18 | 0 | 0 | 0 | 0 | 0 | 0 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 100 |
|
|
11 | 59 | 382 | 1773 |
| 28269 | 76360 | 37057 | 39303 | 8.1 | 7.5 | 6.5 | 4.8 | 15.7 | 46.9 | 10.6 | 7.2 | 7.6 | 5.3 | 5.7 | 14.2 | 47.6 | 12.4 | 32.6 | 34.4 | 1266 | 20659 | 249180 | 1002725 |
| 75691 | 2817 | 1435 | 1382 | 3.8 | 9.5 | 2.9 | 5.9 | 17.3 | 36.6 | 24 | 4.2 | 6.2 | 4.5999999999999996 | 7.7 | 22.1 | 31.8 | 23.4 | 40.4 | 36.9 | 71 | 1481 | 31346 | 98008 |
| 25670 | 6074 | 3019 | 3055 | 7.5 | 7.6 | 4.8 | 7.3 | 12.5 | 38.700000000000003 | 21.6 | 6.8 | 8.1999999999999993 | 7.8 | 6.9 | 10.5 | 38.1 | 21.6 | 35.200000000000003 | 42.1 | 48 | 463 | 2676 | 10959 |
| 48184 | 17251 | 8354 | 8897 | 6.5 | 8 | 8 | 7.1 | 13.7 | 39.700000000000003 | 16.899999999999999 | 3.9 | 6 | 6.1 | 9.1999999999999993 | 12.2 | 41.2 | 21.5 | 36.700000000000003 | 39.799999999999997 | 369 | 11888 | 170286 | 666154 |
| 2535 | 1511 | 706 | 805 | 5.5 | 2.8 | 8.1 | 9.3000000000000007 | 11 | 36.799999999999997 | 26.3 | 2.9 | 3.7 | 11.8 | 9.1999999999999993 | 7.6 | 37.5 | 27.3 | 41.9 | 48.5 | 49 | 166 | 1690 | 9080 |
| 15746 | 106 | 47 | 59 | 0 | 0 | 6.4 | 19.100000000000001 | 0 | 53.2 | 21.3 | 10.199999999999999 | 0 | 6.8 | 0 | 13.6 | 22 | 47.5 | 39.6 | 53.6 | 9 | 127 | 1011 | 4049 |
| 88008 | 6881 | 3521 | 3360 | 12.6 | 6.1 | 7.2 | 7.1 | 18.600000000000001 | 34.299999999999997 | 14.1 | 5.3 | 7.7 | 7.5 | 5.7 | 16.8 | 45.7 | 11.4 | 29.6 | 31.2 | 103 | 1952 | 16335 | 65725 |
| 39217 | 1138 | 418 | 720 | 0 | 0 | 0 | 0 | 0 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 100 | 0 | 20.5 | 20.100000000000001 | 12 | 192 | 806 | 3360 |
| 97540 | 8061 | 3893 | 4168 | 2.6 | 7.2 | 5.4 | 5.8 | 16.600000000000001 | 35.5 | 27 | 4.8 | 7.2 | 7.1 | 7.1 | 11.1 | 33.700000000000003 | 29 | 42.7 | 45.1 | 160 | 893 | 5819 | 26660 |
| 99169 | 2577 | 1315 | 1262 | 9.1 | 7.7 | 7.5 | 10 | 13.9 | 29.2 | 22.7 | 11.1 | 5.9 | 8.1999999999999993 | 9.1 | 9.9 | 27.4 | 28.4 | 43.4 | 45.9 | 80 | 606 | 4716 | 23112 |
| 32955 | 37745 | 17729 | 20016 | 5 | 7 | 9.9 | 7.2 | 12.7 | 32.4 | 25.8 | 4.4000000000000004 | 6.6 | 8.1 | 6.2 | 11.9 | 32.6 | 30.1 | 44.9 | 45.8 | 918 | 11235 | 103807 | 432061 |
| 4360 | 517 | 266 | 251 | 0.8 | 11.7 | 9.4 | 6.8 | 9.4 | 30.1 | 32 | 1.2 | 9.1999999999999993 | 4.8 | 14.7 | 5.6 | 28.7 | 35.9 | 48.8 | 55.2 | 12 | 27 | 127 | 743 |
| 31408 | 10611 | 5919 | 4692 | 6.8 | 9 | 4.7 | 4.9000000000000004 | 8 | 54.1 | 12.7 | 10.8 | 7.2 | 6.6 | 4.2 | 12.5 | 38.6 | 20.100000000000001 | 33.6 | 35.200000000000003 | 591 | 15300 | 152484 | 622458 |
| 97503 | 12592 | 6309 | 6283 | 7.1 | 7.9 | 6.4 | 5.7 | 12.9 | 42.2 | 17.8 | 10.5 | 5.7 | 2.8 | 6.5 | 14.3 | 42.9 | 17.2 | 35.9 | 30.4 | 240 | 4858 | 47896 | 196024 |
| 61448 | 3706 | 1852 | 1854 | 4.7 | 6.2 | 7.7 | 7.1 | 12.7 | 34.4 | 27.3 | 3.6 | 6.4 | 6.6 | 7.1 | 12.6 | 29.9 | 33.799999999999997 | 43 | 47.5 | 42 | 216 | 798 | 4582 |
| 76177 | 6470 | 3321 | 3149 | 7.2 | 3.6 | 7.6 | 5.5 | 8.6 | 58.4 | 9.1999999999999993 | 14 | 4.5 | 6.5 | 4.4000000000000004 | 20.5 | 41.1 | 8.9 | 35.4 | 29.9 | 421 | 24624 | 358954 | 1390697 |
| 37324 | 5145 | 2578 | 2567 | 9.8000000000000007 | 4.4000000000000004 | 6.2 | 5.7 | 14 | 35.4 | 24.6 | 6.7 | 6.2 | 5.0999999999999996 | 9.6999999999999993 | 7.3 | 38.6 | 26.4 | 37.200000000000003 | 43.8 | 137 | 3201 | 40872 | 185889 |
| 47923 | 3519 | 1682 | 1837 | 6.4 | 6.8 | 7.5 | 6.1 | 16.2 | 40 | 17.100000000000001 | 6.8 | 9.3000000000000007 | 5.4 | 6.8 | 16.8 | 34.299999999999997 | 20.6 | 39.5 | 40.4 | 76 | 561 | 4186 | 17733 |
| 15663 | 418 | 197 | 221 | 8.1 | 9.6 | 11.7 | 12.2 | 2 | 16.2 | 40.1 | 2.2999999999999998 | 1.8 | 6.8 | 14.9 | 8.1 | 26.2 | 39.799999999999997 | 56.4 | 56.2 | 19 | 796 | 16185 | 65337 |
| 95120 | 37644 | 18723 | 18921 | 4.5999999999999996 | 8.4 | 10.3 | 8 | 17.600000000000001 | 29.4 | 21.9 | 3.6 | 10.4 | 8.6 | 7.5 | 14.5 | 31.1 | 24.3 | 44.3 | 45.4 | 512 | 3361 | 64597 | 265542 |
| 27012 | 25699 | 12259 | 13440 | 6.6 | 10 | 7.2 | 8.5 | 15 | 31.4 | 21.2 | 4.5999999999999996 | 9.4 | 8.3000000000000007 | 8.4 | 11.8 | 33.4 | 24.2 | 43.3 | 45.1 | 632 | 6456 | 47670 | 199328 |
| 24620 | 2992 | 1444 | 1548 | 9.6 | 7 | 7.8 | 5.6 | 10.8 | 40 | 19.2 | 4.7 | 9.9 | 10.9 | 13.5 | 11.4 | 34.799999999999997 | 14.8 | 38.299999999999997 | 42.8 | 25 | 219 | 2157 | 8701 |
| 46617 | 9403 | 4546 | 4857 | 6.6 | 5.0999999999999996 | 4.8 | 5.8 | 12.4 | 47.5 | 17.8 | 6.8 | 5.8 | 5.8 | 4.5 | 10.9 | 41.7 | 24.6 | 30.9 | 35.4 | 269 | 4948 | 52919 | 204581 |
| 28520 | 413 | 166 | 247 | 24.1 | 0 | 6 | 0 | 9.6 | 33.700000000000003 | 26.5 | 9.3000000000000007 | 0 | 4.5 | 14.6 | 19.399999999999999 | 33.6 | 18.600000000000001 | 39.4 | 36.6 | 3 | 4 | 14 | 60 |
| 7444 | 10811 | 4643 | 6168 | 4.4000000000000004 | 8.8000000000000007 | 12.4 | 6.8 | 10.3 | 27.6 | 29.7 | 2.7 | 7.8 | 9.6 | 5.6 | 11 | 29 | 34.200000000000003 | 49.1 | 49.4 | 351 | 5188 | 55156 | 228858 |
| 57031 | 733 | 367 | 366 | 3.5 | 7.9 | 7.9 | 8.4 | 10.6 | 42.2 | 19.3 | 5.5 | 4.0999999999999996 | 10.7 | 11.7 | 6.6 | 42.6 | 18.899999999999999 | 38.9 | 39.299999999999997 | 9 | 52 | 394 | 1811 |
| 37118 | 1173 | 551 | 622 | 1.5 | 8.6999999999999993 | 17.100000000000001 | 0 | 10.199999999999999 | 39.4 | 23.2 | 3.7 | 10.8 | 7.6 | 8.8000000000000007 | 18.3 | 34.1 | 16.7 | 34.9 | 40.200000000000003 | 5 | 16 | 152 | 611 |
| 5252 | 245 | 114 | 131 | 0 | 6.1 | 14 | 17.5 | 7 | 21.9 | 33.299999999999997 | 3.1 | 11.5 | 9.1999999999999993 | 17.600000000000001 | 3.1 | 26.7 | 29 | 55.3 | 53.4 | 19 | 97 | 1271 | 5283 |
| 7601 | 44035 | 22132 | 21903 | 6.5 | 8 | 7.6 | 6.5 | 9.1 | 46.1 | 16.3 | 6.9 | 7.3 | 7 | 7.6 | 6.8 | 42.8 | 21.6 | 37.9 | 40.299999999999997 | 2273 | 40913 | 538677 | 2340830 |
| 12928 | 1882 | 962 | 920 | 2.2999999999999998 | 6.3 | 10.199999999999999 | 8.9 | 12.7 | 33.1 | 26.5 | 2.6 | 5.8 | 8.9 | 7.2 | 12.9 | 36.700000000000003 | 25.9 | 47.6 | 41.4 | 29 | 151 | 1054 | 4284 |
| 18818 | 1281 | 673 | 608 | 4 | 9.6999999999999993 | 11 | 7.9 | 12.3 | 27 | 28.1 | 3.1 | 5.6 | 12.5 | 11.3 | 7.4 | 31.4 | 28.6 | 48.4 | 51.3 | 26 | 202 | 1580 | 7545 |
| 59936 | 487 | 255 | 232 | 0 | 7.1 | 0 | 5.5 | 8.1999999999999993 | 60 | 19.2 | 0 | 7.3 | 8.1999999999999993 | 2.6 | 8.1999999999999993 | 41.8 | 31.9 | 32.299999999999997 | 44.5 | 37 | 90 | 480 | 6205 |
| 98037 | 27994 | 13631 | 14363 | 5.8 | 6.9 | 7.4 | 6.7 | 12.4 | 44 | 16.899999999999999 | 5.7 | 6.2 | 7.6 | 6.9 | 12.9 | 41.7 | 19 | 36.5 | 37.5 | 926 | 11958 | 103085 | 427303 |
| 25245 | 622 | 279 | 343 | 0 | 13.6 | 0 | 4.3 | 12.5 | 31.5 | 38 | 10.199999999999999 | 11.4 | 5.2 | 13.7 | 9.9 | 21.9 | 27.7 | 49.4 | 47.6 | 6 | 6 | 39 | 132 |
| 96712 | 8441 | 4590 | 3851 | 4.5999999999999996 | 3.2 | 4.3 | 6.9 | 15.6 | 47.8 | 17.600000000000001 | 4.0999999999999996 | 5.5 | 6.5 | 4.7 | 18.2 | 38.1 | 22.8 | 34.6 | 39.299999999999997 | 218 | 1756 | 11341 | 47992 |
| 54160 | 303 | 154 | 149 | 0.6 | 3.9 | 10.4 | 5.2 | 15.6 | 49.4 | 14.9 | 3.4 | 7.4 | 11.4 | 6.7 | 9.4 | 43.6 | 18.100000000000001 | 39.700000000000003 | 44.3 | 9 | 99 | 713 | 3060 |
| 77360 | 4949 | 2445 | 2504 | 5.8 | 7.5 | 2.2000000000000002 | 4.5999999999999996 | 12.1 | 28.1 | 39.6 | 3.1 | 4.5 | 9.1999999999999993 | 5.5 | 11.3 | 30.4 | 36 | 46.7 | 50.3 | 61 | 398 | 3093 | 15488 |
| 54495 | 7997 | 4061 | 3936 | 9.8000000000000007 | 6.5 | 7.8 | 6.1 | 13.9 | 35.5 | 20.399999999999999 | 5.0999999999999996 | 6.8 | 7.3 | 9.1 | 12.7 | 35.200000000000003 | 23.9 | 35.4 | 42.5 | 190 | 3966 | 54022 | 199622 |
| 30559 | 4410 | 2251 | 2159 | 0.8 | 5.2 | 7.8 | 6.5 | 13.1 | 35.200000000000003 | 31.3 | 4.8 | 4.9000000000000004 | 9.5 | 10.1 | 6.9 | 31.4 | 32.4 | 46.3 | 51.1 | 53 | 228 | 1600 | 6525 |
| 35592 | 4831 | 2326 | 2505 | 5.5 | 4.8 | 9.6 | 8.9 | 10.9 | 39.4 | 20.9 | 5.4 | 6.2 | 6.8 | 7.5 | 13.3 | 36.5 | 24.4 | 42.6 | 40 | 118 | 1464 | 11288 | 49373 |
| 41831 | 1001 | 490 | 511 | 1.4 | 11.4 | 7.3 | 19.2 | 9.8000000000000007 | 39.6 | 11.2 | 1 | 16.600000000000001 | 11 | 8.4 | 7.4 | 39.1 | 16.399999999999999 | 43.6 | 46.4 | 7 | 47 | 526 | 2071 |
| 82639 | 779 | 419 | 360 | 11.7 | 5.5 | 13.8 | 5.5 | 11.5 | 35.799999999999997 | 16.2 | 3.1 | 13.1 | 4.7 | 1.7 | 14.7 | 42.8 | 20 | 40.4 | 32.9 | 33 | 91 | 611 | 3242 |
| 20732 | 9973 | 4522 | 5451 | 7.2 | 10.6 | 7.9 | 5.7 | 14.3 | 37.6 | 16.7 | 7.4 | 4.5 | 7.4 | 7.3 | 16.2 | 43.9 | 13.3 | 37.200000000000003 | 33 | 106 | 773 | 4925 | 23494 |
| 67837 | 890 | 429 | 461 | 4.7 | 7.7 | 6.5 | 4.4000000000000004 | 16.600000000000001 | 40.6 | 19.600000000000001 | 6.7 | 6.5 | 3.3 | 6.7 | 13.4 | 41 | 22.3 | 35.4 | 33.1 | 37 | 136 | 1290 | 5098 |
| 16435 | 1904 | 945 | 959 | 6 | 5.3 | 7 | 11.5 | 11.7 | 37.799999999999997 | 20.6 | 7 | 7 | 10.5 | 7 | 12.3 | 36.799999999999997 | 19.399999999999999 | 40.9 | 41.3 | 25 | 94 | 731 | 3602 |
| 17870 | 15098 | 7382 | 7716 | 5.5 | 6.9 | 6.8 | 7.1 | 11.9 | 42.8 | 19 | 2.7 | 6.8 | 5.4 | 8 | 10.1 | 42.7 | 24.5 | 35.299999999999997 | 40.5 | 414 | 7571 | 47432 | 199494 |
| 77454 | 188 | 111 | 77 | 0 | 0 | 0 | 15.3 | 0 | 57.7 | 27 | 0 | 0 | 0 | 22.1 | 0 | 0 | 77.900000000000006 | 25.8 | 60.6 | 6 | 10 | 41 | 194 |
| 54947 | 2539 | 1269 | 1270 | 2 | 9.6 | 12.1 | 9.1999999999999993 | 7.2 | 31.6 | 28.1 | 4.5999999999999996 | 9.8000000000000007 | 10.199999999999999 | 7.4 | 7 | 30 | 30.9 | 49.9 | 49.6 | 32 | 110 | 706 | 3462 |
| 61475 | 363 | 160 | 203 | 2.5 | 18.8 | 8.1 | 2.5 | 8.8000000000000007 | 41.3 | 18.100000000000001 | 1 | 14.3 | 11.8 | 0 | 17.2 | 37.9 | 17.7 | 43.4 | 35.9 | 5 | 15 | 221 | 2292 |
| 94572 | 9724 | 4335 | 5389 | 9.1999999999999993 | 5 | 6.1 | 10.6 | 10.4 | 39.4 | 19.2 | 4 | 5 | 7.6 | 6.2 | 12.8 | 41.9 | 22.6 | 35.700000000000003 | 37.5 | 79 | 996 | 24776 | 77658 |
| 64643 | 1391 | 675 | 716 | 12 | 9.1999999999999993 | 4.3 | 6.7 | 14.2 | 30.8 | 22.8 | 5.3 | 7.4 | 4.9000000000000004 | 4.7 | 19.100000000000001 | 36 | 22.5 | 38.9 | 39.4 | 14 | 69 | 445 | 2152 |
| 70466 | 8560 | 4557 | 4003 | 6.9 | 4.2 | 6 | 7.2 | 15.2 | 49.4 | 11.1 | 5.2 | 7.4 | 5.8 | 7.1 | 11.5 | 42.5 | 20.5 | 31.8 | 38.799999999999997 | 66 | 325 | 1844 | 8450 |
| 92887 | 20292 | 10080 | 10212 | 5.8 | 7 | 9.5 | 9.3000000000000007 | 14.9 | 34.799999999999997 | 18.7 | 6.1 | 8.9 | 9.6999999999999993 | 9.4 | 11.8 | 34.799999999999997 | 19.3 | 40.5 | 43.3 | 625 | 6794 | 95375 | 376988 |
| 45389 | 564 | 280 | 284 | 5.7 | 3.2 | 18.2 | 5.4 | 15.7 | 30.4 | 21.4 | 4.5999999999999996 | 6 | 5.3 | 10.199999999999999 | 12 | 34.9 | 27.1 | 39 | 42.6 | 10 | 28 | 98 | 375 |
| 58538 | 2669 | 1385 | 1284 | 13.9 | 5.3 | 5.8 | 4.0999999999999996 | 21 | 41 | 8.8000000000000007 | 10.4 | 7 | 6.8 | 3.7 | 20.100000000000001 | 41.1 | 11 | 26 | 29.2 | 17 | 932 | 6505 | 27357 |
| 18829 | 888 | 452 | 436 | 5.8 | 11.9 | 8.8000000000000007 | 6.4 | 14.2 | 35.6 | 17.3 | 6.9 | 7.8 | 10.8 | 4.5999999999999996 | 13.3 | 30 | 26.6 | 42 | 44.8 | 12 | 61 | 276 | 1252 |
| 60443 | 21184 | 9728 | 11456 | 5.5 | 7.1 | 8.1 | 8.1 | 13.4 | 36.9 | 20.9 | 3.4 | 5 | 6.2 | 9.9 | 16.100000000000001 | 34.9 | 24.5 | 40 | 41.6 | 346 | 6291 | 45487 | 182411 |
| 21778 | 1077 | 532 | 545 | 2.6 | 11.8 | 5.0999999999999996 | 11.3 | 1.7 | 53 | 14.5 | 0 | 8.8000000000000007 | 6.8 | 11.6 | 22.6 | 35.200000000000003 | 15 | 39.799999999999997 | 40.200000000000003 | 18 | 109 | 867 | 4789 |
| 5251 | 1399 | 638 | 761 | 1.3 | 6.4 | 11.8 | 6.6 | 9.1 | 21.8 | 43.1 | 5.0999999999999996 | 9.3000000000000007 | 10.1 | 12.2 | 7.9 | 19.399999999999999 | 35.9 | 54.9 | 53.6 | 73 | 479 | 3763 | 16779 |
| 32216 | 37190 | 18268 | 18922 | 9.1 | 7.8 | 5.5 | 5.4 | 9.6 | 46.6 | 16.100000000000001 | 4.7 | 6.2 | 7.6 | 4.5999999999999996 | 12.4 | 40.799999999999997 | 23.7 | 34.6 | 37 | 1797 | 32050 | 413167 | 1535742 |
| 80465 | 15986 | 7934 | 8052 | 4.5999999999999996 | 7.8 | 9 | 8.9 | 11.9 | 34.4 | 23.4 | 5.8 | 8.6999999999999993 | 9 | 9.5 | 12.3 | 34.299999999999997 | 20.5 | 44.5 | 43.5 | 336 | 1606 | 12446 | 60620 |
| 67332 | 168 | 89 | 79 | 0 | 11.2 | 22.5 | 5.6 | 9 | 30.3 | 21.3 | 0 | 17.7 | 15.2 | 13.9 | 1.3 | 35.4 | 16.5 | 49.9 | 49.3 | 4 | 73 | 807 | 2977 |
| 82215 | 171 | 107 | 64 | 0 | 2.8 | 13.1 | 14 | 4.7 | 36.4 | 29 | 0 | 4.7 | 0 | 25 | 4.7 | 10.9 | 54.7 | 50.7 | 63.3 | 7 | 27 | 138 | 668 |
| 27863 | 11877 | 5898 | 5979 | 10.199999999999999 | 7.5 | 5.5 | 8 | 14.3 | 39.4 | 15.1 | 5.5 | 6.4 | 7 | 3.6 | 15.8 | 40.200000000000003 | 21.6 | 36.4 | 36.700000000000003 | 100 | 672 | 3985 | 18454 |
| 67732 | 437 | 226 | 211 | 2.2000000000000002 | 2.2000000000000002 | 11.1 | 12.4 | 12.4 | 25.2 | 34.5 | 7.1 | 2.8 | 14.7 | 9 | 8.5 | 28.4 | 29.4 | 53 | 50.7 | 16 | 104 | 895 | 4228 |
| 91405 | 53347 | 26612 | 26735 | 7.3 | 7.3 | 6.6 | 6 | 14.6 | 47.4 | 10.8 | 7.6 | 7.3 | 7.1 | 5.6 | 13.2 | 44.4 | 14.8 | 31.6 | 34.9 | 957 | 12807 | 126812 | 549159 |
| 25121 | 678 | 316 | 362 | 0 | 28.5 | 3.8 | 0 | 0 | 38.6 | 29.1 | 0 | 0 | 6.9 | 19.100000000000001 | 16.600000000000001 | 45 | 12.4 | 47.3 | 43.5 | 8 | 53 | 505 | 2014 |
| 19009 | 955 | 491 | 464 | 0.6 | 0.8 | 9 | 13.6 | 16.100000000000001 | 44.6 | 15.3 | 5.8 | 5.4 | 12.9 | 6.9 | 10.6 | 38.799999999999997 | 19.600000000000001 | 28.1 | 39.299999999999997 | 27 | 778 | 5335 | 22277 |
| 93110 | 17382 | 8997 | 8385 | 5.4 | 8.1 | 6.6 | 6.8 | 9.3000000000000007 | 41.4 | 22.5 | 3.6 | 8.5 | 6.2 | 8.8000000000000007 | 10.3 | 33.5 | 29.1 | 38.6 | 47.1 | 326 | 3721 | 69125 | 256148 |
| 40870 | 396 | 206 | 190 | 0 | 0 | 4.9000000000000004 | 12.6 | 5.8 | 48.5 | 28.2 | 11.6 | 7.4 | 5.8 | 12.6 | 13.7 | 31.1 | 17.899999999999999 | 43.9 | 38.200000000000003 | 4 | 52 | 418 | 1882 |
| 27939 | 2592 | 1209 | 1383 | 11.8 | 5.0999999999999996 | 6.6 | 5.5 | 17.8 | 30.4 | 22.7 | 6.5 | 2.5 | 6.9 | 5.0999999999999996 | 22.4 | 39.9 | 16.600000000000001 | 36.700000000000003 | 31.5 | 44 | 354 | 1646 | 7561 |
| 63033 | 42635 | 18843 | 23792 | 7 | 5.3 | 7.6 | 8.1 | 16 | 37.6 | 18.399999999999999 | 5.4 | 5.3 | 7.3 | 6.5 | 11 | 39.4 | 25.2 | 34.9 | 40.6 | 909 | 7856 | 42228 | 179903 |
| 25241 | 1612 | 910 | 702 | 6.7 | 8.4 | 4.3 | 5.6 | 15.7 | 41.3 | 18 | 8.1 | 7.3 | 7.1 | 5.4 | 16.5 | 27.2 | 28.3 | 35.799999999999997 | 43.8 | 20 | 65 | 345 | 1819 |
| 68532 | 812 | 484 | 328 | 9.6999999999999993 | 12 | 1.4 | 8.3000000000000007 | 15.5 | 42.4 | 10.7 | 2.4 | 25 | 3.7 | 5.8 | 28.4 | 18.899999999999999 | 15.9 | 17.899999999999999 | 45 | 10 | 27 | 184 | 1103 |
Response Variable Data
# display first 100 rows
kable(head(data_response, 100)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>%
scroll_box(width = "100%", height = "500px")
| geographic_area_name | revenue | popular_product |
|---|---|---|
| ZIP 44319 (Akron, OH) | 110168 | retirement |
| ZIP 73067 (Ninnekah, OK) | 32785 | retirement |
| ZIP 01851 (Lowell, MA) | 124838 | life |
| ZIP 67335 (Cherryvale, KS) | 48629 | college |
| ZIP 72524 (Cord, AR) | 23694 | college |
| ZIP 35901 (Gadsden, AL) | 131639 | college |
| ZIP 08752 (Seaside Park, NJ) | 37291 | college |
| ZIP 35063 (Empire, AL) | 34062 | college |
| ZIP 48114 (Brighton, MI) | 96213 | retirement |
| ZIP 20037 (Washington, DC) | 367039 | life |
| ZIP 54421 (Colby, WI) | 43943 | retirement |
| ZIP 79070 (Perryton, TX) | 74486 | college |
| ZIP 49676 (Rapid City, MI) | 38341 | college |
| ZIP 03583 (Jefferson, NH) | 26785 | retirement |
| ZIP 54454 (Milladore, WI) | 32692 | retirement |
| ZIP 08019 (Chatsworth, NJ) | 33492 | retirement |
| ZIP 36003 (Autaugaville, AL) | 33899 | life |
| ZIP 13335 (Edmeston, NY) | 37718 | retirement |
| ZIP 62215 (Albers, IL) | 31449 | retirement |
| ZIP 53575 (Oregon, WI) | 72434 | retirement |
| ZIP 79934 (El Paso, TX) | 35526 | life |
| ZIP 37355 (Manchester, TN) | 118132 | retirement |
| ZIP 15610 (Acme, PA) | 42750 | retirement |
| ZIP 73106 (Oklahoma City, OK) | 121778 | life |
| ZIP 01129 (Springfield, MA) | 51292 | college |
| ZIP 04926 (China Village, ME) | 31080 | college |
| ZIP 28269 (Charlotte, NC) | 245501 | life |
| ZIP 75691 (Tatum, TX) | 51806 | retirement |
| ZIP 25670 (Delbarton, WV) | 46552 | college |
| ZIP 48184 (Wayne, MI) | 123227 | retirement |
| ZIP 02535 (Chilmark, MA) | 32454 | life |
| ZIP 15746 (Hillsdale, PA) | 34060 | life |
| ZIP 88008 (Santa Teresa, NM) | 50212 | college |
| ZIP 39217 (Jackson, MS) | 26521 | life |
| ZIP 97540 (Talent, OR) | 46696 | retirement |
| ZIP 99169 (Ritzville, WA) | 49698 | college |
| ZIP 32955 (Rockledge, FL) | 164311 | retirement |
| ZIP 04360 (Vienna, ME) | 35425 | retirement |
| ZIP 31408 (Savannah, GA) | 153679 | life |
| ZIP 97503 (White City, OR) | 73316 | life |
| ZIP 61448 (Knoxville, IL) | 29695 | retirement |
| ZIP 76177 (Fort Worth, TX) | 193817 | life |
| ZIP 37324 (Decherd, TN) | 63595 | college |
| ZIP 47923 (Brookston, IN) | 44056 | life |
| ZIP 15663 (Madison, PA) | 37042 | retirement |
| ZIP 95120 (San Jose, CA) | 102039 | retirement |
| ZIP 27012 (Clemmons, NC) | 113884 | retirement |
| ZIP 24620 (Hurley, VA) | 36325 | college |
| ZIP 46617 (South Bend, IN) | 76865 | life |
| ZIP 28520 (Cedar Island, NC) | 21887 | college |
| ZIP 07444 (Pompton Plains, NJ) | 84611 | retirement |
| ZIP 57031 (Gayville, SD) | 23241 | life |
| ZIP 37118 (Milton, TN) | 25822 | retirement |
| ZIP 05252 (East Arlington, VT) | 23640 | retirement |
| ZIP 07601 (Hackensack, NJ) | 463202 | life |
| ZIP 12928 (Crown Point, NY) | 31085 | retirement |
| ZIP 18818 (Friendsville, PA) | 31768 | retirement |
| ZIP 59936 (West Glacier, MT) | 31508 | life |
| ZIP 98037 (Lynnwood, WA) | 168494 | life |
| ZIP 25245 (Given, WV) | 34714 | retirement |
| ZIP 96712 (Haleiwa, HI) | 50489 | life |
| ZIP 54160 (Potter, WI) | 32849 | life |
| ZIP 77360 (Onalaska, TX) | 40400 | retirement |
| ZIP 54495 (Wisconsin Rapids, WI) | 66623 | college |
| ZIP 30559 (Mineral Bluff, GA) | 27135 | life |
| ZIP 35592 (Vernon, AL) | 50156 | life |
| ZIP 41831 (Leburn, KY) | 34033 | retirement |
| ZIP 82639 (Kaycee, WY) | 32708 | college |
| ZIP 20732 (Chesapeake Beach, MD) | 36165 | retirement |
| ZIP 67837 (Copeland, KS) | 33355 | life |
| ZIP 16435 (Springboro, PA) | 37128 | life |
| ZIP 17870 (Selinsgrove, PA) | 86302 | life |
| ZIP 77454 (Lissie, TX) | 33712 | life |
| ZIP 54947 (Larsen, WI) | 35079 | retirement |
| ZIP 61475 (Sciota, IL) | 25766 | retirement |
| ZIP 94572 (Rodeo, CA) | 42240 | college |
| ZIP 64643 (Hale, MO) | 23710 | college |
| ZIP 70466 (Tickfaw, LA) | 36292 | life |
| ZIP 92887 (Yorba Linda, CA) | 121839 | retirement |
| ZIP 45389 (Christiansburg, OH) | 27194 | college |
| ZIP 58538 (Fort Yates, ND) | 50303 | college |
| ZIP 18829 (Le Raysville, PA) | 36727 | retirement |
| ZIP 60443 (Matteson, IL) | 85899 | retirement |
| ZIP 21778 (Rocky Ridge, MD) | 38034 | retirement |
| ZIP 05251 (Dorset, VT) | 41918 | retirement |
| ZIP 32216 (Jacksonville, FL) | 343459 | life |
| ZIP 80465 (Morrison, CO) | 70866 | retirement |
| ZIP 67332 (Bartlett, KS) | 25024 | retirement |
| ZIP 82215 (Hartville, WY) | 32281 | life |
| ZIP 27863 (Pikeville, NC) | 40999 | college |
| ZIP 67732 (Brewster, KS) | 29529 | college |
| ZIP 91405 (Van Nuys, CA) | 177474 | life |
| ZIP 25121 (Lake, WV) | 37739 | retirement |
| ZIP 19009 (Bryn Athyn, PA) | 38188 | life |
| ZIP 93110 (Santa Barbara, CA) | 70880 | retirement |
| ZIP 40870 (Totz, KY) | 25584 | life |
| ZIP 27939 (Grandy, NC) | 42647 | college |
| ZIP 63033 (Florissant, MO) | 142530 | college |
| ZIP 25241 (Evans, WV) | 32792 | retirement |
| ZIP 68532 (Lincoln, NE) | 37814 | retirement |
ID-Geography Mapping Data
# display first 100 rows
kable(head(data_mapping, 100)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>%
scroll_box(width = "100%", height = "500px")
| Id2 | geographic_area_name |
|---|---|
| 1001 | ZIP 01001 (Agawam, MA) |
| 1002 | ZIP 01002 (Amherst, MA) |
| 1003 | ZIP 01003 (Amherst, MA) |
| 1005 | ZIP 01005 (Barre, MA) |
| 1007 | ZIP 01007 (Belchertown, MA) |
| 1008 | ZIP 01008 (Blandford, MA) |
| 1009 | ZIP 01009 (Bondsville, MA) |
| 1010 | ZIP 01010 (Brimfield, MA) |
| 1011 | ZIP 01011 (Chester, MA) |
| 1012 | ZIP 01012 (Chesterfield, MA) |
| 1013 | ZIP 01013 (Chicopee, MA) |
| 1020 | ZIP 01020 (Chicopee, MA) |
| 1022 | ZIP 01022 (Chicopee, MA) |
| 1026 | ZIP 01026 (Cummington, MA) |
| 1027 | ZIP 01027 (Easthampton, MA) |
| 1028 | ZIP 01028 (East Longmeadow, MA) |
| 1029 | ZIP 01029 (East Otis, MA) |
| 1030 | ZIP 01030 (Feeding Hills, MA) |
| 1031 | ZIP 01031 (Gilbertville, MA) |
| 1032 | ZIP 01032 (Goshen, MA) |
| 1033 | ZIP 01033 (Granby, MA) |
| 1034 | ZIP 01034 (Granville, MA) |
| 1035 | ZIP 01035 (Hadley, MA) |
| 1036 | ZIP 01036 (Hampden, MA) |
| 1037 | ZIP 01037 (Hardwick, MA) |
| 1038 | ZIP 01038 (Hatfield, MA) |
| 1039 | ZIP 01039 (Haydenville, MA) |
| 1040 | ZIP 01040 (Holyoke, MA) |
| 1050 | ZIP 01050 (Huntington, MA) |
| 1053 | ZIP 01053 (Leeds, MA) |
| 1054 | ZIP 01054 (Leverett, MA) |
| 1056 | ZIP 01056 (Ludlow, MA) |
| 1057 | ZIP 01057 (Monson, MA) |
| 1060 | ZIP 01060 (Northampton, MA) |
| 1062 | ZIP 01062 (Florence, MA) |
| 1068 | ZIP 01068 (Oakham, MA) |
| 1069 | ZIP 01069 (Palmer, MA) |
| 1070 | ZIP 01070 (Plainfield, MA) |
| 1071 | ZIP 01071 (Russell, MA) |
| 1072 | ZIP 01072 (Shutesbury, MA) |
| 1073 | ZIP 01073 (Southampton, MA) |
| 1074 | ZIP 01074 (South Barre, MA) |
| 1075 | ZIP 01075 (South Hadley, MA) |
| 1077 | ZIP 01077 (Southwick, MA) |
| 1079 | ZIP 01079 (Thorndike, MA) |
| 1080 | ZIP 01080 (Three Rivers, MA) |
| 1081 | ZIP 01081 (Wales, MA) |
| 1082 | ZIP 01082 (Ware, MA) |
| 1083 | ZIP 01083 (Warren, MA) |
| 1084 | ZIP 01084 (West Chesterfield, MA) |
| 1085 | ZIP 01085 (Westfield, MA) |
| 1086 | ZIP 01086 (Westfield, MA) |
| 1088 | ZIP 01088 (West Hatfield, MA) |
| 1089 | ZIP 01089 (West Springfield, MA) |
| 1092 | ZIP 01092 (West Warren, MA) |
| 1093 | ZIP 01093 (Whately, MA) |
| 1095 | ZIP 01095 (Wilbraham, MA) |
| 1096 | ZIP 01096 (Williamsburg, MA) |
| 1098 | ZIP 01098 (Worthington, MA) |
| 1103 | ZIP 01103 (Springfield, MA) |
| 1104 | ZIP 01104 (Springfield, MA) |
| 1105 | ZIP 01105 (Springfield, MA) |
| 1106 | ZIP 01106 (Longmeadow, MA) |
| 1107 | ZIP 01107 (Springfield, MA) |
| 1108 | ZIP 01108 (Springfield, MA) |
| 1109 | ZIP 01109 (Springfield, MA) |
| 1118 | ZIP 01118 (Springfield, MA) |
| 1119 | ZIP 01119 (Springfield, MA) |
| 1128 | ZIP 01128 (Springfield, MA) |
| 1129 | ZIP 01129 (Springfield, MA) |
| 1151 | ZIP 01151 (Indian Orchard, MA) |
| 1201 | ZIP 01201 (Pittsfield, MA) |
| 1220 | ZIP 01220 (Adams, MA) |
| 1222 | ZIP 01222 (Ashley Falls, MA) |
| 1223 | ZIP 01223 (Becket, MA) |
| 1224 | ZIP 01224 (Berkshire, MA) |
| 1225 | ZIP 01225 (Cheshire, MA) |
| 1226 | ZIP 01226 (Dalton, MA) |
| 1229 | ZIP 01229 (Glendale, MA) |
| 1230 | ZIP 01230 (Great Barrington, MA) |
| 1235 | ZIP 01235 (Hinsdale, MA) |
| 1236 | ZIP 01236 (Housatonic, MA) |
| 1237 | ZIP 01237 (Lanesboro, MA) |
| 1238 | ZIP 01238 (Lee, MA) |
| 1240 | ZIP 01240 (Lenox, MA) |
| 1242 | ZIP 01242 (Lenox Dale, MA) |
| 1243 | ZIP 01243 (Middlefield, MA) |
| 1245 | ZIP 01245 (Monterey, MA) |
| 1247 | ZIP 01247 (North Adams, MA) |
| 1253 | ZIP 01253 (Otis, MA) |
| 1254 | ZIP 01254 (Richmond, MA) |
| 1255 | ZIP 01255 (Sandisfield, MA) |
| 1256 | ZIP 01256 (Savoy, MA) |
| 1257 | ZIP 01257 (Sheffield, MA) |
| 1258 | ZIP 01258 (South Egremont, MA) |
| 1259 | ZIP 01259 (Southfield, MA) |
| 1260 | ZIP 01260 (South Lee, MA) |
| 1262 | ZIP 01262 (Stockbridge, MA) |
| 1266 | ZIP 01266 (West Stockbridge, MA) |
| 1267 | ZIP 01267 (Williamstown, MA) |
Missing Value Table
# all column indices except ID column
index <- 2:ncol(data)
# convert above columns to numeric
data[index] <- lapply(data[index], as.numeric)
# calculate missing values
na_table <-
map_dbl(data, function(x) sum(is.na(x))) %>%
sort(decreasing = TRUE) %>%
data.frame()
# rename column
colnames(na_table) <- c("total_missing")
# display missing value table
datatable(na_table)
Removing rows with 8 or more missing columns
# index for row having 8 or more missing columns
index <- apply(data, 1, function(x) sum(is.na(x)) >= 8)
# removing above rows
data <- data[!index,]
# calculate missing values
na_table <-
map_dbl(data, function(x) sum(is.na(x))) %>%
sort(decreasing = TRUE) %>%
data.frame()
# rename column
colnames(na_table) <- c("total_missing")
# display missing value table
datatable(na_table)
KNN-Imputation of median ages of males and females
###### 1. male_median_age ######
# data with output variable and variables on which knn needs to be trained
tr <- na.omit(select(data, tot_population, tot_male_population,
perc_male_under_5yrs:perc_male_over_60yrs, male_median_age))
# data with variables on which knn needs to be trained
tr1 <- select(tr, tot_population, tot_male_population,
perc_male_under_5yrs:perc_male_over_60yrs)
# missing data: data on which knn would provide predictions
te1 <-
data %>%
filter(is.na(male_median_age)) %>%
select(tot_population, tot_male_population,
perc_male_under_5yrs:perc_male_over_60yrs)
# output variable
output_tr <- select(tr, male_median_age)$male_median_age
# training data for knn without output variable
train <- as.data.frame(tr1)
# test data to be imputed using knn
test <- as.data.frame(te1)
# output variable
output_var_train <- as.numeric(output_tr)
# perform k-nearest neighbour with k=20
knn_prediction_male <- knn.reg(train=train, test=test, y=output_var_train, k = 20)
# impute missing values with above algorithm
data[is.na(data$male_median_age),]$male_median_age <- knn_prediction_male$pred
###### 2. female_median_age ######
# data with output variable and variables on which knn needs to be trained
tr <- na.omit(select(data, tot_population, tot_female_population,
perc_female_under_5yrs:perc_female_over_60yrs, female_median_age))
# data with variables on which knn needs to be trained
tr1 <- select(tr, tot_population, tot_female_population,
perc_female_under_5yrs:perc_female_over_60yrs)
# missing data: data on which knn would provide predictions
te1 <-
data %>%
filter(is.na(female_median_age)) %>%
select(tot_population, tot_female_population,
perc_female_under_5yrs:perc_female_over_60yrs)
# output variable
output_tr <- select(tr, female_median_age)$female_median_age
# training data for knn without output variable
train <- as.data.frame(tr1)
# test data to be imputed using knn
test <- as.data.frame(te1)
# output variable
output_var_train <- as.numeric(output_tr)
# perform k-nearest neighbour with k=20
knn_prediction_female <- knn.reg(train=train, test=test, y=output_var_train, k = 20)
# impute missing values with above algorithm
data[is.na(data$female_median_age),]$female_median_age <- knn_prediction_female$pred
# calculate missing values
na_table <-
map_dbl(data, function(x) sum(is.na(x))) %>%
sort(decreasing = TRUE) %>%
data.frame()
# rename column
colnames(na_table) <- c("total_missing")
# display missing value table
datatable(na_table)
Summary Table
# summary statistics for numerical variables
summary <- data.frame()
for(i in 2:ncol(data))
{
name = colnames(data)[i]
min = min(data[,i], na.rm=TRUE) %>% round(3)
mean = mean(data[,i,drop=TRUE], na.rm=TRUE) %>% round(3)
median = median(data[,i,drop=TRUE], na.rm=TRUE) %>% round(3)
max = max(data[,i], na.rm=TRUE) %>% round(3) %>% round(3)
count = sum(!is.na(data[,i]))
df = data.frame(name, min, mean=mean, median=median, max, count)
summary <- rbind(summary, df)
}
# display summary table
datatable(summary)
Minimum population greater than equal to 100
# filtering on population
data <-
data %>%
filter(tot_population >=100)
# summary statistics for numerical variables
summary <- data.frame()
for(i in 2:ncol(data))
{
name = colnames(data)[i]
min = min(data[,i], na.rm=TRUE) %>% round(3)
mean = mean(data[,i,drop=TRUE], na.rm=TRUE) %>% round(3)
median = median(data[,i,drop=TRUE], na.rm=TRUE) %>% round(3)
max = max(data[,i], na.rm=TRUE) %>% round(3) %>% round(3)
count = sum(!is.na(data[,i]))
df = data.frame(name, min, mean=mean, median=median, max, count)
summary <- rbind(summary, df)
}
# display summary table
datatable(summary)
Joining Response and Predictor Variables
# joining response and predictor variables
data <-
data %>%
left_join(data_mapping, by = "Id2") %>%
left_join(data_response, by = "geographic_area_name")
# converting popular product to factor
data$popular_product <- as.factor(data$popular_product)
Univariate Analysis-I
Following variables are heavily right-skewed:
Following variables are slighly right-skewed:
Following variables are normally distributed:
edaPlot <- function(value, name)
{
hist <- ggplot(data, aes(x=value)) +
geom_histogram() +
ggtitle(name)
hist
}
p1 <- edaPlot(data$tot_population,"tot_population")
p2 <- edaPlot(data$tot_male_population,"tot_male_population")
p3 <- edaPlot(data$tot_female_population,"tot_female_population")
p4 <- edaPlot(data$male_median_age,"male_median_age")
p5 <- edaPlot(data$female_median_age,"female_median_age")
p6 <- edaPlot(data$no_establishments,"no_establishments")
p7 <- edaPlot(data$paid_employees,"paid_employees")
p8 <- edaPlot(data$payroll_quarter1,"payroll_quarter1")
p9 <- edaPlot(data$annual_payroll,"annual_payroll")
p10 <- edaPlot(data$revenue,"revenue")
Rmisc::multiplot(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, cols = 2)
Univariate Analysis-II
Most of the zipcodes have males:
p1 <- edaPlot(data$perc_male_under_5yrs,"perc_male_under_5yrs")
p2 <- edaPlot(data$perc_male_5_14yrs,"perc_male_5_14yrs")
p3 <- edaPlot(data$perc_male_15_44yrs,"perc_male_15_44yrs")
p4 <- edaPlot(data$perc_male_45_49yrs,"perc_male_45_49yrs")
p5 <- edaPlot(data$perc_male_50_54yrs,"perc_male_50_54yrs")
p6 <- edaPlot(data$perc_male_55_59yrs,"perc_male_55_59yrs")
p7 <- edaPlot(data$perc_male_over_60yrs,"perc_male_over_60yrs")
Rmisc::multiplot(p1, p2, p3, p4, p5, p6, p7, cols = 2)
Univariate Analysis-III
Most of the zipcodes have females:
p1 <- edaPlot(data$perc_female_under_5yrs,"perc_female_under_5yrs")
p2 <- edaPlot(data$perc_female_5_14yrs,"perc_female_5_14yrs")
p3 <- edaPlot(data$perc_female_15_44yrs,"perc_female_15_44yrs")
p4 <- edaPlot(data$perc_female_45_49yrs,"perc_female_45_49yrs")
p5 <- edaPlot(data$perc_female_50_54yrs,"perc_female_50_54yrs")
p6 <- edaPlot(data$perc_female_55_59yrs,"perc_female_55_59yrs")
p7 <- edaPlot(data$perc_female_over_60yrs,"perc_female_over_60yrs")
Rmisc::multiplot(p1, p2, p3, p4, p5, p6, p7, cols = 2)
Univariate Analysis-IV
Popular Products are equally distributed among college, life and retirement.
ggplot(data, aes(x=popular_product)) +
geom_bar()
Revenue Bivariate Analysis-I
Note: See the last row only
Revenue has high positive correlation with:
df1 <- select(data, tot_population:tot_female_population
, male_median_age:annual_payroll, revenue)
pairs(df1)
Note: See the last column only
The above inference is verified by the correlation plot
ggcorr(df1, label = TRUE, angle = 20)
Revenue Bivariate Analysis-II
Note: See the last row only
There does not seem any significant correlations.
df2 <- select(data, perc_male_under_5yrs:perc_male_over_60yrs, revenue)
pairs(df2)
Note: See the last column only
Observations:
ggcorr(df2, label = TRUE, angle = 20)
Revenue Bivariate Analysis-III
Note: See the last row only
There does not seem any significant correlations.
df3 <- select(data, perc_female_under_5yrs:perc_female_over_60yrs, revenue)
pairs(df3)
Observations:
ggcorr(df3, label = TRUE, angle = 20)
Popularity Bivariate Analysis-I
Variables which have separation in distribution of Popular Products:
distribution_plot <- function(x, name)
{
ggplot(data, aes(x=x, fill=popular_product)) +
geom_histogram(position = 'dodge', aes(y=..density..)) +
labs(x=name)
}
p1 <- distribution_plot(data$tot_population,"tot_population")
p2 <- distribution_plot(data$tot_male_population,"tot_male_population")
p3 <- distribution_plot(data$tot_female_population,"tot_female_population")
p4 <- distribution_plot(data$male_median_age,"male_median_age")
p5 <- distribution_plot(data$female_median_age,"female_median_age")
p6 <- distribution_plot(data$no_establishments,"no_establishments")
p7 <- distribution_plot(data$paid_employees,"paid_employees")
p8 <- distribution_plot(data$payroll_quarter1,"payroll_quarter1")
p9 <- distribution_plot(data$annual_payroll,"annual_payroll")
p10 <- distribution_plot(data$revenue,"revenue")
Rmisc::multiplot(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, cols = 2)
Popularity Bivariate Analysis-II
Variables which have separation in distribution of Popular Products:
p1 <- distribution_plot(data$perc_male_under_5yrs,"perc_male_under_5yrs")
p2 <- distribution_plot(data$perc_male_5_14yrs,"perc_male_5_14yrs")
p3 <- distribution_plot(data$perc_male_15_44yrs,"perc_male_15_44yrs")
p4 <- distribution_plot(data$perc_male_45_49yrs,"perc_male_45_49yrs")
p5 <- distribution_plot(data$perc_male_50_54yrs,"perc_male_50_54yrs")
p6 <- distribution_plot(data$perc_male_55_59yrs,"perc_male_55_59yrs")
p7 <- distribution_plot(data$perc_male_over_60yrs,"perc_male_over_60yrs")
Rmisc::multiplot(p1, p2, p3, p4, p5, p6, p7, cols = 2)
Popularity Bivariate Analysis-III
Variables which have separation in distribution of Popular Products:
p1 <- distribution_plot(data$perc_female_under_5yrs,"perc_female_under_5yrs")
p2 <- distribution_plot(data$perc_female_5_14yrs,"perc_female_5_14yrs")
p3 <- distribution_plot(data$perc_female_15_44yrs,"perc_female_15_44yrs")
p4 <- distribution_plot(data$perc_female_45_49yrs,"perc_female_45_49yrs")
p5 <- distribution_plot(data$perc_female_50_54yrs,"perc_female_50_54yrs")
p6 <- distribution_plot(data$perc_female_55_59yrs,"perc_female_55_59yrs")
p7 <- distribution_plot(data$perc_female_over_60yrs,"perc_female_over_60yrs")
Rmisc::multiplot(p1, p2, p3, p4, p5, p6, p7, cols = 2)
Dividing data into train and test set
# 80-20 train vs test split
set.seed(1)
index<-sample(nrow(data),nrow(data)*0.8)
train<-data[index,]
test<-data[-index,]
Building the model to predict Revenue
# Null Model - Regress square feet on only the intercept
nullmodel=lm(revenue~1, data=train)
# Full Model - Regress square feet on all predictor variables
fullmodel=lm(revenue ~ .-Id2-geographic_area_name-popular_product
, data=train)
# Final Model built using stepwise variable selection (BIC)
model_revenue <- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel),
direction = c("forward"), k=log(nrow(train)))
Model Summary
The adjusted R square of the model is ~99%
summary(model_revenue)
##
## Call:
## lm(formula = revenue ~ paid_employees + payroll_quarter1 + no_establishments +
## annual_payroll + tot_male_population + perc_male_5_14yrs +
## perc_female_15_44yrs + perc_female_over_60yrs + perc_male_15_44yrs +
## perc_male_over_60yrs, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -106003 -3512 160 3656 129563
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 21666.572597 1139.057724 19.021
## paid_employees 0.537635 0.049095 10.951
## payroll_quarter1 0.039493 0.003333 11.849
## no_establishments 105.430307 0.669318 157.519
## annual_payroll 0.068797 0.001306 52.694
## tot_male_population -0.249488 0.022758 -10.963
## perc_male_5_14yrs -52.203677 22.976294 -2.272
## perc_female_15_44yrs 81.041546 15.869788 5.107
## perc_female_over_60yrs 66.357749 16.549801 4.010
## perc_male_15_44yrs 84.371563 16.939855 4.981
## perc_male_over_60yrs 69.693437 19.040925 3.660
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## paid_employees < 0.0000000000000002 ***
## payroll_quarter1 < 0.0000000000000002 ***
## no_establishments < 0.0000000000000002 ***
## annual_payroll < 0.0000000000000002 ***
## tot_male_population < 0.0000000000000002 ***
## perc_male_5_14yrs 0.023122 *
## perc_female_15_44yrs 0.000000339 ***
## perc_female_over_60yrs 0.000061655 ***
## perc_male_15_44yrs 0.000000654 ***
## perc_male_over_60yrs 0.000254 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6940 on 5353 degrees of freedom
## Multiple R-squared: 0.9946, Adjusted R-squared: 0.9946
## F-statistic: 9.808e+04 on 10 and 5353 DF, p-value: < 0.00000000000000022
Calculating Mean Square Error on Train and Test Set
Train and Test MSEs are comparable, thus there seems to be no overfitting
# prediction on train set
linear_regression_pred_train <- predict(model_revenue)
# prediction on test set
linear_regression_pred_test <- predict(model_revenue,
newdata=test)
# train MSE
linear_regression_train_mse <- mean((linear_regression_pred_train-train$revenue)^2)
# test MSE
linear_regression_test_mse <- mean((linear_regression_pred_test-test$revenue)^2)
cat("Train MSE:", linear_regression_train_mse, "| Test MSE:", linear_regression_test_mse)
## Train MSE: 48065318 | Test MSE: 45795763
Building the model to predict the popularity of product
# Multinomial Logistics Regression
model_pop_product <- multinom(popular_product ~
male_median_age
+female_median_age
+tot_population
+tot_male_population
+tot_female_population
+perc_male_under_5yrs
+perc_male_15_44yrs
+perc_male_45_49yrs
+perc_male_over_60yrs
+perc_female_under_5yrs
+perc_female_15_44yrs
+perc_female_45_49yrs
+perc_female_over_60yrs
, select(train, -geographic_area_name, -Id2, -revenue), maxit=200)
# performing forward selection (AIC)
model_pop_product_step <- step(model_pop_product)
Model Summary
summary(model_pop_product_step)
## Call:
## multinom(formula = popular_product ~ male_median_age + tot_population +
## tot_male_population + tot_female_population + perc_male_under_5yrs +
## perc_male_15_44yrs + perc_male_45_49yrs + perc_female_15_44yrs +
## perc_female_45_49yrs + perc_female_over_60yrs, data = select(train,
## -geographic_area_name, -Id2, -revenue), maxit = 200)
##
## Coefficients:
## (Intercept) male_median_age tot_population tot_male_population
## life -63.96765 -0.1192173 0.000007577717 -0.0002616518
## retirement -10.82925 -0.1518854 0.000009655579 -0.0003251257
## tot_female_population perc_male_under_5yrs perc_male_15_44yrs
## life 0.0002682052 -9.712368 3.24700966
## retirement 0.0003353229 -9.637377 0.05265874
## perc_male_45_49yrs perc_female_15_44yrs perc_female_45_49yrs
## life -0.0580214 0.08844885 0.04047339
## retirement 10.0168383 0.07618278 -0.11984784
## perc_female_over_60yrs
## life 0.08778008
## retirement 0.07599648
##
## Std. Errors:
## (Intercept) male_median_age tot_population
## life 0.00007243927 0.01532116 0.000007927697
## retirement 0.00004870983 0.01328432 0.000009170222
## tot_male_population tot_female_population perc_male_under_5yrs
## life 0.0002276797 0.0002232655 0.004143988
## retirement 0.0002778930 0.0002720334 0.004521807
## perc_male_15_44yrs perc_male_45_49yrs perc_female_15_44yrs
## life 0.01308191 0.005274154 0.01542477
## retirement 0.01605085 0.002400195 0.01779370
## perc_female_45_49yrs perc_female_over_60yrs
## life 0.008113711 0.01753665
## retirement 0.004736040 0.01402347
##
## Residual Deviance: 599.6059
## AIC: 639.6059
Confusion matrix on train set
Accuracy Rate: 98.84%
conf_matrix_train <-
table(train$popular_product,predict(model_pop_product))
conf_matrix_train
##
## college life retirement
## college 1777 11 14
## life 7 1660 7
## retirement 19 4 1865
Confusion matrix on test set
Accuracy Rate: 97.83%
conf_matrix_test <-
table(test$popular_product,predict(model_pop_product, newdata = test))
conf_matrix_test
##
## college life retirement
## college 415 1 9
## life 3 424 0
## retirement 11 5 474
Using a Q-Q Plot, errors seem to be normally distributed. However, both the tails seem to be heavily skewed due to outliers.
# Constructing a dataframe containing model attributes
model_attributes1 <-
data.frame(index=1:nrow(train),
residuals = model_revenue$residuals,
fitted_values = model_revenue$fitted.values)
# Constructing Q-Q Plot
qqnorm(model_attributes1$residuals)
qqline(model_attributes1$residuals, col='red')
There seems to be no pattern for the errors over time (index). Thus we can safely assume that the errors are uncorrelated.
# Plotting Residuals over Time
model_attributes1 %>%
ggplot(aes(x=index,y=residuals)) +
geom_point()
We can clearly see that the residuals are constantly varied across the majority of the fitted values. However, for very large fitted values, the variation seems to fan out. If we remove the outliers, we would remove the fan shaped variance as well.
# Residuals vs Fitted-Value Plot
ggplot(model_attributes1, aes(x=fitted_values,y=residuals)) +
geom_point() +
geom_hline(yintercept = 0, color = "red") +
geom_hline(yintercept = 3, color = "blue") +
geom_hline(yintercept = -3, color = "blue")
Variation Inflation Factor is high for a few variables like annual_payroll, payroll_quarter1 and paid_employees. We will need to remove some of these to reduce multi-collinearity
vif(model_revenue)
## paid_employees payroll_quarter1 no_establishments
## 17.593212 52.967413 8.692075
## annual_payroll tot_male_population perc_male_5_14yrs
## 84.343784 3.042816 1.653595
## perc_female_15_44yrs perc_female_over_60yrs perc_male_15_44yrs
## 2.944912 3.355291 3.635962
## perc_male_over_60yrs
## 4.494583
Almost all standardized errors are below the absolute value of 5. However, there are a few observation that are over the absolute value of 5 and hence can be considered as influential outliers. In an ideal case, these should be removed.
# Plotting Studentized/Standardized Errors
rstan <- rstandard(model_revenue)
plot(rstan)
Removing Influential Outliers
# calculating cooks distance
cooksd <- cooks.distance(model_revenue)
# influential outliers row numbers
influential <- as.numeric(names(cooksd)[(cooksd > mean(cooksd, na.rm=T))])
# remove influential outliers
train <- train[-influential,]
Building the model to predict Revenue
# Null Model - Regress square feet on only the intercept
nullmodel=lm(revenue~1, data=train)
# Full Model - Regress square feet on all predictor variables except annual_payroll
fullmodel=lm(revenue ~ .-Id2-geographic_area_name-popular_product-annual_payroll-paid_employees
, data=train)
# Final Model built using stepwise variable selection (BIC)
model_revenue <- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel),
direction = c("forward"), k=log(nrow(train)))
Model Summary
The adjusted R square of the model is ~99%
summary(model_revenue)
##
## Call:
## lm(formula = revenue ~ no_establishments + payroll_quarter1 +
## perc_male_5_14yrs + tot_population + perc_male_15_44yrs +
## perc_female_over_60yrs + perc_female_15_44yrs, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56485 -3607 107 3626 34374
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 24568.791530 835.112612 29.420
## no_establishments 113.863276 0.540313 210.736
## payroll_quarter1 0.306316 0.001617 189.485
## perc_male_5_14yrs -59.928591 18.092538 -3.312
## tot_population -0.060523 0.010521 -5.753
## perc_male_15_44yrs 58.524142 12.570366 4.656
## perc_female_over_60yrs 87.225537 12.370295 7.051
## perc_female_15_44yrs 54.477925 14.006055 3.890
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## no_establishments < 0.0000000000000002 ***
## payroll_quarter1 < 0.0000000000000002 ***
## perc_male_5_14yrs 0.000931 ***
## tot_population 0.000000009279 ***
## perc_male_15_44yrs 0.000003307055 ***
## perc_female_over_60yrs 0.000000000002 ***
## perc_female_15_44yrs 0.000102 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6067 on 5312 degrees of freedom
## Multiple R-squared: 0.9913, Adjusted R-squared: 0.9912
## F-statistic: 8.605e+04 on 7 and 5312 DF, p-value: < 0.00000000000000022
Calculating Mean Square Error on Train and Test Set
Train and Test MSEs are comparable, thus there seems to be no overfitting
# prediction on train set
linear_regression_pred_train <- predict(model_revenue)
# prediction on test set
linear_regression_pred_test <- predict(model_revenue,
newdata=test)
# train MSE
linear_regression_train_mse <- mean((linear_regression_pred_train-train$revenue)^2)
# test MSE
linear_regression_test_mse <- mean((linear_regression_pred_test-test$revenue)^2)
cat("Train MSE:", linear_regression_train_mse, "| Test MSE:", linear_regression_test_mse)
## Train MSE: 36752185 | Test MSE: 321499258
Building the model to predict the popularity of product
# Multinomial Logistics Regression
model_pop_product <- multinom(popular_product ~
male_median_age
+female_median_age
+tot_population
+tot_male_population
+tot_female_population
+perc_male_under_5yrs
+perc_male_15_44yrs
+perc_male_45_49yrs
+perc_male_over_60yrs
+perc_female_under_5yrs
+perc_female_15_44yrs
+perc_female_45_49yrs
+perc_female_over_60yrs
, select(train, -geographic_area_name, -Id2, -revenue), maxit=200)
# performing forward selection (AIC)
model_pop_product_step <- step(model_pop_product)
Model Summary
summary(model_pop_product_step)
## Call:
## multinom(formula = popular_product ~ male_median_age + tot_population +
## tot_male_population + tot_female_population + perc_male_under_5yrs +
## perc_male_15_44yrs + perc_male_45_49yrs + perc_female_15_44yrs +
## perc_female_45_49yrs + perc_female_over_60yrs, data = select(train,
## -geographic_area_name, -Id2, -revenue), maxit = 200)
##
## Coefficients:
## (Intercept) male_median_age tot_population tot_male_population
## life -63.97387 -0.1186000 0.000007263478 -0.0002651740
## retirement -10.77528 -0.1520542 0.000010230867 -0.0003063879
## tot_female_population perc_male_under_5yrs perc_male_15_44yrs
## life 0.0002721347 -9.720721 3.2471242
## retirement 0.0003166178 -9.650487 0.0560999
## perc_male_45_49yrs perc_female_15_44yrs perc_female_45_49yrs
## life -0.05536281 0.09030834 0.03580886
## retirement 9.99796531 0.07939600 -0.13179365
## perc_female_over_60yrs
## life 0.0871428
## retirement 0.0761664
##
## Std. Errors:
## (Intercept) male_median_age tot_population
## life 0.00007348353 0.01536158 0.000007966623
## retirement 0.00004912882 0.01327730 0.000009400856
## tot_male_population tot_female_population perc_male_under_5yrs
## life 0.0002295425 0.0002250732 0.004113861
## retirement 0.0002871352 0.0002809541 0.004432349
## perc_male_15_44yrs perc_male_45_49yrs perc_female_15_44yrs
## life 0.01312008 0.005267846 0.01544974
## retirement 0.01604569 0.002397547 0.01782996
## perc_female_45_49yrs perc_female_over_60yrs
## life 0.008229996 0.01756903
## retirement 0.004837625 0.01399654
##
## Residual Deviance: 596.7282
## AIC: 636.7282
Confusion matrix on train set
Accuracy Rate: 97.76%
conf_matrix_train <-
table(train$popular_product,predict(model_pop_product))
conf_matrix_train
##
## college life retirement
## college 1772 11 12
## life 7 1634 7
## retirement 18 5 1854
Confusion matrix on test set
Accuracy Rate: 97.83%
conf_matrix_test <-
table(test$popular_product,predict(model_pop_product, newdata = test))
conf_matrix_test
##
## college life retirement
## college 414 2 9
## life 3 424 0
## retirement 11 5 474
Using a Q-Q Plot, errors seem to be normally distributed.
# Constructing a dataframe containing model attributes
model_attributes1 <-
data.frame(index=1:nrow(train),
residuals = model_revenue$residuals,
fitted_values = model_revenue$fitted.values)
# Constructing Q-Q Plot
qqnorm(model_attributes1$residuals)
qqline(model_attributes1$residuals, col='red')
There seems to be no pattern for the errors over time (index). Thus we can safely assume that the errors are uncorrelated.
# Plotting Residuals over Time
model_attributes1 %>%
ggplot(aes(x=index,y=residuals)) +
geom_point()
We can clearly see that the residuals are constantly varied across the fitted values.
# Residuals vs Fitted-Value Plot
ggplot(model_attributes1, aes(x=fitted_values,y=residuals)) +
geom_point() +
geom_hline(yintercept = 0, color = "red") +
geom_hline(yintercept = 3, color = "blue") +
geom_hline(yintercept = -3, color = "blue")
Variation Inflation Factor <10 for all variables. Thus their is no multi-collinearity.
vif(model_revenue)
## no_establishments payroll_quarter1 perc_male_5_14yrs
## 5.844737 2.785597 1.318786
## tot_population perc_male_15_44yrs perc_female_over_60yrs
## 3.459487 2.494747 2.396186
## perc_female_15_44yrs
## 2.850943
Almost all standardized errors are below the absolute value of 5.
# Plotting Studentized/Standardized Errors
rstan <- rstandard(model_revenue)
plot(rstan)
Linear Regression - Predicting Revenue
Revenue = output variable
All held constant, with 1 unit increase in no_establishments, the average revenue increases by $113.86.
A more concrete way of elaborating the above point would be: All held constant, we are 95 % confident that with 1 unit increase in no_establishments, the average revenue increases by $112.78 - $114.94.
All other predictor variables can be interpretted in the same way.
Multinomial Logistic Regression
AIC - which a measure of the quality of the model - is 636.72
Accuracy Rate on test set i.e. percentage of popular products correctly classified are 97.83%
Revenue generated in each zipcode is predicted using a linear regression model.
We are getting an almost perfect model (99% Adjusted R-Square) for predicting the revenue generated in each zipcode. Thus, we can very accurately predict the same and take actions to increase/decrease the revenue in selected zip codes.
MSE(Train): 36752185 | MSE(Test): 321499258
Popular product in each zipcode is predicted using a multinomial logistic regression model.
Model AIC: 636.7282