Data Analysis On Zip Codes

class: center, middle, inverse, title-slide

.title[
# Data Analysis On Zip Codes 
]
.subtitle[
## Three way sample test 
]
.author[
### Yuanqi
]

---

# <center>Data</center>

```r
loan01 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational01.csv", header = TRUE)[, -1]
loan02 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational02.csv", header = TRUE)[, -1]
loan03 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational03.csv", header = TRUE)[, -1]
loan04 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational04.csv", header = TRUE)[, -1]
loan05 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational05.csv", header = TRUE)[, -1]
loan06 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational06.csv", header = TRUE)[, -1]
loan07 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational07.csv", header = TRUE)[, -1]
loan08 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational08.csv", header = TRUE)[, -1]
loan09 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational09.csv", header = TRUE)[, -1]
bankLoan = rbind(loan01, loan02, loan03, loan04, loan05, loan06, loan07, loan08, loan09)
# dim(bankLoan)
#names(bankLoan)
```
---
# <center>Data</center>
This is taking a sample of 1000 from the whole population of 899164. For this procedure, we are only taking a subset of 1000 Zip codes randomly, each with the same probability of being chosen. In the Sample, we have a variance of 30.

```r
Zip =as.character(bankLoan$Zip)  # make a character vector
N=length(Zip)                      # find the size of the data. 
f.table = -sort(-table(Zip))       # sort the vector in descending order
n = length(f.table)                       # find the number of distinct industries
kable(cbind(Population.size = N))
```

| Population.size|
|---------------:|
| 899164|
---
# <center>Data</center>
I first tried to extract the first two digits from the zipcodes.

```r
Zip.2.digits = substr(bankLoan$Zip, 1, 2)   # extract the first two digits of the Zip code
bankLoan$Zip2Digit = Zip.2.digits           # add the above two-digit variable the loan data
ftable = table(bankLoan$Zip2Digit)
kable(t(ftable))
```

| 0| 1| 10| 11| 12| 13| 14| 15| 16| 17| 18| 19| 2| 20| 21| 22| 23| 24| 25| 26| 27| 28| 29| 3| 30| 31| 32| 33| 34| 35| 36| 37| 38| 39| 4| 40| 41| 42| 43| 44| 45| 46| 47| 48| 49| 5| 50| 51| 52| 53| 54| 55| 56| 57| 58| 59| 6| 60| 61| 62| 63| 64| 65| 66| 67| 68| 69| 7| 70| 71| 72| 73| 74| 75| 76| 77| 78| 79| 8| 80| 81| 82| 83| 84| 85| 86| 87| 88| 89| 9| 90| 91| 92| 93| 94| 95| 96| 97| 98| 99|
|---:|--:|-----:|-----:|----:|----:|-----:|-----:|----:|----:|----:|-----:|--:|-----:|-----:|----:|----:|----:|----:|----:|----:|-----:|----:|--:|-----:|----:|-----:|-----:|----:|----:|----:|----:|----:|----:|--:|----:|----:|----:|-----:|-----:|----:|-----:|----:|-----:|----:|--:|----:|----:|----:|-----:|-----:|-----:|----:|----:|----:|----:|--:|-----:|----:|----:|----:|----:|----:|----:|----:|----:|----:|--:|-----:|----:|----:|----:|----:|-----:|-----:|-----:|-----:|----:|--:|-----:|----:|----:|-----:|-----:|-----:|----:|----:|----:|----:|--:|-----:|-----:|-----:|-----:|-----:|-----:|----:|-----:|-----:|----:|
| 283| 24| 14759| 19211| 6749| 6397| 14310| 10221| 6293| 6181| 7530| 15093| 11| 10208| 13148| 5263| 6829| 2888| 1965| 2631| 8414| 13556| 8861| 5| 21577| 5178| 14010| 23899| 6252| 5748| 3143| 7174| 7549| 6829| 5| 6244| 2007| 2392| 10843| 14441| 8705| 11399| 3300| 13097| 8183| 5| 5993| 2395| 4656| 11261| 11936| 17969| 7381| 5128| 5808| 8778| 4| 23790| 5167| 4090| 8926| 8814| 8312| 6508| 6481| 7153| 1139| 6| 12487| 4003| 5524| 5510| 6082| 17706| 12580| 20879| 15928| 9346| 15| 20308| 3792| 3370| 10088| 18872| 16923| 2035| 4524| 5419| 8308| 24| 25034| 20052| 32356| 12858| 17673| 21038| 5010| 11083| 20000| 5832|
---
# <center>Data</center>
Afterwards, I grouped the first first 2 digits of the zipcodes into groups of 10s, 20s, 30s, 40s, 50s, 60s, 70s, 80s, and 90s. The first two digits of the zipcoes represents a region in the U.S. For example if 19 is extracted fom 19446, then 19 would represent a zip code from a Northeastern state like Pennsylvania.

```r
cate.10.19=c("10","11","12","13","14","15","16","17","18","19") 
cate.20.29 = c("20","21","22","23","24","25","26","27","28","29")
cate.30.39 = c("30","31","32","33","34","35","36","37","38","39")
cate.40.49=c("40","41","42","43","44","45","46","47","48","49") 
cate.50.59=c("50","51","52","53","54","55","56","57","58","59") 
cate.60.69 = c("60","61","62","63","64","65","66","67","68","69")
cate.70.79 = c("70","71","72","73","74","75","76","77","78","79")
cate.80.89=c("80","81","82","83","84","85","86","87","88","89") 
cate.90.99 = c("90","91","92","93","94","95","96","97","98","99")

Zip2Digit0 = bankLoan$Zip2Digit                   # extract the 2-digit Zip
Zip2Digit =  bankLoan$Zip2Digit                   # extract the 2-digit Zip-copy
logic.10.19=Zip2Digit %in% cate.10.19           # identify the three categories: 31, 32, 33.
Zip2Digit[logic.10.19] = 109                      # replace 31, 32, 33 with 313
logic.20.29=Zip2Digit %in% cate.20.29           # identify the three categories: 44 and 45.
Zip2Digit[logic.20.29] = 209
logic.30.39=Zip2Digit %in% cate.30.39           # identify the three categories: 48 and 49.
Zip2Digit[logic.30.39] = 309
logic.40.49=Zip2Digit %in% cate.40.49           
Zip2Digit[logic.40.49] = 409                      
logic.50.59=Zip2Digit %in% cate.50.59           
Zip2Digit[logic.50.59] = 509
logic.60.69=Zip2Digit %in% cate.60.69           
Zip2Digit[logic.60.69] = 609
logic.70.79=Zip2Digit %in% cate.70.79           
Zip2Digit[logic.70.79] = 709                      
logic.80.89=Zip2Digit %in% cate.80.89           
Zip2Digit[logic.80.89] = 809
logic.90.99=Zip2Digit %in% cate.90.99           
Zip2Digit[logic.90.99] = 909

bankLoan$strZip = Zip2Digit
```

---
# <center>Data</center>
We have defined a new population by deleting observations ranging 0 to 9 because some states states do not have a zip code or the size is relatively small in order to get the new smaller population.

```r
del.categories = c("0", "1", "2","3","4","5","6","7","8", "9")       # categories to be deleted in 
                                                      # the original population
del.obs.status = !(bankLoan$strZip %in% del.categories) # deletion status. ! negation operator
study.pop = bankLoan[del.obs.status,]                     # excluding the categories
kable(t(table(study.pop$strZip)))  
```

|    109|   209|    309|   409|   509|   609|    709|   809|    909|
|------:|-----:|------:|-----:|-----:|-----:|------:|-----:|------:|
| 106744| 73763| 101359| 80611| 81305| 80380| 110045| 93639| 170936|

---
# <center>sample</center>

```r
study.pop$sampling.frame = 1:length(study.pop$Zip)   # sampling list
names(study.pop)                                      # checking the sampling list variable
```

```
##  [1] "LoanNr_ChkDgt"     "Name"              "City"             
##  [4] "State"             "Zip"               "Bank"             
##  [7] "BankState"         "NAICS"             "ApprovalDate"     
## [10] "ApprovalFY"        "Term"              "NoEmp"            
## [13] "NewExist"          "CreateJob"         "RetainedJob"      
## [16] "FranchiseCode"     "UrbanRural"        "RevLineCr"        
## [19] "LowDoc"            "ChgOffDate"        "DisbursementDate" 
## [22] "DisbursementGross" "BalanceGross"      "MIS_Status"       
## [25] "ChgOffPrinGr"      "GrAppv"            "SBA_Appv"         
## [28] "Zip2Digit"         "strZip"            "sampling.frame"
```

```r
sampled.list = sample(1:length(study.pop$Zip), 1000) # sampling the list
SRS.sample = study.pop[sampled.list,]                   # extract the sampling units (observations)
## dimension check
dimension.SRS = dim(SRS.sample)
names(dimension.SRS) = c("Size", "Var.count")
kable(t(dimension.SRS))                                        # checking the sample size
```

| Size| Var.count|
|----:|---------:|
| 1000|        30|

---
# <center>Samples</center>
When this code is preforming systematic sampling where it is taking every 1th observation in the column for the variable Zip, the sample will consist of everything 1th observation in the column ranging from n=1 t0 n= 899164. 
The jump size is calculated to find the appropriate jump size from our population in order to obtain a sample of 1000 when we perform systematic sampling.

```r
jump.size = dim(study.pop)[1]%/%1000   # find the jump size in the systematic sampling
jump.size
```

```
## [1] 898
```

```r
rand.starting.pt=sample(1:jump.size,1) # find the random starting value
sampling.id = seq(rand.starting.pt, dim(study.pop)[1], jump.size)  # sampling IDs
length(sampling.id)
```

```
## [1] 1001
```

```r
sys.sample=study.pop[sampling.id,]    # extract the sampling units of systematic samples
sys.Sample.dim = dim(sys.sample)
names(sys.Sample.dim) = c("Size", "Var.count")
kable(t(sys.Sample.dim))
```

| Size| Var.count|
|----:|---------:|
| 1001| 30|
---
# <center>samples</center>
Then, I ran a program to create a table for the stratified zipcodes. Later, this is creating a sample of 1000 by taking clusters of a particular number group in the population, which are listed below.

```r
freq.table = table(study.pop$strZip) # frequency table of strZip
rel.freq = freq.table/sum(freq.table) # relative frequency 
strata.size = round(rel.freq*1000) # strata size allocation
strata.names=names(strata.size) # extract strZip names for accuracy checking
```
---
# <center>Data</center>

```r
kable(t(strata.size))  # make a nice-looking table using kable().
```

| 109| 209| 309| 409| 509| 609| 709| 809| 909|
|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| 119| 82| 113| 90| 90| 89| 122| 104| 190|
---
# <center>sample</center>
The final code is taking 119 zip codes from the 109 group, 82 zip codes from the 209 group, 113 zip codes from the 309 group, 91 zip coes from the 409 group, 90 zip codes from the 509 group, 89 zip codes from the 609 group, 122 zip codes from the 809 group, and 190 zip codes from the 909 group to create a stratified sample of 1000.

```r
strata.sample = study.pop[1,]    # create a reference data frame
strata.sample$add.id = 1         # add a temporary ID to because in the loop
# i =2                           # testing a single iteration
for (i in 1:length(strata.names)){
   ith.strata.names = strata.names[i]   # extract data frame names
   ith.strata.size = strata.size[i]     # allocated stratum size
   # The following code identifies observations to be selected
   ith.sampling.id = which(study.pop$strZip==ith.strata.names) 
   ith.strata = study.pop[ith.sampling.id,]  # i-th stratified population
   ith.strata$add.id = 1:dim(ith.strata)[1]  # add sampling list/frame
   # The following code generates a subset of random ID
   ith.sampling.id = sample(1:dim(ith.strata)[1], ith.strata.size) 
   ## Create a selection status -- pay attention to the operator: %in% 
   ith.sample =ith.strata[ith.strata$add.id %in%ith.sampling.id,]
   ## dim(ith.sample)         $ check the sample
   strata.sample = rbind(strata.sample, ith.sample)  # stack all data frame!
 }
 dim(strata.sample)
```

```
## [1] 1000   31
```

```r
 strat.sample.final = strata.sample[,-31]     # drop the temporary stratum ID
 kable(head(strat.sample.final))                    # accuracy check!
```

| | LoanNr_ChkDgt|Name |City |State | Zip|Bank |BankState | NAICS|ApprovalDate |ApprovalFY | Term| NoEmp| NewExist| CreateJob| RetainedJob| FranchiseCode| UrbanRural|RevLineCr |LowDoc |ChgOffDate |DisbursementDate |DisbursementGross |BalanceGross |MIS_Status |ChgOffPrinGr |GrAppv |SBA_Appv |Zip2Digit |strZip | sampling.frame|
|:-----|-------------:|:------------------------------|:----------|:-----|-----:|:-----------------------------|:---------|------:|:------------|:----------|----:|-----:|--------:|---------:|-----------:|-------------:|----------:|:---------|:------|:----------|:----------------|:-----------------|:------------|:----------|:------------|:-----------|:-----------|:---------|:------|--------------:|
|1 | 1000014003|ABC HOBBYCRAFT |EVANSVILLE |IN | 47711|FIFTH THIRD BANK |OH | 451120|28-Feb-97 |1997 | 84| 4| 2| 0| 0| 1| 0|N |Y | |28-Feb-99 |$60,000.00 |$0.00 |P I F |$0.00 |$60,000.00 |$48,000.00 |47 |409 | 1|
|398 | 1003035000|CICERO GAS & GROCERY INC |CICERO |NY | 13039|KEYBANK NATIONAL ASSOCIATION |OH | 0|17-Jul-98 |1998 | 84| 6| 1| 0| 0| 1| 0|N |N | |31-Jul-98 |$25,000.00 |$0.00 |P I F |$0.00 |$25,000.00 |$12,500.00 |13 |109 | 396|
|10809 | 1077154008|CARTER BROTHERS, INC. |CICERO |NY | 13039|HSBC BK USA NATL ASSOC |NY | 0|31-Mar-97 |1997 | 84| 20| 2| 0| 0| 1| 0|N |N | |8-May-97 |$90,000.00 |$0.00 |P I F |$0.00 |$90,000.00 |$72,000.00 |13 |109 | 10804|
|13691 | 1097904005|MARBLE,TILE,TERRAZZO & GRANITE |BROOKLYN |NY | 11217|GE CAP. SMALL BUS. FINAN CORP |TX | 0|4-Apr-97 |1997 | 241| 18| 1| 0| 0| 1| 0|N |N | |8-Jul-97 |$290,000.00 |$0.00 |P I F |$0.00 |$325,000.00 |$243,750.00 |11 |109 | 13685|
|28023 | 1210725009|R & R Garber Partnership |Boyertown |PA | 19512|METRO BANK |PA | 531120|28-Feb-05 |2005 | 182| 7| 1| 2| 7| 1| 2|N |N | |31-Mar-05 |$306,000.00 |$0.00 |P I F |$0.00 |$306,000.00 |$229,500.00 |19 |109 | 28013|
|30366 | 1229104004|MARIE I SOTO DBA SOTOMAR AGENC |BRONX |NY | 10459|HSBC BK USA NATL ASSOC |NY | 0|13-May-97 |1997 | 61| 3| 2| 0| 0| 1| 0|N |Y | |22-May-97 |$50,000.00 |$0.00 |P I F |$0.00 |$50,000.00 |$40,000.00 |10 |109 | 30356|
---
# <center>tables</center>
We have calculated the default rate across the states previously. We will use the population level rates to compare them with sample-level industry specific rates. For the table, the MIS Status enables us to see how many people who live in particular zip codes defaulted on their loans and how many did not default on their loans on the 3rd and 4th columns. The last column gives us the percentage of people living within a particular set of zip codes who defaulted on their loans.

```r
x.table = table(bankLoan$strZip, bankLoan$MIS_Status)
no.lab = x.table[,1]   # first column consists of unknown default label
default = x.table[,2]
no.default = x.table[,3]
default.rate = round(100*default/(default+no.default),1)
 default.status.rate = cbind(no.lab = no.lab, 
                          default = default, 
                          no.default = no.default,
                          default.rate=default.rate)
kable(default.status.rate, caption = "Population size, default counts, 
                                      and population default rates")
```

Table: Population size, default counts, 
                                      and population default rates

|    | no.lab| default| no.default| default.rate|
|:---|------:|-------:|----------:|------------:|
|0   |      0|     222|         61|         78.4|
|1   |      0|       1|         23|          4.2|
|109 |    758|   18366|      87620|         17.3|
|2   |      0|       4|          7|         36.4|
|209 |    273|   12726|      60764|         17.3|
|3   |      0|       1|          4|         20.0|
|309 |    139|   22512|      78708|         22.2|
|4   |      0|       0|          5|          0.0|
|409 |    236|   14465|      65910|         18.0|
|5   |      0|       0|          5|          0.0|
|509 |     87|    8542|      72676|         10.5|
|6   |      0|       1|          3|         25.0|
|609 |    218|   13756|      66406|         17.2|
|7   |      0|       1|          5|         16.7|
|709 |     83|   20702|      89260|         18.8|
|8   |      0|       6|          9|         40.0|
|809 |     49|   16347|      77243|         17.5|
|9   |      0|       4|         20|         16.7|
|909 |    154|   29902|     140880|         17.5|

---
# <center>Data</center>
This is a comparison between the Industry default rates between the population and SRS

```r
x.table = table(SRS.sample$strZip, SRS.sample$MIS_Status)
no.lab.srs = x.table[,1]      # first column consists of unknown default label
default.srs = x.table[,2]
no.default.srs = x.table[,3]
default.rate.srs = round(100*default.srs/(default.srs+no.default.srs),1)
##
industry.code = names(default.rate.srs)    # extract NSICS code
industry.name=c("upper North East","Lower North East",
                "Lower South East", "Upper Mid East", "Upper Middle",
                "Center of U.S","South of U.S", "Mid West",
                "West Coast")  # actual industry names!
default.rate.pop = default.rate[industry.code]
cbind(industry.code,industry.name)
```

```
##       industry.code industry.name     
##  [1,] "109"         "upper North East"
##  [2,] "209"         "Lower North East"
##  [3,] "309"         "Lower South East"
##  [4,] "409"         "Upper Mid East"  
##  [5,] "509"         "Upper Middle"    
##  [6,] "609"         "Center of U.S"   
##  [7,] "709"         "South of U.S"    
##  [8,] "809"         "Mid West"        
##  [9,] "909"         "West Coast"
```

```r
SRS.pop.rates = cbind(default.rate.pop,default.rate.srs)
rownames(SRS.pop.rates) = industry.name
kable(SRS.pop.rates, caption="Comparison of industry-specific default rates 
                               between population and the SRS.")
```

Table: Comparison of industry-specific default rates 
                               between population and the SRS.

| | default.rate.pop| default.rate.srs|
|:----------------|----------------:|----------------:|
|upper North East | 17.3| 14.8|
|Lower North East | 17.3| 20.4|
|Lower South East | 22.2| 23.2|
|Upper Mid East | 18.0| 16.7|
|Upper Middle | 10.5| 8.0|
|Center of U.S | 17.2| 17.2|
|South of U.S | 18.8| 15.3|
|Mid West | 17.5| 9.1|
|West Coast | 17.5| 19.5|
---
# <center>tables</center>
We will use sample stratification variable to find the industry-specific rates based on the systematic sample. The table will have rates of population, SRS, and systematic random samples.

```r
x.table = table(sys.sample$strZip, sys.sample$MIS_Status)
no.lab.sys = x.table[,1]      # first column consists of unknown default label
default.sys = x.table[,2]
no.default.sys = x.table[,3]
default.rate.sys = round(100*default.sys/(default.sys+no.default.sys),1)
sys.SRS.pop.rates = cbind(default.rate.pop, default.rate.srs, default.rate.sys)
rownames(SRS.pop.rates) = industry.name
kable(sys.SRS.pop.rates, caption="Comparison of industry-specific default rates 
                               between population, SRS, and Systematic Sample.")
```

Table: Comparison of industry-specific default rates 
                               between population, SRS, and Systematic Sample.

|    | default.rate.pop| default.rate.srs| default.rate.sys|
|:---|----------------:|----------------:|----------------:|
|109 |             17.3|             14.8|             18.6|
|209 |             17.3|             20.4|             21.3|
|309 |             22.2|             23.2|             21.1|
|409 |             18.0|             16.7|             19.8|
|509 |             10.5|              8.0|              5.4|
|609 |             17.2|             17.2|             15.1|
|709 |             18.8|             15.3|             20.0|
|809 |             17.5|              9.1|             20.0|
|909 |             17.5|             19.5|             15.2|

---
# <center>tables</center>
This is a comparison of industry specific default rates between the population, SRS, and stratified samples.

```r
x.table = table(strat.sample.final$strZip, strat.sample.final$MIS_Status)
no.lab.str = x.table[,1]      # first column consists of unknown default label
default.str = x.table[,2]
no.default.str = x.table[,3]
default.rate.str = round(100*default.str/(default.str+no.default.str),1)
str.SRS.pop.rates = cbind(default.rate.pop, default.rate.srs, default.rate.sys, default.rate.str)
rownames(str.SRS.pop.rates) = industry.name
kable(str.SRS.pop.rates, caption="Comparison of industry-specific default rates 
                               between population, SRS, Systematic Sample, 
                               and Stratified Samples.")
```

Table: Comparison of industry-specific default rates 
                               between population, SRS, Systematic Sample, 
                               and Stratified Samples.

| | default.rate.pop| default.rate.srs| default.rate.sys| default.rate.str|
|:----------------|----------------:|----------------:|----------------:|----------------:|
|upper North East | 17.3| 14.8| 18.6| 16.9|
|Lower North East | 17.3| 20.4| 21.3| 20.7|
|Lower South East | 22.2| 23.2| 21.1| 21.2|
|Upper Mid East | 18.0| 16.7| 19.8| 9.9|
|Upper Middle | 10.5| 8.0| 5.4| 11.1|
|Center of U.S | 17.2| 17.2| 15.1| 11.4|
|South of U.S | 18.8| 15.3| 20.0| 26.2|
|Mid West | 17.5| 9.1| 20.0| 18.3|
|West Coast | 17.5| 19.5| 15.2| 16.8|
---
# <center>tables</center>
I have concluded that Stratification sample may be the best fit for the model because the default rates that we got from the stratified sample are closest to the default rates from the population .

```r
n=length(default.rate.pop)
plot(NULL, xlim=c(0,n), ylim=c(0, 50), 
     xlab="Industry Classification Code", 
     ylab ="Default Rates (Percentage)", axes=FALSE) # empty plot
title("Comparison of Industry-specific Default Rates")
points(1:n, as.vector(default.rate.pop), pch=16, col=2, cex = 0.8)
lines(1:n, as.vector(default.rate.pop), lty=1, col=2, cex = 0.8)
#
points(1:n, as.vector(default.rate.srs), pch=17, col=3, cex = 0.8)
lines(1:n, as.vector(default.rate.srs), lty=2, col=3, cex = 0.8)
#
points(1:n, as.vector(default.rate.sys), pch=19, col=4, cex = 0.8)
lines(1:n, as.vector(default.rate.sys), lty=3, col=4, cex = 0.8)
#
points(1:n, as.vector(default.rate.str), pch=20, col=5, cex = 0.8)
lines(1:n, as.vector(default.rate.str), lty=4, col=5, cex = 0.8)
#
axis(1,at=1:n, label=industry.code, las = 2)
axis(2)
#
rowMax=apply(str.SRS.pop.rates, 1, max) # max default rate in each industry
segments(1:n, rep(0,n), 1:n, rowMax, lty=2, col="lightgray", lwd = 0.5)
legend("topright", c("Pop", "SRS", "Sys", "Str"), lty=1:4, col=2:5, pch=c(16,17,19,20), cex=0.6, bty="n")
```

![](data:image/png;base64,#w04-html-presentation--3-_files/figure-html/unnamed-chunk-13-1.png)