class: center, middle, inverse, title-slide .title[ #
Data Analysis On Zip Codes
] .subtitle[ ##
Three way sample test
] .author[ ###
Yuanqi
] --- # <center>Data</center> ```r loan01 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational01.csv", header = TRUE)[, -1] loan02 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational02.csv", header = TRUE)[, -1] loan03 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational03.csv", header = TRUE)[, -1] loan04 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational04.csv", header = TRUE)[, -1] loan05 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational05.csv", header = TRUE)[, -1] loan06 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational06.csv", header = TRUE)[, -1] loan07 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational07.csv", header = TRUE)[, -1] loan08 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational08.csv", header = TRUE)[, -1] loan09 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational09.csv", header = TRUE)[, -1] bankLoan = rbind(loan01, loan02, loan03, loan04, loan05, loan06, loan07, loan08, loan09) # dim(bankLoan) #names(bankLoan) ``` --- # <center>Data</center> This is taking a sample of 1000 from the whole population of 899164. For this procedure, we are only taking a subset of 1000 Zip codes randomly, each with the same probability of being chosen. In the Sample, we have a variance of 30. ```r Zip =as.character(bankLoan$Zip) # make a character vector N=length(Zip) # find the size of the data. f.table = -sort(-table(Zip)) # sort the vector in descending order n = length(f.table) # find the number of distinct industries kable(cbind(Population.size = N)) ``` | Population.size| |---------------:| | 899164| --- # <center>Data</center> I first tried to extract the first two digits from the zipcodes. ```r Zip.2.digits = substr(bankLoan$Zip, 1, 2) # extract the first two digits of the Zip code bankLoan$Zip2Digit = Zip.2.digits # add the above two-digit variable the loan data ftable = table(bankLoan$Zip2Digit) kable(t(ftable)) ``` | 0| 1| 10| 11| 12| 13| 14| 15| 16| 17| 18| 19| 2| 20| 21| 22| 23| 24| 25| 26| 27| 28| 29| 3| 30| 31| 32| 33| 34| 35| 36| 37| 38| 39| 4| 40| 41| 42| 43| 44| 45| 46| 47| 48| 49| 5| 50| 51| 52| 53| 54| 55| 56| 57| 58| 59| 6| 60| 61| 62| 63| 64| 65| 66| 67| 68| 69| 7| 70| 71| 72| 73| 74| 75| 76| 77| 78| 79| 8| 80| 81| 82| 83| 84| 85| 86| 87| 88| 89| 9| 90| 91| 92| 93| 94| 95| 96| 97| 98| 99| |---:|--:|-----:|-----:|----:|----:|-----:|-----:|----:|----:|----:|-----:|--:|-----:|-----:|----:|----:|----:|----:|----:|----:|-----:|----:|--:|-----:|----:|-----:|-----:|----:|----:|----:|----:|----:|----:|--:|----:|----:|----:|-----:|-----:|----:|-----:|----:|-----:|----:|--:|----:|----:|----:|-----:|-----:|-----:|----:|----:|----:|----:|--:|-----:|----:|----:|----:|----:|----:|----:|----:|----:|----:|--:|-----:|----:|----:|----:|----:|-----:|-----:|-----:|-----:|----:|--:|-----:|----:|----:|-----:|-----:|-----:|----:|----:|----:|----:|--:|-----:|-----:|-----:|-----:|-----:|-----:|----:|-----:|-----:|----:| | 283| 24| 14759| 19211| 6749| 6397| 14310| 10221| 6293| 6181| 7530| 15093| 11| 10208| 13148| 5263| 6829| 2888| 1965| 2631| 8414| 13556| 8861| 5| 21577| 5178| 14010| 23899| 6252| 5748| 3143| 7174| 7549| 6829| 5| 6244| 2007| 2392| 10843| 14441| 8705| 11399| 3300| 13097| 8183| 5| 5993| 2395| 4656| 11261| 11936| 17969| 7381| 5128| 5808| 8778| 4| 23790| 5167| 4090| 8926| 8814| 8312| 6508| 6481| 7153| 1139| 6| 12487| 4003| 5524| 5510| 6082| 17706| 12580| 20879| 15928| 9346| 15| 20308| 3792| 3370| 10088| 18872| 16923| 2035| 4524| 5419| 8308| 24| 25034| 20052| 32356| 12858| 17673| 21038| 5010| 11083| 20000| 5832| --- # <center>Data</center> Afterwards, I grouped the first first 2 digits of the zipcodes into groups of 10s, 20s, 30s, 40s, 50s, 60s, 70s, 80s, and 90s. The first two digits of the zipcoes represents a region in the U.S. For example if 19 is extracted fom 19446, then 19 would represent a zip code from a Northeastern state like Pennsylvania. ```r cate.10.19=c("10","11","12","13","14","15","16","17","18","19") cate.20.29 = c("20","21","22","23","24","25","26","27","28","29") cate.30.39 = c("30","31","32","33","34","35","36","37","38","39") cate.40.49=c("40","41","42","43","44","45","46","47","48","49") cate.50.59=c("50","51","52","53","54","55","56","57","58","59") cate.60.69 = c("60","61","62","63","64","65","66","67","68","69") cate.70.79 = c("70","71","72","73","74","75","76","77","78","79") cate.80.89=c("80","81","82","83","84","85","86","87","88","89") cate.90.99 = c("90","91","92","93","94","95","96","97","98","99") Zip2Digit0 = bankLoan$Zip2Digit # extract the 2-digit Zip Zip2Digit = bankLoan$Zip2Digit # extract the 2-digit Zip-copy logic.10.19=Zip2Digit %in% cate.10.19 # identify the three categories: 31, 32, 33. Zip2Digit[logic.10.19] = 109 # replace 31, 32, 33 with 313 logic.20.29=Zip2Digit %in% cate.20.29 # identify the three categories: 44 and 45. Zip2Digit[logic.20.29] = 209 logic.30.39=Zip2Digit %in% cate.30.39 # identify the three categories: 48 and 49. Zip2Digit[logic.30.39] = 309 logic.40.49=Zip2Digit %in% cate.40.49 Zip2Digit[logic.40.49] = 409 logic.50.59=Zip2Digit %in% cate.50.59 Zip2Digit[logic.50.59] = 509 logic.60.69=Zip2Digit %in% cate.60.69 Zip2Digit[logic.60.69] = 609 logic.70.79=Zip2Digit %in% cate.70.79 Zip2Digit[logic.70.79] = 709 logic.80.89=Zip2Digit %in% cate.80.89 Zip2Digit[logic.80.89] = 809 logic.90.99=Zip2Digit %in% cate.90.99 Zip2Digit[logic.90.99] = 909 bankLoan$strZip = Zip2Digit ``` --- # <center>Data</center> We have defined a new population by deleting observations ranging 0 to 9 because some states states do not have a zip code or the size is relatively small in order to get the new smaller population. ```r del.categories = c("0", "1", "2","3","4","5","6","7","8", "9") # categories to be deleted in # the original population del.obs.status = !(bankLoan$strZip %in% del.categories) # deletion status. ! negation operator study.pop = bankLoan[del.obs.status,] # excluding the categories kable(t(table(study.pop$strZip))) ``` | 109| 209| 309| 409| 509| 609| 709| 809| 909| |------:|-----:|------:|-----:|-----:|-----:|------:|-----:|------:| | 106744| 73763| 101359| 80611| 81305| 80380| 110045| 93639| 170936| --- # <center>sample</center> ```r study.pop$sampling.frame = 1:length(study.pop$Zip) # sampling list names(study.pop) # checking the sampling list variable ``` ``` ## [1] "LoanNr_ChkDgt" "Name" "City" ## [4] "State" "Zip" "Bank" ## [7] "BankState" "NAICS" "ApprovalDate" ## [10] "ApprovalFY" "Term" "NoEmp" ## [13] "NewExist" "CreateJob" "RetainedJob" ## [16] "FranchiseCode" "UrbanRural" "RevLineCr" ## [19] "LowDoc" "ChgOffDate" "DisbursementDate" ## [22] "DisbursementGross" "BalanceGross" "MIS_Status" ## [25] "ChgOffPrinGr" "GrAppv" "SBA_Appv" ## [28] "Zip2Digit" "strZip" "sampling.frame" ``` ```r sampled.list = sample(1:length(study.pop$Zip), 1000) # sampling the list SRS.sample = study.pop[sampled.list,] # extract the sampling units (observations) ## dimension check dimension.SRS = dim(SRS.sample) names(dimension.SRS) = c("Size", "Var.count") kable(t(dimension.SRS)) # checking the sample size ``` | Size| Var.count| |----:|---------:| | 1000| 30| --- # <center>Samples</center> When this code is preforming systematic sampling where it is taking every 1th observation in the column for the variable Zip, the sample will consist of everything 1th observation in the column ranging from n=1 t0 n= 899164. The jump size is calculated to find the appropriate jump size from our population in order to obtain a sample of 1000 when we perform systematic sampling. ```r jump.size = dim(study.pop)[1]%/%1000 # find the jump size in the systematic sampling jump.size ``` ``` ## [1] 898 ``` ```r rand.starting.pt=sample(1:jump.size,1) # find the random starting value sampling.id = seq(rand.starting.pt, dim(study.pop)[1], jump.size) # sampling IDs length(sampling.id) ``` ``` ## [1] 1001 ``` ```r sys.sample=study.pop[sampling.id,] # extract the sampling units of systematic samples sys.Sample.dim = dim(sys.sample) names(sys.Sample.dim) = c("Size", "Var.count") kable(t(sys.Sample.dim)) ``` | Size| Var.count| |----:|---------:| | 1001| 30| --- # <center>samples</center> Then, I ran a program to create a table for the stratified zipcodes. Later, this is creating a sample of 1000 by taking clusters of a particular number group in the population, which are listed below. ```r freq.table = table(study.pop$strZip) # frequency table of strZip rel.freq = freq.table/sum(freq.table) # relative frequency strata.size = round(rel.freq*1000) # strata size allocation strata.names=names(strata.size) # extract strZip names for accuracy checking ``` --- # <center>Data</center> ```r kable(t(strata.size)) # make a nice-looking table using kable(). ``` | 109| 209| 309| 409| 509| 609| 709| 809| 909| |---:|---:|---:|---:|---:|---:|---:|---:|---:| | 119| 82| 113| 90| 90| 89| 122| 104| 190| --- # <center>sample</center> The final code is taking 119 zip codes from the 109 group, 82 zip codes from the 209 group, 113 zip codes from the 309 group, 91 zip coes from the 409 group, 90 zip codes from the 509 group, 89 zip codes from the 609 group, 122 zip codes from the 809 group, and 190 zip codes from the 909 group to create a stratified sample of 1000. ```r strata.sample = study.pop[1,] # create a reference data frame strata.sample$add.id = 1 # add a temporary ID to because in the loop # i =2 # testing a single iteration for (i in 1:length(strata.names)){ ith.strata.names = strata.names[i] # extract data frame names ith.strata.size = strata.size[i] # allocated stratum size # The following code identifies observations to be selected ith.sampling.id = which(study.pop$strZip==ith.strata.names) ith.strata = study.pop[ith.sampling.id,] # i-th stratified population ith.strata$add.id = 1:dim(ith.strata)[1] # add sampling list/frame # The following code generates a subset of random ID ith.sampling.id = sample(1:dim(ith.strata)[1], ith.strata.size) ## Create a selection status -- pay attention to the operator: %in% ith.sample =ith.strata[ith.strata$add.id %in%ith.sampling.id,] ## dim(ith.sample) $ check the sample strata.sample = rbind(strata.sample, ith.sample) # stack all data frame! } dim(strata.sample) ``` ``` ## [1] 1000 31 ``` ```r strat.sample.final = strata.sample[,-31] # drop the temporary stratum ID kable(head(strat.sample.final)) # accuracy check! ``` | | LoanNr_ChkDgt|Name |City |State | Zip|Bank |BankState | NAICS|ApprovalDate |ApprovalFY | Term| NoEmp| NewExist| CreateJob| RetainedJob| FranchiseCode| UrbanRural|RevLineCr |LowDoc |ChgOffDate |DisbursementDate |DisbursementGross |BalanceGross |MIS_Status |ChgOffPrinGr |GrAppv |SBA_Appv |Zip2Digit |strZip | sampling.frame| |:-----|-------------:|:------------------------------|:----------|:-----|-----:|:-----------------------------|:---------|------:|:------------|:----------|----:|-----:|--------:|---------:|-----------:|-------------:|----------:|:---------|:------|:----------|:----------------|:-----------------|:------------|:----------|:------------|:-----------|:-----------|:---------|:------|--------------:| |1 | 1000014003|ABC HOBBYCRAFT |EVANSVILLE |IN | 47711|FIFTH THIRD BANK |OH | 451120|28-Feb-97 |1997 | 84| 4| 2| 0| 0| 1| 0|N |Y | |28-Feb-99 |$60,000.00 |$0.00 |P I F |$0.00 |$60,000.00 |$48,000.00 |47 |409 | 1| |398 | 1003035000|CICERO GAS & GROCERY INC |CICERO |NY | 13039|KEYBANK NATIONAL ASSOCIATION |OH | 0|17-Jul-98 |1998 | 84| 6| 1| 0| 0| 1| 0|N |N | |31-Jul-98 |$25,000.00 |$0.00 |P I F |$0.00 |$25,000.00 |$12,500.00 |13 |109 | 396| |10809 | 1077154008|CARTER BROTHERS, INC. |CICERO |NY | 13039|HSBC BK USA NATL ASSOC |NY | 0|31-Mar-97 |1997 | 84| 20| 2| 0| 0| 1| 0|N |N | |8-May-97 |$90,000.00 |$0.00 |P I F |$0.00 |$90,000.00 |$72,000.00 |13 |109 | 10804| |13691 | 1097904005|MARBLE,TILE,TERRAZZO & GRANITE |BROOKLYN |NY | 11217|GE CAP. SMALL BUS. FINAN CORP |TX | 0|4-Apr-97 |1997 | 241| 18| 1| 0| 0| 1| 0|N |N | |8-Jul-97 |$290,000.00 |$0.00 |P I F |$0.00 |$325,000.00 |$243,750.00 |11 |109 | 13685| |28023 | 1210725009|R & R Garber Partnership |Boyertown |PA | 19512|METRO BANK |PA | 531120|28-Feb-05 |2005 | 182| 7| 1| 2| 7| 1| 2|N |N | |31-Mar-05 |$306,000.00 |$0.00 |P I F |$0.00 |$306,000.00 |$229,500.00 |19 |109 | 28013| |30366 | 1229104004|MARIE I SOTO DBA SOTOMAR AGENC |BRONX |NY | 10459|HSBC BK USA NATL ASSOC |NY | 0|13-May-97 |1997 | 61| 3| 2| 0| 0| 1| 0|N |Y | |22-May-97 |$50,000.00 |$0.00 |P I F |$0.00 |$50,000.00 |$40,000.00 |10 |109 | 30356| --- # <center>tables</center> We have calculated the default rate across the states previously. We will use the population level rates to compare them with sample-level industry specific rates. For the table, the MIS Status enables us to see how many people who live in particular zip codes defaulted on their loans and how many did not default on their loans on the 3rd and 4th columns. The last column gives us the percentage of people living within a particular set of zip codes who defaulted on their loans. ```r x.table = table(bankLoan$strZip, bankLoan$MIS_Status) no.lab = x.table[,1] # first column consists of unknown default label default = x.table[,2] no.default = x.table[,3] default.rate = round(100*default/(default+no.default),1) default.status.rate = cbind(no.lab = no.lab, default = default, no.default = no.default, default.rate=default.rate) kable(default.status.rate, caption = "Population size, default counts, and population default rates") ``` Table: Population size, default counts, and population default rates | | no.lab| default| no.default| default.rate| |:---|------:|-------:|----------:|------------:| |0 | 0| 222| 61| 78.4| |1 | 0| 1| 23| 4.2| |109 | 758| 18366| 87620| 17.3| |2 | 0| 4| 7| 36.4| |209 | 273| 12726| 60764| 17.3| |3 | 0| 1| 4| 20.0| |309 | 139| 22512| 78708| 22.2| |4 | 0| 0| 5| 0.0| |409 | 236| 14465| 65910| 18.0| |5 | 0| 0| 5| 0.0| |509 | 87| 8542| 72676| 10.5| |6 | 0| 1| 3| 25.0| |609 | 218| 13756| 66406| 17.2| |7 | 0| 1| 5| 16.7| |709 | 83| 20702| 89260| 18.8| |8 | 0| 6| 9| 40.0| |809 | 49| 16347| 77243| 17.5| |9 | 0| 4| 20| 16.7| |909 | 154| 29902| 140880| 17.5| --- # <center>Data</center> This is a comparison between the Industry default rates between the population and SRS ```r x.table = table(SRS.sample$strZip, SRS.sample$MIS_Status) no.lab.srs = x.table[,1] # first column consists of unknown default label default.srs = x.table[,2] no.default.srs = x.table[,3] default.rate.srs = round(100*default.srs/(default.srs+no.default.srs),1) ## industry.code = names(default.rate.srs) # extract NSICS code industry.name=c("upper North East","Lower North East", "Lower South East", "Upper Mid East", "Upper Middle", "Center of U.S","South of U.S", "Mid West", "West Coast") # actual industry names! default.rate.pop = default.rate[industry.code] cbind(industry.code,industry.name) ``` ``` ## industry.code industry.name ## [1,] "109" "upper North East" ## [2,] "209" "Lower North East" ## [3,] "309" "Lower South East" ## [4,] "409" "Upper Mid East" ## [5,] "509" "Upper Middle" ## [6,] "609" "Center of U.S" ## [7,] "709" "South of U.S" ## [8,] "809" "Mid West" ## [9,] "909" "West Coast" ``` ```r SRS.pop.rates = cbind(default.rate.pop,default.rate.srs) rownames(SRS.pop.rates) = industry.name kable(SRS.pop.rates, caption="Comparison of industry-specific default rates between population and the SRS.") ``` Table: Comparison of industry-specific default rates between population and the SRS. | | default.rate.pop| default.rate.srs| |:----------------|----------------:|----------------:| |upper North East | 17.3| 14.8| |Lower North East | 17.3| 20.4| |Lower South East | 22.2| 23.2| |Upper Mid East | 18.0| 16.7| |Upper Middle | 10.5| 8.0| |Center of U.S | 17.2| 17.2| |South of U.S | 18.8| 15.3| |Mid West | 17.5| 9.1| |West Coast | 17.5| 19.5| --- # <center>tables</center> We will use sample stratification variable to find the industry-specific rates based on the systematic sample. The table will have rates of population, SRS, and systematic random samples. ```r x.table = table(sys.sample$strZip, sys.sample$MIS_Status) no.lab.sys = x.table[,1] # first column consists of unknown default label default.sys = x.table[,2] no.default.sys = x.table[,3] default.rate.sys = round(100*default.sys/(default.sys+no.default.sys),1) sys.SRS.pop.rates = cbind(default.rate.pop, default.rate.srs, default.rate.sys) rownames(SRS.pop.rates) = industry.name kable(sys.SRS.pop.rates, caption="Comparison of industry-specific default rates between population, SRS, and Systematic Sample.") ``` Table: Comparison of industry-specific default rates between population, SRS, and Systematic Sample. | | default.rate.pop| default.rate.srs| default.rate.sys| |:---|----------------:|----------------:|----------------:| |109 | 17.3| 14.8| 18.6| |209 | 17.3| 20.4| 21.3| |309 | 22.2| 23.2| 21.1| |409 | 18.0| 16.7| 19.8| |509 | 10.5| 8.0| 5.4| |609 | 17.2| 17.2| 15.1| |709 | 18.8| 15.3| 20.0| |809 | 17.5| 9.1| 20.0| |909 | 17.5| 19.5| 15.2| --- # <center>tables</center> This is a comparison of industry specific default rates between the population, SRS, and stratified samples. ```r x.table = table(strat.sample.final$strZip, strat.sample.final$MIS_Status) no.lab.str = x.table[,1] # first column consists of unknown default label default.str = x.table[,2] no.default.str = x.table[,3] default.rate.str = round(100*default.str/(default.str+no.default.str),1) str.SRS.pop.rates = cbind(default.rate.pop, default.rate.srs, default.rate.sys, default.rate.str) rownames(str.SRS.pop.rates) = industry.name kable(str.SRS.pop.rates, caption="Comparison of industry-specific default rates between population, SRS, Systematic Sample, and Stratified Samples.") ``` Table: Comparison of industry-specific default rates between population, SRS, Systematic Sample, and Stratified Samples. | | default.rate.pop| default.rate.srs| default.rate.sys| default.rate.str| |:----------------|----------------:|----------------:|----------------:|----------------:| |upper North East | 17.3| 14.8| 18.6| 16.9| |Lower North East | 17.3| 20.4| 21.3| 20.7| |Lower South East | 22.2| 23.2| 21.1| 21.2| |Upper Mid East | 18.0| 16.7| 19.8| 9.9| |Upper Middle | 10.5| 8.0| 5.4| 11.1| |Center of U.S | 17.2| 17.2| 15.1| 11.4| |South of U.S | 18.8| 15.3| 20.0| 26.2| |Mid West | 17.5| 9.1| 20.0| 18.3| |West Coast | 17.5| 19.5| 15.2| 16.8| --- # <center>tables</center> I have concluded that Stratification sample may be the best fit for the model because the default rates that we got from the stratified sample are closest to the default rates from the population . ```r n=length(default.rate.pop) plot(NULL, xlim=c(0,n), ylim=c(0, 50), xlab="Industry Classification Code", ylab ="Default Rates (Percentage)", axes=FALSE) # empty plot title("Comparison of Industry-specific Default Rates") points(1:n, as.vector(default.rate.pop), pch=16, col=2, cex = 0.8) lines(1:n, as.vector(default.rate.pop), lty=1, col=2, cex = 0.8) # points(1:n, as.vector(default.rate.srs), pch=17, col=3, cex = 0.8) lines(1:n, as.vector(default.rate.srs), lty=2, col=3, cex = 0.8) # points(1:n, as.vector(default.rate.sys), pch=19, col=4, cex = 0.8) lines(1:n, as.vector(default.rate.sys), lty=3, col=4, cex = 0.8) # points(1:n, as.vector(default.rate.str), pch=20, col=5, cex = 0.8) lines(1:n, as.vector(default.rate.str), lty=4, col=5, cex = 0.8) # axis(1,at=1:n, label=industry.code, las = 2) axis(2) # rowMax=apply(str.SRS.pop.rates, 1, max) # max default rate in each industry segments(1:n, rep(0,n), 1:n, rowMax, lty=2, col="lightgray", lwd = 0.5) legend("topright", c("Pop", "SRS", "Sys", "Str"), lty=1:4, col=2:5, pch=c(16,17,19,20), cex=0.6, bty="n") ``` <!-- -->