HTML Presentation

class: center, middle, inverse, title-slide

.title[
# HTML Presentation 
]
.subtitle[
## Data analysis 
]
.author[
### Yuanqi Zhang 
]

---

# Introduction
We are going to do an analysis by comparing 3 samplings plans which are simple random sampling, stratified sampling, and systematic sampling. The Bank load data is treated as a population that has 9 subsets. We turn these 9 subsets into one data set called bankloan.

---

# Stratification variable
It is a categorical variable that is used to stratify a population depending on the values.

| Population.size|
|---------------:|
|          899164|

|Var1 |  Freq|
|:----|-----:|
|0    |   283|
|1    |    24|
|10   | 14759|
|11   | 19211|
|12   |  6749|
|13   |  6397|
|14   | 14310|
|15   | 10221|
|16   |  6293|
|17   |  6181|
|18   |  7530|
|19   | 15093|
|2    |    11|
|20   | 10208|
|21   | 13148|
|22   |  5263|
|23   |  6829|
|24   |  2888|
|25   |  1965|
|26   |  2631|
|27   |  8414|
|28   | 13556|
|29   |  8861|
|3    |     5|
|30   | 21577|
|31   |  5178|
|32   | 14010|
|33   | 23899|
|34   |  6252|
|35   |  5748|
|36   |  3143|
|37   |  7174|
|38   |  7549|
|39   |  6829|
|4    |     5|
|40   |  6244|
|41   |  2007|
|42   |  2392|
|43   | 10843|
|44   | 14441|
|45   |  8705|
|46   | 11399|
|47   |  3300|
|48   | 13097|
|49   |  8183|
|5    |     5|
|50   |  5993|
|51   |  2395|
|52   |  4656|
|53   | 11261|
|54   | 11936|
|55   | 17969|
|56   |  7381|
|57   |  5128|
|58   |  5808|
|59   |  8778|
|6    |     4|
|60   | 23790|
|61   |  5167|
|62   |  4090|
|63   |  8926|
|64   |  8814|
|65   |  8312|
|66   |  6508|
|67   |  6481|
|68   |  7153|
|69   |  1139|
|7    |     6|
|70   | 12487|
|71   |  4003|
|72   |  5524|
|73   |  5510|
|74   |  6082|
|75   | 17706|
|76   | 12580|
|77   | 20879|
|78   | 15928|
|79   |  9346|
|8    |    15|
|80   | 20308|
|81   |  3792|
|82   |  3370|
|83   | 10088|
|84   | 18872|
|85   | 16923|
|86   |  2035|
|87   |  4524|
|88   |  5419|
|89   |  8308|
|9    |    24|
|90   | 25034|
|91   | 20052|
|92   | 32356|
|93   | 12858|
|94   | 17673|
|95   | 21038|
|96   |  5010|
|97   | 11083|
|98   | 20000|
|99   |  5832|

---

# Study Population

|    109|   209|    309|   409|   509|   609|    709|   809|    909|
|------:|-----:|------:|-----:|-----:|-----:|------:|-----:|------:|
| 106744| 73763| 101359| 80611| 81305| 80380| 110045| 93639| 170936|
---

# Simple Random Sampling
We will take a sample of 1000 from the whole population of 899164.
<img src="data:image/png;base64,#download.png" width="100%" />

| Size| Var.count|
|----:|---------:|
| 1000|        30|
---

# Systematic sampling
The jump size is calculated to find the appropriate jump size from our population in order to obtain a sample of 1000 when we perform systematic sampling. The equation 694216/1000 to get a jump size of 69.
<img src="data:image/png;base64,#Systematic-Sampling-main-image-1.jpg" width="100%" />

| Size| Var.count|
|----:|---------:|
| 1001|        30|

---

# Stratified Sample
I ran a program to create a table for the stratified zipcodes. Later, this is creating a sample of 1000 by taking clusters of a particular number group in the population.
<img src="data:image/png;base64,#HHHH.png" width="60%" />

| 109| 209| 309| 409| 509| 609| 709| 809| 909|
|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| 119|  82| 113|  90|  90|  89| 122| 104| 190|

| Size| Var.count|
|----:|---------:|
| 1001|        30|
---

# Industry-specific Rates
The table will have rates of population, SRS, and systematic random samples.

Table: Comparison of industry-specific default rates 
                               between population and the SRS.

|                 | default.rate.pop| default.rate.srs|
|:----------------|----------------:|----------------:|
|upper North East |             17.3|             10.9|
|Lower North East |             17.3|             18.5|
|Lower South East |             22.2|             19.5|
|Upper Mid East   |             18.0|             18.9|
|Upper Middle     |             10.5|              4.4|
|Center of U.S    |             17.2|             10.1|
|South of U.S     |             18.8|             17.2|
|Mid West         |             17.5|             19.1|
|West Coast       |             17.5|             18.1|

Table: Comparison of industry-specific default rates 
                               between population, SRS, and Systematic Sample.

|    | default.rate.pop| default.rate.srs| default.rate.sys|
|:---|----------------:|----------------:|----------------:|
|109 |             17.3|             10.9|             27.6|
|209 |             17.3|             18.5|             12.9|
|309 |             22.2|             19.5|             16.5|
|409 |             18.0|             18.9|             24.1|
|509 |             10.5|              4.4|             11.2|
|609 |             17.2|             10.1|             16.0|
|709 |             18.8|             17.2|             13.9|
|809 |             17.5|             19.1|             16.4|
|909 |             17.5|             18.1|             19.1|
---

| | default.rate.pop| default.rate.srs| default.rate.sys| default.rate.str|
|:----------------|----------------:|----------------:|----------------:|----------------:|
|upper North East | 17.3| 10.9| 27.6| 15.5|
|Lower North East | 17.3| 18.5| 12.9| 18.3|
|Lower South East | 22.2| 19.5| 16.5| 24.8|
|Upper Mid East | 18.0| 18.9| 24.1| 20.9|
|Upper Middle | 10.5| 4.4| 11.2| 8.9|
|Center of U.S | 17.2| 10.1| 16.0| 18.0|
|South of U.S | 18.8| 17.2| 13.9| 26.2|
|Mid West | 17.5| 19.1| 16.4| 19.2|
|West Coast | 17.5| 18.1| 19.1| 18.9|
---
#Visual Comparison
The Stratification sample may be the best fit for the model because the default rates that we got from the stratified sample are closest to the default rates from the population.
![](data:image/png;base64,#PRES_files/figure-html/unnamed-chunk-16-1.png)
---
# Mean squared error
The above patterns of industry-specific default rates in the following line plot. 
![](data:image/png;base64,#PRES_files/figure-html/unnamed-chunk-17-1.png)
The comparison results were based on a one-step sample. There could be big variations. A more reliable approach to obtaining a stable overall performance of the three sampling plans is to take multiple samples and compare the mean sqred errors.