class: center, middle, inverse, title-slide .title[ #
HTML Presentation
] .subtitle[ ##
Data analysis
] .author[ ###
Yuanqi Zhang
] --- # Introduction We are going to do an analysis by comparing 3 samplings plans which are simple random sampling, stratified sampling, and systematic sampling. The Bank load data is treated as a population that has 9 subsets. We turn these 9 subsets into one data set called bankloan. --- # Stratification variable It is a categorical variable that is used to stratify a population depending on the values. | Population.size| |---------------:| | 899164| |Var1 | Freq| |:----|-----:| |0 | 283| |1 | 24| |10 | 14759| |11 | 19211| |12 | 6749| |13 | 6397| |14 | 14310| |15 | 10221| |16 | 6293| |17 | 6181| |18 | 7530| |19 | 15093| |2 | 11| |20 | 10208| |21 | 13148| |22 | 5263| |23 | 6829| |24 | 2888| |25 | 1965| |26 | 2631| |27 | 8414| |28 | 13556| |29 | 8861| |3 | 5| |30 | 21577| |31 | 5178| |32 | 14010| |33 | 23899| |34 | 6252| |35 | 5748| |36 | 3143| |37 | 7174| |38 | 7549| |39 | 6829| |4 | 5| |40 | 6244| |41 | 2007| |42 | 2392| |43 | 10843| |44 | 14441| |45 | 8705| |46 | 11399| |47 | 3300| |48 | 13097| |49 | 8183| |5 | 5| |50 | 5993| |51 | 2395| |52 | 4656| |53 | 11261| |54 | 11936| |55 | 17969| |56 | 7381| |57 | 5128| |58 | 5808| |59 | 8778| |6 | 4| |60 | 23790| |61 | 5167| |62 | 4090| |63 | 8926| |64 | 8814| |65 | 8312| |66 | 6508| |67 | 6481| |68 | 7153| |69 | 1139| |7 | 6| |70 | 12487| |71 | 4003| |72 | 5524| |73 | 5510| |74 | 6082| |75 | 17706| |76 | 12580| |77 | 20879| |78 | 15928| |79 | 9346| |8 | 15| |80 | 20308| |81 | 3792| |82 | 3370| |83 | 10088| |84 | 18872| |85 | 16923| |86 | 2035| |87 | 4524| |88 | 5419| |89 | 8308| |9 | 24| |90 | 25034| |91 | 20052| |92 | 32356| |93 | 12858| |94 | 17673| |95 | 21038| |96 | 5010| |97 | 11083| |98 | 20000| |99 | 5832| --- # Study Population | 109| 209| 309| 409| 509| 609| 709| 809| 909| |------:|-----:|------:|-----:|-----:|-----:|------:|-----:|------:| | 106744| 73763| 101359| 80611| 81305| 80380| 110045| 93639| 170936| --- # Simple Random Sampling We will take a sample of 1000 from the whole population of 899164. <img src="data:image/png;base64,#download.png" width="100%" /> | Size| Var.count| |----:|---------:| | 1000| 30| --- # Systematic sampling The jump size is calculated to find the appropriate jump size from our population in order to obtain a sample of 1000 when we perform systematic sampling. The equation 694216/1000 to get a jump size of 69. <img src="data:image/png;base64,#Systematic-Sampling-main-image-1.jpg" width="100%" /> | Size| Var.count| |----:|---------:| | 1001| 30| --- # Stratified Sample I ran a program to create a table for the stratified zipcodes. Later, this is creating a sample of 1000 by taking clusters of a particular number group in the population. <img src="data:image/png;base64,#HHHH.png" width="60%" /> | 109| 209| 309| 409| 509| 609| 709| 809| 909| |---:|---:|---:|---:|---:|---:|---:|---:|---:| | 119| 82| 113| 90| 90| 89| 122| 104| 190| | Size| Var.count| |----:|---------:| | 1001| 30| --- # Industry-specific Rates The table will have rates of population, SRS, and systematic random samples. Table: Comparison of industry-specific default rates between population and the SRS. | | default.rate.pop| default.rate.srs| |:----------------|----------------:|----------------:| |upper North East | 17.3| 10.9| |Lower North East | 17.3| 18.5| |Lower South East | 22.2| 19.5| |Upper Mid East | 18.0| 18.9| |Upper Middle | 10.5| 4.4| |Center of U.S | 17.2| 10.1| |South of U.S | 18.8| 17.2| |Mid West | 17.5| 19.1| |West Coast | 17.5| 18.1| Table: Comparison of industry-specific default rates between population, SRS, and Systematic Sample. | | default.rate.pop| default.rate.srs| default.rate.sys| |:---|----------------:|----------------:|----------------:| |109 | 17.3| 10.9| 27.6| |209 | 17.3| 18.5| 12.9| |309 | 22.2| 19.5| 16.5| |409 | 18.0| 18.9| 24.1| |509 | 10.5| 4.4| 11.2| |609 | 17.2| 10.1| 16.0| |709 | 18.8| 17.2| 13.9| |809 | 17.5| 19.1| 16.4| |909 | 17.5| 18.1| 19.1| --- | | default.rate.pop| default.rate.srs| default.rate.sys| default.rate.str| |:----------------|----------------:|----------------:|----------------:|----------------:| |upper North East | 17.3| 10.9| 27.6| 15.5| |Lower North East | 17.3| 18.5| 12.9| 18.3| |Lower South East | 22.2| 19.5| 16.5| 24.8| |Upper Mid East | 18.0| 18.9| 24.1| 20.9| |Upper Middle | 10.5| 4.4| 11.2| 8.9| |Center of U.S | 17.2| 10.1| 16.0| 18.0| |South of U.S | 18.8| 17.2| 13.9| 26.2| |Mid West | 17.5| 19.1| 16.4| 19.2| |West Coast | 17.5| 18.1| 19.1| 18.9| --- #Visual Comparison The Stratification sample may be the best fit for the model because the default rates that we got from the stratified sample are closest to the default rates from the population. <!-- --> --- # Mean squared error The above patterns of industry-specific default rates in the following line plot. <!-- --> The comparison results were based on a one-step sample. There could be big variations. A more reliable approach to obtaining a stable overall performance of the three sampling plans is to take multiple samples and compare the mean sqred errors.