class: center, middle, inverse, title-slide .title[ #
Comparing Potential Sampling Plans.
] .author[ ###
Kyle Weber and Ian Vanwright
] .institute[ ###
West Chester University of Pennsylvania
] .date[ ###
Prepared for
STA490: Data Visualization
Slides available at:
https://rpubs.com/KW986324
AND
https://github.com/Kyle-Weber/STA490
] --- class: top, center ## Table of Contents: .left[ - Data Cleaning - Combining Categories - Default Rate - Analyzing Four Sampling Plans - Comparing Default Rates - Default Rate Visualization - Conclusion ] --- class: top, center ## Data Cleaning: .left[ - Removed all missing values for MIS_Status and State - Changed currency values to numerical - Remaining observations is 897153 ] --- class: top, center ## Combining Categories: .left[ - Variable "State" combined into 5 regions - Represent geographical region (Northeast, Southeast, Midwest, Southwest, and West) - Frequency Table of Regions - Lowest is SouthWest ] |Region | Frequency| |:------|---------:| | | 14| |MW | 202581| |NE | 202423| |SE | 140841| |SW | 90202| |W | 263103| --- ## Calculating Default Rates: .left[ - Original rates for data set - Provides a comparison to the sampling plans - Southeast has the highest rate ] <!--If need time can talk about why they are important--> |State | Default Rate| |:-----|------------:| | | 0.1428571| |MW | 0.1583021| |NE | 0.1611872| |SE | 0.2141990| |SW | 0.1934325| |W | 0.1719593| --- class: top, center ## Simple Random Sampling Process: .left[ - Random Sample of 4000 <!-- why 4000 less than 5% of original data set --> - Taken from Cleaned Data Set - Frequency and Default Rate Table ] |Size | Var.count| |:---------|---------:| |Size | 4000| |Var.count | 28| |Region | Sample_Default_Rate| |:------|-------------------:| |MW | 0.1699196| |NE | 0.1767289| |SE | 0.2003339| |SW | 0.2029340| |W | 0.1669421| --- class: top, center ## Systematic Sample: .left[ - Jump size of 4000 observations - Selects observations from fixed intervals - Representative of entire population - Rounding Error ] | Size| Var.count| |----:|---------:| | 4014| 28| |Region | Sample_Default_Rate| |:------|-------------------:| |W | 0.1569620| |NE | 0.1518152| |MW | 0.1512876| |SW | 0.1714286| |SE | 0.1973466| --- class: top, center ## Stratified Sampling. .left[ - Strata size for regions ] | | MW| NE| SE| SW| W| |--:|---:|---:|---:|---:|----:| | 0| 901| 900| 627| 401| 1170| |Region | Sample_Default_Rate| |:------|-------------------:| |MW | 0.1675916| |NE | 0.1533333| |SE | 0.2137161| |SW | 0.1820449| |W | 0.1598291| --- class: top, center ## Cluster Sampling .left[ - ZipCode used to Define Clusters - Finally, a cluster sample of Zip code is taken with the size and variable count shown below. ] |Size | Var.count| |:---------|---------:| |Size | 423| |Var.count | 28| |Region | Default Rate| |:------|------------:| |MW | 0.3448276| |NE | 0.1707317| |SE | 0.1966527| |SW | 0.1666667| |W | 0.0298507| --- class: top, center ## Default Rate Comparison: .left[ - Comparison of Default Rates across regions - Sample Default Rate vs Subpopulation Default Rate - Systematic or Stratified ] <div class="figure">
<p class="caption">Summary of inferential statistics of the full model</p> </div> --- ## Comparrisson Between Eachother - Default Rate for each Region Based on Sampling Plan <img src="490Grp2_files/figure-html/interactive-comparison-1.png" width="120%" /> # ```{r} # # Define a function to create abbreviated region names # abbreviate_region_names <- function(region_names) { # abbreviations <- c("MW" = "Midwest", "NE" = "Northeast", "W" = "West", "SE" = "Southeast", "SW" = "SouthWest") # return(abbreviations[region_names]) # } # # # Add abbreviated region names to the data frame # all_region_default_rates$Abbreviated_Region <- abbreviate_region_names(all_region_default_rates$Region) # # # Split data into two halves # half1 <- subset(all_region_default_rates, Sampling_Plan %in% c("Simple Random", "Systematic")) # half2 <- subset(all_region_default_rates, Sampling_Plan %in% c("Clustering", "Stratified")) # # # Create bar plots for each half # plot1 <- ggplot(half1, aes(x = Abbreviated_Region, y = Default_Rate, fill = Sampling_Plan)) + # geom_bar(stat = "identity", position = "dodge") + # labs(title = "Default Rates for Each Region by Sampling Plan (Half 1)", # x = "Region", y = "Default Rate", fill = "Sampling Plan") + # theme_minimal() + # theme(legend.position = "bottom") # Adjust legend position # # plot2 <- ggplot(half2, aes(x = Abbreviated_Region, y = Default_Rate, fill = Sampling_Plan)) + # geom_bar(stat = "identity", position = "dodge") + # labs(title = "Default Rates for Each Region by Sampling Plan (Half 2)", # x = "Region", y = "Default Rate", fill = "Sampling Plan") + # theme_minimal() + # theme(legend.position = "bottom") # Adjust legend position # # # Output the plots # plot1 # plot2 # # # ``` --- ## Visualization for final choice ``` ## <ggproto object: Class FacetWrap, Facet, gg> ## compute_layout: function ## draw_back: function ## draw_front: function ## draw_labels: function ## draw_panels: function ## finish_data: function ## init_scales: function ## map_data: function ## params: list ## setup_data: function ## setup_params: function ## shrink: TRUE ## train_scales: function ## vars: function ## super: <ggproto object: Class FacetWrap, Facet, gg> ``` <img src="490Grp2_files/figure-html/interactive sample defaul-1.png" width="120%" /> --- <img src="490Grp2_files/figure-html/interactive sample default-1.png" width="120%" /> --- ## Visualization for final choice <img src="490Grp2_files/figure-html/interactive-sample default 2-1.png" width="130%" /> --- ## Conclusion - why this is important - uses - Questions