Comparing Potential Sampling Plans.

.title[
# <font size="7" color="white">Comparing Potential Sampling Plans.</font>
]
.author[
### <font size="5" color="white"> Kyle Weber and Ian Vanwright </font>
]
.institute[
### <font size="6" color="white">West Chester University of Pennsylvania</font><br>
]
.date[
### <font color="white" size="4"> Prepared for<br> </font> <font color="gold" size="6"> STA490: Data Visualization </font> <br><br> <font color="white" size="3"> Slides available at: <a href="https://rpubs.com/KW986324" class="uri">https://rpubs.com/KW986324</a> AND <a href="https://github.com/Kyle-Weber/STA490" class="uri">https://github.com/Kyle-Weber/STA490</a></font>
]

---

- Data Cleaning

- Combining Categories

- Default Rate

- Analyzing Four Sampling Plans

- Comparing Default Rates

- Default Rate Visualization

- Conclusion
]

---

- Removed all missing values for MIS_Status and State

- Changed currency values to numerical

- Remaining observations is 897153

]

---

- Variable "State" combined into 5 regions

- Represent geographical region (Northeast, Southeast, Midwest, Southwest, and West)

- Frequency Table of Regions

- Lowest is SouthWest

]

|Region | Frequency|
|:------|---------:|
|       |        14|
|MW     |    202581|
|NE     |    202423|
|SE     |    140841|
|SW     |     90202|
|W      |    263103|

---

##    Calculating Default Rates:

- Original rates for data set

- Provides a comparison to the sampling plans

-  Southeast has the highest rate
]

|State | Default Rate|
|:-----|------------:|
|      |    0.1428571|
|MW    |    0.1583021|
|NE    |    0.1611872|
|SE    |    0.2141990|
|SW    |    0.1934325|
|W     |    0.1719593|

---

##    Simple Random Sampling Process:

- Random Sample of 4000

- Taken from Cleaned Data Set

- Frequency  and Default Rate Table

]

|Size      | Var.count|
|:---------|---------:|
|Size      |      4000|
|Var.count |        28|

|Region | Sample_Default_Rate|
|:------|-------------------:|
|MW     |           0.1699196|
|NE     |           0.1767289|
|SE     |           0.2003339|
|SW     |           0.2029340|
|W      |           0.1669421|

---

- Jump size of 4000 observations

- Selects observations from fixed intervals

- Representative of entire population

- Rounding Error

]

| Size| Var.count|
|----:|---------:|
| 4014|        28|

|Region | Sample_Default_Rate|
|:------|-------------------:|
|W      |           0.1569620|
|NE     |           0.1518152|
|MW     |           0.1512876|
|SW     |           0.1714286|
|SE     |           0.1973466|

---

- Strata size for regions

]

|   |  MW|  NE|  SE|  SW|    W|
|--:|---:|---:|---:|---:|----:|
|  0| 901| 900| 627| 401| 1170|

|Region | Sample_Default_Rate|
|:------|-------------------:|
|MW     |           0.1675916|
|NE     |           0.1533333|
|SE     |           0.2137161|
|SW     |           0.1820449|
|W      |           0.1598291|

---

- ZipCode used to Define Clusters

- Finally, a cluster sample of Zip code is taken with the size and variable count shown below.

]

|Size      | Var.count|
|:---------|---------:|
|Size      |       423|
|Var.count |        28|

|Region | Default Rate|
|:------|------------:|
|MW     |    0.3448276|
|NE     |    0.1707317|
|SE     |    0.1966527|
|SW     |    0.1666667|
|W      |    0.0298507|

---

- Sample Default Rate vs Subpopulation Default Rate

- Systematic or Stratified
]

<div class="figure">
<div class="datatables html-widget html-fill-item" id="htmlwidget-bc3f51a5642842f9caf6" style="width:100%;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-bc3f51a5642842f9caf6">{"x":{"filter":"none","vertical":false,"data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20"],["Simple Random","Simple Random","Simple Random","Simple Random","Simple Random","Systematic","Systematic","Systematic","Systematic","Systematic","Clustering","Clustering","Clustering","Clustering","Clustering","Stratified","Stratified","Stratified","Stratified","Stratified"],["MW","NE","SE","SW","W","W","NE","MW","SW","SE","W","SE","NE","SW","MW","MW","NE","SE","SW","W"],[0.16992,0.17673,0.20033,0.20293,0.16694,0.15696,0.15182,0.15129,0.17143,0.19735,0.02985,0.19665,0.17073,0.16667,0.34483,0.16759,0.15333,0.21372,0.18204,0.15983],[0.1583,0.16119,0.2142,0.19343,0.17196,0.17196,0.16119,0.1583,0.19343,0.2142,0.17196,0.2142,0.16119,0.19343,0.1583,0.1583,0.16119,0.2142,0.19343,0.17196]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>Sampling_Plan<\/th>\n      <th>Region<\/th>\n      <th>Sample_Default_Rate<\/th>\n      <th>Subpopulation_Default_Rate<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"dom":"t","pageLength":61,"scrollY":"200px","scrollCollapse":true,"columnDefs":[{"className":"dt-right","targets":[3,4]},{"orderable":false,"targets":0},{"name":" ","targets":0},{"name":"Sampling_Plan","targets":1},{"name":"Region","targets":2},{"name":"Sample_Default_Rate","targets":3},{"name":"Subpopulation_Default_Rate","targets":4}],"order":[],"autoWidth":false,"orderClasses":false,"lengthMenu":[10,25,50,61,100]}},"evals":[],"jsHooks":[]}</script>
<p class="caption">Summary of inferential statistics of the full model</p>
</div>

---

##    Comparrisson Between Eachother

- Default Rate for each Region Based on Sampling Plan

# ```{r}
# # Define a function to create abbreviated region names
# abbreviate_region_names <- function(region_names) {
#   abbreviations <- c("MW" = "Midwest", "NE" = "Northeast", "W" = "West", "SE" = "Southeast", "SW" = "SouthWest")
#   return(abbreviations[region_names])
# }
# 
# # Add abbreviated region names to the data frame
# all_region_default_rates$Abbreviated_Region <- abbreviate_region_names(all_region_default_rates$Region)
# 
# # Split data into two halves
# half1 <- subset(all_region_default_rates, Sampling_Plan %in% c("Simple Random", "Systematic"))
# half2 <- subset(all_region_default_rates, Sampling_Plan %in% c("Clustering", "Stratified"))
# 
# # Create bar plots for each half
# plot1 <- ggplot(half1, aes(x = Abbreviated_Region, y = Default_Rate, fill = Sampling_Plan)) +
#   geom_bar(stat = "identity", position = "dodge") +
#   labs(title = "Default Rates for Each Region by Sampling Plan (Half 1)",
#        x = "Region", y = "Default Rate", fill = "Sampling Plan") +
#   theme_minimal() +
#   theme(legend.position = "bottom")  # Adjust legend position
# 
# plot2 <- ggplot(half2, aes(x = Abbreviated_Region, y = Default_Rate, fill = Sampling_Plan)) +
#   geom_bar(stat = "identity", position = "dodge") +
#   labs(title = "Default Rates for Each Region by Sampling Plan (Half 2)",
#        x = "Region", y = "Default Rate", fill = "Sampling Plan") +
#   theme_minimal() +
#   theme(legend.position = "bottom")  # Adjust legend position
# 
# # Output the plots
# plot1
# plot2
# 
# 
# ```

---

##    Visualization for final choice

```
## <ggproto object: Class FacetWrap, Facet, gg>
##     compute_layout: function
##     draw_back: function
##     draw_front: function
##     draw_labels: function
##     draw_panels: function
##     finish_data: function
##     init_scales: function
##     map_data: function
##     params: list
##     setup_data: function
##     setup_params: function
##     shrink: TRUE
##     train_scales: function
##     vars: function
##     super:  <ggproto object: Class FacetWrap, Facet, gg>
```

---

---
##    Visualization for final choice

---

##     Conclusion

- why this is important

- uses

- Questions