Introduction

In this report, we will be looking at different methods of sampling to perform an empirical comparison of mean, standard deviation, median, minimum and maximum. The discrepancies between the default population rate and each of the sampling plans shows the goodness of the sampling plans.

Data Description

The data set that we will be using was put together from the General Social Survey over the years 1972-2022 and can be found at https://raw.githubusercontent.com/LPyott/Capstone/main/GSS.csv. The variables are as follows:

The variables we will be using for analysis are health, hrs1, and age. Health is a categorical variable with the survey question being “Would you say your own health, in general, is excellent, good, fair, or poor?”, coded as 1, 2, 3, and 4 respectively for analysis purposes. We filtered out all health values less than 0, the hours worked values less than 0, and also the age values less than 18. After filtering the missing values out, we changed age into a categorical variable with the youngest age group being 18 and 19 year olds, then 20 to 29, and breaks every 10 years after.

GSS = read.csv('https://raw.githubusercontent.com/LPyott/Capstone/main/GSS.csv')
 eval = GSS %>%
  filter(AGE>17, HRS1 >= 0, HEALTH >0) #filtering out missing values
eval=eval %>%
  mutate(agecat=case_when(AGE<18~"No answer",
                          AGE==18 | AGE==19~"18 and 19",
                          AGE>19 & AGE<30~"20 to 29",
                          AGE>29 & AGE<40~"30 to 39",
                          AGE>39 & AGE<50~"40 to 49",
                          AGE>49 & AGE<60~"50 to 59",
                          AGE>59 & AGE<70~"60 to 69",
                          AGE>69 & AGE<80~"70 to 79",
                          AGE>79 & AGE<90~"80 to 89",
  ))

The histograms and barplot show the frequencies of each of our variables.

par(mfrow = c(1,3))
hist(eval$AGE)
hist(eval$HRS1)
barplot(popHEALTH, main="Barplot of Population Health")

kable(popHEALTH)
Var1 Freq
1 0.3344966
2 0.4981905
3 0.1490242
4 0.0182887

Simple Random Sample

In a simple random sample, a sample is drawn from the population randomly with replacement. When drawing this type of sample, we assume all possible combinations of data points have an equally likely chance to be selected as a random sample for analysis. In this analysis, we are using a random sample of 500 people.

rand_df <- eval[sample(nrow(eval), size=500), ]

The histograms and barplot show the frequencies of each of our random sample variables.

par(mfrow = c(1,3))
hist(rand_df$AGE)
hist(rand_df$HRS1)
barplot(randHEALTH, main="Barplot of Random Sample Health")

kable(randHEALTH)
Var1 Freq
1 0.338
2 0.478
3 0.164
4 0.020

Systematic Sample

In systematic sampling, the key step is finding the jump size (m). This is equal to the population size divided by the size of the random sample to be drawn from the population. The next step is taking a random sample from the first “m-th” record, and choose every “m-th” record for the sample. Below is the equation to find the jump size \[ jump \:size= \frac{population\:size}{random \:sample\: size} \]

obtain_sys = function(N,n){
  k = ceiling(N/(n-1))
  r = sample(1:k, 1)
  seq(r, r + k*(n-1), k)
}
#apply function
sys_sample_df = eval[obtain_sys(nrow(eval), 500), ]

The histograms and barplot show the frequencies of each of our systematic sample variables.

par(mfrow = c(1,3))
hist(sys_sample_df$AGE)
hist(sys_sample_df$HRS1)
barplot(sysHEALTH, main="Barplot of Sys Sample Health")

kable(sysHEALTH)
Var1 Freq
1 0.3299389
2 0.4969450
3 0.1547862
4 0.0183299

Stratified Sampling – Method 1

Stratified sampling occurs when the data is put into several subpopulations when a simple random sample is difficult to obtain. It is important to note here that the subsample must be proportional to the subpopulation size in order to obtain a combined sample similar to the simple random sample and systematic samples. In method 1 of the stratified sampling, we are using a stratification variable. In this sample, we made the age variable categorical, making 18 an 19 year olds in one group and each decade grouped together afterwards. We used the sample number to be 62.

strat_sample_1 <- eval %>%
  group_by(agecat) %>%
  sample_n(size=62)

The histograms and barplot show the frequencies of each of our method 1 stratification sample variables.

par(mfrow = c(1,3))
hist(strat_sample_1$AGE)
hist(strat_sample_1$HRS1)
barplot(strat1HEALTH, main="Barplot of Strat 1 Health")

kable(strat1HEALTH)
Var1 Freq
1 0.3165323
2 0.4979839
3 0.1693548
4 0.0161290

Stratified Sampling – Method 2

In the second method of stratified sampling, we keep age as a categorical variable. In this method, for us to have proportional samples we look at how many people are in each category. Based on this, we take a proportion of the strata and apply it to the sample, weighting it. Within these strata is where the random sample is drawn from.

strat_sample_2= eval %>%
  arrange(agecat)
agecat_table=table(eval$agecat)
agecat_dataframe=as.data.frame(agecat_table)
#Proportions of each strata
agecat_dataframe$weight=agecat_dataframe$Freq/nrow(eval)
#Sample size by prooprtion to reach n=500 new variabke to makwe the weights 500
agecat_dataframe$n=ceiling(500*agecat_dataframe$weight)
agecat_dataframe
##        Var1 Freq      weight   n
## 1 18 and 19  393 0.012698720   7
## 2  20 to 29 6576 0.212485459 107
## 3  30 to 39 8201 0.264992891 133
## 4  40 to 49 7152 0.231097325 116
## 5  50 to 59 5473 0.176845030  89
## 6  60 to 69 2524 0.081556159  41
## 7  70 to 79  547 0.017674809   9
## 8  80 to 89   82 0.002649606   2
st=strata(strat_sample_2,
          stratanames=c("agecat"),
          size=c(9, 93, 105, 90, 78, 66, 44, 20), method="srswor")
#method means within the strata we're taking a simple random sample
sample2=getdata(eval, st)

The histograms and barplot show the frequencies of each of our method 2 stratification sample variables.

par(mfrow = c(1,3))
hist(strat_sample_2$AGE)
hist(strat_sample_2$HRS1)
barplot(strat2HEALTH, main="Barplot of Strat 2 Health")

kable(strat2HEALTH)
Var1 Freq
1 0.3344966
2 0.4981905
3 0.1490242
4 0.0182887

Analysis

Below is each variable and a table summary that compares each random sample to the population. The closer each number is to the population shows us the goodness of the model.

AGE <- cbind(eval_AGE, rand_AGE, sys_AGE, strat1_AGE, strat2_AGE)
rownames(AGE)<- c("Mean", "Std Dev", "Med", "Min", "Max")
kable(AGE)
eval_AGE rand_AGE sys_AGE strat1_AGE strat2_AGE
Mean 41.22 40.83600 41.40122 49.47782 41.21675
Std Dev 13.29 13.89832 9.00000 21.76318 13.28968
Med 40.00 39.00000 40.00000 49.50000 40.00000
Min 18.00 18.00000 19.00000 18.00000 18.00000
Max 89.00 84.00000 77.00000 89.00000 89.00000
HRS1 <- cbind(eval_HRS1, rand_HRS1, sys_HRS1, strat1_HRS1, strat2_HRS1)
rownames(HRS1)<- c("Mean", "Std Dev", "Med", "Min", "Max")
kable(HRS1)
eval_HRS1 rand_HRS1 sys_HRS1 strat1_HRS1 strat2_HRS1
Mean 40.00000 41.0540 40.55397 36.76008 41.13888
Std Dev 48.00000 13.9835 9.00000 17.22603 14.04410
Med 37.00000 40.0000 40.00000 40.00000 40.00000
Min 0.00000 0.0000 1.00000 0.00000 0.00000
Max 41.13888 89.0000 89.00000 89.00000 89.00000
allHEALTH <- cbind(popHEALTH, randHEALTH, sysHEALTH, strat1HEALTH, strat2HEALTH)
kable(allHEALTH)
popHEALTH randHEALTH sysHEALTH strat1HEALTH strat2HEALTH
|Var1 | Freq| |Var1 | Freq| |Var1 | Freq| |Var1 | Freq| |Var1 | Freq|
|:—-|———:| |:—-|—–:| |:—-|———:| |:—-|———:| |:—-|———:|
|1 | 0.3344966| |1 | 0.338| |1 | 0.3299389| |1 | 0.3165323| |1 | 0.3344966|
|2 | 0.4981905| |2 | 0.478| |2 | 0.4969450| |2 | 0.4979839| |2 | 0.4981905|
|3 | 0.1490242| |3 | 0.164| |3 | 0.1547862| |3 | 0.1693548| |3 | 0.1490242|
|4 | 0.0182887| |4 | 0.020| |4 | 0.0183299| |4 | 0.0161290| |4 | 0.0182887|

Discussion and Conclusion

We have looked at three different types of sampling plans to decide which one will give us the sample closest to the population. In this data set, the best sampling plan to use is weighted stratified samples, or method 2 stratified sampling. There is room for error in this project, because we only took one random sample for each type of sampling. This experiment shows us that for this data set, the best way to sample is the second method of stratified sampling. We know this because these numbers are the closest to the default population sample.