Introduction

In this report, we will be looking at different methods of sampling to perform an empirical comparison of mean, standard deviation, median, minimum and maximum. The discrepancies between the default population rate and each of the sampling plans shows the goodness of the sampling plans.

Data Description

The data set that we will be using was put together from the General Social Survey over the years 1972-2022 and can be found at https://raw.githubusercontent.com/LPyott/Capstone/main/GSS.csv. The variables are as follows:

year: GSS year for respondent
id_: respondent id number
hrs1: number of hours you worked last week
age: age of respondent
race: race of respondent
income: total family income
health: condition of health
ballot: ballot used for interview

The variables we will be using for analysis are health, hrs1, and age. Health is a categorical variable with the survey question being “Would you say your own health, in general, is excellent, good, fair, or poor?”, coded as 1, 2, 3, and 4 respectively for analysis purposes. We filtered out all health values less than 0, the hours worked values less than 0, and also the age values less than 18. After filtering the missing values out, we changed age into a categorical variable with the youngest age group being 18 and 19 year olds, then 20 to 29, and breaks every 10 years after.

GSS = read.csv('https://raw.githubusercontent.com/LPyott/Capstone/main/GSS.csv')
 eval = GSS %>%
  filter(AGE>17, HRS1 >= 0, HEALTH >0) #filtering out missing values
eval=eval %>%
  mutate(agecat=case_when(AGE<18~"No answer",
                          AGE==18 | AGE==19~"18 and 19",
                          AGE>19 & AGE<30~"20 to 29",
                          AGE>29 & AGE<40~"30 to 39",
                          AGE>39 & AGE<50~"40 to 49",
                          AGE>49 & AGE<60~"50 to 59",
                          AGE>59 & AGE<70~"60 to 69",
                          AGE>69 & AGE<80~"70 to 79",
                          AGE>79 & AGE<90~"80 to 89",
  ))

The histograms and barplot show the frequencies of each of our variables.

par(mfrow = c(1,3))
hist(eval$AGE)
hist(eval$HRS1)
barplot(popHEALTH, main="Barplot of Population Health")

kable(popHEALTH)

Var1	Freq
1	0.3344966
2	0.4981905
3	0.1490242
4	0.0182887

Simple Random Sample

In a simple random sample, a sample is drawn from the population randomly with replacement. When drawing this type of sample, we assume all possible combinations of data points have an equally likely chance to be selected as a random sample for analysis. In this analysis, we are using a random sample of 500 people.

rand_df <- eval[sample(nrow(eval), size=500), ]

The histograms and barplot show the frequencies of each of our random sample variables.

par(mfrow = c(1,3))
hist(rand_df$AGE)
hist(rand_df$HRS1)
barplot(randHEALTH, main="Barplot of Random Sample Health")

kable(randHEALTH)

Var1	Freq
1	0.338
2	0.478
3	0.164
4	0.020

Systematic Sample

In systematic sampling, the key step is finding the jump size (m). This is equal to the population size divided by the size of the random sample to be drawn from the population. The next step is taking a random sample from the first “m-th” record, and choose every “m-th” record for the sample. Below is the equation to find the jump size \[ jump \:size= \frac{population\:size}{random \:sample\: size} \]

obtain_sys = function(N,n){
  k = ceiling(N/(n-1))
  r = sample(1:k, 1)
  seq(r, r + k*(n-1), k)
}
#apply function
sys_sample_df = eval[obtain_sys(nrow(eval), 500), ]

The histograms and barplot show the frequencies of each of our systematic sample variables.

par(mfrow = c(1,3))
hist(sys_sample_df$AGE)
hist(sys_sample_df$HRS1)
barplot(sysHEALTH, main="Barplot of Sys Sample Health")

kable(sysHEALTH)

Var1	Freq
1	0.3299389
2	0.4969450
3	0.1547862
4	0.0183299

Stratified Sampling – Method 1

Stratified sampling occurs when the data is put into several subpopulations when a simple random sample is difficult to obtain. It is important to note here that the subsample must be proportional to the subpopulation size in order to obtain a combined sample similar to the simple random sample and systematic samples. In method 1 of the stratified sampling, we are using a stratification variable. In this sample, we made the age variable categorical, making 18 an 19 year olds in one group and each decade grouped together afterwards. We used the sample number to be 62.

strat_sample_1 <- eval %>%
  group_by(agecat) %>%
  sample_n(size=62)

The histograms and barplot show the frequencies of each of our method 1 stratification sample variables.

par(mfrow = c(1,3))
hist(strat_sample_1$AGE)
hist(strat_sample_1$HRS1)
barplot(strat1HEALTH, main="Barplot of Strat 1 Health")

kable(strat1HEALTH)

Var1	Freq
1	0.3165323
2	0.4979839
3	0.1693548
4	0.0161290

Stratified Sampling – Method 2

In the second method of stratified sampling, we keep age as a categorical variable. In this method, for us to have proportional samples we look at how many people are in each category. Based on this, we take a proportion of the strata and apply it to the sample, weighting it. Within these strata is where the random sample is drawn from.

strat_sample_2= eval %>%
  arrange(agecat)
agecat_table=table(eval$agecat)
agecat_dataframe=as.data.frame(agecat_table)
#Proportions of each strata
agecat_dataframe$weight=agecat_dataframe$Freq/nrow(eval)
#Sample size by prooprtion to reach n=500 new variabke to makwe the weights 500
agecat_dataframe$n=ceiling(500*agecat_dataframe$weight)
agecat_dataframe

##        Var1 Freq      weight   n
## 1 18 and 19  393 0.012698720   7
## 2  20 to 29 6576 0.212485459 107
## 3  30 to 39 8201 0.264992891 133
## 4  40 to 49 7152 0.231097325 116
## 5  50 to 59 5473 0.176845030  89
## 6  60 to 69 2524 0.081556159  41
## 7  70 to 79  547 0.017674809   9
## 8  80 to 89   82 0.002649606   2

st=strata(strat_sample_2,
          stratanames=c("agecat"),
          size=c(9, 93, 105, 90, 78, 66, 44, 20), method="srswor")
#method means within the strata we're taking a simple random sample
sample2=getdata(eval, st)

The histograms and barplot show the frequencies of each of our method 2 stratification sample variables.

par(mfrow = c(1,3))
hist(strat_sample_2$AGE)
hist(strat_sample_2$HRS1)
barplot(strat2HEALTH, main="Barplot of Strat 2 Health")

kable(strat2HEALTH)

Var1	Freq
1	0.3344966
2	0.4981905
3	0.1490242
4	0.0182887

Analysis

Below is each variable and a table summary that compares each random sample to the population. The closer each number is to the population shows us the goodness of the model.

AGE <- cbind(eval_AGE, rand_AGE, sys_AGE, strat1_AGE, strat2_AGE)
rownames(AGE)<- c("Mean", "Std Dev", "Med", "Min", "Max")
kable(AGE)

	eval_AGE	rand_AGE	sys_AGE	strat1_AGE	strat2_AGE
Mean	41.22	40.83600	41.40122	49.47782	41.21675
Std Dev	13.29	13.89832	9.00000	21.76318	13.28968
Med	40.00	39.00000	40.00000	49.50000	40.00000
Min	18.00	18.00000	19.00000	18.00000	18.00000
Max	89.00	84.00000	77.00000	89.00000	89.00000

HRS1 <- cbind(eval_HRS1, rand_HRS1, sys_HRS1, strat1_HRS1, strat2_HRS1)
rownames(HRS1)<- c("Mean", "Std Dev", "Med", "Min", "Max")
kable(HRS1)

	eval_HRS1	rand_HRS1	sys_HRS1	strat1_HRS1	strat2_HRS1
Mean	40.00000	41.0540	40.55397	36.76008	41.13888
Std Dev	48.00000	13.9835	9.00000	17.22603	14.04410
Med	37.00000	40.0000	40.00000	40.00000	40.00000
Min	0.00000	0.0000	1.00000	0.00000	0.00000
Max	41.13888	89.0000	89.00000	89.00000	89.00000

allHEALTH <- cbind(popHEALTH, randHEALTH, sysHEALTH, strat1HEALTH, strat2HEALTH)
kable(allHEALTH)

popHEALTH	randHEALTH	sysHEALTH	strat1HEALTH	strat2HEALTH
\|Var1 \| Freq\|	\|Var1 \| Freq\|	\|Var1 \| Freq\|	\|Var1 \| Freq\|	\|Var1 \| Freq\|
\|:—-\|———:\|	\|:—-\|—–:\|	\|:—-\|———:\|	\|:—-\|———:\|	\|:—-\|———:\|
\|1 \| 0.3344966\|	\|1 \| 0.338\|	\|1 \| 0.3299389\|	\|1 \| 0.3165323\|	\|1 \| 0.3344966\|
\|2 \| 0.4981905\|	\|2 \| 0.478\|	\|2 \| 0.4969450\|	\|2 \| 0.4979839\|	\|2 \| 0.4981905\|
\|3 \| 0.1490242\|	\|3 \| 0.164\|	\|3 \| 0.1547862\|	\|3 \| 0.1693548\|	\|3 \| 0.1490242\|
\|4 \| 0.0182887\|	\|4 \| 0.020\|	\|4 \| 0.0183299\|	\|4 \| 0.0161290\|	\|4 \| 0.0182887\|

Discussion and Conclusion

We have looked at three different types of sampling plans to decide which one will give us the sample closest to the population. In this data set, the best sampling plan to use is weighted stratified samples, or method 2 stratified sampling. There is room for error in this project, because we only took one random sample for each type of sampling. This experiment shows us that for this data set, the best way to sample is the second method of stratified sampling. We know this because these numbers are the closest to the default population sample.

popHEALTH	randHEALTH	sysHEALTH	strat1HEALTH	strat2HEALTH
\|Var1 \| Freq\|	\|Var1 \| Freq\|	\|Var1 \| Freq\|	\|Var1 \| Freq\|	\|Var1 \| Freq\|
\|:—-\|———:\|	\|:—-\|—–:\|	\|:—-\|———:\|	\|:—-\|———:\|	\|:—-\|———:\|
\|1 \| 0.3344966\|	\|1 \| 0.338\|	\|1 \| 0.3299389\|	\|1 \| 0.3165323\|	\|1 \| 0.3344966\|
\|2 \| 0.4981905\|	\|2 \| 0.478\|	\|2 \| 0.4969450\|	\|2 \| 0.4979839\|	\|2 \| 0.4981905\|
\|3 \| 0.1490242\|	\|3 \| 0.164\|	\|3 \| 0.1547862\|	\|3 \| 0.1693548\|	\|3 \| 0.1490242\|
\|4 \| 0.0182887\|	\|4 \| 0.020\|	\|4 \| 0.0183299\|	\|4 \| 0.0161290\|	\|4 \| 0.0182887\|

Sampling Distribution Report

Emma Laughlin