In this report, we will be looking at different methods of sampling to perform an empirical comparison of mean, standard deviation, median, minimum and maximum. The discrepancies between the default population rate and each of the sampling plans shows the goodness of the sampling plans.
The data set that we will be using was put together from the General Social Survey over the years 1972-2022 and can be found at https://raw.githubusercontent.com/LPyott/Capstone/main/GSS.csv. The variables are as follows:
year: GSS year for respondent
id_: respondent id number
hrs1: number of hours you worked last week
age: age of respondent
race: race of respondent
income: total family income
health: condition of health
ballot: ballot used for interview
The variables we will be using for analysis are health, hrs1, and age. Health is a categorical variable with the survey question being “Would you say your own health, in general, is excellent, good, fair, or poor?”, coded as 1, 2, 3, and 4 respectively for analysis purposes. We filtered out all health values less than 0, the hours worked values less than 0, and also the age values less than 18. After filtering the missing values out, we changed age into a categorical variable with the youngest age group being 18 and 19 year olds, then 20 to 29, and breaks every 10 years after.
GSS = read.csv('https://raw.githubusercontent.com/LPyott/Capstone/main/GSS.csv')
eval = GSS %>%
filter(AGE>17, HRS1 >= 0, HEALTH >0) #filtering out missing values
eval=eval %>%
mutate(agecat=case_when(AGE<18~"No answer",
AGE==18 | AGE==19~"18 and 19",
AGE>19 & AGE<30~"20 to 29",
AGE>29 & AGE<40~"30 to 39",
AGE>39 & AGE<50~"40 to 49",
AGE>49 & AGE<60~"50 to 59",
AGE>59 & AGE<70~"60 to 69",
AGE>69 & AGE<80~"70 to 79",
AGE>79 & AGE<90~"80 to 89",
))
The histograms and barplot show the frequencies of each of our variables.
par(mfrow = c(1,3))
hist(eval$AGE)
hist(eval$HRS1)
barplot(popHEALTH, main="Barplot of Population Health")
kable(popHEALTH)
| Var1 | Freq |
|---|---|
| 1 | 0.3344966 |
| 2 | 0.4981905 |
| 3 | 0.1490242 |
| 4 | 0.0182887 |
In a simple random sample, a sample is drawn from the population randomly with replacement. When drawing this type of sample, we assume all possible combinations of data points have an equally likely chance to be selected as a random sample for analysis. In this analysis, we are using a random sample of 500 people.
rand_df <- eval[sample(nrow(eval), size=500), ]
The histograms and barplot show the frequencies of each of our random sample variables.
par(mfrow = c(1,3))
hist(rand_df$AGE)
hist(rand_df$HRS1)
barplot(randHEALTH, main="Barplot of Random Sample Health")
kable(randHEALTH)
| Var1 | Freq |
|---|---|
| 1 | 0.338 |
| 2 | 0.478 |
| 3 | 0.164 |
| 4 | 0.020 |
In systematic sampling, the key step is finding the jump size (m). This is equal to the population size divided by the size of the random sample to be drawn from the population. The next step is taking a random sample from the first “m-th” record, and choose every “m-th” record for the sample. Below is the equation to find the jump size \[ jump \:size= \frac{population\:size}{random \:sample\: size} \]
obtain_sys = function(N,n){
k = ceiling(N/(n-1))
r = sample(1:k, 1)
seq(r, r + k*(n-1), k)
}
#apply function
sys_sample_df = eval[obtain_sys(nrow(eval), 500), ]
The histograms and barplot show the frequencies of each of our systematic sample variables.
par(mfrow = c(1,3))
hist(sys_sample_df$AGE)
hist(sys_sample_df$HRS1)
barplot(sysHEALTH, main="Barplot of Sys Sample Health")
kable(sysHEALTH)
| Var1 | Freq |
|---|---|
| 1 | 0.3299389 |
| 2 | 0.4969450 |
| 3 | 0.1547862 |
| 4 | 0.0183299 |
Stratified sampling occurs when the data is put into several subpopulations when a simple random sample is difficult to obtain. It is important to note here that the subsample must be proportional to the subpopulation size in order to obtain a combined sample similar to the simple random sample and systematic samples. In method 1 of the stratified sampling, we are using a stratification variable. In this sample, we made the age variable categorical, making 18 an 19 year olds in one group and each decade grouped together afterwards. We used the sample number to be 62.
strat_sample_1 <- eval %>%
group_by(agecat) %>%
sample_n(size=62)
The histograms and barplot show the frequencies of each of our method 1 stratification sample variables.
par(mfrow = c(1,3))
hist(strat_sample_1$AGE)
hist(strat_sample_1$HRS1)
barplot(strat1HEALTH, main="Barplot of Strat 1 Health")
kable(strat1HEALTH)
| Var1 | Freq |
|---|---|
| 1 | 0.3165323 |
| 2 | 0.4979839 |
| 3 | 0.1693548 |
| 4 | 0.0161290 |
In the second method of stratified sampling, we keep age as a categorical variable. In this method, for us to have proportional samples we look at how many people are in each category. Based on this, we take a proportion of the strata and apply it to the sample, weighting it. Within these strata is where the random sample is drawn from.
strat_sample_2= eval %>%
arrange(agecat)
agecat_table=table(eval$agecat)
agecat_dataframe=as.data.frame(agecat_table)
#Proportions of each strata
agecat_dataframe$weight=agecat_dataframe$Freq/nrow(eval)
#Sample size by prooprtion to reach n=500 new variabke to makwe the weights 500
agecat_dataframe$n=ceiling(500*agecat_dataframe$weight)
agecat_dataframe
## Var1 Freq weight n
## 1 18 and 19 393 0.012698720 7
## 2 20 to 29 6576 0.212485459 107
## 3 30 to 39 8201 0.264992891 133
## 4 40 to 49 7152 0.231097325 116
## 5 50 to 59 5473 0.176845030 89
## 6 60 to 69 2524 0.081556159 41
## 7 70 to 79 547 0.017674809 9
## 8 80 to 89 82 0.002649606 2
st=strata(strat_sample_2,
stratanames=c("agecat"),
size=c(9, 93, 105, 90, 78, 66, 44, 20), method="srswor")
#method means within the strata we're taking a simple random sample
sample2=getdata(eval, st)
The histograms and barplot show the frequencies of each of our method 2 stratification sample variables.
par(mfrow = c(1,3))
hist(strat_sample_2$AGE)
hist(strat_sample_2$HRS1)
barplot(strat2HEALTH, main="Barplot of Strat 2 Health")
kable(strat2HEALTH)
| Var1 | Freq |
|---|---|
| 1 | 0.3344966 |
| 2 | 0.4981905 |
| 3 | 0.1490242 |
| 4 | 0.0182887 |
Below is each variable and a table summary that compares each random sample to the population. The closer each number is to the population shows us the goodness of the model.
AGE <- cbind(eval_AGE, rand_AGE, sys_AGE, strat1_AGE, strat2_AGE)
rownames(AGE)<- c("Mean", "Std Dev", "Med", "Min", "Max")
kable(AGE)
| eval_AGE | rand_AGE | sys_AGE | strat1_AGE | strat2_AGE | |
|---|---|---|---|---|---|
| Mean | 41.22 | 40.83600 | 41.40122 | 49.47782 | 41.21675 |
| Std Dev | 13.29 | 13.89832 | 9.00000 | 21.76318 | 13.28968 |
| Med | 40.00 | 39.00000 | 40.00000 | 49.50000 | 40.00000 |
| Min | 18.00 | 18.00000 | 19.00000 | 18.00000 | 18.00000 |
| Max | 89.00 | 84.00000 | 77.00000 | 89.00000 | 89.00000 |
HRS1 <- cbind(eval_HRS1, rand_HRS1, sys_HRS1, strat1_HRS1, strat2_HRS1)
rownames(HRS1)<- c("Mean", "Std Dev", "Med", "Min", "Max")
kable(HRS1)
| eval_HRS1 | rand_HRS1 | sys_HRS1 | strat1_HRS1 | strat2_HRS1 | |
|---|---|---|---|---|---|
| Mean | 40.00000 | 41.0540 | 40.55397 | 36.76008 | 41.13888 |
| Std Dev | 48.00000 | 13.9835 | 9.00000 | 17.22603 | 14.04410 |
| Med | 37.00000 | 40.0000 | 40.00000 | 40.00000 | 40.00000 |
| Min | 0.00000 | 0.0000 | 1.00000 | 0.00000 | 0.00000 |
| Max | 41.13888 | 89.0000 | 89.00000 | 89.00000 | 89.00000 |
allHEALTH <- cbind(popHEALTH, randHEALTH, sysHEALTH, strat1HEALTH, strat2HEALTH)
kable(allHEALTH)
| popHEALTH | randHEALTH | sysHEALTH | strat1HEALTH | strat2HEALTH |
|---|---|---|---|---|
| |Var1 | Freq| | |Var1 | Freq| | |Var1 | Freq| | |Var1 | Freq| | |Var1 | Freq| |
| |:—-|———:| | |:—-|—–:| | |:—-|———:| | |:—-|———:| | |:—-|———:| |
| |1 | 0.3344966| | |1 | 0.338| | |1 | 0.3299389| | |1 | 0.3165323| | |1 | 0.3344966| |
| |2 | 0.4981905| | |2 | 0.478| | |2 | 0.4969450| | |2 | 0.4979839| | |2 | 0.4981905| |
| |3 | 0.1490242| | |3 | 0.164| | |3 | 0.1547862| | |3 | 0.1693548| | |3 | 0.1490242| |
| |4 | 0.0182887| | |4 | 0.020| | |4 | 0.0183299| | |4 | 0.0161290| | |4 | 0.0182887| |
We have looked at three different types of sampling plans to decide which one will give us the sample closest to the population. In this data set, the best sampling plan to use is weighted stratified samples, or method 2 stratified sampling. There is room for error in this project, because we only took one random sample for each type of sampling. This experiment shows us that for this data set, the best way to sample is the second method of stratified sampling. We know this because these numbers are the closest to the default population sample.