In this part we analyze distribution of wages of White males and females from North Center.
Firstly we save information about population mean and population standard deviation of wages.
Then we filter out data so all we have left are white males and females.
Distribution of both on 2 histograms:
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In the plot above:
in blue- Male Wage distribution,
in red- Female Wage distribution.
## Means SDs
## Population 5.896103 3.693086
## Males 7.106885 3.849507
## Females 4.395714 2.101629
It is worth to notice that on average Females from North Center earn less than population average and Males from North Center on average earn more than population average.
Plot above shows us distribution of wages.
Width line inside boxplots represents median.
Firstly let’s check how big sample size we would need to have to meet assumptions
error<-0.1
conf_level<-0.95
pop_size_M<-nrow(whiteM)
pop_size_F<-nrow(whiteF)
sum_all_wage<-sum(malefemales$value)
sum_M<-sum(whiteM$wage)
sum_F<-sum(whiteF$wage)
p_M<-sum_M/sum_all_wage
p_F<-sum_F/sum_all_wage
p_SD<-sqrt(p_M*p_F)
sample.size.prop(error,p_SD,pop_size_M,conf_level)
##
## sample.size.prop object: Sample size for proportion estimate
## With finite population correction: N=61, precision e=0.1 and expected proportion P=0.4877
##
## Sample size needed: 38
sample.size.prop(error,p_SD,pop_size_F,conf_level)
##
## sample.size.prop object: Sample size for proportion estimate
## With finite population correction: N=63, precision e=0.1 and expected proportion P=0.4877
##
## Sample size needed: 39
So for males we need 38 entries in a sample, and 39 for females, so lets set sample size to 39.
Let’s now create 50 samples, and save proportions, lower and upper boundaries for later.
p_size_W<-nrow(malefemales)
s_size<-39
n<-50
samples_p<-rep(NA,n)
samples_sd<-rep(NA,n)
samples_lower<-rep(NA,n)
samples_upper<-rep(NA,n)
for (i in 1:n){
s_rows<- sample(1:p_size_W, s_size)
sampl<-malefemales[s_rows,]
sample_sum_all<-sum(sampl$value)
sample_Males<-filter(sampl,sampl$name=="Male")
sample_sum_M<-sum(sample_Males$value)
sampl_p<-sample_sum_M/sample_sum_all
sampl_sd<-sqrt(sampl_p*(1-sampl_p))
sampl_lower<-sampl_p-1.96*(sampl_sd/sqrt(s_size))
sampl_upper<-sampl_p+1.96*(sampl_sd/sqrt(s_size))
samples_p[i]<-sampl_p
samples_sd[i]<-sampl_sd
samples_lower[i]<-sampl_lower
samples_upper[i]<-sampl_upper
}
Lets print 5 random confidence intervals (lower and upper bounds) we obtained above.
## [1] " Lower " " Upper "
## [1] 0.3993044 0.7112323
## [1] 0.4923564 0.7931446
## [1] 0.4406216 0.7487913
## [1] 0.4071491 0.7185125
## [1] 0.521151 0.816563
## Where true value is 0.6102048
We can see that true proportion value fits between these confidence interval.
Just as checkup, lets see if true value fits between average confidence interval.
## Sample Lower Bound: 0.4532764
## True Mean: 5.896103
## Sample Upper Bound 0.756677
Lets visualize obtained CI (confidence intervals).
Now lets compare average proportion from samples and true proportion.
## Estimated proportion: 0.6049767
## True proportion: 0.6102048
## Percentage error between estimation and true value: -0.8567784 %
Thanks to the CLT, estimated average is close to the real value.
If we were about to estimate means for white male and/or female from North Center, our calculations would be the same as above.
At last, lets check how would our results differ for other sample sizes and number of samples.
1) half a size (39/2)
## Estimated proportion: 0.490861
## True proportion: 0.6102048
## Percentage error between estimation and true value: -19.558 %
Result may differ for different sampling, if error in this case is as low as before, we are lucky.
2) 1.5 * sample size
## Estimated proportion: 0.6050753
## True proportion: 0.6102048
## Percentage error between estimation and true value: -0.8406258 %
In this case, error should be the similar or lower than in our first try.
Lets also try to estimate whole population mean of wage from sample of white people from North Center.
s_size<-80
n<-100
p_size<-nrow(wage1)
samples_mean4<-rep(NA,n)
samples_sd4<-rep(NA,n)
samples_lower4<-rep(NA,n)
samples_upper4<-rep(NA,n)
for (i in 1:n){
s_rows<- sample(1:p_size_W, s_size)
sampl<-malefemales[s_rows,]
sampl_mean<-mean(sampl$value)
sampl_sd<-sd(sampl$value)
sampl_lower<-sampl_mean-1.96*(sampl_sd/sqrt(s_size))
sampl_upper<-sampl_mean+1.96*(sampl_sd/sqrt(s_size))
samples_mean4[i]<-sampl_mean
samples_sd4[i]<-sampl_sd
samples_lower4[i]<-sampl_lower
samples_upper4[i]<-sampl_upper
}
## Estimated mean: 5.732019
## True mean: 5.896103
## Percentage error between estimation and true value: -2.782922 %
## Sample Lower Bound: 5.003083
## True Mean: 5.896103
## Sample Upper Bound 6.460955
Lets put North Center away and have some fun with whole population!
pop_female<-filter(wage1,wage1$female=="Female")
pop_male<-filter(wage1,wage1$female=="Male")
pop_female_mean<-mean(pop_female$wage)
pop_female_sd<-sd(pop_female$wage)
pop_male_sd<-sd(pop_male$wage)
pop_male_mean<-mean(pop_male$wage)
E<-abs(pop_female_mean-pop_male_mean)
Ds<-(pop_female_sd^2) + (pop_male_sd^2)
D<-sqrt(Ds)
shadeDist(xshade=0,ddist="dnorm",parm1=E,parm2=D,lower.tail = FALSE)
E2<-abs(pop_female_mean-meanP)
Ds2<-sdP^2 + pop_female_sd^2
D2<-sqrt(Ds2)
shadeDist(xshade=0,ddist="dnorm",parm1=E2,parm2=D2,lower.tail = FALSE)