Report 1

0.1 Distribution of wages
0.2 Calculate True Means and SDs
0.3 Violin Plots
0.4 Sampling Estimation
- 0.4.1 Proportions
- 0.4.2 Mean
0.5 Probability
- 0.5.1 What is the probability that a female have higher wage than a male?
- 0.5.2 What is the probability for a woman to have wage higher than population average?

0.1 Distribution of wages

In this part we analyze distribution of wages of White males and females from North Center.
Firstly we save information about population mean and population standard deviation of wages.

Then we filter out data so all we have left are white males and females.

Distribution of both on 2 histograms:

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

In the plot above:
in blue- Male Wage distribution,
in red- Female Wage distribution.

0.2 Calculate True Means and SDs

##               Means      SDs
## Population 5.896103 3.693086
## Males      7.106885 3.849507
## Females    4.395714 2.101629

It is worth to notice that on average Females from North Center earn less than population average and Males from North Center on average earn more than population average.

0.3 Violin Plots

Plot above shows us distribution of wages.
Width line inside boxplots represents median.

0.4 Sampling Estimation

0.4.1 Proportions

Firstly let’s check how big sample size we would need to have to meet assumptions

error<-0.1
conf_level<-0.95

pop_size_M<-nrow(whiteM)
pop_size_F<-nrow(whiteF)

sum_all_wage<-sum(malefemales$value)
sum_M<-sum(whiteM$wage)
sum_F<-sum(whiteF$wage)

p_M<-sum_M/sum_all_wage
p_F<-sum_F/sum_all_wage
p_SD<-sqrt(p_M*p_F)

sample.size.prop(error,p_SD,pop_size_M,conf_level)

## 
## sample.size.prop object: Sample size for proportion estimate
## With finite population correction: N=61, precision e=0.1 and expected proportion P=0.4877
## 
## Sample size needed: 38

sample.size.prop(error,p_SD,pop_size_F,conf_level)

## 
## sample.size.prop object: Sample size for proportion estimate
## With finite population correction: N=63, precision e=0.1 and expected proportion P=0.4877
## 
## Sample size needed: 39

So for males we need 38 entries in a sample, and 39 for females, so lets set sample size to 39.

Let’s now create 50 samples, and save proportions, lower and upper boundaries for later.

p_size_W<-nrow(malefemales)
s_size<-39
n<-50
samples_p<-rep(NA,n)
samples_sd<-rep(NA,n)
samples_lower<-rep(NA,n)
samples_upper<-rep(NA,n)
for (i in 1:n){
  s_rows<- sample(1:p_size_W, s_size)
  sampl<-malefemales[s_rows,]
  
  sample_sum_all<-sum(sampl$value)
  sample_Males<-filter(sampl,sampl$name=="Male")
  sample_sum_M<-sum(sample_Males$value)

  sampl_p<-sample_sum_M/sample_sum_all
  sampl_sd<-sqrt(sampl_p*(1-sampl_p))
  sampl_lower<-sampl_p-1.96*(sampl_sd/sqrt(s_size))
  sampl_upper<-sampl_p+1.96*(sampl_sd/sqrt(s_size))
  
  samples_p[i]<-sampl_p
  samples_sd[i]<-sampl_sd
  samples_lower[i]<-sampl_lower
  samples_upper[i]<-sampl_upper
}

Lets print 5 random confidence intervals (lower and upper bounds) we obtained above.

## [1] " Lower " " Upper "

## [1] 0.3993044 0.7112323
## [1] 0.4923564 0.7931446
## [1] 0.4406216 0.7487913
## [1] 0.4071491 0.7185125
## [1] 0.521151 0.816563

## Where true value is 0.6102048

We can see that true proportion value fits between these confidence interval.

Just as checkup, lets see if true value fits between average confidence interval.

## Sample Lower Bound: 0.4532764 
## True Mean: 5.896103 
## Sample Upper Bound 0.756677

Lets visualize obtained CI (confidence intervals).

Now lets compare average proportion from samples and true proportion.

## Estimated proportion: 0.6049767

## True proportion: 0.6102048

## Percentage error between estimation and true value: -0.8567784 %

Thanks to the CLT, estimated average is close to the real value.

If we were about to estimate means for white male and/or female from North Center, our calculations would be the same as above.

At last, lets check how would our results differ for other sample sizes and number of samples.

1) half a size (39/2)

## Estimated proportion: 0.490861

## True proportion: 0.6102048

## Percentage error between estimation and true value: -19.558 %

Result may differ for different sampling, if error in this case is as low as before, we are lucky.

2) 1.5 * sample size

## Estimated proportion: 0.6050753

## True proportion: 0.6102048

## Percentage error between estimation and true value: -0.8406258 %

In this case, error should be the similar or lower than in our first try.

0.4.2 Mean

Lets also try to estimate whole population mean of wage from sample of white people from North Center.

s_size<-80
n<-100
p_size<-nrow(wage1)

samples_mean4<-rep(NA,n)
samples_sd4<-rep(NA,n)
samples_lower4<-rep(NA,n)
samples_upper4<-rep(NA,n)

for (i in 1:n){
  s_rows<- sample(1:p_size_W, s_size)
  sampl<-malefemales[s_rows,]
  
  sampl_mean<-mean(sampl$value)
  sampl_sd<-sd(sampl$value)
  sampl_lower<-sampl_mean-1.96*(sampl_sd/sqrt(s_size))
  sampl_upper<-sampl_mean+1.96*(sampl_sd/sqrt(s_size))
  
  samples_mean4[i]<-sampl_mean
  samples_sd4[i]<-sampl_sd
  samples_lower4[i]<-sampl_lower
  samples_upper4[i]<-sampl_upper
}

## Estimated mean: 5.732019

## True mean: 5.896103

## Percentage error between estimation and true value: -2.782922 %

## Sample Lower Bound: 5.003083 
## True Mean: 5.896103 
## Sample Upper Bound 6.460955

0.5 Probability

Lets put North Center away and have some fun with whole population!

0.5.1 What is the probability that a female have higher wage than a male?

pop_female<-filter(wage1,wage1$female=="Female")
pop_male<-filter(wage1,wage1$female=="Male")
pop_female_mean<-mean(pop_female$wage)
pop_female_sd<-sd(pop_female$wage)
pop_male_sd<-sd(pop_male$wage)
pop_male_mean<-mean(pop_male$wage)
E<-abs(pop_female_mean-pop_male_mean)
Ds<-(pop_female_sd^2) + (pop_male_sd^2)
D<-sqrt(Ds)
shadeDist(xshade=0,ddist="dnorm",parm1=E,parm2=D,lower.tail = FALSE)

0.5.2 What is the probability for a woman to have wage higher than population average?

E2<-abs(pop_female_mean-meanP)
Ds2<-sdP^2 + pop_female_sd^2
D2<-sqrt(Ds2)
shadeDist(xshade=0,ddist="dnorm",parm1=E2,parm2=D2,lower.tail = FALSE)