Estimation
Point & Interval Estimation
The data I used is the “wage1” in library "np. Cross-section wage data consisting of a random sample taken from the U.S. Current Population Survey for the year 1976. There are 526 observations in total.
1. Descripion statistics
My main idea is to research the wage and married. Firstly, use summary() to get the basic information of these two columns. We can find the wage is in the [0.53, 24.98], and the mean is 5.896. Secondly, from the histogram and box plot we can find it doesn’t look like normal distribution but left-biased, so most of case has less wage. Thirdly, from the histogram of married information, we find the married people are more. Then I start curious about the relation of wage and married. At the beginning, we can simply use box plot to see that the married professionals have higher wage.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.530 3.330 4.650 5.896 6.880 24.980
## Married Notmarried
## 320 206
2. Estimation
2.1.1 estimate the mean wage and mean wage of Married/Notmarried professionals in the U.S.
We can get the mean estimation of wage. And after dividing the data into married and not married, it is easy to estimate the mean of married/notmarried. From the value we get, it is easy to find the married professionals have high wage.
## Two-Sided 95% Confidence Interval: Arithmetic Mean
##
## n nNA pNA M SD Low Upp
## 526 0 0% 5.90 3.69 5.58 6.21
## Two-Sided 95% Confidence Interval: Arithmetic Mean
##
## n nNA pNA M SD Low Upp
## 320 0 0% 6.57 3.99 6.13 7.01
## Two-Sided 95% Confidence Interval: Arithmetic Mean
##
## n nNA pNA M SD Low Upp
## 206 0 0% 4.84 2.89 4.45 5.24
2.1.2 estimate the difference in mean wage of Married/Nomarried professionls
Now we estimate the difference to make the thought above more clear. Calculate the confidence interval of difference of mean of Married/Notmarried people, and use plot to show the details. It is obvious that the married have higher wage.
## Two-Sided 95% Confidence Interval: Difference in Means from Independent Samples
##
## Between n nNA M SD M.Diff Low Upp
## 1 206 0 4.84 2.89
## 2 320 0 6.57 3.99 1.73 1.14 2.32
2.2.1 estimate the proportion of married professionals in different wage intervals.
After that, I want to research by another way. I divide the wage into 5 intervals, and estimate the proportion of married from different intervals. If the proportion increases while interval increases, it will show the thought from another side. Firstly, we can easily draw a proportion of the married information in different intervals, it looks like our thought is correct. Then we can estimate the confidence interval to research deeply. The confidence interval of last two interval we need use “wilson” method to instead of “wald”. Because wald method is a simple method to use the proportion of sample instead of population proportion. And it will let the upper higher to be 1.
## Two-Sided 95% Confidence Interval: Proportion
##
## n nNA pNA Prop Low Upp
## 526 0 0% 0.608 0.566 0.649
## Two-Sided 95% Confidence Interval: Proportion
##
## n nNA pNA Prop Low Upp
## 292 0 0% 0.517 0.460 0.574
## Two-Sided 95% Confidence Interval: Proportion
##
## n nNA pNA Prop Low Upp
## 182 0 0% 0.709 0.639 0.770
## Two-Sided 95% Confidence Interval: Proportion
##
## n nNA pNA Prop Low Upp
## 36 0 0% 0.722 0.576 0.869
## Two-Sided 95% Confidence Interval: Proportion
##
## n nNA pNA Prop Low Upp
## 11 0 0% 0.909 0.623 0.984
## Two-Sided 95% Confidence Interval: Proportion
##
## n nNA pNA Prop Low Upp
## 5 0 0% 0.800 0.376 0.964
I use ggplot() to draw the result of calculation. Although the proportion is truly in the trend of increasing, but the margin of error is too high for some intervals, because they are small samples.
tmp = data.frame(groups,proportion,marginError_Low,marginError_Upper)
tmp$groups <- factor(tmp$groups, levels = groups)
ggplot(tmp, aes(x = groups, y = proportion,color = groups)) +
geom_errorbar(aes(ymin = marginError_Low, ymax = marginError_Upper), width = 0.1) +
geom_point(size = 4, shape = 17, col = "hotpink") +
xlab("wage interval") +
ylab("married proportion estimation") +
ggtitle("Estimation of proportion of married in different intervals of wage")2.2.2 estimate the difference in proportions of married professionals above/below mean of wage.
Now we can estimate the difference in proportion of married above/below mean of wage to simply the thought above and enlarge the sample in different intervals. From the four fold plot it can be simply find higher wage has more married professionals. Then after we calculate the confidence interval for difference of proportion. But the margin of error is still about 0.08 which is 0.08/0.21 = 38%.
data("wage1")
wage1$married <- sub("Notmarried", 0, wage1$married)
wage1$married <- sub("Married", 1, wage1$married)
wage1$married <- sapply(wage1$married, as.numeric)
above <- subset(wage1, wage > meanWage)
below <- subset(wage1, wage < meanWage)
ci.prop.diff(below$married, above$married)## Two-Sided 95% Confidence Interval: Difference in Proportions from Independent Samples
##
## Between n nNA p p.Diff Low Upp
## 1 325 0 0.53
## 2 201 0 0.74 0.21 0.12 0.29
3 Summary
All in all, the high wage has some relation with married. It maybe because when people has stable and better income they will be more would like to be married and they are more welcome.