1. Introducing a new variable
In order to analyze and compare the wages depending on the region of living, a new variable “region” has been introduced. It takes three Boolean values from the “wage1” dataset (northcen, south, west) and merges them into one which gives information about the region where the particular person from the dataset lives.
data("wage1")
wage1_region <- wage1 %>% mutate(region = if_else(northcen == "1","northcen",
if_else(west == "1", "west",
if_else(south == "1", "south",NULL)))) %>% na.omit()
married_west_male <- wage1_region %>% filter(married == "Married", region == "west", female == "Male")
married_male <- wage1_region %>% filter(married == "Married", female == "Male")
summary(wage1_region)
## wage educ exper tenure
## Min. : 0.530 Min. : 0.00 Min. : 1.00 Min. : 0.000
## 1st Qu.: 3.297 1st Qu.:12.00 1st Qu.: 5.00 1st Qu.: 0.000
## Median : 4.600 Median :12.00 Median :14.00 Median : 2.000
## Mean : 5.759 Mean :12.48 Mean :17.40 Mean : 4.975
## 3rd Qu.: 6.820 3rd Qu.:14.00 3rd Qu.:27.25 3rd Qu.: 7.000
## Max. :22.200 Max. :18.00 Max. :51.00 Max. :33.000
## nonwhite female married numdep smsa
## Nonwhite: 44 Female:197 Married :254 Min. :0.000 Min. :0.0000
## White :364 Male :211 Notmarried:154 1st Qu.:0.000 1st Qu.:0.0000
## Median :1.000 Median :1.0000
## Mean :1.054 Mean :0.6936
## 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :6.000 Max. :1.0000
## northcen south west construc
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.00000
## Mean :0.3235 Mean :0.4583 Mean :0.2181 Mean :0.04657
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## ndurman trcommpu trade services
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :0.0000 Median :0.0000
## Mean :0.1225 Mean :0.04412 Mean :0.2917 Mean :0.1005
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000
## profserv profocc clerocc servocc
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.2451 Mean :0.3529 Mean :0.1691 Mean :0.1397
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## lwage expersq tenursq region
## Min. :-0.6349 Min. : 1.0 Min. : 0.00 Length:408
## 1st Qu.: 1.1932 1st Qu.: 25.0 1st Qu.: 0.00 Class :character
## Median : 1.5260 Median : 196.0 Median : 4.00 Mode :character
## Mean : 1.6075 Mean : 486.1 Mean : 71.64
## 3rd Qu.: 1.9198 3rd Qu.: 742.8 3rd Qu.: 49.00
## Max. : 3.1001 Max. :2601.0 Max. :1089.00
2. General Overview
Basic boxplot of wages distribution
boxplot(wage1_region$wage, main = "General overiew on hourly earnings", ylab = "avarage hourly earnings")
median(wage1_region$wage)
## [1] 4.6
A general overview shows the wages of the cleaned dataset which median is equal to 4.6. As we can see, there are also many outliers of people that earns much more than avarage.
Frequency table of the wages
attach(wage1_region)
interval <- (max(wage)-min(wage))/10
wage_limits <- cut(wage, seq(min(wage), max(wage), by = interval))
wage_tab <- table(wage_limits)
wage_tab <- as.data.frame(wage_tab)
kable(wage_tab, format = "html", col.names=c("Wages range", "Frequency"))
| Wages range | Frequency |
|---|---|
| (0.53,2.7] | 19 |
| (2.7,4.86] | 198 |
| (4.86,7.03] | 94 |
| (7.03,9.2] | 43 |
| (9.2,11.4] | 24 |
| (11.4,13.5] | 12 |
| (13.5,15.7] | 7 |
| (15.7,17.9] | 3 |
| (17.9,20] | 5 |
| (20,22.2] | 2 |
Tabular Accurancy Index
wage_tai <- classIntervals(wage, n = 10, style="fixed", fixedBreaks=seq(min(wage), max(wage), by = interval) )
tai <- jenks.tests(wage_tai)
kable(tai, format = "html")
| x | |
|---|---|
| # classes | 10.0000000 |
| Goodness of fit | 0.9712185 |
| Tabular accuracy | 0.8015681 |
3. Detailed wages distribution
Wage & Marriage & Region
Let’s investigate the height of wage depending of the fact that someone is married
Boxplot of relationship between wages and the fact of beeing married
ggplot(wage1_region, aes(married, wage)) + geom_boxplot()
A results show that people that are married earns on average more than not married ones.
Addtional factor - region
ggplot(wage1_region, aes(married, wage, fill=region)) + geom_boxplot()
When a region is taken into account, it turns out that the tendency of the married people earning more is kept in every region. Moreover, people from west earns the most, when the ones from south the least.
Density plots
Here we take into account also a relationship between wage and sex. Plots are also divided for each region.
On the first plot, there is a % value, whie on the second histogram, there are discete values
qplot(wage, data = wage1_region, geom = "density", color = female) + theme_bw() + facet_wrap(~region)
qplot(wage, data = wage1_region, geom = "histogram", color = female) + theme_bw() + facet_wrap(~region)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Based on sensity plots, it can be assumed that the distribution of the wages is right skewed. Moreover, he intensity of the extremes is greater than that of the normal distribution, so there are thicker tails.
Another assumption is that, no matter the region, men earns significantly more.
4. Summarized measures
our_summary1 <-
list("Central tednecy measures" =
list("mean" = ~ mean(wage),
"median" = ~ median(wage),
"variance" = ~ var(wage),
"sd" = ~ sd(wage),
"IQR" = ~ IQR(wage)),
"Shape measures" =
list("Skewness" = ~ skewness(wage),
"Kurthosis" = ~ kurtosis(wage)),
"Extreme values" =
list("min" = ~ min(wage),
"max" = ~ max(wage))
)
summary_table(group_by(wage1_region,region), our_summary1) %>% kbl(format = "html") %>% pack_rows("Central tednecy measures",1, 5) %>% pack_rows("Shape measures",6, 7) %>% pack_rows("Extreme values",8, 9)
| northcen (N = 132) | south (N = 187) | west (N = 89) | |
|---|---|---|---|
| Central tednecy measures | |||
| mean | 5.710455 | 5.386898 | 6.613371 |
| median | 4.685000 | 4.500000 | 5.300000 |
| variance | 11.055488 | 9.602000 | 17.784848 |
| sd | 3.324979 | 3.098709 | 4.217208 |
| IQR | 3.542500 | 3.135000 | 4.670000 |
| Shape measures | |||
| Skewness | 1.909692 | 1.974116 | 1.488316 |
| Kurthosis | 7.856008 | 7.763816 | 5.082929 |
| Extreme values | |||
| min | 1.500000 | 1.500000 | 0.530000 |
| max | 21.860001 | 20.000000 | 22.200001 |
5. Proportion Interval Estimaion
We are calclutaing the proportion of married men that lives in west and all married men.
Manual Solution
n <- nrow(married_male)
k <- nrow(married_west_male)
p_hat <- k/n
p_hat
## [1] 0.1946309
se_prop <- sqrt(p_hat*(1-p_hat)/n)
se_prop
## [1] 0.03243472
E = qnorm(.975)*se_prop
p_hat + c(-E, E)
## [1] 0.1310600 0.2582018
standard_error <- sd(married_male$wage)/sqrt(n)
standard_error
## [1] 0.3375064
The proportion (p hat) of married men that lives in west and all married men is equal to 0.19
The estimated confidence interval of the proportion of married men that lives in west and all married men is between 0.13 and 0.25.
Test check
p_hat_interval <- prop.test(k,n)
p_hat_interval
##
## 1-sample proportions test with continuity correction
##
## data: k out of n, null probability 0.5
## X-squared = 54.362, df = 1, p-value = 1.667e-13
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.1362006 0.2692053
## sample estimates:
## p
## 0.1946309
A test confirms the results of calculations
6. Mean Estimation
We are estiamting the mean of the population based on the sample of married men.
Manual Solution
attach(married_male)
## Następujące obiekty zostały zakryte z wage1_region:
##
## clerocc, construc, educ, exper, expersq, female, lwage, married,
## ndurman, nonwhite, northcen, numdep, profocc, profserv, region,
## services, servocc, smsa, south, tenure, tenursq, trade, trcommpu,
## wage, west
n <- length(wage)
xbar <- mean(wage)
s <- sd(wage)
margin <- qt(0.975,df=n-1)*s/sqrt(n)
low <- xbar - margin
high <- xbar + margin
low
## [1] 6.919556
high
## [1] 8.253464
Based on calculations, a mean is estimated (with 95% confidence) to lie in the interval between 6,92 and 8,25
Test check
t_test <- t.test(wage)
t_test
##
## One Sample t-test
##
## data: wage
## t = 22.478, df = 148, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 6.919556 8.253464
## sample estimates:
## mean of x
## 7.58651
A test confirms the results of calculations
Plots
summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE,
conf.interval=.95, .drop=TRUE) {
# New version of length which can handle NA's: if na.rm==T, don't count them
length2 <- function (x, na.rm=FALSE) {
if (na.rm) sum(!is.na(x))
else length(x)
}
# This does the summary. For each group's data frame, return a vector with
# N, mean, and sd
datac <- ddply(data, groupvars, .drop=.drop,
.fun = function(xx, col) {
c(N = length2(xx[[col]], na.rm=na.rm),
mean = mean (xx[[col]], na.rm=na.rm),
sd = sd (xx[[col]], na.rm=na.rm)
)
},
measurevar
)
# Rename the "mean" column
datac <- rename(datac, c("mean" = measurevar))
datac$se <- datac$sd / sqrt(datac$N) # Calculate standard error of the mean
# Confidence interval multiplier for standard error
# Calculate t-statistic for confidence interval:
# e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
ciMult <- qt(conf.interval/2 + .5, datac$N-1)
datac$ci <- datac$se * ciMult
return(datac)
}
x1 <- summarySE(married_male, measurevar="wage", groupvars=c("region"))
x1
## region N wage sd se ci
## 1 northcen 44 7.687955 4.108166 0.6193294 1.2489967
## 2 south 76 6.615132 3.405525 0.3906406 0.7781959
## 3 west 29 9.978276 4.922120 0.9140147 1.8722743
# Error bars represent standard error of the mean
ggplot(x1, aes(x=region, y=wage, fill = region)) +
geom_bar(position=position_dodge(), stat="identity") +
geom_errorbar(aes(ymin=wage-se, ymax=wage+se),
width=.2, # Width of the error bars
position=position_dodge(.9))
# Use 95% confidence intervals instead of SEM
ggplot(x1, aes(x=region, y=wage, fill = region)) +
geom_bar(position=position_dodge(), stat="identity") +
geom_errorbar(aes(ymin=wage-ci, ymax=wage+ci),
width=.2, # Width of the error bars
position=position_dodge(.9))