Point & Interval Estimation

Michał Rejmak

14 11 2021

1. Introducing a new variable

In order to analyze and compare the wages depending on the region of living, a new variable “region” has been introduced. It takes three Boolean values from the “wage1” dataset (northcen, south, west) and merges them into one which gives information about the region where the particular person from the dataset lives.

data("wage1")
wage1_region <- wage1 %>% mutate(region = if_else(northcen == "1","northcen",
                                          if_else(west == "1", "west",
                                          if_else(south == "1", "south",NULL)))) %>% na.omit()

married_west_male <- wage1_region %>% filter(married == "Married", region == "west", female == "Male")
married_male <- wage1_region %>% filter(married == "Married", female == "Male")

summary(wage1_region)
##       wage             educ           exper           tenure      
##  Min.   : 0.530   Min.   : 0.00   Min.   : 1.00   Min.   : 0.000  
##  1st Qu.: 3.297   1st Qu.:12.00   1st Qu.: 5.00   1st Qu.: 0.000  
##  Median : 4.600   Median :12.00   Median :14.00   Median : 2.000  
##  Mean   : 5.759   Mean   :12.48   Mean   :17.40   Mean   : 4.975  
##  3rd Qu.: 6.820   3rd Qu.:14.00   3rd Qu.:27.25   3rd Qu.: 7.000  
##  Max.   :22.200   Max.   :18.00   Max.   :51.00   Max.   :33.000  
##      nonwhite      female          married        numdep           smsa       
##  Nonwhite: 44   Female:197   Married   :254   Min.   :0.000   Min.   :0.0000  
##  White   :364   Male  :211   Notmarried:154   1st Qu.:0.000   1st Qu.:0.0000  
##                                               Median :1.000   Median :1.0000  
##                                               Mean   :1.054   Mean   :0.6936  
##                                               3rd Qu.:2.000   3rd Qu.:1.0000  
##                                               Max.   :6.000   Max.   :1.0000  
##     northcen          south             west           construc      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :0.00000  
##  Mean   :0.3235   Mean   :0.4583   Mean   :0.2181   Mean   :0.04657  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##     ndurman          trcommpu           trade           services     
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.00000   Median :0.0000   Median :0.0000  
##  Mean   :0.1225   Mean   :0.04412   Mean   :0.2917   Mean   :0.1005  
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :1.0000  
##     profserv         profocc          clerocc          servocc      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.2451   Mean   :0.3529   Mean   :0.1691   Mean   :0.1397  
##  3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      lwage            expersq          tenursq           region         
##  Min.   :-0.6349   Min.   :   1.0   Min.   :   0.00   Length:408        
##  1st Qu.: 1.1932   1st Qu.:  25.0   1st Qu.:   0.00   Class :character  
##  Median : 1.5260   Median : 196.0   Median :   4.00   Mode  :character  
##  Mean   : 1.6075   Mean   : 486.1   Mean   :  71.64                     
##  3rd Qu.: 1.9198   3rd Qu.: 742.8   3rd Qu.:  49.00                     
##  Max.   : 3.1001   Max.   :2601.0   Max.   :1089.00

2. General Overview

Basic boxplot of wages distribution

boxplot(wage1_region$wage, main = "General overiew on hourly earnings", ylab = "avarage hourly earnings")

median(wage1_region$wage)
## [1] 4.6

A general overview shows the wages of the cleaned dataset which median is equal to 4.6. As we can see, there are also many outliers of people that earns much more than avarage.

Frequency table of the wages

attach(wage1_region)

interval <- (max(wage)-min(wage))/10
wage_limits <- cut(wage, seq(min(wage), max(wage), by = interval))
wage_tab <- table(wage_limits)

wage_tab <- as.data.frame(wage_tab)
kable(wage_tab, format = "html", col.names=c("Wages range", "Frequency"))
Wages range Frequency
(0.53,2.7] 19
(2.7,4.86] 198
(4.86,7.03] 94
(7.03,9.2] 43
(9.2,11.4] 24
(11.4,13.5] 12
(13.5,15.7] 7
(15.7,17.9] 3
(17.9,20] 5
(20,22.2] 2

Tabular Accurancy Index

wage_tai <- classIntervals(wage, n = 10, style="fixed", fixedBreaks=seq(min(wage), max(wage), by = interval) )
tai <- jenks.tests(wage_tai)
kable(tai, format = "html")
x
# classes 10.0000000
Goodness of fit 0.9712185
Tabular accuracy 0.8015681

3. Detailed wages distribution

Wage & Marriage & Region

Let’s investigate the height of wage depending of the fact that someone is married

Boxplot of relationship between wages and the fact of beeing married

ggplot(wage1_region, aes(married, wage)) + geom_boxplot()

A results show that people that are married earns on average more than not married ones.

Addtional factor - region

ggplot(wage1_region, aes(married, wage, fill=region)) + geom_boxplot()

When a region is taken into account, it turns out that the tendency of the married people earning more is kept in every region. Moreover, people from west earns the most, when the ones from south the least.

Density plots

Here we take into account also a relationship between wage and sex. Plots are also divided for each region.

On the first plot, there is a % value, whie on the second histogram, there are discete values

qplot(wage, data = wage1_region, geom = "density", color = female) + theme_bw() + facet_wrap(~region)

qplot(wage, data = wage1_region, geom = "histogram", color = female) + theme_bw() + facet_wrap(~region)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Based on sensity plots, it can be assumed that the distribution of the wages is right skewed. Moreover, he intensity of the extremes is greater than that of the normal distribution, so there are thicker tails.

Another assumption is that, no matter the region, men earns significantly more.

4. Summarized measures

our_summary1 <-
  list("Central tednecy measures" =
       list("mean"           = ~ mean(wage),
            "median"       = ~ median(wage),
            "variance"        = ~ var(wage),
            "sd" =               ~ sd(wage),
            "IQR" =            ~ IQR(wage)),
       "Shape measures" =
       list("Skewness"       = ~ skewness(wage),
            "Kurthosis"    = ~ kurtosis(wage)),
       "Extreme values" =
       list("min"       = ~ min(wage),
            "max"       = ~ max(wage))
       )
summary_table(group_by(wage1_region,region), our_summary1) %>% kbl(format = "html") %>% pack_rows("Central tednecy measures",1, 5) %>% pack_rows("Shape measures",6, 7) %>% pack_rows("Extreme values",8, 9)
northcen (N = 132) south (N = 187) west (N = 89)
Central tednecy measures
mean 5.710455 5.386898 6.613371
median 4.685000 4.500000 5.300000
variance 11.055488 9.602000 17.784848
sd 3.324979 3.098709 4.217208
IQR 3.542500 3.135000 4.670000
Shape measures
Skewness 1.909692 1.974116 1.488316
Kurthosis 7.856008 7.763816 5.082929
Extreme values
min 1.500000 1.500000 0.530000
max 21.860001 20.000000 22.200001

5. Proportion Interval Estimaion

We are calclutaing the proportion of married men that lives in west and all married men.

Manual Solution

n <- nrow(married_male) 
k <- nrow(married_west_male) 
p_hat <- k/n
p_hat
## [1] 0.1946309
se_prop <- sqrt(p_hat*(1-p_hat)/n)
se_prop
## [1] 0.03243472
E = qnorm(.975)*se_prop


p_hat + c(-E, E)
## [1] 0.1310600 0.2582018
standard_error <- sd(married_male$wage)/sqrt(n)
standard_error
## [1] 0.3375064

The proportion (p hat) of married men that lives in west and all married men is equal to 0.19

The estimated confidence interval of the proportion of married men that lives in west and all married men is between 0.13 and 0.25.

Test check

p_hat_interval <- prop.test(k,n)
p_hat_interval
## 
##  1-sample proportions test with continuity correction
## 
## data:  k out of n, null probability 0.5
## X-squared = 54.362, df = 1, p-value = 1.667e-13
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.1362006 0.2692053
## sample estimates:
##         p 
## 0.1946309

A test confirms the results of calculations

6. Mean Estimation

We are estiamting the mean of the population based on the sample of married men.

Manual Solution

attach(married_male)
## Następujące obiekty zostały zakryte z wage1_region:
## 
##     clerocc, construc, educ, exper, expersq, female, lwage, married,
##     ndurman, nonwhite, northcen, numdep, profocc, profserv, region,
##     services, servocc, smsa, south, tenure, tenursq, trade, trcommpu,
##     wage, west
n <- length(wage)
xbar <- mean(wage)
s <- sd(wage)


margin <- qt(0.975,df=n-1)*s/sqrt(n)


low <- xbar - margin
high <- xbar + margin


low
## [1] 6.919556
high
## [1] 8.253464

Based on calculations, a mean is estimated (with 95% confidence) to lie in the interval between 6,92 and 8,25

Test check

t_test <- t.test(wage)
t_test
## 
##  One Sample t-test
## 
## data:  wage
## t = 22.478, df = 148, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  6.919556 8.253464
## sample estimates:
## mean of x 
##   7.58651

A test confirms the results of calculations

Plots

summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE,
                      conf.interval=.95, .drop=TRUE) {

    # New version of length which can handle NA's: if na.rm==T, don't count them
    length2 <- function (x, na.rm=FALSE) {
        if (na.rm) sum(!is.na(x))
        else       length(x)
    }

    # This does the summary. For each group's data frame, return a vector with
    # N, mean, and sd
    datac <- ddply(data, groupvars, .drop=.drop,
      .fun = function(xx, col) {
        c(N    = length2(xx[[col]], na.rm=na.rm),
          mean = mean   (xx[[col]], na.rm=na.rm),
          sd   = sd     (xx[[col]], na.rm=na.rm)
        )
      },
      measurevar
    )

    # Rename the "mean" column    
    datac <- rename(datac, c("mean" = measurevar))

    datac$se <- datac$sd / sqrt(datac$N)  # Calculate standard error of the mean

    # Confidence interval multiplier for standard error
    # Calculate t-statistic for confidence interval: 
    # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
    ciMult <- qt(conf.interval/2 + .5, datac$N-1)
    datac$ci <- datac$se * ciMult

    return(datac)
}



x1 <- summarySE(married_male, measurevar="wage", groupvars=c("region"))
x1
##     region  N     wage       sd        se        ci
## 1 northcen 44 7.687955 4.108166 0.6193294 1.2489967
## 2    south 76 6.615132 3.405525 0.3906406 0.7781959
## 3     west 29 9.978276 4.922120 0.9140147 1.8722743
# Error bars represent standard error of the mean
ggplot(x1, aes(x=region, y=wage, fill = region)) + 
    geom_bar(position=position_dodge(), stat="identity") +
    geom_errorbar(aes(ymin=wage-se, ymax=wage+se),
                  width=.2,                    # Width of the error bars
                  position=position_dodge(.9))

# Use 95% confidence intervals instead of SEM
ggplot(x1, aes(x=region, y=wage, fill = region)) + 
    geom_bar(position=position_dodge(), stat="identity") +
    geom_errorbar(aes(ymin=wage-ci, ymax=wage+ci),
                  width=.2,                    # Width of the error bars
                  position=position_dodge(.9))