Introducing data

Using the IPUMS CPS, I extracted data about almost 125000 of people -their basic characteristics and income information. In the following report I would like to examine yearly incomes of people working as IT specialists. To obtain population of them, I filtered the dataset by numerical codes of occupation performed last year. To get any insight into extracted data, let’s perform some descriptive statistics.

Data come from: IPUMS CPS, University of Minnesota, www.ipums.org

Descriptive statistics

IT occupations included in a dataset:

That would be good to know whether the income in IT can be influenced by the specialist’s gender. We can roughly check it with a violin plot.

Plot reveals that a difference depending on gender is just slight. However, incomes earned by most of women are distributed more steadily - oscillate around the mean.

In the next step we will compare the number of male and female IT professionals.

Let’s also examine how the income depends on the age and specialization of professional. Scatterplot can also confirm how numerous are particular groups of specialists in this population.

Having got to know main features of the dataset, we can go forward to the main part of this analysis.

Point & Interval estimation for the MEAN

I’ll perform an estimation procedure for 30 simple random samples of size n = 50:

Firstly, let’s prepare empty vectors that will store means and standard deviations af all samples:

sample_incwage_mean <- rep(NA, n)
sample_incwage_sd <- rep(NA, n)

In a loop I’ll calculate means and standard deviations of all samples:

for (i in 1:50) {
  
 sample_incwage <- sample(population_incwage, n) # obtain a sample of size n = 50 from the population of incwages
 sample_incwage_mean[i] <- mean(sample_incwage)
 sample_incwage_sd[i] <- sd(sample_incwage)

}

Then, calculate boundaries of confidence intervals:

lower_incwage <- sample_incwage - 1.96 * (sample_incwage_sd/ sqrt(n))
upper_incwage <- sample_incwage + 1.96 * (sample_incwage_sd/ sqrt(n))

Finally, I can plot sample means and confidence intervals for those means. For the purpose of comparison, I also marked a population mean on the graph (red line).

It’s clearly visible on the graph that wider confidence intervals come in pairs with means that are closer to the population mean.

Point & Interval estimation for PROPORTION

In this section I will calculate the proportion of woman in a set of IT specialists. First, let’s take a look on a whole population:

sum_women <- sum(IT_population$SEX == 'woman')
sum_all <- nrow(IT_population)
p_women_population <-sum_women/sum_all

Proportion of woman in a population of IT specialists is equal to 0.2428991. Later, we will compare this value with the results of an proportion estimation.

Let’s begin this estimation with a sample size of 50 observations.

samp_rn<- sample(1:nrow(IT_population), 50)
samp<- IT_population[samp_rn,]

sum_women_samp <- sum(samp$SEX =='woman')
sum_all_samp <- nrow(samp)
p_women_samp <- sum_women_samp / sum_all_samp

lb_women <- p_women_samp - 1.96*(sqrt((p_women_population*(1-p_women_population))/n))
ub_women <- p_women_samp + 1.96*(sqrt((p_women_population*(1-p_women_population))/n))

For this random sample confidence interval is the range 0.1411331, 0.3788669,

Again, I will perform estimation for many samples. Let “many” be equal to 50:

n <- 50

Here I create empty vectors to save data:

prop_women_samp <- rep(NA, 50)
population_prop_sd <- rep(NA, 50)

Now, in a loop, I calculate the means and standard deviations of 50 random samples.

for(i in 1:n){
samp_rn<- sample(1:nrow(IT_population), n)  
samp<- IT_population[samp_rn,]

sum_women <- sum(IT_population$SEX == 'woman')
sum_all <- nrow(IT_population)
p_women <-sum_women/sum_all

sum_women_samp <- sum(samp$SEX == 'woman')
sum_all_samp <- nrow(samp)

prop_women_samp[i] <- sum_women_samp / sum_all_samp
population_prop_sd[i] <- sqrt(p_women*(1-p_women))

}

Then, I can construct confidence intervals…

lower_vector <- prop_women_samp - 1.96 * population_prop_sd / sqrt(n) 
upper_vector <- prop_women_samp + 1.96 * population_prop_sd / sqrt(n) 

… and plot them. In this case I also included population proportion (red line) on the graph.

plotCI(1:50, prop_women_samp,uiw = 1.96*population_prop_sd /sqrt(n), xlab="Sample No.", ylab="Sample proportion", main="Sample proportion & confidence interval for the proportion")
par(new=T)
plot(1:50, rep(p_women_population, 50), type = "l", pch = 10, 
     col = "red", ylab="", xlab="", yaxt="n")