Libraries


library(Rmisc)
library(dplyr)
library(knitr)
library(ggplot2)
library(plotrix)


About data


In this research I will use data from CBOS. It comes from monthly studies called “Aktualne problemy i wydarzenia”. It is 276 edition that was published in 2021-04-18 and comes form questionnaires that were hold in Warsaw. In total 1101 participants took part in this survey.

It describes attitudes of people from Warsaw to economical and political situation viewed from their individual, subjective perspective.

In total there were 162 questions asked. I decided to focus on one of them.

More information about the studies can be found on the website:
https://rds.icm.edu.pl/dataset.xhtml?persistentId=doi:10.18150/0DD0HX&version=1.1

I started from loading data table from CBOS website and choosing needed columns, that is one about year of birth (URODZONY), second about gender (PLEC) which I transformed from numeric representation (1,2) to nominal (K, M) where K stands for Kobiety (Female), and M stands for Mezczyzni (Male), and last about question of interest (q27).

problemy_wydarzenia <- read.delim(url("https://rds.icm.edu.pl/api/access/datafile/542"), row.names=1)


standard_zycia <- problemy_wydarzenia[,c("URODZONY", "PLEC", "q27")]

colnames(standard_zycia) = c("urodzony", "plec", "sytuacja_za_rok")



zmien_oznaczenie_plec <- function(plec) 
{
  plec2 <- rep(0, length(plec))
  
  for(i in 1:length(plec))
  {
    if (plec[i] == 1){
      plec2[i] = 'M'
    } else {
      plec2[i] = 'K'
    }
  }
  
  return(plec2)
}

standard_zycia$plec <- zmien_oznaczenie_plec(standard_zycia$plec)


Research problem


Now it is finally good moment to explain what was the question I have chosen.

In polish it was asked as follows:
A jak Pan(i) sądzi, czy w ciągu najbliższego roku sytuacja w Polsce poprawi się, pogorszy czy też się nie zmieni?

What means:
What do Mr./Mrs. think, will the situation in Poland improve, worsen or will not change over the next year?

Possible answers were as follows:

  • 1 - Zdecydowanie poprawi się\(~~~~~\)-\(~~~\)It will definitely improve
  • 2 - Raczej poprawi się\(~~~~~~~~~~~~~~~~\)-\(~~~\)Rather, it will improve
  • 3 - Nie zmieni się\(~~~~~~~~~~~~~~~~~~~~~~\)-\(~~~\)Will not change
  • 4 - Raczej pogorszy się\(~~~~~~~~~~~~~~\)-\(~~~\)Rather, it will get worse
  • 5 - Zdecydowanie pogorszy się\(~~~\)-\(~~~\)It will definitely get worse
  • 7 - Nie udzielono odpowiedzi\(~~~~~\)-\(~~~\)No answer was given

Answers to this question are recorded as values from 1 to 5, where the greater the value, the more someone is a bad prophet.

I decided I will try to estimate mean values of this answers for female and male separately, as well as for people from different age categories.

That is why I created extra column with age of respondents (wiek) and basing on this I divided people on six age categories (kategoria_wiekowa). I have chosen these age categories intervals in such a way so they are quite uniformly distributed, it means that to each of the categories falls similar number of people. What can be seen on histogram below. I did it because I wanted to receive confidence intervals of similar ranges for all of these categories, because as we know from the lectures sample size have impact on width of confidence intervals.


Table with samples statistics


Next I calculated statistics for 12 groups. 12 = (2 - number of sexes) * (6 - number of categories). They include standard error, margin of error for default 95% confidence level called in table ci, standard deviation and sample mean.

Sample Statistics

Sex Age Size Mean sd se Margin of error
K 18-30 105 3.2952 0.6640 0.0648 0.1285
K 31-40 100 3.3900 0.6340 0.0634 0.1258
K 41-50 90 3.4222 0.7028 0.0741 0.1472
K 51-60 91 3.5055 0.7207 0.0756 0.1501
K 61-70 92 3.3152 0.7252 0.0756 0.1502
K 70+ 80 3.3250 0.8385 0.0938 0.1866
M 18-30 98 3.2959 0.6767 0.0684 0.1357
M 31-40 71 3.4507 0.7707 0.0915 0.1824
M 41-50 76 3.5000 0.8246 0.0946 0.1884
M 51-60 108 3.5185 0.6901 0.0664 0.1316
M 61-70 72 3.6111 0.7230 0.0852 0.1699
M 70+ 38 3.3684 0.8194 0.1329 0.2693


Calculating sufficient sample mean


Now lets see weather cardinality of our chosen groups is huge enough to calculate sample mean with specific parameters of margin of error and confidence level. I decided that I would like to have margin of error not higher than 10% from population mean and confidence level of 95%. Because I don’t know whole population I assumed that populations parameters will steam from all I have. That is 1021 respondent who answered the question q27.

sufficient_sample_size = function(conf_level, population_sd, margin_of_error) 
{
  return(((qnorm(conf_level + (1 - conf_level) / 2)  * population_sd ) / margin_of_err) ^ 2)
}

mean_za_rok <- mean(s_z_za_rok_bez_NULL$sytuacja_za_rok)
margin_of_err <- 0.10 * mean_za_rok

sd_za_rok <- sd(s_z_za_rok_bez_NULL$sytuacja_za_rok)


suff_samp_size <- sufficient_sample_size(0.95, sd_za_rok, margin_of_err)
suff_samp_size
## [1] 17.44202


After calculations it turns out that sufficient sample size for 10% margin of error and 95% confidence interval will be equal to 17,4. But because our smallest sample cardinality is equal to 38 what is more than 2 times more we can surely reduce out margin of error. I will evaluate it below.

better_margin_of_err <- qnorm(0.975) * sd_za_rok / sqrt(38)

relative_error <- better_margin_of_err / mean_za_rok * 100
relative_error
## [1] 6.774958


So now we know we are even able to achieve at least 6.77% of margin of error with 95% confidence level assumption for the smallest sample that consists of 38 elements, other samples will be even more numerous, it means that our margin of error for them will be even smaller. Ok, but what this result tells us. It means that for samples of size 38, statistically 95% of them will estimate mean with a deviation of not more than 6.77% from population mean.

Plot with saple means


Next I constructed plot that shows all of this 12 means with their confidence intervals. It can be seen below.


Conclusions


Generally it seems that men have more pessimistic view when it comes to forecasting the future of Poland. Another thing to notice is that young and old people tend to see reality in a more favorable light. For me results of these studies are another argument that common perception of young people as idealists is not accidental. As I am not old yet, understanding why older people tend to perceive reality better will be an open question for now.


Does sample confidence interval contain population mean?


At the end I decided to draw 50 samples, show them on the graph and mark in green those of them which do not contain population mean. I remember I didn’t understand it during lecture, and that is why I wanted to do that task once again. I checked “Report #2 - Estimation for the Mean and differences” file and realized that uiw parameter in 171 line wasn’t divided by square root of sample size. And that is why on our plots all sample means with their confidence intervals contained population mean. Below I did similar analysis but on date of mine interest.