RTeam&DA

Area of responsibility of each team member:

Artyushin Alexey:

Variable: Occupation Plots: Bar plot(Occupation) Descriptions: Bar plot (Occupation) description, Scatter plot description

Bykova Nadezhda

Variable: Difference in contract/real work hours Plots: Histogram, Density plot Descriptions: Box plot description, Histogram ( Difference in contract/real work hours) description, Histogram(People responsible for in job) description Additional: Tables

Vlasenko Anastasia:

Variable: People responsible for in job Plots: Scatter plot, Box plot, Stacked bar plot, Histogram (People responsible for in job) Descriptions: - Additional: Finalizing project file

Kulikov Artyom:

Variable: Establishment size Plots: Bar plot(Establishment size) Descriptions: Bar plot description(Establishment size)

Hello! We are thrilled to present you our findings which concern socio-demographics in Germany. In this project our group focused on topic of work and occupation of people in Germany and we examined such parameters as: difference in contract/real work hours, size of a company, number of people in one’s responsibility and types of occupation.

library(haven)
library(dplyr)
library(plotly)
library(ggplot2)
library(foreign)
library(gridExtra)
library(knitr)
library(rmarkdown)
#detach("package:plyr", unload=TRUE)

##Uploading and reading the data base
ESS <- read_spss("ESS8DE.sav")

## We take three columns from the data base: Hours of work according to the contract ("wkhct"), real working hours ("wkhtot") and number of people a person is esponsible for in job ("njbspv")
ESS1 <- select(ESS, "wkhct", "wkhtot", "njbspv")

## We delete rows where the second column contains zero. 
## We believe that negative difference between contact and real work hours for these people doesnt make sense (because they don't work)
ESS2 <- ESS1[!(ESS1$wkhtot==0 | ESS1$wkhtot==666 | ESS1$wkhtot==777 | 
                 ESS1$wkhtot==888| ESS1$wkhtot==999),]

ESS3 <- ESS1[!(ESS2$wkhct==0 | ESS2$wkhct==666 | ESS2$wkhct==777 | 
                 ESS2$wkhct==888| ESS2$wkhct==999),]

## Warning: Length of logical index must be 1 or 2852, not 2831

ESS2$wkhct = as.character(ESS2$wkhct)
ESS2$wkhct = as.numeric(ESS2$wkhct)
ESS2$wkhtot = as.character(ESS2$wkhtot)
ESS2$wkhtot = as.numeric(ESS2$wkhtot)
ESS2 = mutate(ESS2, diff = ESS2$wkhtot - ESS2$wkhct)
ggplot() +
  geom_point(data = ESS2, aes(x = njbspv, y = diff), col="cornflowerblue") + xlim(-50, 900) + 
  xlab("Number of people responsible for in job") +
  ylab("Difference in working hours") +
  ggtitle("Connection between working hours and number of people responsible for")

## Warning: Removed 1937 rows containing missing values (geom_point).

It can be seen from the scatter plot that there is a skew of statistics, because almost the entire sample consists of people who either do not have people in subordination, or their number reaches up to about 100 subordinates, and those with a very large number of people they responsible for (250-700) are not really many. However, the scatter plot shows that most of the people in the sample are those who responsible for 0 or up to 50 people, and their difference is often positive than negative. The scatter plot also shows that those people who responsible for 250 and up to 500 people do not have a negative difference between their actual hours and the hours that specified in the contract. Due to the uneven sample, we cannot detect a correlation.

## We create vectors and substract them. 
ContractWorkHours <- c(as.numeric(as.character(ESS2$wkhct)))
RealWorkHours <- c(as.numeric(as.character(ESS2$wkhtot)))
DifferenceInWorkHours <- RealWorkHours - ContractWorkHours


## Deleting N/As from our vector
DifferenceInWorkHours <- DifferenceInWorkHours[!is.na(DifferenceInWorkHours)]

## Creating a hystogram and a density plot for our variable
par(mfrow=c(1,2)) 

## Creating a funtion for mode
mode <- function(DifferenceInWorkHours) {ux <- unique(DifferenceInWorkHours) 
ux[which.max(tabulate(match(DifferenceInWorkHours, ux)))]}

## Combining them on one plot
hist(DifferenceInWorkHours,
     main = "Histogram for difference in work hours",
     xlab = "Difference in hours", ylab = "Number of people reporting such",
     border = "navyblue", col="cornflowerblue",
     xlim= c(-60, 60),
     ylim= c(0, 1200))
breaks = seq(-60, 60)

## Adding lines for mode, mean and median
abline(v = mean(DifferenceInWorkHours), col = "red", lwd = 2)
abline(v = median(DifferenceInWorkHours), col = "black",lwd = 2)
abline(v = mode(DifferenceInWorkHours), col = "green", lwd = 2)
legend(x = 12, y = 1200, 
       c("Mean", "Median", "Mode"),
       col = c("red", "black", "green"),
       lwd = c(2, 2, 2),
       cex = 0.65)

legend(x = -58, y = 1200,
       c(round(mean(DifferenceInWorkHours),3), median(DifferenceInWorkHours),mode(DifferenceInWorkHours)),
       col = c("red", "black", "green"),
       lwd = c(2, 2, 2),
       cex = 0.65)


mode.result = mode(DifferenceInWorkHours) 

## Creating a Density plot 
Dens <- density(DifferenceInWorkHours)
plot(Dens, 
     main = "Density plot for difference in work hours", col="cornflowerblue", 
     xlim= c(-60, 60))

breaks = seq(-60, 60)

## To obtain actual range we substract min value from max value 
rangeVect <- range(DifferenceInWorkHours)
rangenum <- rangeVect[2] - rangeVect[1]


ESS2 = na.omit(ESS2)
rangeV2 <- range(ESS2$njbspv)
rangenum2 <- rangeV2[2] - rangeV2[1]

Are there many people who overwork in Germany? By looking at constructed histogram we can quite clearly see that the majority of people still work for exactly the same number of hours as it is stated in their contracts (mode = 0). As it can be also seen, number of people who work more than they should is quite high when it comes to 10 - 20 hours in difference, but then it drops as soon as difference in hours reaches the point of 20 hours. In comparison to people who work more than they should, there are a lot less people working less than it is stated in their contracts. However, while comparing people working slightly less/slightly more than they should it can be seen that here are more people working less. So the conclusion is: the majority of people in German work exactly as much as they have to. but when it comes to slight differences in working hours, there are more people who work less.

ESS <- read.spss("ESS8DE.sav", use.value.labels=T, to.data.frame=T)

##Establishment size - ordinal variable - scale with numbers of people who are employed at the place where person usually work

##Here is the barplot and summary showing numbers of different establishment sizes
barplot(table(ESS$estsz), border = "navyblue", col="cornflowerblue")

summary(ESS$estsz)

##    Under 10    10 to 24    25 to 99  100 to 499 500 or more        NA's 
##         706         409         582         498         482         175

##We construct new variable specially for establishment size, redcue NAs and transform it into a vector
estsz1 = ESS$estsz
useless <- is.na(estsz1)
estsz2 <- estsz1[!useless]
estsz3 <- c(as.character(estsz2))
class(estsz3)

## [1] "character"

##Here is the median of the variable
median(estsz3, na.rm = TRUE)

## [1] "25 to 99"

##We construct a function to find mode and discover the mode for establishment size
Mode <- function(estsz3) {
  ux <- unique(estsz3)
  ux[which.max(tabulate(match(estsz3, ux)))]
}
Mode.result2=Mode(estsz3)
Mode.result2

## [1] "Under 10"

people = select(ESS, "njbspv")
people = people[!is.na(people)]
people = as.numeric(people)
people = as.data.frame(people)
ggplot() +
  geom_histogram(data = people, aes(x = people), col="cornflowerblue", fill="cornflowerblue") + xlim(0, 100)+xlab("Number of people responsible for") + ylab("Frequency")+ggtitle("Distribution of how many people are in charge of this employee")

people = na.omit(people)

There are a lot more people having almost no people in their responsibility ?

As it is clearly seen from the histogram, the majority of respondents indicated quite low numbers of people they were responsible for. Numbers of people one is responsible for that are closer to zero are the most popular responses, but then there is a sharp decrease as soon as this number exceeds app. 4-5 people. Strangely enough, quite a lot of people (app. 140) reported having from 5 to 12 employees being in their area of responsibility in comparison with only 120 people having 4-5 people in responsibility. We would expect numbers of responses go down as number of people in their responsibility grows, however, we see some fluctuations: for instance, there are almost as much people having nearly 50 people in their responsibility as there are people having only 33. To sum it all up, people having almost no responsibility over others remain as a dominant group, however, there is no clear downward trend as number of people in ones’ responsibility grows bigger.

RQ what is the most frequent number of people in a single workplace in Germany? As seen from the bar chart, small (under 10) and medium (25 to 99) establishment sizes are the most popular in Germany. Large sizes (‘100 to 499’ and ‘more than 500’) are less frequent because it seems there are not so many large corporations which can gather lots of people in a single place. Size of 10 to 24 people seems to be inconvenient for a workplace as it is the least popular in respondents’ answers

ESS <- read.spss("ESS8DE.sav", use.value.labels=T, to.data.frame=T)
Mode <- function(people) {
  ux <- unique(people$people)
  ux[which.max(tabulate(match(people$people, ux)))]
}
Mode3=Mode(people)
Mode3

## [1] 2

ESS = na.omit(ESS$isco08)

Mode <- function(isco08) {
  ux <- unique(isco08)
  ux[which.max(tabulate(match(isco08, ux)))]
}
Mode4=Mode(ESS)
Mode4

## [1] Shop sales assistants
## 590 Levels: Armed forces occupations ... Elementary workers not elsewhere classified

## Creating the first table to describe variables used in a project 
tt <- ttheme_minimal(
  core=list(bg_params = list(fill = blues9[1:4], col=NA),
            fg_params=list(fontface=3)),
  colhead=list(fg_params=list(col="navyblue", fontface="bold.italic")),
  rowhead=list(fg_params=list(col="navyblue", fontface="bold.italic")))

Table1<- matrix(1:16, byrow = TRUE, nrow = 4)
Table1[,1]<- c("Number of people responsible for in job", "Establishment size", "Difference in contract/real work hours", "Occupation")
Table1[,2]<- c("Quantitative", "Qualitative", "Quantitative", "Qualitative")
Table1[,3]<- c("Ratio scale", "Ordinal scale","Interval scale","Nominal scale")
Table1[,4]<- c("Discrete", "Discrete", "Continuous", "-")
colnames(Table1) <- c("Variable","Qualitative or Quantitative","Level of measurement", "Continious or descrete")
grid.table(Table1, theme = tt)

ESS <- read.spss("ESS8DE.sav", use.value.labels=T, to.data.frame=T)
data1 <- ESS %>% 
select(isco08, njbspv, estsz, gndr) %>% na.omit() 
## Creating descriptives' table 
Table2<- matrix(1:32, byrow = TRUE, nrow = 4)
Table2[,1]<- c("Number of people responsible for in job", "Establishment size", "Difference in contract/real work hours", "Occupation")
Table2[,2]<- c(Mode3,Mode.result2, mode.result,Mode4)
Table2[,3]<- c(median(people$people),median(estsz3, na.rm = TRUE), median(DifferenceInWorkHours, na.rm = TRUE), "Early childhood educators")
Table2[,4]<- c(mean(people$people),NA, mean(DifferenceInWorkHours),NA)
Table2[,5]<- c(rangenum2,NA, rangenum, NA)
Table2[,6]<- c(IQR(people$people),NA,IQR(DifferenceInWorkHours),NA)
Table2[,7]<- c(var(people$people),NA,var(DifferenceInWorkHours), NA)
Table2[,8]<- c(sd(people$people),NA,sd(DifferenceInWorkHours), NA)
colnames(Table2) <- c("Variable","Mode","Median", "Mean", "Range", "IQR", "Variance", "SD")
grid.table(Table2, theme = tt)

grid.arrange(
  tableGrob(Table1, rows = rownames(Table1), cols = colnames(Table1),
            theme = tt, vp = NULL),
  tableGrob(Table2, rows = rownames(Table2), cols = colnames(Table2),
            theme = tt, vp = NULL),
  nrow=2)

ESS <- read.spss("ESS8DE.sav", use.value.labels=T, to.data.frame=T)
data1 <- ESS %>% 
select(isco08, njbspv, estsz, gndr) %>% na.omit() 

box <- ggplot() + 
geom_boxplot(data = data1, aes(x = as.factor(estsz), y = as.numeric(njbspv)), col = "cornflowerblue") + 
xlab("Establishment size") + ylab("Number of people responsible for") + ggtitle("Relation between establishment size and people being in charge")
box

Bigger companies always mean a lot of people being under responsibility of a person? As it can be seen, medians of following three groups, 10-24, 25-99, 100 - 499, are relatively similar, but their positions are quite different when we analyze the inner part of a box plot: box plot representing group of 100-499 is the most skewed one out of these three since the upper part of it is much larger than the part below the median. On the other hand, looking at the box plot representing group of 10 - 24 we can say that responses were distributed relatively equally. The median of the fourth box plot is a bit higher compared to previous group of three box plots, and its maximal value is the highest among other groups . Strangely enough, maximal value reported by people from companies with less than 10 employees is quite high in comparison to other groups.. Both of mentioned box plots are skewed, judging by the mean. By analyzing this, we can see that a lot of people from big companies (25 - 99, 100 - 499, 500 or more) reported having high numbers of people for whom they are responsible for, but small companies (less than 10), had both people being responsible for big numbers of people (probably being CEOs) and ordinary workers having only a few people in their responsibility.

sum_table = data1 %>% 
dplyr::group_by(isco08) %>% 
dplyr::summarise(n = n())%>% 
dplyr::arrange(desc(n)) %>% 
dplyr::top_n(10)

## Selecting by n

data1 = right_join(x = data1, y = sum_table, by = "isco08")

ggplot() + geom_bar(data=data1, aes(x=as.factor(isco08), fill=as.factor(gndr))) + guides(fill=guide_legend(title="Gender of a worker"))+ylab("How many of them")+xlab("Occupation") + coord_flip()

This graph shows the top 12 most popular occupations from our data. It can be seen that the most popular of them – (the frequency of appearances reaches more than 40 times), and less popular – (the frequency of appearances reaches about 15 times). In addition, the color indicates the proportion of men and women belonging to a particular occupation.

DA-1

RTeam&DA

Artyushin Alexey:

Bykova Nadezhda

Vlasenko Anastasia:

Kulikov Artyom: