Homework 3: Valentina L.

1. Sample description

Our dataset includes data on salary (€), length of employment (years), and gender of public services employees of 200 individuals in total. Out of all dataset entries there are 100 female, 100 male individuals, respectively half of all participants are female and half are male. In regard to lenght of employement in years, the minimum lengh is about 0.01 years (365/100 = 3,65 days), the maximum is 29.67 years, the average shows 15.73 years. In regard to salary in Euro the minimum is 30,203 per year, the maximum is 331,348 per year, the average is 122,304 per year. The standard deviation (sd) for length of employment (years) is 9.04 years, which more than half of the mean - there is a considerable difference among the length of employement in the sample. The standard deviation (sd) for salary in Euro is 79,030.12 Euros, which is also more than half of the mean; there is a wide spread of income levels within the dataset.

0# basic sample description

## [1] 0

# list of the variables (columns) in my dataset
names(df)

## [1] "years_empl" "salary"     "gender"

## --> years_empl
## --> salary
## --> gender (nominal data)

# number of entries 
nrow(df)

## [1] 200

## --> the dataset shows 200 entries

# summary statistics (show all variables)
summary(df)

##    years_empl            salary          gender         
##  Min.   : 0.007167   Min.   : 30203   Length:200        
##  1st Qu.: 7.790195   1st Qu.: 54208   Class :character  
##  Median :16.191430   Median : 97496   Mode  :character  
##  Mean   :15.734362   Mean   :122304                     
##  3rd Qu.:22.908421   3rd Qu.:179447                     
##  Max.   :29.666752   Max.   :331348

# sd for numeric columns (years_empl & salary)
sd(df$years_empl)

## [1] 9.035618

sd(df$salary)

## [1] 79030.12

## --> measures of how spread out the values are

# Frequency table for categorical variable (gender)
table(df$gender)

## 
## Female   Male 
##    100    100

## --> out of 200 entries there are 100 female, 100 male individuals in the dataset
## the next lines are just to try out some coding
prop.table(table(df$gender))

## 
## Female   Male 
##    0.5    0.5

## --> again: relative frequency shows that 0.5 are female, 0.5 are male
round(100 * prop.table(table(df$gender)), 1)

## 
## Female   Male 
##     50     50

## --> percentage: 50 % female, 50 % male

2. Association between years (length of employment) and salary as scatterplot.

The scatterplot shows the association between the length of employment (in years) and the annual salary (in Euros) that individuals earn. There is a positive association between these two variables, indicating that the longer someone has been employed, the higher their annual salary tends to be. The slope is positive but becomes steeper at higher levels of employment length, suggesting that the relationship is not linear. Instead, salary appears to increase more rapidly with more years of employment.

0 # there is a positive association: the higher the length of employment in years the higher the salary in Euros per year individuals receive.

## [1] 0

# scatterplot from bottom left to top right, so employees with more years of employment generally earn higher salaries, although there is a pay gap meaning some individuals with the same years of employment earn less than others (other factors e.g., role, industry, education).
# the increase in salary is not constant across all years (the slope is steeper at higher employment length) --> positive association, but NOT LINEAR. Linear model does not apply - this is visible.
# plot(x,y)
plot(df$years_empl, df$salary,
     main = "How the lenght of employment is associated with annual salary", 
     xlab = "Lenght of employment in years", 
     ylab = "Annual salary in Euros")

3. Estimate salary by years of employment

As seen before, there is a positive association between years of employment and salary, but the relationship is not linear, which is evident from the scatterplot. To better understand this association, I linearized the relationship by transforming only the dependent variable, salary, by taking its base-10 logarithm. Now, the model estimates how years of employment affect the logarithm of salary (percentage).

## [1] 0

# In order to understand this association, linearize this association (make this point cloud linear)
# option: transform the dependent variable "salary" --> add some "average" points to illustrate the argumentation. 

# Regression Analysis
# The higher the length of employment in years (independent) the higher the salary in Euros (dependent) per year individuals receive.
# estimate salary by years of employment
## estimate the regression constant (intercept) and the regression coefficient.
#! lm(y ~ x)
lm(df$salary ~ df$years_empl) # estimate a linear model of how salary depends on years of employment.

## 
## Call:
## lm(formula = df$salary ~ df$years_empl)
## 
## Coefficients:
##   (Intercept)  df$years_empl  
##         -2684           7944

plot(df$years_empl, df$salary)
abline(lm(df$salary ~ df$years_empl))

# use logarithm (base 10) to transform exponential growth into a linear relationship (only salary)
plot(df$years_empl, log10(df$salary),
     main = "How the lenght of employment is associated with annual salary", 
     xlab = "Lenght of employment in years", 
     ylab = "Log 10 of Annual salary in Euros")

# estimate a linear model:
# est. log10(salary) = 4.50 + 0.03 * years_empl
# The estimated logarithm (base 10) of salary is modeled as increasing linearly with years of employment.
# the intercept is 4.50: the expected logarithm base 10 of salary is 4.50 when years of employment is 0.
# the coefficient is 0.03: for every additional year of employment, the log10(salary) increases by 0.03 (10^0.03 = 1.07 = 7% increase for evey year of employment).

lm(log10(df$salary) ~ df$years_empl)

## 
## Call:
## lm(formula = log10(df$salary) ~ df$years_empl)
## 
## Coefficients:
##   (Intercept)  df$years_empl  
##       4.50918        0.03083

abline(lm(log10(df$salary) ~ df$years_empl))

4. Interpretation

It shows that the higher the length of employment in years is the higher the salary in Euros - positive association, which is now linearized. It can be seen how years of employment affect the logarithm of salary (in percentage). As the intercept is 4.50 the expected logarithm base 10 of salary is 4.50 when years of employment is 0 (10^4.50 = 31.620 Euros according to this model). Further, for every additional year of employment, the logarithm base 10 of salary increases by 0.03, which corresponds to approx. 7% for every additional year of employment.

Homework 3: Valentina L.

2025-05-23

1. Sample description

2. Association between years (length of employment) and salary as scatterplot.

3. Estimate salary by years of employment

4. Interpretation