HW 3

1. Sample description

The dataset used contains information on workers’ years of experience and the salaries they receive. It includes a total of n observations (as indicated by the result of nrow.

To explore the distribution of the variables, histograms were generated for both years of experience (years_exp) and salary (salary). These visualizations provided a general overview of the data distribution, allowing for the identification of patterns such as clustering or spread.

Additionally, descriptive statistics were computed for both variables. For salary, measures such as the minimum, first quartile, median, mean, third quartile, and maximum were obtained. The same statistical summary was calculated for years of experience, offering a general understanding of how this variable behaves across the sample.

nrow(df)

## [1] 200

hist(df$years_exp)

hist(df$salary)

summary(df$salary)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   30203   54208   97496  122304  179447  331348

summary(df$years_exp)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##  0.007167  7.790195 16.191430 15.734362 22.908421 29.666752

2. Association between years and salary as scatterplot.

The distribution of the data does not appear to be completely linear; however, a potential correlation can be observed, suggesting that a higher number of years of experience tends to be associated with a higher salary, according to the dataset. Following this observation, Pearson and Spearman correlation coefficients were calculated, resulting in values of 0.908 and 0.960, respectively. This relationship can be seen in the graph below.

plot(df$years_exp, df$salary)
cor(df$years_exp, df$salary, use = "complete.obs", method = "pearson")

## [1] 0.908204

cor(df$years_exp, df$salary, use = "complete.obs", method = "spearman")

## [1] 0.9608431

plot(df$years_exp, df$salary)
lm(df$salary ~ df$years_exp)

## 
## Call:
## lm(formula = df$salary ~ df$years_exp)
## 
## Coefficients:
##  (Intercept)  df$years_exp  
##        -2684          7944

abline(lm(df$salary ~ df$years_exp))
0# replace this by your regression model. Use lm() and transform the dependent variable "salary" appropriately!

## [1] 0

points(x=10, y=50000, cex = 2, pch= 21, col = "white", bg="blue")
points(x=20, y=150000, cex = 2, pch= 21, col = "white", bg="blue")
points(x=30, y=250000, cex = 2, pch= 21, col = "white", bg="blue")

0 # replace this by plot(independent variable, dependent variable)

## [1] 0

3. Estimate salary by years of employment

That is, for each additional year of experience, the salary increases by approximately €7,944.

lm(df$salary ~ df$years_exp)

## 
## Call:
## lm(formula = df$salary ~ df$years_exp)
## 
## Coefficients:
##  (Intercept)  df$years_exp  
##        -2684          7944

4. Interpretation

In this model, the intercept of -2684 represents the estimated salary when years of experience is zero. While a negative salary is not realistic, this value simply reflects the best mathematical fit for the data.

The slope of 7944 indicates that, on average, each additional year of experience is associated with an increase of approximately 7944 in salary. This supports the strong positive relationship between experience and salary observed in the dataset.

5. (Voluntary) Gender effects

The multiple linear regression model estimates salary based on both years of experience and gender:

The intercept (-15,847) represents the estimated salary for a female worker with 0 years of experience.

The coefficient for years of experience (7,944) indicates that salary increases by approximately €7,944 for each additional year of experience.

The coefficient for genderMale (26,325) suggests that, holding experience constant, male workers earn on average €26,325 more than female workers.

lm(df$salary ~ df$years_exp + df$gender)

## 
## Call:
## lm(formula = df$salary ~ df$years_exp + df$gender)
## 
## Coefficients:
##   (Intercept)   df$years_exp  df$genderMale  
##        -15847           7944          26325