practical task 13

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Warning: package 'effects' was built under R version 4.3.2
## Carregando pacotes exigidos: carData
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
## Rows: 552 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): gender, occupation, union
## dbl (3): wage, experience, education
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Summary statistics

summary(PSID1982)
##       wage          experience      education        gender         
##  Min.   : 313.0   Min.   : 7.00   Min.   : 4.00   Length:552        
##  1st Qu.: 824.5   1st Qu.:13.00   1st Qu.:12.00   Class :character  
##  Median :1100.0   Median :20.00   Median :12.00   Mode  :character  
##  Mean   :1174.5   Mean   :22.73   Mean   :12.94                     
##  3rd Qu.:1381.2   3rd Qu.:32.00   3rd Qu.:16.00                     
##  Max.   :5100.0   Max.   :51.00   Max.   :17.00                     
##   occupation           union          
##  Length:552         Length:552        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

Wage

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   313.0   824.5  1100.0  1174.5  1381.2  5100.0

The histogram illustrates how the original wage variable is spread out. The taller bar at wage=1000 indicates a clustering of observations around this wage level. The rightward skewness implies the presence of a few high-wage outliers in the dataset.

Transforming wage into log(wage)

The density graph of the wage variable after applying a logarithmic transformation resembles a peak around log(wage)=7, forming a mountain-like shape. This indicates that the log(wage) distribution is roughly normal or symmetric, a characteristic that proves advantageous for linear regression modeling.

The potential benefits of log-transforming the wage variable for linear regression modeling

Linear Relationship: Linear regression operates under the assumption of a straight-line connection between the independent variable(s) and the dependent variable. Utilizing the logarithm of the wage variable can mitigate right-skewness, enhancing the linearity of the relationship.

Homoscedasticity: Employing log transformation frequently stabilizes the variability of residuals across different levels of the independent variable. This is crucial for satisfying the homoscedasticity assumption in linear regression.

Interpretability: In the context of wages, applying a log transformation to the variable enhances the interpretability of coefficients in the linear regression model. A one-unit change in log(wage) corresponds to a percentage change in the original wage, providing a more meaningful interpretation.

Estimating a linear regression model (via OLS)

## 
## Call:
## lm(formula = log(wage) ~ gender + union + education + experience + 
##     I(experience^2), data = PSID1982)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.03000 -0.23413  0.01157  0.22003  1.18308 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.1607606  0.1195829  43.156  < 2e-16 ***
## gendermale       0.4078465  0.0511955   7.966 9.56e-15 ***
## unionyes         0.0854963  0.0322775   2.649 0.008312 ** 
## education        0.0805124  0.0056836  14.166  < 2e-16 ***
## experience       0.0281727  0.0072177   3.903 0.000107 ***
## I(experience^2) -0.0004198  0.0001423  -2.950 0.003313 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3459 on 546 degrees of freedom
## Multiple R-squared:  0.3479, Adjusted R-squared:  0.3419 
## F-statistic: 58.25 on 5 and 546 DF,  p-value: < 2.2e-16

Performing a stepwise model selection

## Start:  AIC=-1165.93
## log(wage) ~ gender + union + education + experience + I(experience^2)
## 
##                    Df Sum of Sq    RSS     AIC
## + education:union   1    3.7906 61.550 -1196.9
## <none>                          65.340 -1165.9
## + gender:education  1    0.1682 65.172 -1165.4
## 
## Step:  AIC=-1196.92
## log(wage) ~ gender + union + education + experience + I(experience^2) + 
##     union:education
## 
##                    Df Sum of Sq    RSS     AIC
## <none>                          61.550 -1196.9
## + gender:education  1    0.0955 61.454 -1195.8
## - union:education   1    3.7906 65.340 -1165.9
## 
## Call:
## lm(formula = log(wage) ~ gender + union + education + experience + 
##     I(experience^2) + union:education, data = PSID1982)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.10138 -0.22652  0.02306  0.21949  1.17973 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.8983374  0.1246874  39.285  < 2e-16 ***
## gendermale          0.3873380  0.0498597   7.769 3.97e-14 ***
## unionyes            0.9275040  0.1486811   6.238 8.89e-10 ***
## education           0.1006744  0.0065266  15.425  < 2e-16 ***
## experience          0.0299238  0.0070181   4.264 2.37e-05 ***
## I(experience^2)    -0.0004717  0.0001385  -3.405 0.000711 ***
## unionyes:education -0.0679519  0.0117290  -5.793 1.17e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3361 on 545 degrees of freedom
## Multiple R-squared:  0.3857, Adjusted R-squared:  0.3789 
## F-statistic: 57.03 on 6 and 545 DF,  p-value: < 2.2e-16

The influence of education on wage in the selected model:

The estimated coefficient for education is 0.1006744, signifying the anticipated change in the logarithm of wages for each additional unit of education (equivalent to one extra year of education), while keeping other factors constant.

The positive coefficient implies a favorable correlation between education and the logarithm of wages. The t-value (15.425) is exceptionally significant (p-value < 2e-16), indicating that the influence of education on the logarithm of wages holds statistical significance.

In conclusion, based on the chosen model, increased education is linked to a statistically significant positive influence on the logarithm of wages. This suggests that individuals with higher levels of education generally experience higher wages.

Specific regression equations according to the final model:

Group 1: Male with union and average work experience

## [1] "Predicted log(wage): 7.07303772387518"
## [1] "95% Confidence Interval: 6.41033205120099 7.73574339654937"

The anticipated logarithm of wages for men who are union members and possess an average level of work experience is 7.073. With a confidence level of 95%, our estimation suggests that the actual logarithm of wages is likely to be within the range of [6.410, 7.736]. This indicates that, on average, men with union membership and average work experience can anticipate their logarithmic wages to fall within this specified interval.

Group 2: Female with union and average work experience

## [1] "Predicted log(wage): 6.68569971233507"
## [1] "95% Confidence Interval: 6.01670003037552 7.35469939429462"

The anticipated logarithm of wages for women who are members of a union and possess average work experience is 6.686. With a 95% level of confidence, our estimation suggests that the actual logarithm of wages is likely to lie within the range of [6.017, 7.355]. On average, women with union membership and average work experience can anticipate having logarithmic wages within this specified interval.

Group 3: Male without union and average work experience

## [1] "Predicted log(wage): 7.02472286497077"
## [1] "95% Confidence Interval: 6.36270831536138 7.68673741458016"

The anticipated natural logarithm of wages for men who are not part of a union and possess an average level of work experience is 7.025. With 95% confidence, we project that the actual natural logarithm of wages is likely to lie within the range of [6.363, 7.687]. This implies that, on average, men without union affiliation and with average work experience can anticipate their log(wages) falling within this specified interval.

Group 4: Female without union and average work experience

## [1] "Predicted log(wage): 6.63738485343066"
## [1] "95% Confidence Interval: 5.9699088951008 7.30486081176052"

The anticipated natural logarithm of wages for women who are not part of a union and have an average level of work experience is 6.637. With a 95% level of confidence, our estimate suggests that the actual natural logarithm of wages is likely to be within the range of [5.970, 7.305]. On average, women without union affiliation and with typical work experience are projected to have natural logarithm values for wages within this specified interval.