# Set a random seed for reproducibilityset.seed(123)# Prepare needed librariespackages<-c("stargazer", "visdat","skimr","tidyverse","glmnet")for(iin1:length(packages)){if(!packages[i]%in%rownames(installed.packages())){install.packages(packages[i] , repos ="http://cran.rstudio.com/" , dependencies =TRUE)}library(packages[i], character.only =TRUE)}
Please cite as:
Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading required package: Matrix
Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':
expand, pack, unpack
Loaded glmnet 4.1-8
=======================================================
Statistic N Mean St. Dev. Min Max
-------------------------------------------------------
yrs.since.phd 397 22.315 12.887 1 56
yrs.service 397 17.615 13.006 0 60
salary 397 113,706.500 30,289.040 57,800 231,545
-------------------------------------------------------
===================================================================
Statistic N Mean St. Dev. Min Max
-------------------------------------------------------------------
yrs.since.phd 397 22.315 12.887 1 56
yrs.service 397 17.615 13.006 0 60
salary 397 11,370,646.000 3,028,904.000 5,780,000 23,154,500
-------------------------------------------------------------------
=================================================
Statistic N Mean St. Dev. Min Max
-------------------------------------------------
yrs.since.phd 397 22.315 12.887 1 56
yrs.service 397 17.615 13.006 0 60
salary 397 113.706 30.289 57.800 231.545
-------------------------------------------------
OLS (Ordinary Least Squares)
Is scale invariant. The choice of scale is purely for convenience and clarity in interpretation.
The regression equation:
\[
rank_i = \beta_0 + \beta_1 Years \ Since \ PhD_i + \beta_2 Years \ of \ Service_i + \beta_3 Sex_i + \beta_4 Salary_i + \epsilon _i
\tag{1}\]
includes the variable Salary, which could be measured in units such as dollars, thousands of dollars, or cents. Changing the scale of the salary variable (e.g., dividing by 1,000 to express salary in thousands) does not alter the underlying relationship between salary and rank. Instead, it affects the interpretation of the coefficient for Salary.
OLS is Scale Invariant
Scale invariance refers to the property of regression models where changing the scale of an independent variable affects only the coefficient of that variable, not its overall effect on the dependent variable.
Specifically:
If Salary is measured in dollars, the coefficient \(\beta_4\) represents the change in rank for a one-dollar increase in salary.
If Salary is measured in thousands of dollars, the coefficient \(\beta_4\) would adjust accordingly, representing the change in rank for a one-thousand-dollar increase in salary.
The model’s predictions (\(\hat{rank}_i\)) and the overall fit (e.g., \(R^2\), significance of \(\beta_4\)) remain unchanged because the relationship between salary and rank is preserved; only the scale of the coefficient adapts.
reg1<-lm(data =df, formula =rank~yrs.since.phd+yrs.service+salary+sex)reg1_cents<-lm(data =df_cents, formula =rank~yrs.since.phd+yrs.service+salary+sex)reg1_thousands<-lm(data =df_thousands, formula =rank~yrs.since.phd+yrs.service+salary+sex)stargazer(reg1,reg1_cents, reg1_thousands, type="text", column.labels =c("Original","In Cents","In Thousands"), order =("salary"), digits =10)
Lasso (Least Absolute Shrinkage and Selection Operator)
Now, suppose we are interested in the estimating equation:
\[
rank_i = \beta_0 + \beta_1 Years \ Since \ PhD_i + \beta_2 Years \ of \ Service_i + \beta_3 Salary_i + \epsilon _i
\tag{2}\]
We will run lasso on this functional form.
We will then show that if you do not scale a variable, it is more likely to get eliminated (if it is relatively large compared to scaled variables), and more unlikely to get eliminated (if it is relatively small compared to scaled variables).
LASSO is not scale-invariant.
This means that the selection of variables and the magnitude of their coefficients can be affected by the scale of the predictors
=======================================================
Statistic N Mean St. Dev. Min Max
-------------------------------------------------------
yrs.since.phd 397 22.315 12.887 1 56
yrs.service 397 17.615 13.006 0 60
salary 397 113,706.500 30,289.040 57,800 231,545
-------------------------------------------------------
==============================================
Statistic N Mean St. Dev. Min Max
----------------------------------------------
yrs.since.phd 397 0.000 1.000 -1.654 2.614
yrs.service 397 -0.000 1.000 -1.354 3.259
salary 397 0.000 1.000 -1.846 3.890
----------------------------------------------
Run OLS on the estimating equation, which is equivalent to \(\lambda=0\) in the glmnet command. \[
rank_i = \beta_0 + \beta_1 Years \ Since \ PhD_i + \beta_2 Years \ of \ Service_i + \beta_3 Salary_i + \epsilon _i
\]
reg2<-lm(data =df, formula =as.numeric(rank)~scale(yrs.since.phd)+scale(yrs.service)+scale(salary))summary(reg2)# same coefficients as la.eq
Call:
lm(formula = as.numeric(rank) ~ scale(yrs.since.phd) + scale(yrs.service) +
scale(salary), data = df)
Residuals:
Min 1Q Median 3Q Max
-1.30020 -0.37101 0.02817 0.34410 1.02497
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.50126 0.02387 104.790 < 2e-16 ***
scale(yrs.since.phd) 0.45802 0.06018 7.611 2.04e-13 ***
scale(yrs.service) -0.04959 0.05798 -0.855 0.393
scale(salary) 0.29829 0.02653 11.244 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4756 on 393 degrees of freedom
Multiple R-squared: 0.6189, Adjusted R-squared: 0.616
F-statistic: 212.7 on 3 and 393 DF, p-value: < 2.2e-16
Output matches for reg2 and la.eq.
Implement Lasso
Choosing a non-zero \(\lambda\) ensures we are implementing an Lasso.
la.eq selects all three vars.
la.eq_cents selects should eliminate salary variable as it would give us a bigger coefficient corresponding to salary variable. Lasso coefficient on salary in la.eq_cents model should be thus smaller than la.eq model, suggesting we strongly eliminate this variables.
la.eq_thousands selects should keep salary variable as it would give us a smaller coefficient corresponding to salary variable. Lasso coefficient on salary in la.eq_thousands model should be thus bigger than la.eq model, suggesting we strongly keep this variables.
?glmnet# lassoscaledX<-df_scaled[,c("salary", "yrs.since.phd", "yrs.service")]scaledX_cents<-scaledX|>mutate(salary =salary*1000000)scaledX_thousands<-scaledX|>mutate(salary =salary/10000000)la.eq<-glmnet(x =scaledX, y =as.numeric(df_scaled$rank), alpha=1, lambda =.342)la.eq_cents<-glmnet(x =scaledX_cents, y =as.numeric(df_scaled$rank), alpha=1, lambda =.342)la.eq_thousands<-glmnet(x =scaledX_thousands, y =as.numeric(df_scaled$rank), alpha=1, lambda =.342)data.frame( Lasso_standard =la.eq$beta[,1], Lasso_cents =la.eq_cents$beta[,1], Lasso_thousands =la.eq_thousands$beta[,1])