SPP608 Statistics in R Lab 6

Lecture 6: Regression Diagnostics and Variable Transformation

Slides for Week 6 is available here.

In this tutorial, you will learn how to:

run a polynomial linear regression with a log-transformed variable and a quadratic term
identify and interpret coefficients, p-value, and goodness of fit measures
do logarithm transformation on continuous variables
do centering and add a quadratic term
check for omitted variable bias (Added Variable plots)

A Reminder

Please only write codes inside a code chuck, keep your #comment short in a code chuck.
Write paragraphs above or below the code chucks.
Or, TA and I will not be able to see your writing when they went out of the margin.

Basic Set Up and install and load all necessary libraries

options(scipen=999, digits = 3)
# install.packages("ggfortify")
# install.packages("stargazer")
library(stargazer)
library(ggplot2) # for autoplot
library(ggfortify) # for autoplot
library(car)

Step 0. Look at the Data Documentation

Check the CPS2015_Description.pdf in your project folder.

Each month the Bureau of Labor Statistics in the U.S. Department of Labor conducts the “Current Population Survey” (CPS), which provides data on labor force characteristics of the population, including the level of employment, unemployment, and earnings. The file CPS2015 contains the data for 2015 (from the 2016 surveys). These data are for full-time workers, defined as workers employed more than 35 hours per week for at least 48 weeks in the previous year. Data are provided for workers whose highest educational achievement is (1) a high school diploma, and (2) a bachelor’s degree.

Variables in the dataset:

female: 1 if female; 0 if male
year: Year
ahe: Average Hourly Earnings
bachelor: 1 if worker has a bachelor’s degree; 0 if worker has a high school degree only
age: Age of the individuals

Step 1. Import a Dataset (+Save a Copy!)

Your filename.csv needs to be exact and indicates the file format, which is a .csv file (Comma Separated Values file)–a delimited text file that uses a comma to separate values. It is one type of data-friendly Excel spreadsheets besides the common .xlsx format)

cps <- read.csv("CPS2015.csv")

cps1 <- cps # cps1 is now the spare copy

Now in your global environment, you should see cps and cps1 stored. It has 7098 observations (i.e., 7098 rows) and 5 variables (i.e., 5 columns).

Step 2. Exploratory Data Analysis (Simplified)

summary(cps)

##       year           ahe           bachelor         female           age      
##  Min.   :2015   Min.   :  0.0   Min.   :0.000   Min.   :0.000   Min.   :25.0  
##  1st Qu.:2015   1st Qu.: 13.0   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:27.0  
##  Median :2015   Median : 18.3   Median :1.000   Median :0.000   Median :30.0  
##  Mean   :2015   Mean   : 21.2   Mean   :0.526   Mean   :0.417   Mean   :29.6  
##  3rd Qu.:2015   3rd Qu.: 26.4   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:32.0  
##  Max.   :2015   Max.   :105.8   Max.   :1.000   Max.   :1.000   Max.   :34.0

str(cps)

## 'data.frame':    7098 obs. of  5 variables:
##  $ year    : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ ahe     : num  2.04 2.39 0 0 2.88 ...
##  $ bachelor: int  1 1 0 0 1 0 1 1 0 0 ...
##  $ female  : int  0 1 0 0 0 1 1 0 1 0 ...
##  $ age     : int  34 33 25 28 27 30 25 29 25 29 ...

cps$year <- NULL 
# remove year since it stays the same

cor(cps)

##             ahe bachelor  female      age
## ahe       1.000  0.38034 -0.1122  0.13101
## bachelor  0.380  1.00000  0.1487 -0.00113
## female   -0.112  0.14867  1.0000 -0.03184
## age       0.131 -0.00113 -0.0318  1.00000

hist(cps$ahe) # does it look normal or skewed to you?

hist(cps$age)

# log transformation can help make the distribution more normal

Step 3. Variable Transformation

`ifelse()` for Dummy and Conditional Coding

Dummy coding refers to the process of coding a categorical variable into dichotomous variables (assuming mutually exclusive groups). We can use the function ifelse to recode and create a new variable based on certain conditions.

Here is the syntax of ‘ifelse()’ function: ifelse(logical_expression, a , b)

The argument above in ‘ifelse’ states that:
1. logical_expression: Indicates the variable in which you specific a test on (e.g., age is above or equal to 30).
2. a: Executes when the logical_expression is TRUE.
3. b: Executes when the logical_expression is FALSE.

Review the operators for indicating one or multiple conditions for specifying the test (e.g., or, and, less than, greater than and equal to): See Lab 2 RPubs Note.

summary(cps$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    25.0    27.0    30.0    29.6    32.0    34.0

cps$age30 <- ifelse(cps$age >= 30, 1, 0)

summary(factor(cps$age30))

##    0    1 
## 3404 3694

# produce summary as a factor variable

Example for two conditions: Create a variable that codes as 1 if the worker is equal or more than 30 years old and with a bachelor degree as the highest educational level (meet the two criteria). If not, coded as 0.

cps$age30_bac <- ifelse(cps$age >= 30  & 
                          cps$bachelor == 1, 1, 0)

summary(factor(cps$age30_bac))

##    0    1 
## 5158 1940

Example for either of the two conditions: female or with a bachelor degree. (meet either or the two criteria)

cps$fem_bac <- ifelse(cps$female == 1 |  
                         cps$bachelor == 1,
                         1, 0)
summary(factor(cps$fem_bac))

##    0    1 
## 2222 4876

# produce summary as a factor variable

Handling Outliers

In this class, we don’t have time to cover outliers. If you want to learn more about how to handle outliers, check out my past tutorial.

Centering and Adding a Quadratic Term

Centering is the process of subtracting the variable mean (average) from each of the values of that same variable; in other words, it’s a linear rescaling of a variable. Centering variables is sometimes completed prior to including those variables as predictors in a regression model, and it is generally done for one or both of the following purposes: (a) to make the intercept valuable more interpretable, and (b) to reduce collinearity between two or more predictor variables that are subsequently multiplied to create an interaction term (product term) when estimating a moderated multiple linear regression model or polynomial regression model, for example.

Two Steps: create a new age variable centered around its mean, then square it

Centering the Age variable (named ageC) by deducting the mean from the value

mean(cps$age)

## [1] 29.6

# mean of age = 29.63

cps$ageC <- (cps$age - mean(cps$age))
summary(cps$ageC)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -4.63   -2.63    0.37    0.00    2.37    4.37

# the mean becomes your new "zero" for ageC

Squaring the centered variable ageC to create ageC2

cps$ageC2 <- (cps$ageC^2) 

# alternatively: 
cps$ageC2 <- (cps$ageC*cps$ageC)

summary(cps$ageC2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.14    1.88    5.61    8.27   13.18   21.44

cor(cps$ageC, cps$ageC2)

## [1] -0.0674

# note that ageC and ageC2 are weakly correlated: no multicollinearity issue.

Logarithm transformation for `cps$age` and `cps$ahe`

Log transformation is a data transformation method in which it replaces each variable x with a log(x). A log-regression model is a regression equation where one or more of the variables are linearized via a log-transformation. It allows us to transform a complex nonlinear relationship into a simpler linear model that can be easily evaluated using direct and standard techniques.

The laws of logarithms

log(-2) is undefined
log(0) is undefined
log(0.2) = negative values
log(1) = 0

Clean up 0 in the variable using `ifelse`

For the variable: ahe

In the context of analyzing a dataset containing annual household income (ahe), we need to ensure that the data is free of any values of 0 (and negative value) because taking the logarithm of 0 is undefined or results in infinity.

Using ifelse() to change variable value from 0 to 1

Since the minimum value of ahe is 0, we need to clean the data by replacing 0 with 1. To do this, we can use the ifelse() function in R. In the following code, if cps$ahe is equal to 0, cps$ahe1 will be assigned the value of 1. If cps$ahe is not equal to 0, cps$ahe1 will be assigned the original value of cps$ahe. This way, we have replaced all 0 values in ahe with 1, which will allow us to perform any necessary logarithmic transformations without encountering undefined values.

cps$ahe1 <- ifelse(cps$ahe <= 0, 1, cps$ahe)
summary(cps$ahe1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    13.0    18.3    21.2    26.4   105.8

# Now, we can do a log transformation on `cps$ahe`
cps$log_ahe1 <- log(cps$ahe1)
summary(cps$log_ahe1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    2.56    2.91    2.91    3.27    4.66

hist(cps$log_ahe1)

For the variable: age

In addition to cleaning the data for ahe, we can also perform a logarithmic transformation on the age variable. Unlike ahe, the age variable does not contain any observations coded as 0, so we don’t need to worry about undefined values when taking the logarithm.

Performing a logarithmic transformation on age can be useful in certain analyses, as it can help to reduce the impact of outliers and better approximate a normal distribution. To do this in R, we can use the log() function to create a new variable log_age for any subsequent analyses or modeling.

summary(cps$age) # no 0 in age

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    25.0    27.0    30.0    29.6    32.0    34.0

cps$log_age <- log(cps$age)

summary(cps$log_age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.22    3.30    3.40    3.38    3.47    3.53

If you click and view the dataset in the global environment, you should be able to see these new variables created on the right end of the dataset (scroll or use the arrow keys to view the entire dataset) and confirm that the new variables have been added.

Scatterplot with a quadratic term

We can generate a scatterplot of the logged average hourly earnings (log_ahe1) against the centered age variable (x = ageC). Specifically, poly(x,2) argument specifies that a second-order polynomial (quadratic) function should be used to model the relationship between the variables.

ggplot(cps, aes(x=ageC, y=log_ahe1)) + geom_point()  +
  stat_smooth(se=F, method='lm', formula = y ~ poly(x,2)) + 
  labs(title = "Non-linear Relationship between Age and Average Hourly Earnings")

Step 4. Compute Multiple Regression Models

Hierarchical Regression

Hierarchical regression is a way to show if variables of your interest explain a statistically significant amount of variance in your Dependent Variable (DV) after accounting for all other variables. This is a framework for model comparison rather than a statistical method. In this framework, you build several regression models by adding variables to a previous model at each step; later models always include smaller models in previous steps. In many cases, our interest is to determine whether newly added variables show a significant improvement in R2 (the proportion of explained variance in DV by the model). (Read more here)

Adjusted R-Square Value

$Adjusted R^2$ measure how close the data are to the fitted regression line. 0% indicates that the model explains none of the variability of the response data around its mean. How much of the variance in DV is explained by our independent variables? Based on the adjusted R-squared value, model lm1 accounts for approximately 2.86% of the variance in average hourly earnings.

Start with a linear model!

lm1 <- lm(ahe ~ female + age, data = cps)
summary(lm1)

## 
## Call:
## lm(formula = ahe ~ female + age, data = cps)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -22.65  -8.12  -2.77   5.14  82.69 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   6.4117     1.4775    4.34             0.000014 ***
## female       -2.6586     0.2879   -9.24 < 0.0000000000000002 ***
## age           0.5377     0.0493   10.90 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12 on 7095 degrees of freedom
## Multiple R-squared:  0.0288, Adjusted R-squared:  0.0286 
## F-statistic:  105 on 2 and 7095 DF,  p-value: <0.0000000000000002

# check adjusted R2
summary(lm1)$adj.r.squared

## [1] 0.0286

# check multicollinearity
vif(lm1)

## female    age 
##      1      1

# Check residual plots
autoplot(lm1, which = 1:4, ncol = 2, label.size = 3)

https://www.r-econometrics.com/methods/hcrobusterrors/

Omitted Variable Bias

Omitted variable bias arises when an important variable is left out of a model. It occurs when the effect of an omitted variable on the dependent variable is correlated with one or more of the independent variables included in the model. As a result, the estimated coefficients of the included variables become biased and can lead to incorrect conclusions. Say, in the lecture example, if one of these omitted variables, say work experience, is positively correlated with both education and income, then the estimated coefficient for education in the model will be biased upwards. In other words, the model will overestimate the effect of education on income because it is not accounting for the effect of work experience. This is known as omitted variable bias.

Let’s add bachelor to lm2 since education level can be an important factor for explaining earning.

lm2 <- lm(ahe ~ female + age + bachelor, data = cps)
summary(lm2)

## 
## Call:
## lm(formula = ahe ~ female + age + bachelor, data = cps)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -27.91  -6.65  -1.86   4.25  83.91 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   2.0342     1.3548     1.5                0.13    
## female       -4.1424     0.2659   -15.6 <0.0000000000000002 ***
## age           0.5316     0.0451    11.8 <0.0000000000000002 ***
## bachelor      9.8471     0.2624    37.5 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.9 on 7094 degrees of freedom
## Multiple R-squared:  0.19,   Adjusted R-squared:  0.189 
## F-statistic:  553 on 3 and 7094 DF,  p-value: <0.0000000000000002

# check adjusted R2
summary(lm2)$adj.r.squared

## [1] 0.189

# check multicollinearity
vif(lm2)

##   female      age bachelor 
##     1.02     1.00     1.02

# Check residual plots
autoplot(lm2, which = 1:4, ncol = 2, label.size = 3)

Variable Transformation

We can take further steps to improve the model fit with the data by tackling issues of nonlinearity and non-normality. This may involve transforming the variables or applying more advanced modeling techniques such as polynomial and log regression to model nonlinear relationship.

Polynomial OLS Regression

A linear Regression with ageC and the quadratic term ageC2

lm3 <- lm(ahe ~ female + ageC + ageC2 + bachelor, data = cps)
summary(lm3)

## 
## Call:
## lm(formula = ahe ~ female + ageC + ageC2 + bachelor, data = cps)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -27.62  -6.66  -1.92   4.28  83.86 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  17.9913     0.2541   70.80 <0.0000000000000002 ***
## female       -4.1389     0.2659  -15.56 <0.0000000000000002 ***
## ageC          0.5272     0.0452   11.67 <0.0000000000000002 ***
## ageC2        -0.0252     0.0177   -1.43                0.15    
## bachelor      9.8488     0.2624   37.53 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.9 on 7093 degrees of freedom
## Multiple R-squared:  0.19,   Adjusted R-squared:  0.189 
## F-statistic:  416 on 4 and 7093 DF,  p-value: <0.0000000000000002

# check adjusted R2
summary(lm3)$adj.r.squared

## [1] 0.189

# check multicollinearity
vif(lm3)

##   female     ageC    ageC2 bachelor 
##     1.02     1.01     1.00     1.02

# Check residual plots
autoplot(lm3, which = 1:4, ncol = 2, label.size = 3)

Log-transformed OLS regression models

Log-transformed OLS regression models can be interpreted in the same way as regular OLS regression models, but the interpretation of the coefficients changes. Specifically, the coefficients in a log-transformed OLS regression model represent the percentage change in the dependent variable associated with a one-unit increase in the independent variable.

Log-regression models fall into four categories:

linear model, which is the traditional linear model without making any log transformations, \[ Y = b_0 + b_1*x + \mu \]
linear-log model, where we transform the x-explanatory variables using logs, \[ Y = b_0 + b_1*log(x) + \mu \]
log-linear model, where we transform the y-dependent variable using logs; and \[ log(Y) = b_0 + b_1*x + \mu \]
log-log model, where both the y-dependent variable and the x-explanatory factors are transformed using logs. \[ log(Y) = b_0 + b_1*log(x) + \mu \]

Linear-log Regression with IV `log_age`

lm4 <- lm(ahe ~ female + log_age + bachelor, data = cps)
summary(lm4)

## 
## Call:
## lm(formula = ahe ~ female + log_age + bachelor, data = cps)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -27.82  -6.66  -1.89   4.30  83.89 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  -35.240      4.484   -7.86   0.0000000000000045 ***
## female        -4.141      0.266  -15.57 < 0.0000000000000002 ***
## log_age       15.669      1.323   11.85 < 0.0000000000000002 ***
## bachelor       9.848      0.262   37.53 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.9 on 7094 degrees of freedom
## Multiple R-squared:  0.19,   Adjusted R-squared:  0.189 
## F-statistic:  554 on 3 and 7094 DF,  p-value: <0.0000000000000002

# check adjusted R2
summary(lm4)$adj.r.squared

## [1] 0.189

# check multicollinearity
vif(lm4)

##   female  log_age bachelor 
##     1.02     1.00     1.02

# Check residual plots
autoplot(lm4, which = 1:4, ncol = 2, label.size = 3)

Log-linear Regression with DV `log_ahe1`

Note the sign for ageC2 in the model. The sign is positive when the model is convex and negative when the curve is concave. Learn more: https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faqhow-do-i-interpret-the-sign-of-the-quadratic-term-in-a-polynomial-regression/

lm5 <- lm(log_ahe1 ~ female + ageC + ageC2 + bachelor, data = cps)
summary(lm5)

## 
## Call:
## lm(formula = log_ahe1 ~ female + ageC + ageC2 + bachelor, data = cps)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7147 -0.2862  0.0129  0.3044  2.0602 
## 
## Coefficients:
##              Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  2.758755   0.011139  247.68 <0.0000000000000002 ***
## female      -0.176939   0.011657  -15.18 <0.0000000000000002 ***
## ageC         0.023976   0.001980   12.11 <0.0000000000000002 ***
## ageC2       -0.001875   0.000776   -2.42               0.016 *  
## bachelor     0.462164   0.011504   40.18 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.479 on 7093 degrees of freedom
## Multiple R-squared:  0.209,  Adjusted R-squared:  0.208 
## F-statistic:  467 on 4 and 7093 DF,  p-value: <0.0000000000000002

# check adjusted R2
summary(lm5)$adj.r.squared

## [1] 0.208

# check multicollinearity
vif(lm5)

##   female     ageC    ageC2 bachelor 
##     1.02     1.01     1.00     1.02

# Check residual plots
autoplot(lm5, which = 1:4, ncol = 2, label.size = 3)

Log-log Regression with DV `log_ahe1` and IV `log_age`

lm6 <- lm(log_ahe1 ~ female + log_age + bachelor, data = cps)
summary(lm6)

## 
## Call:
## lm(formula = log_ahe1 ~ female + log_age + bachelor, data = cps)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7061 -0.2870  0.0105  0.3029  2.0631 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   0.3118     0.1966    1.59                0.11    
## female       -0.1771     0.0117  -15.19 <0.0000000000000002 ***
## log_age       0.7185     0.0580   12.39 <0.0000000000000002 ***
## bachelor      0.4621     0.0115   40.16 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.479 on 7094 degrees of freedom
## Multiple R-squared:  0.208,  Adjusted R-squared:  0.208 
## F-statistic:  621 on 3 and 7094 DF,  p-value: <0.0000000000000002

# check adjusted R2
summary(lm6)$adj.r.squared

## [1] 0.208

# check multicollinearity
vif(lm6)

##   female  log_age bachelor 
##     1.02     1.00     1.02

# Check residual plots
autoplot(lm6, which = 1:4, ncol = 2, label.size = 3)

Step 5. Check AIC and BIC for Model Selection

Information criteria, unlike significance tests, are indices that allow for model comparison accounting for model complexity and sample size. Smaller models are better and more efficient. Smaller values of AIC and BIC are better.

Having said that, BIC is more conservative and penalizes model complexity more heavily. Hence, BIC is a better indicator of fit for multiple regression models when AIC and BIC values are inconclusive. Another consideration is theoretical, which models make more sense based on the prior literature and hypothesized relationships? You might opt for a model that makes a better theoretical sense when two models have a similar fit.

We used AIC and BIC model selection to distinguish among a set of possible models describing the relationship between average hourly earning, sex, a bachelor’s degree, and age.

AIC(lm1,lm2,lm3,lm4,lm5,lm6)

##     df   AIC
## lm1  4 55366
## lm2  5 54083
## lm3  6 54083
## lm4  5 54082
## lm5  6  9688
## lm6  5  9690

BIC(lm1,lm2,lm3,lm4,lm5,lm6)

##     df   BIC
## lm1  4 55394
## lm2  5 54118
## lm3  6 54125
## lm4  5 54116
## lm5  6  9729
## lm6  5  9724

Step 6: Export our working data in csv and rda

Rda (or .RData) is a file format used in the R for saving and loading data and objects. It is a compressed and efficient way to store large dataset.

# export a csv copy
write.csv(cps, file = "cps_lab6.csv") 

# export data in .rda format
save(cps, file = "cps_lab6.rda")

Step 7: Export Regression Tables using `Stargazer` package

The Stargazer package in R is a popular tool designed for creating well-formatted tables that summarize statistical models’ results. It supports a wide range of model types from various packages beyond base R’s statistical models, including lm (linear models) and glm (generalized linear models), among others. Stargazer tables can be exported in multiple formats, including HTML, LaTeX, and plain text, making it easier to integrate statistical analysis results into reports, research papers, and presentations. The full documentation: https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf

If you use the stargazer package in your research publications, please remember to include the following citation: Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables. R package version 5.2.2. https://CRAN.R-project.org/package=stargazer

Summary Statistics

** Note: Open the tables from your folder using a word document.**

cps_new <- (cps[c("ahe","age","bachelor","female")])

# install.packages("stargazer")
library(stargazer)

stargazer(cps_new, type= "html", out = "summary table.html",
          title="Table 1. Summary Statistics", align=TRUE)

**Table 1. Summary Statistics**

Statistic	N	Mean	St. Dev.	Min	Max

ahe	7,098	21.200	12.100	0.000	106.000
age	7,098	29.600	2.880	25	34
bachelor	7,098	0.526	0.499	0	1
female	7,098	0.417	0.493	0	1

Correlation Matrix

** Note: Open the tables from your folder using a word document.**

# Create a matrix by selecting some of the variables
cps_new$bachelor <- as.numeric(cps_new$bachelor)

# Remove categorical variables from your correlation matrix 
# since correlation is only useful for continuous/numeric variables
cor.matrix1 <- cor(cps_new[c("ahe","age")])

stargazer(cor.matrix1, type= "html", out = "cor.table1.html",
          title="Table 2. Correlation Matrix")

**Table 2. Correlation Matrix**

	ahe	age

ahe	1	0.131
age	0.131	1

Regression Tables

** Note: Open the tables from your folder using a word document.**

For more settings, see cheatsheet https://www.jakeruss.com/cheatsheets/stargazer/

lm1$AIC <- AIC(lm1)
lm2$AIC <- AIC(lm2)
lm3$AIC <- AIC(lm3)
lm4$AIC <- AIC(lm4)
lm5$AIC <- AIC(lm5)
lm6$AIC <- AIC(lm6)

lm1$BIC <- BIC(lm1)
lm2$BIC <- BIC(lm2)
lm3$BIC <- BIC(lm3)
lm4$BIC <- BIC(lm4)
lm5$BIC <- BIC(lm5)
lm6$BIC <- BIC(lm6)

stargazer(lm1, lm2, lm3, lm4, lm5, lm6, type= "html", out = "regression.table.html",
          dep.var.labels = "Average Hourly Earnings", 
          column.labels = c("Basic", "Bachelor", "Polynomial", "Linear-log", "Log-linear", "Log-Log"), 
          intercept.bottom = FALSE, 
          digits  = 2 , digits.extra = 2,
          single.row= FALSE,
          keep.stat = c("n","adj.rsq","aic","bic"),
          notes.append = FALSE,
          star.char = c("+", "*", "**", "***"),
          star.cutoffs = c(0.1, 0.05, 0.01, 0.001),
          notes = c("(+ p<0.1; * p<0.05; ** p<0.01; *** p<0.001)"), 
          title="Table 3. Regression Results", align=TRUE, no.space=FALSE)

**Table 3. Regression Results**

	Dependent variable:

	Average Hourly Earnings				log_ahe1
	Basic	Bachelor	Polynomial	Linear-log	Log-linear	Log-Log
	(1)	(2)	(3)	(4)	(5)	(6)

Constant	6.41^***	2.03	18.00^***	-35.20^***	2.76^***	0.31
	(1.48)	(1.35)	(0.25)	(4.48)	(0.01)	(0.20)

female	-2.66^***	-4.14^***	-4.14^***	-4.14^***	-0.18^***	-0.18^***
	(0.29)	(0.27)	(0.27)	(0.27)	(0.01)	(0.01)

age	0.54^***	0.53^***
	(0.05)	(0.05)

ageC			0.53^***		0.02^***
			(0.05)		(0.002)

ageC2			-0.03		-0.002^*
			(0.02)		(0.001)

log_age				15.70^***		0.72^***
				(1.32)		(0.06)

bachelor		9.85^***	9.85^***	9.85^***	0.46^***	0.46^***
		(0.26)	(0.26)	(0.26)	(0.01)	(0.01)


Observations	7,098	7,098	7,098	7,098	7,098	7,098
Adjusted R²	0.03	0.19	0.19	0.19	0.21	0.21
Akaike Inf. Crit.	55,366.00	54,083.00	54,083.00	54,082.00	9,688.00	9,690.00
Bayesian Inf. Crit.	55,394.00	54,118.00	54,125.00	54,116.00	9,729.00	9,724.00

Note:	(+ p<0.1; * p<0.05; p<0.01; * p<0.001)

SPP608 Statistics in R Lab 6

Viviana Wu

3/10/2025

Lecture 6: Regression Diagnostics and Variable Transformation

A Reminder

Step 0. Look at the Data Documentation

Step 1. Import a Dataset (+Save a Copy!)

Step 2. Exploratory Data Analysis (Simplified)

Step 3. Variable Transformation

`ifelse()` for Dummy and Conditional Coding

Handling Outliers

Centering and Adding a Quadratic Term

Logarithm transformation for `cps$age` and `cps$ahe`

Clean up 0 in the variable using `ifelse`

Scatterplot with a quadratic term

Step 4. Compute Multiple Regression Models

Hierarchical Regression

Adjusted R-Square Value

Start with a linear model!

Omitted Variable Bias

Variable Transformation

Polynomial OLS Regression

Log-transformed OLS regression models

Linear-log Regression with IV `log_age`

Log-linear Regression with DV `log_ahe1`

Log-log Regression with DV `log_ahe1` and IV `log_age`

Step 5. Check AIC and BIC for Model Selection

Step 6: Export our working data in csv and rda

Step 7: Export Regression Tables using `Stargazer` package

Summary Statistics

Correlation Matrix

Regression Tables

SPP608 Statistics in R Lab 6

Viviana Wu

3/10/2025

Lecture 6: Regression Diagnostics and Variable Transformation

A Reminder

Step 0. Look at the Data Documentation

Step 1. Import a Dataset (+Save a Copy!)

Step 2. Exploratory Data Analysis (Simplified)

Step 3. Variable Transformation

ifelse() for Dummy and Conditional Coding

Handling Outliers

Centering and Adding a Quadratic Term

Logarithm transformation for cps$age and cps$ahe

Clean up 0 in the variable using ifelse

Scatterplot with a quadratic term

Step 4. Compute Multiple Regression Models

Hierarchical Regression

Adjusted R-Square Value

Start with a linear model!

Omitted Variable Bias

Variable Transformation

Polynomial OLS Regression

Log-transformed OLS regression models

Linear-log Regression with IV log_age

Log-linear Regression with DV log_ahe1

Log-log Regression with DV log_ahe1 and IV log_age

Step 5. Check AIC and BIC for Model Selection

Step 6: Export our working data in csv and rda

Step 7: Export Regression Tables using Stargazer package

Summary Statistics

Correlation Matrix

Regression Tables

`ifelse()` for Dummy and Conditional Coding

Logarithm transformation for `cps$age` and `cps$ahe`

Clean up 0 in the variable using `ifelse`

Linear-log Regression with IV `log_age`

Log-linear Regression with DV `log_ahe1`

Log-log Regression with DV `log_ahe1` and IV `log_age`

Step 7: Export Regression Tables using `Stargazer` package