Introduction

Regression analysis is a group of statistical processes used in R programming and statistics to determine the relationship between dataset variables. Essentially, there is an outcome variable(Y) which is the parameter you want to explain or predict and there is an explanatory variable(X) which are parameters you want to use to predict or explain the outcome. Various statistical programming softwares can be used to run a regression analysis, but the data wrangling and data visualization techniques in R make this software very adept at this kind of analysis. So, this code-through tutorial shows what a regression looks like visually and how advanced techniques and packages within R can be used to find the relationships between variables in a dataset.


Content Overview

The code through will analyse the dataset wage1, which has been collected using the “wooldridge” CRAN archive data package available publicly on R. The main objective of the analysis is to see how education effets earnings. So, from the wage1 dataset, the main outcome or dependent variable is wage and the main explanatory or independent variable is educ. Other variables such as female and married will also be used to run the regression so that we can account for demographic differences. Thus, these analyses can be done with the help of a few R packages, most notably: tidyverse, which contains dplyr and ggplot2; stats, which programs statistical regression; and modelsummary which displays the various regression models in a neat professional table.


Why You Should Care

This topic is a valuable topic to a broad audience ranging from undergraduate economics students to high level policymakers. In academic settings, students of econometrics or data science often learn the theory and book knowledge behind regression and the statistical relationship between variables. Although they are well versed in the technique, oftentimes they are unaware of how to apply their knowledge using real datasets in a programming setting. So, this code-through tutorial is a simple into the world of regression using programming tools available in R. For policymakers and new data scientists, regression analysis serves as a fundamental method behind various statistical and machine learning techniques. So, how to visualize and run a regression should be a key component in any data scientists area of expertise. Policymakers will also learn how to empirically evaluate their policy recommendations to governing authorities.

Learning Objectives

This is an ideal process that teaches the reader how to code a regression in R. More specifically, this code through will illustrate how to prepare a data for regression, show correlation between variables within a dataset, visually represent this correlation, run a regression with the appropriate parameters, visually represent the line of best fit and finally interpret and demonstrate the models.



Effect of Education on Income

The Wooldridge package contains various cross-sectional datasets based on labor force surveys. The dataset has been downloaded onto the R environment using the install.packages function. The dataset we will use is called wage1. It contains data in a cross sectional format with several variables. The main variables of concern are wage which contains wage statistics for hourly income, and educ which contains years of education. Generally, we expect an individuals income to increase as years of education increases. This relationship can be seen using a correlation function in R and can also be represented using a scatterplot. Furthermore, if we control for more demographic variables then we will get a more partial and accurate effect of education on income. The following sections will elaborate further on the distribution of the variables in the data set and get into the core code before running the final complex regression.


Data Peparation and Visualization

ggplot(wage1, aes(wage))+
  geom_histogram()+
  labs(title="Wage Distribution", cex.sub=3)

ggplot(wage1,aes(educ))+
  geom_histogram()+
  labs(title = "Education Distribution")

Using ggplot2 from tidyverse we can visualize the distribution of our main variables of concern with the help histograms. This allows us to account for any outliers, which are extreme values, which my distort the relationship when running the regression. It appears that the wage variable has an extreme value for hourly wage. Thus, using dplyr again from tidyverse we can filter out this extreme wage value.

wage_analytical<-wage1%>%
filter(wage<=22.8)

So, with the help of dplyr, we have been able to drop the extreme wage values and create a new analytical dataset. Oftentimes when working with real raw data, datasets need to be filtered and edited to fit the administrators needs and formed into an analytical dataset before any analysis can be done.


Testing for Correlation

When analyzing statistical relationships between variables, the first step is to see if there is any correlation between them. In R, this can be done using the cor function from the stats package. So, we need to carry out a correlation test between wage and educ. This relationship can also be visualized using a scatterplot that can be created using ggplot2 from tidyverse. The ggplot function also allows the user to customize the graphic by adding titles, labels and modifying the aesthetics(colors, shapes, sizes) of the components.

cor(wage_analytical$wage,wage_analytical$educ)
## [1] 0.3960886
ggplot(wage_analytical, aes(x=educ, y=wage))+
  geom_point(size=4, alpha=1, color="Red")+
  labs(title="Correlation: Educ vs. Wage")

So, we can see from both the correlation test results and the scatterplot, there is a positive correlation between education and wage. The correlation test gave a value of 0.396 and the scatterplot showed a trend, as years of education increased, the amount dots increased in height, indicating increases in wages.


Simple Regression

No that we have established correlation between the two variables, we can now run the regression to estimate the statistical relationship between educ and wage

model<-lm(wage~educ, wage_analytical)
tidy(model)

Using the stats package, we are able to run the simple regression of education on wage. The package uses the lm function to regress the two variables and finally the results can be displayed in a neat table format with the help of the tidy function from tidyverse. The coefficient value of educ is 0.506. So, the results show that for every year of education, hourly wage increases by about $0.50.


Graphical Representation of Regression Line

As mentioned before, the regression line is a line of best fit. It is a line that best describes the behavior of these variables. With the help of R, we are able to visually represent this line with the following code that uses the geom_point and geom_smooth functions from ggplot2.

ggplot(wage_analytical, aes(x=educ, y=wage),method="lm")+
         geom_point(size=4, alpha=1)+
  geom_smooth(method = "lm")+
  labs(title = "Regression Line of Best Fit")

The regression line shows an increasing slope indicating that for every year increase in education, wages increase by $0.50.


Multiple Regression

Now, for the final stage, we will be running a multiple regression with several control variables. As mentioned earlier, the addition of control variables in this case will account for demographic differences and give a more partial effect of education on wages.

model_multiple<-lm(wage~educ+exper+female+married, wage_analytical)
tidy(model_multiple)

In this multiple regression, we have controlled for exper, which is the number of years of experience, female, which is a variable that equals to 1 if the individual is a female person, and married, which is a variable that equals to 1 if the individual is a married person. By controlling for these demographics, we can see that the coefficient value of educ has changed to 0.548 when compared to the simple regression mode. Thus, in this multiple regression model, the results show that as education increases by a year, hourly wage income increases by $0.55.


Summary of Models

Finally, it is very important to know how to display regression tables in a professional setting. The modelsummary() function takes a bunch of different regression models and puts them in a neat side-by-side table. In a normal report or analysis, we would include all of these at once instead of running and presenting individually.

modelsummary(list("Simple Regression"=model, "Multiple Regression"=model_multiple))
Simple Regression Multiple Regression
(Intercept) -0.523 -1.374
(0.660) (0.725)
educ 0.506 0.549
(0.051) (0.050)
exper 0.052
(0.011)
female -1.989
(0.262)
married 0.627
(0.285)
Num.Obs. 524 524
R2 0.157 0.307
R2 Adj. 0.155 0.302
AIC 2723.6 2626.5
BIC 2736.4 2652.1
Log.Lik. -1358.786 -1307.268
F 97.133 57.585
RMSE 3.24 2.95

So, in conclusion, R is very well adept at running and presenting various types of regression analysis and quantitative statistical methods. This code through has elaborated on the theory behind regressions and has guided the reader on how to use R code to prepare a dataset, visualize relationships between variables within the dataset, run various regressions, and finally display the results in a profession manner.


Further Resources

If you are interested to learn more about regression analysis in R and the tidyverse, please see the resources below:




Works Cited

This code through references and cites the following sources: