Regression analysis is a group of statistical processes used in R programming and statistics to determine the relationship between dataset variables. Essentially, there is an outcome variable(Y) which is the parameter you want to explain or predict and there is an explanatory variable(X) which are parameters you want to use to predict or explain the outcome. Various statistical programming softwares can be used to run a regression analysis, but the data wrangling and data visualization techniques in R make this software very adept at this kind of analysis. So, this code-through tutorial shows what a regression looks like visually and how advanced techniques and packages within R can be used to find the relationships between variables in a dataset.
The code through will analyse the dataset wage1, which has been collected using the “wooldridge” CRAN archive data package available publicly on R. The main objective of the analysis is to see how education effets earnings. So, from the wage1 dataset, the main outcome or dependent variable is wage and the main explanatory or independent variable is educ. Other variables such as female and married will also be used to run the regression so that we can account for demographic differences. Thus, these analyses can be done with the help of a few R packages, most notably: tidyverse, which contains dplyr and ggplot2; stats, which programs statistical regression; and modelsummary which displays the various regression models in a neat professional table.
This topic is a valuable topic to a broad audience ranging from
undergraduate economics students to high level policymakers. In academic
settings, students of econometrics or data science often learn the
theory and book knowledge behind regression and the statistical
relationship between variables. Although they are well versed in the
technique, oftentimes they are unaware of how to apply their knowledge
using real datasets in a programming setting. So, this code-through
tutorial is a simple into the world of regression using programming
tools available in R. For policymakers and new data scientists,
regression analysis serves as a fundamental method behind various
statistical and machine learning techniques. So, how to visualize and
run a regression should be a key component in any data scientists area
of expertise. Policymakers will also learn how to empirically evaluate
their policy recommendations to governing authorities.
This is an ideal process that teaches the reader how to code a regression in R. More specifically, this code through will illustrate how to prepare a data for regression, show correlation between variables within a dataset, visually represent this correlation, run a regression with the appropriate parameters, visually represent the line of best fit and finally interpret and demonstrate the models.
The Wooldridge package contains various cross-sectional datasets based on labor force surveys. The dataset has been downloaded onto the R environment using the install.packages function. The dataset we will use is called wage1. It contains data in a cross sectional format with several variables. The main variables of concern are wage which contains wage statistics for hourly income, and educ which contains years of education. Generally, we expect an individuals income to increase as years of education increases. This relationship can be seen using a correlation function in R and can also be represented using a scatterplot. Furthermore, if we control for more demographic variables then we will get a more partial and accurate effect of education on income. The following sections will elaborate further on the distribution of the variables in the data set and get into the core code before running the final complex regression.
ggplot(wage1, aes(wage))+
geom_histogram()+
labs(title="Wage Distribution", cex.sub=3)ggplot(wage1,aes(educ))+
geom_histogram()+
labs(title = "Education Distribution")Using ggplot2 from tidyverse we can visualize the distribution of our main variables of concern with the help histograms. This allows us to account for any outliers, which are extreme values, which my distort the relationship when running the regression. It appears that the wage variable has an extreme value for hourly wage. Thus, using dplyr again from tidyverse we can filter out this extreme wage value.
wage_analytical<-wage1%>%
filter(wage<=22.8)So, with the help of dplyr, we have been able to drop the extreme wage values and create a new analytical dataset. Oftentimes when working with real raw data, datasets need to be filtered and edited to fit the administrators needs and formed into an analytical dataset before any analysis can be done.
When analyzing statistical relationships between variables, the first step is to see if there is any correlation between them. In R, this can be done using the cor function from the stats package. So, we need to carry out a correlation test between wage and educ. This relationship can also be visualized using a scatterplot that can be created using ggplot2 from tidyverse. The ggplot function also allows the user to customize the graphic by adding titles, labels and modifying the aesthetics(colors, shapes, sizes) of the components.
cor(wage_analytical$wage,wage_analytical$educ)## [1] 0.3960886
ggplot(wage_analytical, aes(x=educ, y=wage))+
geom_point(size=4, alpha=1, color="Red")+
labs(title="Correlation: Educ vs. Wage")So, we can see from both the correlation test results and the scatterplot, there is a positive correlation between education and wage. The correlation test gave a value of 0.396 and the scatterplot showed a trend, as years of education increased, the amount dots increased in height, indicating increases in wages.
No that we have established correlation between the two variables, we can now run the regression to estimate the statistical relationship between educ and wage
model<-lm(wage~educ, wage_analytical)
tidy(model)Using the stats package, we are able to run the simple regression of education on wage. The package uses the lm function to regress the two variables and finally the results can be displayed in a neat table format with the help of the tidy function from tidyverse. The coefficient value of educ is 0.506. So, the results show that for every year of education, hourly wage increases by about $0.50.
As mentioned before, the regression line is a line of best fit. It is a line that best describes the behavior of these variables. With the help of R, we are able to visually represent this line with the following code that uses the geom_point and geom_smooth functions from ggplot2.
ggplot(wage_analytical, aes(x=educ, y=wage),method="lm")+
geom_point(size=4, alpha=1)+
geom_smooth(method = "lm")+
labs(title = "Regression Line of Best Fit")The regression line shows an increasing slope indicating that for every year increase in education, wages increase by $0.50.
Now, for the final stage, we will be running a multiple regression with several control variables. As mentioned earlier, the addition of control variables in this case will account for demographic differences and give a more partial effect of education on wages.
model_multiple<-lm(wage~educ+exper+female+married, wage_analytical)
tidy(model_multiple)In this multiple regression, we have controlled for exper, which is the number of years of experience, female, which is a variable that equals to 1 if the individual is a female person, and married, which is a variable that equals to 1 if the individual is a married person. By controlling for these demographics, we can see that the coefficient value of educ has changed to 0.548 when compared to the simple regression mode. Thus, in this multiple regression model, the results show that as education increases by a year, hourly wage income increases by $0.55.
Finally, it is very important to know how to display regression tables in a professional setting. The modelsummary() function takes a bunch of different regression models and puts them in a neat side-by-side table. In a normal report or analysis, we would include all of these at once instead of running and presenting individually.
modelsummary(list("Simple Regression"=model, "Multiple Regression"=model_multiple))| Simple Regression | Multiple Regression | |
|---|---|---|
| (Intercept) | -0.523 | -1.374 |
| (0.660) | (0.725) | |
| educ | 0.506 | 0.549 |
| (0.051) | (0.050) | |
| exper | 0.052 | |
| (0.011) | ||
| female | -1.989 | |
| (0.262) | ||
| married | 0.627 | |
| (0.285) | ||
| Num.Obs. | 524 | 524 |
| R2 | 0.157 | 0.307 |
| R2 Adj. | 0.155 | 0.302 |
| AIC | 2723.6 | 2626.5 |
| BIC | 2736.4 | 2652.1 |
| Log.Lik. | -1358.786 | -1307.268 |
| F | 97.133 | 57.585 |
| RMSE | 3.24 | 2.95 |
So, in conclusion, R is very well adept at running and presenting various types of regression analysis and quantitative statistical methods. This code through has elaborated on the theory behind regressions and has guided the reader on how to use R code to prepare a dataset, visualize relationships between variables within the dataset, run various regressions, and finally display the results in a profession manner.
If you are interested to learn more about regression analysis in R and the tidyverse, please see the resources below:
Resource I [https://www.geeksforgeeks.org/regression-analysis-in-r-programming/]
Resource II [https://www.tidyverse.org/]
This code through references and cites the following sources:
Jeffrey M. Wooldridge, 2001. “Econometric Analysis of Cross Section and Panel Data,” MIT Press Books, The MIT Press, edition 1, volume 1, number 0262232197, December. [https://cran.r-project.org/web/packages/wooldridge/index.html]
Andrew Heiss (2022). Program Evaluation, PMAP 8521, Andrew Young School of Policy Studies Georgia State University, Spring 2022 . [https://evalsp22.classes.andrewheiss.com/example/regression/]