For my first blog post for Data 621, I will be showcasing an example of how to use a Simple Multiple Regression using salary data that I sourced from Kaggle.
First, I will load the packages that I will be using for the purposes of loading my data and running diagnostic plots for my Simple Multiple Regression model.
library(readr) # readr helps me load my data
library(ggfortify) # ggfortify gives me access to the autoplot() function for diagnostic plots
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.1
I sourced my data from Kaggle (data found here: https://www.kaggle.com/datasets/rkiattisak/salaly-prediction-for-beginer?resource=download).
salary = read_csv('https://raw.githubusercontent.com/cocodono/Data-621---Blog-1---Simple-Regression/main/Salary%20Data.csv')
## Rows: 375 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Gender, Education Level, Job Title
## dbl (3): Age, Years of Experience, Salary
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The function “read_csv()” from the readr package allows me to read a csv and then begin to work with the data in my RStudio environment.
head(salary)
## # A tibble: 6 × 6
## Age Gender `Education Level` `Job Title` `Years of Experience` Salary
## <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 32 Male Bachelor's Software Engineer 5 90000
## 2 28 Female Master's Data Analyst 3 65000
## 3 45 Male PhD Senior Manager 15 150000
## 4 36 Female Bachelor's Sales Associate 7 60000
## 5 52 Male Master's Director 20 200000
## 6 29 Male Bachelor's Marketing Analyst 2 55000
The “head()” function lets me quickly look at the the first six rows of my dataset. From the “head()” function, I can see what data types my variables are, which provides me with some helpful context.
summary(salary)
## Age Gender Education Level Job Title
## Min. :23.00 Length:375 Length:375 Length:375
## 1st Qu.:31.00 Class :character Class :character Class :character
## Median :36.00 Mode :character Mode :character Mode :character
## Mean :37.43
## 3rd Qu.:44.00
## Max. :53.00
## NA's :2
## Years of Experience Salary
## Min. : 0.00 Min. : 350
## 1st Qu.: 4.00 1st Qu.: 55000
## Median : 9.00 Median : 95000
## Mean :10.03 Mean :100577
## 3rd Qu.:15.00 3rd Qu.:140000
## Max. :25.00 Max. :250000
## NA's :2 NA's :2
Using the “summary()” function allows me to get a summary of the data in my data frame “salary.” For numerical data types, summary provides me with the IQR, Min, Mean, Median, Max and the count of NA’s in a column. For non-numerical data, summary also provides me with some insight in the form of Length, Class and Mode. From using “summary()” and “head()” I can see that I only have either double or character data type variables.
For the purposes of Simple Multiple Linear Regression, I can only use numerical data types. If I wanted to use the non-numerical data type variables for the purpose of my regression model, I would have to convert the variables to dummy variables, which would create columns to represent each value a cell could take on in a given categorical variable (represented with a binary 1 to indicate that an observation is associated with that specific value or 0 to represent that observation taking on some other value). I will not be working with dummy variables in this blog and using dummy variables is possible in regression models.
Given that I cannot use categorical variables in my regression model, I am limited to using “Age,” “Years of Experience,” and “Salary” to build a Simple Mulitple Regression Model, and two independent variables and one dependent variable is enough for me to construct my model.
I’ll use the built in “lm()” R function to build my model.
model_1 = lm(`Salary` ~ `Age` + `Years of Experience`, salary)
The format of the “lm()” function’s input is lm(dependent_variable ~ independent_variables(s), data_name). I chose to build a model that predicts Salary based on Age and Years of Experience.
Now, to assess the quality of my model, I use the “summary()” function, which tells me the residual quantiles, summary statistics, the standard errors and t statistics, along with the p-values of the latter, the residual standard error, and the F-test.
summary(model_1)
##
## Call:
## lm(formula = Salary ~ Age + `Years of Experience`, data = salary)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64540 -7436 678 9304 78062
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -18700.4 17060.3 -1.096 0.27373
## Age 1885.8 632.5 2.981 0.00306 **
## `Years of Experience` 4853.8 681.9 7.118 5.74e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17530 on 370 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.8687, Adjusted R-squared: 0.868
## F-statistic: 1224 on 2 and 370 DF, p-value: < 2.2e-16
What does this model tell me about the relationship between Salary and the combination of Age and Years of Experience?
Well, immediately the model tells me that the relationship between Salary and the combination of Years of Experience and Age can be modeled with the following equation:
Salary = \(-18700.4+1885.8 \times\) Age \(+4853.8\times\) Years of Experience.
I found this equation, by reading the estimate for the intercept and the estimates for the coefficients of both Age and Years of Experience.
When looking at the summary of my model, my eyes immediately go to the bottom of the output as it tells me both the p-value of the whole model, as well as the \(R^2\) values (adjusted and multiple).
P-value can tell you the statistical significance of the model. An ideal p-value is generally anything below 0.05 and given that this model has a p-value of “< 2.2e-16,” I would say that there is a statistically significant relationship between my independent variables and my dependent variable.
\(R^2\) values serve as a representation of how well a model accounts for variability in the data used to build the model. R-squared value standards differ based on circumstance (just as p-value standards do); however, with that being said, an \(R^2\) value of about .868 is rather significant.
Based on this model’s \(R^2\) values and p-value, I would say that the combination of Age and Years of Experience serve as a pretty solid predictor of salary for this data.
The “summary()” function also tells you the individual p-values associated with my predictor variables. From the “summary()” function output, I can see that the p-value associated with Years of Experience seems to be lower than that of the p-value for Age (5.74e-12 vs 0.00306). Though, I will note that both variables have p-values lower than 0.05. In the model improvement process, you will often remove variables with higher p-values to try to improve model performance, but I will not do that here for two reasons: 1. both variables have p-values lower than 0.05 and 2. if I remove a predictor variable, this model will no longer be a Simple Multiple Regression, but rather a Simple Regression (I’ll only have one predictor variable).
Now, I’ll use the “autoplot()” function from the ggfortify R package to run diagnostic plots on my model.
autoplot(model_1)
When building a linear model, I am operating under some assumptions about my model. Those assumptions are as follows:
I will use the results of my calling the “autoplot()” function on my regression model to check these assumptions.
The “autoplot()” function generates 4 plots for my Simple Multiple Regression Model; the “Residuals vs Fitted” plot, the “Normal Q-Q” plot, the “Scale-Location” plot, and the “Residuals vs Leverage” plot.
To explain these plots, I’ll provide some quick definitions for some terms that will be very useful. These are my quick working definitions and while I am well aware that these definitions will surely not be comprehensive enough of these concepts, hopefully, these quick working definitions will allow for some understanding of what I am discussing in this section.
Residuals: residuals represent the difference between the observed values of the dependent variable and the values predicted by the regression equation.
Leverage: The measure of how much influence individual obersvations have on the overall fit of a regression model.
Fitted Values: the predicted values of the dependent variable based on the regression equation.
Standardized Residuals: residuals that have been transformed to have a mean of zero and a standard deviation of one.
I’ll start with the “Residuals vs Fitted” plot. The y-axis has the residual error and the x-axis has the predicted mean value for the observations. If my assumptions are correct, the points should be scattered evenly above and below the zero-line. After looking at this plot, I would say that the points are generally scatter pretty evenly above and below the zero-line. There is a line that appears on this chart and the goal for a successful model is to have this (blue in this case) line rather smooth and close to zero, which seems to be the case.
The chart directly below the “Residuals vs Fitted” plot is the “Scale Location” plot. The two plots have the same x-axis (the predicted mean values for the observations), but the y-axis has been re-scaled in the “Scale Location” plot (the y-axis is now the absolute value of residuals, which are then scaled in accordance with estimated values and then square rooted). The goal for this plot is that the standard error should be falling around 1, with the fitted line being rather smooth. The fitted line is not perfect and is rather smooth, and the average standard error seems to be hovering around one. I will note that, these plots will rarely perfectly validate my assumptions and are very much to be interpreted.
For the “Normal Q-Q Plot,” it seems that there is some deviation at the tails and for the most part the standardized errors adhere to the normal line (implying that that standardized residuals are fairly normally distributed), which is ideal for the validity of the model.
Finally, I come to the “Residuals vs Leverage” plot. It is ideal to have your points on the plot close to the zero line, which generally seems to be the case with our plot.
These plots indicate that the assumptions are fairly met; however, these plots do also do a good job of pointing out observations that seem to be more heavily impacting the model’s fit. If you notice, some points on the plots that seem to have less than ideal placement have numbers attributed to them. Those numbers are the associated observation’s index. A logical next step after building the initial model would be to engage in model improvement. Model improvement may look like removing predictor variables altogether (although, that would not be possible in this case, while trying preserve this model as a multiple regression model). Model improvement can also look like removing observations that do not serve the model’s fit. I could assess these plots and recognize that I have certain outlier observations that do not benefit my model’s performance, and I could remove them. For right now, though, I will leave this model in it’s current form (model improvement may be the task for a later blog!).