The pacman() package last week was new to me, and I found it very helpful! I’ve always found it annoying to install and load everything separately, nice to know there’s an alternative. I’ll also use the same packages we used in the Week 4 code walkthrough, since we’re doing similar work with linear regression.
install.packages("pacman")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
library("pacman")
pacman::p_load(
tidyverse,
readr,
magrittr,
dplyr,
psych,
ggplot2,
caTools,
caret
)
Finally, I’ll read in the Salary Data dataset using the read.csv function and label it ‘Salary Data’.
Salary_Data <- read.csv("/cloud/project/Data/Salary_Data.csv")
view(Salary_Data)
The two columns in this dataset are ‘YearsExperience’ (which represents how many years of work experience an individual has) and “Salary” (which represents an individual’s annual salary in dollars and serves as our dependent variable.)
The next step is to perform basic EDA and summarize the variables, particularly in regards to the dependent variable. Looking at our summary statistics, the mean is higher than the median which means that our dataset is positively skewed. I also want to take a quick look at the correlation as well, which shows that there is a strong positive correlation (.978) between years of experience and salary.
hist(Salary_Data$Salary)
summary(Salary_Data$Salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 37731 56721 65237 76003 100545 122391
cor(Salary_Data[c("YearsExperience", "Salary")])
## YearsExperience Salary
## YearsExperience 1.0000000 0.9782416
## Salary 0.9782416 1.0000000
Let’s set a seed to make the code reproducible, and split the dataset into training/testing sets.
set.seed(777)
in_train <- createDataPartition(Salary_Data$Salary, p = .7, list = FALSE)
train_SalaryData <- Salary_Data[in_train,]
test_SalaryData <- Salary_Data[-in_train,]
The “p =” argument determines how much of our dataset will go to the training dataset vs. the test dataset. From what I remember, the general recommendation is somewhere around 70-75% of the dataset should go to the training set, so I set my value to .7.
Time to start building a model for simple linear regression!
SDModel <- lm(Salary #Dependentvariable
~ YearsExperience #Independentvariable
, data = train_SalaryData) #the training section of the overall dataset)
summary(SDModel)
##
## Call:
## lm(formula = Salary ~ YearsExperience, data = train_SalaryData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8079.3 -4131.8 153.4 2800.3 11374.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27145.3 2518.6 10.78 8.85e-10 ***
## YearsExperience 9182.0 415.8 22.08 1.62e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5736 on 20 degrees of freedom
## Multiple R-squared: 0.9606, Adjusted R-squared: 0.9586
## F-statistic: 487.7 on 1 and 20 DF, p-value: 1.615e-15
As commented in the code above, the dependent variable is Salary, the independent variable is YearsExperience, and the data argument is the training section of our overall dataset (train_SalaryData).
The model output demonstrates that for every additional year of experience, a person’s salary increases by $9,182. We know this because the coefficient for YearsExperience is 9182.
The starting salary for a researcher with no experience is $27,145 as demonstrated by the intercept.
On average, a person earns $9,182 for each additional year of experience.
The p-value for this model is extremely small (8.85e-10), which indicates that it would be extremely unlikely for YearsExperience to NOT HAVE a relationship with Salary. Our residuals are positive for the 3rd/4th quartiles and negative for the 1st/2nd quartiles, but all seem relatively evenly distributed across the board. The adjusted R-square value is .9586, which means that our model explains close to 96% of the dependent variable (which seems pretty solid to me.)
First, let’s predict the salary for researchers with 5, 10, and 15 years of experience based on the model we created. From the outputs below, we can see it steadily rising as years experience increases.
predict(SDModel, data.frame(YearsExperience = 5))
## 1
## 73055.32
predict(SDModel, data.frame(YearsExperience = 10))
## 1
## 118965.3
predict(SDModel, data.frame(YearsExperience = 15))
## 1
## 164875.3
Next, let’s use the training dataset to predict the salaries of researchers in the testing dataset. We can see that the predicted values are 97% correlated with the actual values, which means that they are not very different.
test_SalaryData$pred_values <- predict(SDModel, test_SalaryData)
cor(test_SalaryData$pred_values, test_SalaryData$Salary)
## [1] 0.9795333