Prepare the Environment

The pacman() package last week was new to me, and I found it very helpful! I’ve always found it annoying to install and load everything separately, nice to know there’s an alternative. I’ll also use the same packages we used in the Week 4 code walkthrough, since we’re doing similar work with linear regression.

install.packages("pacman")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
library("pacman")

pacman::p_load(
  tidyverse,
  readr,
  magrittr,
  dplyr,
  psych,
  ggplot2,
  caTools,
  caret
)

Finally, I’ll read in the Salary Data dataset using the read.csv function and label it ‘Salary Data’.

Salary_Data <- read.csv("/cloud/project/Data/Salary_Data.csv")

Understand the Dataset

  1. What are the two columns used in the dataset for this analysis, and what do they represent?
view(Salary_Data)

The two columns in this dataset are ‘YearsExperience’ (which represents how many years of work experience an individual has) and “Salary” (which represents an individual’s annual salary in dollars and serves as our dependent variable.)

The next step is to perform basic EDA and summarize the variables, particularly in regards to the dependent variable. Looking at our summary statistics, the mean is higher than the median which means that our dataset is positively skewed. I also want to take a quick look at the correlation as well, which shows that there is a strong positive correlation (.978) between years of experience and salary.

hist(Salary_Data$Salary)

summary(Salary_Data$Salary)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   37731   56721   65237   76003  100545  122391
cor(Salary_Data[c("YearsExperience", "Salary")])
##                 YearsExperience    Salary
## YearsExperience       1.0000000 0.9782416
## Salary                0.9782416 1.0000000

Data Splitting

Let’s set a seed to make the code reproducible, and split the dataset into training/testing sets.

set.seed(777)

in_train <- createDataPartition(Salary_Data$Salary, p = .7, list = FALSE)

train_SalaryData <- Salary_Data[in_train,]
test_SalaryData <- Salary_Data[-in_train,]
  1. Explain what value you used for the “p =” argument, and wh is it set to this value?

The “p =” argument determines how much of our dataset will go to the training dataset vs. the test dataset. From what I remember, the general recommendation is somewhere around 70-75% of the dataset should go to the training set, so I set my value to .7.

Model Building

Time to start building a model for simple linear regression!

SDModel <- lm(Salary #Dependentvariable 
              ~ YearsExperience #Independentvariable
              , data = train_SalaryData) #the training section of the overall dataset)

summary(SDModel)
## 
## Call:
## lm(formula = Salary ~ YearsExperience, data = train_SalaryData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8079.3 -4131.8   153.4  2800.3 11374.3 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      27145.3     2518.6   10.78 8.85e-10 ***
## YearsExperience   9182.0      415.8   22.08 1.62e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5736 on 20 degrees of freedom
## Multiple R-squared:  0.9606, Adjusted R-squared:  0.9586 
## F-statistic: 487.7 on 1 and 20 DF,  p-value: 1.615e-15

As commented in the code above, the dependent variable is Salary, the independent variable is YearsExperience, and the data argument is the training section of our overall dataset (train_SalaryData).

Model Interpretation

  1. Describe the relationship between years of experience and salary based on model output.

The model output demonstrates that for every additional year of experience, a person’s salary increases by $9,182. We know this because the coefficient for YearsExperience is 9182.

  1. What is the starting salary provided that the researcher does not have any experience?

The starting salary for a researcher with no experience is $27,145 as demonstrated by the intercept.

  1. How much on average is a person earning per additional years of experience?

On average, a person earns $9,182 for each additional year of experience.

  1. Evaluate model performance using 1) p-values, 2) residuals, and 3) R-square.

The p-value for this model is extremely small (8.85e-10), which indicates that it would be extremely unlikely for YearsExperience to NOT HAVE a relationship with Salary. Our residuals are positive for the 3rd/4th quartiles and negative for the 1st/2nd quartiles, but all seem relatively evenly distributed across the board. The adjusted R-square value is .9586, which means that our model explains close to 96% of the dependent variable (which seems pretty solid to me.)

Predict

First, let’s predict the salary for researchers with 5, 10, and 15 years of experience based on the model we created. From the outputs below, we can see it steadily rising as years experience increases.

predict(SDModel, data.frame(YearsExperience = 5))
##        1 
## 73055.32
predict(SDModel, data.frame(YearsExperience = 10))
##        1 
## 118965.3
predict(SDModel, data.frame(YearsExperience = 15))
##        1 
## 164875.3

Next, let’s use the training dataset to predict the salaries of researchers in the testing dataset. We can see that the predicted values are 97% correlated with the actual values, which means that they are not very different.

test_SalaryData$pred_values <- predict(SDModel, test_SalaryData)

cor(test_SalaryData$pred_values, test_SalaryData$Salary)
## [1] 0.9795333