A Simple Linear Regression Analysis of the Relationship of Employee Salary Against Employee Years of Experience

Hamza Al Katranji - S3696451

# Introduction

# Problem statement

# Data

There are two columns YearsExperience and Salary

# Descriptive Statistics

Data Summary

Data Summary

##  YearsExperience      Salary      
##  Min.   : 1.100   Min.   : 37731  
##  1st Qu.: 3.450   1st Qu.: 57019  
##  Median : 5.300   Median : 81363  
##  Mean   : 6.309   Mean   : 83946  
##  3rd Qu.: 9.250   3rd Qu.:113224  
##  Max.   :13.500   Max.   :139465

# Histogram of Salary and Years of Experience

# Data cleansing - Checking for outliers and missing values

There are no outliers in the dataset

any(is.na(data))
## [1] FALSE

There are no missing vaues in the dataset

# Scatter plot of salary and years of experience

# Correlation between salary and years of experience

## 
##  Pearson's product-moment correlation
## 
## data:  data$YearsExperience and data$Salary
## t = 30.237, df = 33, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9651673 0.9911731
## sample estimates:
##       cor 
## 0.9824273

# Splitting the dataset

For further analysis of the data, the dataset is splited into two sets a training set and a test set.

The first few rows of the each set are as follows:

# Fitting a simple regression model

## 
## Call:
## lm(formula = Salary ~ YearsExperience, data = training_set)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8470  -4293    244   3200  12700 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      28864.3     2349.0   12.29 2.47e-12 ***
## YearsExperience   8810.6      343.2   25.67  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6235 on 26 degrees of freedom
## Multiple R-squared:  0.9621, Adjusted R-squared:  0.9606 
## F-statistic: 659.1 on 1 and 26 DF,  p-value: < 2.2e-16

# Check for heteroscedasticity

Breusch-Pagan test to check for heteroscedasticity

## 
##  studentized Breusch-Pagan test
## 
## data:  lm1
## BP = 0.37999, df = 1, p-value = 0.5376

# Regression line for the training and test datasets

# Regression line for the full dataset

# Evaluation

##  Mean Absolute Error: 5151.473 
##  Mean Square Error: 32305371 
##  Root Mean Square Error: 5683.781 
##  R-squared: 0.9692507

# Discussion

# References