SRout Discussion 11

Salary of employees based on their Years of working experience.
Download the dataset from kaggle using below link.
https://www.kaggle.com/karthickveerakumar/salary-data-simple-linear-regression

There are only two columns in the dataset i.e YearsExperience and Salary. Both column type is double. The Shape of the DataSet is 30*2. There are 30 rows and 2 columns. No null values avaialbe in the dataset.

First load the CSV file in to R, then see the correlation between two columns.

# load csv 
sal_df <- read.csv("/Users/subhalaxmirout/DATA 605/Salary_Data.csv")
# Show correlation
pairs(sal_df[,])

The relation shows positively co-related. This means that highly experienced people pay a high salary.

# overview of data
head(sal_df)

##   YearsExperience Salary
## 1             1.1  39343
## 2             1.3  46205
## 3             1.5  37731
## 4             2.0  43525
## 5             2.2  39891
## 6             2.9  56642

Below shows shape of data set.

# shape of data
dim(sal_df)

## [1] 30  2

Summary of the data shows the statistics of columns i.e minimun, maximun, mean, meadian etc.

# Summary of data
summary(sal_df)

##  YearsExperience      Salary      
##  Min.   : 1.100   Min.   : 37731  
##  1st Qu.: 3.200   1st Qu.: 56721  
##  Median : 4.700   Median : 65237  
##  Mean   : 5.313   Mean   : 76003  
##  3rd Qu.: 7.700   3rd Qu.:100545  
##  Max.   :10.500   Max.   :122391

Linear regression is a way to model the relationship between two variables. The equation has the form \(Y= c + mX\), where where \(Y\) is the dependent variable, \(X\) is the independent variable , \(m\) is the slope of the line and \(c\) is the y-intercept.

There are four assumptions associated with a linear regression model:

Linearity: The relationship between X and the mean of Y is linear
Homoscedasticity: The variance of residual is the same for any value of X
Independence: Observations are independent of each other
Normality: For any fixed value of X, Y is normally distributed

# create linear model
lm=lm(Salary~YearsExperience,data = sal_df)

Below plot shows the positive co-relation between YearsExperience and Salary.

plot(sal_df$YearsExperience,sal_df$Salary, pch=16,cex=1.3, col="blue",
     xlab="Years of experience",ylab="Salary",main="Linear regression model")
abline(lm)

residuals <- residuals(lm)
hist(residuals)

qqnorm(residuals)
qqline(residuals)

plot(sal_df$YearsExperience, residuals )
abline(0, 0)

# Summary of model
summary(lm)

## 
## Call:
## lm(formula = Salary ~ YearsExperience, data = sal_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7958.0 -4088.5  -459.9  3372.6 11448.0 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      25792.2     2273.1   11.35 5.51e-12 ***
## YearsExperience   9450.0      378.8   24.95  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5788 on 28 degrees of freedom
## Multiple R-squared:  0.957,  Adjusted R-squared:  0.9554 
## F-statistic: 622.5 on 1 and 28 DF,  p-value: < 2.2e-16

Summary statistics shows:

High \(R^2\) and \(Adjusted R-squared\)
p-value is less than 0.05
Residuals are normaly distributed in histogram plot
Q-Q plot shows most observations are fall on the line and less number of outliers

We can say, this model is a good fit model. The eauation can be written as, \(y = 25792.2 + 9450X\)