Salary of employees based on their Years of working experience.
Download the dataset from kaggle using below link.
https://www.kaggle.com/karthickveerakumar/salary-data-simple-linear-regression
There are only two columns in the dataset i.e YearsExperience and Salary. Both column type is double. The Shape of the DataSet is 30*2. There are 30 rows and 2 columns. No null values avaialbe in the dataset.
First load the CSV file in to R, then see the correlation between two columns.
# load csv
sal_df <- read.csv("/Users/subhalaxmirout/DATA 605/Salary_Data.csv")
# Show correlation
pairs(sal_df[,])
The relation shows positively co-related. This means that highly experienced people pay a high salary.
# overview of data
head(sal_df)
## YearsExperience Salary
## 1 1.1 39343
## 2 1.3 46205
## 3 1.5 37731
## 4 2.0 43525
## 5 2.2 39891
## 6 2.9 56642
Below shows shape of data set.
# shape of data
dim(sal_df)
## [1] 30 2
Summary of the data shows the statistics of columns i.e minimun, maximun, mean, meadian etc.
# Summary of data
summary(sal_df)
## YearsExperience Salary
## Min. : 1.100 Min. : 37731
## 1st Qu.: 3.200 1st Qu.: 56721
## Median : 4.700 Median : 65237
## Mean : 5.313 Mean : 76003
## 3rd Qu.: 7.700 3rd Qu.:100545
## Max. :10.500 Max. :122391
Linear regression is a way to model the relationship between two variables. The equation has the form \(Y= c + mX\), where where \(Y\) is the dependent variable, \(X\) is the independent variable , \(m\) is the slope of the line and \(c\) is the y-intercept.
There are four assumptions associated with a linear regression model:
# create linear model
lm=lm(Salary~YearsExperience,data = sal_df)
Below plot shows the positive co-relation between YearsExperience and Salary.
plot(sal_df$YearsExperience,sal_df$Salary, pch=16,cex=1.3, col="blue",
xlab="Years of experience",ylab="Salary",main="Linear regression model")
abline(lm)
residuals <- residuals(lm)
hist(residuals)
qqnorm(residuals)
qqline(residuals)
plot(sal_df$YearsExperience, residuals )
abline(0, 0)
# Summary of model
summary(lm)
##
## Call:
## lm(formula = Salary ~ YearsExperience, data = sal_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7958.0 -4088.5 -459.9 3372.6 11448.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25792.2 2273.1 11.35 5.51e-12 ***
## YearsExperience 9450.0 378.8 24.95 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5788 on 28 degrees of freedom
## Multiple R-squared: 0.957, Adjusted R-squared: 0.9554
## F-statistic: 622.5 on 1 and 28 DF, p-value: < 2.2e-16
Summary statistics shows:
We can say, this model is a good fit model. The eauation can be written as, \(y = 25792.2 + 9450X\)