I just created my own salary data and tried to fit the linear regression model. The Data has 2 columns years of experience and salary.
#Salary data
salary <- read.csv('https://raw.githubusercontent.com/Riteshlohiya/Data605_Discussion11/master/Salary_Data.csv')
salary
## YearsExperience Salary
## 1 1.1 39343
## 2 1.3 46205
## 3 1.5 37731
## 4 2.0 43525
## 5 2.2 39891
## 6 2.9 56642
## 7 3.0 60150
## 8 3.2 54445
## 9 3.2 64445
## 10 3.7 57189
## 11 3.9 63218
## 12 4.0 55794
## 13 4.0 56957
## 14 4.1 57081
## 15 4.5 61111
## 16 4.9 67938
## 17 5.1 66029
## 18 5.3 83088
## 19 5.9 81363
## 20 6.0 93940
## 21 6.8 91738
## 22 7.1 98273
## 23 7.9 101302
## 24 8.2 113812
## 25 8.7 109431
## 26 9.0 105582
## 27 9.5 116969
## 28 9.6 112635
## 29 10.3 122391
## 30 10.5 121872
summary(salary)
## YearsExperience Salary
## Min. : 1.100 Min. : 37731
## 1st Qu.: 3.200 1st Qu.: 56721
## Median : 4.700 Median : 65237
## Mean : 5.313 Mean : 76003
## 3rd Qu.: 7.700 3rd Qu.:100545
## Max. :10.500 Max. :122391
#Distribution
hist(salary$YearsExperience, main = "Histogram of Years of Experience")
hist(salary$Salary, main = "Histogram of Years of Salary")
plot(salary$YearsExperience ~ salary$Salary, main = "Years of Experience vs Salary")
Craete a simple regression model:
# Simple linear regression model
slm <- lm(salary$YearsExperience ~ salary$Salary)
summary(slm)
##
## Call:
## lm(formula = salary$YearsExperience ~ salary$Salary)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.12974 -0.46457 0.04105 0.54311 0.79669
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.383e+00 3.273e-01 -7.281 6.3e-08 ***
## salary$Salary 1.013e-04 4.059e-06 24.950 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5992 on 28 degrees of freedom
## Multiple R-squared: 0.957, Adjusted R-squared: 0.9554
## F-statistic: 622.5 on 1 and 28 DF, p-value: < 2.2e-16
plot(salary$YearsExperience ~ salary$Salary,
xlab='Salary',
ylab='Years of Experience',
main='Years of Experience vs Salary')
abline(slm)
Residual plots:
# Residuals
plot(slm$residuals, ylab='Residuals')
abline(a=0, b=0)
# Q-Q plot
qqnorm(slm$residuals)
qqline(slm$residuals)
The R-squared value is 95.7% which is good. That means that the explained variability is 95.7 % between independent and dependent variables. Seeing the residual plot, we can see there is constant variability and no pattern. Q-Q plot also looks good with some outliers at the tails. I think this linear model is appropriate.