Discussion 11

IS 605 FUNDAMENTALS OF COMPUTATIONAL MATHEMATICS

Linear Regression Model

Assignment: Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

I just created my own salary data and tried to fit the linear regression model. The Data has 2 columns years of experience and salary.

#Salary data
salary <- read.csv('https://raw.githubusercontent.com/Riteshlohiya/Data605_Discussion11/master/Salary_Data.csv')
salary
##    YearsExperience Salary
## 1              1.1  39343
## 2              1.3  46205
## 3              1.5  37731
## 4              2.0  43525
## 5              2.2  39891
## 6              2.9  56642
## 7              3.0  60150
## 8              3.2  54445
## 9              3.2  64445
## 10             3.7  57189
## 11             3.9  63218
## 12             4.0  55794
## 13             4.0  56957
## 14             4.1  57081
## 15             4.5  61111
## 16             4.9  67938
## 17             5.1  66029
## 18             5.3  83088
## 19             5.9  81363
## 20             6.0  93940
## 21             6.8  91738
## 22             7.1  98273
## 23             7.9 101302
## 24             8.2 113812
## 25             8.7 109431
## 26             9.0 105582
## 27             9.5 116969
## 28             9.6 112635
## 29            10.3 122391
## 30            10.5 121872
summary(salary)
##  YearsExperience      Salary      
##  Min.   : 1.100   Min.   : 37731  
##  1st Qu.: 3.200   1st Qu.: 56721  
##  Median : 4.700   Median : 65237  
##  Mean   : 5.313   Mean   : 76003  
##  3rd Qu.: 7.700   3rd Qu.:100545  
##  Max.   :10.500   Max.   :122391
#Distribution
hist(salary$YearsExperience, main = "Histogram of Years of Experience")

hist(salary$Salary, main = "Histogram of Years of Salary")

plot(salary$YearsExperience ~ salary$Salary, main = "Years of Experience vs Salary")

Craete a simple regression model:

# Simple linear regression model
slm <- lm(salary$YearsExperience ~ salary$Salary)
summary(slm)
## 
## Call:
## lm(formula = salary$YearsExperience ~ salary$Salary)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.12974 -0.46457  0.04105  0.54311  0.79669 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.383e+00  3.273e-01  -7.281  6.3e-08 ***
## salary$Salary  1.013e-04  4.059e-06  24.950  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5992 on 28 degrees of freedom
## Multiple R-squared:  0.957,  Adjusted R-squared:  0.9554 
## F-statistic: 622.5 on 1 and 28 DF,  p-value: < 2.2e-16
plot(salary$YearsExperience ~ salary$Salary, 
     xlab='Salary',
     ylab='Years of Experience',
     main='Years of Experience vs Salary')
abline(slm)

Residual plots:

# Residuals
plot(slm$residuals, ylab='Residuals')
abline(a=0, b=0)

# Q-Q plot
qqnorm(slm$residuals)
qqline(slm$residuals)

Conclusion:

The R-squared value is 95.7% which is good. That means that the explained variability is 95.7 % between independent and dependent variables. Seeing the residual plot, we can see there is constant variability and no pattern. Q-Q plot also looks good with some outliers at the tails. I think this linear model is appropriate.