Question

Using the salary_dataset from kaggle, build a linear model for salary vs years of experience. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Solution

Load required libraries

library(tidyverse)

Read the data and take a glimpse at the dataset

Read the data

salary_url = "https://raw.githubusercontent.com/chinedu2301/data605-computer-maths/main/discussions/discussion11/salary_dataset.csv"
# read the data from github url
salary_dataset_raw = read_csv(salary_url) 

Take a glimpse at the data

salary_dataset = salary_dataset_raw %>% select("YearsExperience", "Salary")
glimpse(salary_dataset)
## Rows: 30
## Columns: 2
## $ YearsExperience <dbl> 1.2, 1.4, 1.6, 2.1, 2.3, 3.0, 3.1, 3.3, 3.3, 3.8, 4.0,…
## $ Salary          <dbl> 39344, 46206, 37732, 43526, 39892, 56643, 60151, 54446…

We can see that the dataset has 30 rows and 2 columns (Salary and YearsExperience).

Summary for the dataset

summary(salary_dataset)
##  YearsExperience      Salary      
##  Min.   : 1.200   Min.   : 37732  
##  1st Qu.: 3.300   1st Qu.: 56722  
##  Median : 4.800   Median : 65238  
##  Mean   : 5.413   Mean   : 76004  
##  3rd Qu.: 7.800   3rd Qu.:100546  
##  Max.   :10.600   Max.   :122392

Visualize the data

p = ggplot(salary_dataset, aes(x=YearsExperience, y=Salary)) + geom_point() + 
    theme(panel.grid.major = element_line(colour = "lemonchiffon3"),
    panel.grid.minor = element_line(colour = "lemonchiffon3"),
    axis.title = element_text(size = 13),
    axis.text = element_text(size = 11),
    axis.text.x = element_text(family = "sans",
        size = 11), axis.text.y = element_text(family = "sans",
        size = 11), plot.title = element_text(size = 15,
        hjust = 0.5), panel.background = element_rect(fill = "gray85"),
    plot.background = element_rect(fill = "antiquewhite")) +labs(title = "Salary vs YearsExperience",
    x = "YearsExperience", y = "Salary")
p

From the scatter plot, as the YearsExperience increases, the Salary tend to increase as well. There is a clear linear relationship between the Salary and YearsExperience in the given data.

Build the Model

lm_salary = lm(salary_dataset$Salary ~ salary_dataset$YearsExperience)
lm_salary
## 
## Call:
## lm(formula = salary_dataset$Salary ~ salary_dataset$YearsExperience)
## 
## Coefficients:
##                    (Intercept)  salary_dataset$YearsExperience  
##                          24848                            9450

From the output of the model, the linear function is given by:
\(Salary = 9450 * YearsExperience - 24848\)

Evaluate the Model - Quality Evaluation

# Check the summary of the model
summary(lm_salary)
## 
## Call:
## lm(formula = salary_dataset$Salary ~ salary_dataset$YearsExperience)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7958.0 -4088.5  -459.9  3372.6 11448.0 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     24848.2     2306.7   10.77 1.82e-11 ***
## salary_dataset$YearsExperience   9450.0      378.8   24.95  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5788 on 28 degrees of freedom
## Multiple R-squared:  0.957,  Adjusted R-squared:  0.9554 
## F-statistic: 622.5 on 1 and 28 DF,  p-value: < 2.2e-16

The values from the summary show that there is little variability in the data and the result is statistically significant.

Residual Analysis
Residual vs Fitted plot

plot(fitted(lm_salary),resid(lm_salary), main="Residuals vs Fitted", xlab = "Fitted", ylab = "Residuals")
abline(0, 0)

From this plot, we can say that there seems to be constant variance for the residuals and the data is not heteroscedastic.

QQ Plot

qqnorm(resid(lm_salary))
qqline(resid(lm_salary))

From the Q-Q plot, we can see that the residuals follow a nearly normal distribution.

Entire plot

par(mfrow=c(2,2))
plot(lm_salary)

From the residual analysis, we can see that the conditions for linear regression are satisfied for this data and it can be fitted by using a linear model approach. The R-squared value of about 0.957, acceptable values for the R-squared value is highly dependent on the use case of the model. From the results of the model, I believe that it is appropriate to use a simple linear regression model for this data