Using the salary_dataset from kaggle, build a linear model for salary vs years of experience. Conduct residual analysis. Was the linear model appropriate? Why or why not?
Load required libraries
library(tidyverse)
Read the data and take a glimpse at the dataset
Read the data
salary_url = "https://raw.githubusercontent.com/chinedu2301/data605-computer-maths/main/discussions/discussion11/salary_dataset.csv"
# read the data from github url
salary_dataset_raw = read_csv(salary_url)
Take a glimpse at the data
salary_dataset = salary_dataset_raw %>% select("YearsExperience", "Salary")
glimpse(salary_dataset)
## Rows: 30
## Columns: 2
## $ YearsExperience <dbl> 1.2, 1.4, 1.6, 2.1, 2.3, 3.0, 3.1, 3.3, 3.3, 3.8, 4.0,…
## $ Salary <dbl> 39344, 46206, 37732, 43526, 39892, 56643, 60151, 54446…
We can see that the dataset has 30 rows and 2 columns (Salary and YearsExperience).
Summary for the dataset
summary(salary_dataset)
## YearsExperience Salary
## Min. : 1.200 Min. : 37732
## 1st Qu.: 3.300 1st Qu.: 56722
## Median : 4.800 Median : 65238
## Mean : 5.413 Mean : 76004
## 3rd Qu.: 7.800 3rd Qu.:100546
## Max. :10.600 Max. :122392
Visualize the data
p = ggplot(salary_dataset, aes(x=YearsExperience, y=Salary)) + geom_point() +
theme(panel.grid.major = element_line(colour = "lemonchiffon3"),
panel.grid.minor = element_line(colour = "lemonchiffon3"),
axis.title = element_text(size = 13),
axis.text = element_text(size = 11),
axis.text.x = element_text(family = "sans",
size = 11), axis.text.y = element_text(family = "sans",
size = 11), plot.title = element_text(size = 15,
hjust = 0.5), panel.background = element_rect(fill = "gray85"),
plot.background = element_rect(fill = "antiquewhite")) +labs(title = "Salary vs YearsExperience",
x = "YearsExperience", y = "Salary")
p
From the scatter plot, as the YearsExperience increases, the Salary tend to increase as well. There is a clear linear relationship between the Salary and YearsExperience in the given data.
Build the Model
lm_salary = lm(salary_dataset$Salary ~ salary_dataset$YearsExperience)
lm_salary
##
## Call:
## lm(formula = salary_dataset$Salary ~ salary_dataset$YearsExperience)
##
## Coefficients:
## (Intercept) salary_dataset$YearsExperience
## 24848 9450
From the output of the model, the linear function is given by:
\(Salary = 9450 * YearsExperience -
24848\)
Evaluate the Model - Quality Evaluation
# Check the summary of the model
summary(lm_salary)
##
## Call:
## lm(formula = salary_dataset$Salary ~ salary_dataset$YearsExperience)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7958.0 -4088.5 -459.9 3372.6 11448.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24848.2 2306.7 10.77 1.82e-11 ***
## salary_dataset$YearsExperience 9450.0 378.8 24.95 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5788 on 28 degrees of freedom
## Multiple R-squared: 0.957, Adjusted R-squared: 0.9554
## F-statistic: 622.5 on 1 and 28 DF, p-value: < 2.2e-16
The values from the summary show that there is little variability in the data and the result is statistically significant.
Residual Analysis
Residual vs Fitted plot
plot(fitted(lm_salary),resid(lm_salary), main="Residuals vs Fitted", xlab = "Fitted", ylab = "Residuals")
abline(0, 0)
From this plot, we can say that there seems to be constant variance for the residuals and the data is not heteroscedastic.
QQ Plot
qqnorm(resid(lm_salary))
qqline(resid(lm_salary))
From the Q-Q plot, we can see that the residuals follow a nearly normal distribution.
Entire plot
par(mfrow=c(2,2))
plot(lm_salary)
From the residual analysis, we can see that the conditions for linear regression are satisfied for this data and it can be fitted by using a linear model approach. The R-squared value of about 0.957, acceptable values for the R-squared value is highly dependent on the use case of the model. From the results of the model, I believe that it is appropriate to use a simple linear regression model for this data