Harsha Kumar(S3752953), Soumya Hiremath(S3746319)
June 2, 2019
A simple linear regression is used to examine relationship between two quantative variables by predicting the value of dependent variable y, assuming predictor variable x provides information about it.
Steps involved are:
Import the dataset
Do exploratory analysis
Check outliers
Check correlation
Create a relationship model using the lm() function
Find the coefficients from the model created
Plot relationship graph
Can years of experience of employees be used to predict the salary?
This can be achieved using simple linear regression with which we predict salary (Salary) by establishing a statistically significant linear relationship with years of experience(YearsExperience) and correlation to measure the strength of the linear relationship between the two variables.
The dataset used is Open Data from Kaggle.com, choosen from the below link:
https://www.kaggle.com/karthickveerakumar/salary-data-simple-linear-regression/version/1
The dataset contains 30 observations with two variables which are:
YearsExperience: continous numerical values
Salary: continous numerical values
The data is loaded and evaluated using str function to check the data types and values of the attributes. When data analysis is performed, data containing missing values is often encountered. In our dataset there are no missing values.
#Importing the dataset
Salary<- read.csv("C:/Users/soumya/Desktop/is/Salary_Data.csv")
#Evaluating dataset using str function
str(Salary)## 'data.frame': 30 obs. of 2 variables:
## $ YearsExperience: num 1.1 1.3 1.5 2 2.2 2.9 3 3.2 3.2 3.7 ...
## $ Salary : num 39343 46205 37731 43525 39891 ...
#Checking missing values
print(Salary$Salary[is.na(Salary$Salary)])## numeric(0)
print(Salary$YearsExperience[is.na(Salary$YearsExperience)])## numeric(0)
The tables below shows the statistical summaries about the variables in the dataset.The box plot is used to visualize the outliers for each variable in the relation. From the below plot we can see there are no outliers found.
Salary %>% summarise(Min = min(YearsExperience,na.rm = TRUE),
Q1 = quantile(YearsExperience,probs = .25,na.rm =TRUE),
Median = median(YearsExperience, na.rm = TRUE),
Q3 = quantile(YearsExperience,probs = .75,na.rm =TRUE),
Max = max(YearsExperience,na.rm = TRUE),
Mean = mean(YearsExperience, na.rm = TRUE),
SD = sd(YearsExperience, na.rm = TRUE),
n = n(),
Missing = sum(is.na(YearsExperience))) -> table1
knitr::kable(table1)| Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|
| 1.1 | 3.2 | 4.7 | 7.7 | 10.5 | 5.313333 | 2.837888 | 30 | 0 |
Salary %>% summarise(Min = min(Salary,na.rm = TRUE),
Q1 = quantile(Salary,probs = .25,na.rm =TRUE),
Median = median(Salary, na.rm = TRUE),
Q3 = quantile(Salary,probs = .75,na.rm =TRUE),
Max = max(Salary,na.rm = TRUE),
Mean = mean(Salary, na.rm = TRUE),
SD = sd(Salary, na.rm = TRUE),
n = n(),
Missing = sum(is.na(Salary))) -> table2
knitr::kable(table1)| Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|
| 1.1 | 3.2 | 4.7 | 7.7 | 10.5 | 5.313333 | 2.837888 | 30 | 0 |
#Divide the graph area in 2 columns
par(mfrow=c(1, 2))
#Check outliers in YearsExperience
boxplot(Salary$YearsExperience, main="Experience Outliers", sub=paste("Outlierrows:",boxplot.stats(Salary$YearsExperience)$out))
#Check outliers in Salary
boxplot(Salary$Salary, main="Salary Outliers",
sub=paste("Outlier rows: ", boxplot.stats(Salary$Salary)$out))A Pearson’s correlation, r was calculated to measure the strength of the linear relationship between YearsExperience and Salary. Correlation can take values between -1 to +1. The positive correlation was statistically significant, r=0.978, p<.001 and 95% CI [0.954, 0.989]. This means as YearsExperience increases Salary also increases.The scatter plot is used to visualize how the relationship between the two variables looks like.
# correlation test
cor.test(Salary$Salary, Salary$YearsExperience)##
## Pearson's product-moment correlation
##
## data: Salary$Salary and Salary$YearsExperience
## t = 24.95, df = 28, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9542949 0.9897078
## sample estimates:
## cor
## 0.9782416
#scatter plot
plot(YearsExperience ~ Salary, data = Salary, xlab = "Salary", ylab = "Years of Experience")# Visualize correlation
pairs(Salary)A Simple linear regression is performed by checking assumptions, interpreting important output and testing the statistical hypotheses of a linear regression model.
The first step is to plot the relationship between x and y variables, YearsExperience and Salary, to determine if linear regression is suitable.The data exhibited a positive linear trend. Now, we proceed fitting the linear regression model using the lm() function.
From the summary table, we can see R-squared value is 0.957 which is close to 1. There was statistically significant evidence that the data fit a linear regression model.
The model summary also reports an F statistic which is used to test the overall regression model. The F-test for the linear regression has the following statistical hypotheses:
H0:The data do not fit the linear regression model
HA:The data fit the linear regression model
Assuming the data do not fit a linear model in the population, the F statistic reported in the summary as F=622.5, will have a F distribution with df1=1 and df2=n-2=30-2=28 and p value less than the 0.05 level of significance,we reject H0. There is a statistically significant positive relationship between YearsExperience and Salary. Hence, the data fits the linear regression model.
model1 <- lm(Salary ~ YearsExperience, data = Salary)
model1 %>% summary()##
## Call:
## lm(formula = Salary ~ YearsExperience, data = Salary)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7958.0 -4088.5 -459.9 3372.6 11448.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25792.2 2273.1 11.35 5.51e-12 ***
## YearsExperience 9450.0 378.8 24.95 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5788 on 28 degrees of freedom
## Multiple R-squared: 0.957, Adjusted R-squared: 0.9554
## F-statistic: 622.5 on 1 and 28 DF, p-value: < 2.2e-16
model1 %>% summary() %>% coef()## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25792.200 2273.0534 11.34694 5.511950e-12
## YearsExperience 9449.962 378.7546 24.95009 1.143068e-20
model1 %>% confint()## 2.5 % 97.5 %
## (Intercept) 21136.061 30448.34
## YearsExperience 8674.119 10225.81
2*pt(q = 24.95,df = 30 - 2, lower.tail=FALSE)## [1] 1.143184e-20
To test the statistical significance of the intercept/constant, we set the following statistical hypotheses:
Ho:α=0
HA:α≠0
The intercept/constant is reported as α = 25792.200, which represents the average Salary when YearsExperience is equal to 0 and p<.001. This hypothesis is tested using a t statistic, reported as, t = 11.35,p<.001 . The constant is statistically significant at the 0.05 level. This means that there is statistically significant evidence that the constant is not 0. R reports the 95% CI for a to be [21136.061,30448.34]. H0:??=0 is clearly not captured by this interval, Hence it was rejected.
The hypothesis test of the slope, b, was as follows:
Ho:β=0
HA:β≠0
The slope of the regression line was reported as b = 9449.962. The slope represents the average increase in salary following a one unit increase in years of experience. This is a positive change. This hypothesis is tested using a t statistic, reported as, t = 24.95 and p<0.05. Also, R reports the 95% CI for b to be [8674.119, 10225.81]. This 95% CI does not capture H0, therefore it was rejected. There was statistically significant evidence that YearsExperience was positively related to Salary.
Before reporting the final regression model, we must validate all the following assumptions for linear regression.
Independence
Linearity
Normality of residuals
Homoscedasticity
Residuals Vs Fitted:
Independence is checked through the research design. We must ensure that all measurements between observations are independent. If the relationship between fitted values and residuals is flat (look at the red line), this is a good indication that there is a linear relationship.
Normal Q-Q:
We check the normal Q-Q plot to determine if there were any gross deviations from normality (e.g obvious S shapes or non-linear trends). The plot above suggests there are no major deviations from normality.
Scale-Location:
This is another plot used to check homoscedasticity(assumption of homogeneity of variance). In the plot above, the red line is close to flat and the variance in the square root of the standardised residuals is consistent across predicted (fitted values). Hence, the line fits to the data.
Residual vs Leverage:
This plot is used to identify cases that might be unduly influencing the fit of the regression model, for example, outliers. We need to look for values that fall beyond the red bands in the plot. These bands are based on Cook’s distances. In the diagnostic plot above, there are no values that fall outside the bands, and therefore, no evidence of influential cases.
ggplot(Salary, aes(x=Salary, y=YearsExperience)) +
geom_point(shape=19, colour="red", fill="blue") +
geom_smooth(method='lm', formula=y~x) +
labs(title="Salary and Years of Experience Regression") +
labs(x="Salary") +
labs(y="Years of Experience")par(mfrow=c(2, 2))
plot(model1)The scatter plot demonstrated evidence of a positive linear relationship. Other non-linear trends were ruled out.
The overall regression model was statistically significant, F(1,28)=622.5, p<.001, and explained 95.7% of the variability in Salary, R-sqaured=0.957.
Final inspection of the residuals supported normality and homoscedasticity.
Decision:
Overall model: Reject H0.
Intercept: Reject H0.
Slope: Reject H0.
Conclusion:
There was a statistically significant positive linear relationship between YearsExperience and Salary.
A linear regression model was fitted to predict the dependent variable, Salary, using measures of YearsExperience as a single predictor.
Strength:
In-depth model analysis to make accurate prediction.
Limitations:
This analysis is just limited to employees of one firm.
Open data published by an individual which may contain data entry errors.
Future investigations:
Is there any salary difference among different firms? Can we create regression models for each firms?
Is there any other attribute that affects the Salary other than years of experience?