MATH1324 Introduction to Statistics Assignment 3

Simple Linear Regression

Harsha Kumar(S3752953), Soumya Hiremath(S3746319)

June 2, 2019

Introduction

A simple linear regression is used to examine relationship between two quantative variables by predicting the value of dependent variable y, assuming predictor variable x provides information about it.

Steps involved are:

Import the dataset
Do exploratory analysis
Check outliers
Check correlation
Create a relationship model using the lm() function
Find the coefficients from the model created
Plot relationship graph

Problem Statement

Can years of experience of employees be used to predict the salary?

This can be achieved using simple linear regression with which we predict salary (Salary) by establishing a statistically significant linear relationship with years of experience(YearsExperience) and correlation to measure the strength of the linear relationship between the two variables.

Data

The dataset used is Open Data from Kaggle.com, choosen from the below link:

https://www.kaggle.com/karthickveerakumar/salary-data-simple-linear-regression/version/1

The dataset contains 30 observations with two variables which are:

YearsExperience: continous numerical values
Salary: continous numerical values

Data Exploratory Analysis.

The data is loaded and evaluated using str function to check the data types and values of the attributes. When data analysis is performed, data containing missing values is often encountered. In our dataset there are no missing values.

#Importing the dataset
Salary<- read.csv("C:/Users/soumya/Desktop/is/Salary_Data.csv")

#Evaluating dataset using str function
str(Salary)

## 'data.frame':    30 obs. of  2 variables:
##  $ YearsExperience: num  1.1 1.3 1.5 2 2.2 2.9 3 3.2 3.2 3.7 ...
##  $ Salary         : num  39343 46205 37731 43525 39891 ...

#Checking missing values
print(Salary$Salary[is.na(Salary$Salary)])

## numeric(0)

print(Salary$YearsExperience[is.na(Salary$YearsExperience)])

## numeric(0)

Descriptive Statistics and Visualisation

The tables below shows the statistical summaries about the variables in the dataset.The box plot is used to visualize the outliers for each variable in the relation. From the below plot we can see there are no outliers found.

Salary %>% summarise(Min = min(YearsExperience,na.rm = TRUE),
                                           Q1 = quantile(YearsExperience,probs = .25,na.rm =TRUE),
                                           Median = median(YearsExperience, na.rm = TRUE),
                                           Q3 = quantile(YearsExperience,probs = .75,na.rm =TRUE),
                                           Max = max(YearsExperience,na.rm = TRUE),
                                           Mean = mean(YearsExperience, na.rm = TRUE),
                                           SD = sd(YearsExperience, na.rm = TRUE),
                                           n = n(),
                                           Missing = sum(is.na(YearsExperience))) -> table1
knitr::kable(table1)

Min	Q1	Median	Q3	Max	Mean	SD	n	Missing
1.1	3.2	4.7	7.7	10.5	5.313333	2.837888	30	0

Salary %>% summarise(Min = min(Salary,na.rm = TRUE),
                                           Q1 = quantile(Salary,probs = .25,na.rm =TRUE),
                                           Median = median(Salary, na.rm = TRUE),
                                           Q3 = quantile(Salary,probs = .75,na.rm =TRUE),
                                           Max = max(Salary,na.rm = TRUE),
                                           Mean = mean(Salary, na.rm = TRUE),
                                           SD = sd(Salary, na.rm = TRUE),
                                           n = n(),
                                           Missing = sum(is.na(Salary))) -> table2

knitr::kable(table1)

Min	Q1	Median	Q3	Max	Mean	SD	n	Missing
1.1	3.2	4.7	7.7	10.5	5.313333	2.837888	30	0

#Divide the graph area in 2 columns
par(mfrow=c(1, 2))

#Check outliers in YearsExperience
boxplot(Salary$YearsExperience, main="Experience Outliers", sub=paste("Outlierrows:",boxplot.stats(Salary$YearsExperience)$out)) 

#Check outliers in Salary 
boxplot(Salary$Salary, main="Salary Outliers", 
sub=paste("Outlier rows: ", boxplot.stats(Salary$Salary)$out))

Correlation

A Pearson’s correlation, r was calculated to measure the strength of the linear relationship between YearsExperience and Salary. Correlation can take values between -1 to +1. The positive correlation was statistically significant, r=0.978, p<.001 and 95% CI [0.954, 0.989]. This means as YearsExperience increases Salary also increases.The scatter plot is used to visualize how the relationship between the two variables looks like.

# correlation test
cor.test(Salary$Salary, Salary$YearsExperience)

## 
##  Pearson's product-moment correlation
## 
## data:  Salary$Salary and Salary$YearsExperience
## t = 24.95, df = 28, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9542949 0.9897078
## sample estimates:
##       cor 
## 0.9782416

#scatter plot
plot(YearsExperience ~ Salary, data = Salary, xlab = "Salary", ylab = "Years of Experience")

# Visualize correlation
pairs(Salary)

Hypothesis Testing

A Simple linear regression is performed by checking assumptions, interpreting important output and testing the statistical hypotheses of a linear regression model.

The first step is to plot the relationship between x and y variables, YearsExperience and Salary, to determine if linear regression is suitable.The data exhibited a positive linear trend. Now, we proceed fitting the linear regression model using the lm() function.

From the summary table, we can see R-squared value is 0.957 which is close to 1. There was statistically significant evidence that the data fit a linear regression model.

The model summary also reports an F statistic which is used to test the overall regression model. The F-test for the linear regression has the following statistical hypotheses:

H0:The data do not fit the linear regression model

HA:The data fit the linear regression model

Assuming the data do not fit a linear model in the population, the F statistic reported in the summary as F=622.5, will have a F distribution with df1=1 and df2=n-2=30-2=28 and p value less than the 0.05 level of significance,we reject H0. There is a statistically significant positive relationship between YearsExperience and Salary. Hence, the data fits the linear regression model.

model1 <- lm(Salary ~ YearsExperience, data = Salary)
model1 %>% summary()

## 
## Call:
## lm(formula = Salary ~ YearsExperience, data = Salary)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7958.0 -4088.5  -459.9  3372.6 11448.0 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      25792.2     2273.1   11.35 5.51e-12 ***
## YearsExperience   9450.0      378.8   24.95  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5788 on 28 degrees of freedom
## Multiple R-squared:  0.957,  Adjusted R-squared:  0.9554 
## F-statistic: 622.5 on 1 and 28 DF,  p-value: < 2.2e-16

model1 %>% summary() %>% coef()

##                  Estimate Std. Error  t value     Pr(>|t|)
## (Intercept)     25792.200  2273.0534 11.34694 5.511950e-12
## YearsExperience  9449.962   378.7546 24.95009 1.143068e-20

model1 %>% confint()

##                     2.5 %   97.5 %
## (Intercept)     21136.061 30448.34
## YearsExperience  8674.119 10225.81

2*pt(q = 24.95,df = 30 - 2, lower.tail=FALSE)

## [1] 1.143184e-20

To test the statistical significance of the intercept/constant, we set the following statistical hypotheses:

Ho:α=0

HA:α≠0

The intercept/constant is reported as α = 25792.200, which represents the average Salary when YearsExperience is equal to 0 and p<.001. This hypothesis is tested using a t statistic, reported as, t = 11.35,p<.001 . The constant is statistically significant at the 0.05 level. This means that there is statistically significant evidence that the constant is not 0. R reports the 95% CI for a to be [21136.061,30448.34]. H0:??=0 is clearly not captured by this interval, Hence it was rejected.

The hypothesis test of the slope, b, was as follows:

Ho:β=0

HA:β≠0

The slope of the regression line was reported as b = 9449.962. The slope represents the average increase in salary following a one unit increase in years of experience. This is a positive change. This hypothesis is tested using a t statistic, reported as, t = 24.95 and p<0.05. Also, R reports the 95% CI for b to be [8674.119, 10225.81]. This 95% CI does not capture H0, therefore it was rejected. There was statistically significant evidence that YearsExperience was positively related to Salary.

Hypthesis Testing Cont.

Before reporting the final regression model, we must validate all the following assumptions for linear regression.

Independence

Linearity

Normality of residuals

Homoscedasticity

Residuals Vs Fitted:

Independence is checked through the research design. We must ensure that all measurements between observations are independent. If the relationship between fitted values and residuals is flat (look at the red line), this is a good indication that there is a linear relationship.

Normal Q-Q:

We check the normal Q-Q plot to determine if there were any gross deviations from normality (e.g obvious S shapes or non-linear trends). The plot above suggests there are no major deviations from normality.

Scale-Location:

This is another plot used to check homoscedasticity(assumption of homogeneity of variance). In the plot above, the red line is close to flat and the variance in the square root of the standardised residuals is consistent across predicted (fitted values). Hence, the line fits to the data.

Residual vs Leverage:

This plot is used to identify cases that might be unduly influencing the fit of the regression model, for example, outliers. We need to look for values that fall beyond the red bands in the plot. These bands are based on Cook’s distances. In the diagnostic plot above, there are no values that fall outside the bands, and therefore, no evidence of influential cases.

ggplot(Salary, aes(x=Salary, y=YearsExperience)) +
    geom_point(shape=19, colour="red", fill="blue") +
    geom_smooth(method='lm', formula=y~x) + 
    labs(title="Salary and Years of Experience Regression") +
    labs(x="Salary") +
    labs(y="Years of Experience")

par(mfrow=c(2, 2))
plot(model1)

Summary

The scatter plot demonstrated evidence of a positive linear relationship. Other non-linear trends were ruled out.

The overall regression model was statistically significant, F(1,28)=622.5, p<.001, and explained 95.7% of the variability in Salary, R-sqaured=0.957.

Final inspection of the residuals supported normality and homoscedasticity.

Decision:

Overall model: Reject H0.

Intercept: Reject H0.

Slope: Reject H0.

Conclusion:

There was a statistically significant positive linear relationship between YearsExperience and Salary.

Discussion

A linear regression model was fitted to predict the dependent variable, Salary, using measures of YearsExperience as a single predictor.

Strength:

In-depth model analysis to make accurate prediction.

Limitations:

This analysis is just limited to employees of one firm.

Open data published by an individual which may contain data entry errors.

Future investigations:

Is there any salary difference among different firms? Can we create regression models for each firms?

Is there any other attribute that affects the Salary other than years of experience?