Abstract
In this project I analyze a data set from an engineering firm and am testing for an SLR.The data I am looking at is contained in the CIVILENG.csv file and contains the salary data for 100 executives at a large Engineering firm and compares the salaries to 10 independent variables. A more in depth read of this data can be seen in this page from the textbook, Statistics for Engineering and the Sciences Sixth Edition.
The variables introduced in this data set are X1 through X10 which are outlined in the table above.
However, I will only be looking at the variable X1 and how it pertains to the quantity of interest in the data, Salary.
library(ggplot2)
d <- read.csv("CIVILENG.csv")
g = ggplot(d, aes(x = X1, y = LNSalary, color = X1)) + geom_point()
g = g + geom_smooth(method = "loess")
g
## `geom_smooth()` using formula 'y ~ x'
It was gathered to explain an executive engineer’s salary in terms of several independent variables.
My interest in the data is to see if there is a Linear relationship within it.
The question I will be researching in this project is whether there is a linear relationship between the Salary and X1 (Experience in years).
In order to prove that this data has a linear relationship of the form \[y_i=\beta_0+\beta_1x_i+\epsilon_i\] where \(\epsilon_i\) is the error involved from the model, there are a few things we will need to prove about the data.
The relationship between X1 and LNSalary must be approximately linear. This will be checked using trendscatter to see the trendline for the data (1).
The residuals should have an approximate normal distribution. This will be checked with the Shapiro-Wilk test. If most of the points fit on the line, then it is a normal distribution (1).
The variance of residuals is roughly the same for all the X variables, X1 and remains constant. This will be checked by plotting the residual values by the Fitted values. If they line up at zero and are distributed in equal amounts above and below zero, we can say that the mean is zero and the variance is constant.
Now that we know what we must prove, we should prove it for the data.
The following tests will check the data against the previously mentioned theories that are required for the relationship to be linear.
To test the relative linearity of our data we will scatterplot it and then put a trendline on the data. This will give us an idea if the data is somewhat linear to see if we want to continue.
library(s20x)
trendscatter(d$X1,d$LNSalary,xlab="Experience", ylab="Salary")
The trendline indicates that as Experience increases so does Salary, which fits with a positive slope linear relationship. We should continue with the tests to confirm our suspicions.
For SLR the errors involved should have a distribution as follows.
\[\epsilon_i \sim N(0,\sigma^2)\]
To prove that the errors are distributed normally we will use the Shapiro-Wilk test. This test has a null hypothesis that says that the data is distributed normally. Thus, if we want to say with 95% confidence that the errors are distributed normally, then the output p-value must be larger than \(0.05\).
library(s20x)
x <- d$X1
y <- d$LNSalary
d.lm=with(d,lm(LNSalary~X1))
d.lm
##
## Call:
## lm(formula = LNSalary ~ X1)
##
## Coefficients:
## (Intercept) X1
## 11.09089 0.02784
normcheck(d.lm,shapiro.wilk = TRUE)
The points do mainly fit on the line. However, because our \(P_{value}=0.166>0.05\) we should accept the null hypothesis and say that the errors are distributed normal.(2)
This will be more of eyeballing the data to see if it distributed with an equal variance on both sides of the mean value when plotting the residuals vs. the fitted values.
Salary.res=residuals(d.lm)
Salary.fit=fitted(d.lm)
plot(Salary.fit,Salary.res)
trendscatter(Salary.fit,Salary.res)
plot(d.lm,which=1)
As is seen they are distributed fairly evenly on either side of the mean value (red line). Thus, we can say that the errors have a constant variance.
To test if the mean is zero we will first have R calculate the mean of the residuals and then run a 95% confidence test on the null hypothesis that the mean is zero.
mean(Salary.res)
## [1] 4.464338e-18
t.test(Salary.res,NULL=0)
##
## One Sample t-test
##
## data: Salary.res
## t = 2.7837e-16, df = 99, p-value = 1
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.0318215 0.0318215
## sample estimates:
## mean of x
## 4.464338e-18
Therefore, according to the t.test we should accept the null hypothesis that the mean of the errors is zero.
To show the independence of the data we must rely on the story of how the data was obtained. First, the original problem says that each variable is independent. And also the data from the X1 and LNSalary are independent within themselves because each data value refers to a single engineering executive.
The value d.lm is the linear model approximation for our data. Now that we have confirmed that the error involved in our linear model is independent and has a normal distribution with a mean of zero and constant variance, we can safely say that X1 and LNSalary have a linear relationship.Now, we need to know what form this relationship takes. Using the command summary() we can see the specifics of the linear model that fits our data the best. This command also runs a confidence test on the values of \(\beta_1\) and \(\beta_0\) with the null values being that they both equal 0.
summary(d.lm)
##
## Call:
## lm(formula = LNSalary ~ X1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51010 -0.08148 0.01533 0.09007 0.34663
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.090887 0.033055 335.52 <2e-16 ***
## X1 0.027839 0.002206 12.62 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1612 on 98 degrees of freedom
## Multiple R-squared: 0.619, Adjusted R-squared: 0.6151
## F-statistic: 159.2 on 1 and 98 DF, p-value: < 2.2e-16
Based on the summary \(\hat \beta_0=11.09\) and \(\hat \beta_1=0.028\) and based on the p-values since they are so small, we should reject the null hypothesis of this test.
Just to recap what we have covered in the project so far, we started with the assumption that our data behaves as a SLR. Thus, the equation for this relationship was
\[y_i=\beta_0 + \beta_1x_i + \epsilon_i\]
Where, we assumed
\[\epsilon_i \sim N(0,\sigma^2)\]
We proved normality using the Shapiro-wilk test, constant variance by plotting the residuals vs. the fitted values, mean of zero with a Ttest, and independence by how the data was collected.
Then, we can say that the new equation for our data is
\[y_i=\beta_0 + \beta_1x_i\]
Then, using summary() we found estimates of \(\beta_0\) and \(\beta_1\) called \(\hat \beta_0\) and \(\hat \beta_1\), giving us the final approximation of our equation
\[\hat y_i=11.09+0.028x_i\]
One output value from summary() was the multiple \(R^2\) value.
\[Mult.\ R^2=r^2=\frac{SS_{yy}-SSE}{SS_{yy}}=1-\frac{SSE}{SS_{yy}}=0.619\]
This value means that the straight line model that we developed can explain \(61.9 \%\) of the variation present in the sample of LNSalary values.
The point estimates \(\hat \beta_0\) and \(\hat \beta_1\) mean that our linear equation to describe the data will have an approximate y-intercept of 11.09 and a slope of 0.028.
In order to understand how well our approximations for our parameters explain the data, we need to understand them and form confidence intervals for them. This is because if we have a small error in our approximation for \(\beta_1\) (slope) then that creates a large amount of error in our \(\beta_0\) approximation (intercept). The formulas for \(\beta_1\) and \(\beta_0\) are
\[\hat \beta_1=\frac{SS_{xy}}{SS_{xx}}=\frac{\sum_{i=1}^n (x_i-\bar x)y_i}{\sum_{i=1}^n (x_i-\bar x)x_i}\]
and
\[\hat \beta_0=\bar y-\hat \beta_1 \bar x\] To calculate a confidence interval for both \(\hat \beta_1\) and \(\hat \beta_0\), all we need to know is that a confidence interval is the point estimate plus or minus the t-value times the standard error all of which have been given to us by summary(). Thus, the confidence interval for \(\hat \beta_1\) is \((-7.2*10^{-7},0.306)\) and the confidence interval for \(\hat \beta_0\) is \((2.734*10^{-4}, 22.182)\).`
n=length(d$X1)
cd=cooks.distance(d.lm)
mycol = ifelse(cd>4/n, "Red", "Black")
plot(cd, col=mycol)
If a Cooks distance is larger than \(\frac{4}{n}\) then the point is said to be an outlier and has a strong influence on the fitted values (3). Thus, following this rule there are seven point with my data that are outliers.
According to all of the tests that have been run and the results that have come back, I would say the variables X1 and LNSalary do share a linear relationship even given the low Multiple \(R^2\) value, which is likely due to the outlier points.
One improvement would be to figure out what caused the outliers, whether it was just a result of random data or because it was input wrong.
“9.2.3 - Assumptions for the SLR Model: Stat 500.” PennState: Statistics Online Courses, https://online.stat.psu.edu/stat500/lesson/9/9.2/9.2.3.
“Is This Normal? Shapiro-Wilk Test in R to the Rescue.” ProgrammingR, https://www.programmingr.com/shapiro-wilk-test-in-r/#:~:text=In%20R%2C%20the%20Shapiro-Wilk%20test%20can%20be%20applied,now%20talk%20about%20how%20to%20interpret%20this%20result.
Zach. “How to Identify Influential Data Points Using Cook’s Distance.” Statology, 30 Nov. 2020, https://www.statology.org/how-to-identify-influential-data-points-using-cooks-distance/.