1 Introduction

The data I am looking at is contained in the CIVILENG.csv file and contains the salary data for 100 executives at a large Engineering firm and compares the salaries to 10 independent variables. A more in depth read of this data can be seen in this page from the textbook, Statistics for Engineering and the Sciences Sixth Edition.

1.1 What are the variables?

The variables introduced in this data set are X1 through X10 which are outlined in the table above.

However, I will only be looking at the variable X1 and how it pertains to the quantity of interest in the data, Salary.

1.1.1 Plot data

library(ggplot2)

d <- read.csv("CIVILENG.csv")
g = ggplot(d, aes(x = X1, y = LNSalary, color = X1)) + geom_point()
g = g + geom_smooth(method = "loess")
g

## `geom_smooth()` using formula 'y ~ x'

1.2 Why was it gathered?

It was gathered to explain an executive engineer’s salary in terms of several independent variables.

1.3 What is your interest in the data?

My interest in the data is to see if there is a Linear relationship within it.

1.4 What problem do you wish to solve?

The question I will be researching in this project is whether there is a linear relationship between the Salary and X1 (Experience in years).

2 Theory needed to carry out SLR

In order to prove that this data has a linear relationship of the form \[y_i=\beta_0+\beta_1x_i+\epsilon_i\] where \(\epsilon_i\) is the error involved from the model, there are a few things we will need to prove about the data.

2.1 Linearity

The relationship between X1 and LNSalary must be approximately linear. This will be checked using trendscatter to see the trendline for the data (1).

2.2 Normality of Errors

The residuals should have an approximate normal distribution. This will be checked with the Shapiro-Wilk test. If most of the points fit on the line, then it is a normal distribution (1).

2.3 Mean of zero and Constant variances

The variance of residuals is roughly the same for all the X variables, X1 and remains constant. This will be checked by plotting the residual values by the Fitted values. If they line up at zero and are distributed in equal amounts above and below zero, we can say that the mean is zero and the variance is constant.

3 Validity with mathematical expressions and Analysis of the data

Now that we know what we must prove, we should prove it for the data.

3.1 Checks on validity

The following tests will check the data against the previously mentioned theories that are required for the relationship to be linear.

3.1.1 Straight trend line

To test the relative linearity of our data we will scatterplot it and then put a trendline on the data. This will give us an idea if the data is somewhat linear to see if we want to continue.

3.1.1.1 Use trendscatter

library(s20x)
trendscatter(d$X1,d$LNSalary,xlab="Experience", ylab="Salary")

The trendline indicates that as Experience increases so does Salary, which fits with a positive slope linear relationship. We should continue with the tests to confirm our suspicions.

3.1.2 Errors distributed Normally

For SLR the errors involved should have a distribution as follows.

\[\epsilon_i \sim N(0,\sigma^2)\]

To prove that the errors are distributed normally we will use the Shapiro-Wilk test. This test has a null hypothesis that says that the data is distributed normally. Thus, if we want to say with 95% confidence that the errors are distributed normally, then the output p-value must be larger than \(0.05\).

3.1.2.1 Shapiro-wilk

library(s20x)
x <- d$X1
y <- d$LNSalary

d.lm=with(d,lm(LNSalary~X1))
d.lm

## 
## Call:
## lm(formula = LNSalary ~ X1)
## 
## Coefficients:
## (Intercept)           X1  
##    11.09089      0.02784

normcheck(d.lm,shapiro.wilk = TRUE)

The points do mainly fit on the line. However, because our \(P_{value}=0.166>0.05\) we should accept the null hypothesis and say that the errors are distributed normal.(2)

3.1.3 Constant variance

This will be more of eyeballing the data to see if it distributed with an equal variance on both sides of the mean value when plotting the residuals vs. the fitted values.

3.1.3.1 Residual vs fitted values

Salary.res=residuals(d.lm)
Salary.fit=fitted(d.lm)
plot(Salary.fit,Salary.res)

3.1.3.2 trendscatter on Residual Vs Fitted

trendscatter(Salary.fit,Salary.res)

plot(d.lm,which=1)

As is seen they are distributed fairly evenly on either side of the mean value (red line). Thus, we can say that the errors have a constant variance.

3.1.4 Zero mean value of \(\epsilon\)

To test if the mean is zero we will first have R calculate the mean of the residuals and then run a 95% confidence test on the null hypothesis that the mean is zero.

mean(Salary.res)

## [1] 4.464338e-18

t.test(Salary.res,NULL=0)

## 
##  One Sample t-test
## 
## data:  Salary.res
## t = 2.7837e-16, df = 99, p-value = 1
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.0318215  0.0318215
## sample estimates:
##    mean of x 
## 4.464338e-18

Therefore, according to the t.test we should accept the null hypothesis that the mean of the errors is zero.

3.1.5 Independence of data

To show the independence of the data we must rely on the story of how the data was obtained. First, the original problem says that each variable is independent. And also the data from the X1 and LNSalary are independent within themselves because each data value refers to a single engineering executive.

3.2 Summary lm object

The value d.lm is the linear model approximation for our data. Now that we have confirmed that the error involved in our linear model is independent and has a normal distribution with a mean of zero and constant variance, we can safely say that X1 and LNSalary have a linear relationship.Now, we need to know what form this relationship takes. Using the command summary() we can see the specifics of the linear model that fits our data the best. This command also runs a confidence test on the values of \(\beta_1\) and \(\beta_0\) with the null values being that they both equal 0.

summary(d.lm)

## 
## Call:
## lm(formula = LNSalary ~ X1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51010 -0.08148  0.01533  0.09007  0.34663 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.090887   0.033055  335.52   <2e-16 ***
## X1           0.027839   0.002206   12.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1612 on 98 degrees of freedom
## Multiple R-squared:  0.619,  Adjusted R-squared:  0.6151 
## F-statistic: 159.2 on 1 and 98 DF,  p-value: < 2.2e-16

Based on the summary \(\hat \beta_0=11.09\) and \(\hat \beta_1=0.028\) and based on the p-values since they are so small, we should reject the null hypothesis of this test.

3.2.1 Interpretation of all tests

Just to recap what we have covered in the project so far, we started with the assumption that our data behaves as a SLR. Thus, the equation for this relationship was

\[y_i=\beta_0 + \beta_1x_i + \epsilon_i\]

Where, we assumed

\[\epsilon_i \sim N(0,\sigma^2)\]

We proved normality using the Shapiro-wilk test, constant variance by plotting the residuals vs. the fitted values, mean of zero with a Ttest, and independence by how the data was collected.

Then, we can say that the new equation for our data is

\[y_i=\beta_0 + \beta_1x_i\]

Then, using summary() we found estimates of \(\beta_0\) and \(\beta_1\) called \(\hat \beta_0\) and \(\hat \beta_1\), giving us the final approximation of our equation

\[\hat y_i=11.09+0.028x_i\]

3.2.2 Interpretation of multiple R squared

One output value from summary() was the multiple \(R^2\) value.

\[Mult.\ R^2=r^2=\frac{SS_{yy}-SSE}{SS_{yy}}=1-\frac{SSE}{SS_{yy}}=0.619\]

This value means that the straight line model that we developed can explain \(61.9 \%\) of the variation present in the sample of LNSalary values.

3.2.3 Interpretation of all point estimates

The point estimates \(\hat \beta_0\) and \(\hat \beta_1\) mean that our linear equation to describe the data will have an approximate y-intercept of 11.09 and a slope of 0.028.

3.3 Calculate cis for \(\beta\) parameter estimates

In order to understand how well our approximations for our parameters explain the data, we need to understand them and form confidence intervals for them. This is because if we have a small error in our approximation for \(\beta_1\) (slope) then that creates a large amount of error in our \(\beta_0\) approximation (intercept). The formulas for \(\beta_1\) and \(\beta_0\) are

\[\hat \beta_1=\frac{SS_{xy}}{SS_{xx}}=\frac{\sum_{i=1}^n (x_i-\bar x)y_i}{\sum_{i=1}^n (x_i-\bar x)x_i}\]

and

\[\hat \beta_0=\bar y-\hat \beta_1 \bar x\] To calculate a confidence interval for both \(\hat \beta_1\) and \(\hat \beta_0\), all we need to know is that a confidence interval is the point estimate plus or minus the t-value times the standard error all of which have been given to us by summary(). Thus, the confidence interval for \(\hat \beta_1\) is \((-7.2*10^{-7},0.306)\) and the confidence interval for \(\hat \beta_0\) is \((2.734*10^{-4}, 22.182)\).`

3.3.1 Check on outliers using cooks plots

n=length(d$X1)
cd=cooks.distance(d.lm)
mycol = ifelse(cd>4/n, "Red", "Black")
plot(cd, col=mycol)

If a Cooks distance is larger than \(\frac{4}{n}\) then the point is said to be an outlier and has a strong influence on the fitted values (3). Thus, following this rule there are seven point with my data that are outliers.

4 Conclusion

4.1 Answer your research question

According to all of the tests that have been run and the results that have come back, I would say the variables X1 and LNSalary do share a linear relationship even given the low Multiple \(R^2\) value, which is likely due to the outlier points.

4.2 Suggest ways to improve model or experiment

One improvement would be to figure out what caused the outliers, whether it was just a result of random data or because it was input wrong.

5 References

“9.2.3 - Assumptions for the SLR Model: Stat 500.” PennState: Statistics Online Courses, https://online.stat.psu.edu/stat500/lesson/9/9.2/9.2.3.
“Is This Normal? Shapiro-Wilk Test in R to the Rescue.” ProgrammingR, https://www.programmingr.com/shapiro-wilk-test-in-r/#:~:text=In%20R%2C%20the%20Shapiro-Wilk%20test%20can%20be%20applied,now%20talk%20about%20how%20to%20interpret%20this%20result.
Zach. “How to Identify Influential Data Points Using Cook’s Distance.” Statology, 30 Nov. 2020, https://www.statology.org/how-to-identify-influential-data-points-using-cooks-distance/.

MATH 4753 Project 2 Draft

Caleb Gray

06 December, 2021