Pick any two quantitative variables from any data set that interests you. If you are at a loss, look at the R datasets and find one. Then conduct both correlation and simple regression analysis. Interpret the residuals. Did the assumptions hold?

Data is from https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset. I want to see if there is a correlation betwen Age and Hourly Rate and if Age can be used to predict Hourly Rate (in USD)

First, select the data

setwd("C:/Users/hrall/OneDrive/Documents/School/Spring 2022/ADEC5310.02/Homework")
jobsatisfaction <- read.csv("InputData_HayleeAllen.csv", header=TRUE)
attach(jobsatisfaction)

Next, plot the data, set up a linear model, add a line of best fit to the graph, and calculate the correlation coefficient.

plot(Age,HourlyRate,xlab="Age", ylab = "Hourly Rate")
m1 <-lm(HourlyRate ~ Age)
abline(m1)

cor(Age, HourlyRate)
## [1] 0.02428654
summary(m1)
## 
## Call:
## lm(formula = HourlyRate ~ Age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.868 -17.517   0.064  17.483  35.078 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 63.89557    2.20855  28.931   <2e-16 ***
## Age          0.05405    0.05806   0.931    0.352    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.33 on 1468 degrees of freedom
## Multiple R-squared:  0.0005898,  Adjusted R-squared:  -9.096e-05 
## F-statistic: 0.8664 on 1 and 1468 DF,  p-value: 0.3521

Per the data, there does not appear to be a strong correlation between the two variables
1. The correlation coefficient is close to 0
2. Per the graph, the line of best fit is almost flat
3. The data is scattered and does not appear to correlate along a curve

To try and find a better correlation, I am going to make 4 other models comparing 2 variables with a linear regression

par(mfrow=c(2,2))
plot(Age,NumCompaniesWorked,xlab="Age", ylab = "Number of Companies Worked") 
plot(HourlyRate,JobSatisfaction,xlab="Hourly Rate", ylab = "Job Statisfaction") 
plot(Age,TotalWorkingYears,xlab="Age", ylab = "Total Working Years")
plot(PercentSalaryHike,JobSatisfaction,xlab="Percent Salary Hike", ylab = "Job Statisfaction")

cor(Age, NumCompaniesWorked)
## [1] 0.2996348
cor(HourlyRate, JobSatisfaction)
## [1] -0.07133462
cor(Age, TotalWorkingYears)
## [1] 0.6803805
cor(PercentSalaryHike, JobSatisfaction)
## [1] 0.02000204

Based on the visual and correlation values, I want to explore graph #3, Age vs Total Working Years

cor(Age, TotalWorkingYears)
## [1] 0.6803805
par(mfrow=c(1,1))
plot(Age,TotalWorkingYears,xlab="Age", ylab = "Total Working Years")
m2 <-lm( TotalWorkingYears ~ Age)
abline(m2)

summary(m2)
## 
## Call:
## lm(formula = TotalWorkingYears ~ Age)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.4931  -3.8468  -0.0518   3.4712  16.5069 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -10.11753    0.61966  -16.33   <2e-16 ***
## Age           0.57949    0.01629   35.57   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.704 on 1468 degrees of freedom
## Multiple R-squared:  0.4629, Adjusted R-squared:  0.4626 
## F-statistic:  1265 on 1 and 1468 DF,  p-value: < 2.2e-16

The correlation coefficient is approximately 0.68, which is not close to 0, and therefore indicates a relative correlation. Visually, the data has a clear direction, as age increases, so does the number of working hours. It is also clear that as age increases, so does the deviation from the mean. The Multiple R-Squared is the proportion of the variance in the data that’s explained by the model (because we only have one variable in this model, “Multiple R-squared is the same as”Adjusted R-squared”). Lastly, the p-value is low meaning that Age does likely predict the Total Working Years.