Pick any two quantitative variables from any data set that interests you. If you are at a loss, look at the R datasets and find one. Then conduct both correlation and simple regression analysis. Interpret the residuals. Did the assumptions hold?
Data is from https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset. I want to see if there is a correlation betwen Age and Hourly Rate and if Age can be used to predict Hourly Rate (in USD)
First, select the data
setwd("C:/Users/hrall/OneDrive/Documents/School/Spring 2022/ADEC5310.02/Homework")
jobsatisfaction <- read.csv("InputData_HayleeAllen.csv", header=TRUE)
attach(jobsatisfaction)
Next, plot the data, set up a linear model, add a line of best fit to the graph, and calculate the correlation coefficient.
plot(Age,HourlyRate,xlab="Age", ylab = "Hourly Rate")
m1 <-lm(HourlyRate ~ Age)
abline(m1)
cor(Age, HourlyRate)
## [1] 0.02428654
summary(m1)
##
## Call:
## lm(formula = HourlyRate ~ Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.868 -17.517 0.064 17.483 35.078
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.89557 2.20855 28.931 <2e-16 ***
## Age 0.05405 0.05806 0.931 0.352
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.33 on 1468 degrees of freedom
## Multiple R-squared: 0.0005898, Adjusted R-squared: -9.096e-05
## F-statistic: 0.8664 on 1 and 1468 DF, p-value: 0.3521
Per the data, there does not appear to be a strong correlation between the two variables
1. The correlation coefficient is close to 0
2. Per the graph, the line of best fit is almost flat
3. The data is scattered and does not appear to correlate along a curve
To try and find a better correlation, I am going to make 4 other models comparing 2 variables with a linear regression
par(mfrow=c(2,2))
plot(Age,NumCompaniesWorked,xlab="Age", ylab = "Number of Companies Worked")
plot(HourlyRate,JobSatisfaction,xlab="Hourly Rate", ylab = "Job Statisfaction")
plot(Age,TotalWorkingYears,xlab="Age", ylab = "Total Working Years")
plot(PercentSalaryHike,JobSatisfaction,xlab="Percent Salary Hike", ylab = "Job Statisfaction")
cor(Age, NumCompaniesWorked)
## [1] 0.2996348
cor(HourlyRate, JobSatisfaction)
## [1] -0.07133462
cor(Age, TotalWorkingYears)
## [1] 0.6803805
cor(PercentSalaryHike, JobSatisfaction)
## [1] 0.02000204
Based on the visual and correlation values, I want to explore graph #3, Age vs Total Working Years
cor(Age, TotalWorkingYears)
## [1] 0.6803805
par(mfrow=c(1,1))
plot(Age,TotalWorkingYears,xlab="Age", ylab = "Total Working Years")
m2 <-lm( TotalWorkingYears ~ Age)
abline(m2)
summary(m2)
##
## Call:
## lm(formula = TotalWorkingYears ~ Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.4931 -3.8468 -0.0518 3.4712 16.5069
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.11753 0.61966 -16.33 <2e-16 ***
## Age 0.57949 0.01629 35.57 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.704 on 1468 degrees of freedom
## Multiple R-squared: 0.4629, Adjusted R-squared: 0.4626
## F-statistic: 1265 on 1 and 1468 DF, p-value: < 2.2e-16
The correlation coefficient is approximately 0.68, which is not close to 0, and therefore indicates a relative correlation. Visually, the data has a clear direction, as age increases, so does the number of working hours. It is also clear that as age increases, so does the deviation from the mean. The Multiple R-Squared is the proportion of the variance in the data that’s explained by the model (because we only have one variable in this model, “Multiple R-squared is the same as”Adjusted R-squared”). Lastly, the p-value is low meaning that Age does likely predict the Total Working Years.