This week should be relatively simple. We will look at relationship between Salaries and Age in our “Bank Salaries” data set. First, we look at whether there is any relationship between the two, using correlation coefficient and test for whether it is equal to zero. Then we will look at simple regression to try to quantify the relationship. Let’s get started.
First, import the data:
library(readxl)
Bank_Salaries <- read_excel("G:/My Drive/Data Analysis/Week 5/Bank Salaries.xlsx")
View(Bank_Salaries)
First let’s take a look at the graph. I will generate a scatter plot with Age on x axis and Salary on y axis. Remember that plot command has a structure of (x,y) which, at least to me, is sometimes inconvenient as I tend to think in terms of y=f(x), as we will when we work on regression. Just to keep in mind. So, let’s look at the simple scatter plot
plot(Bank_Salaries$Age, Bank_Salaries$Salary)
Looking at the graph, we can kind of see positive relationship, bu you can also notice some outliers at really high age (in the sample). Let’s just take a quick look. It seems that outliers are all above $70K, so let’s check.
table(Bank_Salaries$Age[Bank_Salaries$Salary>=70000], Bank_Salaries$Salary[Bank_Salaries$Salary>=70000])
##
## 74000 88000 94000 95000 97000
## 59 0 0 1 0 0
## 60 0 0 0 1 0
## 61 0 0 0 0 1
## 62 0 1 0 0 0
## 65 1 0 0 0 0
Sure enough, it seems that all 5 outliers are above $70K.
Let’s draw the plot excluding these:
plot(Bank_Salaries$Age[Bank_Salaries$Salary<=70000], Bank_Salaries$Salary[Bank_Salaries$Salary<=70000])
Still looks positive. So let’s just forget about outliers for a second and look at raw data.
One bug to put in your ear is that maybe Grade plays a role in the sense that older people will be in higher grades and earn more (next week we will talk as to how to control for this). So, let’s plot the data with color representing different grades. It is possible that relationship depends on grade (for reference, try this exercise with iris dataset and look at correlation of sepal.length and sepal.width overall and then by species).
So, again, let’s start with scatter plot with dots being colored based on Grade
plot(Bank_Salaries$Age, Bank_Salaries$Salary, col=Bank_Salaries$Grade)
legend("topright", legend=unique(Bank_Salaries$Grade), fill=unique(Bank_Salaries$Grade))
Not very informative. Let’s see if separate scatter plots for each grade look any different
library(ggplot2)
ggplot(Bank_Salaries, aes(x = Age, y = Salary)) + geom_point() + facet_wrap(~ Grade) + theme_minimal() + ggtitle("Scatter Plots of Age vs Salary by Grade")
It seems there are different patterns. So let’s examine correlation coefficient. So, let’s estimate Pearson correlation coefficient and let’s test if it is different from zero.
cor(Bank_Salaries$Salary, Bank_Salaries$Age)
## [1] 0.383798
cortest<-cor.test(Bank_Salaries$Salary, Bank_Salaries$Age)
print(cortest)
##
## Pearson's product-moment correlation
##
## data: Bank_Salaries$Salary and Bank_Salaries$Age
## t = 5.9654, df = 206, p-value = 1.05e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2614054 0.4940439
## sample estimates:
## cor
## 0.383798
So, we see it is positive correlation of 0.384 (roughly) and it is very significant with p-value of 1.05*10^(-8).
Just out of curiosity (and to show you more fun in R :)), let’s look at correlation coefficients by Grade. First step is to transform the data set into data table, so I will create a separate set called df, as I want to use Bank_Salaries later.
library(data.table)
df <- data.table(Bank_Salaries)
# Calculate correlation for each group
result_cor <- df[, cor(Salary, Age), by = "Grade"]
#additional fun - calculate means of each variable for each group
mresult<-df[, mean(Salary), by="Grade"]
mresult_age<-df[,mean(Age), by="Grade"]
# Display the results
print(result_cor)
## Grade V1
## 1: 1 0.4723709
## 2: 2 -0.1269764
## 3: 3 0.1291289
## 4: 4 0.2076582
## 5: 5 0.3815624
## 6: 6 0.2916440
print(mresult)
## Grade V1
## 1: 1 32335.17
## 2: 2 34664.52
## 3: 3 38664.65
## 4: 4 44152.14
## 5: 5 50328.57
## 6: 6 68000.00
print(mresult_age)
## Grade V1
## 1: 1 39.53333
## 2: 2 37.71429
## 3: 3 39.25581
## 4: 4 40.42857
## 5: 5 40.28571
## 6: 6 55.71429
#UPDATE ON 4/17
#I was bothered by reporting the two means separately and wanted to put them in a same table. Here is the code
means_together<-aggregate(x=cbind(Bank_Salaries$Salary, Bank_Salaries$Age)~Bank_Salaries$Grade, FUN=mean)
I was bothered by creating multiple tables, etc. but I wanted to push the Rmd out in interest of time. Since, I have played with how to organize everything better, and below are some suggestions. I am not including it to confuse you (I promise :) ) but, from past experience, some of these can be used when dealing with different data sets, particularly the renaming commands.
#NOTE: if you use "full name" results will be reported as VAR 1 and VAR 2. If you go with the option below, and use data= within the command, it will use names of variables when reporting.
means_together2<-aggregate(x=cbind(Salary, Age)~Grade, data=Bank_Salaries, FUN=mean)
means_together2
## Grade Salary Age
## 1 1 32335.17 39.53333
## 2 2 34664.52 37.71429
## 3 3 38664.65 39.25581
## 4 4 44152.14 40.42857
## 5 5 50328.57 40.28571
## 6 6 68000.00 55.71429
Another way to rename variables (that can be useful in other applications, say where you have a codebook):
#starting with correlations and means produced earlier with resulting tables using V1 and V2
setnames(result_cor, c("V1"), c("SalAgeCorr"))
result_cor
## Grade SalAgeCorr
## 1: 1 0.4723709
## 2: 2 -0.1269764
## 3: 3 0.1291289
## 4: 4 0.2076582
## 5: 5 0.3815624
## 6: 6 0.2916440
setnames(means_together, c("Bank_Salaries$Grade","V1","V2"), c("Grade", "Salary","Age"))
means_together
## Grade Salary Age
## 1 1 32335.17 39.53333
## 2 2 34664.52 37.71429
## 3 3 38664.65 39.25581
## 4 4 44152.14 40.42857
## 5 5 50328.57 40.28571
## 6 6 68000.00 55.71429
Finally, let’s say I want correlation and each mean by grade in one big, happy table :)
result_cor_all <- df[, .(Correlation = cor(Salary, Age), Mean_Salary = mean(Salary), Mean_Age=mean(Age)), by = "Grade"]
result_cor_all
## Grade Correlation Mean_Salary Mean_Age
## 1: 1 0.4723709 32335.17 39.53333
## 2: 2 -0.1269764 34664.52 37.71429
## 3: 3 0.1291289 38664.65 39.25581
## 4: 4 0.2076582 44152.14 40.42857
## 5: 5 0.3815624 50328.57 40.28571
## 6: 6 0.2916440 68000.00 55.71429
As with any detour, we are back to where we were originally headed. As you can see in any of the tables above containing correlation, correlation is not the same (as in, not as strong), especially in grades 2 and 3. It is even negative (but really small) in grade 2. If you want to chase this further, you can run the test for each grade, so let’s just do it for 2.
cortestG2<-cor.test(Bank_Salaries$Salary[Bank_Salaries$Grade==2], Bank_Salaries$Age[Bank_Salaries$Grade==2])
print(cortestG2)
##
## Pearson's product-moment correlation
##
## data: Bank_Salaries$Salary[Bank_Salaries$Grade == 2] and Bank_Salaries$Age[Bank_Salaries$Grade == 2]
## t = -0.80962, df = 40, p-value = 0.4229
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4148960 0.1840582
## sample estimates:
## cor
## -0.1269764
Notice, this correlation coefficient is not significant (p=0.4229). So, in grade 2, age does not really seem correlated with salary (probably because grade determines salary much more). You can examine average ages in previous output.
OK, I will leave it alone now, so let’s examine this relationship closer, let’s see if we can quantify effect of age on salary and turn to simple regression
Let’s run the simple regression model of salary as a function of age. So, we are trying to estimate \(\beta_0\) and \(\beta_1\) in the following model:
\(Salary = \beta_0 + \beta_1*Age\)
We are interested in \(\beta_1\) estimate
reg1<-lm(Bank_Salaries$Salary ~ Bank_Salaries$Age)
summary(reg1)
##
## Call:
## lm(formula = Bank_Salaries$Salary ~ Bank_Salaries$Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20155 -6652 -1251 5144 48451
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23010.70 2925.50 7.866 2.04e-13 ***
## Bank_Salaries$Age 418.65 70.18 5.965 1.05e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10420 on 206 degrees of freedom
## Multiple R-squared: 0.1473, Adjusted R-squared: 0.1432
## F-statistic: 35.59 on 1 and 206 DF, p-value: 1.05e-08
So, our estimate of \(\beta_1\) is 418.65 which means that we expect that each additional year of age brings in extra $418.65 in salary (try with years1 and years2 variables to see if those differ). Coefficient is significant at 1% and given \(R^2\) of .1473 we can say that 14.73% of variation in Salary is explained by age. There are a lot of issues with this model, but let’s ignore them for now. NOTE: there is not a good or bad \(R^2\). Sometimes, 0.1 is a cause for celebration and 0.8 cause for despair. Value of \(R^2\) tends to depend mostly on nature of data. Data on individuals will yield VERY low \(R^2\), while data on aggregate such as economy will yield very high values. \(R^2\) and, actually, \(adj. R^2\) are really only useful to compare different models. We can use them for comparison as long as the DEPENDENT VARIABLE IS THE SAME, i.e. we have not done some transformation to it (such as taking a log).
You can read about the F-test, etc. If you want to see the ANOVA table for this regression (which other softward provides by default), you can use ANOVA command on the results object.
anova(reg1)
## Analysis of Variance Table
##
## Response: Bank_Salaries$Salary
## Df Sum Sq Mean Sq F value Pr(>F)
## Bank_Salaries$Age 1 3.8633e+09 3863276320 35.586 1.05e-08 ***
## Residuals 206 2.2364e+10 108562286
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
OK, now, let’s examine the residuals
#let's look at histogram of residuals
hist(reg1$residuals)
#let's look at other plots:
par(mfrow = c(2, 2)) # this sets the grid to show all plots (there are 4 in total) so that we do not have to hit enter each time
plot(reg1)
We see potential heteroskedasticity in the first plot (residuals vs. fitted) and non-normality (in Q-Q plot). My sneaking suspicion is that outliers we identified above (and you can see them clearly in Q-Q plot) may be driving some of this. So, let’s run regression without those 5 observations with salaries over $70K. NOTE: for other transformations, see video and we will talk more next week.
reg2<-lm(Bank_Salaries$Salary[Bank_Salaries$Salary<=70000]~Bank_Salaries$Age[Bank_Salaries$Salary<=70000])
summary(reg2)
##
## Call:
## lm(formula = Bank_Salaries$Salary[Bank_Salaries$Salary <= 70000] ~
## Bank_Salaries$Age[Bank_Salaries$Salary <= 70000])
##
## Residuals:
## Min 1Q Median 3Q Max
## -13883 -5873 -1221 4478 25275
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 30706.98 2299.09 13.356
## Bank_Salaries$Age[Bank_Salaries$Salary <= 70000] 200.40 55.97 3.581
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## Bank_Salaries$Age[Bank_Salaries$Salary <= 70000] 0.00043 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7867 on 201 degrees of freedom
## Multiple R-squared: 0.05996, Adjusted R-squared: 0.05529
## F-statistic: 12.82 on 1 and 201 DF, p-value: 0.0004298
hist(reg2$residuals)
par(mfrow=c(2,2))
plot(reg2)
Some notes: age seems much less important both in terms of effect ($200 per year vs. $418) and the significance (4.3*10^(-4) vs. 1.5*10(-8)). This is not surprising as salary outliers are also age outliers, driving a lot of what we saw in the initial regression. Honestly, I would leave them out.
Histogram still shows non-normality as in skew in the residuals, although Q-Q seems better. Again, I think I would leave them out given how big of outliers these 5 people are and with 208 observations, I can afford to lose those degrees of freedom.
Of course, this model is not good as other factors (such as grade) may be at play, but that is the topic for next week.