WEEK 6 - CORRELATION AND SIMPLE REGRESSION

This week should be relatively simple. We will look at relationship between Salaries and Age in our “Bank Salaries” data set. First, we look at whether there is any relationship between the two, using correlation coefficient and test for whether it is equal to zero. Then we will look at simple regression to try to quantify the relationship. Let’s get started.

First, import the data:

library(readxl)
Bank_Salaries <- read_excel("G:/My Drive/Data Analysis/Week 5/Bank Salaries.xlsx")
View(Bank_Salaries)

CORRELATION

First let’s take a look at the graph. I will generate a scatter plot with Age on x axis and Salary on y axis. Remember that plot command has a structure of (x,y) which, at least to me, is sometimes inconvenient as I tend to think in terms of y=f(x), as we will when we work on regression. Just to keep in mind. So, let’s look at the simple scatter plot

plot(Bank_Salaries$Age, Bank_Salaries$Salary)

Looking at the graph, we can kind of see positive relationship, bu you can also notice some outliers at really high age (in the sample). Let’s just take a quick look. It seems that outliers are all above $70K, so let’s check.

table(Bank_Salaries$Age[Bank_Salaries$Salary>=70000], Bank_Salaries$Salary[Bank_Salaries$Salary>=70000])

##     
##      74000 88000 94000 95000 97000
##   59     0     0     1     0     0
##   60     0     0     0     1     0
##   61     0     0     0     0     1
##   62     0     1     0     0     0
##   65     1     0     0     0     0

Sure enough, it seems that all 5 outliers are above $70K.

Let’s draw the plot excluding these:

plot(Bank_Salaries$Age[Bank_Salaries$Salary<=70000], Bank_Salaries$Salary[Bank_Salaries$Salary<=70000])

Still looks positive. So let’s just forget about outliers for a second and look at raw data.

One bug to put in your ear is that maybe Grade plays a role in the sense that older people will be in higher grades and earn more (next week we will talk as to how to control for this). So, let’s plot the data with color representing different grades. It is possible that relationship depends on grade (for reference, try this exercise with iris dataset and look at correlation of sepal.length and sepal.width overall and then by species).

So, again, let’s start with scatter plot with dots being colored based on Grade

plot(Bank_Salaries$Age, Bank_Salaries$Salary, col=Bank_Salaries$Grade)
legend("topright", legend=unique(Bank_Salaries$Grade), fill=unique(Bank_Salaries$Grade))

Not very informative. Let’s see if separate scatter plots for each grade look any different

library(ggplot2)

ggplot(Bank_Salaries, aes(x = Age, y = Salary)) + geom_point() + facet_wrap(~ Grade) + theme_minimal() + ggtitle("Scatter Plots of Age vs Salary by Grade")

It seems there are different patterns. So let’s examine correlation coefficient. So, let’s estimate Pearson correlation coefficient and let’s test if it is different from zero.

cor(Bank_Salaries$Salary, Bank_Salaries$Age)

## [1] 0.383798

cortest<-cor.test(Bank_Salaries$Salary, Bank_Salaries$Age)
print(cortest)

## 
##  Pearson's product-moment correlation
## 
## data:  Bank_Salaries$Salary and Bank_Salaries$Age
## t = 5.9654, df = 206, p-value = 1.05e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2614054 0.4940439
## sample estimates:
##      cor 
## 0.383798

So, we see it is positive correlation of 0.384 (roughly) and it is very significant with p-value of 1.05*10^(-8).

Just out of curiosity (and to show you more fun in R :)), let’s look at correlation coefficients by Grade. First step is to transform the data set into data table, so I will create a separate set called df, as I want to use Bank_Salaries later.

library(data.table)
df <- data.table(Bank_Salaries)

# Calculate correlation for each group
result_cor <- df[, cor(Salary, Age), by = "Grade"]


#additional fun - calculate means of each variable for each group
mresult<-df[, mean(Salary), by="Grade"]
mresult_age<-df[,mean(Age), by="Grade"]

# Display the results
print(result_cor)

##    Grade         V1
## 1:     1  0.4723709
## 2:     2 -0.1269764
## 3:     3  0.1291289
## 4:     4  0.2076582
## 5:     5  0.3815624
## 6:     6  0.2916440

print(mresult)

##    Grade       V1
## 1:     1 32335.17
## 2:     2 34664.52
## 3:     3 38664.65
## 4:     4 44152.14
## 5:     5 50328.57
## 6:     6 68000.00

print(mresult_age)

##    Grade       V1
## 1:     1 39.53333
## 2:     2 37.71429
## 3:     3 39.25581
## 4:     4 40.42857
## 5:     5 40.28571
## 6:     6 55.71429

#UPDATE ON 4/17
#I was bothered by reporting the two means separately and wanted to put them in a same table. Here is the code
means_together<-aggregate(x=cbind(Bank_Salaries$Salary, Bank_Salaries$Age)~Bank_Salaries$Grade, FUN=mean)

SLIGHT DETOUR (updated on 4/17)

I was bothered by creating multiple tables, etc. but I wanted to push the Rmd out in interest of time. Since, I have played with how to organize everything better, and below are some suggestions. I am not including it to confuse you (I promise :) ) but, from past experience, some of these can be used when dealing with different data sets, particularly the renaming commands.

#NOTE: if you use "full name" results will be reported as VAR 1 and VAR 2.  If you go with the option below, and use data= within the command, it will use names of variables when reporting. 

means_together2<-aggregate(x=cbind(Salary, Age)~Grade, data=Bank_Salaries, FUN=mean)
means_together2

##   Grade   Salary      Age
## 1     1 32335.17 39.53333
## 2     2 34664.52 37.71429
## 3     3 38664.65 39.25581
## 4     4 44152.14 40.42857
## 5     5 50328.57 40.28571
## 6     6 68000.00 55.71429

Another way to rename variables (that can be useful in other applications, say where you have a codebook):

#starting with correlations and means produced earlier with resulting tables using V1 and V2

setnames(result_cor, c("V1"), c("SalAgeCorr"))
result_cor

##    Grade SalAgeCorr
## 1:     1  0.4723709
## 2:     2 -0.1269764
## 3:     3  0.1291289
## 4:     4  0.2076582
## 5:     5  0.3815624
## 6:     6  0.2916440

setnames(means_together, c("Bank_Salaries$Grade","V1","V2"), c("Grade", "Salary","Age"))
means_together

##   Grade   Salary      Age
## 1     1 32335.17 39.53333
## 2     2 34664.52 37.71429
## 3     3 38664.65 39.25581
## 4     4 44152.14 40.42857
## 5     5 50328.57 40.28571
## 6     6 68000.00 55.71429

Finally, let’s say I want correlation and each mean by grade in one big, happy table :)

result_cor_all <- df[, .(Correlation = cor(Salary, Age), Mean_Salary = mean(Salary), Mean_Age=mean(Age)), by = "Grade"]
result_cor_all

##    Grade Correlation Mean_Salary Mean_Age
## 1:     1   0.4723709    32335.17 39.53333
## 2:     2  -0.1269764    34664.52 37.71429
## 3:     3   0.1291289    38664.65 39.25581
## 4:     4   0.2076582    44152.14 40.42857
## 5:     5   0.3815624    50328.57 40.28571
## 6:     6   0.2916440    68000.00 55.71429

END OF DETOUR

As with any detour, we are back to where we were originally headed. As you can see in any of the tables above containing correlation, correlation is not the same (as in, not as strong), especially in grades 2 and 3. It is even negative (but really small) in grade 2. If you want to chase this further, you can run the test for each grade, so let’s just do it for 2.

cortestG2<-cor.test(Bank_Salaries$Salary[Bank_Salaries$Grade==2], Bank_Salaries$Age[Bank_Salaries$Grade==2])
print(cortestG2)

## 
##  Pearson's product-moment correlation
## 
## data:  Bank_Salaries$Salary[Bank_Salaries$Grade == 2] and Bank_Salaries$Age[Bank_Salaries$Grade == 2]
## t = -0.80962, df = 40, p-value = 0.4229
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4148960  0.1840582
## sample estimates:
##        cor 
## -0.1269764

Notice, this correlation coefficient is not significant (p=0.4229). So, in grade 2, age does not really seem correlated with salary (probably because grade determines salary much more). You can examine average ages in previous output.

OK, I will leave it alone now, so let’s examine this relationship closer, let’s see if we can quantify effect of age on salary and turn to simple regression

SIMPLE REGRESSION

Let’s run the simple regression model of salary as a function of age. So, we are trying to estimate $\beta_0$ and $\beta_1$ in the following model:

$Salary = \beta_0 + \beta_1*Age$

We are interested in $\beta_1$ estimate

reg1<-lm(Bank_Salaries$Salary ~ Bank_Salaries$Age)
summary(reg1)

## 
## Call:
## lm(formula = Bank_Salaries$Salary ~ Bank_Salaries$Age)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -20155  -6652  -1251   5144  48451 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       23010.70    2925.50   7.866 2.04e-13 ***
## Bank_Salaries$Age   418.65      70.18   5.965 1.05e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10420 on 206 degrees of freedom
## Multiple R-squared:  0.1473, Adjusted R-squared:  0.1432 
## F-statistic: 35.59 on 1 and 206 DF,  p-value: 1.05e-08

So, our estimate of $\beta_1$ is 418.65 which means that we expect that each additional year of age brings in extra $418.65 in salary (try with years1 and years2 variables to see if those differ). Coefficient is significant at 1% and given $R^2$ of .1473 we can say that 14.73% of variation in Salary is explained by age. There are a lot of issues with this model, but let’s ignore them for now. NOTE: there is not a good or bad $R^2$. Sometimes, 0.1 is a cause for celebration and 0.8 cause for despair. Value of $R^2$ tends to depend mostly on nature of data. Data on individuals will yield VERY low $R^2$, while data on aggregate such as economy will yield very high values. $R^2$ and, actually, $adj. R^2$ are really only useful to compare different models. We can use them for comparison as long as the DEPENDENT VARIABLE IS THE SAME, i.e. we have not done some transformation to it (such as taking a log).

You can read about the F-test, etc. If you want to see the ANOVA table for this regression (which other softward provides by default), you can use ANOVA command on the results object.

anova(reg1)

## Analysis of Variance Table
## 
## Response: Bank_Salaries$Salary
##                    Df     Sum Sq    Mean Sq F value   Pr(>F)    
## Bank_Salaries$Age   1 3.8633e+09 3863276320  35.586 1.05e-08 ***
## Residuals         206 2.2364e+10  108562286                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

OK, now, let’s examine the residuals

#let's look at histogram of residuals
hist(reg1$residuals)

#let's look at other plots:
par(mfrow = c(2, 2)) # this sets the grid to show all plots (there are 4 in total) so that we do not have to hit enter each time
plot(reg1)

We see potential heteroskedasticity in the first plot (residuals vs. fitted) and non-normality (in Q-Q plot). My sneaking suspicion is that outliers we identified above (and you can see them clearly in Q-Q plot) may be driving some of this. So, let’s run regression without those 5 observations with salaries over $70K. NOTE: for other transformations, see video and we will talk more next week.

reg2<-lm(Bank_Salaries$Salary[Bank_Salaries$Salary<=70000]~Bank_Salaries$Age[Bank_Salaries$Salary<=70000])
summary(reg2)

## 
## Call:
## lm(formula = Bank_Salaries$Salary[Bank_Salaries$Salary <= 70000] ~ 
##     Bank_Salaries$Age[Bank_Salaries$Salary <= 70000])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -13883  -5873  -1221   4478  25275 
## 
## Coefficients:
##                                                  Estimate Std. Error t value
## (Intercept)                                      30706.98    2299.09  13.356
## Bank_Salaries$Age[Bank_Salaries$Salary <= 70000]   200.40      55.97   3.581
##                                                  Pr(>|t|)    
## (Intercept)                                       < 2e-16 ***
## Bank_Salaries$Age[Bank_Salaries$Salary <= 70000]  0.00043 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7867 on 201 degrees of freedom
## Multiple R-squared:  0.05996,    Adjusted R-squared:  0.05529 
## F-statistic: 12.82 on 1 and 201 DF,  p-value: 0.0004298

hist(reg2$residuals)

par(mfrow=c(2,2))
plot(reg2)

Some notes: age seems much less important both in terms of effect ($200 per year vs. $418) and the significance (4.3*10^(-4) vs. 1.5*10(-8)). This is not surprising as salary outliers are also age outliers, driving a lot of what we saw in the initial regression. Honestly, I would leave them out.

Histogram still shows non-normality as in skew in the residuals, although Q-Q seems better. Again, I think I would leave them out given how big of outliers these 5 people are and with 208 observations, I can afford to lose those degrees of freedom.

Of course, this model is not good as other factors (such as grade) may be at play, but that is the topic for next week.

Week 6 Examples

Sasha Tomic

2025-04-17

WEEK 6 - CORRELATION AND SIMPLE REGRESSION

CORRELATION

SLIGHT DETOUR (updated on 4/17)

END OF DETOUR

SIMPLE REGRESSION