Using Linear Regression to predict the total 3-month salary for 100 respondents from the ANZ datasets

Data from InsideSherpa Virtual Internship https://www.insidesherpa.com/modules/ZLJCsrpkHo9pZBJNY/BnwCTubtx8NW5W5Kr

Summary of Results

Model 2 has an F-statistic of 6011.491 and a p-value of <2.2e^-16 less than the alpha level of 0.05, therefore we conclude that the model is a good predictor of salary and these results are significant and less likely to be due to chance.

Model2 Multiple Linear Regression

l_salary=1.233+0.111(No_payments)+0.983(Age)+0.011(Gender)

Subset of Dataset used for regression analysis

Variables chosen for Analysis

After filtering the data, to include only observations with credit transactions, there were 883 observations for 100 unique customers left,

THe data was summarized to find the total number of credit transactions for each individual customer_id.

[1]  2 14

Questions guiding my analysis

Range indicates that the respondents had between two and 14 credit transactions.

1.Why are some people paid more than once a month? Eg. Abigail had 13 credit transactions while Alexander had 6 credit transactions?

2.Why did not some people receive a payment in a month? Eg. Diana was paid a weekly amount of 1,013.67 AUD during the 3-month period-at total of 13 payments. While, Ronald had only two transactions 6,111.57 AUD in September and 10,753.02 AUD in October.

3.Are the respondents with fewer credit payments freelance consultants while the respondents with more credit payments salaried employees?

4.Do the people work in the same place?

Performing Simple Linear Regression Salary~Average_Salary

Explanation of model1 Salary versus Number of Payments

Beta-nought=when no_payments is zero, the average salary is 17,434 AUD

Beta1= For every one unit increase in the no_payments, the average salary would decrease by 55.74 AUD Rsquared= 0.08%

The very low F-statistic, 0.7174 and the p-value= 0.397 indicate that at the alpha level of 0.05, there is no correlation between the no_payments and the salary of respondents.

Thus, we can conclude that there is no significant a relationship between no_payments and the salary of respondents.

model1=lm(salary~no_payments, data=pay2)
summary(model1)

Call:
lm(formula = salary ~ no_payments, data = pay2)

Residuals:
   Min     1Q Median     3Q    Max 
 -9608  -5108  -1904   4569  18133 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17434.25     717.24  24.307   <2e-16 ***
no_payments   -55.84      65.93  -0.847    0.397    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6714 on 881 degrees of freedom
Multiple R-squared:  0.0008136, Adjusted R-squared:  -0.0003205 
F-statistic: 0.7174 on 1 and 881 DF,  p-value: 0.3972

Confidence Interval for Model1

Confidence interval of beta1 for Model1, no_payments is (-185, 73). Since zero 0 is in the middle of the interval, the results are not significant, evidenced by low t-statistic =-0.847 and the p-value = 0.3972 meaning that we are likely to observe such a (substantial) association between no_payments and salary by chance.

                 2.5 %      97.5 %
(Intercept) 16026.5556 18841.94903
no_payments  -185.2392    73.55536
`geom_smooth()` using formula 'y ~ x'
`geom_smooth()` using method = 'loess' and formula 'y ~ x'
`geom_smooth()` using method = 'loess' and formula 'y ~ x'

Residuals versus Fitted plot

The Residuals versus Fitted plot for the simple linear regression model above shows that number of credit payments is not linearly related to the total 3-month salary.

Also from the model summary above the R-squared value was 0.08% meaning that number of payments only explained 0.08% of the variation in the salary.

Thus, the next step to find the appropriate model is add more independent variables to explain the variation in the salary.


Plotting histograms of salary, average 3-month salary, age and number of credit transactions showed that these variables needed to be transformed using the natural log to approximate a normal distribution.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Histograms after natural log transformations of total 3-month Salary and average 3-month salary

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Histograms after natural log transformations of Age and Number if Credit payments

Multiple Linear Regression

Multiple Linear Regression=ln (total_3_month_Salary)~ln(Number of credit transactions)+ln(average_3_month_salary)+ln(age)+gender

l_salary=1.233+0.111(No_payments)+0.983(Age)+0.011(Gender)

model2=lm(l_salary~no_payments+l_avg_salary+age+gender, data=pay2)

summary(model2)

Call:
lm(formula = l_salary ~ no_payments + l_avg_salary + age + gender, 
    data = pay2)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.62821 -0.00931  0.00337  0.01709  0.07246 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.2330488  0.0568215  21.700   <2e-16 ***
no_payments  0.1113100  0.0010231 108.799   <2e-16 ***
l_avg_salary 0.9830141  0.0064506 152.390   <2e-16 ***
age          0.0001063  0.0002020   0.526   0.5990    
genderM      0.0109763  0.0049655   2.211   0.0273 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.07179 on 878 degrees of freedom
Multiple R-squared:  0.9648,    Adjusted R-squared:  0.9646 
F-statistic:  6011 on 4 and 878 DF,  p-value: < 2.2e-16
plot(model2)

Explanation of Multiple Linear Regression results

Residuals versus Fitted plot The points are gathered randomly around the residual= 0 line indicating that the assumption that there is a linear relationship between the response variable (Salary) and the Predictor variables is reasonable.

However, the points around(10,-0.6) show that there are outliers in the data.

Normal Q-Q Plot The Normal Q-Q Plot shows that there are a lot of negative residual values. Since the relationship between the Standardized and theoretical residuals is not linear, the assumption of linear regression that the error terms are normally distributed is not met.


The Comparison of the Residual vs Fitted plots for Model1 and Model2 show that Model2 is a better fit for the data to predict the salary.

Model2 has an adjusted R-squared value of 96.46% meaning that 96.46% of the variation in ln(salary) is explained by the predictor variables, ln(average 3-month salary),age,gender and number of credit payments.

Model 2 has an F-statistic of 6011.491 and a p-value of <2.2e^-16 less than the alpha level of 0.05, therefore we conclude that the model is a good predictor of salary and these results are significant and less likely to be due to chance.

$model1

$model2
NA

chu

