Summary of Results
Model 2 has an F-statistic of 6011.491 and a p-value of <2.2e^-16 less than the alpha level of 0.05, therefore we conclude that the model is a good predictor of salary and these results are significant and less likely to be due to chance.
Model2 Multiple Linear Regression
l_salary=1.233+0.111(No_payments)+0.983(Age)+0.011(Gender)
Subset of Dataset used for regression analysis
Variables chosen for Analysis
After filtering the data, to include only observations with credit transactions, there were 883 observations for 100 unique customers left,
THe data was summarized to find the total number of credit transactions for each individual customer_id.
[1] 2 14
Questions guiding my analysis
Range indicates that the respondents had between two and 14 credit transactions.
1.Why are some people paid more than once a month? Eg. Abigail had 13 credit transactions while Alexander had 6 credit transactions?
2.Why did not some people receive a payment in a month? Eg. Diana was paid a weekly amount of 1,013.67 AUD during the 3-month period-at total of 13 payments. While, Ronald had only two transactions 6,111.57 AUD in September and 10,753.02 AUD in October.
3.Are the respondents with fewer credit payments freelance consultants while the respondents with more credit payments salaried employees?
4.Do the people work in the same place?
Confidence Interval for Model1
Confidence interval of beta1 for Model1, no_payments is (-185, 73). Since zero 0 is in the middle of the interval, the results are not significant, evidenced by low t-statistic =-0.847 and the p-value = 0.3972 meaning that we are likely to observe such a (substantial) association between no_payments and salary by chance.
2.5 % 97.5 %
(Intercept) 16026.5556 18841.94903
no_payments -185.2392 73.55536
`geom_smooth()` using formula 'y ~ x'
`geom_smooth()` using method = 'loess' and formula 'y ~ x'
`geom_smooth()` using method = 'loess' and formula 'y ~ x'

Residuals versus Fitted plot
The Residuals versus Fitted plot for the simple linear regression model above shows that number of credit payments is not linearly related to the total 3-month salary.
Also from the model summary above the R-squared value was 0.08% meaning that number of payments only explained 0.08% of the variation in the salary.
Thus, the next step to find the appropriate model is add more independent variables to explain the variation in the salary.
Plotting histograms of salary, average 3-month salary, age and number of credit transactions showed that these variables needed to be transformed using the natural log to approximate a normal distribution.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Histograms after natural log transformations of total 3-month Salary and average 3-month salary
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Histograms after natural log transformations of Age and Number if Credit payments
Multiple Linear Regression
Multiple Linear Regression=ln (total_3_month_Salary)~ln(Number of credit transactions)+ln(average_3_month_salary)+ln(age)+gender
l_salary=1.233+0.111(No_payments)+0.983(Age)+0.011(Gender)
model2=lm(l_salary~no_payments+l_avg_salary+age+gender, data=pay2)
summary(model2)
Call:
lm(formula = l_salary ~ no_payments + l_avg_salary + age + gender,
data = pay2)
Residuals:
Min 1Q Median 3Q Max
-0.62821 -0.00931 0.00337 0.01709 0.07246
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.2330488 0.0568215 21.700 <2e-16 ***
no_payments 0.1113100 0.0010231 108.799 <2e-16 ***
l_avg_salary 0.9830141 0.0064506 152.390 <2e-16 ***
age 0.0001063 0.0002020 0.526 0.5990
genderM 0.0109763 0.0049655 2.211 0.0273 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.07179 on 878 degrees of freedom
Multiple R-squared: 0.9648, Adjusted R-squared: 0.9646
F-statistic: 6011 on 4 and 878 DF, p-value: < 2.2e-16
plot(model2)




Explanation of Multiple Linear Regression results
Residuals versus Fitted plot The points are gathered randomly around the residual= 0 line indicating that the assumption that there is a linear relationship between the response variable (Salary) and the Predictor variables is reasonable.
However, the points around(10,-0.6) show that there are outliers in the data.
Normal Q-Q Plot The Normal Q-Q Plot shows that there are a lot of negative residual values. Since the relationship between the Standardized and theoretical residuals is not linear, the assumption of linear regression that the error terms are normally distributed is not met.
The Comparison of the Residual vs Fitted plots for Model1 and Model2 show that Model2 is a better fit for the data to predict the salary.
Model2 has an adjusted R-squared value of 96.46% meaning that 96.46% of the variation in ln(salary) is explained by the predictor variables, ln(average 3-month salary),age,gender and number of credit payments.
Model 2 has an F-statistic of 6011.491 and a p-value of <2.2e^-16 less than the alpha level of 0.05, therefore we conclude that the model is a good predictor of salary and these results are significant and less likely to be due to chance.
$model1
$model2
NA
chu
