'data.frame': 498 obs. of 3 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ income : num 3.86 4.98 4.92 3.21 7.2 ...
$ happiness: num 2.31 3.43 4.6 2.79 5.6 ...
summary(mydata)
X income happiness
Min. : 1.0 Min. :1.506 Min. :0.266
1st Qu.:125.2 1st Qu.:3.006 1st Qu.:2.266
Median :249.5 Median :4.424 Median :3.473
Mean :249.5 Mean :4.467 Mean :3.393
3rd Qu.:373.8 3rd Qu.:5.992 3rd Qu.:4.503
Max. :498.0 Max. :7.482 Max. :6.863
library(psych)describe(mydata)
vars n mean sd median trimmed mad min max range skew
X 1 498 249.50 143.90 249.50 249.50 184.58 1.00 498.00 497.00 0.00
income 2 498 4.47 1.74 4.42 4.46 2.25 1.51 7.48 5.98 0.02
happiness 3 498 3.39 1.43 3.47 3.39 1.66 0.27 6.86 6.60 -0.01
kurtosis se
X -1.21 6.45
income -1.24 0.08
happiness -0.78 0.06
Average income is about $4,470, and with standard deviation = 1.74 there is a moderate variability. The income data is nearly symmetrical (skewness close to 0), with fewer extreme values (kurtosis = -1.24)
The average happiness score is 3.39 and also with moderate variability (standard deviation = 1.43). The distribution is also nearly symmetrical (skewness = -0.01) and with flatter tails than a normal distribution (kurtosis = - 0.78)
In general, both Income and Happiness are close to normal distribution, with minimal skew and low likelihood of extreme outliers.
plot(mydata$income, mydata$happiness, xlab ="Income (in thousands)", ylab ="Happiness Score", main ="Scatter Plot of Income vs Happiness")abline(lm(happiness ~ income, data = mydata), col ="blue")lines(lowess(mydata$income, mydata$happiness), col ="red", lty =2)
To visually test the linearity assumption between Income and Happiness, we can create a scatter plot where:
Blue line shows the linear regression line.
Red dashed line shows a smoothed trend line to highlight any potential non-linear relationships.
Lines do not deviate as much, so there are some pattern in data
model <-lm(happiness ~ income, data = mydata)summary(model)
Call:
lm(formula = happiness ~ income, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-2.02479 -0.48526 0.04078 0.45898 2.37805
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.20427 0.08884 2.299 0.0219 *
income 0.71383 0.01854 38.505 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7181 on 496 degrees of freedom
Multiple R-squared: 0.7493, Adjusted R-squared: 0.7488
F-statistic: 1483 on 1 and 496 DF, p-value: < 2.2e-16
Since p-value < 0.001, then Income has a strong, positive, and statistically significant effect on happiness in this dataset.
For each additional $1,000 in income, happiness increases by about 0.71 points.
Also, our model explains 74.9% of the variation in happiness (R-squared = 0.7493), so it describes dependency well. Residual standard error is 0.72, therefore predictions are close to observed values.
model2 <-lm(salary ~ sales, data = ceosal1)summary(model2)
Call:
lm(formula = salary ~ sales, data = ceosal1)
Residuals:
Min 1Q Median 3Q Max
-1463.7 -478.9 -241.4 123.1 13614.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.174e+03 1.128e+02 10.407 <2e-16 ***
sales 1.547e-02 8.906e-03 1.737 0.0838 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1366 on 207 degrees of freedom
Multiple R-squared: 0.01437, Adjusted R-squared: 0.009607
F-statistic: 3.018 on 1 and 207 DF, p-value: 0.08385
From the statistics we get, we can suggest that the model explains only 1.44% of the variation in salary and salesndefined, indicating poor predictive power. The standard error is high, so there is a variation from the actual salaries. The p-value for sales is 0.0838 is not strong enough at the 5% level.
The residual standard error is high (1366), and the F-statistic suggests that sales has a minimal impact on salary and the model doesn’t fit the data well.
# Level-level modelmodel_level_level <-lm(salary ~ sales, data = ceosal1)summary(model_level_level)
Call:
lm(formula = salary ~ sales, data = ceosal1)
Residuals:
Min 1Q Median 3Q Max
-1463.7 -478.9 -241.4 123.1 13614.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.174e+03 1.128e+02 10.407 <2e-16 ***
sales 1.547e-02 8.906e-03 1.737 0.0838 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1366 on 207 degrees of freedom
Multiple R-squared: 0.01437, Adjusted R-squared: 0.009607
F-statistic: 3.018 on 1 and 207 DF, p-value: 0.08385
The level-level model shows significant positive relationship between sales and salary (p-value = 0.0838), with each unit increase in sales increasing salary by $0.01547. But the model explains only 1.44% of the variation in salary (R-squared = 1.44%). The model is not statistically significant at the 5% level (p-value = 0.08385). The sales variable has a small effect on salary.
# Log-level modelmodel_log_level <-lm(log(salary) ~ sales, data = ceosal1)summary(model_log_level)
Call:
lm(formula = log(salary) ~ sales, data = ceosal1)
Residuals:
Min 1Q Median 3Q Max
-1.44220 -0.29159 -0.02837 0.28323 2.72487
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.847e+00 4.500e-02 152.138 < 2e-16 ***
sales 1.498e-05 3.553e-06 4.217 3.7e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5448 on 207 degrees of freedom
Multiple R-squared: 0.07912, Adjusted R-squared: 0.07467
F-statistic: 17.79 on 1 and 207 DF, p-value: 3.696e-05
The log-level model shows relationship between sales and salary (p-value < 0.01). However, the model explains only 7.91% of the variation in salary, sol it is very innacurate. The effect of sales on salary is statistically significant but very small.
# Level-log modelmodel_level_log <-lm(salary ~log(sales), data = ceosal1)summary(model_level_log)
Call:
lm(formula = salary ~ log(sales), data = ceosal1)
Residuals:
Min 1Q Median 3Q Max
-1072.1 -447.6 -222.8 41.7 13702.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -898.93 771.50 -1.165 0.24529
log(sales) 262.90 92.36 2.847 0.00486 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1349 on 207 degrees of freedom
Multiple R-squared: 0.03767, Adjusted R-squared: 0.03302
F-statistic: 8.103 on 1 and 207 DF, p-value: 0.004863
The level-log model shows a significant positive relationship between log(sales) and salary (p-value = 0.00486), with each one-unit increase in log(sales) increasing salary by $262.90. However, the model explains only 3.77% of the variation in salary, indicating a weak fit. The effect is statistically significant but the model’s explanatory power is low.
# Log-log modelmodel_log_log <-lm(log(salary) ~log(sales), data = ceosal1)summary(model_log_log)
Call:
lm(formula = log(salary) ~ log(sales), data = ceosal1)
Residuals:
Min 1Q Median 3Q Max
-1.01038 -0.28140 -0.02723 0.21222 2.81128
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.82200 0.28834 16.723 < 2e-16 ***
log(sales) 0.25667 0.03452 7.436 2.7e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5044 on 207 degrees of freedom
Multiple R-squared: 0.2108, Adjusted R-squared: 0.207
F-statistic: 55.3 on 1 and 207 DF, p-value: 2.703e-12
The log-log model shows positive relationship between sales and salary (p-value < 0.001), with each 1% increase in sales corresponding to a 0.25667% increase in salary. The model explains 21.08% of the variation in salary, so there is a moderate fit.
To sum up, only the log-log model provides a moderately better accuracy, yet still very weak to be a good predictor.
Task 3
data(k401k)model3 <-lm(prate ~ mrate + age, data = k401k)summary(model3)
Call:
lm(formula = prate ~ mrate + age, data = k401k)
Residuals:
Min 1Q Median 3Q Max
-81.162 -8.067 4.787 12.474 18.256
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 80.1191 0.7790 102.85 < 2e-16 ***
mrate 5.5213 0.5259 10.50 < 2e-16 ***
age 0.2432 0.0447 5.44 6.21e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.94 on 1531 degrees of freedom
Multiple R-squared: 0.09225, Adjusted R-squared: 0.09106
F-statistic: 77.79 on 2 and 1531 DF, p-value: < 2.2e-16
We examined the effect of a 401k plan’s match rate and the age of the plan on the participation rate. The estimated coefficients are positive for both mrate and age, suggesting that higher match rates and older plans are associated with higher participation rates. Specifically, the intercept 80.1191 represents the estimated baseline participation rate when both mratw and age are zero, though it is unlikely that a real plan would have these values.
The coefficient for mrate is 5.5213, so for each one-unit increase in the match rate, the participation rate is expected to increase by about 5.52 units. The coefficient for age is 0.2432, suggesting that for each additional year a plan has been in place, the participation rate is expected to increase by about 0.24 units. Both of these coefficients are statistically significant at conventional levels, as shown by the very low p-values (p < 2e-16 for mrate and p = 6.21e-08 for age), meaning we have strong evidence against the null hypothesis of no effect for both variables.
The residual standard error of 15.94 shows the average deviation of the observed prate values from the predicted values based on this model, indicating some variability not captured. The R^2 value of 0.09225 tells us that about 9.23% of the variation in prate is explained by the model, which is relatively low. This low R^2 suggests that other factors not included in this model may play a significant role in determining participation rates. Finally, the overall F-statistic 77.79 and its very small p-value (< 2.2e-16) indicate that the model is statistically significant, meaning that at least one of the predictors (mrate or age) is significantly associated with prate.