Sections
- Packages Required
- Introduction
- Questions
Required Packages
The packages required for this markdown are:
1. One-Way Anova
A. Effect of Teleworking on Income
The coefficient of independent variable Telecommute is significant since it is the only variable in the model. p-value < 0.05.
| (Intercept) |
1183 |
15.69 |
75.43 |
0 |
* * * |
| telecommute2 |
-350.8 |
18.84 |
-18.61 |
4.572e-75 |
* * * |
Fitting linear model: weekly_earnings ~ telecommute
| 5542 |
647.1 |
0.05886 |
0.05869 |
B. Plot
The mean income of telecommuters is USD $351 more than none telecommuters

C. Naive Explanation
- It is simple. It is using only one variable to build the model and it is not explaining underlying causal relationships that produce the variable being investigated.
- A simple analysis is this model doesn’t capture the relationship between number of hours worked and an employee is paid hourly or not.
2. Two-Way Anova
A. Effect of Teleworking + Hourly Paid on Income
Explanation of model design:
When designing the model, we assumed based on real life observations that hourly paid employees make less than exempt employees.
Table continues below
| (Intercept) |
788.6 |
22.52 |
35.02 |
| telecommute2 |
-125.7 |
25.44 |
-4.941 |
| hourly_non_hourly2 |
663 |
29.19 |
22.71 |
| telecommute2:hourly_non_hourly2 |
-179.6 |
35.38 |
-5.078 |
| (Intercept) |
7.508e-243 |
* * * |
| telecommute2 |
7.993e-07 |
* * * |
| hourly_non_hourly2 |
3.025e-109 |
* * * |
| telecommute2:hourly_non_hourly2 |
3.949e-07 |
* * * |
Fitting linear model: weekly_earnings ~ telecommute + hourly_non_hourly + telecommute:hourly_non_hourly
| 5542 |
591.1 |
0.2149 |
0.2144 |
B. Model result explanation
- p-values of the independent variables telecommute and hourly non hourly are significant, less than 0.05.
- Adjusted R-squared increased to 21% compared to 5% on the one-way anova model
- This model is still an naive model. There are many factor that are not considered that can explain better income such a age, industry and geography.
C. Model Visualization
The income mean increases when employee is telecommuter and exempt. Weekly earning is $125.72 more for telecommuters and $662.95 for exempt employees.

D. Comparison LM 1 vs LM 2
Model of telecommute + hourly non hourly paid fits better than the simple linear models to explain changes in weekly earnings. p-value of model 2 is significant < 0.05
Analysis of Variance Table
| 5540 |
2.32e+09 |
NA |
NA |
NA |
NA |
NA |
| 5538 |
1.935e+09 |
2 |
384505502 |
550.2 |
1.166e-218 |
|
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ’ ’ 1
3. Weekly earnings by hours worked
- A. weekly_earnings = $$b_1 hours_worked + b_0
B. Model
Weekly Earnings = 22.5887 hours worked + 66.0433
| (Intercept) |
66.04 |
28.58 |
2.311 |
0.02086 |
* |
| hours_worked |
22.59 |
0.7072 |
31.94 |
1.169e-205 |
* * * |
Fitting linear model: weekly_earnings ~ hours_worked
| 5542 |
613 |
0.1555 |
0.1554 |
C. Naive Explanation
- It is simple. It is using only one variable to build the model and it is not explaining underlying causal relationships that produce the variable being investigated.
- It is not considering the effect of being hourly paid or exempt which is an important aspect observed on the number of hours reported.
- Hours worked explains only 15% of variability of the model.
D. Model Analysis
- We cannot see a clear linear relationship between hours worked and weekly earnings. The data looks very spread out.
- One reason could be exempt employee make a fixed salary regardless number of hours worked. This implies that the relationship cannot be linear.
- First recommendation is to exclude exempt employees and build a model to explain weekly earnings by hours worked with only employees who are hourly paid.
- Second recommendation is to play with other variables that can explain better the behavior on weekly earnings.

E. Recommendation
- First recommendation: Excluding exempt employees didn’t improve a lot the model and we can discard the first recommendation.
| (Intercept) |
15.15 |
25.83 |
0.5864 |
0.5576 |
|
| hours_worked |
18.43 |
0.6746 |
27.32 |
8.611e-148 |
* * * |
Fitting linear model: weekly_earnings ~ hours_worked
| 3182 |
424.6 |
0.1901 |
0.1898 |

- Second recommendation: Adding more variables, for example detailed_occupation_group helps to explain more variability on the model. We can say that using only hours worked is not enough to explain weekly earnings. We need to consider more variables that influence our model, for instance combining, geography, profession, sex, industry and so on have a better effect.
Example of behavior using detailed_occupation_group increases adjusted R-square to 31%
Table continues below
| (Intercept) |
568.2 |
35.58 |
15.97 |
| hours_worked |
18.6 |
0.6629 |
28.06 |
| as.factor(detailed_occupation_group)2 |
-13.7 |
38.45 |
-0.3563 |
| as.factor(detailed_occupation_group)3 |
74.54 |
46.22 |
1.613 |
| as.factor(detailed_occupation_group)4 |
94.24 |
49.55 |
1.902 |
| as.factor(detailed_occupation_group)5 |
-154.7 |
67.95 |
-2.277 |
| as.factor(detailed_occupation_group)6 |
-362 |
62.3 |
-5.811 |
| as.factor(detailed_occupation_group)7 |
245.2 |
68.81 |
3.563 |
| as.factor(detailed_occupation_group)8 |
-139.9 |
42.43 |
-3.298 |
| as.factor(detailed_occupation_group)9 |
-195.3 |
61.14 |
-3.194 |
| as.factor(detailed_occupation_group)10 |
-149.4 |
36.34 |
-4.11 |
| as.factor(detailed_occupation_group)11 |
-735.2 |
48.78 |
-15.07 |
| as.factor(detailed_occupation_group)12 |
-358.8 |
51.87 |
-6.917 |
| as.factor(detailed_occupation_group)13 |
-746.2 |
39.92 |
-18.69 |
| as.factor(detailed_occupation_group)14 |
-717.1 |
49.68 |
-14.43 |
| as.factor(detailed_occupation_group)15 |
-750.2 |
50.12 |
-14.97 |
| as.factor(detailed_occupation_group)16 |
-442.7 |
31.96 |
-13.85 |
| as.factor(detailed_occupation_group)17 |
-570.5 |
29.1 |
-19.61 |
| as.factor(detailed_occupation_group)18 |
-719.7 |
108.2 |
-6.65 |
| as.factor(detailed_occupation_group)19 |
-397.6 |
42.66 |
-9.321 |
| as.factor(detailed_occupation_group)20 |
-313.6 |
44.11 |
-7.111 |
| as.factor(detailed_occupation_group)21 |
-551.7 |
37.81 |
-14.59 |
| as.factor(detailed_occupation_group)22 |
-536.3 |
37.58 |
-14.27 |
| (Intercept) |
3.716e-56 |
* * * |
| hours_worked |
4.479e-162 |
* * * |
| as.factor(detailed_occupation_group)2 |
0.7217 |
|
| as.factor(detailed_occupation_group)3 |
0.1068 |
|
| as.factor(detailed_occupation_group)4 |
0.05723 |
|
| as.factor(detailed_occupation_group)5 |
0.02283 |
* |
| as.factor(detailed_occupation_group)6 |
6.561e-09 |
* * * |
| as.factor(detailed_occupation_group)7 |
0.0003697 |
* * * |
| as.factor(detailed_occupation_group)8 |
0.0009804 |
* * * |
| as.factor(detailed_occupation_group)9 |
0.001411 |
* * |
| as.factor(detailed_occupation_group)10 |
4.008e-05 |
* * * |
| as.factor(detailed_occupation_group)11 |
2.487e-50 |
* * * |
| as.factor(detailed_occupation_group)12 |
5.12e-12 |
* * * |
| as.factor(detailed_occupation_group)13 |
1.193e-75 |
* * * |
| as.factor(detailed_occupation_group)14 |
2.204e-46 |
* * * |
| as.factor(detailed_occupation_group)15 |
1.096e-49 |
* * * |
| as.factor(detailed_occupation_group)16 |
6.572e-43 |
* * * |
| as.factor(detailed_occupation_group)17 |
8.15e-83 |
* * * |
| as.factor(detailed_occupation_group)18 |
3.218e-11 |
* * * |
| as.factor(detailed_occupation_group)19 |
1.626e-20 |
* * * |
| as.factor(detailed_occupation_group)20 |
1.303e-12 |
* * * |
| as.factor(detailed_occupation_group)21 |
2.479e-47 |
* * * |
| as.factor(detailed_occupation_group)22 |
2.064e-45 |
* * * |
Fitting linear model: weekly_earnings ~ hours_worked + as.factor(detailed_occupation_group)
| 5542 |
551.1 |
0.32 |
0.3173 |
4. Weekly earnings by age
- A. weekly_earnings = $$b_1 age + b_0
B. Model
Weekly Earnings = 9.1941 hours worked + 548.9457
| (Intercept) |
548.9 |
28.24 |
19.44 |
1.681e-81 |
* * * |
| age |
9.194 |
0.6306 |
14.58 |
2.787e-47 |
* * * |
Fitting linear model: weekly_earnings ~ age
| 5542 |
654.6 |
0.03696 |
0.03678 |
C. Naive Explanation
- It is using only age to explain weekly earnings.
- It is not using other variables that can have an influence in weekly earnings. Example the age by gender be related somehow.
- It is not looking at any interaction of age or other variables.
D. Visualization

E. Additional Explanation
- It is clearly looking at the visualization in part D, that the relation between weekly earnings and age is not quite linear, actually it is very scattered, Even when p-value shows that the relation is significant.
- Variability explained just using age is very small, only 3.6%.
- Given the mean for all weekly earnings 548.9457 and the residuals standard error is 654.6. We can say that the percentage (any prediction will still be off by) 654.6/548.95 = 1.19 which means any prediction is off by 119%.
5. Aggregated model
A. weekly_earnings = b0 + b1(age) + b2(education) + b3(hours_worked) + b4(hourly_non_hourly) + b5(occupation_group) + b6(industry)
B. Model
Table continues below
| (Intercept) |
-222.1 |
229.9 |
-0.9661 |
| age |
5.691 |
0.5049 |
11.27 |
| as.factor(education)32 |
-151.4 |
306.1 |
-0.4945 |
| as.factor(education)33 |
-92.8 |
230 |
-0.4034 |
| as.factor(education)34 |
-27.13 |
237.3 |
-0.1143 |
| as.factor(education)35 |
-32.97 |
224.9 |
-0.1466 |
| as.factor(education)36 |
98.48 |
217.3 |
0.4531 |
| as.factor(education)37 |
50.51 |
214.3 |
0.2357 |
| as.factor(education)38 |
24.02 |
222.7 |
0.1079 |
| as.factor(education)39 |
109.9 |
207.1 |
0.5304 |
| as.factor(education)40 |
123.5 |
207.4 |
0.5957 |
| as.factor(education)41 |
157.2 |
208.9 |
0.7526 |
| as.factor(education)42 |
191.6 |
208.5 |
0.9191 |
| as.factor(education)43 |
367 |
207.5 |
1.769 |
| as.factor(education)44 |
469.9 |
208.4 |
2.255 |
| as.factor(education)45 |
732.2 |
213.7 |
3.426 |
| as.factor(education)46 |
655.3 |
213 |
3.076 |
| hours_worked |
15.99 |
0.619 |
25.83 |
| as.factor(hourly_non_hourly)2 |
245.2 |
16.24 |
15.1 |
| as.factor(occupation_group)2 |
-72.59 |
23.34 |
-3.111 |
| as.factor(occupation_group)3 |
-297.1 |
27.41 |
-10.84 |
| as.factor(occupation_group)4 |
-220.4 |
30.75 |
-7.167 |
| as.factor(occupation_group)5 |
-344.9 |
25.66 |
-13.45 |
| as.factor(occupation_group)6 |
-237.7 |
128.1 |
-1.855 |
| as.factor(occupation_group)7 |
-99.26 |
49.19 |
-2.018 |
| as.factor(occupation_group)8 |
-63.14 |
41.08 |
-1.537 |
| as.factor(occupation_group)9 |
-285.3 |
37.97 |
-7.514 |
| as.factor(occupation_group)10 |
-275.6 |
36.79 |
-7.491 |
| as.factor(industry)2 |
413.1 |
114.7 |
3.6 |
| as.factor(industry)3 |
192.2 |
100 |
1.921 |
| as.factor(industry)4 |
239.3 |
93.98 |
2.546 |
| as.factor(industry)5 |
79.46 |
94.14 |
0.844 |
| as.factor(industry)6 |
269.3 |
97.2 |
2.77 |
| as.factor(industry)7 |
221.2 |
103.2 |
2.142 |
| as.factor(industry)8 |
179.7 |
95.11 |
1.89 |
| as.factor(industry)9 |
231.7 |
93.66 |
2.474 |
| as.factor(industry)10 |
53.33 |
92.7 |
0.5753 |
| as.factor(industry)11 |
26.67 |
94.33 |
0.2828 |
| as.factor(industry)12 |
-7.591 |
97.4 |
-0.07794 |
| as.factor(industry)13 |
198 |
94.78 |
2.089 |
| (Intercept) |
0.334 |
|
| age |
3.788e-29 |
* * * |
| as.factor(education)32 |
0.621 |
|
| as.factor(education)33 |
0.6867 |
|
| as.factor(education)34 |
0.909 |
|
| as.factor(education)35 |
0.8834 |
|
| as.factor(education)36 |
0.6505 |
|
| as.factor(education)37 |
0.8136 |
|
| as.factor(education)38 |
0.9141 |
|
| as.factor(education)39 |
0.5959 |
|
| as.factor(education)40 |
0.5514 |
|
| as.factor(education)41 |
0.4517 |
|
| as.factor(education)42 |
0.3581 |
|
| as.factor(education)43 |
0.07702 |
|
| as.factor(education)44 |
0.02416 |
* |
| as.factor(education)45 |
0.0006159 |
* * * |
| as.factor(education)46 |
0.002108 |
* * |
| hours_worked |
5.511e-139 |
* * * |
| as.factor(hourly_non_hourly)2 |
1.715e-50 |
* * * |
| as.factor(occupation_group)2 |
0.001877 |
* * |
| as.factor(occupation_group)3 |
4.173e-27 |
* * * |
| as.factor(occupation_group)4 |
8.696e-13 |
* * * |
| as.factor(occupation_group)5 |
1.42e-40 |
* * * |
| as.factor(occupation_group)6 |
0.0636 |
|
| as.factor(occupation_group)7 |
0.04365 |
* |
| as.factor(occupation_group)8 |
0.1243 |
|
| as.factor(occupation_group)9 |
6.677e-14 |
* * * |
| as.factor(occupation_group)10 |
7.908e-14 |
* * * |
| as.factor(industry)2 |
0.0003205 |
* * * |
| as.factor(industry)3 |
0.05474 |
|
| as.factor(industry)4 |
0.01092 |
* |
| as.factor(industry)5 |
0.3987 |
|
| as.factor(industry)6 |
0.005619 |
* * |
| as.factor(industry)7 |
0.0322 |
* |
| as.factor(industry)8 |
0.05885 |
|
| as.factor(industry)9 |
0.01341 |
* |
| as.factor(industry)10 |
0.5651 |
|
| as.factor(industry)11 |
0.7774 |
|
| as.factor(industry)12 |
0.9379 |
|
| as.factor(industry)13 |
0.03678 |
* |
Fitting linear model: weekly_earnings ~ age + as.factor(education) + hours_worked + as.factor(hourly_non_hourly) + as.factor(occupation_group) + as.factor(industry)
| 5542 |
504.6 |
0.4316 |
0.4276 |
C. Validation of Collinearity
There are some problematic correlations among some variables with high values like: hourly_non_hourly & occupation_group, hourly_non_hourly & education, occupation_group & industry.
D. Ranges Judgement
- The model works better working with education level when is equal or higher of a Master degree. Looking at the output, significant p-values are the ones over 44 which are the ones for Master, Professional School and PhD.
- Industries (2, 4, 6 and 9) are the only ones significant. However, it could be early to say that industry 2 (mining) is the most relevant, because there are not too many observations compared to the number of observations for the other industries.
E. Hypothetical Observation
Observations that fit the ranges in part D. Education = 45 and Industry = 4
1 2 3 4 5 6 7 8
1794.791 1947.052 1880.151 1827.548 1912.365 2027.276 2175.233 1674.571
9
2044.931