Intro to GLM and Linearity
Introduction
This evaluation uses data from a Census bureau survey about internet and technology use.
Data source: http://asayanalytics.com/telework_csv
1) Simple One-Way ANOVA
a. Effect of teleworking on income
Question: Does Teleworking appear to have a significant effect on income? How do you know? Explain the effect (if it exists)
As shown in the ANOVA test below, there is a significant relationship between income and whether one telecommutes or not. With p-vale being less than 0.001, the result is extremely significant. Thus, teleworking appears to have a significant effect on income.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 1183 | 15.69 | 75.43 | 0 |
| telecommute2 | -350.8 | 18.84 | -18.61 | 4.572e-75 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 5542 | 647.1 | 0.05886 | 0.05869 |
b. Visualization
The box plot below shows that the mean of income of telecommuters is $350.80 higher than none telecommuters. Note that the One-way ANOVA model is the naive model that only compares the mean between the values of variables. It does not incorporate other variables that might also have an effect as well such as number of hours worked and whether or not the employee is paid hourly.
2) Simple Regression Model: Earnings by Hours Worked
Instruction: Build a simple regression model estimating weekly earnings by hours worked.
a. Regression model using Beta Notation
The generalized form of the regression using beta notation is weekly_earnings = b_0 + b_1(hours_worked).
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 66.04 | 28.58 | 2.311 | 0.02086 |
| hours_worked | 22.59 | 0.7072 | 31.94 | 1.169e-205 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 5542 | 613 | 0.1555 | 0.1554 |
b. Estimated form of the regression based on results
Based on the results from part A, the estimated form of the regression model is weekly_earnings = 66.04 + 22.59(hours_worked)
c. Naive Explanations
Provide at least 3 explanations for why we would consider this a naïve model.
- There is only about 15% change in weekly earnings that can be explained by number of hours worked. The rest of almost 85% is still unknown.
- It is a simple model, involving only two variables (one dependent and one independent). It does not explain underlying causal relationships that produce the weekly earnings variable.
- The model did not take into account other independent variables that may also have an affect on weekly earnings of employees (e.g. hourly or non-hourly and full-time or part-time)
d. Model Improvement Proposal
Based on the point graph shown below, there is no clear linearity relationship between weekly earning and the amount of hours worked. What I think should be done to better this model is to add other variables like full/part-time or paid hourly or non-hourly. If employees are paid non-hourly, then I don’t think it makes sense to model one variable as a function of the other. In this circumstance, the salary would already be fixed, thus modeling this would not work.
Thus, my first suggestion is to exclude any employees who do not get paid hourly and build a model to explain the relationships between weekly earnings and hours worked with another independent variable being hourly paid. The second recommendation is to look at other variables that may also have an effect on the model, specifically the full-time/part-time position.
e1. Model Improvement Execution - 1st Recommendation
First recommendation: exclude any employees who do not get paid hourly and build a model to explain the relationships between weekly earnings and hours worked with another independent variable being hourly paid. This is done by adding hourly_non_hourly variable as a factor.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 23.06 | 26.31 | 0.8764 | 0.3809 |
| hours_worked | 18.21 | 0.6645 | 27.41 | 3.215e-155 |
| hourly_non_hourly2 | 498.5 | 15.65 | 31.86 | 1.082e-204 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 5542 | 563.5 | 0.2863 | 0.2861 |
As shown in the table above, both independent variables (hours_worked and hourly_non_hourly) appear to have significant effect on the weekly earnings since their p-values are significantly lower than 0.05.
e2. Model Improvement Execution - 2nd Recommendation
Second recommendation: look at other variables that may also have an effect on the model, such as the full-time/part-time position, education, sex, citizenship, age, geographic region, occupation group, and industry.
After exploring the data, I found that two other variables that also have significant effect on weekly earnings are 1) detailed_occupation_group variable and 2) education, especially Bachelor’s degree and above (i.e. education numbered 43 and up), where both models increased from 15% to 32%.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 568.2 | 35.58 | 15.97 | 3.716e-56 |
| hours_worked | 18.6 | 0.6629 | 28.06 | 4.479e-162 |
| detailed_occupation_group2 | -13.7 | 38.45 | -0.3563 | 0.7217 |
| detailed_occupation_group3 | 74.54 | 46.22 | 1.613 | 0.1068 |
| detailed_occupation_group4 | 94.24 | 49.55 | 1.902 | 0.05723 |
| detailed_occupation_group5 | -154.7 | 67.95 | -2.277 | 0.02283 |
| detailed_occupation_group6 | -362 | 62.3 | -5.811 | 6.561e-09 |
| detailed_occupation_group7 | 245.2 | 68.81 | 3.563 | 0.0003697 |
| detailed_occupation_group8 | -139.9 | 42.43 | -3.298 | 0.0009804 |
| detailed_occupation_group9 | -195.3 | 61.14 | -3.194 | 0.001411 |
| detailed_occupation_group10 | -149.4 | 36.34 | -4.11 | 4.008e-05 |
| detailed_occupation_group11 | -735.2 | 48.78 | -15.07 | 2.487e-50 |
| detailed_occupation_group12 | -358.8 | 51.87 | -6.917 | 5.12e-12 |
| detailed_occupation_group13 | -746.2 | 39.92 | -18.69 | 1.193e-75 |
| detailed_occupation_group14 | -717.1 | 49.68 | -14.43 | 2.204e-46 |
| detailed_occupation_group15 | -750.2 | 50.12 | -14.97 | 1.096e-49 |
| detailed_occupation_group16 | -442.7 | 31.96 | -13.85 | 6.572e-43 |
| detailed_occupation_group17 | -570.5 | 29.1 | -19.61 | 8.15e-83 |
| detailed_occupation_group18 | -719.7 | 108.2 | -6.65 | 3.218e-11 |
| detailed_occupation_group19 | -397.6 | 42.66 | -9.321 | 1.626e-20 |
| detailed_occupation_group20 | -313.6 | 44.11 | -7.111 | 1.303e-12 |
| detailed_occupation_group21 | -551.7 | 37.81 | -14.59 | 2.479e-47 |
| detailed_occupation_group22 | -536.3 | 37.58 | -14.27 | 2.064e-45 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 5542 | 551.1 | 0.32 | 0.3173 |
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -140.1 | 226.5 | -0.6187 | 0.5361 |
| hours_worked | 20.67 | 0.6407 | 32.26 | 2.504e-209 |
| education32 | -276.7 | 334 | -0.8285 | 0.4074 |
| education33 | -143 | 250.8 | -0.5703 | 0.5685 |
| education34 | -124.7 | 258.3 | -0.4829 | 0.6292 |
| education35 | -151.8 | 244.8 | -0.62 | 0.5353 |
| education36 | -47.71 | 237 | -0.2013 | 0.8404 |
| education37 | -73.85 | 233.5 | -0.3163 | 0.7518 |
| education38 | -125.5 | 242.8 | -0.517 | 0.6052 |
| education39 | 58.78 | 225.7 | 0.2605 | 0.7945 |
| education40 | 95.83 | 225.8 | 0.4244 | 0.6713 |
| education41 | 158.1 | 227.5 | 0.6947 | 0.4873 |
| education42 | 225.2 | 227 | 0.9921 | 0.3212 |
| education43 | 510.1 | 225.7 | 2.26 | 0.02385 |
| education44 | 696.8 | 226.4 | 3.077 | 0.002098 |
| education45 | 995.3 | 232.1 | 4.289 | 1.829e-05 |
| education46 | 874.2 | 231.4 | 3.778 | 0.0001599 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 5542 | 551.6 | 0.318 | 0.316 |
3) Simple Regression Model: Earning as Function of Age
Instruction: Build a simple regression model estimating weekly earnings as a function of age.
a. Regression model using Beta Notation
The generalized form of the regression model using beta notation is weekly_earnings = b_0 + b_1(age)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 548.9 | 28.24 | 19.44 | 1.681e-81 |
| age | 9.194 | 0.6306 | 14.58 | 2.787e-47 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 5542 | 654.6 | 0.03696 | 0.03678 |
b. Estimated form of the regression based on results
Based on the results from part A, the estimated form of the regression model is weekly_earnings = 548.9457 + 9.1941(age)
c. Naive Explanations
Provide at least 3 explanations for why this is considered a naïve model.
- The model only using age as the variable to predict weekly earnings, where it would not work that way in reality.
- There is only about 9% effect that age has on weekly earnings.
- The model is not looking at any other variables other than age that would also influence weekly earnings.
d. Linearity Assumption Test
Based on the plot graph below, the linear model is clearly not a strong fit as the polynomial slope does not even touch the blue line of linear model.
e. Possible Concerns
Three possible concerns regarding this linear model beyond those inherent in the naïve design:
Outliers: The first concern I have is the outliers in the weekly_earnings variables. I would first transform my data by filtering out those outliers or by using the log function.
Older age = less data point: The second most concerned is about the missing data points as age increases. As people get older, they either tend to be less technological competent, already retired, and thus, have no more earnings, or already passed away. We can clearly see that the data points are weighted more between the ages of mid-20s and mid-60s. This might be an underlying problem of outliers too, so I believe the best way to address this concern is by evenly weighting the data points by eliminating the outliers. That is to filter out ages that contain a lot of missing data points.
Age as discrete variable: Since weekly_earnings does not increase as age increases, treating age variable as a numerical value would not be correct. This is my last concern, and the only way to address this is to simply transform age to be a factor/categorical.
4) Modification from Q3
Instruction: Modify your model from Q3 by adding at least 3 other IVs to the regression and transforming the age variable (and others) as necessary to meet the linearity assumption. Interpret your results and answer the following questions:
a. Regression model using Beta Notation
The generalized form of the regression model using beta notation is weekly_earnings = b_0 + b_1(age) + b_2(hours_worked) + b_3(telecommute) + b_4(hourly_non_hourly) + b_5(full_or_part_time) + b_6(occupation_group) + b_8(industry) + b_9(education)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 378.8 | 312.4 | 1.212 | 0.2254 |
| as.factor(age)16 | -150.1 | 259.1 | -0.5792 | 0.5625 |
| as.factor(age)17 | -114.1 | 245.9 | -0.4639 | 0.6428 |
| as.factor(age)18 | -177.4 | 227.3 | -0.7805 | 0.4352 |
| as.factor(age)19 | -237.3 | 228.1 | -1.04 | 0.2983 |
| as.factor(age)20 | -303.3 | 223.3 | -1.358 | 0.1745 |
| as.factor(age)21 | -270.4 | 221.3 | -1.221 | 0.222 |
| as.factor(age)22 | -400.8 | 220.1 | -1.821 | 0.06865 |
| as.factor(age)23 | -335 | 218.6 | -1.532 | 0.1255 |
| as.factor(age)24 | -353.1 | 218.8 | -1.614 | 0.1066 |
| as.factor(age)25 | -282.4 | 218.3 | -1.293 | 0.196 |
| as.factor(age)26 | -374.2 | 218.4 | -1.714 | 0.08662 |
| as.factor(age)27 | -277.6 | 218.4 | -1.271 | 0.2038 |
| as.factor(age)28 | -206.7 | 218.1 | -0.9476 | 0.3434 |
| as.factor(age)29 | -239.6 | 218 | -1.099 | 0.2718 |
| as.factor(age)30 | -212.2 | 218.3 | -0.972 | 0.3311 |
| as.factor(age)31 | -181.6 | 217.2 | -0.8363 | 0.403 |
| as.factor(age)32 | -159.5 | 218.7 | -0.7295 | 0.4657 |
| as.factor(age)33 | -208.2 | 217.6 | -0.957 | 0.3386 |
| as.factor(age)34 | -209.5 | 217.9 | -0.9615 | 0.3363 |
| as.factor(age)35 | -166 | 218 | -0.7615 | 0.4464 |
| as.factor(age)36 | -162.3 | 218.6 | -0.7425 | 0.4578 |
| as.factor(age)37 | -238.1 | 218.9 | -1.088 | 0.2767 |
| as.factor(age)38 | -165 | 218.6 | -0.7551 | 0.4502 |
| as.factor(age)39 | -33.61 | 219 | -0.1535 | 0.878 |
| as.factor(age)40 | -84.6 | 218.7 | -0.3868 | 0.699 |
| as.factor(age)41 | -31.39 | 218.9 | -0.1434 | 0.886 |
| as.factor(age)42 | -92.4 | 219.5 | -0.421 | 0.6738 |
| as.factor(age)43 | -215.8 | 219.2 | -0.9844 | 0.325 |
| as.factor(age)44 | -76.35 | 219 | -0.3485 | 0.7274 |
| as.factor(age)45 | -66.81 | 217.9 | -0.3067 | 0.7591 |
| as.factor(age)46 | -10.38 | 219.1 | -0.04738 | 0.9622 |
| as.factor(age)47 | -162.1 | 218.9 | -0.7405 | 0.459 |
| as.factor(age)48 | -69.7 | 218.6 | -0.3188 | 0.7499 |
| as.factor(age)49 | -110 | 219 | -0.5021 | 0.6156 |
| as.factor(age)50 | -108.4 | 217.8 | -0.4975 | 0.6188 |
| as.factor(age)51 | -41.9 | 219.4 | -0.191 | 0.8485 |
| as.factor(age)52 | -42.82 | 218.7 | -0.1958 | 0.8448 |
| as.factor(age)53 | -87.1 | 218.5 | -0.3985 | 0.6902 |
| as.factor(age)54 | -30.53 | 218.8 | -0.1395 | 0.889 |
| as.factor(age)55 | 34.81 | 218.5 | 0.1593 | 0.8734 |
| as.factor(age)56 | -112 | 218.5 | -0.5126 | 0.6083 |
| as.factor(age)57 | -75.05 | 218.9 | -0.3429 | 0.7317 |
| as.factor(age)58 | -58.09 | 218.5 | -0.2658 | 0.7904 |
| as.factor(age)59 | 34.81 | 219.6 | 0.1585 | 0.8741 |
| as.factor(age)60 | -49.22 | 219.4 | -0.2243 | 0.8225 |
| as.factor(age)61 | -9.342 | 219 | -0.04266 | 0.966 |
| as.factor(age)62 | -96.8 | 220.5 | -0.439 | 0.6607 |
| as.factor(age)63 | -24.32 | 221.2 | -0.1099 | 0.9125 |
| as.factor(age)64 | 4.347 | 223.1 | 0.01948 | 0.9845 |
| as.factor(age)65 | -156.7 | 223.8 | -0.7 | 0.484 |
| as.factor(age)66 | 62.54 | 229.5 | 0.2726 | 0.7852 |
| as.factor(age)67 | -34.59 | 229.4 | -0.1508 | 0.8801 |
| as.factor(age)68 | -71.58 | 232.4 | -0.308 | 0.7581 |
| as.factor(age)69 | 111.9 | 239.6 | 0.4669 | 0.6406 |
| as.factor(age)70 | -243.2 | 236.5 | -1.029 | 0.3037 |
| as.factor(age)71 | -6.87 | 251.8 | -0.02729 | 0.9782 |
| as.factor(age)72 | -236.4 | 284.5 | -0.8311 | 0.406 |
| as.factor(age)73 | -317.5 | 261.6 | -1.213 | 0.225 |
| as.factor(age)74 | -69.66 | 265.4 | -0.2625 | 0.793 |
| as.factor(age)75 | 48.62 | 284.5 | 0.1709 | 0.8643 |
| as.factor(age)76 | -391.2 | 284.9 | -1.373 | 0.1697 |
| as.factor(age)77 | -143.9 | 309.1 | -0.4656 | 0.6415 |
| as.factor(age)78 | 276 | 308.8 | 0.8939 | 0.3714 |
| as.factor(age)79 | -387.8 | 358.4 | -1.082 | 0.2792 |
| as.factor(age)80 | -299.6 | 270.9 | -1.106 | 0.2687 |
| as.factor(age)85 | -155.1 | 277.1 | -0.5598 | 0.5757 |
| hours_worked | 12.19 | 0.7286 | 16.74 | 2.437e-61 |
| telecommute2 | -83.79 | 15.57 | -5.382 | 7.693e-08 |
| hourly_non_hourly2 | 222.2 | 16.18 | 13.73 | 3.262e-42 |
| full_or_part_time2 | -210.7 | 24.35 | -8.652 | 6.558e-18 |
| occupation_group2 | -70.72 | 23.16 | -3.054 | 0.00227 |
| occupation_group3 | -272.9 | 27.29 | -10 | 2.426e-23 |
| occupation_group4 | -197 | 30.65 | -6.428 | 1.404e-10 |
| occupation_group5 | -330.7 | 25.59 | -12.92 | 1.246e-37 |
| occupation_group6 | -164.9 | 127 | -1.299 | 0.194 |
| occupation_group7 | -94.99 | 48.8 | -1.947 | 0.05163 |
| occupation_group8 | -61.14 | 40.8 | -1.499 | 0.134 |
| occupation_group9 | -271 | 37.72 | -7.185 | 7.628e-13 |
| occupation_group10 | -253.1 | 36.57 | -6.921 | 4.993e-12 |
| industry2 | 454.3 | 113.5 | 4.002 | 6.36e-05 |
| industry3 | 194.9 | 99.03 | 1.968 | 0.04917 |
| industry4 | 229.9 | 92.91 | 2.474 | 0.01339 |
| industry5 | 93.55 | 93.11 | 1.005 | 0.3151 |
| industry6 | 268.3 | 96.17 | 2.79 | 0.005294 |
| industry7 | 222.8 | 102.2 | 2.18 | 0.02932 |
| industry8 | 176 | 94.05 | 1.872 | 0.06129 |
| industry9 | 229.5 | 92.66 | 2.477 | 0.01327 |
| industry10 | 54.8 | 91.66 | 0.5978 | 0.55 |
| industry11 | 57.98 | 93.32 | 0.6214 | 0.5344 |
| industry12 | 9.688 | 96.33 | 0.1006 | 0.9199 |
| industry13 | 186.7 | 93.71 | 1.992 | 0.0464 |
| education32 | -143.1 | 303 | -0.4722 | 0.6368 |
| education33 | -92.71 | 227.2 | -0.408 | 0.6833 |
| education34 | 7.227 | 234.6 | 0.0308 | 0.9754 |
| education35 | -74.15 | 223.7 | -0.3314 | 0.7403 |
| education36 | 115.2 | 215.8 | 0.5337 | 0.5936 |
| education37 | 32.35 | 212.6 | 0.1522 | 0.8791 |
| education38 | 50.05 | 220.4 | 0.2271 | 0.8203 |
| education39 | 125.1 | 204.8 | 0.611 | 0.5412 |
| education40 | 147.8 | 205 | 0.721 | 0.4709 |
| education41 | 166.1 | 206.5 | 0.8043 | 0.4212 |
| education42 | 200.7 | 206.1 | 0.9736 | 0.3303 |
| education43 | 375.1 | 205.2 | 1.828 | 0.0676 |
| education44 | 478.3 | 206.1 | 2.321 | 0.02032 |
| education45 | 734.9 | 211.4 | 3.476 | 0.0005121 |
| education46 | 673.1 | 210.7 | 3.195 | 0.001407 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 5542 | 497 | 0.4553 | 0.4447 |
b. Estimated form of the regression based on results
Based on the results from part A, the estimation form of the regression model is weekly_earnings = 378.8 -150(age)16 - 114(age)17 - … - 115.1(age)85 + 12.19(hours_worked) - 83.79(non-telecommute) + 222.2(hourly_non_hourly) - 210.7(part_time) - 70.72(occupation_group2) - … - 253.1(occupation_group10) + 454.3(industry2) + … + 186.7(industry13) - 143.1(education32) - … + 673.1(education46)
Since many IVs are treated as factors, each factor is treated as its own variable (e.g. education, occupation group, and industry).
c. Collinearity
Question: Do you suspect any of your independent variables are collinear? Explain why or why not.
Yes, I do suspect that some of my IVs are collinear, particularly telecommute, hourly_non_hourly, and hour_worked. This is because to consider the amount of hours worked, we first have to figure out whether the employees are telecommuter, get paid by hours, and whether they work full time or part time. More than that, occupation_group, industry, and education also seem to have collinearity because the education level indicates the kind of job positions in each occupation/industry.
d. Ranges of Values
From part B, the linear equation is
weekly_earnings = 378.8 -150(age)16 - 114(age)17 - … - 115.1(age)85 + 12.19(hours_worked) - 83.79(non-telecommute) + 222.2(hourly_non_hourly) - 210.7(part_time) - 70.72(occupation_group2) - … - 253.1(occupation_group10) + 454.3(industry2) + … + 186.7(industry13) - 143.1(education32) - … + 673.1(education46)
Judging by the output, ranges of values I am most comfortable using this model to estimate future observations are as follow:
- occupation_group = 3-5 and 9-10
- industry = 2-4, 6, 7, 9, 13
- education = 44-46 (45 being the most significant)
e. Hypothetical Observation
Generate a hypothetical observation that exists within the ranges specified in part d and estimate the weekly earnings for that individual. In addition to this estimate, calculate the estimated range of weekly earnings this individual may have and comment on the results.
Below is a hypothetical observation that exists with the ranges in part D, specifically Education = 45 and Industry = 10.
Predicted Values
## 1 2 3 4 5 6 7 8
## 1300.7479 1037.4839 984.7048 1128.6815 1376.4496 1714.5415 1587.1710 1711.5189
## 9 10
## 1586.5508 1671.7889