GLM_HW_Prior
Introduction
This R Markdown Document exists to look at Telework Data and to allow me to learn about ANOVA/linear regression/the General Linear Model.
Statisitcs is fun. I swear!
Our data comes from the Census Bureau.
Question 1 - One Way (or the highway) ANOVA
Oh, ANOVA. I remember my first beer. This is a One Way ANOVA Test seeking to determine if Teleworking has a Significant Impact on User Income.
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(telecommute) 1 1.451e+08 145093488 346.5 <2e-16 ***
## Residuals 5540 2.320e+09 418730
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So, for the most part, in this whole numeric vomit the only value we care about right now is that number under “Pr(>F)”, which translates to 2*10^-16.
In other words, its a very small number. Generally speaking, anything smaller than .05 is considered to be Statistically Significant, which is nerd-speak for “We feel like this model bears some resemblance to reality”.
In this case, the ANOVA test does find that as an employee’s Teleworking Capability and their Income are somehow linked, as one goes up or down, so too does their income.
This is supposed to be significant
What makes this model stand out as a “Naive ANOVA” model is that it’s not nearly complex enough - there can be dozens of confounding factors related to the model. Age, education, industry, region and full or part-time status may all have an impact (and potentially a greater impact!) on someone’s income. That’s what we’re trying to measure with Anova.
The Anova model provided above basically asks “are there more dots in the red columns than the blue?” That’s an over-simplification of the model, but the visualization above is…also a gross oversimplification of the issue of Telecommuter Average Income Levels as well - however, here we see a histogram of the reporting earning figures for Telecommuters and Office Drones Workers, which does suggest that telecommute work is more lucrative than its in-person ilk.
Question 2 - Slightly more Wise ANOVA
Let’s improve our model shall we? Above we looked at the question of whether “Weekly Earnings” were effected by Telecommuter Status. Now, let’s add in one of those Confounding Variables I was mentioning earlier - age. My thought process is thus - as an employee gets older, they also gain experience/skill. Better skilled labor commands higher salaries. Ergo…the thought is that Age should also have an effect on Weekly Earnings.
Let’s see…
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(telecommute) 1 1.451e+08 145093488 361.1 <2e-16 ***
## age 1 9.408e+07 94076788 234.1 <2e-16 ***
## Residuals 5539 2.226e+09 401822
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So, once again, we see <2e-16 for both variables. That does not disprove the supposition that both Age and Telecommuter Status have an effect on Weekly Earning. But is this a complete model? Is it even a good model?
I don’t think so. Logically we know, just looking at this data that Age and Telecommution are not the complete picture when it comes to one’s earning potential. A 90 year old selling insurance in Dayton over the phone doesn’t make more than a 28 year old designing JAVA Applets from a beach-bar in Dubai. Skills, industry, education and…honestly luck/networking more than likely need to be included in this model.
Let’s bring in the General Linear Model to find out, with the “lm” command.
First, the original model.
##
## Call:
## lm(formula = weekly_earnings ~ telecommute, data = Telework_Master)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1170.3 -447.4 -159.4 277.4 2052.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1533.95 33.06 46.40 <2e-16 ***
## telecommute -350.76 18.84 -18.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 647.1 on 5540 degrees of freedom
## Multiple R-squared: 0.05886, Adjusted R-squared: 0.05869
## F-statistic: 346.5 on 1 and 5540 DF, p-value: < 2.2e-16
So, in the simple model the “Adjusted R-Squared” value reads as .058 which tells me that this model leaves an awful lot to be desired, as this model is thought to only account for 5.8% of the variation in income.
##
## Call:
## lm(formula = weekly_earnings ~ telecommute + age, data = Telework_Master)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1389.7 -424.9 -145.9 273.1 2245.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1142.4816 41.2741 27.68 <2e-16 ***
## telecommute -354.3831 18.4604 -19.20 <2e-16 ***
## age 9.3444 0.6107 15.30 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 633.9 on 5539 degrees of freedom
## Multiple R-squared: 0.09703, Adjusted R-squared: 0.09671
## F-statistic: 297.6 on 2 and 5539 DF, p-value: < 2.2e-16
ouch, R^2 is 9.7%, which isn’t wonderful. typically you like to see 70% or more. It is, however, ever so slightly a step in the right direction.
What this image shows is a breakdown of every plot point of Age+Income, with two dot-color-options based on whether someone’s a telecommuter. The two bars going through also show the general trendline. But notice that thin grey sheath around both of the bars, that represents - if I understand this correctly - the space in which similarly-good-fit bars might fit, and that suggests that the wider the grey-sheath the less good our model is.
Anyway, let’s continue
Question 3 - Linear Models
##
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = Telework_Master)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1758.0 -411.2 -185.1 230.4 2796.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.0433 28.5766 2.311 0.0209 *
## hours_worked 22.5887 0.7072 31.943 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 613 on 5540 degrees of freedom
## Multiple R-squared: 0.1555, Adjusted R-squared: 0.1554
## F-statistic: 1020 on 1 and 5540 DF, p-value: < 2.2e-16
The model I’m attempting to…model is as follows - Earnings = B0 + (B1 x Hours Worked) After running our model we get some numbers to plug in Earnings = 66.04 + 22.58(Hours Worked)
Is this model “Naive”?
I feel as if this model isn’t a particularly well rounded one. The issue that jumps out to me at first, when looking at the numbers - and I’ll be the first to tell you that I’m not well versed in interpretting these outputs - is how large the Standard Error (28.57) is in relation to the Intercept (66.04). We expect to be, regularly 33% off the mark with our model.
Secondly, this model - again - fails to take into account a myriad of factors that could govern one’s earnings. Location, industry, education to name the most obvious ones.
lastly, the Adjusted R^2 value (.155) is larger than .05 which is more often than not the threshhold in Statistics of a model being good (if my past homework is any indication).
How do we improve this, then?
We have a not-so-good, Naive model. One option is to just throw it out, but I’m not so sure. I think we’re on to something here, we just need to add a few more ingredients to the soup, so to speak. I’d like to include Education & occupation group to the model.
Working in different capacities is a key indicator of earning potential. Agriculture doesn’t pay as well as Management, for example. Occupation group contains several categories.
** Occupation Group **
| Value | Role |
|---|---|
| 1 | Management |
| 2 | Professional |
| 3 | Service |
| 4 | Sales |
| 5 | Office/Admin Support |
| 6 | Agriculture |
| 7 | Construction |
| 8 | Maintenance |
| 9 | Production |
| 10 | Transportation |
| 11 | Armed Forces |
So, let’s see what we get.
##
## Call:
## lm(formula = weekly_earnings ~ hours_worked + as.factor(occupation_group),
## data = Telework_Master)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1443.8 -337.4 -117.4 191.7 2678.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 522.131 33.074 15.787 < 2e-16 ***
## hours_worked 19.596 0.659 29.735 < 2e-16 ***
## as.factor(occupation_group)2 -74.081 24.035 -3.082 0.00206 **
## as.factor(occupation_group)3 -668.592 26.423 -25.304 < 2e-16 ***
## as.factor(occupation_group)4 -433.285 29.950 -14.467 < 2e-16 ***
## as.factor(occupation_group)5 -560.917 26.787 -20.940 < 2e-16 ***
## as.factor(occupation_group)6 -720.456 108.860 -6.618 3.98e-11 ***
## as.factor(occupation_group)7 -392.613 41.443 -9.474 < 2e-16 ***
## as.factor(occupation_group)8 -309.836 42.979 -7.209 6.39e-13 ***
## as.factor(occupation_group)9 -545.635 36.301 -15.031 < 2e-16 ***
## as.factor(occupation_group)10 -530.117 36.047 -14.706 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 557.6 on 5531 degrees of freedom
## Multiple R-squared: 0.3023, Adjusted R-squared: 0.301
## F-statistic: 239.6 on 10 and 5531 DF, p-value: < 2.2e-16
That’s quite the numeric vomit, hey? But, it does seem as if it’s something of an improvement - Adjusted R^2 has grown, though not as high as we’d like. The issues stems from the wild variability within job segements when it comes to pay. It looks as if management, agriculture and maintenance all stay fairly close to the line, but the other inustries have huge swings in potential earnings. Especially Agriculture, which makes sense because it likely contains both field workers and the corporate owners of the larger mega farms in the great plains.
And let’s visualize it.
So, this bar-chart (of sorts) illustrates that the Management and Professional fields have the highest regular salaries, but they also have some pretty major variation within them. Meanwhile, Agrigulture has a fairly low average salary, but there are one or more outlyers who are significantly out-earning their competitors within the field. I have a suspicion that those individuals own larger farms or run farm corporations and may be better classed as Professional/Management, but that falls outside the scope of this data.
Question 4 - Multivariate. Age + Earnings
This model posits if Age and Earnings are related.
Earnings = B0 + B1(Age)
##
## Call:
## lm(formula = weekly_earnings ~ age, data = Telework_Master)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1245.3 -445.3 -178.1 284.7 2133.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 548.9457 28.2350 19.44 <2e-16 ***
## age 9.1941 0.6306 14.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 654.6 on 5540 degrees of freedom
## Multiple R-squared: 0.03696, Adjusted R-squared: 0.03678
## F-statistic: 212.6 on 1 and 5540 DF, p-value: < 2.2e-16
Running the model we see the following values.
Earnings = 548.94 + 9.19(Age)
The model posits a “starting point” of ~550/week modified upward by ~9:20 as a factor of age. Which does, on its surface make a degree of sense.
We see that this model passes the T-Test with a value less than .05 (rather significantly so), and do not reject the hypothesis that age and pay are related.
That said, again I feel that this model is naive, with adjusted R^2 suggesting that there remains 97% of the delta to earnings that are not explicitly explained by age.
Once again, sector and education are likely to be large factors - potentially larger factors - than simply age alone with regard to someone’s earnings potential.
Let’s see if the thing is even a valid model, shall we?
I know, I know. More statistic blobs, but this simple plot does show a model that the linear model could feasibly handle. It’s possible that there may be a slight drop-off effect as folks who can afford to do so leave the workforce as they close on 70 (notice how many fewer dots there are after ~65 in the upper quadrant?) while those who continue to work past 75 don’t seem to be making much - indicating that financially they may not be able to retire - or they may be working as a form of supplamental income in addition to their retirement savings (and to give them something to do!). Again, this data doesn’t go into why these people may work, but I know plenty of Elderly who tell me they work because they would be bored to death - (literally?) - if they didn’t have something to do.
For the purposes of “fixing” the model I might be interested in limiting the age-range to a cap of 65 (when folks can start drawing on Social Security) - it would remove a lot of data - something I’m wary to do. But it would limit the model to those in their prime working years.
I am, also, concerned about the neat row of 2884.61’s on the very top of the plot, and I have a suspicion that the survey may simply have reported “2884+” as a single band-box answer, which is now collecting them into a single range. If I were attempting to fix the model, I think I might put an earnings cap at 2800, removing those figures out-earning that value as outlier super earners. I am, again, concerned about removing data points, but there are 182 out of 5542 entries at that earning level, and that’s fairly consistent with economic earnings data for the US generally (they’re the top .03%).
Lastly, I think this model is naive in so far as it’s utilizing a fairly simplistic understanding of work. People either work or don’t with no grey spot in between. What about the Student who does HelpDesk for Apple in their offtime? Or the Elderly person taking Customer Service calls for the VFW from his home?
What about the Stock Broker who runs his own firm in up-state NY?
This model assumes they’re all the same.
I would be very interested to see this same plot but separated with Full and Part Time facet_wrapped into their own plots, so we could see what the model looked like presumably, part-timers skew to the lower earning brackets and with higher incidences at the young and old ranges, with the full-timers more normally distributed and likely skewing to a higher earning bracket. Presumably.
Q5 - Adding it all together.
So, let’s take our model from Q4 and see what we can do with it.
##
## Call:
## lm(formula = weekly_earnings ~ age + education + as.factor(geography_region) +
## as.factor(telecommute), data = Telework2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1536.40 -317.39 -79.37 237.87 2014.84
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 615.6150 35.5955 17.295 < 2e-16 ***
## age 8.0426 0.5332 15.085 < 2e-16 ***
## educationBA 234.2151 24.1851 9.684 < 2e-16 ***
## educationDoctorate 601.1571 55.6337 10.806 < 2e-16 ***
## educationGED -97.0679 22.0546 -4.401 1.10e-05 ***
## educationMAST 356.6938 30.1685 11.823 < 2e-16 ***
## educationNo GED -262.5567 36.8494 -7.125 1.18e-12 ***
## educationProf_School 546.7998 62.6361 8.730 < 2e-16 ***
## as.factor(geography_region)2 -39.5889 22.3925 -1.768 0.0771 .
## as.factor(geography_region)3 -10.5698 20.3770 -0.519 0.6040
## as.factor(geography_region)4 15.7766 21.4376 0.736 0.4618
## as.factor(telecommute)2 -167.1406 15.5381 -10.757 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 492.8 on 5144 degrees of freedom
## Multiple R-squared: 0.222, Adjusted R-squared: 0.2204
## F-statistic: 133.5 on 11 and 5144 DF, p-value: < 2.2e-16
So, we generated a new model. In this model, I posit that
Earnings = B0 + B1 (Age) + B2(education) + B3(Geographic Region) + B4(Telecommuter)
I chose these variables because I’ve maintained this entire article that Education is a solid predictor of higher wages, that where you work is almost as important as what you do, and that as you increase in earning potential you often gain workplace flexibility.
which is to say that as you age you get more experience, and are therefore a more compelling hire - leading to more munneh.
which is also to say that as you get more credentials and training you also increase your hiring desirability, further increasing your Demand (and therefore your pay)
Which is also to say that life expenses in the Midwest are markedly lower than they are in New England
Which is to finally that unskilled laborers - stockists at Big Lots or Fry Cooks at McDonalds don’t have any flexibility in their worklife, while Stockbrokers and Brain Surgeons do.
We got…quite the mouthful. The “education” coefficients are parenthses’d together, and essentially function as 1/0 True/False tests. I don’t know why Associate’s Degree didn’t show up in the resulsts as it’s a valid category. “Professional” school is to say Dentist, Lawyer, M.D. etc. High level Vocational schools.
Earnings = 615.61 + 8.04Age + (-262.55No GED + -97.06GED + 234.21Bachelors + 356.69Masters + 546.79Professional + 601.15Doctorate) + (-39.58Midwest + -10.56South + 15.76West) + -167.14Commute
Colinearity?
I feel like Age and Education may be weakly colinear. Most 15 year olds simply can’t have Doctorates in any serious number, but at the same time there’s no guarantee that a 60 year old should have a Masters - hence the weak colinearity.
As a guess, I would imagine that the colinearity is stronger in the ages between 15 and…say 40. Most people finish their GED before 20. They finish their BA before 25, they finish thier post-graduate - before 40. However at every class you likely see large populatio drop offs, of folks who have their GED but don’t got for a BA, who get a BA but don’t go for a Masters etc etc etc.
I would also posit that it’s unlikely someone in their 50’s is going to pursue a Bachelors if they don’t already have one (certainly it happens, but not in the level that we’d need to model).
So, let’s look at a hypothetical telecommuter
## 2.5 % 97.5 %
## (Intercept) 545.832697 685.397270
## age 6.997432 9.087842
## educationBA 186.802092 281.628120
## educationDoctorate 492.091392 710.222727
## educationGED -140.304314 -53.831540
## educationMAST 297.550624 415.836904
## educationNo GED -334.797099 -190.316220
## educationProf_School 424.006481 669.593142
## as.factor(geography_region)2 -83.487752 4.309929
## as.factor(geography_region)3 -50.517297 29.377782
## as.factor(geography_region)4 -26.250216 57.803404
## as.factor(telecommute)2 -197.601848 -136.679252
This chart illustrates the possible values that these variables could take, which is of limited use because many of these values are binary…an observation is from the Midwest or they’re not…they have a Masters or they don’t…etc.
But let’s give this thing the Sniff Test!
Here we see the model plotted. It is, admittedly, kind of difficult to make sense of visually, it is without a doubt a busy. However, the thing we’re mostly concerned with are the Trend Lines. They mostly fit the model, though curiously those who attend Professional School tend to earn less as they age. Potentially indicating that they retire earlier, or in the case of Doctors, their ability to practice degrades as their manual dexterity does.
Does this thing hold water? Let’s try and build an employee!
Let’s plug in a 35 year old man with a Bachelors Degree who lives in the South and who telecommutes
Earnings = 615.61 + (8.04*35) + ((-262.55*0) + (-97.06*0) + (234.21*1) + (356.69*0) + (546.79*0) + (601.15*0)) + ((-39.58*0) + (-10.56*1) + (15.76*0)) + -167.14*1
show(Earnings)## [1] 953.52
According to the model, this hypothetical college educated, middle aged, telecommuting person makes $953.52 a week, however, the residual Standard Error suggests +/- 492.8 that this persons income may be any value on a range between
460.72< –| 953.52 |– >1446.32 /week
that’s a fairly wide Range for these possible earnings. However, when we consider that some fields - like Managerment and Professional have a generally fairly high salary, and other fields like Armed Forces and Agriculture tend to have very low salaries, and this model doesn’t account for job-field, this variability is to be expected. Were I to go forward and develop a more complete model I would certainly want to consider Job Field to perhaps help narrow some of this variability down.
Additionally, not all Bachelor’s Degrees are created equal. Much to my chagrin - a BA in History is simply worth less than a BS in Computer Science (sorry, Napoleon), and not all parts of the South are the same either. The DC Corridor or Virgia are quite a bit different than rural Arkansas. These factors also help to explain the wild variability contained within the potental salaries that our Hypothetical Southerner may earn.
That said, the values do fall well within a logical “sanity check” value of the annual salary of a college aged American in (presumably) 2000.
You end up with an annual pre-tax take home of $23,957 <—| 49,583 |—> 75,208, which is well within the norm of Average US Salaries