The following is an analysis of census data collected on teleworking. The data was obtained from the Census Bureau survey.
The following shows the impact that teleworking has on income. 1 represents a value of “Yes”, inferring the respondant uses teleworking, while 2 represents a value of “No”, inferring the respondant does not use teleworking.
The anova test below shows that there is a significant relationship between income and telecommuncation. With a P-value less than .001, the result is extremely significant. The test shows that there is a significant impact on income depending on whether you telework or not.
## Df Sum Sq Mean Sq F value Pr(>F)
## telecommute 1 1.451e+08 145093488 346.5 <2e-16 ***
## Residuals 5540 2.320e+09 418730
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = weekly_earnings ~ telecommute, data = telework_OD)
##
## $telecommute
## diff lwr upr p adj
## 2-1 -350.7614 -387.7015 -313.8213 0
The boxplot below also supports this belief as the median of incomes for people not teleworking is even with the bottom of the second quartile of people that are teleworking. Even with the outliers in the boxplot of people not teleworking, there is still a lot of evidence showing that people teleworking have significantly higher incomes.
Despite the very clear evidence, there are several reasons why this method has failed to capture the true impact of teleworking on income. For one, there are other factors that may influence an individuals income besides the use of teleworking. These influences can range from geographic region, where incomes may be higher as a result of laws or cost of living, or number of members in a household. Whatever the case, the model above fails to include other factors that may influence earnings outside of teleworking. In addition, the model doesn’t represent the level at which teleworking is used. Some individuals may be using teleworking as a small portion of their work while others use it primarily.
Teleworking also is likely to be a naive model as certain roles or job types are more likely to use teleworking than others. For example, you’re more likely to see teleworking used in finance than in construction because of the type of work being done. Contruction, for example, would not work with teleworking since the worker needs to be present on the work site, they would not be able to perform in that profession presumably well teleworking.
Certain jobs are more capable of being completed at home than others. Because of this, there are certain roles that are home bound while others cannot be. I am testing the occupation to teleworking to show that occupation has a significant impact with teleworking.
## Df Sum Sq Mean Sq F value Pr(>F)
## telecommute 1 1.451e+08 145093488 429.04 < 2e-16
## detailed_occupation_group 21 4.432e+08 21106627 62.41 < 2e-16
## telecommute:detailed_occupation_group 21 1.719e+07 818367 2.42 0.000294
## Residuals 5498 1.859e+09 338185
##
## telecommute ***
## detailed_occupation_group ***
## telecommute:detailed_occupation_group ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Model 1: weekly_earnings ~ telecommute
## Model 2: weekly_earnings ~ telecommute * detailed_occupation_group
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5540 2319766548
## 2 5498 1859341670 42 460424878 32.416 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the ANOVA test, weekly earnings is impacted by occupation and teleworking together at a significant level. Additionally, each individual factor has a significant impact on weekly earnings. The overall model has improved with the addition of occupation as a factor.
Despite the addition, I would stil consider this a naive model as there are likely other factors impacting the total of weekly earnings. Full-time and part-time is likely to impact whether someone teleworks and their income.
The visualization below shows that different occupations quite clearly make different earnings. Occupations like food prep, cleaning services, and personal care cannot be completed via teleworking. On the flip side, occupations like law, business, and management can be done by teleworking. These occupations are likely part of the reason why there is differentiation in earnings from teleworking.
The following is a regression analysis of weekly earnings with regards to hours worked. The purpose is to determine what the regression equation would be and visualize the impact of hours worked on weekly earnings.
Based on the regression performed below, weekly earnings would have the following equation in relation to hours worked:
weekly earnings = 66.04 + 22.59 (hours worked)
##
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = telework_OD)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1758.0 -411.2 -185.1 230.4 2796.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.0433 28.5766 2.311 0.0209 *
## hours_worked 22.5887 0.7072 31.943 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 613 on 5540 degrees of freedom
## Multiple R-squared: 0.1555, Adjusted R-squared: 0.1554
## F-statistic: 1020 on 1 and 5540 DF, p-value: < 2.2e-16
This model would be considered naive for several reasons. For one, It assumes that by working zero hours, you would earn $66 per week. Theoretically, this could be feasible if the household is receiving welfare money or social security. This could even be feasible if the individual is calculating investments. However, that also would require some work to maintain. Anyways, weekly earnings requires some form of work.
The model is also naive for assuming that every increase in unit hours worked is equivalent to $22. As I’ve learned from my own working experiences, working certain hours provides certain levels of money. When working part-time at a golf course, my pay rate sat right around minimum wage. However, when working full-time as an intern for a company, my pay rate went up significantly. So the model wrongly assumes that each hour of work is valued the same.
Another reason this model would be considered naive is for the hours worked themselves. This is particularly important when considering overtime or holiday pay. Typically, hours over 40 or during the holidays will pay the base rate plus an additional half of the base rate per hour worked. This type of pay would cause a significant change in weekly earnings by hours worked.
In this new model, I will be using full-time/part-time and age to show their impact on weekly earnings. I believe that full-time/part-time will show similar results to hourly work while age will show increased wages as people get older.
##
## Call:
## lm(formula = weekly_earnings ~ hours_worked + full_or_part_time +
## age + detailed_occupation_group, data = telework_OD)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1328.52 -333.64 -97.27 203.51 2684.44
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 426.0640 46.9233 9.080 < 2e-16 ***
## hours_worked 14.3835 0.7715 18.644 < 2e-16 ***
## full_or_part_time2 -266.7181 25.2941 -10.545 < 2e-16 ***
## age 7.4330 0.5255 14.144 < 2e-16 ***
## detailed_occupation_group2 -14.0418 37.4129 -0.375 0.707437
## detailed_occupation_group3 90.6285 45.0338 2.012 0.044220 *
## detailed_occupation_group4 103.2604 48.2186 2.142 0.032277 *
## detailed_occupation_group5 -138.8140 66.1487 -2.099 0.035905 *
## detailed_occupation_group6 -371.7495 60.5991 -6.135 9.13e-10 ***
## detailed_occupation_group7 247.8487 66.9185 3.704 0.000215 ***
## detailed_occupation_group8 -131.2259 41.2853 -3.179 0.001488 **
## detailed_occupation_group9 -159.7264 59.4970 -2.685 0.007283 **
## detailed_occupation_group10 -122.4776 35.3752 -3.462 0.000540 ***
## detailed_occupation_group11 -674.5186 47.5637 -14.181 < 2e-16 ***
## detailed_occupation_group12 -332.9938 50.4692 -6.598 4.56e-11 ***
## detailed_occupation_group13 -614.1451 39.5249 -15.538 < 2e-16 ***
## detailed_occupation_group14 -677.9578 48.3717 -14.016 < 2e-16 ***
## detailed_occupation_group15 -673.1004 48.9391 -13.754 < 2e-16 ***
## detailed_occupation_group16 -393.4039 31.2111 -12.605 < 2e-16 ***
## detailed_occupation_group17 -555.3577 28.3138 -19.614 < 2e-16 ***
## detailed_occupation_group18 -601.2618 105.4690 -5.701 1.25e-08 ***
## detailed_occupation_group19 -365.3050 41.5688 -8.788 < 2e-16 ***
## detailed_occupation_group20 -301.0605 42.9177 -7.015 2.58e-12 ***
## detailed_occupation_group21 -538.5115 36.8069 -14.631 < 2e-16 ***
## detailed_occupation_group22 -495.0241 36.6261 -13.516 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 536 on 5517 degrees of freedom
## Multiple R-squared: 0.357, Adjusted R-squared: 0.3542
## F-statistic: 127.7 on 24 and 5517 DF, p-value: < 2.2e-16
The following is a regression analysis of weekly earnings with regards to hours worked.
Using the regression model created below, weekly earnings would have the following equation in relation to a unit increase in age:
Weekly Earnings = 548.95 + 9.19(age)
##
## Call:
## lm(formula = weekly_earnings ~ age, data = telework_OD)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1245.3 -445.3 -178.1 284.7 2133.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 548.9457 28.2350 19.44 <2e-16 ***
## age 9.1941 0.6306 14.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 654.6 on 5540 degrees of freedom
## Multiple R-squared: 0.03696, Adjusted R-squared: 0.03678
## F-statistic: 212.6 on 1 and 5540 DF, p-value: < 2.2e-16
This model is considered naive as age very clearly has no sole impact on weekly earnings. As someone gets older, they advance in their career and make more weekly earnings. It is more the work ethic and occupation an individual is involved in than their age alone.
Although it is likely you will receive more money as you grow older for reasons mentioned previously, it is almost impossible for someone to be born with $548. Although parents could gift money to their child, the newborn cannot make weekly earnings when they are born initially. For this reason, this model is particularly naive.
Finally, the model is particularly naive as it is unlikely for someone to get an increase of $180 in pay each year. You are likely to see very little to any weekly earnings for people at a young age. From their 20s until their 60s, earnings will likely increase until the individual retires and returns back to no weekly earnings. In addition, individuals will either see pay increases of large amounts, depending on their occupation, or no pay increase at all.
The graph below compares a linear model to an estimated line of best fit. Based on the graph alone, the linear model is clearly not a strong fit as the two degree polynomial does not come into contact with any points on the linear model. The two may come into contact in negative age, but it is not possible to be a negative age.
For each data point, there are increasingly fewer responses. As people get older, more people tend to die as a result of age and disease. This is why you see so few data points between 65 and 80. That or the census doesn’t record data from people who are not working. In either case, the weighting of these data points has become uneven and outliers will more heavily influence the line than it would in, say, the 20-30 year stage. Making the data points more evenly weighted to better represent the population would help in eliminating the effect of outliers.
Another concern with the model is missing data points. Between what looks like 80 and 85, there appears to be no data collected while a string of data past that has a few data points representing a certain unit of age. Additionally, under the age of 15, the legal age you can begin working I believe, there are no data points representing those ages. Clearly they have no weekly earnings, but it impacts how the curve would likely look and be shaped.
My final concern is the overall recording of the data. There are quite a few points sitting at the top of the graph as if there was a maximum possibility to the census weekly earnings that could be portrayed. Inaccuracies in recording may cause incorrect representation in our line.
In this new model, I am going to use a combination of age, hours worked, full or part time, and hourly/non-hourly to see if I can use those to come close to predicting weekly earnings. I find these to be the most interesting and most probable in combining to find weekly earnings.
Below are the results of the regression model using the independent variables mentioned above.
##
## Call:
## lm(formula = weekly_earnings ~ poly(age, 2, raw = TRUE) + poly(hours_worked,
## 3, raw = TRUE) + full_or_part_time + hourly_non_hourly, data = telework_OD)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1644.5 -339.0 -103.4 198.1 2515.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.60639 88.31731 -0.063 0.949
## poly(age, 2, raw = TRUE)1 27.08801 3.29110 8.231 2.30e-16 ***
## poly(age, 2, raw = TRUE)2 -0.23108 0.03694 -6.256 4.25e-10 ***
## poly(hours_worked, 3, raw = TRUE)1 -23.47438 4.55995 -5.148 2.72e-07 ***
## poly(hours_worked, 3, raw = TRUE)2 0.89080 0.12317 7.232 5.40e-13 ***
## poly(hours_worked, 3, raw = TRUE)3 -0.00595 0.00098 -6.071 1.36e-09 ***
## full_or_part_time2 -258.32292 27.39902 -9.428 < 2e-16 ***
## hourly_non_hourly2 432.37127 15.46552 27.957 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 543.9 on 5534 degrees of freedom
## Multiple R-squared: 0.3357, Adjusted R-squared: 0.3349
## F-statistic: 399.6 on 7 and 5534 DF, p-value: < 2.2e-16
The formula, using the coefficients produced, would look like the following:
weekly earnings = -5.61 + 27.09 (age) - .23 (age^2) - 23.47 (hours worked) + .89 (hours worked^2) - .01 (hours worked^3) - 258.32 (part time) + 432.37 (non-hourly)
For part time, if you work part time you would insert a 1. Otherwise 0. For non-hourly, if you work non-hourly you would insert a 1. Otherwise 0.
There is some colinearity between full/part time and hours worked. This is unsurprising as most people that work full time work around 40 hours or more per year.
## age hours_worked full_or_part_time hourly_non_hourly
## age 1.000000000 -0.003209048 -0.0345702 0.1311379
## hours_worked -0.003209048 1.000000000 -0.5733351 0.2066567
## full_or_part_time -0.034570197 -0.573335104 1.0000000 -0.2027057
## hourly_non_hourly 0.131137879 0.206656746 -0.2027057 1.0000000
For future estimations, the area on the equation that is likely to be most effective is between ages 30 and 50 working around 35-45 hours a week.
Using my model, I will calculate the income for an individual I found in the dataset. This individual is 39 years old, works 38 hours per week, works full-time, and is an hourly worker. Their current weekly earnings is 375 dollars. However, the model predicts that it should be around 545.65 per week.
(-5.61) + (27.09 * (39)) - (.23 * (39*39)) - (23.47 * (38)) + (.89 * (38*38)) - (.01 * (38*38*38)) -( 258.32 * (0)) + (432.37 * (0))
## [1] 545.65