Telework Data Analysis

Introduction

The following is an analysis of census data collected on teleworking. The data was obtained from the Census Bureau survey.

Teleworking’s Impact on Income

The following shows the impact that teleworking has on income. 1 represents a value of “Yes”, inferring the respondant uses teleworking, while 2 represents a value of “No”, inferring the respondant does not use teleworking.

Statistical Test

The anova test below shows that there is a significant relationship between income and telecommuncation. With a P-value less than .001, the result is extremely significant. The test shows that there is a significant impact on income depending on whether you telework or not.

##               Df    Sum Sq   Mean Sq F value Pr(>F)    
## telecommute    1 1.451e+08 145093488   346.5 <2e-16 ***
## Residuals   5540 2.320e+09    418730                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = weekly_earnings ~ telecommute, data = telework_OD)
## 
## $telecommute
##          diff       lwr       upr p adj
## 2-1 -350.7614 -387.7015 -313.8213     0

Visualising the Data

The boxplot below also supports this belief as the median of incomes for people not teleworking is even with the bottom of the second quartile of people that are teleworking. Even with the outliers in the boxplot of people not teleworking, there is still a lot of evidence showing that people teleworking have significantly higher incomes.

Case Against the Evidence

Despite the very clear evidence, there are several reasons why this method has failed to capture the true impact of teleworking on income. For one, there are other factors that may influence an individuals income besides the use of teleworking. These influences can range from geographic region, where incomes may be higher as a result of laws or cost of living, or number of members in a household. Whatever the case, the model above fails to include other factors that may influence earnings outside of teleworking. In addition, the model doesn’t represent the level at which teleworking is used. Some individuals may be using teleworking as a small portion of their work while others use it primarily.

Teleworking also is likely to be a naive model as certain roles or job types are more likely to use teleworking than others. For example, you’re more likely to see teleworking used in finance than in construction because of the type of work being done. Contruction, for example, would not work with teleworking since the worker needs to be present on the work site, they would not be able to perform in that profession presumably well teleworking.

Comparing impact of Teleworking and Occupation on Weekly Earnings

Certain jobs are more capable of being completed at home than others. Because of this, there are certain roles that are home bound while others cannot be. I am testing the occupation to teleworking to show that occupation has a significant impact with teleworking.

##                                         Df    Sum Sq   Mean Sq F value   Pr(>F)
## telecommute                              1 1.451e+08 145093488  429.04  < 2e-16
## detailed_occupation_group               21 4.432e+08  21106627   62.41  < 2e-16
## telecommute:detailed_occupation_group   21 1.719e+07    818367    2.42 0.000294
## Residuals                             5498 1.859e+09    338185                 
##                                          
## telecommute                           ***
## detailed_occupation_group             ***
## telecommute:detailed_occupation_group ***
## Residuals                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Analysis of Variance Table
## 
## Model 1: weekly_earnings ~ telecommute
## Model 2: weekly_earnings ~ telecommute * detailed_occupation_group
##   Res.Df        RSS Df Sum of Sq      F    Pr(>F)    
## 1   5540 2319766548                                  
## 2   5498 1859341670 42 460424878 32.416 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the ANOVA test, weekly earnings is impacted by occupation and teleworking together at a significant level. Additionally, each individual factor has a significant impact on weekly earnings. The overall model has improved with the addition of occupation as a factor.

Despite the addition, I would stil consider this a naive model as there are likely other factors impacting the total of weekly earnings. Full-time and part-time is likely to impact whether someone teleworks and their income.

Visualization of Occupation with Teleworking

The visualization below shows that different occupations quite clearly make different earnings. Occupations like food prep, cleaning services, and personal care cannot be completed via teleworking. On the flip side, occupations like law, business, and management can be done by teleworking. These occupations are likely part of the reason why there is differentiation in earnings from teleworking.

Weekly Earnings by Hours Worked Regression

The following is a regression analysis of weekly earnings with regards to hours worked. The purpose is to determine what the regression equation would be and visualize the impact of hours worked on weekly earnings.

Regression Analysis

Based on the regression performed below, weekly earnings would have the following equation in relation to hours worked:

weekly earnings = 66.04 + 22.59 (hours worked)

## 
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = telework_OD)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1758.0  -411.2  -185.1   230.4  2796.0 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   66.0433    28.5766   2.311   0.0209 *  
## hours_worked  22.5887     0.7072  31.943   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 613 on 5540 degrees of freedom
## Multiple R-squared:  0.1555, Adjusted R-squared:  0.1554 
## F-statistic:  1020 on 1 and 5540 DF,  p-value: < 2.2e-16

This Model is Naive

This model would be considered naive for several reasons. For one, It assumes that by working zero hours, you would earn $66 per week. Theoretically, this could be feasible if the household is receiving welfare money or social security. This could even be feasible if the individual is calculating investments. However, that also would require some work to maintain. Anyways, weekly earnings requires some form of work.

The model is also naive for assuming that every increase in unit hours worked is equivalent to $22. As I’ve learned from my own working experiences, working certain hours provides certain levels of money. When working part-time at a golf course, my pay rate sat right around minimum wage. However, when working full-time as an intern for a company, my pay rate went up significantly. So the model wrongly assumes that each hour of work is valued the same.

Another reason this model would be considered naive is for the hours worked themselves. This is particularly important when considering overtime or holiday pay. Typically, hours over 40 or during the holidays will pay the base rate plus an additional half of the base rate per hour worked. This type of pay would cause a significant change in weekly earnings by hours worked.

New Model Proposal

In this new model, I will be using full-time/part-time and age to show their impact on weekly earnings. I believe that full-time/part-time will show similar results to hourly work while age will show increased wages as people get older.

## 
## Call:
## lm(formula = weekly_earnings ~ hours_worked + full_or_part_time + 
##     age + detailed_occupation_group, data = telework_OD)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1328.52  -333.64   -97.27   203.51  2684.44 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  426.0640    46.9233   9.080  < 2e-16 ***
## hours_worked                  14.3835     0.7715  18.644  < 2e-16 ***
## full_or_part_time2          -266.7181    25.2941 -10.545  < 2e-16 ***
## age                            7.4330     0.5255  14.144  < 2e-16 ***
## detailed_occupation_group2   -14.0418    37.4129  -0.375 0.707437    
## detailed_occupation_group3    90.6285    45.0338   2.012 0.044220 *  
## detailed_occupation_group4   103.2604    48.2186   2.142 0.032277 *  
## detailed_occupation_group5  -138.8140    66.1487  -2.099 0.035905 *  
## detailed_occupation_group6  -371.7495    60.5991  -6.135 9.13e-10 ***
## detailed_occupation_group7   247.8487    66.9185   3.704 0.000215 ***
## detailed_occupation_group8  -131.2259    41.2853  -3.179 0.001488 ** 
## detailed_occupation_group9  -159.7264    59.4970  -2.685 0.007283 ** 
## detailed_occupation_group10 -122.4776    35.3752  -3.462 0.000540 ***
## detailed_occupation_group11 -674.5186    47.5637 -14.181  < 2e-16 ***
## detailed_occupation_group12 -332.9938    50.4692  -6.598 4.56e-11 ***
## detailed_occupation_group13 -614.1451    39.5249 -15.538  < 2e-16 ***
## detailed_occupation_group14 -677.9578    48.3717 -14.016  < 2e-16 ***
## detailed_occupation_group15 -673.1004    48.9391 -13.754  < 2e-16 ***
## detailed_occupation_group16 -393.4039    31.2111 -12.605  < 2e-16 ***
## detailed_occupation_group17 -555.3577    28.3138 -19.614  < 2e-16 ***
## detailed_occupation_group18 -601.2618   105.4690  -5.701 1.25e-08 ***
## detailed_occupation_group19 -365.3050    41.5688  -8.788  < 2e-16 ***
## detailed_occupation_group20 -301.0605    42.9177  -7.015 2.58e-12 ***
## detailed_occupation_group21 -538.5115    36.8069 -14.631  < 2e-16 ***
## detailed_occupation_group22 -495.0241    36.6261 -13.516  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 536 on 5517 degrees of freedom
## Multiple R-squared:  0.357,  Adjusted R-squared:  0.3542 
## F-statistic: 127.7 on 24 and 5517 DF,  p-value: < 2.2e-16

Weekly Earnings by Age

The following is a regression analysis of weekly earnings with regards to hours worked.

Regression Analysis

Using the regression model created below, weekly earnings would have the following equation in relation to a unit increase in age:

Weekly Earnings = 548.95 + 9.19(age)

## 
## Call:
## lm(formula = weekly_earnings ~ age, data = telework_OD)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1245.3  -445.3  -178.1   284.7  2133.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 548.9457    28.2350   19.44   <2e-16 ***
## age           9.1941     0.6306   14.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 654.6 on 5540 degrees of freedom
## Multiple R-squared:  0.03696,    Adjusted R-squared:  0.03678 
## F-statistic: 212.6 on 1 and 5540 DF,  p-value: < 2.2e-16

This Model is Naive

This model is considered naive as age very clearly has no sole impact on weekly earnings. As someone gets older, they advance in their career and make more weekly earnings. It is more the work ethic and occupation an individual is involved in than their age alone.

Although it is likely you will receive more money as you grow older for reasons mentioned previously, it is almost impossible for someone to be born with $548. Although parents could gift money to their child, the newborn cannot make weekly earnings when they are born initially. For this reason, this model is particularly naive.

Finally, the model is particularly naive as it is unlikely for someone to get an increase of $180 in pay each year. You are likely to see very little to any weekly earnings for people at a young age. From their 20s until their 60s, earnings will likely increase until the individual retires and returns back to no weekly earnings. In addition, individuals will either see pay increases of large amounts, depending on their occupation, or no pay increase at all.

Assessing Linearity

The graph below compares a linear model to an estimated line of best fit. Based on the graph alone, the linear model is clearly not a strong fit as the two degree polynomial does not come into contact with any points on the linear model. The two may come into contact in negative age, but it is not possible to be a negative age.

Concerns with the Model

For each data point, there are increasingly fewer responses. As people get older, more people tend to die as a result of age and disease. This is why you see so few data points between 65 and 80. That or the census doesn’t record data from people who are not working. In either case, the weighting of these data points has become uneven and outliers will more heavily influence the line than it would in, say, the 20-30 year stage. Making the data points more evenly weighted to better represent the population would help in eliminating the effect of outliers.

Another concern with the model is missing data points. Between what looks like 80 and 85, there appears to be no data collected while a string of data past that has a few data points representing a certain unit of age. Additionally, under the age of 15, the legal age you can begin working I believe, there are no data points representing those ages. Clearly they have no weekly earnings, but it impacts how the curve would likely look and be shaped.

My final concern is the overall recording of the data. There are quite a few points sitting at the top of the graph as if there was a maximum possibility to the census weekly earnings that could be portrayed. Inaccuracies in recording may cause incorrect representation in our line.

Adjusting the Regression Model

In this new model, I am going to use a combination of age, hours worked, full or part time, and hourly/non-hourly to see if I can use those to come close to predicting weekly earnings. I find these to be the most interesting and most probable in combining to find weekly earnings.

Creating the Model

Below are the results of the regression model using the independent variables mentioned above.

## 
## Call:
## lm(formula = weekly_earnings ~ poly(age, 2, raw = TRUE) + poly(hours_worked, 
##     3, raw = TRUE) + full_or_part_time + hourly_non_hourly, data = telework_OD)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1644.5  -339.0  -103.4   198.1  2515.6 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          -5.60639   88.31731  -0.063    0.949    
## poly(age, 2, raw = TRUE)1            27.08801    3.29110   8.231 2.30e-16 ***
## poly(age, 2, raw = TRUE)2            -0.23108    0.03694  -6.256 4.25e-10 ***
## poly(hours_worked, 3, raw = TRUE)1  -23.47438    4.55995  -5.148 2.72e-07 ***
## poly(hours_worked, 3, raw = TRUE)2    0.89080    0.12317   7.232 5.40e-13 ***
## poly(hours_worked, 3, raw = TRUE)3   -0.00595    0.00098  -6.071 1.36e-09 ***
## full_or_part_time2                 -258.32292   27.39902  -9.428  < 2e-16 ***
## hourly_non_hourly2                  432.37127   15.46552  27.957  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 543.9 on 5534 degrees of freedom
## Multiple R-squared:  0.3357, Adjusted R-squared:  0.3349 
## F-statistic: 399.6 on 7 and 5534 DF,  p-value: < 2.2e-16

The New Model

The formula, using the coefficients produced, would look like the following:

weekly earnings = -5.61 + 27.09 (age) - .23 (age^2) - 23.47 (hours worked) + .89 (hours worked^2) - .01 (hours worked^3) - 258.32 (part time) + 432.37 (non-hourly)

For part time, if you work part time you would insert a 1. Otherwise 0. For non-hourly, if you work non-hourly you would insert a 1. Otherwise 0.

There is some colinearity between full/part time and hours worked. This is unsurprising as most people that work full time work around 40 hours or more per year.

##                            age hours_worked full_or_part_time hourly_non_hourly
## age                1.000000000 -0.003209048        -0.0345702         0.1311379
## hours_worked      -0.003209048  1.000000000        -0.5733351         0.2066567
## full_or_part_time -0.034570197 -0.573335104         1.0000000        -0.2027057
## hourly_non_hourly  0.131137879  0.206656746        -0.2027057         1.0000000

For future estimations, the area on the equation that is likely to be most effective is between ages 30 and 50 working around 35-45 hours a week.

Testing the New Model

Using my model, I will calculate the income for an individual I found in the dataset. This individual is 39 years old, works 38 hours per week, works full-time, and is an hourly worker. Their current weekly earnings is 375 dollars. However, the model predicts that it should be around 545.65 per week.

(-5.61) + (27.09 * (39)) - (.23 * (39*39)) - (23.47 * (38)) + (.89 * (38*38)) - (.01 * (38*38*38)) -( 258.32 * (0)) + (432.37 * (0))

## [1] 545.65

Telework Data Analysis

Matt

11/5/2019

Introduction

Teleworking’s Impact on Income

Statistical Test

Visualising the Data

Case Against the Evidence

Comparing impact of Teleworking and Occupation on Weekly Earnings

Visualization of Occupation with Teleworking

Weekly Earnings by Hours Worked Regression

Regression Analysis

This Model is Naive

New Model Proposal

Weekly Earnings by Age

Regression Analysis

This Model is Naive

Assessing Linearity

Concerns with the Model

Adjusting the Regression Model

Creating the Model

The New Model

Testing the New Model