Introduction

This evaluation uses data from a Census bureau survey about internet and technology use.

Data source: http://asayanalytics.com/telework_csv

1) Simple One-Way ANOVA

a. Effect of teleworking on income

Question: Does Teleworking appear to have a significant effect on income? How do you know? Explain the effect (if it exists)

As shown in the ANOVA test below, there is a significant relationship between income and whether one telecommutes or not. With p-vale being less than 0.001, the result is extremely significant. Thus, teleworking appears to have a significant effect on income.

  Estimate Std. Error t value Pr(>|t|)
(Intercept) 1183 15.69 75.43 0
telecommute2 -350.8 18.84 -18.61 4.572e-75
Fitting linear model: weekly_earnings ~ telecommute
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
5542 647.1 0.05886 0.05869

b. Visualization

The box plot below shows that the mean of income of telecommuters is $350.80 higher than none telecommuters. Note that the One-way ANOVA model is the naive model that only compares the mean between the values of variables. It does not incorporate other variables that might also have an effect as well such as number of hours worked and whether or not the employee is paid hourly.

2) Simple Regression Model: Earnings by Hours Worked

Instruction: Build a simple regression model estimating weekly earnings by hours worked.

a. Regression model using Beta Notation

The generalized form of the regression using beta notation is weekly_earnings = b_0 + b_1(hours_worked).

  Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.04 28.58 2.311 0.02086
hours_worked 22.59 0.7072 31.94 1.169e-205
Fitting linear model: weekly_earnings ~ hours_worked
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
5542 613 0.1555 0.1554

b. Estimated form of the regression based on results

Based on the results from part A, the estimated form of the regression model is weekly_earnings = 66.04 + 22.59(hours_worked)

c. Naive Explanations

Provide at least 3 explanations for why we would consider this a naïve model.

  1. There is only about 15% change in weekly earnings that can be explained by number of hours worked. The rest of almost 85% is still unknown.
  2. It is a simple model, involving only two variables (one dependent and one independent). It does not explain underlying causal relationships that produce the weekly earnings variable.
  3. The model did not take into account other independent variables that may also have an affect on weekly earnings of employees (e.g. hourly or non-hourly and full-time or part-time)

d. Model Improvement Proposal

Based on the point graph shown below, there is no clear linearity relationship between weekly earning and the amount of hours worked. What I think should be done to better this model is to add other variables like full/part-time or paid hourly or non-hourly. If employees are paid non-hourly, then I don’t think it makes sense to model one variable as a function of the other. In this circumstance, the salary would already be fixed, thus modeling this would not work.

Thus, my first suggestion is to exclude any employees who do not get paid hourly and build a model to explain the relationships between weekly earnings and hours worked with another independent variable being hourly paid. The second recommendation is to look at other variables that may also have an effect on the model, specifically the full-time/part-time position.

e1. Model Improvement Execution - 1st Recommendation

First recommendation: exclude any employees who do not get paid hourly and build a model to explain the relationships between weekly earnings and hours worked with another independent variable being hourly paid. This is done by adding hourly_non_hourly variable as a factor.

  Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.06 26.31 0.8764 0.3809
hours_worked 18.21 0.6645 27.41 3.215e-155
hourly_non_hourly2 498.5 15.65 31.86 1.082e-204
Fitting linear model: weekly_earnings ~ hours_worked + hourly_non_hourly
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
5542 563.5 0.2863 0.2861

As shown in the table above, both independent variables (hours_worked and hourly_non_hourly) appear to have significant effect on the weekly earnings since their p-values are significantly lower than 0.05.

e2. Model Improvement Execution - 2nd Recommendation

Second recommendation: look at other variables that may also have an effect on the model, such as the full-time/part-time position, education, sex, citizenship, age, geographic region, occupation group, and industry.

After exploring the data, I found that two other variables that also have significant effect on weekly earnings are 1) detailed_occupation_group variable and 2) education, especially Bachelor’s degree and above (i.e. education numbered 43 and up), where both models increased from 15% to 32%.

  Estimate Std. Error t value Pr(>|t|)
(Intercept) 568.2 35.58 15.97 3.716e-56
hours_worked 18.6 0.6629 28.06 4.479e-162
detailed_occupation_group2 -13.7 38.45 -0.3563 0.7217
detailed_occupation_group3 74.54 46.22 1.613 0.1068
detailed_occupation_group4 94.24 49.55 1.902 0.05723
detailed_occupation_group5 -154.7 67.95 -2.277 0.02283
detailed_occupation_group6 -362 62.3 -5.811 6.561e-09
detailed_occupation_group7 245.2 68.81 3.563 0.0003697
detailed_occupation_group8 -139.9 42.43 -3.298 0.0009804
detailed_occupation_group9 -195.3 61.14 -3.194 0.001411
detailed_occupation_group10 -149.4 36.34 -4.11 4.008e-05
detailed_occupation_group11 -735.2 48.78 -15.07 2.487e-50
detailed_occupation_group12 -358.8 51.87 -6.917 5.12e-12
detailed_occupation_group13 -746.2 39.92 -18.69 1.193e-75
detailed_occupation_group14 -717.1 49.68 -14.43 2.204e-46
detailed_occupation_group15 -750.2 50.12 -14.97 1.096e-49
detailed_occupation_group16 -442.7 31.96 -13.85 6.572e-43
detailed_occupation_group17 -570.5 29.1 -19.61 8.15e-83
detailed_occupation_group18 -719.7 108.2 -6.65 3.218e-11
detailed_occupation_group19 -397.6 42.66 -9.321 1.626e-20
detailed_occupation_group20 -313.6 44.11 -7.111 1.303e-12
detailed_occupation_group21 -551.7 37.81 -14.59 2.479e-47
detailed_occupation_group22 -536.3 37.58 -14.27 2.064e-45
Fitting linear model: weekly_earnings ~ hours_worked + detailed_occupation_group
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
5542 551.1 0.32 0.3173

  Estimate Std. Error t value Pr(>|t|)
(Intercept) -140.1 226.5 -0.6187 0.5361
hours_worked 20.67 0.6407 32.26 2.504e-209
education32 -276.7 334 -0.8285 0.4074
education33 -143 250.8 -0.5703 0.5685
education34 -124.7 258.3 -0.4829 0.6292
education35 -151.8 244.8 -0.62 0.5353
education36 -47.71 237 -0.2013 0.8404
education37 -73.85 233.5 -0.3163 0.7518
education38 -125.5 242.8 -0.517 0.6052
education39 58.78 225.7 0.2605 0.7945
education40 95.83 225.8 0.4244 0.6713
education41 158.1 227.5 0.6947 0.4873
education42 225.2 227 0.9921 0.3212
education43 510.1 225.7 2.26 0.02385
education44 696.8 226.4 3.077 0.002098
education45 995.3 232.1 4.289 1.829e-05
education46 874.2 231.4 3.778 0.0001599
Fitting linear model: weekly_earnings ~ hours_worked + education
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
5542 551.6 0.318 0.316

3) Simple Regression Model: Earning as Function of Age

Instruction: Build a simple regression model estimating weekly earnings as a function of age.

a. Regression model using Beta Notation

The generalized form of the regression model using beta notation is weekly_earnings = b_0 + b_1(age)

  Estimate Std. Error t value Pr(>|t|)
(Intercept) 548.9 28.24 19.44 1.681e-81
age 9.194 0.6306 14.58 2.787e-47
Fitting linear model: weekly_earnings ~ age
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
5542 654.6 0.03696 0.03678

b. Estimated form of the regression based on results

Based on the results from part A, the estimated form of the regression model is weekly_earnings = 548.9457 + 9.1941(age)

c. Naive Explanations

Provide at least 3 explanations for why this is considered a naïve model.

  1. The model only using age as the variable to predict weekly earnings, where it would not work that way in reality.
  2. There is only about 9% effect that age has on weekly earnings.
  3. The model is not looking at any other variables other than age that would also influence weekly earnings.

d. Linearity Assumption Test

Based on the plot graph below, the linear model is clearly not a strong fit as the polynomial slope does not even touch the blue line of linear model.

e. Possible Concerns

Three possible concerns regarding this linear model beyond those inherent in the naïve design:

  1. Outliers: The first concern I have is the outliers in the weekly_earnings variables. I would first transform my data by filtering out those outliers or by using the log function.

  2. Older age = less data point: The second most concerned is about the missing data points as age increases. As people get older, they either tend to be less technological competent, already retired, and thus, have no more earnings, or already passed away. We can clearly see that the data points are weighted more between the ages of mid-20s and mid-60s. This might be an underlying problem of outliers too, so I believe the best way to address this concern is by evenly weighting the data points by eliminating the outliers. That is to filter out ages that contain a lot of missing data points.

  3. Age as discrete variable: Since weekly_earnings does not increase as age increases, treating age variable as a numerical value would not be correct. This is my last concern, and the only way to address this is to simply transform age to be a factor/categorical.

4) Modification from Q3

Instruction: Modify your model from Q3 by adding at least 3 other IVs to the regression and transforming the age variable (and others) as necessary to meet the linearity assumption. Interpret your results and answer the following questions:

a. Regression model using Beta Notation

The generalized form of the regression model using beta notation is weekly_earnings = b_0 + b_1(age) + b_2(hours_worked) + b_3(telecommute) + b_4(hourly_non_hourly) + b_5(full_or_part_time) + b_6(occupation_group) + b_8(industry) + b_9(education)

  Estimate Std. Error t value Pr(>|t|)
(Intercept) 378.8 312.4 1.212 0.2254
as.factor(age)16 -150.1 259.1 -0.5792 0.5625
as.factor(age)17 -114.1 245.9 -0.4639 0.6428
as.factor(age)18 -177.4 227.3 -0.7805 0.4352
as.factor(age)19 -237.3 228.1 -1.04 0.2983
as.factor(age)20 -303.3 223.3 -1.358 0.1745
as.factor(age)21 -270.4 221.3 -1.221 0.222
as.factor(age)22 -400.8 220.1 -1.821 0.06865
as.factor(age)23 -335 218.6 -1.532 0.1255
as.factor(age)24 -353.1 218.8 -1.614 0.1066
as.factor(age)25 -282.4 218.3 -1.293 0.196
as.factor(age)26 -374.2 218.4 -1.714 0.08662
as.factor(age)27 -277.6 218.4 -1.271 0.2038
as.factor(age)28 -206.7 218.1 -0.9476 0.3434
as.factor(age)29 -239.6 218 -1.099 0.2718
as.factor(age)30 -212.2 218.3 -0.972 0.3311
as.factor(age)31 -181.6 217.2 -0.8363 0.403
as.factor(age)32 -159.5 218.7 -0.7295 0.4657
as.factor(age)33 -208.2 217.6 -0.957 0.3386
as.factor(age)34 -209.5 217.9 -0.9615 0.3363
as.factor(age)35 -166 218 -0.7615 0.4464
as.factor(age)36 -162.3 218.6 -0.7425 0.4578
as.factor(age)37 -238.1 218.9 -1.088 0.2767
as.factor(age)38 -165 218.6 -0.7551 0.4502
as.factor(age)39 -33.61 219 -0.1535 0.878
as.factor(age)40 -84.6 218.7 -0.3868 0.699
as.factor(age)41 -31.39 218.9 -0.1434 0.886
as.factor(age)42 -92.4 219.5 -0.421 0.6738
as.factor(age)43 -215.8 219.2 -0.9844 0.325
as.factor(age)44 -76.35 219 -0.3485 0.7274
as.factor(age)45 -66.81 217.9 -0.3067 0.7591
as.factor(age)46 -10.38 219.1 -0.04738 0.9622
as.factor(age)47 -162.1 218.9 -0.7405 0.459
as.factor(age)48 -69.7 218.6 -0.3188 0.7499
as.factor(age)49 -110 219 -0.5021 0.6156
as.factor(age)50 -108.4 217.8 -0.4975 0.6188
as.factor(age)51 -41.9 219.4 -0.191 0.8485
as.factor(age)52 -42.82 218.7 -0.1958 0.8448
as.factor(age)53 -87.1 218.5 -0.3985 0.6902
as.factor(age)54 -30.53 218.8 -0.1395 0.889
as.factor(age)55 34.81 218.5 0.1593 0.8734
as.factor(age)56 -112 218.5 -0.5126 0.6083
as.factor(age)57 -75.05 218.9 -0.3429 0.7317
as.factor(age)58 -58.09 218.5 -0.2658 0.7904
as.factor(age)59 34.81 219.6 0.1585 0.8741
as.factor(age)60 -49.22 219.4 -0.2243 0.8225
as.factor(age)61 -9.342 219 -0.04266 0.966
as.factor(age)62 -96.8 220.5 -0.439 0.6607
as.factor(age)63 -24.32 221.2 -0.1099 0.9125
as.factor(age)64 4.347 223.1 0.01948 0.9845
as.factor(age)65 -156.7 223.8 -0.7 0.484
as.factor(age)66 62.54 229.5 0.2726 0.7852
as.factor(age)67 -34.59 229.4 -0.1508 0.8801
as.factor(age)68 -71.58 232.4 -0.308 0.7581
as.factor(age)69 111.9 239.6 0.4669 0.6406
as.factor(age)70 -243.2 236.5 -1.029 0.3037
as.factor(age)71 -6.87 251.8 -0.02729 0.9782
as.factor(age)72 -236.4 284.5 -0.8311 0.406
as.factor(age)73 -317.5 261.6 -1.213 0.225
as.factor(age)74 -69.66 265.4 -0.2625 0.793
as.factor(age)75 48.62 284.5 0.1709 0.8643
as.factor(age)76 -391.2 284.9 -1.373 0.1697
as.factor(age)77 -143.9 309.1 -0.4656 0.6415
as.factor(age)78 276 308.8 0.8939 0.3714
as.factor(age)79 -387.8 358.4 -1.082 0.2792
as.factor(age)80 -299.6 270.9 -1.106 0.2687
as.factor(age)85 -155.1 277.1 -0.5598 0.5757
hours_worked 12.19 0.7286 16.74 2.437e-61
telecommute2 -83.79 15.57 -5.382 7.693e-08
hourly_non_hourly2 222.2 16.18 13.73 3.262e-42
full_or_part_time2 -210.7 24.35 -8.652 6.558e-18
occupation_group2 -70.72 23.16 -3.054 0.00227
occupation_group3 -272.9 27.29 -10 2.426e-23
occupation_group4 -197 30.65 -6.428 1.404e-10
occupation_group5 -330.7 25.59 -12.92 1.246e-37
occupation_group6 -164.9 127 -1.299 0.194
occupation_group7 -94.99 48.8 -1.947 0.05163
occupation_group8 -61.14 40.8 -1.499 0.134
occupation_group9 -271 37.72 -7.185 7.628e-13
occupation_group10 -253.1 36.57 -6.921 4.993e-12
industry2 454.3 113.5 4.002 6.36e-05
industry3 194.9 99.03 1.968 0.04917
industry4 229.9 92.91 2.474 0.01339
industry5 93.55 93.11 1.005 0.3151
industry6 268.3 96.17 2.79 0.005294
industry7 222.8 102.2 2.18 0.02932
industry8 176 94.05 1.872 0.06129
industry9 229.5 92.66 2.477 0.01327
industry10 54.8 91.66 0.5978 0.55
industry11 57.98 93.32 0.6214 0.5344
industry12 9.688 96.33 0.1006 0.9199
industry13 186.7 93.71 1.992 0.0464
education32 -143.1 303 -0.4722 0.6368
education33 -92.71 227.2 -0.408 0.6833
education34 7.227 234.6 0.0308 0.9754
education35 -74.15 223.7 -0.3314 0.7403
education36 115.2 215.8 0.5337 0.5936
education37 32.35 212.6 0.1522 0.8791
education38 50.05 220.4 0.2271 0.8203
education39 125.1 204.8 0.611 0.5412
education40 147.8 205 0.721 0.4709
education41 166.1 206.5 0.8043 0.4212
education42 200.7 206.1 0.9736 0.3303
education43 375.1 205.2 1.828 0.0676
education44 478.3 206.1 2.321 0.02032
education45 734.9 211.4 3.476 0.0005121
education46 673.1 210.7 3.195 0.001407
Fitting linear model: weekly_earnings ~ as.factor(age) + hours_worked + telecommute + hourly_non_hourly + full_or_part_time + occupation_group + industry + education
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
5542 497 0.4553 0.4447

b. Estimated form of the regression based on results

Based on the results from part A, the estimation form of the regression model is weekly_earnings = 378.8 -150(age)16 - 114(age)17 - … - 115.1(age)85 + 12.19(hours_worked) - 83.79(non-telecommute) + 222.2(hourly_non_hourly) - 210.7(part_time) - 70.72(occupation_group2) - … - 253.1(occupation_group10) + 454.3(industry2) + … + 186.7(industry13) - 143.1(education32) - … + 673.1(education46)

Since many IVs are treated as factors, each factor is treated as its own variable (e.g. education, occupation group, and industry).

c. Collinearity

Question: Do you suspect any of your independent variables are collinear? Explain why or why not.

Yes, I do suspect that some of my IVs are collinear, particularly telecommute, hourly_non_hourly, and hour_worked. This is because to consider the amount of hours worked, we first have to figure out whether the employees are telecommuter, get paid by hours, and whether they work full time or part time. More than that, occupation_group, industry, and education also seem to have collinearity because the education level indicates the kind of job positions in each occupation/industry.

d. Ranges of Values

From part B, the linear equation is

weekly_earnings = 378.8 -150(age)16 - 114(age)17 - … - 115.1(age)85 + 12.19(hours_worked) - 83.79(non-telecommute) + 222.2(hourly_non_hourly) - 210.7(part_time) - 70.72(occupation_group2) - … - 253.1(occupation_group10) + 454.3(industry2) + … + 186.7(industry13) - 143.1(education32) - … + 673.1(education46)

Judging by the output, ranges of values I am most comfortable using this model to estimate future observations are as follow:

  • occupation_group = 3-5 and 9-10
  • industry = 2-4, 6, 7, 9, 13
  • education = 44-46 (45 being the most significant)

e. Hypothetical Observation

Generate a hypothetical observation that exists within the ranges specified in part d and estimate the weekly earnings for that individual. In addition to this estimate, calculate the estimated range of weekly earnings this individual may have and comment on the results.

Below is a hypothetical observation that exists with the ranges in part D, specifically Education = 45 and Industry = 10.

Predicted Values

##         1         2         3         4         5         6         7         8 
## 1300.7479 1037.4839  984.7048 1128.6815 1376.4496 1714.5415 1587.1710 1711.5189 
##         9        10 
## 1586.5508 1671.7889