Visualization of means for both factors. 1= Yes 2=No
As the p-value is less than the significance level threshold of 0.05, it can be indicated that there is strong evidence against the null hypothesis; therefore we can reject the null, concluding a significant difference between income based on teleworking in the model summary. This is considered a naïve model for many reasons, but I will only explain two. My first reason relates to causality. It is naïve to make an attempt to explain what causes the differences in mean earnings by factor due to unknown variables such as: education, seniority, or position/employment-title. Another reason this model is naïve is related to the weekly earnings capping at $2884.61 which would influence the true mean, however this data is meaningful and cannot be removed as outliers.
## Df Sum Sq Mean Sq F value Pr(>F)
## weekly_earnings 1 69.4 69.42 346.5 <2e-16 ***
## Residuals 5540 1109.9 0.20
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results depict a more defined story rather than that in question one. Here we can see, by the visualization, that employees that work in the Northeast and South via telecommunication by average make higher earnings. Keeping this in mind, it is important to remember that this is still in-fact a naive model. This relates back to the inferences listed above regarding causality and weekly salary caps as defined by the census bureau. We can, however, reference back to the findings as we sift through the data.
When discovering valuable information regarding the relationship between telecommunication and earnings, the most intuitive interaction relates to the location of the employee. Interactions 1-8 depict whether the employee works remotely (1= Yes 2=No) and also depicts the interaction between locations (#1 Northeast, #2 Midwest (formerly North Central), #3 South, #4 West).
## Df Sum Sq
## as.factor(telecommute) 1 1.451e+08
## as.factor(geography_region) 3 5.370e+06
## as.factor(telecommute):as.factor(geography_region) 3 1.661e+06
## Residuals 5534 2.313e+09
## Mean Sq F value
## as.factor(telecommute) 145093488 347.185
## as.factor(geography_region) 1790092 4.283
## as.factor(telecommute):as.factor(geography_region) 553695 1.325
## Residuals 417914
## Pr(>F)
## as.factor(telecommute) <2e-16 ***
## as.factor(geography_region) 0.005 **
## as.factor(telecommute):as.factor(geography_region) 0.264
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Referring back to the one way anova, we remember we rejected the null hypothesis on the basis that the significance threshold was less than .05. Using the same logic, we must not reject the null through the interaction effect with location creating a P-Value of .264 listed in the summary. Following this practice through the interactions as defined by the Tukey test depicted below, we notice many variables with a P-Value greater than that of .05 thus improving the fit of the model in comparison to Q1. Those are the variables I am interested in when analyzing the relationship between teleworking/location and income. See below for Tukey honestly significant difference results.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = weekly_earnings ~ as.factor(telecommute) * as.factor(geography_region), data = census)
##
## $`as.factor(telecommute)`
## diff lwr upr p adj
## 2-1 -350.7614 -387.6655 -313.8574 0
##
## $`as.factor(geography_region)`
## diff lwr upr p adj
## 2-1 -95.70419 -168.503279 -22.90510 0.0040899
## 3-1 -40.49404 -106.204840 25.21677 0.3880429
## 4-1 -26.60643 -95.947938 42.73507 0.7574320
## 3-2 55.21015 -6.243971 116.66428 0.0961320
## 4-2 69.09776 3.775907 134.41960 0.0333042
## 4-3 13.88760 -43.428589 71.20379 0.9248496
##
## $`as.factor(telecommute):as.factor(geography_region)`
## diff lwr upr p adj
## 2:1-1:1 -401.605451 -538.70720 -264.50370 0.0000000
## 1:2-1:1 -180.250425 -338.51083 -21.99002 0.0130088
## 2:2-1:1 -462.661950 -594.55685 -330.76705 0.0000000
## 1:3-1:1 -67.620747 -206.22251 70.98102 0.8187460
## 2:3-1:1 -429.373830 -554.48119 -304.26647 0.0000000
## 1:4-1:1 -66.396908 -210.57576 77.78194 0.8592933
## 2:4-1:1 -409.633659 -539.16337 -280.10395 0.0000000
## 1:2-2:1 221.355026 86.65456 356.05550 0.0000178
## 2:2-2:1 -61.056499 -163.49735 41.38435 0.6152141
## 1:3-2:1 333.984703 223.04188 444.92753 0.0000000
## 2:3-2:1 -27.768379 -121.30828 65.77152 0.9861259
## 1:4-2:1 335.208543 217.37221 453.04488 0.0000000
## 2:4-2:1 -8.028209 -107.40530 91.34888 0.9999974
## 2:2-1:2 -282.411526 -411.80856 -153.01449 0.0000000
## 1:3-1:2 112.629677 -23.59725 248.85661 0.1925490
## 2:3-1:2 -249.123405 -371.59454 -126.65227 0.0000000
## 1:4-1:2 113.853517 -28.04387 255.75090 0.2255881
## 2:4-1:2 -229.383235 -356.36855 -102.39792 0.0000013
## 1:3-2:2 395.041203 290.60133 499.48107 0.0000000
## 2:3-2:2 33.288120 -52.43871 119.01495 0.9384753
## 1:4-2:2 396.265042 284.52974 508.00034 0.0000000
## 2:4-2:2 53.028291 -39.03246 145.08904 0.6564734
## 2:3-1:3 -361.753083 -457.47807 -266.02810 0.0000000
## 1:4-1:3 1.223840 -118.35443 120.80211 1.0000000
## 2:4-1:3 -342.012912 -443.44942 -240.57640 0.0000000
## 1:4-2:3 362.976922 259.34119 466.61265 0.0000000
## 2:4-2:3 19.740171 -62.30109 101.78143 0.9961293
## 2:4-1:4 -343.236752 -452.17002 -234.30348 0.0000000
Generalized form of the regression using beta notation (Y = a + bX)
weekly_earnings = 66.0433 + 22.5887(hours_worked)
The three reasons this would be considered a naïve model are as follows. First, this model does not represent the wholistic relationship between hours worked and income. The model represents a Multiple R-squared value of 0.1555 and Adjusted R-squared value of 0.1554 representing roughly 15% of the data with all variables significant with a P value less than .05. Leading to my next example, the line is not a fantastic fit because hours worked and income in the raw form do not suggest a relationship in its entirety. Looking at the visualization, we can note where individuals work zero hours and still rake in close to $3000 dollars weeky, displaying extreme variability in the data set. Lastly, the relationship does not define occupational grouping to help gauge a better understanding of earnings in relation to income such as: management, professionals, armed forces, C-Suite, etc. Having a better understanding of grouping and how wages are paid (hourly/salary), can help provide a better fit.
To aid in better understanding the model, I believe under no circumstances does it make sense to model hours worked and weekly earnings alone. To trim the average, or define a more robust data set, excluding outliers may provide a better fit to the line; this however is not how the real world operates. The model should include outliers, as well as raw means, so that the model can be understood on how it interacts with the real world. There may be functions where preforming these type of actions make sense but from a census data set with real world implications, it is important to understand the data exactly how it relates to the real world. A forced line of best fit or regression that paints an artificial picture of income may have a direct impact on interest rates or taxes. The best way to help understand the model is to specifically add independant variables that make sense to the data thus providing a better wholistic overview. In the visualization below, the data includes an interaction with binary data relating to whether the employee is hourly or non-hourly. It must be noted that this is still a naïve model, however we see the Adjusted R-Square increase to 0.286 meaning 28.6% of the data can be explained with this relationship with all variables beneath the significance threshold of .05. With the increased Adjusted R-Square value, we can infer that we are on the right track to better understanding the data as it exists in the real world.
1= Hourly Worker 2= Nonhourly Worker
##
## Call:
## lm(formula = weekly_earnings ~ (hours_worked) + (hourly_non_hourly),
## data = census)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1819.8 -331.6 -135.5 213.8 2728.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -475.4739 31.2913 -15.20 <2e-16 ***
## hours_worked 18.2135 0.6645 27.41 <2e-16 ***
## hourly_non_hourly 498.5295 15.6472 31.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 563.5 on 5539 degrees of freedom
## Multiple R-squared: 0.2863, Adjusted R-squared: 0.2861
## F-statistic: 1111 on 2 and 5539 DF, p-value: < 2.2e-16
1 = Full time labor force 2 = Part Time Labor Force
Generalized form of the regression using beta notation (Y = a + bX + bX2)
weekly_earnings = 1336.2752 + 8.6106(age) + -664.8508(full_or_part_time)
Three reasons that this can be considered a naïve model include the unknown lurking variables, the possibility of the relationship being completely coincidental, and bi-directional causality. A possible explanation for lurking variables in regard to the relationship between earnings, age, and full or part-time work, can reference other variables such as education, demographics, industry, etc. All the listed variables can influence the relationship and cause a correlation which may not be inherent by nature and may only derive from lurking variables. My next point refers to complete coincidence. In nature, we understand there is a relationship between earnings, age, and if the employee is full or part-time due to certain laws which protect children from participating in the work force. This relationship however, cannot be determined only by using these variables in the model. When discussing earnings as a function of age and employment status, it is important to keep in mind the anomalies which may skew the data set or inversely cause a relationship that is not completely explained by the model. Finally, the two variables may be bi-directionally causal. A bi-directional causality may increase the likelihood that the variables are correlational without properly defining the relationship for the entire data set.
##
## Call:
## lm(formula = weekly_earnings ~ (age) + (full_or_part_time), data = census)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1179.1 -419.7 -146.9 233.6 2636.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1336.2752 38.0649 35.10 <2e-16 ***
## age 8.6106 0.5889 14.62 <2e-16 ***
## full_or_part_time -664.8508 23.1961 -28.66 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 610.9 on 5539 degrees of freedom
## Multiple R-squared: 0.1613, Adjusted R-squared: 0.161
## F-statistic: 532.8 on 2 and 5539 DF, p-value: < 2.2e-16
Three concerns for this model are as follows: slight deviation from linearity resulting in a non-linear existing relationship in the residuals vs fitted data, significant skew in the normal distribution of the Q-Q plot, and increased variability across the spread with unequal random points in the scale-location. I would address the following concerns by ensuring all regression assumptions are addressed and determining how the summary statistics support or disqualify the relationship, thus encouraging exploratory analysis. I would then test other exclusive variables as a function of age to find a possible best fit for the regression. Last, I would add a third intuitive variable in efforts to increase regression statistics such as slope coefficients, p-values, or Multiple R-Squared/Adjusted R-Squared followed by testing linearity.
Weekly Earnings ~ Age + Full or Part-Time + Hours Worked + Hourly/Non-Hourly + Telecommute
Generalized form of the regression using beta notation (Y = a + bX + bX2 + bX3 + bX4 + bX5)
log(weekly_earnings) = 179.5795 + 7.1517(age)^5 + -274.1562(Full or Part-Time) + 13.3996(Hours Worked) + 413.3548(Hourly/Non-Hourly) + -199.1879(Telecommute)
At a quick glance, regarding collinearity from the model and utilizing Chi Square F Value significance the smaller the F value the higher the likelihood of collinearity. The variables that stand out to me are telecommute as a function of age and hours worked as a function of age, however the F value is still large resulting little possibility of collinearity. The variables I feel most comfortable with projecting future observations are full or part-time and hourly or non-hourly as a function of age. As these variables have the highest F-Value resulting in wide variability reflecting more differences in the population level. Digging deeper into the data to test collinearity with the (cor) function, we can see that all F statistics are smaller than .7 thus proving all variables are significantly different. Another test for collinearity is regarding to inflation of values. The VIF (Variance Inflation Factor) score for all variables can be seen below and well under the benchmark of 10 finally concluding the data is non-correlated and meaningful. The plots below are utilized to visualize normality and fit with the variables in respect to collinearity.
##
## Call:
## lm(formula = weekly_earnings ~ (age) + (full_or_part_time) +
## (hours_worked) + (hourly_non_hourly) + (telecommute), data = census)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1606.4 -338.1 -101.3 201.8 2555.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 179.5795 68.4115 2.625 0.00869 **
## age 7.1517 0.5277 13.552 < 2e-16 ***
## full_or_part_time -274.1562 25.2597 -10.854 < 2e-16 ***
## hours_worked 13.3996 0.7694 17.415 < 2e-16 ***
## hourly_non_hourly 413.3548 15.6260 26.453 < 2e-16 ***
## telecommute -199.1879 16.2554 -12.254 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 542 on 5536 degrees of freedom
## Multiple R-squared: 0.3401, Adjusted R-squared: 0.3395
## F-statistic: 570.7 on 5 and 5536 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: weekly_earnings
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 91089447 91089447 310.03 < 2.2e-16 ***
## full_or_part_time 1 306594518 306594518 1043.52 < 2.2e-16 ***
## hours_worked 1 136352870 136352870 464.09 < 2.2e-16 ***
## hourly_non_hourly 1 260190837 260190837 885.58 < 2.2e-16 ***
## telecommute 1 44115715 44115715 150.15 < 2.2e-16 ***
## Residuals 5536 1626516649 293807
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## age full_or_part_time hours_worked
## age 1.000000000 -0.03457020 -0.003209048
## full_or_part_time -0.034570197 1.00000000 -0.573335104
## hours_worked -0.003209048 -0.57333510 1.000000000
## hourly_non_hourly 0.131137879 -0.20270566 0.206656746
## telecommute 0.012821617 0.08838726 -0.102674164
## hourly_non_hourly telecommute
## age 0.1311379 0.01282162
## full_or_part_time -0.2027057 0.08838726
## hours_worked 0.2066567 -0.10267416
## hourly_non_hourly 1.0000000 -0.22800380
## telecommute -0.2280038 1.00000000
## age full_or_part_time hours_worked hourly_non_hourly
## 1.021344 1.508096 1.513927 1.126102
## telecommute
## 1.060613
Below we have a visualization for a weekly earnings prediction for an individual who is 24 years old, works full time at 40 hours a week, and telecommutes to work relating to the linear regression model. The boxplot visualization depicts the mean weekly earnings projection, as well as the corresponding quartiles for individuals above and below the mean.
## age full_or_part_time hours_worked telecommute
## 1 24 1 40 1