Introduction

The purpose of this markdown is to perform analysis on Census Bureau data about internet and technology use.

Required Packages

The following packages were required for this analysis:

PACKAGE Description
readr Allows the imporation of .csv files
Skimr Grants ability to generate summary statistics
Tidyverse Loads the tidyverse packagess
knitr RMarkdown documents
rmdformats RMarkdown themes
CAR Companion to Regression Package

Accessing the Data

Download the Data

To access the data used in this case, download the .csv file of the data from the following website. Then load the data into a variable as follows:

# download the web-based data for university data from the US department 
# of Education

#download.file("http://asayanalytics.com/telework_csv","telework.csv")
# Read the data and store in a database called Universi_Data_raw. This contains 
# 7115 observations of 29 different variables relative to Universities. This 
# data will be cleaned to become more useful for future analysis.
telework <- 
  read_csv("telework.csv")

Analysis of Variance

Task 1

Answer the question: Does Telemarketing appear to have a significant effect on Income?

To answer this question, the code groups the data by the telecommute tab and summarizes the average weekly pay for each type: Telecommute or Not Telecommute. The results are shown here:

## # A tibble: 2 x 2
##   telecommute     average
##   <chr>             <dbl>
## 1 Not Telecommute    832.
## 2 Telecommute       1183.

The following boxplot shows how the data stacks relative to each category. It clearly shows that the the interquartile range for the Telecommute has higher values than Not Telecommute, providing a dataset that the weekly pay for someone that telecommutes is higher than those that do not.

Lastly, an analysis of variance is performed to look at the difference between Telecommute and Not Telecommute. The results are as follows:

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = weekly_earnings ~ telecommute, data = telework)
## 
## $telecommute
##                                 diff      lwr      upr p adj
## Telecommute-Not Telecommute 350.7614 313.8213 387.7015     0

This shows that there is a difference between telecommuting and not telecommuting in terms of weekly pay, with telecommute having the higher value.

Task 2

  1. Build a simple regression model estimating weekly earnings by hours worked.

First, the regression equation will have the following form:

Weekly Earnings = B0 + B1*Hours_worked

  1. Next, we will develop a linear model to estimate the weekly hours:
## 
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1758.0  -411.2  -185.1   230.4  2796.0 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   66.0433    28.5766   2.311   0.0209 *  
## hours_worked  22.5887     0.7072  31.943   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 613 on 5540 degrees of freedom
## Multiple R-squared:  0.1555, Adjusted R-squared:  0.1554 
## F-statistic:  1020 on 1 and 5540 DF,  p-value: < 2.2e-16

From the output, the equation becomes the following:

Weekly Earnings = 66.0433 + 22.5887*Hours_Worked

  1. This model would be considered a naive model for the following reasons:
  1. the R-squared value is very low, pointing to a bad correlation.

  2. The residual standard error is very large, meaning that for three standard deviations of the data, the weekly earnings is +/- $1839.

  3. there are other variables that will impact weekly earnings aside from the number of hours worked, such as age, profession, geographic location, etc.

  1. I do not believe that just modeling these two variables together makes sense as there are other variables that are likely to also have an impact on the weekly earnings. As an example, the following plot is a scatter plot of the weekly earnings based on the hours worked.

It would make more sense to include additional variables in the analysis, such as age and education.

## 
## Call:
## lm(formula = weekly_earnings ~ hours_worked + age + as.factor(education), 
##     data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1819.8  -327.1  -100.1   207.7  2919.7 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -521.2880   224.2680  -2.324 0.020140 *  
## hours_worked             20.8023     0.6298  33.029  < 2e-16 ***
## age                       7.3755     0.5283  13.961  < 2e-16 ***
## as.factor(education)32 -150.0121   328.4317  -0.457 0.647867    
## as.factor(education)33  -67.1497   246.5426  -0.272 0.785351    
## as.factor(education)34  -22.4919   254.0052  -0.089 0.929444    
## as.factor(education)35  -58.5856   240.7185  -0.243 0.807722    
## as.factor(education)36   42.6339   233.0099   0.183 0.854828    
## as.factor(education)37   42.8819   229.6468   0.187 0.851879    
## as.factor(education)38   -9.4211   238.7651  -0.039 0.968527    
## as.factor(education)39  121.3879   221.8683   0.547 0.584321    
## as.factor(education)40  170.1129   221.9959   0.766 0.443537    
## as.factor(education)41  218.2234   223.6653   0.976 0.329271    
## as.factor(education)42  276.0375   223.1423   1.237 0.216122    
## as.factor(education)43  573.3046   221.8825   2.584 0.009797 ** 
## as.factor(education)44  734.7774   222.5837   3.301 0.000969 ***
## as.factor(education)45 1018.4454   228.1172   4.465 8.18e-06 ***
## as.factor(education)46  899.5326   227.4533   3.955 7.76e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 542.2 on 5524 degrees of freedom
## Multiple R-squared:  0.3412, Adjusted R-squared:  0.3392 
## F-statistic: 168.3 on 17 and 5524 DF,  p-value: < 2.2e-16

This data shows that there are high significance placed on the hours worked, the age of the worker, and the education level, with the higher levels of education, those between 43 and 46, signifying Bachelor degrees through Doctorate degrees, also having high significance. Lastly, the R-squared value has doubled to a value of 0.3412 by taking the additional variables into account.

Task 3

  1. Build a simple regression model estimating weekly earnings as a function of age.

First, the generalized regression equation will have the following form:

Weekly Earnings = B0 + B1*Age

  1. Next, develop a linear model to estimate the weekly earnings as a function of age:
## 
## Call:
## lm(formula = weekly_earnings ~ age, data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1245.3  -445.3  -178.1   284.7  2133.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 548.9457    28.2350   19.44   <2e-16 ***
## age           9.1941     0.6306   14.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 654.6 on 5540 degrees of freedom
## Multiple R-squared:  0.03696,    Adjusted R-squared:  0.03678 
## F-statistic: 212.6 on 1 and 5540 DF,  p-value: < 2.2e-16

From the output, the equation becomes the following:

Weekly Earnings = 548.9457 + 9.1941*Age

  1. The data is naive for the following reasons:
  1. The R-squared value is very low at 0.03696

  2. Similar to task 2, the residual standard error is high, leading to a 3 standard deviation value of +/- $1963.8 for weekly earnings

  3. Again, similar to task 2, it would be unlikely that age is to be the only variable that is going to impact weekly earnings.

  1. Test the linearity of the model.

To test the linearity of the model, we first plot the data on a scatter plot.

The scatter plot shows that the data does not follow a particularly well defined patter, but you can see that a larger portion of the data is located near the lower side of the y-axis. This would lead to the regression equation being focus near the lower end, potentially a parabolic equation instead of a linear equation.

One last check is to use the crPlot function to understand how the residuals are fit to the data.

This plot shows that the data wants to follow a parabolic curve (pink) instead of a linear curve (blue).

  1. Identify at least 3 other possible concerns regarding this model beyond those inherent in the naïve design
  1. The range of some of the variables do not make sense. For example, including weekly earnings with values of zero. This could be solved by filtering out the data with weekly earnings = $0 and then recalculating the linear model.

  2. From the residuals plot, we can see that the curve of the residuals is below the linear fitted curve, therefore, most of the data lies below the linear curve suggesting that it is not a good fit of the data and the regression does not meet the linearity assumption.

Task 4

  1. Modify your model from Q3 by adding at least 3 other IVs to the regression and transforming the age variable (and others) as necessary to meet the linearity assumption
  1. Expand upon the model from task 3, and include at least 3 additional independent variables into the equation.

To do this, three additional variables to supplement the equation based on Age. The variables are the following:

Hours Worked Sex Education Level

The generalized regression equation will have the following form:

Weekly Earnings = B0 + B1(Age) B2(Age^2) + B3(Hours Worked) + B4(Hours_Worked)^2 + B5(Hours_Worked)^3 + B6(Sex) + B7(Education Level)

  1. Next, develop a model to estimate the weekly hours as a function of age:
## 
## Call:
## lm(formula = weekly_earnings ~ age + hours_worked + I(hours_worked * 
##     hours_worked) + as.factor(sex) + as.factor(education), data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1790.4  -317.9   -89.2   195.1  2719.5 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -366.54548  224.06707  -1.636 0.101923    
## age                               7.50059    0.52196  14.370  < 2e-16 ***
## hours_worked                     15.20603    1.92909   7.883 3.84e-15 ***
## I(hours_worked * hours_worked)    0.05373    0.02455   2.188 0.028690 *  
## as.factor(sex)2                -170.96284   14.82076 -11.535  < 2e-16 ***
## as.factor(education)32         -105.47356  324.45343  -0.325 0.745132    
## as.factor(education)33          -33.53017  243.56689  -0.138 0.890512    
## as.factor(education)34          -23.71301  250.94288  -0.094 0.924719    
## as.factor(education)35          -14.53608  237.84750  -0.061 0.951270    
## as.factor(education)36           71.72667  230.21070   0.312 0.755379    
## as.factor(education)37           80.22818  226.92978   0.354 0.723699    
## as.factor(education)38           13.44453  235.89492   0.057 0.954552    
## as.factor(education)39          168.51779  219.22679   0.769 0.442109    
## as.factor(education)40          225.98474  219.36731   1.030 0.302977    
## as.factor(education)41          277.47397  221.02670   1.255 0.209392    
## as.factor(education)42          339.73485  220.51780   1.541 0.123466    
## as.factor(education)43          627.37661  219.25544   2.861 0.004234 ** 
## as.factor(education)44          793.09086  219.97238   3.605 0.000314 ***
## as.factor(education)45         1043.66192  225.44840   4.629 3.75e-06 ***
## as.factor(education)46          937.66603  224.77858   4.172 3.07e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 535.6 on 5522 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3552 
## F-statistic: 161.6 on 19 and 5522 DF,  p-value: < 2.2e-16

Therefore the equation would take the form of the following:

Weekly Earnings = -366.54 +7.5 * age + 15.506 * hours_worked + .0537 * hours_worked^2 - 170.96 + Education Level

The second to last integer, -170.96, only applies if the person is a female. If the person is a male, this factor = 0

The last integer in the equation is dependent upon the education level that the person has received, ranging from a 12th grade level to doctorate level. This has a correlation of 0.3574, and a significant p-value for most of the variables in the analysis, the exception being the education variables that correspond to lower levels of education (Education level below 43 - Bachelors degree).

We can now look at how the residual assumptions of the new regression equation fit in the data:

From the Residuals vs Fitted Plot, we can now see that the fitted curve better fits the data.

  1. Are any of the independent variables collinear?

To check for this, the Variance Inflation Factor is generated for each variable to see if it correlates with another.

The general guidelines are as follows:

VIF < 5 = Good, 5 < VIF < 10 = Possible Problem, VIF > 10 = Problem Very Likely

##                                    GVIF Df GVIF^(1/(2*Df))
## age                            1.023502  1        1.011683
## hours_worked                   9.747518  1        3.122102
## I(hours_worked * hours_worked) 9.758047  1        3.123787
## as.factor(sex)                 1.060320  1        1.029719
## as.factor(education)           1.056072 15        1.001820

From the VIF test, all variables fall within the “Good” guideline.

  1. Judging by the output, what ranges of values (for your IVs and the DV, weekly earnings) are you most comfortable using this model to estimate future observations?

The ranges that I would expect the data to be most useful are as follows:

Starting with education, which is the most obvious from the data, the only items that had significance were those for people with Bachelors degrees and up. Therefore, the data for anything below that level is too random to be a good representative of the data.

The data appears to work well for both male and females.

The age of the person is significant relative to the number of hours worked and the education level. I would not expect anyone younger than the age of 20 to have a bachelors degree, although realistically there are outliers.

Lastly, this regression should be applicable to any number of hours worked.

Part 1) Generate a hypothetical observation that exists within the ranges specified in part d and estimate the weekly earnings for that individual.

To answer this question, this will make the following assumptions:

Age: 36

Sex: Female

Number of Hours worked: 40

Education level: Masters degree (44)

Therefore the estimate for this persons’ weekly earnings is as follows:

Equation: Weekly Earnings = -366.54 +7.5 * 36 + 15.506 * 40 + .0537 * 40^2 - 170.96 + 793.09

Weekly Earnings = $1231.75

Part 2) . In addition to this estimate, calculate the estimated range of weekly earnings this individual may have and comment on the results.

Upper Limit Equation: Weekly Earnings = -366.54 + (2* 224.067) +(7.5+ (2* .522)) * 36 + (15.206 +(2* 1.929)) * 40 + (.0537+ (2* .0245)) * 40^2 - 170.96+ (2* 14.82) + 793.09 + (2* 219.972)

Upper Limit weekly earnings = $2407.9

Lower Limit Equation: Weekly Earnings = -366.54 - (2* 224.067) +(7.5- (2* .522)) * 36 + (15.206 -(2* 1.929)) * 40 + (.0537- (2* .0245)) * 40^2 - 170.96- (2* 14.82) + 793.09 - (2* 219.972)

Lower Limit weekly earnings = $31.56

The upper and lower limits appear to be drastically different. While it is unlikely that someone will achieve 2 standard deviations positively or negatively, it is more likely that there will be some combination of each that determines the weekly earnings. What is obvious from this, however, is that the value distance between the upper limit and the lower limit gives credibility to the low correlation value for the regression equation.