The purpose of this markdown is to perform analysis on Census Bureau data about internet and technology use.
The following packages were required for this analysis:
| PACKAGE | Description |
|---|---|
| readr | Allows the imporation of .csv files |
| Skimr | Grants ability to generate summary statistics |
| Tidyverse | Loads the tidyverse packagess |
| knitr | RMarkdown documents |
| rmdformats | RMarkdown themes |
| CAR | Companion to Regression Package |
To access the data used in this case, download the .csv file of the data from the following website. Then load the data into a variable as follows:
# download the web-based data for university data from the US department
# of Education
#download.file("http://asayanalytics.com/telework_csv","telework.csv")
# Read the data and store in a database called Universi_Data_raw. This contains
# 7115 observations of 29 different variables relative to Universities. This
# data will be cleaned to become more useful for future analysis.
telework <-
read_csv("telework.csv")
Answer the question: Does Telemarketing appear to have a significant effect on Income?
To answer this question, the code groups the data by the telecommute tab and summarizes the average weekly pay for each type: Telecommute or Not Telecommute. The results are shown here:
## # A tibble: 2 x 2
## telecommute average
## <chr> <dbl>
## 1 Not Telecommute 832.
## 2 Telecommute 1183.
The following boxplot shows how the data stacks relative to each category. It clearly shows that the the interquartile range for the Telecommute has higher values than Not Telecommute, providing a dataset that the weekly pay for someone that telecommutes is higher than those that do not.
Lastly, an analysis of variance is performed to look at the difference between Telecommute and Not Telecommute. The results are as follows:
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = weekly_earnings ~ telecommute, data = telework)
##
## $telecommute
## diff lwr upr p adj
## Telecommute-Not Telecommute 350.7614 313.8213 387.7015 0
This shows that there is a difference between telecommuting and not telecommuting in terms of weekly pay, with telecommute having the higher value.
First, the regression equation will have the following form:
Weekly Earnings = B0 + B1*Hours_worked
##
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1758.0 -411.2 -185.1 230.4 2796.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.0433 28.5766 2.311 0.0209 *
## hours_worked 22.5887 0.7072 31.943 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 613 on 5540 degrees of freedom
## Multiple R-squared: 0.1555, Adjusted R-squared: 0.1554
## F-statistic: 1020 on 1 and 5540 DF, p-value: < 2.2e-16
From the output, the equation becomes the following:
Weekly Earnings = 66.0433 + 22.5887*Hours_Worked
the R-squared value is very low, pointing to a bad correlation.
The residual standard error is very large, meaning that for three standard deviations of the data, the weekly earnings is +/- $1839.
there are other variables that will impact weekly earnings aside from the number of hours worked, such as age, profession, geographic location, etc.
It would make more sense to include additional variables in the analysis, such as age and education.
##
## Call:
## lm(formula = weekly_earnings ~ hours_worked + age + as.factor(education),
## data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1819.8 -327.1 -100.1 207.7 2919.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -521.2880 224.2680 -2.324 0.020140 *
## hours_worked 20.8023 0.6298 33.029 < 2e-16 ***
## age 7.3755 0.5283 13.961 < 2e-16 ***
## as.factor(education)32 -150.0121 328.4317 -0.457 0.647867
## as.factor(education)33 -67.1497 246.5426 -0.272 0.785351
## as.factor(education)34 -22.4919 254.0052 -0.089 0.929444
## as.factor(education)35 -58.5856 240.7185 -0.243 0.807722
## as.factor(education)36 42.6339 233.0099 0.183 0.854828
## as.factor(education)37 42.8819 229.6468 0.187 0.851879
## as.factor(education)38 -9.4211 238.7651 -0.039 0.968527
## as.factor(education)39 121.3879 221.8683 0.547 0.584321
## as.factor(education)40 170.1129 221.9959 0.766 0.443537
## as.factor(education)41 218.2234 223.6653 0.976 0.329271
## as.factor(education)42 276.0375 223.1423 1.237 0.216122
## as.factor(education)43 573.3046 221.8825 2.584 0.009797 **
## as.factor(education)44 734.7774 222.5837 3.301 0.000969 ***
## as.factor(education)45 1018.4454 228.1172 4.465 8.18e-06 ***
## as.factor(education)46 899.5326 227.4533 3.955 7.76e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 542.2 on 5524 degrees of freedom
## Multiple R-squared: 0.3412, Adjusted R-squared: 0.3392
## F-statistic: 168.3 on 17 and 5524 DF, p-value: < 2.2e-16
This data shows that there are high significance placed on the hours worked, the age of the worker, and the education level, with the higher levels of education, those between 43 and 46, signifying Bachelor degrees through Doctorate degrees, also having high significance. Lastly, the R-squared value has doubled to a value of 0.3412 by taking the additional variables into account.
First, the generalized regression equation will have the following form:
Weekly Earnings = B0 + B1*Age
##
## Call:
## lm(formula = weekly_earnings ~ age, data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1245.3 -445.3 -178.1 284.7 2133.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 548.9457 28.2350 19.44 <2e-16 ***
## age 9.1941 0.6306 14.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 654.6 on 5540 degrees of freedom
## Multiple R-squared: 0.03696, Adjusted R-squared: 0.03678
## F-statistic: 212.6 on 1 and 5540 DF, p-value: < 2.2e-16
From the output, the equation becomes the following:
Weekly Earnings = 548.9457 + 9.1941*Age
The R-squared value is very low at 0.03696
Similar to task 2, the residual standard error is high, leading to a 3 standard deviation value of +/- $1963.8 for weekly earnings
Again, similar to task 2, it would be unlikely that age is to be the only variable that is going to impact weekly earnings.
To test the linearity of the model, we first plot the data on a scatter plot.
The scatter plot shows that the data does not follow a particularly well defined patter, but you can see that a larger portion of the data is located near the lower side of the y-axis. This would lead to the regression equation being focus near the lower end, potentially a parabolic equation instead of a linear equation.
One last check is to use the crPlot function to understand how the residuals are fit to the data.
This plot shows that the data wants to follow a parabolic curve (pink) instead of a linear curve (blue).
The range of some of the variables do not make sense. For example, including weekly earnings with values of zero. This could be solved by filtering out the data with weekly earnings = $0 and then recalculating the linear model.
From the residuals plot, we can see that the curve of the residuals is below the linear fitted curve, therefore, most of the data lies below the linear curve suggesting that it is not a good fit of the data and the regression does not meet the linearity assumption.
To do this, three additional variables to supplement the equation based on Age. The variables are the following:
Hours Worked Sex Education Level
The generalized regression equation will have the following form:
Weekly Earnings = B0 + B1(Age) B2(Age^2) + B3(Hours Worked) + B4(Hours_Worked)^2 + B5(Hours_Worked)^3 + B6(Sex) + B7(Education Level)
##
## Call:
## lm(formula = weekly_earnings ~ age + hours_worked + I(hours_worked *
## hours_worked) + as.factor(sex) + as.factor(education), data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1790.4 -317.9 -89.2 195.1 2719.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -366.54548 224.06707 -1.636 0.101923
## age 7.50059 0.52196 14.370 < 2e-16 ***
## hours_worked 15.20603 1.92909 7.883 3.84e-15 ***
## I(hours_worked * hours_worked) 0.05373 0.02455 2.188 0.028690 *
## as.factor(sex)2 -170.96284 14.82076 -11.535 < 2e-16 ***
## as.factor(education)32 -105.47356 324.45343 -0.325 0.745132
## as.factor(education)33 -33.53017 243.56689 -0.138 0.890512
## as.factor(education)34 -23.71301 250.94288 -0.094 0.924719
## as.factor(education)35 -14.53608 237.84750 -0.061 0.951270
## as.factor(education)36 71.72667 230.21070 0.312 0.755379
## as.factor(education)37 80.22818 226.92978 0.354 0.723699
## as.factor(education)38 13.44453 235.89492 0.057 0.954552
## as.factor(education)39 168.51779 219.22679 0.769 0.442109
## as.factor(education)40 225.98474 219.36731 1.030 0.302977
## as.factor(education)41 277.47397 221.02670 1.255 0.209392
## as.factor(education)42 339.73485 220.51780 1.541 0.123466
## as.factor(education)43 627.37661 219.25544 2.861 0.004234 **
## as.factor(education)44 793.09086 219.97238 3.605 0.000314 ***
## as.factor(education)45 1043.66192 225.44840 4.629 3.75e-06 ***
## as.factor(education)46 937.66603 224.77858 4.172 3.07e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 535.6 on 5522 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3552
## F-statistic: 161.6 on 19 and 5522 DF, p-value: < 2.2e-16
Therefore the equation would take the form of the following:
Weekly Earnings = -366.54 +7.5 * age + 15.506 * hours_worked + .0537 * hours_worked^2 - 170.96 + Education Level
The second to last integer, -170.96, only applies if the person is a female. If the person is a male, this factor = 0
The last integer in the equation is dependent upon the education level that the person has received, ranging from a 12th grade level to doctorate level. This has a correlation of 0.3574, and a significant p-value for most of the variables in the analysis, the exception being the education variables that correspond to lower levels of education (Education level below 43 - Bachelors degree).
We can now look at how the residual assumptions of the new regression equation fit in the data:
From the Residuals vs Fitted Plot, we can now see that the fitted curve better fits the data.
To check for this, the Variance Inflation Factor is generated for each variable to see if it correlates with another.
The general guidelines are as follows:
VIF < 5 = Good, 5 < VIF < 10 = Possible Problem, VIF > 10 = Problem Very Likely
## GVIF Df GVIF^(1/(2*Df))
## age 1.023502 1 1.011683
## hours_worked 9.747518 1 3.122102
## I(hours_worked * hours_worked) 9.758047 1 3.123787
## as.factor(sex) 1.060320 1 1.029719
## as.factor(education) 1.056072 15 1.001820
From the VIF test, all variables fall within the “Good” guideline.
The ranges that I would expect the data to be most useful are as follows:
Starting with education, which is the most obvious from the data, the only items that had significance were those for people with Bachelors degrees and up. Therefore, the data for anything below that level is too random to be a good representative of the data.
The data appears to work well for both male and females.
The age of the person is significant relative to the number of hours worked and the education level. I would not expect anyone younger than the age of 20 to have a bachelors degree, although realistically there are outliers.
Lastly, this regression should be applicable to any number of hours worked.
Part 1) Generate a hypothetical observation that exists within the ranges specified in part d and estimate the weekly earnings for that individual.
To answer this question, this will make the following assumptions:
Age: 36
Sex: Female
Number of Hours worked: 40
Education level: Masters degree (44)
Therefore the estimate for this persons’ weekly earnings is as follows:
Equation: Weekly Earnings = -366.54 +7.5 * 36 + 15.506 * 40 + .0537 * 40^2 - 170.96 + 793.09
Weekly Earnings = $1231.75
Part 2) . In addition to this estimate, calculate the estimated range of weekly earnings this individual may have and comment on the results.
Upper Limit Equation: Weekly Earnings = -366.54 + (2* 224.067) +(7.5+ (2* .522)) * 36 + (15.206 +(2* 1.929)) * 40 + (.0537+ (2* .0245)) * 40^2 - 170.96+ (2* 14.82) + 793.09 + (2* 219.972)
Upper Limit weekly earnings = $2407.9
Lower Limit Equation: Weekly Earnings = -366.54 - (2* 224.067) +(7.5- (2* .522)) * 36 + (15.206 -(2* 1.929)) * 40 + (.0537- (2* .0245)) * 40^2 - 170.96- (2* 14.82) + 793.09 - (2* 219.972)
Lower Limit weekly earnings = $31.56
The upper and lower limits appear to be drastically different. While it is unlikely that someone will achieve 2 standard deviations positively or negatively, it is more likely that there will be some combination of each that determines the weekly earnings. What is obvious from this, however, is that the value distance between the upper limit and the lower limit gives credibility to the low correlation value for the regression equation.