Occupational Importance: A Final Examish

We are to embark on a study of occupational importance deploying data on the subject from some time back in Canada. Some might call it prestige. The outcome of interest is the importance attached to the occupation in a broad sample of Canadian individuals. The data are contained in Occupational.Importance in the Markdown below. To make the data active in the RStudio at any point, simply press play to the code chunk below.

## 'data.frame':    102 obs. of  7 variables:
##  $ title     : chr  "GOV.ADMINISTRATORS" "GENERAL.MANAGERS" "ACCOUNTANTS" "PURCHASING.OFFICERS" ...
##  $ education : num  13.1 12.3 12.8 11.4 14.6 ...
##  $ income    : num  12351 25879 9271 8865 8403 ...
##  $ women     : num  11.16 4.02 15.7 9.11 11.68 ...
##  $ importance: num  68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
##  $ census    : num  1113 1130 1171 1175 2111 ...
##  $ type      : chr  "prof" "prof" "prof" "prof" ...

The outcome:
Importance: The average importance rating for an occupation from a broad national survey.

The predictors:
* education: The average number of years of education for occupational incumbents.
* income: The average income of occupational incumbents, in dollars.
* women: The percentage of women in the occupation.
* census: The code of the occupation used in the survey (an ID variable).
* type: Professional and managerial(prof), white collar(wc), blue collar(bc), or other
* type: the factor version of type

What explains occupational importance as considered by society at large?

We are particularly interested in education, income, and the gender distribution.

Some Preliminaries [that do not reference the data at all]

In the following three questions, assume a sample of 101 people and that the probability that each person is a Pikachu impersonator is 0.15. The following problems involve a Binomial(n=101, p=0.15).

With 80% central probability, we should witness between _11__ and 20 Pikachu impersonators.

qbinom(c(0.1,0.9), 101, 0.15)

## [1] 11 20

The probability of more than 22 Pikachu impersonators is 0.0247125.

1-pbinom(22, 101, 0.15)

## [1] 0.02471246

The probability of less than 14 Pikachu impersonators is __pbinom(13, 101, 0.15)_.

pbinom(13, 101, 0.15)

## [1] 0.3324065

Suppose that deaths by horse kick arrive upon Prussian cavalry units at a rate of 0.7 per year. von Bortkiewicz’s classic example from my lecture slides; a Poisson(\(\lambda=0.7\))

What is the probability that a unit experiences no deaths by horse kick in a random year? **The probability of no horse kicks in a random year is 0.4965853.

ppois(0,0.7)

## [1] 0.4965853

What is the probability that a unit experiences at least one death by horse kick in a random year? One or more deaths by horse kick are the complement of the previous, e.g. 0.5034147

1-ppois(0,0.7)

## [1] 0.5034147

85% of the time, cavalry units experience no more than 2 death(s) by horse kick in a random year.

qpois(0.85,0.7)

## [1] 2

Suppose that French fish are caught at a rate of 12 per day.

On the median day, how many fish are caught? 12
What is the probability that 16 or more are caught? 0.1555843

Assume that daily beer sales at a British University pub follow a normal distribution with a mean of 435 liters and a standard deviation of 40 liters.

What is the probability of 500 or more liters sold in a day?

pnorm(500, 435, 40, lower.tail = FALSE)

## [1] 0.05208128

The middle 50% of days should see _____ to _____ liters sold.

qnorm(c(0.25,0.75), 435, 40)

## [1] 408.0204 461.9796

Find a 95% confidence interval for the average number of liters sold per day? \(435\)….

The true answer depends on the sample size. We normally use the t distribution but a t with infinite degrees of freedom is normal. The key thing to note is two things. First, the \(t\) distribution of the mean is defined on \(n-1\) degrees of freedom. And the standard error of the mean is \(\frac{s}{\sqrt{n}}\). Both parts depend on n. So let’s sample from a normal and calculate this; any example will do. The analytical solution is that it will be normal with \(\mu=435\) and \(\sigma=\frac{\sigma}{\sqrt{n}}\). The great irony of the whole thing is that, in the limit, any interval about the mean converges to a constant because \(\lim_{n\rightarrow \infty} \frac{s}{\sqrt{n}} = 0\) for any finite standard deviation \(s\).

t.test(rnorm(100, 435, 40))

## 
##  One Sample t-test
## 
## data:  rnorm(100, 435, 40)
## t = 113.84, df = 99, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  425.0026 440.0815
## sample estimates:
## mean of x 
##   432.542

t.test(rnorm(1000, 435, 40))

## 
##  One Sample t-test
## 
## data:  rnorm(1000, 435, 40)
## t = 335.87, df = 999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  431.0425 436.1089
## sample estimates:
## mean of x 
##  433.5757

t.test(rnorm(10000, 435, 40))

## 
##  One Sample t-test
## 
## data:  rnorm(10000, 435, 40)
## t = 1085.1, df = 9999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  433.9898 435.5606
## sample estimates:
## mean of x 
##  434.7752

With 95% probability, an F statistic defined on 2 and 77 degrees of freedom takes values less than 3.1154.

qf(0.95, 2, 77)

## [1] 3.115366

Occupational Importance

For context on the data, the variable names in Occupational.Importance are shown below.

names(Occupational.Importance)

## [1] "title"      "education"  "income"     "women"      "importance"
## [6] "census"     "type"

Getting a feel for the outcome. Graphically summarize importance.

library(ggplot2)
ggplot(data = Occupational.Importance) +
  aes(x = importance) +
  geom_density(adjust = 1, fill = "#0c4c8a") +
  theme_minimal()

A density plot shows the data to have a bit of right skew but symmetry is not out of the question. To preview a brief focus on types, I will show the boxplots.

ggplot(data = Occupational.Importance) +
  aes(x = type, y = importance, fill = type) +
  geom_boxplot() +
  theme_minimal()

OCI <- Occupational.Importance

Is importance symmetric?

library(car)

## Loading required package: carData

densityPlot(OCI$importance)

qqPlot(OCI$importance)

## [1] 24 21

# The qqplot shows some skewness.  Is it normal by the Shapiro test?
shapiro.test(OCI$importance)

## 
##  Shapiro-Wilk normality test
## 
## data:  OCI$importance
## W = 0.97198, p-value = 0.02875

# Nope.

It is fairly symmetric. The mean and median are somewhat close to one another though the mean is bigger than the median; it is not much bigger. You could argue for either.

Provide an appropriate numerical summary for importance.

library(tidyverse)
library(skimr)
OCI %>% skim(importance)

## Skim summary statistics
##  n obs: 102 
##  n variables: 7 
## 
## ── Variable type:numeric ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##    variable missing complete   n  mean   sd   p0   p25  p50   p75 p100
##  importance       0      102 102 46.83 17.2 14.8 35.23 43.6 59.28 87.2
##      hist
##  ▂▅▇▆▅▃▃▁

What is/are the highest prestige occupation(s)?

OCI$title[OCI$importance==max(OCI$importance)]

## [1] "PHYSICIANS"

What is/are the least prestigious occupation(s)?

OCI$title[OCI$importance==min(OCI$importance)]

## [1] "NEWSBOYS"

t.test(OCI$importance)

## 
##  One Sample t-test
## 
## data:  OCI$importance
## t = 27.492, df = 101, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  43.45405 50.21261
## sample estimates:
## mean of x 
##  46.83333

With 95% confidence, average importance ranges from _43.4540517__ to 50.212615 .
What is the mean and standard deviation OF IMPORTANCE for each type of occupation?

OCI %>% group_by(type) %>% skim(importance) %>% filter(stat%in%c("mean","sd"))

## # A tibble: 8 x 6
##   variable   type    stat  level value formatted
##   <chr>      <chr>   <chr> <chr> <dbl> <chr>    
## 1 importance numeric mean  .all  35.5  35.53    
## 2 importance numeric sd    .all  10.0  10.02    
## 3 importance numeric mean  .all  34.7  34.73    
## 4 importance numeric sd    .all  17.7  17.68    
## 5 importance numeric mean  .all  67.8  67.85    
## 6 importance numeric sd    .all   8.68 8.68     
## 7 importance numeric mean  .all  42.2  42.24    
## 8 importance numeric sd    .all   9.52 9.52

Provide the mean and standard deviation of income, education, and women.

OCI %>% skim(income, education, women) %>% filter(stat%in%c("mean","sd"))

## # A tibble: 6 x 6
##   variable  type    stat  level   value formatted
##   <chr>     <chr>   <chr> <chr>   <dbl> <chr>    
## 1 income    numeric mean  .all  6798.   6797.9   
## 2 income    numeric sd    .all  4246.   4245.92  
## 3 education numeric mean  .all    10.7  10.74    
## 4 education numeric sd    .all     2.73 2.73     
## 5 women     numeric mean  .all    29.0  28.98    
## 6 women     numeric sd    .all    31.7  31.72

With 90% confidence, is there a difference [as opposed to no difference] in average education of incumbents between wc [white collar] and prof [professional] occupations?

WC <- OCI[OCI$type=="wc",]
Prof <- OCI[OCI$type=="prof",]

Are they different? TRUE

what is the 95% confidence interval for the difference?

t.test(WC$education,Prof$education)$conf.int

## [1] -3.696390 -2.428519
## attr(,"conf.level")
## [1] 0.95

With 95% confidence, is there a difference [as opposed to no difference] in average income of incumbents between wc [white collar] and prof [professional] occupations?

Are they different? TRUE

what is the 95% confidence interval for the difference?

t.test(WC$income,Prof$income)

## 
##  Welch Two Sample t-test
## 
## data:  WC$income and Prof$income
## t = -5.2202, df = 39.673, p-value = 5.977e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7639.884 -3374.411
## sample estimates:
## mean of x mean of y 
##  5052.304 10559.452

We will briefly examine the relationship between income and education.

Plot income as a function of education.

plot(x=OCI$education, y=OCI$income)
abline(lm(income~education, data=OCI), col="red")

Perform a regression of income as a function of education. Is there a linear relationship with 95% confidence?

summary(lm(income~education, data=OCI))

## 
## Call:
## lm(formula = income ~ education, data = OCI)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5493.2 -2433.8   -41.9  1491.5 17713.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2853.6     1407.0  -2.028   0.0452 *  
## education      898.8      127.0   7.075 2.08e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3483 on 100 degrees of freedom
## Multiple R-squared:  0.3336, Adjusted R-squared:  0.3269 
## F-statistic: 50.06 on 1 and 100 DF,  p-value: 2.079e-10

Yes. The probability of no relationship is basically zero and 0.334 as a proportion of the variance is explained.

How much is each additional year of education worth in income, with 95% confidence?

confint(lm(income~education, data=OCI))

##                  2.5 %     97.5 %
## (Intercept) -5645.1114  -62.05979
## education     646.7782 1150.84748

Each additional year of education is worth between 646.78 and 1150.85 dollars in income with 95% confidence.

Plot the residual from this regression as a function of education.

plot(x=OCI$education, y=residuals(lm(income~education, data=OCI)))

If you can, fit and compare a non-line to the line. Which fits better?

Line.educ <- lm(income~education, data=OCI)
Poly.educ <- lm(income~poly(education, 2, raw=TRUE), data=OCI)
anova(Line.educ,Poly.educ)

## Analysis of Variance Table
## 
## Model 1: income ~ education
## Model 2: income ~ poly(education, 2, raw = TRUE)
##   Res.Df        RSS Df Sum of Sq     F   Pr(>F)   
## 1    100 1213392025                               
## 2     99 1123363857  1  90028168 7.934 0.005857 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The curve fits better. Why?

library(effects)

## Registered S3 methods overwritten by 'lme4':
##   method                          from
##   cooks.distance.influence.merMod car 
##   influence.merMod                car 
##   dfbeta.influence.merMod         car 
##   dfbetas.influence.merMod        car

## lattice theme set by effectsTheme()
## See ?effectsTheme for details.

plot(allEffects(Poly.educ, partial.residuals=TRUE))

It appears to have a rate of return that is reasonably flat until secondary school and the rate increases more dramatically thereafter. This is consistent with a screening/signaling theory of education.

We will now turn to the multiple regression portion of the inquiry.

Multiple Regression

The following set of questions utilize the multiple regression below. I have embedded the object as OIM and the syntax appears below.

OIM <- lm(importance~education+income+women+type, data=Occupational.Importance) 
Occupational.Importance$resids <- Occupational.Importance$importance -  predict(OIM, newdata=Occupational.Importance)
summary(OIM)

## 
## Call:
## lm(formula = importance ~ education + income + women + type, 
##     data = Occupational.Importance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.9325  -4.7107   0.0764   5.4033  19.1735 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.6465650  5.4281200   0.303   0.7623    
## education    3.2965572  0.6601724   4.993 2.69e-06 ***
## income       0.0011606  0.0002737   4.241 5.17e-05 ***
## women        0.0045517  0.0309262   0.147   0.8833    
## typeother   -1.7242964  4.0468736  -0.426   0.6710    
## typeprof     7.4007262  4.0337748   1.835   0.0697 .  
## typewc      -1.8411949  2.7586817  -0.667   0.5061    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.551 on 95 degrees of freedom
## Multiple R-squared:  0.8188, Adjusted R-squared:  0.8074 
## F-statistic: 71.56 on 6 and 95 DF,  p-value: < 2.2e-16

OCI <- Occupational.Importance

Provide an equation for the above regression point estimates including metrics for the relevant slopes.

Importance =  1.65 [blue collar] + 3.297 per year of education + 0.001 per CA dollar of income + 0.005 per percent women + 7.401[if Professional] - 1.84[if White Collar] + \epsilon

The education slope measures change in importance per additional year of education. The income slope measures change in importance per dollar of income.
The women slope measures changes in importance per additional percent of women in the job. The three slopes for type measure the difference between importance for Professional, Other, and White Collar types [vis a vis Blue Collar].

What factors are associated with importance with 95% confidence? Education and income and !type!

A better answer actually thinks about how much is explained by the type, taken as a whole. The anova table of the regression model gives us some information.

anova(OIM)

## Analysis of Variance Table
## 
## Response: importance
##           Df  Sum Sq Mean Sq  F value    Pr(>F)    
## education  1 21608.4 21608.4 378.9917 < 2.2e-16 ***
## income     1  2248.1  2248.1  39.4303 1.015e-08 ***
## women      1     5.3     5.3   0.0926   0.76154    
## type       3   617.1   205.7   3.6077   0.01616 *  
## Residuals 95  5416.5    57.0                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice that the type matters in the sense that it accounts variation. You have some indication that this is true from the pivot table or other summary of importance by type.

What factors are unassociated with importance with 95% confidence? women

From the t-tests in the table this is true. Though the analysis of variance makes this question a bit more complicated.

What is the metric of the residual standard error? What does it measure and what that we have already measured might we compare it to?

The metric of the residual standard error is percentage points in importance. It measures the degrees of freedom averaged standard deviation of the residual – the distance between the fitted and actual values. We can compare it to the standard deviation of the data to see how much error is reduced in explaining importance with these factors. Note, the original standard error of importance was 17.2044856.

How much variation is explained?

R-squared. 0.8188191 The proportion of variance.

The F statistic answers the question, does the model explain more than random variation. What is the p-value for this statistic from the above table? 0
Let’s turn to education. As the number of years of education that an incumbent in the job increases by one year, importance/prestige increases by 3.29 points. With 95% confidence, this effect ranges from 1.986 to 4.607.
What occupation has the most prestige/importance compared to the expected/predicted/fitted value?

MEDICAL.TECHNICIANS

What occupation has the least actual prestige/importance compared to the expected/predicted/fitted value?

NEWSBOYS

What is interquartile range of the difference between the actual and fitted/predicted values?

quantile(OCI$resid, probs=c(0.25,0.75), na.rm=TRUE)

##       25%       75% 
## -4.710664  5.403303

IQR(OCI$resids, na.rm=TRUE)

## [1] 10.11397

The 1st quartile is -4.71 and the 3rd quartile is 5.40. So the IQR is 5.40–4.71=10.11.

Isolate the residuals from the regression. Are they normal? Provide at least two pieces of evidence and interpret them.

library(car)
densityPlot(OIM$residuals)

It is plausible.

qqPlot(OIM$residuals)

## [1] 31 53

shapiro.test(OIM$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  OIM$residuals
## W = 0.98929, p-value = 0.5932

We cannot rule our normal; the p-value of the Shapiro test is over 0.59.

Assume the residuals are normal. What is the probability that we can predict the average importance/prestige of an occupation to within plus or minus 10 rating points.

pnorm(10, 0, 7.551) - pnorm(-10, 0, 7.551)

## [1] 0.8146058

The probability of a prediction within 10 rating points is 0.8146058.

Predict the average prestige/importance for a general manager with 95% confidence.

GM <- Occupational.Importance[Occupational.Importance$title=="GENERAL.MANAGERS",]
predict(OIM, newdata=GM, interval="confidence")

##        fit      lwr      upr
## 2 79.51734 70.34393 88.69076

The 95% confidence interval gives average importance for a general manager to be between 70.34 and 88.69.

Predict the range of possible prestige/importance for a general manager with 95% confidence.

predict(OIM, newdata=GM, interval="prediction")

##        fit      lwr      upr
## 2 79.51734 61.94286 97.09183

The 95% prediction interval gives the values of importance for a general manager to be between 61.94 and 97.09 with the aforementioned probability.

Simplify the regression model deploying a stepwise technique and provide a summary of the final regression model.

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

stepAIC(OIM)

## Start:  AIC=419.17
## importance ~ education + income + women + type
## 
##             Df Sum of Sq    RSS    AIC
## - women      1      1.24 5417.7 417.19
## <none>                   5416.5 419.17
## - type       3    617.09 6033.6 424.17
## - income     1   1025.40 6441.9 434.85
## - education  1   1421.67 6838.2 440.94
## 
## Step:  AIC=417.19
## importance ~ education + income + type
## 
##             Df Sum of Sq    RSS    AIC
## <none>                   5417.7 417.19
## - type       3    621.13 6038.9 422.26
## - income     1   1378.12 6795.8 438.31
## - education  1   1443.36 6861.1 439.28

## 
## Call:
## lm(formula = importance ~ education + income + type, data = Occupational.Importance)
## 
## Coefficients:
## (Intercept)    education       income    typeother     typeprof  
##    1.769393     3.305973     0.001139    -1.732226     7.487737  
##      typewc  
##   -1.719057

Step.Res <- stepAIC(OIM)

## Start:  AIC=419.17
## importance ~ education + income + women + type
## 
##             Df Sum of Sq    RSS    AIC
## - women      1      1.24 5417.7 417.19
## <none>                   5416.5 419.17
## - type       3    617.09 6033.6 424.17
## - income     1   1025.40 6441.9 434.85
## - education  1   1421.67 6838.2 440.94
## 
## Step:  AIC=417.19
## importance ~ education + income + type
## 
##             Df Sum of Sq    RSS    AIC
## <none>                   5417.7 417.19
## - type       3    621.13 6038.9 422.26
## - income     1   1378.12 6795.8 438.31
## - education  1   1443.36 6861.1 439.28

The final model contains type, education, and income.

Fully describe the evidence in the table of your final chosen regression model. What factors matter and what do not and what does each part describe.

ModA <- lm(importance~education+income+type, data=OCI)
summary(ModA)

## 
## Call:
## lm(formula = importance ~ education + income + type, data = OCI)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.0864  -4.8662   0.1436   5.3524  19.2652 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.7693934  5.3361820   0.332   0.7409    
## education    3.3059733  0.6537085   5.057 2.04e-06 ***
## income       0.0011392  0.0002305   4.942 3.28e-06 ***
## typeother   -1.7322264  4.0258430  -0.430   0.6680    
## typeprof     7.4877370  3.9698331   1.886   0.0623 .  
## typewc      -1.7190573  2.6174653  -0.657   0.5129    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.512 on 96 degrees of freedom
## Multiple R-squared:  0.8188, Adjusted R-squared:  0.8093 
## F-statistic: 86.75 on 5 and 96 DF,  p-value: < 2.2e-16

anova(ModA)

## Analysis of Variance Table
## 
## Response: importance
##           Df  Sum Sq Mean Sq  F value    Pr(>F)    
## education  1 21608.4 21608.4 382.8938 < 2.2e-16 ***
## income     1  2248.1  2248.1  39.8362 8.515e-09 ***
## type       3   621.1   207.0   3.6688   0.01494 *  
## Residuals 96  5417.7    56.4                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Everything matters. Education seems most important with income behind that and type still important but not on the order of the others.

Given the evidence provided here, is the percentage of women an important determinant of occupational importance/prestige in Canada?

It does not appear so. But we should investigate a bit. Looking at correlations among the variables, the relationships are the smallest. In the following scatterplots, the third row shows women on the y axis. The first column, education shows nothing obvious. The second column is negatively correlated and far more strongly than the previous. Not much going on. What if it depends on the type?

Occupational.Importance[,c("education","income","women")] %>% cor()

##            education     income       women
## education 1.00000000  0.5775802  0.06185286
## income    0.57758023  1.0000000 -0.44105927
## women     0.06185286 -0.4410593  1.00000000

Occupational.Importance[,c("education","income","women")] %>% plot()

library(ggplot2)
ggplot(data = Occupational.Importance) +
  aes(x = education, y = women) +
  geom_point(color = "#0c4c8a") +
  geom_smooth(span = 2) +
  theme_minimal() +
  facet_wrap(vars(type))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Above shows education.

ggplot(data = Occupational.Importance) +
  aes(x = income, y = women) +
  geom_point(color = "#0c4c8a") +
  geom_smooth(span = 2) +
  theme_minimal() +
  facet_wrap(vars(type))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Not much there.

Plot the effect of income and plot income against the residuals. Does a line seem sufficient?

StepR <- lm(formula = importance ~ education + income + type, data = Occupational.Importance)
plot(effect("income",StepR, partial.residuals=TRUE))

StepR2 <- lm(formula = importance ~ education + poly(income, 2, raw=TRUE) + type, data = Occupational.Importance)
anova(StepR,StepR2)

## Analysis of Variance Table
## 
## Model 1: importance ~ education + income + type
## Model 2: importance ~ education + poly(income, 2, raw = TRUE) + type
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     96 5417.7                                  
## 2     95 4771.1  1     646.6 12.875 0.0005288 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A line is not quite sufficient.

Summarize your findings to answer the question of what explains and how well does it explain occupational prestige/importance?

summary(StepR2)

## 
## Call:
## lm(formula = importance ~ education + poly(income, 2, raw = TRUE) + 
##     type, data = Occupational.Importance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6728  -3.9463   0.3106   4.2692  19.8789 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -1.310e+00  5.107e+00  -0.256 0.798161    
## education                     2.714e+00  6.384e-01   4.251 4.97e-05 ***
## poly(income, 2, raw = TRUE)1  3.112e-03  5.913e-04   5.263 8.76e-07 ***
## poly(income, 2, raw = TRUE)2 -7.851e-08  2.188e-08  -3.588 0.000529 ***
## typeother                     1.883e+00  3.929e+00   0.479 0.632800    
## typeprof                      9.058e+00  3.770e+00   2.402 0.018228 *  
## typewc                        2.045e-01  2.527e+00   0.081 0.935678    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.087 on 95 degrees of freedom
## Multiple R-squared:  0.8404, Adjusted R-squared:  0.8303 
## F-statistic: 83.38 on 6 and 95 DF,  p-value: < 2.2e-16

Incorporating a curve gives me 84.86 percent of variation explained. All other things equal, professional is highest importance, then white and blue collar below; importance increases with education and increases at a decreasing rate with income.

shapiro.test(StepR2$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  StepR2$residuals
## W = 0.9893, p-value = 0.594

What does the model say for a hypothetical professional with income 18000 and education 13; the 95% confidence interval for the average is:

Hypo.D <- data.frame(income=18000, education=13, type="prof")
predict(StepR2, newdata=Hypo.D, interval="confidence")

##        fit      lwr      upr
## 1 73.61446 68.99409 78.23483

I want to conclude with what you can do with esquisse. Just visualizing this can tell almost all of the story in one picture. Or two, depending on how you wish. The first one is a scatterplot with a third dimension determied by size and color determined by the type of profession. It confirms the basic contours of our investigation before. Importance is increasing with both education and income; income is itself, in part, a function of education [that we cannot really show here]. There is also an apparent general grouping by type though it is embedded in income and education.

Occupational.Importance$SR2FV <- round(predict(StepR2, newdata=Occupational.Importance), digits=2) 
Occupational.Importance$SR2Resid <- Occupational.Importance$importance - Occupational.Importance$SR2FV
ggplot(data = Occupational.Importance) +
  aes(x = education, y = income, color = type, size = importance) +
  geom_point() +
  theme_minimal()

It is also apparent that this seems to hold even in the presence of dividing up the data to only examining what is going on within types of profession, but that profession types are partially defined by income and education and have limited variation within them excepting professionals.

ggplot(data = Occupational.Importance) +
  aes(x = education, y = income, size = importance, color=type) +
  geom_point() +
  theme_minimal() +
  facet_wrap(vars(type))

With the previous two plots in mind, I will close out the investigation with two plots that use plotly. The first shows education and income with color and size by importance with a hover on the residual from the final model that incorporate more than a line for the relationship between income and imporance.

library(plotly)
Occupational.Importance <- Occupational.Importance %>% mutate(labsM = paste("Importance: ",importance,"<br> Fit:", SR2FV, "<br>", title)) 
ggplot(Occupational.Importance, aes(x=education, y=income, text=labsM,  size=SR2FV, color=type)) + geom_point() + labs(title="Occupational Importance:", caption = "Sized by Final Model Fitted Value", size="Fitted") + theme_minimal() -> p
ggplotly(p, tooltip = "text")

and

p <- plot_ly(Occupational.Importance, x=~education, y=~income, text = ~labsM, size=~SR2Resid, color=~type) %>% layout(title="Occupational Importance: Sized by Final Model Residual")
p

## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter

## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

Another tack at the same issue but with everything in a graphic.

ggplot(Occupational.Importance) + aes(x=education, y=income, label=type, group=type,  size=importance, color=type, name=title) + geom_density2d(size=0.05) + geom_text() + labs(color="Women Pct") + scale_color_viridis_d() -> Plotly.Me
ggplotly(Plotly.Me)

Occupational Importance

Your Name Here

12/01/2019