Dataset from R

There are over 1000 dataset available in R. We can find some of the dataset in the following link: https://vincentarelbundock.github.io/Rdatasets/datasets.html

For this week’s discussion, the dataset “Prestige” will be used to build a regression.

Prestige of Canadian Occupations

library(car)
## Loading required package: carData
?Prestige

Description

The Prestige data frame has 102 rows and 6 columns. The observations are occupations.

Usage

Prestige

Format

This data frame contains the following columns:

Column Description
education Average education of occupational incumbents, years, in 1971.
income Average income of incumbents, dollars, in 1971.
women Percentage of incumbents who are women.
prestige Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s.
census Canadian Census occupational code.
type Type of occupation. A factor with levels (note: out of order): bc, Blue Collar; prof, Professional, Managerial, and Technical; wc, White Collar.

Source

Canada (1971) Census of Canada. Vol. 3, Part 6. Statistics Canada [pp. 19-1–19-21].

Personal communication from B. Blishen, W. Carroll, and C. Moore, Departments of Sociology, York University and University of Victoria.

References

Fox, J. (2016) Applied Regression Analysis and Generalized Linear Models, Third Edition. Sage.

Fox, J. and Weisberg, S. (2019) An R Companion to Applied Regression, Third Edition, Sage.

Datatable

library(DT)
datatable(Prestige, options = list(scrollX = TRUE))

Packages

library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
## 
## Attaching package: 'openintro'
## The following object is masked from 'package:car':
## 
##     densityPlot
library(psych)
## 
## Attaching package: 'psych'
## The following object is masked from 'package:car':
## 
##     logit
library(PerformanceAnalytics)
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
## 
##     legend

Correlation

suppressWarnings({
chart.Correlation(Prestige[1:5], histogram = TRUE, method = "pearson") })

Linear Model 1

m1 <- lm(income ~ education, data = Prestige) 
summary(m1)
## 
## Call:
## lm(formula = income ~ education, data = Prestige)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5493.2 -2433.8   -41.9  1491.5 17713.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2853.6     1407.0  -2.028   0.0452 *  
## education      898.8      127.0   7.075 2.08e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3483 on 100 degrees of freedom
## Multiple R-squared:  0.3336, Adjusted R-squared:  0.3269 
## F-statistic: 50.06 on 1 and 100 DF,  p-value: 2.079e-10

Model Diagnostics

Linearity and Constant Variability: The points are distributed around 0 in no apparent pattern. Both linearity and constant variability appear to be met.

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
ggplot(data = m1, aes(x = .fitted, y = .resid)) + geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") + xlab("Fitted values") +
  ylab("Residuals")

Normality of Residuals: This assumption is not met as the histogram show a positive skewed distribution

ggplot(data = m1, aes(x = .resid)) + geom_histogram(binwidth = 1000) + xlab("Residuals")

ggplot(data = m1, aes(sample = .resid)) + stat_qq() +stat_qq_line(col = "red")

Linear Model 2

m2 <- lm(income ~ prestige, data = Prestige) 
summary(m2)
## 
## Call:
## lm(formula = income ~ prestige, data = Prestige)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6693.3 -1434.9   136.5  1251.8 15152.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1465.03     860.47  -1.703   0.0918 .  
## prestige      176.43      17.26  10.224   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2984 on 100 degrees of freedom
## Multiple R-squared:  0.5111, Adjusted R-squared:  0.5062 
## F-statistic: 104.5 on 1 and 100 DF,  p-value: < 2.2e-16

Model Diagnostics

Linearity and Constant Variability: The points are distributed around 0 in no apparent pattern. Both linearity and constant variability appear to be met.

ggplot(data = m2, aes(x = .fitted, y = .resid)) + geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") + xlab("Fitted values") +
  ylab("Residuals")

Normality of Residuals: This assumption is not met as the histogram show a positive skewed distribution

ggplot(data = m2, aes(x = .resid)) + geom_histogram(binwidth = 1000) + xlab("Residuals")

ggplot(data = m2, aes(sample = .resid)) + stat_qq() +stat_qq_line(col = "red")

Model 3

m3 <- lm(income ~ ., data = Prestige) 
summary(m3)
## 
## Call:
## lm(formula = income ~ ., data = Prestige)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7752.4  -954.6  -331.2   742.6 14301.3 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.32053 3037.27048   0.002  0.99808    
## education    131.18372  288.74961   0.454  0.65068    
## women        -53.23480    9.83107  -5.415 4.96e-07 ***
## prestige     139.20912   36.40239   3.824  0.00024 ***
## census         0.04209    0.23568   0.179  0.85865    
## typeprof     509.15150 1798.87914   0.283  0.77779    
## typewc       347.99010 1173.89384   0.296  0.76757    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2633 on 91 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.6363, Adjusted R-squared:  0.6123 
## F-statistic: 26.54 on 6 and 91 DF,  p-value: < 2.2e-16

Model Diagnostics

Based on the following graphs, not all assumptions are met. Thus, transformation may be needed to find the best fit model.

ggplot(data = m3, aes(x = .fitted, y = .resid)) + geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") + xlab("Fitted values") +
  ylab("Residuals")

ggplot(data = m3, aes(x = .resid)) + geom_histogram(binwidth = 1500) + xlab("Residuals")

ggplot(data = m3, aes(sample = .resid)) + stat_qq() +stat_qq_line(col = "red")