Dataset from R

There are over 1000 dataset available in R. We can find some of the dataset in the following link: https://vincentarelbundock.github.io/Rdatasets/datasets.html

For this week’s discussion, the dataset “Prestige” will be used to build a regression.

Prestige of Canadian Occupations

library(car)

## Loading required package: carData

?Prestige

Description

The Prestige data frame has 102 rows and 6 columns. The observations are occupations.

Usage

Prestige

Format

This data frame contains the following columns:

Column	Description
education	Average education of occupational incumbents, years, in 1971.
income	Average income of incumbents, dollars, in 1971.
women	Percentage of incumbents who are women.
prestige	Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s.
census	Canadian Census occupational code.
type	Type of occupation. A factor with levels (note: out of order): bc, Blue Collar; prof, Professional, Managerial, and Technical; wc, White Collar.

Source

Canada (1971) Census of Canada. Vol. 3, Part 6. Statistics Canada [pp. 19-1–19-21].

Personal communication from B. Blishen, W. Carroll, and C. Moore, Departments of Sociology, York University and University of Victoria.

References

Fox, J. (2016) Applied Regression Analysis and Generalized Linear Models, Third Edition. Sage.

Fox, J. and Weisberg, S. (2019) An R Companion to Applied Regression, Third Edition, Sage.

Datatable

library(DT)
datatable(Prestige, options = list(scrollX = TRUE))

Packages

library(openintro)

## Loading required package: airports

## Loading required package: cherryblossom

## Loading required package: usdata

## 
## Attaching package: 'openintro'

## The following object is masked from 'package:car':
## 
##     densityPlot

library(psych)

## 
## Attaching package: 'psych'

## The following object is masked from 'package:car':
## 
##     logit

library(PerformanceAnalytics)

## Loading required package: xts

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## 
## Attaching package: 'PerformanceAnalytics'

## The following object is masked from 'package:graphics':
## 
##     legend

Correlation

suppressWarnings({
chart.Correlation(Prestige[1:5], histogram = TRUE, method = "pearson") })

Linear Model 1

m1 <- lm(income ~ education, data = Prestige) 
summary(m1)

## 
## Call:
## lm(formula = income ~ education, data = Prestige)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5493.2 -2433.8   -41.9  1491.5 17713.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2853.6     1407.0  -2.028   0.0452 *  
## education      898.8      127.0   7.075 2.08e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3483 on 100 degrees of freedom
## Multiple R-squared:  0.3336, Adjusted R-squared:  0.3269 
## F-statistic: 50.06 on 1 and 100 DF,  p-value: 2.079e-10

Model Diagnostics

Linearity and Constant Variability: The points are distributed around 0 in no apparent pattern. Both linearity and constant variability appear to be met.

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(data = m1, aes(x = .fitted, y = .resid)) + geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") + xlab("Fitted values") +
  ylab("Residuals")

Normality of Residuals: This assumption is not met as the histogram show a positive skewed distribution

ggplot(data = m1, aes(x = .resid)) + geom_histogram(binwidth = 1000) + xlab("Residuals")

ggplot(data = m1, aes(sample = .resid)) + stat_qq() +stat_qq_line(col = "red")

Linear Model 2

m2 <- lm(income ~ prestige, data = Prestige) 
summary(m2)

## 
## Call:
## lm(formula = income ~ prestige, data = Prestige)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6693.3 -1434.9   136.5  1251.8 15152.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1465.03     860.47  -1.703   0.0918 .  
## prestige      176.43      17.26  10.224   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2984 on 100 degrees of freedom
## Multiple R-squared:  0.5111, Adjusted R-squared:  0.5062 
## F-statistic: 104.5 on 1 and 100 DF,  p-value: < 2.2e-16

Model Diagnostics

Linearity and Constant Variability: The points are distributed around 0 in no apparent pattern. Both linearity and constant variability appear to be met.

ggplot(data = m2, aes(x = .fitted, y = .resid)) + geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") + xlab("Fitted values") +
  ylab("Residuals")

Normality of Residuals: This assumption is not met as the histogram show a positive skewed distribution

ggplot(data = m2, aes(x = .resid)) + geom_histogram(binwidth = 1000) + xlab("Residuals")

ggplot(data = m2, aes(sample = .resid)) + stat_qq() +stat_qq_line(col = "red")

Model 3

m3 <- lm(income ~ ., data = Prestige) 
summary(m3)

## 
## Call:
## lm(formula = income ~ ., data = Prestige)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7752.4  -954.6  -331.2   742.6 14301.3 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.32053 3037.27048   0.002  0.99808    
## education    131.18372  288.74961   0.454  0.65068    
## women        -53.23480    9.83107  -5.415 4.96e-07 ***
## prestige     139.20912   36.40239   3.824  0.00024 ***
## census         0.04209    0.23568   0.179  0.85865    
## typeprof     509.15150 1798.87914   0.283  0.77779    
## typewc       347.99010 1173.89384   0.296  0.76757    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2633 on 91 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.6363, Adjusted R-squared:  0.6123 
## F-statistic: 26.54 on 6 and 91 DF,  p-value: < 2.2e-16

Model Diagnostics

Based on the following graphs, not all assumptions are met. Thus, transformation may be needed to find the best fit model.

ggplot(data = m3, aes(x = .fitted, y = .resid)) + geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") + xlab("Fitted values") +
  ylab("Residuals")

ggplot(data = m3, aes(x = .resid)) + geom_histogram(binwidth = 1500) + xlab("Residuals")

ggplot(data = m3, aes(sample = .resid)) + stat_qq() +stat_qq_line(col = "red")

DATA 605 Discussion 11

Susanna Wong

2023-10-31

Dataset from R

Prestige of Canadian Occupations

Description

Usage

Format

Source

References

Datatable

Packages

Correlation

Linear Model 1

Model Diagnostics

Linear Model 2

Model Diagnostics

Model 3

Model Diagnostics