There are over 1000 dataset available in R. We can find some of the dataset in the following link: https://vincentarelbundock.github.io/Rdatasets/datasets.html
For this week’s discussion, the dataset “Prestige” will be used to build a regression.
library(car)
## Loading required package: carData
?Prestige
The Prestige data frame has 102 rows and 6 columns. The observations are occupations.
Prestige
This data frame contains the following columns:
Column | Description |
---|---|
education | Average education of occupational incumbents, years, in 1971. |
income | Average income of incumbents, dollars, in 1971. |
women | Percentage of incumbents who are women. |
prestige | Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s. |
census | Canadian Census occupational code. |
type | Type of occupation. A factor with levels (note: out of order): bc, Blue Collar; prof, Professional, Managerial, and Technical; wc, White Collar. |
Canada (1971) Census of Canada. Vol. 3, Part 6. Statistics Canada [pp. 19-1–19-21].
Personal communication from B. Blishen, W. Carroll, and C. Moore, Departments of Sociology, York University and University of Victoria.
Fox, J. (2016) Applied Regression Analysis and Generalized Linear Models, Third Edition. Sage.
Fox, J. and Weisberg, S. (2019) An R Companion to Applied Regression, Third Edition, Sage.
library(DT)
datatable(Prestige, options = list(scrollX = TRUE))
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
##
## Attaching package: 'openintro'
## The following object is masked from 'package:car':
##
## densityPlot
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:car':
##
## logit
library(PerformanceAnalytics)
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
suppressWarnings({
chart.Correlation(Prestige[1:5], histogram = TRUE, method = "pearson") })
m1 <- lm(income ~ education, data = Prestige)
summary(m1)
##
## Call:
## lm(formula = income ~ education, data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5493.2 -2433.8 -41.9 1491.5 17713.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2853.6 1407.0 -2.028 0.0452 *
## education 898.8 127.0 7.075 2.08e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3483 on 100 degrees of freedom
## Multiple R-squared: 0.3336, Adjusted R-squared: 0.3269
## F-statistic: 50.06 on 1 and 100 DF, p-value: 2.079e-10
Linearity and Constant Variability: The points are distributed around 0 in no apparent pattern. Both linearity and constant variability appear to be met.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(data = m1, aes(x = .fitted, y = .resid)) + geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") + xlab("Fitted values") +
ylab("Residuals")
Normality of Residuals: This assumption is not met as the histogram show
a positive skewed distribution
ggplot(data = m1, aes(x = .resid)) + geom_histogram(binwidth = 1000) + xlab("Residuals")
ggplot(data = m1, aes(sample = .resid)) + stat_qq() +stat_qq_line(col = "red")
m2 <- lm(income ~ prestige, data = Prestige)
summary(m2)
##
## Call:
## lm(formula = income ~ prestige, data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6693.3 -1434.9 136.5 1251.8 15152.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1465.03 860.47 -1.703 0.0918 .
## prestige 176.43 17.26 10.224 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2984 on 100 degrees of freedom
## Multiple R-squared: 0.5111, Adjusted R-squared: 0.5062
## F-statistic: 104.5 on 1 and 100 DF, p-value: < 2.2e-16
Linearity and Constant Variability: The points are distributed around 0 in no apparent pattern. Both linearity and constant variability appear to be met.
ggplot(data = m2, aes(x = .fitted, y = .resid)) + geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") + xlab("Fitted values") +
ylab("Residuals")
Normality of Residuals: This assumption is not met as the histogram show
a positive skewed distribution
ggplot(data = m2, aes(x = .resid)) + geom_histogram(binwidth = 1000) + xlab("Residuals")
ggplot(data = m2, aes(sample = .resid)) + stat_qq() +stat_qq_line(col = "red")
m3 <- lm(income ~ ., data = Prestige)
summary(m3)
##
## Call:
## lm(formula = income ~ ., data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7752.4 -954.6 -331.2 742.6 14301.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.32053 3037.27048 0.002 0.99808
## education 131.18372 288.74961 0.454 0.65068
## women -53.23480 9.83107 -5.415 4.96e-07 ***
## prestige 139.20912 36.40239 3.824 0.00024 ***
## census 0.04209 0.23568 0.179 0.85865
## typeprof 509.15150 1798.87914 0.283 0.77779
## typewc 347.99010 1173.89384 0.296 0.76757
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2633 on 91 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.6363, Adjusted R-squared: 0.6123
## F-statistic: 26.54 on 6 and 91 DF, p-value: < 2.2e-16
Based on the following graphs, not all assumptions are met. Thus, transformation may be needed to find the best fit model.
ggplot(data = m3, aes(x = .fitted, y = .resid)) + geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") + xlab("Fitted values") +
ylab("Residuals")
ggplot(data = m3, aes(x = .resid)) + geom_histogram(binwidth = 1500) + xlab("Residuals")
ggplot(data = m3, aes(sample = .resid)) + stat_qq() +stat_qq_line(col = "red")