Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
The dataset used in this excercise is called “Prestige” and it is included in car library. It has 102 rows and 6 columns. Each row is an observation that relates to an occupation. The columns in the dataset relate to predicators such as years of education, income, percentage of women in the occupation, prestige of the occupation, etc.
## education income women prestige
## Min. : 6.380 Min. : 611 Min. : 0.000 Min. :14.80
## 1st Qu.: 8.445 1st Qu.: 4106 1st Qu.: 3.592 1st Qu.:35.23
## Median :10.540 Median : 5930 Median :13.600 Median :43.60
## Mean :10.738 Mean : 6798 Mean :28.979 Mean :46.83
## 3rd Qu.:12.648 3rd Qu.: 8187 3rd Qu.:52.203 3rd Qu.:59.27
## Max. :15.970 Max. :25879 Max. :97.510 Max. :87.20
## census type
## Min. :1113 bc :44
## 1st Qu.:3120 prof:31
## Median :5135 wc :23
## Mean :5402 NA's: 4
## 3rd Qu.:8312
## Max. :9517
## [1] "Number of Columns of the data: 6"
## [1] "Number of Rows of the data: 102"
## [1] "Columns of the data: education" "Columns of the data: income"
## [3] "Columns of the data: women" "Columns of the data: prestige"
## [5] "Columns of the data: census" "Columns of the data: type"
## education income women prestige census type
## gov.administrators 13.11 12351 11.16 68.8 1113 prof
## general.managers 12.26 25879 4.02 69.1 1130 prof
## accountants 12.77 9271 15.70 63.4 1171 prof
## purchasing.officers 11.42 8865 9.11 56.8 1175 prof
## chemists 14.62 8403 11.68 73.5 2111 prof
## physicists 15.64 11030 5.13 77.6 2113 prof
## 'data.frame': 102 obs. of 6 variables:
## $ education: num 13.1 12.3 12.8 11.4 14.6 ...
## $ income : int 12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
## $ women : num 11.16 4.02 15.7 9.11 11.68 ...
## $ prestige : num 68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
## $ census : int 1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
## $ type : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...
## Observations: 102
## Variables: 6
## $ education <dbl> 13.11, 12.26, 12.77, 11.42, 14.62, 15.64, 15.09, 15.44, 1...
## $ income <int> 12351, 25879, 9271, 8865, 8403, 11030, 8258, 14163, 11377...
## $ women <dbl> 11.16, 4.02, 15.70, 9.11, 11.68, 5.13, 25.65, 2.69, 1.03,...
## $ prestige <dbl> 68.8, 69.1, 63.4, 56.8, 73.5, 77.6, 72.6, 78.1, 73.1, 68....
## $ census <int> 1113, 1130, 1171, 1175, 2111, 2113, 2133, 2141, 2143, 215...
## $ type <fct> prof, prof, prof, prof, prof, prof, prof, prof, prof, pro...
Checking missing values in the dataset
## education income women prestige census type
## 0.000000 0.000000 0.000000 0.000000 0.000000 3.921569
## education income women prestige census type
## 1 6.38 2847 90.67 28.2 8563 bc
## 2 6.60 5959 0.52 36.2 8782 bc
## 3 6.67 4696 0.00 27.3 8715 bc
## 4 6.69 4443 31.36 33.3 8267 bc
## 5 6.74 3485 39.48 28.8 8278 bc
## 6 6.84 3643 3.60 44.1 7112 <NA>
## 7 6.92 5299 0.56 38.9 8781 bc
## 8 7.11 3472 33.57 17.3 6191 bc
## 9 7.33 3000 69.31 20.8 6162 bc
## 10 7.42 1890 72.24 23.2 8221 bc
## education income women prestige census type
## 93 15.08 8034 46.80 66.1 2733 prof
## 94 15.09 8258 25.65 72.6 2133 prof
## 95 15.21 10432 24.71 69.3 3151 prof
## 96 15.22 9593 34.89 58.3 2391 prof
## 97 15.44 14163 2.69 78.1 2141 prof
## 98 15.64 11030 5.13 77.6 2113 prof
## 99 15.77 19263 5.13 82.3 2343 prof
## 100 15.94 14558 4.32 66.7 3115 prof
## 101 15.96 25308 10.56 87.2 3111 prof
## 102 15.97 12480 19.59 84.6 2711 prof
ggplot(data=Prestige_Education, aes(Prestige_Education$education)) +
geom_histogram(aes(fill = ..count..), binwidth=0.25) +
scale_fill_gradient("Count", low = "green", high = "brown") +
labs(title = "Historgram - Years of Education") +
labs(x = "Education") +
labs(y = "Count")## education income women prestige census type
## 1 9.46 611 96.53 25.9 6147 <NA>
## 2 9.62 918 7.00 14.8 5143 <NA>
## 3 8.60 1656 27.75 21.5 7182 bc
## 4 7.42 1890 72.24 23.2 8221 bc
## 5 9.93 2370 3.69 23.3 5145 bc
## 6 10.64 2448 91.76 42.3 4133 wc
## 7 10.05 2594 67.82 26.5 5137 wc
## 8 6.38 2847 90.67 28.2 8563 bc
## 9 11.04 2901 92.86 38.7 4171 wc
## 10 7.33 3000 69.31 20.8 6162 bc
## education income women prestige census type
## 93 14.52 11377 1.03 73.1 2143 prof
## 94 13.11 12351 11.16 68.8 1113 prof
## 95 15.97 12480 19.59 84.6 2711 prof
## 96 12.27 14032 0.58 66.1 9111 prof
## 97 15.44 14163 2.69 78.1 2141 prof
## 98 15.94 14558 4.32 66.7 3115 prof
## 99 14.71 17498 6.91 68.4 3117 prof
## 100 15.77 19263 5.13 82.3 2343 prof
## 101 15.96 25308 10.56 87.2 3111 prof
## 102 12.26 25879 4.02 69.1 1130 prof
ggplot(data=Prestige_Income, aes(Prestige_Income$income)) +
geom_histogram(aes(fill = ..count..)) +
scale_fill_gradient("Count", low = "blue", high = "purple") +
labs(title = "Historgram - Income") +
labs(x = "Income") +
labs(y = "Count")## [1] 0.5775802
There is positive linear relationship between education and income.
#Plot the spread
plot(x = Prestige$education, y = Prestige$income, main="Prestige Data", xlab = "Education", ylab = "Income")par(mfrow=c(1, 2))
plot(density(Prestige$education), main="Density Plot: Education", ylab="Frequency")
plot(density(Prestige$income), main="Density Plot: Income", ylab="Frequency")Plot for education looks normal, however, plot for income skewed towards right.
In this section we will create a linear regression model and calculate the correlation between the data to see if there is a relationship between education and income.
findCorrelation <- function() {
x = Prestige$education
y = Prestige$income
corr = round(cor(x, y),4)
print (paste0("Correlation = ",corr))
return (corr)
}
c = findCorrelation()## [1] "Correlation = 0.5776"
findStatsFunction <- function() {
m = lm (Prestige$income ~ Prestige$education, data = Prestige)
s = summary(m)
print(s)
slp = round(m$coefficients[2], 4)
int = round(m$coefficients[1], 4)
return (m)
}
m = findStatsFunction()##
## Call:
## lm(formula = Prestige$income ~ Prestige$education, data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5493.2 -2433.8 -41.9 1491.5 17713.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2853.6 1407.0 -2.028 0.0452 *
## Prestige$education 898.8 127.0 7.075 2.08e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3483 on 100 degrees of freedom
## Multiple R-squared: 0.3336, Adjusted R-squared: 0.3269
## F-statistic: 50.06 on 1 and 100 DF, p-value: 2.079e-10
income = -2853.5856 + 898.8128 * education
| Linear Regression Equation | Correlation Coefficient | \(R^2\) |
|---|---|---|
| Income = -2853.5856 + (898.8128 * Education) | 0.5776 | 0.3336 |
As can be concluded, that the data has correlation 0.5776. The residuals tend to remain constant as we move to the right. Additionally, the residuals are uniformly scattered (except a few outliers) above and below zero. Overall, this plot tells us that using the education as the predictor of the income in the regression model explains the data. Therefore, we can conclude that the linear model is appropriate in this case.