Dataset used: Prestige Data in Package(car)
library(car)
library(ggplot2)
library(dplyr)
library(GGally)
ggpairs(prestige[1:5])
plot(prestige$type, prestige$income)
ggplot(data = prestige, aes(x = prestige, y = income)) +
geom_smooth()
ggplot(data = prestige, aes(x = women, y = income)) +
geom_smooth()
ggplot(data = prestige, aes(x = prestige, y = income, col = type)) +
geom_smooth()
set.seed(1)
index <- sample(1:nrow(prestige), nrow(prestige)*0.8)
train <- prestige[index,]
test <- prestige[-index,]
lm.fit <- lm(income ~., data = prestige)
summary(lm.fit)
##
## Call:
## lm(formula = income ~ ., data = prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7752.4 -954.6 -331.2 742.6 14301.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.32053 3037.27048 0.002 0.99808
## education 131.18372 288.74961 0.454 0.65068
## women -53.23480 9.83107 -5.415 4.96e-07 ***
## prestige 139.20912 36.40239 3.824 0.00024 ***
## census 0.04209 0.23568 0.179 0.85865
## typeprof 509.15150 1798.87914 0.283 0.77779
## typewc 347.99010 1173.89384 0.296 0.76757
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2633 on 91 degrees of freedom
## Multiple R-squared: 0.6363, Adjusted R-squared: 0.6123
## F-statistic: 26.54 on 6 and 91 DF, p-value: < 2.2e-16
We see that education is highliy coorelated and prestige and cencues both so lets make a model droppping education and census is not coorelated much so lets drop taht variable as well.
lm.fit2 <- lm(income ~ women + prestige, data = prestige)
summary(lm.fit2)
##
## Call:
## lm(formula = income ~ women + prestige, data = prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7678.4 -1050.9 -310.1 839.6 14114.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 653.17 838.09 0.779 0.438
## women -50.50 8.42 -5.997 3.61e-08 ***
## prestige 163.74 15.46 10.593 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2587 on 95 degrees of freedom
## Multiple R-squared: 0.6334, Adjusted R-squared: 0.6257
## F-statistic: 82.08 on 2 and 95 DF, p-value: < 2.2e-16
plot(lm.fit2, pch=16, which=1)
Looking at the residual plots we can also try taking log of income and predict it using women and prestige column
lm.fit3 <- lm(log(income) ~ women + prestige, data = prestige)
summary(lm.fit3)
##
## Call:
## lm(formula = log(income) ~ women + prestige, data = prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.99182 -0.09790 0.00839 0.12425 0.79493
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.9255582 0.0780000 101.61 <2e-16 ***
## women -0.0079685 0.0007837 -10.17 <2e-16 ***
## prestige 0.0213130 0.0014386 14.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2407 on 95 degrees of freedom
## Multiple R-squared: 0.7914, Adjusted R-squared: 0.787
## F-statistic: 180.2 on 2 and 95 DF, p-value: < 2.2e-16
plot(lm.fit3)
Our adjusted R-square is 0.787 and F-statistic 180.2 which means we now have a model with precision, lets predict the income on our test dataset
test$y <- predict(lm.fit3, newdata = test)
plot(test$income, exp(test$y))
error <- sqrt(mean(sum((exp(test$y) - test$income)^2)))