Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
I used a dataset from Korea Welfare Panel Study, which contains information of people’s debt from bank. My assumption is The later the birth year is, the more amount of loan drawn from bank
.
The variables I am going to use is the birth year and the amount of money they borrowed.
According to the index, the variables we need to use are:
9999: NA
9999999: NA
library(foreign)
data <- read.spss("data_spss_Koweps2014.sav", to.data.frame = T)
## Warning in read.spss("data_spss_Koweps2014.sav", to.data.frame = T):
## data_spss_Koweps2014.sav: Compression bias (0) is not the usual value of
## 100
#str(data)
#summary(data)
data[data$birthyear == 9999,] <- NA
data[data$loan == 9999999,] <- NA
# name required variables
data$loan <- data$h0909_aq1
data$birthyear <- data$h0901_5
#summary
summary(data$loan)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 4264 500 9999999
summary(data$birthyear)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1918 1940 1952 1953 1966 2002
#check NA
table(is.na(data$loan))
##
## FALSE
## 7048
table(is.na(data$birthyear))
##
## FALSE
## 7048
#check type
class(data$loan)
## [1] "numeric"
class(data$birthyear)
## [1] "numeric"
plot(data$birthyear, data$loan)
Since we focus on the people who actually took loans, we will filter those who did not.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
loan.df <- data %>% filter(loan>0)
plot(loan.df$birthyear, loan.df$loan)
lm <- lm(data = loan.df, loan~birthyear)
summary(lm)
##
## Call:
## lm(formula = loan ~ birthyear, data = loan.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20554 -14439 -11765 -8157 9984625
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 341111.8 1051758.3 0.324 0.746
## birthyear -166.2 536.4 -0.310 0.757
##
## Residual standard error: 318500 on 1968 degrees of freedom
## Multiple R-squared: 4.877e-05, Adjusted R-squared: -0.0004593
## F-statistic: 0.09599 on 1 and 1968 DF, p-value: 0.7567
(
\[\begin{aligned}
\widehat{Loan} &= -81830 + 44.34 \times birthyear \end{aligned}\]
)
library(ggplot2)
ggplot(loan.df, aes(birthyear, loan)) +
ylim(1,1000) +
geom_point(size = 3) +
geom_smooth(method = lm,
se = F)
Thus this linear model is not suitable and we can’t draw the abline for the graph either.
plot(lm)
The standard Error for birthyear of 44.34 is 3.451 (t-value) smaller than the coefficient 12.85. The p-value is 0.0005 which is small, but the probability that the intercept is irrelevant to model is 0.0011. The \(R^2\) is 0.006, meaning that withbirth year, only 0.602% of the variability in loan is explained by the birth year. Furethre more, the residuals show that a liniear model is inappropriate model.