Discussion week11

Week 11, Regression 1

Fundamentals of Computational Mathematics

CUNY MSDS DATA 605, Fall 2018

Rose Koh

11/7/2018

Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

I used a dataset from Korea Welfare Panel Study, which contains information of people’s debt from bank. My assumption is The later the birth year is, the more amount of loan drawn from bank.

The variables I am going to use is the birth year and the amount of money they borrowed.

According to the index, the variables we need to use are:

Load data

library(foreign)
data <- read.spss("data_spss_Koweps2014.sav", to.data.frame = T)
## Warning in read.spss("data_spss_Koweps2014.sav", to.data.frame = T):
## data_spss_Koweps2014.sav: Compression bias (0) is not the usual value of
## 100
#str(data)
#summary(data)
data[data$birthyear == 9999,] <- NA
data[data$loan == 9999999,] <- NA
# name required variables
data$loan <- data$h0909_aq1
data$birthyear <- data$h0901_5

#summary
summary(data$loan)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0    4264     500 9999999
summary(data$birthyear)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1918    1940    1952    1953    1966    2002
#check NA
table(is.na(data$loan))
## 
## FALSE 
##  7048
table(is.na(data$birthyear))
## 
## FALSE 
##  7048
#check type
class(data$loan)
## [1] "numeric"
class(data$birthyear)
## [1] "numeric"

Visualization

plot(data$birthyear, data$loan)

Since we focus on the people who actually took loans, we will filter those who did not.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
loan.df <- data %>% filter(loan>0)
plot(loan.df$birthyear, loan.df$loan)

Linear Model

lm <- lm(data = loan.df, loan~birthyear)
summary(lm)
## 
## Call:
## lm(formula = loan ~ birthyear, data = loan.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -20554  -14439  -11765   -8157 9984625 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  341111.8  1051758.3   0.324    0.746
## birthyear      -166.2      536.4  -0.310    0.757
## 
## Residual standard error: 318500 on 1968 degrees of freedom
## Multiple R-squared:  4.877e-05,  Adjusted R-squared:  -0.0004593 
## F-statistic: 0.09599 on 1 and 1968 DF,  p-value: 0.7567
( \[\begin{aligned} \widehat{Loan} &= -81830 + 44.34 \times birthyear \end{aligned}\]

)

library(ggplot2)
ggplot(loan.df, aes(birthyear, loan)) + 
        ylim(1,1000) +
        geom_point(size = 3) +
        geom_smooth(method = lm,
                    se = F)

Thus this linear model is not suitable and we can’t draw the abline for the graph either.

plot(lm)

The standard Error for birthyear of 44.34 is 3.451 (t-value) smaller than the coefficient 12.85. The p-value is 0.0005 which is small, but the probability that the intercept is irrelevant to model is 0.0011. The \(R^2\) is 0.006, meaning that withbirth year, only 0.602% of the variability in loan is explained by the birth year. Furethre more, the residuals show that a liniear model is inappropriate model.