Data Analysis MOOC- Project1

Below is the R code and comments of data analysis project-1

Some theoretical concepts before actual work:

Linear regression assumpes below:

Normal distribution for explanatory (x) and response (y) variable
Linear relationship between (x) and (y)
Homoscedasticity (i.e. residualsa re not related to X)

Step 1: explore relationships with the help of scatterplots

Part 1 What you have to assess and why

Missing Data

Ideally data has no missing values, but in reality all sorts of things happen–equipment malfunctions, participants refuse to answer certain questions, researchers fall asleep. The best way to deal with missing data is to run extra participants and report the number of participants eliminated due to experimental error. Often this is impossible and one ends up with missing data. Missing data is a problem only if it is distributed nonrandomly. For example, if one is running a developmental study and the younger children are more likely to be dropped because of failure to understand instructions. Researchers so want their missing data to be random that they avoid inspecting the data to find out. This is not a good idea.

There are three ways of dealing with missing data: listwise deletion, pairwise deletion, and mean substitution (sometimes called mean imputation). It also has a separate statistical product with more sophisticated solutions. ? Listwise deletion deliminates that participant from all analyses. This tends to be a safeway of handling data so long as the remaining sample size is sufficient and so long as the respondents with missing data do not differ in any way from those with complete data. ? Pairwise deletion deletes the respondent's data only from those computations in which the variable with missing data is involved. For example, think of a correlation matrix. The correlation between variable A and variable B may be based on 182 participants and the correlation between variable B and variable C may be based on 200 participants if 8 participants had missing data for variable A. This is generally considered to be a less good solution than listwise, but is widely used in multiple regression and factor analysis because it results in larger sample sizes than listwise. It is generally considered inappropriate for structural equation modeling (see Byrne, 2001, p. 290). ? Mean substitution (imputation). Missing values on a variable are replaced by the mean of the variable. This is generally considered a conservative approach, because it reduces the probability of getting significant results, but it creates substantial biases in structural equation modeling (it consistently underestimates the parameters, Brown, 1994).

One can also drop variables rather than participants. This is a lovely solution if one variable accounts for a lot of missing values. Structural equation modeling (AMOS) gets very cranky with missing data. There is one direct technique available (maximum likelihood with estimates of means and intercepts), but it is more difficult to get a good model. Pairwise deletion is inappropriate with AMOS, but one can use listwise deletion or mean substitution in SPSS to prepare a data file without missing data.

Multicollinearity

Multicollinearity refers to the situation in which two IVs are very highly correlated with each other. This is a bad thing because it gives spuriously high Rs, which won't replicate. It can be assessed by inspecting tables of intercorrelations and standard errors of partial regression coefficients. There is one strange case in which one IV is correlated with several other IV s but the correlations are not significant and so it is not obvious. This requires a special analysis (see Part 2). Cohen et al. have a whole chapter concerning outliers and multicollinearity. We will not be reading it. Learn from this manual.

Normality

The normality assumption of regression is that residuals be normally distributed and of constant variance (homoscedastic) over sets of values of the independent variables. If the residuals are nonnormal or heteroscedastic then the standard errors of the regression coefficients (which are the denominator of the t formula and therefore used to determine significance) are biased. When the participants are divided into groups (e. g., ANOVA) it is an assumption that the sampling distribution is normally distributed. It is a safe assumption when the sample size is greater than 30. It will also be true when the population distribution is normal. So for samples (or subsamples) of less than 30 we need to assess whether the population distribution is normal. We do that by assessing normality in the sample (because we don?t have access to the population). In multivariate statistics, the assumption of normality applies to the multivariate cases, that is, to all combinations of variables. Although a short cut, it is generally acceptable to assess multivariate normality by assessing normality of the distribution of residuals. This can be done by examining a histogram of the residuals, a normal probability plot, and a scatter plot between predicted scores and residuals. Appendix A describes the various kinds of residuals; we will use standardized residuals. With a large sample size violations of normality will not affect either the significance tests nor the confidence intervals. Nonetheless, it is important to check for the distribution of residuals because the presence of nonnormal residuals suggests other problems, such as misspecification of the multiple regression model (i. e., the wrong independent variables). If multivariate normality is found, then the univariate distributions will also be normal. If the multivariate distribution is not normal, one needs to examine the individual distributions. One can use histograms or normal probability plots. NOTE: there are tests of normality, but they are unduly influenced by sample size.

Outliers

All analyses based in correlations (e. g., multiple regression, factor analysis, structural equation modeling) are very vulnerable to outliers. It is necessary to check for both univariate and multivariate outliers. Usually it is easier to check for univariate outliers first. To check for univariate outliers, use histograms, stem-and-leaf graphs, box plots, and normal probability plots. They do not all give the same answer, so more than one is appropriate.

Linearity

Multiple regression assumes that there is a linear (straight line) relationship between the independent variables and the dependent variable. If the true relationship between the DV and IV (or linear composite of IVs) is curvilinear, the r or Rsquared will be underestimated (the obtained value will be smaller than the true population value). A scatterplot of residuals will reveal multivariate linearity. If this relationship is not linear, then check scatterplots of individual variables. You, the researcher, needs to know whether the relationship between Y and each independent variable (IV) is linear or not. If the true relationship between the DV and IV (or linear composite of IVs) is curvilinear, you need to specify that (Chapter 6 in Cohen et al.). If the relationship is curvilinear, and you do not specify that, the r or R2 will be underestimated (the obtained value will be smaller than the true population value).

Homoscedasticity

Homoscedasticity means that the variances of Y values are the same for all X values. When you have heteroscedasticity (you violate assumption of homoscedasticity) Violating this assumption does not bias the regression coefficient, but does violate the standard error so significance levels are incorrect.. This means that you can safely use the regression coefficient descriptively, but making inferences (is it significant) is biased. We know how:

A positive relation between IV and variance of errors of prediction (funnel shape opening to high values) introduces a positive bias (increases Type I error, concluding significance when there is no relation). Conversely, negative relation between IV and variance of errors (funnel opening to low values) introduces a negative bias. Regression is fairly ?robust? with respect to heteroscedasticity. Cohen et al.?s rule of thumb is that the largest to the smallest conditional variance must be greater than 10 to be worrisome. Errors of prediction are independent of one another. In other words, the residual for the first subject is not related to the residual for the second subject and so on. This is typically violated with time-series or distance measures.

Errors of prediction and the independent variables are independent (p. 119 Cohen et al.).

This will be violated when: ? An independent variable has a large amount of measurement error. This can be tested by assessing reliability in each variable. Unreliability will decrease the partial r and the Beta in most cases (the exception is when the partialled out IV is unreliable; in that case one can?t tell whether the partial r and the Beta will increase or decrease. ? An important IV (one that correlates with the DV and with the other IV s) is NOT included in the regression. ? There is reciprocal causation, that is, the IV affects the DV and the DV affects the IV. This is called simultaneity. The result of violating this assumption is that the regression coefficients will be biased. If simultaneity is the cause the standard error of the regression coefficient is also biased. Expected value or mean of the errors of prediction in the population regression function is equal to zero. Violation of this assumption is not terribly important because systematic error (e. g., mean higher or lower than zero) will simply change the intercept. Our estimate of regression coefficient or standard errors (which give us the statistics we need) will not be biased.

Step 2: explore relationships with the help of correlations Correlation coeficient measures the stregth of relationship between two variables.
Relationship could be postivie or negative Correlation value varies from -1 to +1. Types of correlation coefficients:

a. Pearson's correlation coefficient (X & Y are continuous)
b. Point bi-serial correlation - when one variable is continous and the other is di-chotomous (0,1)
c.Phi coefficient - Both are dichotomous (0,1)
d. Spear's rank - (For ranked data)

Pearson's correlation coefficent = Degree to which X &Y vary together       Cov(X,Y)
                                  -----------------------------------   =  ------------
                                  Degree to which X, Y vary independently  Var of X & Y

Step 3 : Fit a linear regression equation between X & Y

Step 4; Explore residuals: a.Compute residuals (Y- Y') Y' is computed value from regression eq b. Normalize residuals c.Make residual plot d.Check normality

Step 5 : Quantify model fit using adjusted-R squared value;

R-sqaured value is square of correlation coefficient (Explained above)

R-sqaured value measures how well the model containing (x) values explains the varaibility in (Y)

R-squared value will be between 0 to 1. Any model with R-squared value above 0.64 is good (i.e correlation coefficient value - 0.8 or +0.8)

In case of multiple explanatory variables (X), all variables may not be there in linear regression model i.e. all explantatory variables may not contribute to explain variablity in Y.

Regression coefficient = Sum of squares of regression values/ Total sum of sqaures

Total sum of squares = Sum of squares of regression + Sum of squares of Residuals SSTO (Total sum of squares) = SST(Sum of squares of treatments) +SSE(Sum of squares of Errors)

The use of an adjusted R-squared (R2) is an attempt to take account of the phenomenon of the R-squared automatically and spuriously increasing when extra explanatory variables are added to the model. It is a modification due to Theil of R2 that adjusts for the number of explanatory terms in a model relative to the number of data points. The adjusted R2 can be negative, and its value will always be less than or equal to that of R2. Unlike R2, the adjusted R2 increases when a new explanator is included only if the new explanator improves the R2 more than would be expected in the absence of any explanatory value being added by the new explanator. If a set of explanatory variables with a predetermined hierarchy of importance are introduced into a regression one at a time, with the adjusted R2 computed each time, the level at which adjusted R2 reaches a maximum, and decreases afterward, would be the regression with the ideal combination of having the best fit without excess/unnecessary terms. The adjusted R2 is defined as

= R2 - (1-R2)(p/(n-p-1))

where p is the total number of regressors in the linear model (not counting the constant term), and n is the sample size.

Adjusted R2 can also be written as

= (Sum of sqaures of residuals/degrees of freedom of residuals) 1 - ————————————————————— (Sum of sqaures of total/degrees of freedom of total)

Going by above, R2 can also be written as

= (Sum of sqaures of residuals/n) 1 - ————————————————————— (Sum of sqaures of total/n)

Here degrees of freedom = n, number of data points

where dft is the degrees of freedom n– 1 of the estimate of the population variance of the dependent variable, and dfe is the degrees of freedom n – p – 1 of the estimate of the underlying population error variance.

Adjusted R2 does not have the same interpretation as R2—while R2 is a measure of fit, adjusted R2 is instead a comparative measure of suitability of alternative nested sets of explanators.As such, care must be taken in interpreting and reporting this statistic. Adjusted R2 is particularly useful in the feature selection stage of model building.

Assumptions about model errors 1.zero conditional mean of errors 2. Independence of errors 3. Homoscedasticity of errors (constant variance of errors) 4. Normal distirbution of errors - Normally distributed errors are not required for regression coefficients to be unbiased, consistent, and efficient (at least in the sense of being best linear unbiased estimates) but this assumption is required for trustworthy significance tests and confidence intervals in small sample

It is worth noting in passing that while the regression model requires only the normality of errors, the Pearson product moment correlation model requires that the two variables follow a bivariate normal distribution (Pedhazur, 1997). I.e., in the correlation model, both the marginal and conditional distribution of each variable is assumed to be normal.

Other Potential Problems: 1.Multicollinearity -The presence of correlations between the predictors is termed collinearity (for a relationship between two predictor variables) or multicollinearity (for relationships between more than two predictors).

2.Outliers - Cook's distance is the most widely used diagnostic to identify outliers. Alternatively, the use of ?robust? regression methods may help to reduce the influence of outlying observations

library(gclus)

## Loading required package: cluster


dt <- read.csv("/media/APOLLO M250/4_New_Coursera/coursera-master/dataanalysis-002/loansData.csv")
# dt<-read.csv(file.choose())
a <- strsplit(as.character(dt$FICO.Range), "-")  #Splitting FICO RANGE STRING
j <- do.call("rbind", a)  # CONVERTS list created above to a dataframe
colnames(j) <- c("F1", "F2")  # Renaming columns of dataframe j
p <- cbind(dt, j)  # Making one data frame with all data
p[!complete.cases(p), ]  # This will give list of rows where missing data is present

##        Amount.Requested Amount.Funded.By.Investors Interest.Rate
## 101596             5000                       4525         7.43%
## 101515             3500                        225        10.28%
##        Loan.Length Loan.Purpose Debt.To.Income.Ratio State Home.Ownership
## 101596   36 months        other                   1%    NY           NONE
## 101515   36 months        other                  10%    NY           RENT
##        Monthly.Income FICO.Range Open.CREDIT.Lines
## 101596             NA    800-804                NA
## 101515          15000    685-689                NA
##        Revolving.CREDIT.Balance Inquiries.in.the.Last.6.Months
## 101596                       NA                             NA
## 101515                       NA                             NA
##        Employment.Length  F1  F2
## 101596          < 1 year 800 804
## 101515          < 1 year 685 689

# There are two rows in data framw with NAs
p <- p[complete.cases(p), ]  # Data frame with complete set of values

Below is to find numerical values of Interest Rate

m <- sub("%", "", p$Interest.Rate)  #repalces % in Interest Rate  and adds to dataframe 'm'
m <- as.numeric(m)  #converts string to numeric
q <- cbind(p, m)  # makes one data frame with interest rate added to original
head(q)

##       Amount.Requested Amount.Funded.By.Investors Interest.Rate
## 81174            20000                      20000         8.90%
## 99592            19200                      19200        12.12%
## 80059            35000                      35000        21.98%
## 15825            10000                       9975         9.99%
## 33182            12000                      12000        11.71%
## 62403             6000                       6000        15.31%
##       Loan.Length       Loan.Purpose Debt.To.Income.Ratio State
## 81174   36 months debt_consolidation               14.90%    SC
## 99592   36 months debt_consolidation               28.36%    TX
## 80059   60 months debt_consolidation               23.81%    CA
## 15825   36 months debt_consolidation               14.30%    KS
## 33182   36 months        credit_card               18.78%    NJ
## 62403   36 months              other               20.05%    CT
##       Home.Ownership Monthly.Income FICO.Range Open.CREDIT.Lines
## 81174       MORTGAGE           6542    735-739                14
## 99592       MORTGAGE           4583    715-719                12
## 80059       MORTGAGE          11500    690-694                14
## 15825       MORTGAGE           3833    695-699                10
## 33182           RENT           3195    695-699                11
## 62403            OWN           4892    670-674                17
##       Revolving.CREDIT.Balance Inquiries.in.the.Last.6.Months
## 81174                    14272                              2
## 99592                    11140                              1
## 80059                    21977                              1
## 15825                     9346                              0
## 33182                    14469                              0
## 62403                    10391                              2
##       Employment.Length  F1  F2     m
## 81174          < 1 year 735 739  8.90
## 99592           2 years 715 719 12.12
## 80059           2 years 690 694 21.98
## 15825           5 years 695 699  9.99
## 33182           9 years 695 699 11.71
## 62403           3 years 670 674 15.31

names(q)[names(q) == "m"] <- "Interest"  # Colname is updated with Interest

Below is to categorize Debt to Income Ratio

q$n <- sub("%", "", q$Debt.To.Income.Ratio)  #Replaces %in debt to income ratio
names(q)[names(q) == "n"] <- "DI_Ratio"
q$DI_Ratio <- as.numeric(q$DI_Ratio)
q$Category[q$DI_Ratio < 10] <- "A"  # categorizing DI ratio as A
q$Category[q$DI_Ratio >= 10 & q$DI_Ratio < 15] <- "B"  # categorizing DI ratio as B
q$Category[q$DI_Ratio >= 15 & q$DI_Ratio < 20] <- "C"  # categorizing DI ratio as C
q$Category[q$DI_Ratio >= 20] <- "D"  # categorizing DI ratio as D

Below is to find numerical vlaues of loan length

q$LL <- sub(" months", "", q$Loan.Length)
names(q)[names(q) == "LL"] <- "LLen"
q$LLen <- as.numeric(q$LLen)

lapply(q, class)  # To understand type of each column

## $Amount.Requested
## [1] "integer"
## 
## $Amount.Funded.By.Investors
## [1] "numeric"
## 
## $Interest.Rate
## [1] "factor"
## 
## $Loan.Length
## [1] "factor"
## 
## $Loan.Purpose
## [1] "factor"
## 
## $Debt.To.Income.Ratio
## [1] "factor"
## 
## $State
## [1] "factor"
## 
## $Home.Ownership
## [1] "factor"
## 
## $Monthly.Income
## [1] "numeric"
## 
## $FICO.Range
## [1] "factor"
## 
## $Open.CREDIT.Lines
## [1] "integer"
## 
## $Revolving.CREDIT.Balance
## [1] "integer"
## 
## $Inquiries.in.the.Last.6.Months
## [1] "integer"
## 
## $Employment.Length
## [1] "factor"
## 
## $F1
## [1] "factor"
## 
## $F2
## [1] "factor"
## 
## $Interest
## [1] "numeric"
## 
## $DI_Ratio
## [1] "numeric"
## 
## $Category
## [1] "character"
## 
## $LLen
## [1] "numeric"

Graphs to understand data

attach(q)
boxplot(Interest ~ F1, data = q)

plot of chunk unnamed-chunk-4

plot(q$Amount.Requested, q$Interes)

plot of chunk unnamed-chunk-4

plot(q$F2, q$Interest)

plot of chunk unnamed-chunk-4

plot(q$Home.Ownership, q$Interest)

plot of chunk unnamed-chunk-4

plot(q$Inquiries.in.the.Last.6.Months, q$Interest)

plot of chunk unnamed-chunk-4

plot(q$Loan.Length, q$Interest)  # Higher the loan length more the interest

plot of chunk unnamed-chunk-4

plot(q$Open.CREDIT.Lines, q$Interest)

plot of chunk unnamed-chunk-4

plot(q$Employment.Length, q$Interest)

plot of chunk unnamed-chunk-4

a <- subset(q, q$Revolving.CREDIT.Balance < 50000)
plot(a$Revolving.CREDIT.Balance, a$Interest)  # lower the credit balance, lower the interest

plot of chunk unnamed-chunk-4

plot(q$Loan.Purpose, q$Interest)

plot of chunk unnamed-chunk-4

Graphs to understand impact of Debt_to_income_ratio

A <- subset(q, q$Category == "A")
B <- subset(q, q$Category == "B")
C <- subset(q, q$Category == "C")
D <- subset(q, q$Category == "D")
plot(D$Interest)  # plot of DI ratio >=20

plot of chunk unnamed-chunk-5

plot(C$Interest)  # plot of DI ratio <20 & >=15

plot of chunk unnamed-chunk-5

plot(B$Interest)  # plot of DI ratio <15 & >=10

plot of chunk unnamed-chunk-5

Correlation analysis to find out the relation between interest and other variables

cdf <- q[, c(1, 9, 11, 12, 13, 16, 17, 18, 20)]  # a new dataframe with variables of interest
cdf$F2 <- as.numeric(cdf$F2)  # Conversting FICO range upper value to a numeric to conduct cor
cdf$LLen <- as.numeric(cdf$LLen)

Lowess stands for locally weighted scatterplot smoothing. It is a nonparametric method for drawing a smooth curve through a scatterplot. There are at least two ways to produce such a plot in R…

plot(cdf$Interest ~ cdf$F2)
lines(lowess(cdf$Interest ~ cdf$F2), col = "red")

plot of chunk unnamed-chunk-7


scatter.smooth(cdf$Interest ~ cdf$F2)

plot of chunk unnamed-chunk-7

cor(cdf)  # Generates coorelation matrix

##                                Amount.Requested Monthly.Income
## Amount.Requested                        1.00000        0.39118
## Monthly.Income                          0.39118        1.00000
## Open.CREDIT.Lines                       0.19594        0.17140
## Revolving.CREDIT.Balance                0.29337        0.35968
## Inquiries.in.the.Last.6.Months         -0.02956        0.03395
## F2                                      0.08334        0.12277
## Interest                                0.33183        0.01292
## DI_Ratio                                0.08129       -0.16234
## LLen                                    0.41230        0.07454
##                                Open.CREDIT.Lines Revolving.CREDIT.Balance
## Amount.Requested                         0.19594                 0.293365
## Monthly.Income                           0.17140                 0.359684
## Open.CREDIT.Lines                        1.00000                 0.290085
## Revolving.CREDIT.Balance                 0.29009                 1.000000
## Inquiries.in.the.Last.6.Months           0.11074                 0.012186
## F2                                      -0.08950                 0.002735
## Interest                                 0.09031                 0.061109
## DI_Ratio                                 0.37085                 0.189221
## LLen                                     0.04089                 0.055436
##                                Inquiries.in.the.Last.6.Months        F2
## Amount.Requested                                     -0.02956  0.083338
## Monthly.Income                                        0.03395  0.122766
## Open.CREDIT.Lines                                     0.11074 -0.089497
## Revolving.CREDIT.Balance                              0.01219  0.002735
## Inquiries.in.the.Last.6.Months                        1.00000 -0.092142
## F2                                                   -0.09214  1.000000
## Interest                                              0.16465 -0.709283
## DI_Ratio                                              0.01198 -0.216921
## LLen                                                  0.02384  0.012736
##                                Interest DI_Ratio    LLen
## Amount.Requested                0.33183  0.08129 0.41230
## Monthly.Income                  0.01292 -0.16234 0.07454
## Open.CREDIT.Lines               0.09031  0.37085 0.04089
## Revolving.CREDIT.Balance        0.06111  0.18922 0.05544
## Inquiries.in.the.Last.6.Months  0.16465  0.01198 0.02384
## F2                             -0.70928 -0.21692 0.01274
## Interest                        1.00000  0.17220 0.42351
## DI_Ratio                        0.17220  1.00000 0.02499
## LLen                            0.42351  0.02499 1.00000

ABove results suggest that there is high corection between Interest Rate and a. Amount.requested (0.3318) b. Inquiries.in.the.Last.6.Months (0.1646) c. DI_Ratio (0.172) d. LLen (0.423) e FICO vlaue (-0.7092) f.Open.CREDIT.Lines ( 0.09030695)

Below plot will graphically illustrate the impact of correlation

cdr <- abs(cor(cdf))
cd.col <- dmat.color(cdr)
cd.or <- order.single(cdr)
cpairs(cdf, cd.or, panel.colors = cd.col, gap = 0.5, main = "Correlation")

plot of chunk unnamed-chunk-8

Fitting regression equation with above varaibles

Linear <- lm(cdf$Interest ~ cdf$F2)

par(mfrow = c(2, 2))
plot(Linear)

plot of chunk unnamed-chunk-9

library(car)
residualPlots(Linear)  # Residual PLots

plot of chunk unnamed-chunk-9

##            Test stat Pr(>|t|)
## cdf$F2         11.78        0
## Tukey test     11.78        0

avPlots(Linear)  #Influential variables-Added-variableplots
influenceIndexPlot(Linear, id.n = 3)

plot of chunk unnamed-chunk-9

Cook's distance measures how much an observation influences the overall model or predicted values Studentizidedresiduals are the residuals divided by their estimated standard deviation as a way to standardized Bonferronitest to identify outliers Hat-points identify influential observations (have a high impact on the predictor variables) NOTE: If an observation is an outlier and influential (high leverage) then that observation can change the fit of the linear model, it is advisable to remove it. To remove a case(s) type reg1a <-update(prestige.reg4, subset=rownames(Prestige) != “general.managers”) reg1b <-update(prestige.reg4, subset= !(rownames(Prestige) %in% c(“general.managers”,“medical.technicians”)))

inf.index(Linear)

## Error: could not find function "inf.index"

residual_values <- residuals(Linear)
influencePlot(Linear)

plot of chunk unnamed-chunk-10

##      StudRes       Hat   CookD
## 38     1.580 0.0050445 0.07952
## 1684   2.581 0.0039487 0.11480
## 1928   3.469 0.0004115 0.04966

qqPlot(Linear)  # Outliers QQ plots

plot of chunk unnamed-chunk-10

Look for the tails, points should be close to the line or within the confidence intervals.

Quantileplots compare the Studentizedresiduals vsa t-distribution

Other tests:shapiro.test(), mshapiro.test() in library(mvnormtest)-library(ts)

outlierTest(Linear)  #Bonferroni test

## 
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##      rstudent unadjusted p-value Bonferonni p
## 1928    3.469          0.0005317           NA

hist(residual_values)

plot of chunk unnamed-chunk-11

plot(cdf$Interest, residual_values, ylab = "Residuals", xlab = "Interest Rate", 
    main = "Graph of Interest Rate vs Residuals")

plot of chunk unnamed-chunk-11

Testing for multicolinearity When there are strong linear relationships among the predictors in a regression analysis, the precision of the estimated regression coefficients in linear models declines compared to what it would have been were the predictors uncorrelated with each other? (Fox:359) When there is a perfect linear relationship among the predictors, the estimates for a regression model cannot be uniquely computed. The term collinearity implies that two variables are near perfect linear combinations of one another. When more than two variables are involved it is often called multicollinearity, although the two terms are often used interchangeably.

The primary concern is that as the degree of multicollinearity increases, the regression model estimates of the coefficients become unstable and the standard errors for the coefficients can get wildly inflated. In this section, we will explore some Stata commands that help to detect multicollinearity.

We can use the vif command after the regression to check for multicollinearity. vif stands for variance inflation factor. As a rule of thumb, a variable whose VIF values are greater than 10 may merit further investigation. Tolerance, defined as 1/VIF, is used by many researchers to check on the degree of collinearity. A tolerance value lower than 0.1 is comparable to a VIF of 10. It means that the variable could be considered as a linear combination of other independent variables.

vif(Linear)

## Error: model contains fewer than 2 terms

Testing for heteroskedasticity

Breush/Pagan and Cook/Weisberg score test for non-constant error variance. Null is constant variance

See also residualPlots(reg1).

ncvTest(Linear)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 4.73    Df = 1     p = 0.02964

anova(Linear)

## Analysis of Variance Table
## 
## Response: cdf$Interest
##             Df Sum Sq Mean Sq F value Pr(>F)    
## cdf$F2       1  21928   21928    2527 <2e-16 ***
## Residuals 2496  21659       9                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot(Linear)

plot of chunk unnamed-chunk-13

The plot in the lower right shows each points leverage, which is a measure of its importance in determining the regression result. Superimposed on the plot are contour lines for the Cook?s distance, which is another measure of the importance of each observation to the regression. Smaller distances means that removing the observation has little affect on the regression results. Distances larger than 1 are suspicious and suggest the presence of a possible outlier or a poor model

In statistics, leverage is a term used in connection with regression analysis and, in particular, in analyses aimed at identifying those observations that are far away from corresponding average predictor values. Leverage points do not necessarily have a large effect on the outcome of fitting regression models.

Leverage points are those observations, if any, made at extreme or outlying values of the independent variables such that the lack of neighboring observations means that the fitted regression model will pass close to that particular observation

High-leverage points are those that are outliers with respect to the independent variables. Leverage points are those that cause large changes in the parameter estimates when they are deleted. Although an influential point will typically have high leverage, a high leverage point is not necessarily an influential point. The leverage is typically defined as the diagonal of the hat matrix

Leverage points are those which have great influence on the fitted model, that is, those whose x-value is distant from the other x-values. Bad leverage point: if it is also an outlier, that is, the y-value does not follow the pattern set by the other data points. Good leverage point: if it is not an outlier.

Cook's distance or Cook's D is a commonly used estimate of the influence of a data point when performing least squares regression analysis.[1] In a practical ordinary least squares analysis, Cook's distance can be used in several ways: to indicate data points that are particularly worth checking for validity; to indicate regions of the design space where it would be good to be able to obtain more data points. It is named after the American statistician R. Dennis Cook, who introduced the concept in 1977.

Another measure is “ The Mahalanobis distance” - is a descriptive statistic that provides a relative measure of a data point's distance (residual) from a common point. It is a unitless measure introduced by P. C. Mahalanobis in 1936.[1] The Mahalanobis distance is used to identify and gauge similarity of an unknown sample set to a known one. It differs from Euclidean distance in that it takes into account the correlations of the data set and is scale-invariant. In other words, it has a multivariate effect size.

Below will stanadardize residuals. Should not expect to see any residual beyond -3 to +3. If there is a residual beyond this range, then further investigation need to be done.

standard_residual <- rstandard(Linear)
hist(standard_residual)

plot of chunk unnamed-chunk-14

As all values are within range of -3 to +3, further investigation is not required. To determine coefficient of linerar model:

coefs <- coefficients(Linear)
coefs

## (Intercept)      cdf$F2 
##     19.0719     -0.4235

Null Hypothesis of linear regression, says that slope of the line between response variable and predicted variable is zero. However, from below summary, p-value is less than zero. So, null hypothesis will be rejected and concluded that slope of regression line is not zero. Further, adjusted R2 value is ~ 0.5, above model explains 50% of variation.

summary(Linear)

## 
## Call:
## lm(formula = cdf$Interest ~ cdf$F2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.990 -2.136 -0.456  1.835 10.194 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 19.07189    0.13314   143.2   <2e-16 ***
## cdf$F2      -0.42350    0.00842   -50.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 2.95 on 2496 degrees of freedom
## Multiple R-squared: 0.503,   Adjusted R-squared: 0.503 
## F-statistic: 2.53e+03 on 1 and 2496 DF,  p-value: <2e-16

confint(Linear)

##             2.5 % 97.5 %
## (Intercept) 18.81 19.333
## cdf$F2      -0.44 -0.407

anova(Linear)

## Analysis of Variance Table
## 
## Response: cdf$Interest
##             Df Sum Sq Mean Sq F value Pr(>F)    
## cdf$F2       1  21928   21928    2527 <2e-16 ***
## Residuals 2496  21659       9                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Fitted_values <- fitted(Linear)

Below will produce a graph of Interest vs FICO values, with fitted line and residual segments

plot(cdf$Interest ~ cdf$F2)
abline(lm(cdf$Interest ~ cdf$F2))
segments(cdf$F2, Fitted_values, cdf$F2, cdf$Interest)

plot of chunk unnamed-chunk-17

Below will produce a scatter plot of fitted values vs residual values


plot(Fitted_values, residual_values)

plot of chunk unnamed-chunk-18

qqnorm(residual_values)

plot of chunk unnamed-chunk-18

Prediction intervals and confidence intervals are not the same thing. Unfortunately the terms are often confused, and I am often frequently correcting the error in students papers and articles I am reviewing or editing.

A prediction interval is an interval associated with a random variable yet to be observed, with a specified probability of the random variable lying within the interval. For example, I might give an 80% interval for the forecast of GDP in 2014. The actual GDP in 2014 should lie within the interval with probability 0.8. Prediction intervals can arise in Bayesian or frequentist statistics.

Prediction interval - include also the uncertainity about future values

Confidence interval - an interval estimate of population parameter based on a random sample - Reflect uncertainity about regression line(how well the line is determined)

Degree of confidence - Probability that the confidence interval captures the true populaiton parameter

A confidence interval is an interval associated with a parameter and is a frequentist concept. The parameter is assumed to be non-random but unknown, and the confidence interval is computed from data. Because the data are random, the interval is random. A 95% confidence interval will contain the true parameter with probability 0.95. That is, with a large number of repeated samples, 95% of the intervals would contain the true parameter.

reference http://robjhyndman.com/hyndsight/intervals/

The difference between a prediction interval and a confidence interval is the standard error.

The standard error for a confidence interval on the mean takes into account the uncertainty due to sampling. The line you computed from your sample will be different from the line that would have been computed if you had the entire population, the standard error takes this uncertainty into account.

The standard error for a prediction interval on an individual observation takes into account the uncertainty due to sampling like above, but also takes into account the variability of the individuals around the predicted mean. The se for the prediction interval will be wider than for the confidence interval and hence the prediction interval will be wider than the confidence interval.

A confidence interval gives a range for E[y/x] , as you say. A prediction interval gives a range for y itself. Naturally, our best guess for y is E[y/x] , so the intervals will both be centered around the same value, x? ^.

The standard errors are going to be different—we guess the expected value of E[y/x] more precisely than we estimate y itself. Estimating y requires including the variance that comes from the true error term.

To illustrate the difference, imagine that we could get perfect estimates of our ? coefficients. Then, our estimate of E[y/x] would be perfect. But we still wouldn't be sure what y itself was because there is a true error term that we need to consider. Our confidence “interval” would just be a point because we estimate E[y/x] exactly right, but our prediction interval would be wider because we take the true error term into account. Hence, a prediction interval will be wider than a confidence interval -reference http://stats.stackexchange.com/questions/16493/difference-between-confidence-intervals-and-prediction-intervals

Difference between Fitted and Predict functions of R

The fitted function returns the y-hat values associated with the data used to fit the model. The predict function returns predictions for a new set of predictor variables. If you don't specify a new set of predictor variables then it will use the original data by default giving the same results as fitted for some models, but if you want to predict for a new set of values then you need predict. The predict function often also has options for which type of prediction to return, the linear predictor, the prediction transformed to the response scale, the most likely category, the contribution of each term in the model, etc. –reference http://stackoverflow.com/questions/12201439/is-there-a-difference-between-the-r-functions-fitted-and-predict

To compute confidence interval

Predi_C <- predict(Linear, int = "c")

To compute prediction interval

Predi_P <- predict(Linear, int = "p")

## Warning: Predictions on current data refer to _future_ responses

Plot of Original data with Linear regression line along with confidence and prediction intervals

plot(cdf$Interest ~ cdf$F2)
abline(Linear)
matlines(cdf$F2, Predi_C)
matlines(cdf$F2, Predi_P)

plot of chunk unnamed-chunk-21

To find a interest rate for a new value of F2, let new value of F2 be 100, then

c2df <- cdf
attach(c2df)

## The following object(s) are masked from 'q':
## 
##     Amount.Requested, DI_Ratio, F2,
##     Inquiries.in.the.Last.6.Months, Interest, LLen,
##     Monthly.Income, Open.CREDIT.Lines, Revolving.CREDIT.Balance

Linear <- lm(Interest ~ F2)
plot(Linear)

plot of chunk unnamed-chunk-22

new1data <- data.frame(F2 = 50)  # You can define new data frame here .. I have given only one value
pp <- predict(Linear, int = "p", newdata = new1data)
pc <- predict(Linear, int = "c", newdata = new1data)

Summary shows that FICO values explaines 50% of varaiation for Interest. Because R square value si 0.50

To explain the remainaing variation, multiple regression with variables identified after correlation analysis is done

To look at relation between variables in given data, below diagram is used.

pairs(cdf)

plot of chunk unnamed-chunk-23

To check further, model selection process is used. There are three model selection methods: 1. Backward selection 2. Forward selection 3. Step-wise regression method R has step() function which searches for model using Akaike Information crtierion (AIC). AIC deals with the trade-off between the goodness of fit of the model and the complexity of the model. It is founded on information entropy: it offers a relative estimate of the information lost when a given model is used to represent the process that generates the data.

Rg1 <- lm(cdf$Interest ~ cdf$Inquiries.in.the.Last.6.Months + cdf$DI_Ratio + 
    cdf$LLen + cdf$F2 + cdf$Open.CREDIT.Lines + cdf$Amount.Requested)
summary(Rg1)

## 
## Call:
## lm(formula = cdf$Interest ~ cdf$Inquiries.in.the.Last.6.Months + 
##     cdf$DI_Ratio + cdf$LLen + cdf$F2 + cdf$Open.CREDIT.Lines + 
##     cdf$Amount.Requested)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.570 -1.381 -0.168  1.205  9.982 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)                         1.21e+01   2.31e-01   52.24  < 2e-16
## cdf$Inquiries.in.the.Last.6.Months  3.52e-01   3.39e-02   10.38  < 2e-16
## cdf$DI_Ratio                        1.23e-03   6.05e-03    0.20     0.84
## cdf$LLen                            1.34e-01   4.57e-03   29.41  < 2e-16
## cdf$F2                             -4.36e-01   6.10e-03  -71.61  < 2e-16
## cdf$Open.CREDIT.Lines              -5.03e-02   1.01e-02   -4.98  6.7e-07
## cdf$Amount.Requested                1.47e-04   5.96e-06   24.67  < 2e-16
##                                       
## (Intercept)                        ***
## cdf$Inquiries.in.the.Last.6.Months ***
## cdf$DI_Ratio                          
## cdf$LLen                           ***
## cdf$F2                             ***
## cdf$Open.CREDIT.Lines              ***
## cdf$Amount.Requested               ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 2.06 on 2491 degrees of freedom
## Multiple R-squared: 0.757,   Adjusted R-squared: 0.757 
## F-statistic: 1.3e+03 on 6 and 2491 DF,  p-value: <2e-16

confint(Rg1)

##                                         2.5 %     97.5 %
## (Intercept)                        11.6054002 12.5106546
## cdf$Inquiries.in.the.Last.6.Months  0.2857053  0.4187286
## cdf$DI_Ratio                       -0.0106278  0.0130914
## cdf$LLen                            0.1253267  0.1432347
## cdf$F2                             -0.4484503 -0.4245450
## cdf$Open.CREDIT.Lines              -0.0701221 -0.0305200
## cdf$Amount.Requested                0.0001354  0.0001588

step(Rg1, direction = "both")  # bi-directional search

## Start:  AIC=3618
## cdf$Interest ~ cdf$Inquiries.in.the.Last.6.Months + cdf$DI_Ratio + 
##     cdf$LLen + cdf$F2 + cdf$Open.CREDIT.Lines + cdf$Amount.Requested
## 
##                                      Df Sum of Sq   RSS  AIC
## - cdf$DI_Ratio                        1         0 10573 3616
## <none>                                            10573 3618
## - cdf$Open.CREDIT.Lines               1       105 10678 3641
## - cdf$Inquiries.in.the.Last.6.Months  1       458 11030 3722
## - cdf$Amount.Requested                1      2584 13156 4162
## - cdf$LLen                            1      3670 14243 4360
## - cdf$F2                              1     21765 32338 6409
## 
## Step:  AIC=3616
## cdf$Interest ~ cdf$Inquiries.in.the.Last.6.Months + cdf$LLen + 
##     cdf$F2 + cdf$Open.CREDIT.Lines + cdf$Amount.Requested
## 
##                                      Df Sum of Sq   RSS  AIC
## <none>                                            10573 3616
## + cdf$DI_Ratio                        1         0 10573 3618
## - cdf$Open.CREDIT.Lines               1       117 10690 3642
## - cdf$Inquiries.in.the.Last.6.Months  1       458 11031 3720
## - cdf$Amount.Requested                1      2586 13159 4161
## - cdf$LLen                            1      3671 14243 4359
## - cdf$F2                              1     22734 33306 6480

## 
## Call:
## lm(formula = cdf$Interest ~ cdf$Inquiries.in.the.Last.6.Months + 
##     cdf$LLen + cdf$F2 + cdf$Open.CREDIT.Lines + cdf$Amount.Requested)
## 
## Coefficients:
##                        (Intercept)  cdf$Inquiries.in.the.Last.6.Months  
##                          12.073053                            0.351886  
##                           cdf$LLen                              cdf$F2  
##                           0.134284                           -0.436751  
##              cdf$Open.CREDIT.Lines                cdf$Amount.Requested  
##                          -0.049597                            0.000147

step(Rg1, direction = "back")  # backward search

## Start:  AIC=3618
## cdf$Interest ~ cdf$Inquiries.in.the.Last.6.Months + cdf$DI_Ratio + 
##     cdf$LLen + cdf$F2 + cdf$Open.CREDIT.Lines + cdf$Amount.Requested
## 
##                                      Df Sum of Sq   RSS  AIC
## - cdf$DI_Ratio                        1         0 10573 3616
## <none>                                            10573 3618
## - cdf$Open.CREDIT.Lines               1       105 10678 3641
## - cdf$Inquiries.in.the.Last.6.Months  1       458 11030 3722
## - cdf$Amount.Requested                1      2584 13156 4162
## - cdf$LLen                            1      3670 14243 4360
## - cdf$F2                              1     21765 32338 6409
## 
## Step:  AIC=3616
## cdf$Interest ~ cdf$Inquiries.in.the.Last.6.Months + cdf$LLen + 
##     cdf$F2 + cdf$Open.CREDIT.Lines + cdf$Amount.Requested
## 
##                                      Df Sum of Sq   RSS  AIC
## <none>                                            10573 3616
## - cdf$Open.CREDIT.Lines               1       117 10690 3642
## - cdf$Inquiries.in.the.Last.6.Months  1       458 11031 3720
## - cdf$Amount.Requested                1      2586 13159 4161
## - cdf$LLen                            1      3671 14243 4359
## - cdf$F2                              1     22734 33306 6480

## 
## Call:
## lm(formula = cdf$Interest ~ cdf$Inquiries.in.the.Last.6.Months + 
##     cdf$LLen + cdf$F2 + cdf$Open.CREDIT.Lines + cdf$Amount.Requested)
## 
## Coefficients:
##                        (Intercept)  cdf$Inquiries.in.the.Last.6.Months  
##                          12.073053                            0.351886  
##                           cdf$LLen                              cdf$F2  
##                           0.134284                           -0.436751  
##              cdf$Open.CREDIT.Lines                cdf$Amount.Requested  
##                          -0.049597                            0.000147

step(Rg1, direction = "forward")  # Forward search

## Start:  AIC=3618
## cdf$Interest ~ cdf$Inquiries.in.the.Last.6.Months + cdf$DI_Ratio + 
##     cdf$LLen + cdf$F2 + cdf$Open.CREDIT.Lines + cdf$Amount.Requested

## 
## Call:
## lm(formula = cdf$Interest ~ cdf$Inquiries.in.the.Last.6.Months + 
##     cdf$DI_Ratio + cdf$LLen + cdf$F2 + cdf$Open.CREDIT.Lines + 
##     cdf$Amount.Requested)
## 
## Coefficients:
##                        (Intercept)  cdf$Inquiries.in.the.Last.6.Months  
##                          12.058027                            0.352217  
##                       cdf$DI_Ratio                            cdf$LLen  
##                           0.001232                            0.134281  
##                             cdf$F2               cdf$Open.CREDIT.Lines  
##                          -0.436498                           -0.050321  
##               cdf$Amount.Requested  
##                           0.000147

summary.aov(Rg1)

##                                      Df Sum Sq Mean Sq F value Pr(>F)    
## cdf$Inquiries.in.the.Last.6.Months    1   1182    1182  278.39 <2e-16 ***
## cdf$DI_Ratio                          1   1263    1263  297.64 <2e-16 ***
## cdf$LLen                              1   7529    7529 1773.96 <2e-16 ***
## cdf$F2                                1  20456   20456 4819.76 <2e-16 ***
## cdf$Open.CREDIT.Lines                 1      0       0    0.11   0.74    
## cdf$Amount.Requested                  1   2584    2584  608.71 <2e-16 ***
## Residuals                          2491  10573       4                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

confint(Rg1)

##                                         2.5 %     97.5 %
## (Intercept)                        11.6054002 12.5106546
## cdf$Inquiries.in.the.Last.6.Months  0.2857053  0.4187286
## cdf$DI_Ratio                       -0.0106278  0.0130914
## cdf$LLen                            0.1253267  0.1432347
## cdf$F2                             -0.4484503 -0.4245450
## cdf$Open.CREDIT.Lines              -0.0701221 -0.0305200
## cdf$Amount.Requested                0.0001354  0.0001588

par(mfrow = c(2, 2))
plot(Rg1)

plot of chunk unnamed-chunk-24

Summary shows that R-squared value is 0.757 Other variables that affect interest rate are a. Amount.requested (0.3318) b. Inquiries.in.the.Last.6.Months (0.1646) c. DI_Ratio (0.172) d. LLen (0.423) e.$FICO vlaue (-0.7092) f.Open.CREDIT.Lines ( 0.09030695)

Beta Coeffieicents

Beta or standardized coefficients are the slopes we would get if all the variables were on the same scale, which is done by converting them to z-scores before doing the regression. Betas allow a comparison of the relative importance of the predictors, which neither the unstandardized coefficients nor the p-values does. Scaling, or standardizing, the data vectors can be done using the scale( ) function The intercept goes to zero, and the slopes become standardized beta values.

Partial Correlations

Another way to remove the effect of a possible lurking variable from the correlation of two other variables is to calculate a partial correlation. There is no partial correlation function in R. So to create one, copy and paste this script into the R Console…

Rg2 <- lm(scale(cdf$Interest) ~ scale(cdf$Inquiries.in.the.Last.6.Months) + 
    scale(cdf$DI_Ratio) + scale(cdf$LLen) + scale(cdf$F2) + scale(cdf$Open.CREDIT.Lines) + 
    scale(cdf$Amount.Requested))