Below is the R code and comments of data analysis project-1
Some theoretical concepts before actual work:
Linear regression assumpes below:
Step 1: explore relationships with the help of scatterplots
Part 1 What you have to assess and why
Missing Data
Ideally data has no missing values, but in reality all sorts of things happen–equipment malfunctions, participants refuse to answer certain questions, researchers fall asleep. The best way to deal with missing data is to run extra participants and report the number of participants eliminated due to experimental error. Often this is impossible and one ends up with missing data. Missing data is a problem only if it is distributed nonrandomly. For example, if one is running a developmental study and the younger children are more likely to be dropped because of failure to understand instructions. Researchers so want their missing data to be random that they avoid inspecting the data to find out. This is not a good idea.
There are three ways of dealing with missing data: listwise deletion, pairwise deletion, and mean substitution (sometimes called mean imputation). It also has a separate statistical product with more sophisticated solutions. ? Listwise deletion deliminates that participant from all analyses. This tends to be a safeway of handling data so long as the remaining sample size is sufficient and so long as the respondents with missing data do not differ in any way from those with complete data. ? Pairwise deletion deletes the respondent's data only from those computations in which the variable with missing data is involved. For example, think of a correlation matrix. The correlation between variable A and variable B may be based on 182 participants and the correlation between variable B and variable C may be based on 200 participants if 8 participants had missing data for variable A. This is generally considered to be a less good solution than listwise, but is widely used in multiple regression and factor analysis because it results in larger sample sizes than listwise. It is generally considered inappropriate for structural equation modeling (see Byrne, 2001, p. 290). ? Mean substitution (imputation). Missing values on a variable are replaced by the mean of the variable. This is generally considered a conservative approach, because it reduces the probability of getting significant results, but it creates substantial biases in structural equation modeling (it consistently underestimates the parameters, Brown, 1994).
One can also drop variables rather than participants. This is a lovely solution if one variable accounts for a lot of missing values. Structural equation modeling (AMOS) gets very cranky with missing data. There is one direct technique available (maximum likelihood with estimates of means and intercepts), but it is more difficult to get a good model. Pairwise deletion is inappropriate with AMOS, but one can use listwise deletion or mean substitution in SPSS to prepare a data file without missing data.
Multicollinearity
Multicollinearity refers to the situation in which two IVs are very highly correlated with each other. This is a bad thing because it gives spuriously high Rs, which won't replicate. It can be assessed by inspecting tables of intercorrelations and standard errors of partial regression coefficients. There is one strange case in which one IV is correlated with several other IV s but the correlations are not significant and so it is not obvious. This requires a special analysis (see Part 2). Cohen et al. have a whole chapter concerning outliers and multicollinearity. We will not be reading it. Learn from this manual.
Normality
The normality assumption of regression is that residuals be normally distributed and of constant variance (homoscedastic) over sets of values of the independent variables. If the residuals are nonnormal or heteroscedastic then the standard errors of the regression coefficients (which are the denominator of the t formula and therefore used to determine significance) are biased. When the participants are divided into groups (e. g., ANOVA) it is an assumption that the sampling distribution is normally distributed. It is a safe assumption when the sample size is greater than 30. It will also be true when the population distribution is normal. So for samples (or subsamples) of less than 30 we need to assess whether the population distribution is normal. We do that by assessing normality in the sample (because we don?t have access to the population). In multivariate statistics, the assumption of normality applies to the multivariate cases, that is, to all combinations of variables. Although a short cut, it is generally acceptable to assess multivariate normality by assessing normality of the distribution of residuals. This can be done by examining a histogram of the residuals, a normal probability plot, and a scatter plot between predicted scores and residuals. Appendix A describes the various kinds of residuals; we will use standardized residuals. With a large sample size violations of normality will not affect either the significance tests nor the confidence intervals. Nonetheless, it is important to check for the distribution of residuals because the presence of nonnormal residuals suggests other problems, such as misspecification of the multiple regression model (i. e., the wrong independent variables). If multivariate normality is found, then the univariate distributions will also be normal. If the multivariate distribution is not normal, one needs to examine the individual distributions. One can use histograms or normal probability plots. NOTE: there are tests of normality, but they are unduly influenced by sample size.
Outliers
All analyses based in correlations (e. g., multiple regression, factor analysis, structural equation modeling) are very vulnerable to outliers. It is necessary to check for both univariate and multivariate outliers. Usually it is easier to check for univariate outliers first. To check for univariate outliers, use histograms, stem-and-leaf graphs, box plots, and normal probability plots. They do not all give the same answer, so more than one is appropriate.
Linearity
Multiple regression assumes that there is a linear (straight line) relationship between the independent variables and the dependent variable. If the true relationship between the DV and IV (or linear composite of IVs) is curvilinear, the r or Rsquared will be underestimated (the obtained value will be smaller than the true population value). A scatterplot of residuals will reveal multivariate linearity. If this relationship is not linear, then check scatterplots of individual variables. You, the researcher, needs to know whether the relationship between Y and each independent variable (IV) is linear or not. If the true relationship between the DV and IV (or linear composite of IVs) is curvilinear, you need to specify that (Chapter 6 in Cohen et al.). If the relationship is curvilinear, and you do not specify that, the r or R2 will be underestimated (the obtained value will be smaller than the true population value).
Homoscedasticity
Homoscedasticity means that the variances of Y values are the same for all X values. When you have heteroscedasticity (you violate assumption of homoscedasticity) Violating this assumption does not bias the regression coefficient, but does violate the standard error so significance levels are incorrect.. This means that you can safely use the regression coefficient descriptively, but making inferences (is it significant) is biased. We know how:
A positive relation between IV and variance of errors of prediction (funnel shape opening to high values) introduces a positive bias (increases Type I error, concluding significance when there is no relation). Conversely, negative relation between IV and variance of errors (funnel opening to low values) introduces a negative bias. Regression is fairly ?robust? with respect to heteroscedasticity. Cohen et al.?s rule of thumb is that the largest to the smallest conditional variance must be greater than 10 to be worrisome. Errors of prediction are independent of one another. In other words, the residual for the first subject is not related to the residual for the second subject and so on. This is typically violated with time-series or distance measures.
Errors of prediction and the independent variables are independent (p. 119 Cohen et al.).
This will be violated when: ? An independent variable has a large amount of measurement error. This can be tested by assessing reliability in each variable. Unreliability will decrease the partial r and the Beta in most cases (the exception is when the partialled out IV is unreliable; in that case one can?t tell whether the partial r and the Beta will increase or decrease. ? An important IV (one that correlates with the DV and with the other IV s) is NOT included in the regression. ? There is reciprocal causation, that is, the IV affects the DV and the DV affects the IV. This is called simultaneity. The result of violating this assumption is that the regression coefficients will be biased. If simultaneity is the cause the standard error of the regression coefficient is also biased. Expected value or mean of the errors of prediction in the population regression function is equal to zero. Violation of this assumption is not terribly important because systematic error (e. g., mean higher or lower than zero) will simply change the intercept. Our estimate of regression coefficient or standard errors (which give us the statistics we need) will not be biased.
Step 2: explore relationships with the help of correlations
Correlation coeficient measures the stregth of relationship between two variables.
Relationship could be postivie or negative
Correlation value varies from -1 to +1.
Types of correlation coefficients:
a. Pearson's correlation coefficient (X & Y are continuous)
b. Point bi-serial correlation - when one variable is continous and the other is di-chotomous (0,1)
c.Phi coefficient - Both are dichotomous (0,1)
d. Spear's rank - (For ranked data)
Pearson's correlation coefficent = Degree to which X &Y vary together Cov(X,Y)
----------------------------------- = ------------
Degree to which X, Y vary independently Var of X & Y
Step 3 : Fit a linear regression equation between X & Y
Step 4; Explore residuals: a.Compute residuals (Y- Y') Y' is computed value from regression eq b. Normalize residuals c.Make residual plot d.Check normality
Step 5 : Quantify model fit using adjusted-R squared value;
R-sqaured value is square of correlation coefficient (Explained above)
R-sqaured value measures how well the model containing (x) values explains the varaibility in (Y)
R-squared value will be between 0 to 1. Any model with R-squared value above 0.64 is good (i.e correlation coefficient value - 0.8 or +0.8)
In case of multiple explanatory variables (X), all variables may not be there in linear regression model i.e. all explantatory variables may not contribute to explain variablity in Y.
Regression coefficient = Sum of squares of regression values/ Total sum of sqaures
Total sum of squares = Sum of squares of regression + Sum of squares of Residuals SSTO (Total sum of squares) = SST(Sum of squares of treatments) +SSE(Sum of squares of Errors)
The use of an adjusted R-squared (R2) is an attempt to take account of the phenomenon of the R-squared automatically and spuriously increasing when extra explanatory variables are added to the model. It is a modification due to Theil of R2 that adjusts for the number of explanatory terms in a model relative to the number of data points. The adjusted R2 can be negative, and its value will always be less than or equal to that of R2. Unlike R2, the adjusted R2 increases when a new explanator is included only if the new explanator improves the R2 more than would be expected in the absence of any explanatory value being added by the new explanator. If a set of explanatory variables with a predetermined hierarchy of importance are introduced into a regression one at a time, with the adjusted R2 computed each time, the level at which adjusted R2 reaches a maximum, and decreases afterward, would be the regression with the ideal combination of having the best fit without excess/unnecessary terms. The adjusted R2 is defined as
= R2 - (1-R2)(p/(n-p-1))
where p is the total number of regressors in the linear model (not counting the constant term), and n is the sample size.
Adjusted R2 can also be written as
= (Sum of sqaures of residuals/degrees of freedom of residuals) 1 - ————————————————————— (Sum of sqaures of total/degrees of freedom of total)
Going by above, R2 can also be written as
= (Sum of sqaures of residuals/n) 1 - ————————————————————— (Sum of sqaures of total/n)
Here degrees of freedom = n, number of data points
where dft is the degrees of freedom n– 1 of the estimate of the population variance of the dependent variable, and dfe is the degrees of freedom n – p – 1 of the estimate of the underlying population error variance.
Adjusted R2 does not have the same interpretation as R2—while R2 is a measure of fit, adjusted R2 is instead a comparative measure of suitability of alternative nested sets of explanators.As such, care must be taken in interpreting and reporting this statistic. Adjusted R2 is particularly useful in the feature selection stage of model building.
Assumptions about model errors 1.zero conditional mean of errors 2. Independence of errors 3. Homoscedasticity of errors (constant variance of errors) 4. Normal distirbution of errors - Normally distributed errors are not required for regression coefficients to be unbiased, consistent, and efficient (at least in the sense of being best linear unbiased estimates) but this assumption is required for trustworthy significance tests and confidence intervals in small sample
It is worth noting in passing that while the regression model requires only the normality of errors, the Pearson product moment correlation model requires that the two variables follow a bivariate normal distribution (Pedhazur, 1997). I.e., in the correlation model, both the marginal and conditional distribution of each variable is assumed to be normal.
Other Potential Problems: 1.Multicollinearity -The presence of correlations between the predictors is termed collinearity (for a relationship between two predictor variables) or multicollinearity (for relationships between more than two predictors).
2.Outliers - Cook's distance is the most widely used diagnostic to identify outliers. Alternatively, the use of ?robust? regression methods may help to reduce the influence of outlying observations
library(gclus)
## Loading required package: cluster
dt <- read.csv("/media/APOLLO M250/4_New_Coursera/coursera-master/dataanalysis-002/loansData.csv")
# dt<-read.csv(file.choose())
a <- strsplit(as.character(dt$FICO.Range), "-") #Splitting FICO RANGE STRING
j <- do.call("rbind", a) # CONVERTS list created above to a dataframe
colnames(j) <- c("F1", "F2") # Renaming columns of dataframe j
p <- cbind(dt, j) # Making one data frame with all data
p[!complete.cases(p), ] # This will give list of rows where missing data is present
## Amount.Requested Amount.Funded.By.Investors Interest.Rate
## 101596 5000 4525 7.43%
## 101515 3500 225 10.28%
## Loan.Length Loan.Purpose Debt.To.Income.Ratio State Home.Ownership
## 101596 36 months other 1% NY NONE
## 101515 36 months other 10% NY RENT
## Monthly.Income FICO.Range Open.CREDIT.Lines
## 101596 NA 800-804 NA
## 101515 15000 685-689 NA
## Revolving.CREDIT.Balance Inquiries.in.the.Last.6.Months
## 101596 NA NA
## 101515 NA NA
## Employment.Length F1 F2
## 101596 < 1 year 800 804
## 101515 < 1 year 685 689
# There are two rows in data framw with NAs
p <- p[complete.cases(p), ] # Data frame with complete set of values
Below is to find numerical values of Interest Rate
m <- sub("%", "", p$Interest.Rate) #repalces % in Interest Rate and adds to dataframe 'm'
m <- as.numeric(m) #converts string to numeric
q <- cbind(p, m) # makes one data frame with interest rate added to original
head(q)
## Amount.Requested Amount.Funded.By.Investors Interest.Rate
## 81174 20000 20000 8.90%
## 99592 19200 19200 12.12%
## 80059 35000 35000 21.98%
## 15825 10000 9975 9.99%
## 33182 12000 12000 11.71%
## 62403 6000 6000 15.31%
## Loan.Length Loan.Purpose Debt.To.Income.Ratio State
## 81174 36 months debt_consolidation 14.90% SC
## 99592 36 months debt_consolidation 28.36% TX
## 80059 60 months debt_consolidation 23.81% CA
## 15825 36 months debt_consolidation 14.30% KS
## 33182 36 months credit_card 18.78% NJ
## 62403 36 months other 20.05% CT
## Home.Ownership Monthly.Income FICO.Range Open.CREDIT.Lines
## 81174 MORTGAGE 6542 735-739 14
## 99592 MORTGAGE 4583 715-719 12
## 80059 MORTGAGE 11500 690-694 14
## 15825 MORTGAGE 3833 695-699 10
## 33182 RENT 3195 695-699 11
## 62403 OWN 4892 670-674 17
## Revolving.CREDIT.Balance Inquiries.in.the.Last.6.Months
## 81174 14272 2
## 99592 11140 1
## 80059 21977 1
## 15825 9346 0
## 33182 14469 0
## 62403 10391 2
## Employment.Length F1 F2 m
## 81174 < 1 year 735 739 8.90
## 99592 2 years 715 719 12.12
## 80059 2 years 690 694 21.98
## 15825 5 years 695 699 9.99
## 33182 9 years 695 699 11.71
## 62403 3 years 670 674 15.31
names(q)[names(q) == "m"] <- "Interest" # Colname is updated with Interest
Below is to categorize Debt to Income Ratio
q$n <- sub("%", "", q$Debt.To.Income.Ratio) #Replaces %in debt to income ratio
names(q)[names(q) == "n"] <- "DI_Ratio"
q$DI_Ratio <- as.numeric(q$DI_Ratio)
q$Category[q$DI_Ratio < 10] <- "A" # categorizing DI ratio as A
q$Category[q$DI_Ratio >= 10 & q$DI_Ratio < 15] <- "B" # categorizing DI ratio as B
q$Category[q$DI_Ratio >= 15 & q$DI_Ratio < 20] <- "C" # categorizing DI ratio as C
q$Category[q$DI_Ratio >= 20] <- "D" # categorizing DI ratio as D
Below is to find numerical vlaues of loan length
q$LL <- sub(" months", "", q$Loan.Length)
names(q)[names(q) == "LL"] <- "LLen"
q$LLen <- as.numeric(q$LLen)
lapply(q, class) # To understand type of each column
## $Amount.Requested
## [1] "integer"
##
## $Amount.Funded.By.Investors
## [1] "numeric"
##
## $Interest.Rate
## [1] "factor"
##
## $Loan.Length
## [1] "factor"
##
## $Loan.Purpose
## [1] "factor"
##
## $Debt.To.Income.Ratio
## [1] "factor"
##
## $State
## [1] "factor"
##
## $Home.Ownership
## [1] "factor"
##
## $Monthly.Income
## [1] "numeric"
##
## $FICO.Range
## [1] "factor"
##
## $Open.CREDIT.Lines
## [1] "integer"
##
## $Revolving.CREDIT.Balance
## [1] "integer"
##
## $Inquiries.in.the.Last.6.Months
## [1] "integer"
##
## $Employment.Length
## [1] "factor"
##
## $F1
## [1] "factor"
##
## $F2
## [1] "factor"
##
## $Interest
## [1] "numeric"
##
## $DI_Ratio
## [1] "numeric"
##
## $Category
## [1] "character"
##
## $LLen
## [1] "numeric"
Graphs to understand data
attach(q)
boxplot(Interest ~ F1, data = q)
plot(q$Amount.Requested, q$Interes)
plot(q$F2, q$Interest)
plot(q$Home.Ownership, q$Interest)
plot(q$Inquiries.in.the.Last.6.Months, q$Interest)
plot(q$Loan.Length, q$Interest) # Higher the loan length more the interest
plot(q$Open.CREDIT.Lines, q$Interest)
plot(q$Employment.Length, q$Interest)
a <- subset(q, q$Revolving.CREDIT.Balance < 50000)
plot(a$Revolving.CREDIT.Balance, a$Interest) # lower the credit balance, lower the interest
plot(q$Loan.Purpose, q$Interest)
Graphs to understand impact of Debt_to_income_ratio
A <- subset(q, q$Category == "A")
B <- subset(q, q$Category == "B")
C <- subset(q, q$Category == "C")
D <- subset(q, q$Category == "D")
plot(D$Interest) # plot of DI ratio >=20
plot(C$Interest) # plot of DI ratio <20 & >=15
plot(B$Interest) # plot of DI ratio <15 & >=10
Correlation analysis to find out the relation between interest and other variables
cdf <- q[, c(1, 9, 11, 12, 13, 16, 17, 18, 20)] # a new dataframe with variables of interest
cdf$F2 <- as.numeric(cdf$F2) # Conversting FICO range upper value to a numeric to conduct cor
cdf$LLen <- as.numeric(cdf$LLen)
Lowess stands for locally weighted scatterplot smoothing. It is a nonparametric method for drawing a smooth curve through a scatterplot. There are at least two ways to produce such a plot in R…
plot(cdf$Interest ~ cdf$F2)
lines(lowess(cdf$Interest ~ cdf$F2), col = "red")
scatter.smooth(cdf$Interest ~ cdf$F2)
cor(cdf) # Generates coorelation matrix
## Amount.Requested Monthly.Income
## Amount.Requested 1.00000 0.39118
## Monthly.Income 0.39118 1.00000
## Open.CREDIT.Lines 0.19594 0.17140
## Revolving.CREDIT.Balance 0.29337 0.35968
## Inquiries.in.the.Last.6.Months -0.02956 0.03395
## F2 0.08334 0.12277
## Interest 0.33183 0.01292
## DI_Ratio 0.08129 -0.16234
## LLen 0.41230 0.07454
## Open.CREDIT.Lines Revolving.CREDIT.Balance
## Amount.Requested 0.19594 0.293365
## Monthly.Income 0.17140 0.359684
## Open.CREDIT.Lines 1.00000 0.290085
## Revolving.CREDIT.Balance 0.29009 1.000000
## Inquiries.in.the.Last.6.Months 0.11074 0.012186
## F2 -0.08950 0.002735
## Interest 0.09031 0.061109
## DI_Ratio 0.37085 0.189221
## LLen 0.04089 0.055436
## Inquiries.in.the.Last.6.Months F2
## Amount.Requested -0.02956 0.083338
## Monthly.Income 0.03395 0.122766
## Open.CREDIT.Lines 0.11074 -0.089497
## Revolving.CREDIT.Balance 0.01219 0.002735
## Inquiries.in.the.Last.6.Months 1.00000 -0.092142
## F2 -0.09214 1.000000
## Interest 0.16465 -0.709283
## DI_Ratio 0.01198 -0.216921
## LLen 0.02384 0.012736
## Interest DI_Ratio LLen
## Amount.Requested 0.33183 0.08129 0.41230
## Monthly.Income 0.01292 -0.16234 0.07454
## Open.CREDIT.Lines 0.09031 0.37085 0.04089
## Revolving.CREDIT.Balance 0.06111 0.18922 0.05544
## Inquiries.in.the.Last.6.Months 0.16465 0.01198 0.02384
## F2 -0.70928 -0.21692 0.01274
## Interest 1.00000 0.17220 0.42351
## DI_Ratio 0.17220 1.00000 0.02499
## LLen 0.42351 0.02499 1.00000
ABove results suggest that there is high corection between Interest Rate and a. Amount.requested (0.3318) b. Inquiries.in.the.Last.6.Months (0.1646) c. DI_Ratio (0.172) d. LLen (0.423) e FICO vlaue (-0.7092) f.Open.CREDIT.Lines ( 0.09030695)
Below plot will graphically illustrate the impact of correlation
cdr <- abs(cor(cdf))
cd.col <- dmat.color(cdr)
cd.or <- order.single(cdr)
cpairs(cdf, cd.or, panel.colors = cd.col, gap = 0.5, main = "Correlation")
Fitting regression equation with above varaibles
Linear <- lm(cdf$Interest ~ cdf$F2)
par(mfrow = c(2, 2))
plot(Linear)
library(car)
residualPlots(Linear) # Residual PLots
## Test stat Pr(>|t|)
## cdf$F2 11.78 0
## Tukey test 11.78 0
avPlots(Linear) #Influential variables-Added-variableplots
influenceIndexPlot(Linear, id.n = 3)
Cook's distance measures how much an observation influences the overall model or predicted values Studentizidedresiduals are the residuals divided by their estimated standard deviation as a way to standardized Bonferronitest to identify outliers Hat-points identify influential observations (have a high impact on the predictor variables) NOTE: If an observation is an outlier and influential (high leverage) then that observation can change the fit of the linear model, it is advisable to remove it. To remove a case(s) type reg1a <-update(prestige.reg4, subset=rownames(Prestige) != “general.managers”) reg1b <-update(prestige.reg4, subset= !(rownames(Prestige) %in% c(“general.managers”,“medical.technicians”)))
inf.index(Linear)
## Error: could not find function "inf.index"
residual_values <- residuals(Linear)
influencePlot(Linear)
## StudRes Hat CookD
## 38 1.580 0.0050445 0.07952
## 1684 2.581 0.0039487 0.11480
## 1928 3.469 0.0004115 0.04966
qqPlot(Linear) # Outliers QQ plots
outlierTest(Linear) #Bonferroni test
##
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## 1928 3.469 0.0005317 NA
hist(residual_values)
plot(cdf$Interest, residual_values, ylab = "Residuals", xlab = "Interest Rate",
main = "Graph of Interest Rate vs Residuals")
Testing for multicolinearity When there are strong linear relationships among the predictors in a regression analysis, the precision of the estimated regression coefficients in linear models declines compared to what it would have been were the predictors uncorrelated with each other? (Fox:359) When there is a perfect linear relationship among the predictors, the estimates for a regression model cannot be uniquely computed. The term collinearity implies that two variables are near perfect linear combinations of one another. When more than two variables are involved it is often called multicollinearity, although the two terms are often used interchangeably.
The primary concern is that as the degree of multicollinearity increases, the regression model estimates of the coefficients become unstable and the standard errors for the coefficients can get wildly inflated. In this section, we will explore some Stata commands that help to detect multicollinearity.
We can use the vif command after the regression to check for multicollinearity. vif stands for variance inflation factor. As a rule of thumb, a variable whose VIF values are greater than 10 may merit further investigation. Tolerance, defined as 1/VIF, is used by many researchers to check on the degree of collinearity. A tolerance value lower than 0.1 is comparable to a VIF of 10. It means that the variable could be considered as a linear combination of other independent variables.
vif(Linear)
## Error: model contains fewer than 2 terms
Testing for heteroskedasticity
ncvTest(Linear)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 4.73 Df = 1 p = 0.02964
anova(Linear)
## Analysis of Variance Table
##
## Response: cdf$Interest
## Df Sum Sq Mean Sq F value Pr(>F)
## cdf$F2 1 21928 21928 2527 <2e-16 ***
## Residuals 2496 21659 9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(Linear)
The plot in the lower right shows each points leverage, which is a measure of its importance in determining the regression result. Superimposed on the plot are contour lines for the Cook?s distance, which is another measure of the importance of each observation to the regression. Smaller distances means that removing the observation has little affect on the regression results. Distances larger than 1 are suspicious and suggest the presence of a possible outlier or a poor model
In statistics, leverage is a term used in connection with regression analysis and, in particular, in analyses aimed at identifying those observations that are far away from corresponding average predictor values. Leverage points do not necessarily have a large effect on the outcome of fitting regression models.
Leverage points are those observations, if any, made at extreme or outlying values of the independent variables such that the lack of neighboring observations means that the fitted regression model will pass close to that particular observation
High-leverage points are those that are outliers with respect to the independent variables. Leverage points are those that cause large changes in the parameter estimates when they are deleted. Although an influential point will typically have high leverage, a high leverage point is not necessarily an influential point. The leverage is typically defined as the diagonal of the hat matrix
Leverage points are those which have great influence on the fitted model, that is, those whose x-value is distant from the other x-values. Bad leverage point: if it is also an outlier, that is, the y-value does not follow the pattern set by the other data points. Good leverage point: if it is not an outlier.
Cook's distance or Cook's D is a commonly used estimate of the influence of a data point when performing least squares regression analysis.[1] In a practical ordinary least squares analysis, Cook's distance can be used in several ways: to indicate data points that are particularly worth checking for validity; to indicate regions of the design space where it would be good to be able to obtain more data points. It is named after the American statistician R. Dennis Cook, who introduced the concept in 1977.
Another measure is “ The Mahalanobis distance” - is a descriptive statistic that provides a relative measure of a data point's distance (residual) from a common point. It is a unitless measure introduced by P. C. Mahalanobis in 1936.[1] The Mahalanobis distance is used to identify and gauge similarity of an unknown sample set to a known one. It differs from Euclidean distance in that it takes into account the correlations of the data set and is scale-invariant. In other words, it has a multivariate effect size.
Below will stanadardize residuals. Should not expect to see any residual beyond -3 to +3. If there is a residual beyond this range, then further investigation need to be done.
standard_residual <- rstandard(Linear)
hist(standard_residual)
As all values are within range of -3 to +3, further investigation is not required. To determine coefficient of linerar model:
coefs <- coefficients(Linear)
coefs
## (Intercept) cdf$F2
## 19.0719 -0.4235
Null Hypothesis of linear regression, says that slope of the line between response variable and predicted variable is zero. However, from below summary, p-value is less than zero. So, null hypothesis will be rejected and concluded that slope of regression line is not zero. Further, adjusted R2 value is ~ 0.5, above model explains 50% of variation.
summary(Linear)
##
## Call:
## lm(formula = cdf$Interest ~ cdf$F2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.990 -2.136 -0.456 1.835 10.194
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.07189 0.13314 143.2 <2e-16 ***
## cdf$F2 -0.42350 0.00842 -50.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.95 on 2496 degrees of freedom
## Multiple R-squared: 0.503, Adjusted R-squared: 0.503
## F-statistic: 2.53e+03 on 1 and 2496 DF, p-value: <2e-16
confint(Linear)
## 2.5 % 97.5 %
## (Intercept) 18.81 19.333
## cdf$F2 -0.44 -0.407
anova(Linear)
## Analysis of Variance Table
##
## Response: cdf$Interest
## Df Sum Sq Mean Sq F value Pr(>F)
## cdf$F2 1 21928 21928 2527 <2e-16 ***
## Residuals 2496 21659 9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Fitted_values <- fitted(Linear)
Below will produce a graph of Interest vs FICO values, with fitted line and residual segments
plot(cdf$Interest ~ cdf$F2)
abline(lm(cdf$Interest ~ cdf$F2))
segments(cdf$F2, Fitted_values, cdf$F2, cdf$Interest)
Below will produce a scatter plot of fitted values vs residual values
plot(Fitted_values, residual_values)
qqnorm(residual_values)
Prediction intervals and confidence intervals are not the same thing. Unfortunately the terms are often confused, and I am often frequently correcting the error in students papers and articles I am reviewing or editing.
A prediction interval is an interval associated with a random variable yet to be observed, with a specified probability of the random variable lying within the interval. For example, I might give an 80% interval for the forecast of GDP in 2014. The actual GDP in 2014 should lie within the interval with probability 0.8. Prediction intervals can arise in Bayesian or frequentist statistics.
Prediction interval - include also the uncertainity about future values
Confidence interval - an interval estimate of population parameter based on a random sample - Reflect uncertainity about regression line(how well the line is determined)
Degree of confidence - Probability that the confidence interval captures the true populaiton parameter
A confidence interval is an interval associated with a parameter and is a frequentist concept. The parameter is assumed to be non-random but unknown, and the confidence interval is computed from data. Because the data are random, the interval is random. A 95% confidence interval will contain the true parameter with probability 0.95. That is, with a large number of repeated samples, 95% of the intervals would contain the true parameter.
The difference between a prediction interval and a confidence interval is the standard error.
The standard error for a confidence interval on the mean takes into account the uncertainty due to sampling. The line you computed from your sample will be different from the line that would have been computed if you had the entire population, the standard error takes this uncertainty into account.
The standard error for a prediction interval on an individual observation takes into account the uncertainty due to sampling like above, but also takes into account the variability of the individuals around the predicted mean. The se for the prediction interval will be wider than for the confidence interval and hence the prediction interval will be wider than the confidence interval.
A confidence interval gives a range for E[y/x] , as you say. A prediction interval gives a range for y itself. Naturally, our best guess for y is E[y/x] , so the intervals will both be centered around the same value, x? .
The standard errors are going to be different—we guess the expected value of E[y/x] more precisely than we estimate y itself. Estimating y requires including the variance that comes from the true error term.
To illustrate the difference, imagine that we could get perfect estimates of our ? coefficients. Then, our estimate of E[y/x] would be perfect. But we still wouldn't be sure what y itself was because there is a true error term that we need to consider. Our confidence “interval” would just be a point because we estimate E[y/x] exactly right, but our prediction interval would be wider because we take the true error term into account. Hence, a prediction interval will be wider than a confidence interval -reference http://stats.stackexchange.com/questions/16493/difference-between-confidence-intervals-and-prediction-intervals
Difference between Fitted and Predict functions of R
The fitted function returns the y-hat values associated with the data used to fit the model. The predict function returns predictions for a new set of predictor variables. If you don't specify a new set of predictor variables then it will use the original data by default giving the same results as fitted for some models, but if you want to predict for a new set of values then you need predict. The predict function often also has options for which type of prediction to return, the linear predictor, the prediction transformed to the response scale, the most likely category, the contribution of each term in the model, etc. –reference http://stackoverflow.com/questions/12201439/is-there-a-difference-between-the-r-functions-fitted-and-predict
To compute confidence interval
Predi_C <- predict(Linear, int = "c")
To compute prediction interval
Predi_P <- predict(Linear, int = "p")
## Warning: Predictions on current data refer to _future_ responses
Plot of Original data with Linear regression line along with confidence and prediction intervals
plot(cdf$Interest ~ cdf$F2)
abline(Linear)
matlines(cdf$F2, Predi_C)
matlines(cdf$F2, Predi_P)
To find a interest rate for a new value of F2, let new value of F2 be 100, then
c2df <- cdf
attach(c2df)
## The following object(s) are masked from 'q':
##
## Amount.Requested, DI_Ratio, F2,
## Inquiries.in.the.Last.6.Months, Interest, LLen,
## Monthly.Income, Open.CREDIT.Lines, Revolving.CREDIT.Balance
Linear <- lm(Interest ~ F2)
plot(Linear)
new1data <- data.frame(F2 = 50) # You can define new data frame here .. I have given only one value
pp <- predict(Linear, int = "p", newdata = new1data)
pc <- predict(Linear, int = "c", newdata = new1data)
Summary shows that FICO values explaines 50% of varaiation for Interest. Because R square value si 0.50
To explain the remainaing variation, multiple regression with variables identified after correlation analysis is done
To look at relation between variables in given data, below diagram is used.
pairs(cdf)
To check further, model selection process is used. There are three model selection methods: 1. Backward selection 2. Forward selection 3. Step-wise regression method R has step() function which searches for model using Akaike Information crtierion (AIC). AIC deals with the trade-off between the goodness of fit of the model and the complexity of the model. It is founded on information entropy: it offers a relative estimate of the information lost when a given model is used to represent the process that generates the data.
Rg1 <- lm(cdf$Interest ~ cdf$Inquiries.in.the.Last.6.Months + cdf$DI_Ratio +
cdf$LLen + cdf$F2 + cdf$Open.CREDIT.Lines + cdf$Amount.Requested)
summary(Rg1)
##
## Call:
## lm(formula = cdf$Interest ~ cdf$Inquiries.in.the.Last.6.Months +
## cdf$DI_Ratio + cdf$LLen + cdf$F2 + cdf$Open.CREDIT.Lines +
## cdf$Amount.Requested)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.570 -1.381 -0.168 1.205 9.982
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.21e+01 2.31e-01 52.24 < 2e-16
## cdf$Inquiries.in.the.Last.6.Months 3.52e-01 3.39e-02 10.38 < 2e-16
## cdf$DI_Ratio 1.23e-03 6.05e-03 0.20 0.84
## cdf$LLen 1.34e-01 4.57e-03 29.41 < 2e-16
## cdf$F2 -4.36e-01 6.10e-03 -71.61 < 2e-16
## cdf$Open.CREDIT.Lines -5.03e-02 1.01e-02 -4.98 6.7e-07
## cdf$Amount.Requested 1.47e-04 5.96e-06 24.67 < 2e-16
##
## (Intercept) ***
## cdf$Inquiries.in.the.Last.6.Months ***
## cdf$DI_Ratio
## cdf$LLen ***
## cdf$F2 ***
## cdf$Open.CREDIT.Lines ***
## cdf$Amount.Requested ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.06 on 2491 degrees of freedom
## Multiple R-squared: 0.757, Adjusted R-squared: 0.757
## F-statistic: 1.3e+03 on 6 and 2491 DF, p-value: <2e-16
confint(Rg1)
## 2.5 % 97.5 %
## (Intercept) 11.6054002 12.5106546
## cdf$Inquiries.in.the.Last.6.Months 0.2857053 0.4187286
## cdf$DI_Ratio -0.0106278 0.0130914
## cdf$LLen 0.1253267 0.1432347
## cdf$F2 -0.4484503 -0.4245450
## cdf$Open.CREDIT.Lines -0.0701221 -0.0305200
## cdf$Amount.Requested 0.0001354 0.0001588
step(Rg1, direction = "both") # bi-directional search
## Start: AIC=3618
## cdf$Interest ~ cdf$Inquiries.in.the.Last.6.Months + cdf$DI_Ratio +
## cdf$LLen + cdf$F2 + cdf$Open.CREDIT.Lines + cdf$Amount.Requested
##
## Df Sum of Sq RSS AIC
## - cdf$DI_Ratio 1 0 10573 3616
## <none> 10573 3618
## - cdf$Open.CREDIT.Lines 1 105 10678 3641
## - cdf$Inquiries.in.the.Last.6.Months 1 458 11030 3722
## - cdf$Amount.Requested 1 2584 13156 4162
## - cdf$LLen 1 3670 14243 4360
## - cdf$F2 1 21765 32338 6409
##
## Step: AIC=3616
## cdf$Interest ~ cdf$Inquiries.in.the.Last.6.Months + cdf$LLen +
## cdf$F2 + cdf$Open.CREDIT.Lines + cdf$Amount.Requested
##
## Df Sum of Sq RSS AIC
## <none> 10573 3616
## + cdf$DI_Ratio 1 0 10573 3618
## - cdf$Open.CREDIT.Lines 1 117 10690 3642
## - cdf$Inquiries.in.the.Last.6.Months 1 458 11031 3720
## - cdf$Amount.Requested 1 2586 13159 4161
## - cdf$LLen 1 3671 14243 4359
## - cdf$F2 1 22734 33306 6480
##
## Call:
## lm(formula = cdf$Interest ~ cdf$Inquiries.in.the.Last.6.Months +
## cdf$LLen + cdf$F2 + cdf$Open.CREDIT.Lines + cdf$Amount.Requested)
##
## Coefficients:
## (Intercept) cdf$Inquiries.in.the.Last.6.Months
## 12.073053 0.351886
## cdf$LLen cdf$F2
## 0.134284 -0.436751
## cdf$Open.CREDIT.Lines cdf$Amount.Requested
## -0.049597 0.000147
step(Rg1, direction = "back") # backward search
## Start: AIC=3618
## cdf$Interest ~ cdf$Inquiries.in.the.Last.6.Months + cdf$DI_Ratio +
## cdf$LLen + cdf$F2 + cdf$Open.CREDIT.Lines + cdf$Amount.Requested
##
## Df Sum of Sq RSS AIC
## - cdf$DI_Ratio 1 0 10573 3616
## <none> 10573 3618
## - cdf$Open.CREDIT.Lines 1 105 10678 3641
## - cdf$Inquiries.in.the.Last.6.Months 1 458 11030 3722
## - cdf$Amount.Requested 1 2584 13156 4162
## - cdf$LLen 1 3670 14243 4360
## - cdf$F2 1 21765 32338 6409
##
## Step: AIC=3616
## cdf$Interest ~ cdf$Inquiries.in.the.Last.6.Months + cdf$LLen +
## cdf$F2 + cdf$Open.CREDIT.Lines + cdf$Amount.Requested
##
## Df Sum of Sq RSS AIC
## <none> 10573 3616
## - cdf$Open.CREDIT.Lines 1 117 10690 3642
## - cdf$Inquiries.in.the.Last.6.Months 1 458 11031 3720
## - cdf$Amount.Requested 1 2586 13159 4161
## - cdf$LLen 1 3671 14243 4359
## - cdf$F2 1 22734 33306 6480
##
## Call:
## lm(formula = cdf$Interest ~ cdf$Inquiries.in.the.Last.6.Months +
## cdf$LLen + cdf$F2 + cdf$Open.CREDIT.Lines + cdf$Amount.Requested)
##
## Coefficients:
## (Intercept) cdf$Inquiries.in.the.Last.6.Months
## 12.073053 0.351886
## cdf$LLen cdf$F2
## 0.134284 -0.436751
## cdf$Open.CREDIT.Lines cdf$Amount.Requested
## -0.049597 0.000147
step(Rg1, direction = "forward") # Forward search
## Start: AIC=3618
## cdf$Interest ~ cdf$Inquiries.in.the.Last.6.Months + cdf$DI_Ratio +
## cdf$LLen + cdf$F2 + cdf$Open.CREDIT.Lines + cdf$Amount.Requested
##
## Call:
## lm(formula = cdf$Interest ~ cdf$Inquiries.in.the.Last.6.Months +
## cdf$DI_Ratio + cdf$LLen + cdf$F2 + cdf$Open.CREDIT.Lines +
## cdf$Amount.Requested)
##
## Coefficients:
## (Intercept) cdf$Inquiries.in.the.Last.6.Months
## 12.058027 0.352217
## cdf$DI_Ratio cdf$LLen
## 0.001232 0.134281
## cdf$F2 cdf$Open.CREDIT.Lines
## -0.436498 -0.050321
## cdf$Amount.Requested
## 0.000147
summary.aov(Rg1)
## Df Sum Sq Mean Sq F value Pr(>F)
## cdf$Inquiries.in.the.Last.6.Months 1 1182 1182 278.39 <2e-16 ***
## cdf$DI_Ratio 1 1263 1263 297.64 <2e-16 ***
## cdf$LLen 1 7529 7529 1773.96 <2e-16 ***
## cdf$F2 1 20456 20456 4819.76 <2e-16 ***
## cdf$Open.CREDIT.Lines 1 0 0 0.11 0.74
## cdf$Amount.Requested 1 2584 2584 608.71 <2e-16 ***
## Residuals 2491 10573 4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(Rg1)
## 2.5 % 97.5 %
## (Intercept) 11.6054002 12.5106546
## cdf$Inquiries.in.the.Last.6.Months 0.2857053 0.4187286
## cdf$DI_Ratio -0.0106278 0.0130914
## cdf$LLen 0.1253267 0.1432347
## cdf$F2 -0.4484503 -0.4245450
## cdf$Open.CREDIT.Lines -0.0701221 -0.0305200
## cdf$Amount.Requested 0.0001354 0.0001588
par(mfrow = c(2, 2))
plot(Rg1)
Summary shows that R-squared value is 0.757 Other variables that affect interest rate are a. Amount.requested (0.3318) b. Inquiries.in.the.Last.6.Months (0.1646) c. DI_Ratio (0.172) d. LLen (0.423) e.$FICO vlaue (-0.7092) f.Open.CREDIT.Lines ( 0.09030695)
Beta Coeffieicents
Beta or standardized coefficients are the slopes we would get if all the variables were on the same scale, which is done by converting them to z-scores before doing the regression. Betas allow a comparison of the relative importance of the predictors, which neither the unstandardized coefficients nor the p-values does. Scaling, or standardizing, the data vectors can be done using the scale( ) function The intercept goes to zero, and the slopes become standardized beta values.
Partial Correlations
Another way to remove the effect of a possible lurking variable from the correlation of two other variables is to calculate a partial correlation. There is no partial correlation function in R. So to create one, copy and paste this script into the R Console…
Rg2 <- lm(scale(cdf$Interest) ~ scale(cdf$Inquiries.in.the.Last.6.Months) +
scale(cdf$DI_Ratio) + scale(cdf$LLen) + scale(cdf$F2) + scale(cdf$Open.CREDIT.Lines) +
scale(cdf$Amount.Requested))