AArora_Posting12

library(RCurl)

## Loading required package: bitops

library(knitr)
library(corrplot)

## Warning: package 'corrplot' was built under R version 3.4.4

## corrplot 0.84 loaded

library(usdm)

## Warning: package 'usdm' was built under R version 3.4.4

## Loading required package: sp

## Warning: package 'sp' was built under R version 3.4.4

## Loading required package: raster

## Warning: package 'raster' was built under R version 3.4.4

Data: IQ and Physical Characteristics

References: PennState Eberly College Of Science

        https://feliperego.github.io/blog/2015/10/23/Interpreting-Model-Output-In-R
        
        https://www.r-bloggers.com/scatterplot-matrices/
        
        http://statisticsbyjim.com/regression/interpret-adjusted-r-squared-predicted-r-squared-regression/

        https://rpubs.com/FelipeRego/MultipleLinearRegressionInRFirstSteps


        http://blog.minitab.com/blog/understanding-statistics/handling-multicollinearity-in-regression-analysis


        https://uc-r.github.io/linear_regression

        http://analyticspro.org/2016/03/07/r-tutorial-how-to-use-diagnostic-plots-for-regression-models/
        
        https://courses.washington.edu/b515/l5.pdf
        
        http://blog.minitab.com/blog/adventures-in-statistics-2/how-to-interpret-regression-analysis-results-p-values-and-coefficients

NOTES:

When a P value is less than or equal to the significance level, you reject the null hypothesis. The p-value for each term tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value (< 0.05) indicates that you can reject the null hypothesis. In other words, a predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor’s value are related to changes in the response variable. Conversely, a larger (insignificant) p-value suggests that changes in the predictor are not associated with changes in the response.

Hypothesis: Are a person’s brain size and body size predictive of his or her intelligence?

PIQ – represents the performance IQ scores (Response variable)

Brain – Brain size from MRI scan (Predictor variable)

Height – inches (Predictor variable)

Weight – pounds (Predictor variable)

Gender – 0 = Female; 1 = Male

sc.data <- read.csv(text = getURL("https://raw.githubusercontent.com/AjayArora35/compmath/master/AArora_Posting12.csv"), header = T, stringsAsFactors = F)
kable(sc.data[sample(nrow(sc.data), 10), ], align='l', caption = "Sample of 10 rows", row.names=FALSE)

Sample of 10 rows
PIQ	Brain	Height	Weight	Gender
132	85.78	62.5	127	1
102	94.51	73.5	178	1
134	79.06	62.0	122	1
102	83.18	63.0	114	1
124	86.67	66.5	159	0
131	99.13	64.5	138	0
131	93.55	72.0	171	0
110	106.25	77.0	187	0
124	81.69	64.5	118	1
110	99.79	75.5	192	1

summary(sc.data)

##       PIQ             Brain            Height          Weight     
##  Min.   : 72.00   Min.   : 79.06   Min.   :62.00   Min.   :106.0  
##  1st Qu.: 89.25   1st Qu.: 85.48   1st Qu.:66.00   1st Qu.:135.2  
##  Median :115.00   Median : 90.54   Median :68.00   Median :146.5  
##  Mean   :111.34   Mean   : 90.68   Mean   :68.42   Mean   :151.1  
##  3rd Qu.:128.00   3rd Qu.: 94.95   3rd Qu.:70.38   3rd Qu.:172.0  
##  Max.   :150.00   Max.   :107.95   Max.   :77.00   Max.   :192.0  
##      Gender   
##  Min.   :0.0  
##  1st Qu.:0.0  
##  Median :0.5  
##  Mean   :0.5  
##  3rd Qu.:1.0  
##  Max.   :1.0

Scatter Plot Matrix

plot(sc.data)

Reading the Plots:

A Scatter Plot Matrix contains a scatter plot of each pair of variables arranged in a orderly array. For each scatter plot matrix, the variable on the y-axis appears at the left end of the plot’s row and the variable on the x-axis at the bottom of the plot’s column. The variables are written in a diagonal line from top left to bottom right.

Then each variable is plotted against each other. For example, the second square in the first column is an individual scatterplot of PIQ and Brain, with PIQ as the X-axis andBrain as the Y-axis. This same plot is replicated in the second column in the top row. In essence, the boxes on the upper right hand side of the whole scatterplot are mirror images of the plots on the lower left hand.

Purpose of Scatter Plot:

One of the main purpose of a Scatter Plot is data checking. Are there any data errors? In addition, the scatter plots depict “marginal relationships” between each pair of variables without regard to the other variables.

Linear Regression Model (Without Dichomotous Term):

y = B0 + B1x1 + B2x2 + B3x3

where

y = PIQ

x1 = Brain

x2 = Height

x3 = Weight

Linear Model

sc.lm <- lm(PIQ ~ Brain + Height + Weight, data = sc.data)
summary(sc.lm)

## 
## Call:
## lm(formula = PIQ ~ Brain + Height + Weight, data = sc.data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32.74 -12.09  -3.84  14.17  51.69 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.114e+02  6.297e+01   1.768 0.085979 .  
## Brain        2.060e+00  5.634e-01   3.657 0.000856 ***
## Height      -2.732e+00  1.229e+00  -2.222 0.033034 *  
## Weight       5.599e-04  1.971e-01   0.003 0.997750    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.79 on 34 degrees of freedom
## Multiple R-squared:  0.2949, Adjusted R-squared:  0.2327 
## F-statistic: 4.741 on 3 and 34 DF,  p-value: 0.007215

Regression Equation is (y): 1114. + 2.060 Brain - 2.73 Height + 0.001 Weight

The R-Squared is 29.49% This statistic indicates the percentage of the variance in the dependent variable that the independent variables explain collectively. R-squared measures the strength of the relationship between your model and the dependent variable on a convenient 0 - 100% scale.

0% represents a model that does not explain any of the variation in the response variable around its mean. The mean of the dependent variable predicts the dependent variable as well as the regression model.

100% represents a model that explains all of the variation in the response variable around its mean. Usually, the larger the R2, the better the regression model fits your observations.

So, in this instance, 29.49% is below 50%, which may indicate that brain size, height, and weight may not be good predictors.

The Adjusted R-Squared is 23.27%. This value is lower than R-Squared adds further evidence that the existing predictors do not improve the model.

The P-values for the t-tests appearing in the table of estimates suggest that the slope parameters for Brain (P = 0.001) and Height (P = 0.033) are significantly different from 0, while the slope parameter for Weight (P = 0.998) is not.

The P-value for the analysis of variance F-test (P = 0.007) suggests that the model containing Brain, Height and Weight is more useful in predicting intelligence than not taking into account the three predictors. However, this does not tell us that the model with the three predictors is the best model.)

Multicollinearity

In regression, “multicollinearity” refers to predictors that are correlated with other predictors. Multicollinearity occurs when your model includes multiple factors that are correlated not just to your response variable, but also to each other. In other words, it results when you have factors that are a bit redundant.

Multicollinearity increases the standard errors of the coefficients. Increased standard errors in turn means that coefficients for some independent variables may be found not to be significantly different from 0. In other words, by overinflating the standard errors, multicollinearity makes some variables statistically insignificant when they should be significant. Without multicollinearity (and thus, with lower standard errors), those coefficients might be significant.

newdatacor = cor(sc.data[1:4])
corrplot(newdatacor, method = "number")

Variance Inflation Factors

vif(sc.data)

##   Variables      VIF
## 1       PIQ 1.418855
## 2     Brain 2.268452
## 3    Height 2.624187
## 4    Weight 2.027519
## 5    Gender 1.053944

The Multicollinearity plot shows that we have some positive correllation between Height, Weight, but the degree of correlation is not very intense. This is depicted by Variance Inflation Factors where none of the values are above 10. Hence, Multicollinearity is not suspected.

Residual Analysis

plot(sc.lm)

The Residuals versus Fitted graph shows error Residuals versus fitted values. The dotted line at y = 0 indicates our fit line. Any point on the fit line has zero residual. Points above have positive residuals and points below have negative residuals.

The Q-Q plot is close to a straight line. The closer the points are to falling directly on the diagonal line then the more we can interpret the residuals as normally distributed.

In addition,the residual versus predicted value scatter plot does not have a funnel shape. It’s good if you see a horizontal line with equally (randomly) spread points. Furthermore, it helps to identify any outliers.

The Scale-Location Plot graph indicates the spread of points across predicted values range. One of the assumptions for Regression is Homoscedasticity . i.e variance should be reasonably equal across the predictor range. A horizontal red line is ideal and would indicate that residuals have uniform variance across the range. As residuals spread wider from each other the red spread line goes up.

Conclusion (Without Dichomotous Term):

hist(sc.lm$residuals)

Although, this does not have an ideal bell-shaped curve, it does not suggest of any violation of normally distributed data.

NULL Hypothesis: No association between any of the independent variables and outcome (b1=b2=…bk=0) (R^2 =0)

Alternate Hypothesis: Association between at least one of the independent variables and the outcome (at least one slope different from 0) (R^2 <> 0)

From the output we see that the F-statistic for global test is equal to 4.741 on 3 and 34 DF (df=3, 34) and a corresponding p-value 0.007215. Thus, we reject the null and conclude that there is an association between at least one of the independent variables and the outcome. Furthermore, by rejecting the null, we also conclude that the Multiple R2 is different from 0. Specifically, R2=0.2949. The interpretation of Multiple R2 in a multiple linear regression setting is quite similar to a simple linear regression setting. Using the current example, we can interpret the Multiple R2 to mean that 29.49% of the variability in PIQ is explained by Brain, Height, and Weight.

Quadratic Term

Definitions:

Influence : The Influence of an observation can be thought of in terms of how much the predicted scores would change if the observation is excluded. Cook’s Distance is a pretty good measure of influence of an observation.

Leverage : The leverage of an observation is based on how much the observation’s value on the predictor variable differs from the mean of the predictor variable. The more the leverage of an observation , the greater potential that point has in terms of influence.

The Residuals versus Levarage Plot

In this plot the dotted red lines are Cook’s distance and the areas of interest for us are the ones outside dotted line on top right corner or bottom right corner. If any point falls in that region , we say that the observation has high leverage or potential for influencing our model is higher if we exclude that point.

It is not always the case that outliers will have high leverage or vice versa. So, to remediate high leverage, we have a few choices.

Justify the inclusion and keep the model as is
Include a quadratic term
Exclude the observation

Since, our Leverage Plot does not contain any outliers, we may impact, negatively, for applying one of the remediation techniques. We will still apply one of the choices, namely, quadratic term.

sc.lm2 <- lm(PIQ ~ (Brain^2) + Brain + Height + Weight, data = sc.data)
summary(sc.lm2)

## 
## Call:
## lm(formula = PIQ ~ (Brain^2) + Brain + Height + Weight, data = sc.data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32.74 -12.09  -3.84  14.17  51.69 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.114e+02  6.297e+01   1.768 0.085979 .  
## Brain        2.060e+00  5.634e-01   3.657 0.000856 ***
## Height      -2.732e+00  1.229e+00  -2.222 0.033034 *  
## Weight       5.599e-04  1.971e-01   0.003 0.997750    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.79 on 34 degrees of freedom
## Multiple R-squared:  0.2949, Adjusted R-squared:  0.2327 
## F-statistic: 4.741 on 3 and 34 DF,  p-value: 0.007215

plot(sc.lm2)

Conclusion With Quadratic:

There are no impact on the summary as a result of the quadratic addition. All the metrics are the same as the non-quadratic. Furthermore, there was no impact on the residuals as well. This was anticipated since the original data did not have high leverage to remediate using a quadratic.

Linear Model with Dichotomous Term

sc.lm3 <- lm(PIQ ~ Brain + Height + Weight + Gender, data = sc.data)
summary(sc.lm3)

## 
## Call:
## lm(formula = PIQ ~ Brain + Height + Weight + Gender, data = sc.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.370 -11.661  -4.235  13.971  51.397 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 110.404937  64.465977   1.713  0.09617 . 
## Brain         2.072971   0.582803   3.557  0.00116 **
## Height       -2.742923   1.251544  -2.192  0.03557 * 
## Weight        0.001776   0.200287   0.009  0.99298   
## Gender        0.748493   6.689704   0.112  0.91159   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.09 on 33 degrees of freedom
## Multiple R-squared:  0.2952, Adjusted R-squared:  0.2098 
## F-statistic: 3.456 on 4 and 33 DF,  p-value: 0.01827

Comparison Between Linear Model with Dichotomous Term and Non-Dichotomous Term

Many of the Coefficients changed slightly. For example, Residual standard error from including a Dichotomous term is 20.09 with 33 degrees of freedom versus 19.79 with 34 degrees of freedom. In addition, Multiple R-Squared is 29.52% and Adjusted R-Squared is 20.98% compared to 29.49% and 23.27%. Similarily, other coefficients are slightly changed as well. There does not exist any sharp Coefficients departures using a Dichotomous term.

Conclusion With Dichotomous Term:

NULL Hypothesis: No association between any of the independent variables and outcome (b1=b2=…bk=0) (R^2 =0)

Alternate Hypothesis: Association between at least one of the independent variables and the outcome (at least one slope different from 0) (R^2 <> 0)

From the output we see that the F-statistic for global test is equal to 3.456 on 4 and 33 DF (df=4, 33) and a corresponding p-value 0.01827. Thus, we reject the null and conclude that there is an association between at least one of the independent variables and the outcome. Furthermore, by rejecting the null, we also conclude that the Multiple R2 is different from 0. Specifically, R2=0.2952. The interpretation of Multiple R2 in a multiple linear regression setting is quite similar to a simple linear regression setting. Using the current example, we can interpret the Multiple R2 to mean that 29.52% of the variability in PIQ is explained by Brain, Height, Gender, and Weight.

A First-Order Model with One Binary Predictor and One Quantitative Predictor

sc.lm4 <- lm(PIQ ~ Brain + Gender, data = sc.data)
summary(sc.lm4)

## 
## Call:
## lm(formula = PIQ ~ Brain + Gender, data = sc.data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -39.92 -17.66  -1.93  16.94  41.69 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   5.1820    46.2381   0.112   0.9114  
## Brain         1.1724     0.4987   2.351   0.0245 *
## Gender       -0.2880     7.1412  -0.040   0.9681  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.51 on 35 degrees of freedom
## Multiple R-squared:  0.1428, Adjusted R-squared:  0.0938 
## F-statistic: 2.915 on 2 and 35 DF,  p-value: 0.06746

plot(fitted(sc.lm4), resid(sc.lm4))
abline(0,0)

The data are appears to be equally distributed about the mean.

hist(resid(sc.lm4))

The equally distributed data of the bell-shaped curve is not readily apparant.

qqnorm(resid(sc.lm4))
qqline(resid(sc.lm4))

The residual data does diverge from center and the ends of the straight line.

Conclusion (Quantitative Term versus Binary (Gender) Predictor)

NULL Hypothesis: Gender variable slope coefficient contribution is 0

Alternative Hypothesis: Gender slope coefficient variable contribution <> 0

Compared with first set of metrics, such as, Multiple R-Squared of 29.49%, Adjusted R-Squared of 23.27%, the overall F-statistic of 4.71 and p-value of 0.0072 compared with Multiple R-Squared of 14.28%, Adjusted R-Squared of 9.38%, the F-statistic of 2.9, and p-value of 0.067, which is > 0.05. It is obvious that Gender does not contribute positively to the model; which indicates that it is not statistically significant.

AArora_Posting12

Ajay Arora

April 12, 2018