child_IQ <- c(87,91,94,98,103,108,111,123)
Mother_IQ <- c(94,96,89,102,98,94,116,117)

print(child_IQ)

## [1]  87  91  94  98 103 108 111 123

print(Mother_IQ)

## [1]  94  96  89 102  98  94 116 117

model_iq <- lm(Mother_IQ ~ child_IQ)
summary(model_iq)

## 
## Call:
## lm(formula = Mother_IQ ~ child_IQ)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.965  -4.226   2.223   3.594   8.971 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  30.6439    22.7449   1.347   0.2265  
## child_IQ      0.6882     0.2220   3.101   0.0211 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.965 on 6 degrees of freedom
## Multiple R-squared:  0.6157, Adjusted R-squared:  0.5517 
## F-statistic: 9.613 on 1 and 6 DF,  p-value: 0.0211

#SSE - sum(actual - predicted^2)
SSE <- sum(model_iq$residuals^2)

#SSR - sum(predicted - ybar^2)
ybar <- mean(Mother_IQ)
SSR <- sum((model_iq$fitted.values-ybar)^2)

#SST - sum((y - ybar)^2) / SSE+SSR
SST <- SSE+SSR

SSE - sum(y-yhat)^2 - sum of squares errors or residuals, assess how well the regression line fits the data SSR - sum(yhat-ybar)^2 - sum of squares regression, SST - sum(y-ybar)^2 - sum of squares total, assess how well the mean fits the data

Residuals

model_iq$residuals

##          1          2          3          4          5          6 
##   3.486356   2.733723  -6.330753   3.916614  -3.524178 -10.964970 
##          7          8 
##   8.970555   1.712654

residuals or errors are the difference between actual and predicted values of y variable.

#Residual Standard Error sqrt(SSE/n-q)
#n = no. of obs; q = no. of variables(x & y) 

sqrt(SSE/(8-2))

## [1] 6.965398

Standardized error of residuals, is an estimate of the accuracy of the dependent variable being measured.

Coefficient of Determination (rÂ²):

#Coefficient of determination
#multiple R-squared
#SSR/SST or 1-SSE/SST

r.sq <- 1-(SSE/SST)
r.sq

## [1] 0.6157087

SSR/SST

## [1] 0.6157087

rÂ² = SSR/SST - Interpretation of r-square (in this case, as the value 0.6157) about 61% of the variability in the number of child IQ made is explained by the linear relationship between the number of mother IQ and number of child IQ.

How well my x variable is explaining the y variable.

#Coefficient of Correlation (r):
#r = (sign of B1) sqrt(rÂ²) the B1 is positive

sqrt(r.sq)

## [1] 0.7846711

Note: rÂ² - multiple R-squared - Coefficient of Determination tells only percentage of variability, it does not talk about direction of variability whether it is positive or negative.

r - coefficient of correlation tells percentage of variability along with direction whether the relationship is negative or positive.

Adjusted R-Squared

#adj.R-square
#1-((SSE/n-q)/(SST/n-1))

1-((SSE/6)/(SST/7))

## [1] 0.5516602

It has been adjusted for the number of predictors in the model. The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance.

Significance of model

summary(model_iq)

## 
## Call:
## lm(formula = Mother_IQ ~ child_IQ)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.965  -4.226   2.223   3.594   8.971 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  30.6439    22.7449   1.347   0.2265  
## child_IQ      0.6882     0.2220   3.101   0.0211 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.965 on 6 degrees of freedom
## Multiple R-squared:  0.6157, Adjusted R-squared:  0.5517 
## F-statistic: 9.613 on 1 and 6 DF,  p-value: 0.0211

Hypothesis test of signficance - t-test

Ho: b1 = 0 Ha: b1 != 0

t-test statistic: t = b1-0/b1.std.err

standard error

#b1.std.err: sqrt(sse/n-q)/sqrt(x-xbr)Â²
#sqrt(SSE/n-q) / sqrt(sum((x-xbar)^2))

n <- length(Mother_IQ)
q <- length(model_iq$coefficients)

n

## [1] 8

## [1] 2

sq.sse <- sqrt(SSE/(n-q))

#sqrt(sum((x-xbar)^2))

x <- child_IQ
xbar <- mean(x)

sq.xxbr <- sqrt(sum((x-xbar)^2))

std.err <- sq.sse/sq.xxbr

std.err

## [1] 0.2219501

t-value

#bo/bo.std.err
#b1/b1.std.err

bo <- 30.6438634
b1 <- 0.6881584

b1/std.err

## [1] 3.100509

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.6439 22.7449 1.347 0.2265
child_IQ 0.6882 0.2220 3.101 0.0211 * — Signif. codes:
0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

Check the significance of model, intercept and slope are significantly different than zero.

Ho: b1 = 0 Ha: b1 != 0

t-score = 3.101 p-value = 0.0211 alpha = 0.05

t-critical value approach

Draw t-critical value from t-distribution table with degrees of freedom, n-1, 8-1=7, with alpha = 0.05, two-tail ttest

t-critical value: 2.365 t-score : 3.101

t-score gt t-critical value, hence reject Ho

p-value approach

p-value, 0.0211, which is less than alpha value 0.05, hence we reject null hypothesis. Child IQ is significantly contributing to the model.

Decision: Reject Ho, p-value is less than alpha. Conclusion: child IQ is a significant predictor of Mother IQ

Y-intercept

Y-intercept is 30.6439 tells when the child IQ is zero, mother’s IQ is 30.6439.

We can also say that if any mother who has IQ of 30.6439, those mother’s child IQ would be zero.

slope

Slope value, 0.6882 tells that 1 unit in crease in x variable there is an increase of 0.6882 in y variable.

If mother’s IQ is increased by 0.6882, there would be 1 unit increase in child’s IQ.

if child IQ is 100 units then the predicted value is derived like (100b1)+bo - (100 0.6882) + 30.6439.

Evaluation of errors

#MAE - mean absolute error
#abs - will convert negative value into positive value
mean(abs(model_iq$residuals))

## [1] 5.204975

#MSE - mean square error
mean(model_iq$residuals^2)

## [1] 36.38758

#MAPE - Mean absolute percentage error
#(abs(actual-predicted)/actual)
#(abs(actual-predicted)/actual)*100
#mean((abs(actual-predicted)/actual)*100)
mean(abs((Mother_IQ-model_iq$fitted.values)/Mother_IQ)*100)

## [1] 5.245943

#RMSE - root mean square error - sqrt(MSE)
sqrt(mean(model_iq$residuals^2))

## [1] 6.032212

#library(DMwR)
#regr.eval(y, pedicted)

library(DMwR)

## Warning: package 'DMwR' was built under R version 3.5.3

## Loading required package: lattice

## Loading required package: grid

regr.eval(Mother_IQ, model_iq$fitted.values)

##         mae         mse        rmse        mape 
##  5.20497525 36.38758091  6.03221194  0.05245943

Assumptions of the model

Linearity
Errors should be normally distributed
Homogeinity of variance - whether belongs to same group or not.
Influential observations - Outliers
Linearity. Residual line should be close to zero.

plot(model_iq,1)

The above graph shows residuals are not close to zero. Violation of residual linearity.

Errors should be normally distributed.

plot(model_iq,2)

The above graph shows the errors are normally distributed.

Homogeinity of variance - errors should be constant

plot(model_iq,3)

The errors are not constant.

Influential observations - are the values that effect the performance when included in the model both negative or positive way.

plot(model_iq,4)

plot(model_iq)

Regression

Anil

January 28, 2019