As we can see, GMI, SGI, SGAI, and TATA are statistically significant.
The standard deviation of an estimate. Low values are ideal.
Residual Standard Error: a measure of the quality of a linear regression fit, it seems that the model doesn’t fit well at this moment. As indicated by R-squraed and adjusted R-squared.
F-statistic is a good indicator of whether there is a relationship between our predictor and the response variables. The further the F-statistic is from 1 the better it is. how much larger the F-statistic needs to be depends on both the number of data points and the number of predictors. In our example the F-statistic is 80.4 which is relatively larger than 1 given the size of our data.
data <- read.csv("/Users/N-mickey/Desktop/Anly 699/Data/data.csv",header=T,encoding = "UTF8")
names(data)=c('StockCode','DSRI','GMI','AQI','SGI','DEPI','SGAI','LVGI','TATA','M_score','y')
print(summary(lm( DSRI ~y, data)))
##
## Call:
## lm(formula = DSRI ~ y, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0 -3.1 -3.0 -2.7 19042.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.8926 4.7487 0.188 0.851
## y 3.0912 4.9854 0.620 0.535
##
## Residual standard error: 173.5 on 14399 degrees of freedom
## Multiple R-squared: 2.67e-05, Adjusted R-squared: -4.275e-05
## F-statistic: 0.3845 on 1 and 14399 DF, p-value: 0.5352
print(summary(lm( GMI ~y, data)))
##
## Call:
## lm(formula = GMI ~ y, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11514.3 -2.9 -2.6 -2.1 18084.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -26.395 5.260 -5.018 5.29e-07 ***
## y 30.024 5.522 5.437 5.51e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 192.2 on 14399 degrees of freedom
## Multiple R-squared: 0.002049, Adjusted R-squared: 0.001979
## F-statistic: 29.56 on 1 and 14399 DF, p-value: 5.513e-08
print(summary(lm( AQI ~y, data)))
##
## Call:
## lm(formula = AQI ~ y, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.08 -0.13 -0.09 -0.05 1180.11
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.99471 0.26921 3.695 0.000221 ***
## y 0.09372 0.28263 0.332 0.740206
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.836 on 14399 degrees of freedom
## Multiple R-squared: 7.636e-06, Adjusted R-squared: -6.181e-05
## F-statistic: 0.11 on 1 and 14399 DF, p-value: 0.7402
print(summary(lm( SGI ~y, data)))
##
## Call:
## lm(formula = SGI ~ y, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.39 -0.35 -0.24 -0.07 428.68
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0858 0.1185 9.162 <2e-16 ***
## y 0.2683 0.1244 2.157 0.031 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.33 on 14399 degrees of freedom
## Multiple R-squared: 0.0003229, Adjusted R-squared: 0.0002535
## F-statistic: 4.652 on 1 and 14399 DF, p-value: 0.03104
print(summary(lm( DEPI ~y, data)))
##
## Call:
## lm(formula = DEPI ~ y, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12 -11 -11 -10 122963
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.08 28.08 0.038 0.969
## y 10.47 29.48 0.355 0.722
##
## Residual standard error: 1026 on 14399 degrees of freedom
## Multiple R-squared: 8.766e-06, Adjusted R-squared: -6.068e-05
## F-statistic: 0.1262 on 1 and 14399 DF, p-value: 0.7224
print(summary(lm( SGAI ~y, data)))
##
## Call:
## lm(formula = SGAI ~ y, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.808 -0.184 -0.067 0.027 124.344
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.47234 0.05923 24.858 < 2e-16 ***
## y -0.40582 0.06218 -6.526 6.97e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.164 on 14399 degrees of freedom
## Multiple R-squared: 0.002949, Adjusted R-squared: 0.00288
## F-statistic: 42.59 on 1 and 14399 DF, p-value: 6.968e-11
print(summary(lm( LVGI ~y, data)))
##
## Call:
## lm(formula = LVGI ~ y, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1710 -0.1598 -0.0810 0.0429 29.8482
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.17098 0.01792 65.339 < 2e-16 ***
## y -0.09000 0.01881 -4.784 1.74e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6548 on 14399 degrees of freedom
## Multiple R-squared: 0.001587, Adjusted R-squared: 0.001517
## F-statistic: 22.88 on 1 and 14399 DF, p-value: 1.739e-06
print(summary(lm( TATA ~y, data)))
##
## Call:
## lm(formula = TATA ~ y, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8865 -0.2594 -0.1010 0.1179 10.9969
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.22657 0.01435 15.79 <2e-16 ***
## y 0.35404 0.01507 23.50 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5244 on 14399 degrees of freedom
## Multiple R-squared: 0.03693, Adjusted R-squared: 0.03686
## F-statistic: 552.1 on 1 and 14399 DF, p-value: < 2.2e-16
fit <- lm(y ~ DSRI+GMI+AQI+SGI+ DEPI+SGAI+TATA+LVGI, data=data)
summary(fit)
##
## Call:
## lm(formula = y ~ DSRI + GMI + AQI + SGI + DEPI + SGAI + TATA +
## LVGI, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.10601 0.06302 0.09561 0.11494 0.86807
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.744e-01 5.331e-03 164.006 < 2e-16 ***
## DSRI 9.286e-06 1.364e-05 0.681 0.4959
## GMI 6.408e-05 1.230e-05 5.211 1.90e-07 ***
## AQI 1.110e-04 2.405e-04 0.462 0.6443
## SGI 1.185e-03 5.519e-04 2.147 0.0318 *
## DEPI 2.274e-07 2.306e-06 0.099 0.9214
## SGAI -6.288e-03 1.094e-03 -5.749 9.18e-09 ***
## TATA 1.029e-01 4.430e-03 23.229 < 2e-16 ***
## LVGI -1.678e-02 3.650e-03 -4.598 4.30e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2838 on 14392 degrees of freedom
## Multiple R-squared: 0.04278, Adjusted R-squared: 0.04225
## F-statistic: 80.4 on 8 and 14392 DF, p-value: < 2.2e-16
data <- read.csv("/Users/N-mickey/Desktop/Anly 699/Data/data.csv",header=T,encoding = "UTF8")
names(data)=c('StockCode','DSRI','GMI','AQI','SGI','DEPI','SGAI','LVGI','TATA','M_score','y')
str(data)
## 'data.frame': 14401 obs. of 11 variables:
## $ StockCode: int 1 1 1 1 2 2 2 2 4 4 ...
## $ DSRI : num 1.186 1.008 0 0 0.672 ...
## $ GMI : num 1.082 0.907 1.361 0.307 1.069 ...
## $ AQI : num 1.474 0.879 1.289 0.883 0.971 ...
## $ SGI : num 1.13 1 1.01 1.19 1.23 ...
## $ DEPI : num 0.576 1.035 0.737 0.983 0.722 ...
## $ SGAI : num 0 0 0 0 1.09 ...
## $ LVGI : num 1.05 1.18 1.06 1.17 1.04 ...
## $ TATA : num 0.0316 0.0688 0.0478 0.0424 0.2419 ...
## $ M_score : num -1.71 -2.13 -2.73 -3.33 -1.48 -2.3 -1.68 -1.35 1.32 6.29 ...
## $ y : int 1 1 0 0 1 0 1 1 1 1 ...
summary(data)
## StockCode DSRI GMI AQI
## Min. : 1 Min. : -1.541 Min. :-11540.658 Min. : 0.0029
## 1st Qu.: 2438 1st Qu.: 0.844 1st Qu.: 0.529 1st Qu.: 0.9533
## Median :300388 Median : 1.000 Median : 0.990 Median : 0.9989
## Mean :313186 Mean : 3.697 Mean : 0.846 Mean : 1.0797
## 3rd Qu.:600688 3rd Qu.: 1.178 3rd Qu.: 1.252 3rd Qu.: 1.0261
## Max. :900957 Max. :19046.402 Max. : 18088.332 Max. :1181.1942
## SGI DEPI SGAI LVGI
## Min. : -0.3092 Min. : 0.00 Min. : -2.3354 Min. : 0.0000
## 1st Qu.: 1.0000 1st Qu.: 0.85 1st Qu.: 0.9071 1st Qu.: 0.9357
## Median : 1.0912 Median : 0.99 Median : 1.0000 Median : 1.0041
## Mean : 1.3292 Mean : 10.58 Mean : 1.1041 Mean : 1.0893
## 3rd Qu.: 1.2585 3rd Qu.: 1.05 3rd Qu.: 1.1076 3rd Qu.: 1.1293
## Max. :430.0361 Max. :122974.72 Max. :125.8168 Max. :30.9292
## TATA M_score y
## Min. :-0.6599 Min. :-6095.490 Min. :0.0000
## 1st Qu.: 0.2772 1st Qu.: -1.150 1st Qu.:1.0000
## Median : 0.4467 Median : -0.140 Median :1.0000
## Mean : 0.5478 Mean : 3.864 Mean :0.9073
## 3rd Qu.: 0.6763 3rd Qu.: 1.190 3rd Qu.:1.0000
## Max. :11.5775 Max. :17529.540 Max. :1.0000
table(data$y)
##
## 0 1
## 1335 13066
data$y<-replace(data$y,data$y==0,0)
data$y<-replace(data$y,data$y==1,1)
LogModel <- glm(y~.-StockCode, family=binomial(link='logit'),data=data)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Most of the parameters are not statistically significant.
Null deviance indicates the response predicted by a model with nothing but an intercept. The lower the value, the better the model. Residual deviance indicates the response predicted by a model on adding independent variables. The lower the value, the better the model. Deviance is a measure of goodness of fit of a generalized linear model. Or rather, it’s a measure of badness of fit—higher numbers indicate worse fit.
The analogous metric of adjusted R-squared in logistic regression is AIC. AIC is the measure of fit which penalizes a model for the number of model coefficients. Therefore, we always prefer a model with a minimum AIC value.
Finally, the number of Fisher Scoring iterations is returned. Fisher’s scoring algorithm is a derivative of Newton’s method for solving maximum likelihood problems numerically.
For this model, we see that Fisher’s scoring algorithm needed 25 iterations to perform the fit.
summary(LogModel)
##
## Call:
## glm(formula = y ~ . - StockCode, family = binomial(link = "logit"),
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.06674 0.00000 0.00000 0.00000 0.06867
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2128.22 8434.02 0.252 0.801
## DSRI 79.73 1583.65 0.050 0.960
## GMI 45.68 908.96 0.050 0.960
## AQI 44.21 700.98 0.063 0.950
## SGI 79.29 1539.49 0.052 0.959
## DEPI 11.46 198.11 0.058 0.954
## SGAI -15.82 294.48 -0.054 0.957
## LVGI -19.78 559.59 -0.035 0.972
## TATA 411.10 8061.16 0.051 0.959
## M_score 1073.11 1831.29 0.586 0.558
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8.8925e+03 on 14400 degrees of freedom
## Residual deviance: 4.9761e-02 on 14391 degrees of freedom
## AIC: 20.05
##
## Number of Fisher Scoring iterations: 25
LGModelPred <- round(predict(LogModel, type="response"))