Linear Regression

Model Summary

As we can see, GMI, SGI, SGAI, and TATA are statistically significant.

The standard deviation of an estimate. Low values are ideal.

Residual Standard Error: a measure of the quality of a linear regression fit, it seems that the model doesn’t fit well at this moment. As indicated by R-squraed and adjusted R-squared.

F-statistic is a good indicator of whether there is a relationship between our predictor and the response variables. The further the F-statistic is from 1 the better it is. how much larger the F-statistic needs to be depends on both the number of data points and the number of predictors. In our example the F-statistic is 80.4 which is relatively larger than 1 given the size of our data.

data <- read.csv("/Users/N-mickey/Desktop/Anly 699/Data/data.csv",header=T,encoding = "UTF8")
names(data)=c('StockCode','DSRI','GMI','AQI','SGI','DEPI','SGAI','LVGI','TATA','M_score','y')
print(summary(lm( DSRI ~y, data)))

## 
## Call:
## lm(formula = DSRI ~ y, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##    -4.0    -3.1    -3.0    -2.7 19042.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   0.8926     4.7487   0.188    0.851
## y             3.0912     4.9854   0.620    0.535
## 
## Residual standard error: 173.5 on 14399 degrees of freedom
## Multiple R-squared:  2.67e-05,   Adjusted R-squared:  -4.275e-05 
## F-statistic: 0.3845 on 1 and 14399 DF,  p-value: 0.5352

print(summary(lm( GMI ~y, data)))

## 
## Call:
## lm(formula = GMI ~ y, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11514.3     -2.9     -2.6     -2.1  18084.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -26.395      5.260  -5.018 5.29e-07 ***
## y             30.024      5.522   5.437 5.51e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 192.2 on 14399 degrees of freedom
## Multiple R-squared:  0.002049,   Adjusted R-squared:  0.001979 
## F-statistic: 29.56 on 1 and 14399 DF,  p-value: 5.513e-08

print(summary(lm( AQI ~y, data)))

## 
## Call:
## lm(formula = AQI ~ y, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##   -1.08   -0.13   -0.09   -0.05 1180.11 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.99471    0.26921   3.695 0.000221 ***
## y            0.09372    0.28263   0.332 0.740206    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.836 on 14399 degrees of freedom
## Multiple R-squared:  7.636e-06,  Adjusted R-squared:  -6.181e-05 
## F-statistic:  0.11 on 1 and 14399 DF,  p-value: 0.7402

print(summary(lm( SGI ~y, data)))

## 
## Call:
## lm(formula = SGI ~ y, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1.39  -0.35  -0.24  -0.07 428.68 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.0858     0.1185   9.162   <2e-16 ***
## y             0.2683     0.1244   2.157    0.031 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.33 on 14399 degrees of freedom
## Multiple R-squared:  0.0003229,  Adjusted R-squared:  0.0002535 
## F-statistic: 4.652 on 1 and 14399 DF,  p-value: 0.03104

print(summary(lm( DEPI ~y, data)))

## 
## Call:
## lm(formula = DEPI ~ y, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##    -12    -11    -11    -10 122963 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)     1.08      28.08   0.038    0.969
## y              10.47      29.48   0.355    0.722
## 
## Residual standard error: 1026 on 14399 degrees of freedom
## Multiple R-squared:  8.766e-06,  Adjusted R-squared:  -6.068e-05 
## F-statistic: 0.1262 on 1 and 14399 DF,  p-value: 0.7224

print(summary(lm( SGAI ~y, data)))

## 
## Call:
## lm(formula = SGAI ~ y, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -3.808  -0.184  -0.067   0.027 124.344 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.47234    0.05923  24.858  < 2e-16 ***
## y           -0.40582    0.06218  -6.526 6.97e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.164 on 14399 degrees of freedom
## Multiple R-squared:  0.002949,   Adjusted R-squared:  0.00288 
## F-statistic: 42.59 on 1 and 14399 DF,  p-value: 6.968e-11

print(summary(lm( LVGI ~y, data)))

## 
## Call:
## lm(formula = LVGI ~ y, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1710 -0.1598 -0.0810  0.0429 29.8482 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.17098    0.01792  65.339  < 2e-16 ***
## y           -0.09000    0.01881  -4.784 1.74e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6548 on 14399 degrees of freedom
## Multiple R-squared:  0.001587,   Adjusted R-squared:  0.001517 
## F-statistic: 22.88 on 1 and 14399 DF,  p-value: 1.739e-06

print(summary(lm( TATA  ~y, data)))

## 
## Call:
## lm(formula = TATA ~ y, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8865 -0.2594 -0.1010  0.1179 10.9969 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.22657    0.01435   15.79   <2e-16 ***
## y            0.35404    0.01507   23.50   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5244 on 14399 degrees of freedom
## Multiple R-squared:  0.03693,    Adjusted R-squared:  0.03686 
## F-statistic: 552.1 on 1 and 14399 DF,  p-value: < 2.2e-16

fit <- lm(y ~ DSRI+GMI+AQI+SGI+ DEPI+SGAI+TATA+LVGI, data=data)
summary(fit)

## 
## Call:
## lm(formula = y ~ DSRI + GMI + AQI + SGI + DEPI + SGAI + TATA + 
##     LVGI, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.10601  0.06302  0.09561  0.11494  0.86807 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.744e-01  5.331e-03 164.006  < 2e-16 ***
## DSRI         9.286e-06  1.364e-05   0.681   0.4959    
## GMI          6.408e-05  1.230e-05   5.211 1.90e-07 ***
## AQI          1.110e-04  2.405e-04   0.462   0.6443    
## SGI          1.185e-03  5.519e-04   2.147   0.0318 *  
## DEPI         2.274e-07  2.306e-06   0.099   0.9214    
## SGAI        -6.288e-03  1.094e-03  -5.749 9.18e-09 ***
## TATA         1.029e-01  4.430e-03  23.229  < 2e-16 ***
## LVGI        -1.678e-02  3.650e-03  -4.598 4.30e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2838 on 14392 degrees of freedom
## Multiple R-squared:  0.04278,    Adjusted R-squared:  0.04225 
## F-statistic:  80.4 on 8 and 14392 DF,  p-value: < 2.2e-16

Logistic Regression

Step 1: Explore

data <- read.csv("/Users/N-mickey/Desktop/Anly 699/Data/data.csv",header=T,encoding = "UTF8")
names(data)=c('StockCode','DSRI','GMI','AQI','SGI','DEPI','SGAI','LVGI','TATA','M_score','y')
str(data)

## 'data.frame':    14401 obs. of  11 variables:
##  $ StockCode: int  1 1 1 1 2 2 2 2 4 4 ...
##  $ DSRI     : num  1.186 1.008 0 0 0.672 ...
##  $ GMI      : num  1.082 0.907 1.361 0.307 1.069 ...
##  $ AQI      : num  1.474 0.879 1.289 0.883 0.971 ...
##  $ SGI      : num  1.13 1 1.01 1.19 1.23 ...
##  $ DEPI     : num  0.576 1.035 0.737 0.983 0.722 ...
##  $ SGAI     : num  0 0 0 0 1.09 ...
##  $ LVGI     : num  1.05 1.18 1.06 1.17 1.04 ...
##  $ TATA     : num  0.0316 0.0688 0.0478 0.0424 0.2419 ...
##  $ M_score  : num  -1.71 -2.13 -2.73 -3.33 -1.48 -2.3 -1.68 -1.35 1.32 6.29 ...
##  $ y        : int  1 1 0 0 1 0 1 1 1 1 ...

Step 2: No missing data

Step 3: Preparation

summary(data)

##    StockCode           DSRI                GMI                  AQI           
##  Min.   :     1   Min.   :   -1.541   Min.   :-11540.658   Min.   :   0.0029  
##  1st Qu.:  2438   1st Qu.:    0.844   1st Qu.:     0.529   1st Qu.:   0.9533  
##  Median :300388   Median :    1.000   Median :     0.990   Median :   0.9989  
##  Mean   :313186   Mean   :    3.697   Mean   :     0.846   Mean   :   1.0797  
##  3rd Qu.:600688   3rd Qu.:    1.178   3rd Qu.:     1.252   3rd Qu.:   1.0261  
##  Max.   :900957   Max.   :19046.402   Max.   : 18088.332   Max.   :1181.1942  
##       SGI                DEPI                SGAI               LVGI        
##  Min.   : -0.3092   Min.   :     0.00   Min.   : -2.3354   Min.   : 0.0000  
##  1st Qu.:  1.0000   1st Qu.:     0.85   1st Qu.:  0.9071   1st Qu.: 0.9357  
##  Median :  1.0912   Median :     0.99   Median :  1.0000   Median : 1.0041  
##  Mean   :  1.3292   Mean   :    10.58   Mean   :  1.1041   Mean   : 1.0893  
##  3rd Qu.:  1.2585   3rd Qu.:     1.05   3rd Qu.:  1.1076   3rd Qu.: 1.1293  
##  Max.   :430.0361   Max.   :122974.72   Max.   :125.8168   Max.   :30.9292  
##       TATA            M_score                y         
##  Min.   :-0.6599   Min.   :-6095.490   Min.   :0.0000  
##  1st Qu.: 0.2772   1st Qu.:   -1.150   1st Qu.:1.0000  
##  Median : 0.4467   Median :   -0.140   Median :1.0000  
##  Mean   : 0.5478   Mean   :    3.864   Mean   :0.9073  
##  3rd Qu.: 0.6763   3rd Qu.:    1.190   3rd Qu.:1.0000  
##  Max.   :11.5775   Max.   :17529.540   Max.   :1.0000

table(data$y)

## 
##     0     1 
##  1335 13066

data$y<-replace(data$y,data$y==0,0)
data$y<-replace(data$y,data$y==1,1)

Step 4: Model

LogModel <- glm(y~.-StockCode, family=binomial(link='logit'),data=data)

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Step 5: Model Summary

Most of the parameters are not statistically significant.

Null deviance indicates the response predicted by a model with nothing but an intercept. The lower the value, the better the model. Residual deviance indicates the response predicted by a model on adding independent variables. The lower the value, the better the model. Deviance is a measure of goodness of fit of a generalized linear model. Or rather, it’s a measure of badness of fit—higher numbers indicate worse fit.

The analogous metric of adjusted R-squared in logistic regression is AIC. AIC is the measure of fit which penalizes a model for the number of model coefficients. Therefore, we always prefer a model with a minimum AIC value.

Finally, the number of Fisher Scoring iterations is returned. Fisher’s scoring algorithm is a derivative of Newton’s method for solving maximum likelihood problems numerically.

For this model, we see that Fisher’s scoring algorithm needed 25 iterations to perform the fit.

summary(LogModel)

## 
## Call:
## glm(formula = y ~ . - StockCode, family = binomial(link = "logit"), 
##     data = data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.06674   0.00000   0.00000   0.00000   0.06867  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  2128.22    8434.02   0.252    0.801
## DSRI           79.73    1583.65   0.050    0.960
## GMI            45.68     908.96   0.050    0.960
## AQI            44.21     700.98   0.063    0.950
## SGI            79.29    1539.49   0.052    0.959
## DEPI           11.46     198.11   0.058    0.954
## SGAI          -15.82     294.48  -0.054    0.957
## LVGI          -19.78     559.59  -0.035    0.972
## TATA          411.10    8061.16   0.051    0.959
## M_score      1073.11    1831.29   0.586    0.558
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8.8925e+03  on 14400  degrees of freedom
## Residual deviance: 4.9761e-02  on 14391  degrees of freedom
## AIC: 20.05
## 
## Number of Fisher Scoring iterations: 25

LGModelPred <- round(predict(LogModel, type="response"))

Anly 699-Data Visualization

Linfang Li

9/22/2020