Introduction :-

In this report, I am attempting to do Regression Analysis on Students Data set.

Exploratory Data Analysis :-

Exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often with visual methods.

Structure of given dataset :-

The given dataset has 20 students and each student has 4 attributes. The header of the dataset is as follows.

##   SCORE HOURS ANXIETY A_POINTS
## 1    62    40      40       24
## 2    58    31      65       20
## 3    52    35      34       22
## 4    55    26      91       22
## 5    75    51      46       28
## 6    82    48      52       28

Explanation of all the variables :-
1. SCORE : Test performance.
2. HOURS : Hours used for learning.
3. ANXIETY : Score test anxiety; high value means high test anxiety.
4. A POINTS : Points in entrance examination; more points, better examination results.

The detailed structure is as follows.

## 'data.frame':    20 obs. of  4 variables:
##  $ SCORE   : int  62 58 52 55 75 82 38 55 48 68 ...
##  $ HOURS   : int  40 31 35 26 51 48 25 37 30 44 ...
##  $ ANXIETY : int  40 65 34 91 46 52 48 61 34 74 ...
##  $ A_POINTS: int  24 20 22 22 28 28 18 20 18 26 ...

In Input , The type of each attribute is as follows.

##     SCORE     HOURS   ANXIETY  A_POINTS 
## "integer" "integer" "integer" "integer"

All attribute types are good for my analysis. We can proceed further.

Dealing with NULL values :-

The number of null values in each column are as follows.

##    SCORE    HOURS  ANXIETY A_POINTS 
##        0        0        0        0

As there is no null values, we can proceed further.

Summary :-

The overall summary of all the attributes is as follows.

##      SCORE        HOURS          ANXIETY         A_POINTS   
##  Min.   :38   Min.   :25.00   Min.   :13.00   Min.   :18.0  
##  1st Qu.:55   1st Qu.:31.75   1st Qu.:37.75   1st Qu.:20.0  
##  Median :62   Median :39.50   Median :53.00   Median :24.0  
##  Mean   :61   Mean   :39.15   Mean   :49.30   Mean   :23.2  
##  3rd Qu.:68   3rd Qu.:45.25   3rd Qu.:61.00   3rd Qu.:26.0  
##  Max.   :82   Max.   :61.00   Max.   :91.00   Max.   :28.0

The distribution of all continuous variables is as follows.

The co-releation between the continous variables is as follows

##               SCORE      HOURS    ANXIETY   A_POINTS
## SCORE     1.0000000  0.8210109 -0.1182956  0.8716357
## HOURS     0.8210109  1.0000000 -0.3395037  0.7317732
## ANXIETY  -0.1182956 -0.3395037  1.0000000 -0.2441778
## A_POINTS  0.8716357  0.7317732 -0.2441778  1.0000000

Conclusion (or) Description of EDA :-

In our data set,

The distribution of all the four variables is good. there is no strange things in it. There is no outliers in the inputdata.
The variables score , Hours & A_points are highly co-releated to each other. We can do Dimentionality reduction techinques to reduce the number of variables.
Overall, there is no conspicuous patterns involved in the input data.

Models to predict SCORE:-

Now I am planning to build various models to predict the students Score when we have all the following three attributes.

Hours
ANXIETY
A_POINTS

Fitting Linear Model:-

Linear regression is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables.

The summary of the fitted Linear Regression Model :-

## 
## Call:
## lm(formula = regression_form, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4933 -2.5081 -0.3897  3.1395  6.6537 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11.82254    8.80575  -1.343 0.198143    
## HOURS         0.55114    0.17086   3.226 0.005284 ** 
## ANXIETY       0.10352    0.05762   1.796 0.091327 .  
## A_POINTS      1.98888    0.46918   4.239 0.000625 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.468 on 16 degrees of freedom
## Multiple R-squared:  0.8602, Adjusted R-squared:  0.834 
## F-statistic: 32.81 on 3 and 16 DF,  p-value: 4.563e-07

We can observe that,

Residuals Median is almost equal to zero. ( which is good).
Residuals Minimum and Maximum values are very close (opposite signs). (Data points are equally distributed on both sides of fitted line)
Intercept value is -11.82254 and not statistically significant.
HOURS coeeficent value is 0.55114 and also statistically significant.
ANXIETY coeeficent value is 0.10352 and also little bit statistically significant.
A_POINTS coeeficent value is 1.98888 and statistically significant.
86.02 % of Score distribution is Explained by all dependent variables. ( Which is good ).
P-Value is < 4.563e-07 , so the linear model is statistically significant.
ROOT MEAN SQUARE ERROR OF THIS MODEL IS 3.9959115

I am trying to reduce the dependent variables and taking most significant dependent variables.

Fitting Updated Linear Model:-

The summary of the updated Line model is as follows.

## 
## Call:
## lm(formula = SCORE ~ HOURS + ANXIETY + A_POINTS, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4933 -2.5081 -0.3897  3.1395  6.6537 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11.82254    8.80575  -1.343 0.198143    
## HOURS         0.55114    0.17086   3.226 0.005284 ** 
## ANXIETY       0.10352    0.05762   1.796 0.091327 .  
## A_POINTS      1.98888    0.46918   4.239 0.000625 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.468 on 16 degrees of freedom
## Multiple R-squared:  0.8602, Adjusted R-squared:  0.834 
## F-statistic: 32.81 on 3 and 16 DF,  p-value: 4.563e-07

We can observe that,

Residuals Median is almost equal to zero. ( which is good).
Residuals Minimum and Maximum values are very close (opposite signs). (Data points are equally distributed on both sides of fitted line)
Intercept value is -11.82254 and not statistically significant.
HOURS coeeficent value is 0.55114 and also statistically significant.
ANXIETY coeeficent value is 0.10352 and also little bit statistically significant.
A_POINTS coeeficent value is 1.98888 and statistically significant.
86.02 % of Alchol distribution is Explained by all dependent variables. ( Which is not good ).
P-Value is 4.563e-07 , so the linear model is statistically significant.

Actually, after applying step funciton also there is no improvement.

Support Vector machine ( Regressors) :-

Support-vector machines are linear models (supervised learning models) with associated learning algorithms that analyze data for classification and regression analysis. It can solve both Linear & Non-Linear problems.

Fitting SVM - regressor :-

With Default Parameters :-

As a first step, I am trying to fit a Support Vector Machine regressor with default hyperparameter values of Cost control ( C ) and Gamma (\(\gamma\)).

The summary of the fitted default SVM_classifier :-

## 
## Call:
## svm(formula = regression_form, data = df)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.3333333 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  19

We can observe that,

Number of data points which are forming margin : 19
This algorithm is using default eps-regression type.
Kernel Selected is : Radial basis function (RBF)
Default Cost Value ( C ) is 1
Default Gamma Value ( \(\gamma\) ) is 0.3333333
ROOT MEAN SQUARE ERROR OF THIS MODEL IS 4.069501.

Parameters Tuning ( grid search ):-

As we can see the RMSE of the SVM model with default parameters is not so good. We can tune the parameters C and Gamma (\(\gamma\)) so that we are slightly changing the smoothness of the fitted curve ( tunning hyperplane ) to classify the data points more accurately than before.

The tuning summary by using Bootstrapping sampling method is as follows.

## 
## Parameter tuning of 'svm':
## 
## - sampling method: bootstrapping 
## 
## - best parameters:
##  gamma cost
##    0.3    4
## 
## - best performance: 54.91425

We Can observe that, recommended best parameters from Bootstrapping sampling method is

\(\gamma\) : 0.3
C : 4

Fitting SVM_regressor with new parameters :-

## 
## Call:
## svm(formula = regression_form, data = df, gamma = gam, cost = cos)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  4 
##       gamma:  0.3 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  19

We can observe that,

Number of data points which are forming margin : 19
This algorithm is using default eps-regression type.
Kernel Selected is : Radial basis function (RBF)
Cost Value ( C ) is 4 ( Which we are controlling )
Gamma Value ( \(\gamma\) ) is 0.3 ( Which we are controlling )
ROOT MEAN SQUARE ERROR OF THIS MODEL IS 2.653195.

Binart Decission Trees :-

A Binary Decision Tree is a structure based on a sequential decision process. Starting from the root, a feature is evaluated and one of the two branches is selected. This procedure is repeated until a final leaf is reached, which normally represents the classification target you’re looking for.

Fitting CTREE binary Decision Tree: :-

The summary of the fitted model is as follows

## 
##   Conditional inference tree with 2 terminal nodes
## 
## Response:  SCORE 
## Inputs:  HOURS, ANXIETY, A_POINTS 
## Number of observations:  20 
## 
## 1) A_POINTS <= 22; criterion = 1, statistic = 14.435
##   2)*  weights = 9 
## 1) A_POINTS > 22
##   3)*  weights = 11

The same thing, we can explain in following plot

ROOT MEAN SQUARE ERROR OF THIS MODEL IS 7.2626066.

Conclusion on regression Models:-

The summary of all the fitted models and its performence on the training data is as follows

REGRESSION MODELs SUMMARY

S No	Model Name	RMSE Value
1.	Linear Model	3.9959115
3.	SVM - Regressor	4.069501
4.	Tunned SVM - Regressor	2.653195
5.	Binary Decission Tree	7.2626066

As Tunned SVM - Regressor has very less RMSE value, I can conclude as this model is better model to predict the Alcohol content.

Dimentionality Reduction :-

In our data set,

As Score, Hours & A_point variables are higly co-releated to each other, We can apply any dimentionality reduction techniques likes PCA and can be describe all the 4 attributes with less number of attributes.

Principal Component Analysis ( PCA) :-

when variables are correlated, then less variables could explain almost the same amount of variation.PCA is used to extract the important information from multivariate data and express this information as a set of few new variables called principal components.

Fitting PCA :-

The Explanation for the fitted PCA is as follows.

## Standard deviations (1, .., p=3):
## [1] 1.3848856 0.9061761 0.5108197
## 
## Rotation (n x k) = (3 x 3):
##                 PC1        PC2        PC3
## HOURS    -0.6563723  0.2120875 -0.7240126
## ANXIETY  -0.4110570 -0.9052520  0.1074758
## A_POINTS -0.6326196  0.3681545  0.6813623

We can observe that,

Highly co-releated attributes (HOURS , ANXIETY & A_POINTS) can be described by highly un-coreleated attributes (PC1 , PC2 & PC3)
The New Principal components can be expalined by following equations.

\(PC1 = -0.6563723*X_{HOURS} - 0.4110570*X_{ANXIETY} - 0.6326196*X_{A_POINTS}\)

\(PC2 = 0.2120875*X_{HOURS} - 0.9052520*X_{ANXIETY} + 0.3681545*X_{A_POINTS}\)

\(PC3 = -0.7240126*X_{HOURS} + 0.1074758*X_{ANXIETY} + 0.6813623*X_{A_POINTS}\)

Summary PCA :-

The Explanation for the fitted PCA is as follows.

## Importance of components:
##                           PC1    PC2     PC3
## Standard deviation     1.3849 0.9062 0.51082
## Proportion of Variance 0.6393 0.2737 0.08698
## Cumulative Proportion  0.6393 0.9130 1.00000

We can observe that,

63.93 % of the varience in the input dataset is explained by \(PC_1\) itself.
91.3 % of the varience in the input dataset is explained by \(PC_1\) & \(PC_2\) together.

Eigen-Values PCA :-

Apart from the varience of expalined distribution , we can explin the same information in terms of eigen-values. The eigen-values for the fitted Principal components are as follows.

##       eigenvalue variance.percent cumulative.variance.percent
## PC_1   1.9179081        63.930270                    63.93027
## PC_2   0.8211551        27.371838                    91.30211
## PC_3   0.2609367         8.697892                   100.00000
## TOTAL  3.0000000       100.000000                   255.23238

We can observe that,

Eigen-Values of each principal component is proportional to Variance Explained by it.
Similar to variance Explained, eigen value is very high for first principal component.
As total eigen-value = total number of input variables (3)

& Total Varience = 100% , we can say as

All the Principal components together can explain all the varience in input dataset.

Plot of Eigen-Values:-

This plot clearly explains us ,

\(PC_1\) and \(PC_2\) are explaining most of the variences.
We can ignore \(PC_3\).

\(Cos^{2}\) values :-

\(Cos^{2}\) shows the importance of any principal component for a given observation.

From the above plot we can conclude that, for the \(PC_1\) & \(PC_2\) , all the three input varaibels are contributing almost equally.

Bi-Plot :-

As we observed from the \(cos^2\) plot, all the input variables are contributing almost equally. The same thing, we can observe in this bi-plot, as all the arrows are near to circumference of the circle, means, all input variables are important.

Updated Dataset :-

The Dimensionally reduced DataSet for USArrest Data is as follows

##            PC1          PC2 SCORE
## 1  -0.42153665 -0.333451152    62
## 2   1.56236878  0.193549856    58
## 3   0.20443843 -0.967259600    52
## 4   2.09446682  1.551067844    55
## 5  -1.87624414  0.670308752    75
## 6  -1.52882220  0.887421614    82
## 7   2.02156364 -0.990217256    38
## 8   1.04142492  0.142296206    55
## 9   1.35557708 -1.543558063    48
## 10 -0.36652937  1.617387488    68
## 11  0.46144684  0.149773461    62
## 12  0.03495089  0.671850540    62
## 13 -2.63957870 -0.283119859    72
## 14 -0.64678767 -1.742842222    58
## 15  0.31060395 -0.005827844    65
## 16  1.48253857 -0.164922772    42
## 17  0.12902070  0.696349985    68
## 18 -1.34433886  0.012000860    68
## 19  0.27040713  0.274017497    58
## 20 -2.14497016 -0.834825336    72

Fitting Linear Model:-

Fitting Linear Model with the reduced dimensions ( PC1 & PC2 ).

The summary of the fitted Linear Regression Model :-

## 
## Call:
## lm(formula = SCORE ~ ., data = df_latest)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4931 -2.6588 -0.2073  3.3768  6.4494 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  61.0000     0.9711  62.814  < 2e-16 ***
## PC1          -6.5109     0.7194  -9.050 6.55e-08 ***
## PC2           5.1797     1.0995   4.711 0.000201 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.343 on 17 degrees of freedom
## Multiple R-squared:  0.8596, Adjusted R-squared:  0.8431 
## F-statistic: 52.05 on 2 and 17 DF,  p-value: 5.653e-08

We can observe that,

Residuals Median is almost equal to zero. ( which is good).
Residuals Minimum and Maximum values are very close (opposite signs). (Data points are equally distributed on both sides of fitted line)
Intercept value is 61 and statistically significant.
PC1 coeeficent value is -6.5109 and also statistically significant.
PC2 coeeficent value is 5.1797 and also statistically significant.
85.96 % of Score distribution is Explained by all dependent variables. ( Which is good ).
P-Value is 5.653e-08 , so the linear model is statistically significant.
ROOT MEAN SQUARE ERROR OF THIS MODEL IS 4.0040426

CONCLUSION:-

THE RMSE of Linear Model without dimensionally reduced is 3.9959115
THE RMSE of Linear Model with dimensionally reduced data is 4.0040426

As RMSE without PCA & with PCAA are almost same, I can conclude as Dimenonlity reduction is not improved for this Linear Model. May be for other models, it might be good.

————————————————————- THANK YOU ————————————————————-

Regression Analysis of Students Data set

ANIL KUMAR KANASANI - 11013622

2021-02-15

Introduction :-

Exploratory Data Analysis :-

Conclusion (or) Description of EDA :-

Models to predict SCORE:-

Fitting Linear Model:-

The summary of the fitted Linear Regression Model :-

Fitting Updated Linear Model:-

Support Vector machine ( Regressors) :-

Fitting SVM - regressor :-

With Default Parameters :-

The summary of the fitted default SVM_classifier :-

Parameters Tuning ( grid search ):-

Fitting SVM_regressor with new parameters :-

Binart Decission Trees :-

Fitting CTREE binary Decision Tree: :-

Conclusion on regression Models:-

REGRESSION MODELs SUMMARY

Dimentionality Reduction :-

Principal Component Analysis ( PCA) :-

Fitting PCA :-

Summary PCA :-

Eigen-Values PCA :-

Plot of Eigen-Values:-

\(Cos^{2}\) values :-

Bi-Plot :-

Updated Dataset :-

Fitting Linear Model:-

The summary of the fitted Linear Regression Model :-

CONCLUSION:-