Introduction :-

In this report, I am attempting to do Regression Analysis on Students Data set.


Exploratory Data Analysis :-

Exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often with visual methods.

The given dataset has 20 students and each student has 4 attributes. The header of the dataset is as follows.

##   SCORE HOURS ANXIETY A_POINTS
## 1    62    40      40       24
## 2    58    31      65       20
## 3    52    35      34       22
## 4    55    26      91       22
## 5    75    51      46       28
## 6    82    48      52       28

The detailed structure is as follows.

## 'data.frame':    20 obs. of  4 variables:
##  $ SCORE   : int  62 58 52 55 75 82 38 55 48 68 ...
##  $ HOURS   : int  40 31 35 26 51 48 25 37 30 44 ...
##  $ ANXIETY : int  40 65 34 91 46 52 48 61 34 74 ...
##  $ A_POINTS: int  24 20 22 22 28 28 18 20 18 26 ...

In Input , The type of each attribute is as follows.

##     SCORE     HOURS   ANXIETY  A_POINTS 
## "integer" "integer" "integer" "integer"

All attribute types are good for my analysis. We can proceed further.


The number of null values in each column are as follows.

##    SCORE    HOURS  ANXIETY A_POINTS 
##        0        0        0        0

As there is no null values, we can proceed further.


The overall summary of all the attributes is as follows.

##      SCORE        HOURS          ANXIETY         A_POINTS   
##  Min.   :38   Min.   :25.00   Min.   :13.00   Min.   :18.0  
##  1st Qu.:55   1st Qu.:31.75   1st Qu.:37.75   1st Qu.:20.0  
##  Median :62   Median :39.50   Median :53.00   Median :24.0  
##  Mean   :61   Mean   :39.15   Mean   :49.30   Mean   :23.2  
##  3rd Qu.:68   3rd Qu.:45.25   3rd Qu.:61.00   3rd Qu.:26.0  
##  Max.   :82   Max.   :61.00   Max.   :91.00   Max.   :28.0

The distribution of all continuous variables is as follows.


The co-releation between the continous variables is as follows

##               SCORE      HOURS    ANXIETY   A_POINTS
## SCORE     1.0000000  0.8210109 -0.1182956  0.8716357
## HOURS     0.8210109  1.0000000 -0.3395037  0.7317732
## ANXIETY  -0.1182956 -0.3395037  1.0000000 -0.2441778
## A_POINTS  0.8716357  0.7317732 -0.2441778  1.0000000


Conclusion (or) Description of EDA :-

In our data set,

  • The distribution of all the four variables is good. there is no strange things in it. There is no outliers in the inputdata.

  • The variables score , Hours & A_points are highly co-releated to each other. We can do Dimentionality reduction techinques to reduce the number of variables.

  • Overall, there is no conspicuous patterns involved in the input data.


Models to predict SCORE:-

Now I am planning to build various models to predict the students Score when we have all the following three attributes.

Fitting Linear Model:-

Linear regression is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables.

The summary of the fitted Linear Regression Model :-

## 
## Call:
## lm(formula = regression_form, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4933 -2.5081 -0.3897  3.1395  6.6537 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11.82254    8.80575  -1.343 0.198143    
## HOURS         0.55114    0.17086   3.226 0.005284 ** 
## ANXIETY       0.10352    0.05762   1.796 0.091327 .  
## A_POINTS      1.98888    0.46918   4.239 0.000625 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.468 on 16 degrees of freedom
## Multiple R-squared:  0.8602, Adjusted R-squared:  0.834 
## F-statistic: 32.81 on 3 and 16 DF,  p-value: 4.563e-07

We can observe that,

  • Residuals Median is almost equal to zero. ( which is good).

  • Residuals Minimum and Maximum values are very close (opposite signs). (Data points are equally distributed on both sides of fitted line)

  • Intercept value is -11.82254 and not statistically significant.

  • HOURS coeeficent value is 0.55114 and also statistically significant.

  • ANXIETY coeeficent value is 0.10352 and also little bit statistically significant.

  • A_POINTS coeeficent value is 1.98888 and statistically significant.

  • 86.02 % of Score distribution is Explained by all dependent variables. ( Which is good ).

  • P-Value is < 4.563e-07 , so the linear model is statistically significant.

  • ROOT MEAN SQUARE ERROR OF THIS MODEL IS 3.9959115

I am trying to reduce the dependent variables and taking most significant dependent variables.

Fitting Updated Linear Model:-

The summary of the updated Line model is as follows.

## 
## Call:
## lm(formula = SCORE ~ HOURS + ANXIETY + A_POINTS, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4933 -2.5081 -0.3897  3.1395  6.6537 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11.82254    8.80575  -1.343 0.198143    
## HOURS         0.55114    0.17086   3.226 0.005284 ** 
## ANXIETY       0.10352    0.05762   1.796 0.091327 .  
## A_POINTS      1.98888    0.46918   4.239 0.000625 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.468 on 16 degrees of freedom
## Multiple R-squared:  0.8602, Adjusted R-squared:  0.834 
## F-statistic: 32.81 on 3 and 16 DF,  p-value: 4.563e-07

We can observe that,

  • Residuals Median is almost equal to zero. ( which is good).

  • Residuals Minimum and Maximum values are very close (opposite signs). (Data points are equally distributed on both sides of fitted line)

  • Intercept value is -11.82254 and not statistically significant.

  • HOURS coeeficent value is 0.55114 and also statistically significant.

  • ANXIETY coeeficent value is 0.10352 and also little bit statistically significant.

  • A_POINTS coeeficent value is 1.98888 and statistically significant.

  • 86.02 % of Alchol distribution is Explained by all dependent variables. ( Which is not good ).

  • P-Value is 4.563e-07 , so the linear model is statistically significant.

Actually, after applying step funciton also there is no improvement.

Support Vector machine ( Regressors) :-

Support-vector machines are linear models (supervised learning models) with associated learning algorithms that analyze data for classification and regression analysis. It can solve both Linear & Non-Linear problems.


Fitting SVM - regressor :-

With Default Parameters :-

As a first step, I am trying to fit a Support Vector Machine regressor with default hyperparameter values of Cost control ( C ) and Gamma (\(\gamma\)).

The summary of the fitted default SVM_classifier :-

## 
## Call:
## svm(formula = regression_form, data = df)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.3333333 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  19

We can observe that,

  • Number of data points which are forming margin : 19

  • This algorithm is using default eps-regression type.

  • Kernel Selected is : Radial basis function (RBF)

  • Default Cost Value ( C ) is 1

  • Default Gamma Value ( \(\gamma\) ) is 0.3333333

  • ROOT MEAN SQUARE ERROR OF THIS MODEL IS 4.069501.

Parameters Tuning ( grid search ):-

As we can see the RMSE of the SVM model with default parameters is not so good. We can tune the parameters C and Gamma (\(\gamma\)) so that we are slightly changing the smoothness of the fitted curve ( tunning hyperplane ) to classify the data points more accurately than before.

  • The tuning summary by using Bootstrapping sampling method is as follows.
## 
## Parameter tuning of 'svm':
## 
## - sampling method: bootstrapping 
## 
## - best parameters:
##  gamma cost
##    0.3    4
## 
## - best performance: 54.91425

We Can observe that, recommended best parameters from Bootstrapping sampling method is

  • \(\gamma\) : 0.3
  • C : 4

Fitting SVM_regressor with new parameters :-

## 
## Call:
## svm(formula = regression_form, data = df, gamma = gam, cost = cos)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  4 
##       gamma:  0.3 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  19

We can observe that,

  • Number of data points which are forming margin : 19

  • This algorithm is using default eps-regression type.

  • Kernel Selected is : Radial basis function (RBF)

  • Cost Value ( C ) is 4 ( Which we are controlling )

  • Gamma Value ( \(\gamma\) ) is 0.3 ( Which we are controlling )

  • ROOT MEAN SQUARE ERROR OF THIS MODEL IS 2.653195.

Binart Decission Trees :-

A Binary Decision Tree is a structure based on a sequential decision process. Starting from the root, a feature is evaluated and one of the two branches is selected. This procedure is repeated until a final leaf is reached, which normally represents the classification target you’re looking for.


Fitting CTREE binary Decision Tree: :-

The summary of the fitted model is as follows

## 
##   Conditional inference tree with 2 terminal nodes
## 
## Response:  SCORE 
## Inputs:  HOURS, ANXIETY, A_POINTS 
## Number of observations:  20 
## 
## 1) A_POINTS <= 22; criterion = 1, statistic = 14.435
##   2)*  weights = 9 
## 1) A_POINTS > 22
##   3)*  weights = 11

The same thing, we can explain in following plot

  • ROOT MEAN SQUARE ERROR OF THIS MODEL IS 7.2626066.

Conclusion on regression Models:-

The summary of all the fitted models and its performence on the training data is as follows

REGRESSION MODELs SUMMARY

S No Model Name RMSE Value
1. Linear Model 3.9959115
3. SVM - Regressor 4.069501
4. Tunned SVM - Regressor 2.653195
5. Binary Decission Tree 7.2626066

As Tunned SVM - Regressor has very less RMSE value, I can conclude as this model is better model to predict the Alcohol content.


Dimentionality Reduction :-

In our data set,

As Score, Hours & A_point variables are higly co-releated to each other, We can apply any dimentionality reduction techniques likes PCA and can be describe all the 4 attributes with less number of attributes.


Principal Component Analysis ( PCA) :-

when variables are correlated, then less variables could explain almost the same amount of variation.PCA is used to extract the important information from multivariate data and express this information as a set of few new variables called principal components.


Fitting PCA :-

The Explanation for the fitted PCA is as follows.

## Standard deviations (1, .., p=3):
## [1] 1.3848856 0.9061761 0.5108197
## 
## Rotation (n x k) = (3 x 3):
##                 PC1        PC2        PC3
## HOURS    -0.6563723  0.2120875 -0.7240126
## ANXIETY  -0.4110570 -0.9052520  0.1074758
## A_POINTS -0.6326196  0.3681545  0.6813623

We can observe that,

\(PC1 = -0.6563723*X_{HOURS} - 0.4110570*X_{ANXIETY} - 0.6326196*X_{A_POINTS}\)

\(PC2 = 0.2120875*X_{HOURS} - 0.9052520*X_{ANXIETY} + 0.3681545*X_{A_POINTS}\)

\(PC3 = -0.7240126*X_{HOURS} + 0.1074758*X_{ANXIETY} + 0.6813623*X_{A_POINTS}\)

Summary PCA :-

The Explanation for the fitted PCA is as follows.

## Importance of components:
##                           PC1    PC2     PC3
## Standard deviation     1.3849 0.9062 0.51082
## Proportion of Variance 0.6393 0.2737 0.08698
## Cumulative Proportion  0.6393 0.9130 1.00000

We can observe that,

  • 63.93 % of the varience in the input dataset is explained by \(PC_1\) itself.
  • 91.3 % of the varience in the input dataset is explained by \(PC_1\) & \(PC_2\) together.

Eigen-Values PCA :-

Apart from the varience of expalined distribution , we can explin the same information in terms of eigen-values. The eigen-values for the fitted Principal components are as follows.

##       eigenvalue variance.percent cumulative.variance.percent
## PC_1   1.9179081        63.930270                    63.93027
## PC_2   0.8211551        27.371838                    91.30211
## PC_3   0.2609367         8.697892                   100.00000
## TOTAL  3.0000000       100.000000                   255.23238

We can observe that,

  • Eigen-Values of each principal component is proportional to Variance Explained by it.

  • Similar to variance Explained, eigen value is very high for first principal component.

  • As total eigen-value = total number of input variables (3)

    & Total Varience = 100% , we can say as

All the Principal components together can explain all the varience in input dataset.

Plot of Eigen-Values:-

This plot clearly explains us ,

  • \(PC_1\) and \(PC_2\) are explaining most of the variences.

  • We can ignore \(PC_3\).

\(Cos^{2}\) values :-

\(Cos^{2}\) shows the importance of any principal component for a given observation.

From the above plot we can conclude that, for the \(PC_1\) & \(PC_2\) , all the three input varaibels are contributing almost equally.

Bi-Plot :-

As we observed from the \(cos^2\) plot, all the input variables are contributing almost equally. The same thing, we can observe in this bi-plot, as all the arrows are near to circumference of the circle, means, all input variables are important.

Updated Dataset :-

The Dimensionally reduced DataSet for USArrest Data is as follows

##            PC1          PC2 SCORE
## 1  -0.42153665 -0.333451152    62
## 2   1.56236878  0.193549856    58
## 3   0.20443843 -0.967259600    52
## 4   2.09446682  1.551067844    55
## 5  -1.87624414  0.670308752    75
## 6  -1.52882220  0.887421614    82
## 7   2.02156364 -0.990217256    38
## 8   1.04142492  0.142296206    55
## 9   1.35557708 -1.543558063    48
## 10 -0.36652937  1.617387488    68
## 11  0.46144684  0.149773461    62
## 12  0.03495089  0.671850540    62
## 13 -2.63957870 -0.283119859    72
## 14 -0.64678767 -1.742842222    58
## 15  0.31060395 -0.005827844    65
## 16  1.48253857 -0.164922772    42
## 17  0.12902070  0.696349985    68
## 18 -1.34433886  0.012000860    68
## 19  0.27040713  0.274017497    58
## 20 -2.14497016 -0.834825336    72

Fitting Linear Model:-

Fitting Linear Model with the reduced dimensions ( PC1 & PC2 ).

The summary of the fitted Linear Regression Model :-

## 
## Call:
## lm(formula = SCORE ~ ., data = df_latest)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4931 -2.6588 -0.2073  3.3768  6.4494 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  61.0000     0.9711  62.814  < 2e-16 ***
## PC1          -6.5109     0.7194  -9.050 6.55e-08 ***
## PC2           5.1797     1.0995   4.711 0.000201 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.343 on 17 degrees of freedom
## Multiple R-squared:  0.8596, Adjusted R-squared:  0.8431 
## F-statistic: 52.05 on 2 and 17 DF,  p-value: 5.653e-08

We can observe that,

  • Residuals Median is almost equal to zero. ( which is good).

  • Residuals Minimum and Maximum values are very close (opposite signs). (Data points are equally distributed on both sides of fitted line)

  • Intercept value is 61 and statistically significant.

  • PC1 coeeficent value is -6.5109 and also statistically significant.

  • PC2 coeeficent value is 5.1797 and also statistically significant.

  • 85.96 % of Score distribution is Explained by all dependent variables. ( Which is good ).

  • P-Value is 5.653e-08 , so the linear model is statistically significant.

  • ROOT MEAN SQUARE ERROR OF THIS MODEL IS 4.0040426

CONCLUSION:-

As RMSE without PCA & with PCAA are almost same, I can conclude as Dimenonlity reduction is not improved for this Linear Model. May be for other models, it might be good.

————————————————————- THANK YOU ————————————————————-