Exploratory Data Analysis :-
Exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often with visual methods.
- Structure of given dataset :-
The given dataset has 20 students and each student has 4 attributes. The header of the dataset is as follows.
## SCORE HOURS ANXIETY A_POINTS
## 1 62 40 40 24
## 2 58 31 65 20
## 3 52 35 34 22
## 4 55 26 91 22
## 5 75 51 46 28
## 6 82 48 52 28
Explanation of all the variables :-
- SCORE : Test performance.
- HOURS : Hours used for learning.
- ANXIETY : Score test anxiety; high value means high test anxiety.
- A POINTS : Points in entrance examination; more points, better examination results.
The detailed structure is as follows.
## 'data.frame': 20 obs. of 4 variables:
## $ SCORE : int 62 58 52 55 75 82 38 55 48 68 ...
## $ HOURS : int 40 31 35 26 51 48 25 37 30 44 ...
## $ ANXIETY : int 40 65 34 91 46 52 48 61 34 74 ...
## $ A_POINTS: int 24 20 22 22 28 28 18 20 18 26 ...
In Input , The type of each attribute is as follows.
## SCORE HOURS ANXIETY A_POINTS
## "integer" "integer" "integer" "integer"
All attribute types are good for my analysis. We can proceed further.
- Dealing with NULL values :-
The number of null values in each column are as follows.
## SCORE HOURS ANXIETY A_POINTS
## 0 0 0 0
As there is no null values, we can proceed further.
- Summary :-
The overall summary of all the attributes is as follows.
## SCORE HOURS ANXIETY A_POINTS
## Min. :38 Min. :25.00 Min. :13.00 Min. :18.0
## 1st Qu.:55 1st Qu.:31.75 1st Qu.:37.75 1st Qu.:20.0
## Median :62 Median :39.50 Median :53.00 Median :24.0
## Mean :61 Mean :39.15 Mean :49.30 Mean :23.2
## 3rd Qu.:68 3rd Qu.:45.25 3rd Qu.:61.00 3rd Qu.:26.0
## Max. :82 Max. :61.00 Max. :91.00 Max. :28.0
The distribution of all continuous variables is as follows.
The co-releation between the continous variables is as follows
## SCORE HOURS ANXIETY A_POINTS
## SCORE 1.0000000 0.8210109 -0.1182956 0.8716357
## HOURS 0.8210109 1.0000000 -0.3395037 0.7317732
## ANXIETY -0.1182956 -0.3395037 1.0000000 -0.2441778
## A_POINTS 0.8716357 0.7317732 -0.2441778 1.0000000
Conclusion (or) Description of EDA :-
In our data set,
The distribution of all the four variables is good. there is no strange things in it. There is no outliers in the inputdata.
The variables score , Hours & A_points are highly co-releated to each other. We can do Dimentionality reduction techinques to reduce the number of variables.
Overall, there is no conspicuous patterns involved in the input data.
Models to predict SCORE:-
Now I am planning to build various models to predict the students Score when we have all the following three attributes.
- Hours
- ANXIETY
- A_POINTS
Fitting Linear Model:-
Linear regression is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables.
The summary of the fitted Linear Regression Model :-
##
## Call:
## lm(formula = regression_form, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4933 -2.5081 -0.3897 3.1395 6.6537
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.82254 8.80575 -1.343 0.198143
## HOURS 0.55114 0.17086 3.226 0.005284 **
## ANXIETY 0.10352 0.05762 1.796 0.091327 .
## A_POINTS 1.98888 0.46918 4.239 0.000625 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.468 on 16 degrees of freedom
## Multiple R-squared: 0.8602, Adjusted R-squared: 0.834
## F-statistic: 32.81 on 3 and 16 DF, p-value: 4.563e-07
We can observe that,
Residuals Median is almost equal to zero. ( which is good).
Residuals Minimum and Maximum values are very close (opposite signs). (Data points are equally distributed on both sides of fitted line)
Intercept value is -11.82254 and not statistically significant.
HOURS coeeficent value is 0.55114 and also statistically significant.
ANXIETY coeeficent value is 0.10352 and also little bit statistically significant.
A_POINTS coeeficent value is 1.98888 and statistically significant.
86.02 % of Score distribution is Explained by all dependent variables. ( Which is good ).
P-Value is < 4.563e-07 , so the linear model is statistically significant.
ROOT MEAN SQUARE ERROR OF THIS MODEL IS 3.9959115
I am trying to reduce the dependent variables and taking most significant dependent variables.
Fitting Updated Linear Model:-
The summary of the updated Line model is as follows.
##
## Call:
## lm(formula = SCORE ~ HOURS + ANXIETY + A_POINTS, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4933 -2.5081 -0.3897 3.1395 6.6537
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.82254 8.80575 -1.343 0.198143
## HOURS 0.55114 0.17086 3.226 0.005284 **
## ANXIETY 0.10352 0.05762 1.796 0.091327 .
## A_POINTS 1.98888 0.46918 4.239 0.000625 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.468 on 16 degrees of freedom
## Multiple R-squared: 0.8602, Adjusted R-squared: 0.834
## F-statistic: 32.81 on 3 and 16 DF, p-value: 4.563e-07
We can observe that,
Residuals Median is almost equal to zero. ( which is good).
Residuals Minimum and Maximum values are very close (opposite signs). (Data points are equally distributed on both sides of fitted line)
Intercept value is -11.82254 and not statistically significant.
HOURS coeeficent value is 0.55114 and also statistically significant.
ANXIETY coeeficent value is 0.10352 and also little bit statistically significant.
A_POINTS coeeficent value is 1.98888 and statistically significant.
86.02 % of Alchol distribution is Explained by all dependent variables. ( Which is not good ).
P-Value is 4.563e-07 , so the linear model is statistically significant.
Actually, after applying step funciton also there is no improvement.
Support Vector machine ( Regressors) :-
Support-vector machines are linear models (supervised learning models) with associated learning algorithms that analyze data for classification and regression analysis. It can solve both Linear & Non-Linear problems.
Fitting SVM - regressor :-
With Default Parameters :-
As a first step, I am trying to fit a Support Vector Machine regressor with default hyperparameter values of Cost control ( C ) and Gamma (\(\gamma\)).
The summary of the fitted default SVM_classifier :-
##
## Call:
## svm(formula = regression_form, data = df)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 1
## gamma: 0.3333333
## epsilon: 0.1
##
##
## Number of Support Vectors: 19
We can observe that,
Number of data points which are forming margin : 19
This algorithm is using default eps-regression type.
Kernel Selected is : Radial basis function (RBF)
Default Cost Value ( C ) is 1
Default Gamma Value ( \(\gamma\) ) is 0.3333333
ROOT MEAN SQUARE ERROR OF THIS MODEL IS 4.069501.
Parameters Tuning ( grid search ):-
As we can see the RMSE of the SVM model with default parameters is not so good. We can tune the parameters C and Gamma (\(\gamma\)) so that we are slightly changing the smoothness of the fitted curve ( tunning hyperplane ) to classify the data points more accurately than before.
- The tuning summary by using Bootstrapping sampling method is as follows.
##
## Parameter tuning of 'svm':
##
## - sampling method: bootstrapping
##
## - best parameters:
## gamma cost
## 0.3 4
##
## - best performance: 54.91425
We Can observe that, recommended best parameters from Bootstrapping sampling method is
- \(\gamma\) : 0.3
- C : 4
Fitting SVM_regressor with new parameters :-
##
## Call:
## svm(formula = regression_form, data = df, gamma = gam, cost = cos)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 4
## gamma: 0.3
## epsilon: 0.1
##
##
## Number of Support Vectors: 19
We can observe that,
Number of data points which are forming margin : 19
This algorithm is using default eps-regression type.
Kernel Selected is : Radial basis function (RBF)
Cost Value ( C ) is 4 ( Which we are controlling )
Gamma Value ( \(\gamma\) ) is 0.3 ( Which we are controlling )
ROOT MEAN SQUARE ERROR OF THIS MODEL IS 2.653195.
Binart Decission Trees :-
A Binary Decision Tree is a structure based on a sequential decision process. Starting from the root, a feature is evaluated and one of the two branches is selected. This procedure is repeated until a final leaf is reached, which normally represents the classification target you’re looking for.
Fitting CTREE binary Decision Tree: :-
The summary of the fitted model is as follows
##
## Conditional inference tree with 2 terminal nodes
##
## Response: SCORE
## Inputs: HOURS, ANXIETY, A_POINTS
## Number of observations: 20
##
## 1) A_POINTS <= 22; criterion = 1, statistic = 14.435
## 2)* weights = 9
## 1) A_POINTS > 22
## 3)* weights = 11
The same thing, we can explain in following plot
- ROOT MEAN SQUARE ERROR OF THIS MODEL IS 7.2626066.
Conclusion on regression Models:-
The summary of all the fitted models and its performence on the training data is as follows
REGRESSION MODELs SUMMARY
| S No | Model Name | RMSE Value |
|---|---|---|
| 1. | Linear Model | 3.9959115 |
| 3. | SVM - Regressor | 4.069501 |
| 4. | Tunned SVM - Regressor | 2.653195 |
| 5. | Binary Decission Tree | 7.2626066 |
As Tunned SVM - Regressor has very less RMSE value, I can conclude as this model is better model to predict the Alcohol content.
Dimentionality Reduction :-
In our data set,
As Score, Hours & A_point variables are higly co-releated to each other, We can apply any dimentionality reduction techniques likes PCA and can be describe all the 4 attributes with less number of attributes.
Principal Component Analysis ( PCA) :-
when variables are correlated, then less variables could explain almost the same amount of variation.PCA is used to extract the important information from multivariate data and express this information as a set of few new variables called principal components.
Fitting PCA :-
The Explanation for the fitted PCA is as follows.
## Standard deviations (1, .., p=3):
## [1] 1.3848856 0.9061761 0.5108197
##
## Rotation (n x k) = (3 x 3):
## PC1 PC2 PC3
## HOURS -0.6563723 0.2120875 -0.7240126
## ANXIETY -0.4110570 -0.9052520 0.1074758
## A_POINTS -0.6326196 0.3681545 0.6813623
We can observe that,
Highly co-releated attributes (HOURS , ANXIETY & A_POINTS) can be described by highly un-coreleated attributes (PC1 , PC2 & PC3)
The New Principal components can be expalined by following equations.
\(PC1 = -0.6563723*X_{HOURS} - 0.4110570*X_{ANXIETY} - 0.6326196*X_{A_POINTS}\)
\(PC2 = 0.2120875*X_{HOURS} - 0.9052520*X_{ANXIETY} + 0.3681545*X_{A_POINTS}\)
\(PC3 = -0.7240126*X_{HOURS} + 0.1074758*X_{ANXIETY} + 0.6813623*X_{A_POINTS}\)
Summary PCA :-
The Explanation for the fitted PCA is as follows.
## Importance of components:
## PC1 PC2 PC3
## Standard deviation 1.3849 0.9062 0.51082
## Proportion of Variance 0.6393 0.2737 0.08698
## Cumulative Proportion 0.6393 0.9130 1.00000
We can observe that,
- 63.93 % of the varience in the input dataset is explained by \(PC_1\) itself.
- 91.3 % of the varience in the input dataset is explained by \(PC_1\) & \(PC_2\) together.
Eigen-Values PCA :-
Apart from the varience of expalined distribution , we can explin the same information in terms of eigen-values. The eigen-values for the fitted Principal components are as follows.
## eigenvalue variance.percent cumulative.variance.percent
## PC_1 1.9179081 63.930270 63.93027
## PC_2 0.8211551 27.371838 91.30211
## PC_3 0.2609367 8.697892 100.00000
## TOTAL 3.0000000 100.000000 255.23238
We can observe that,
Eigen-Values of each principal component is proportional to Variance Explained by it.
Similar to variance Explained, eigen value is very high for first principal component.
As total eigen-value = total number of input variables (3)
& Total Varience = 100% , we can say as
All the Principal components together can explain all the varience in input dataset.
Plot of Eigen-Values:-
This plot clearly explains us ,
\(PC_1\) and \(PC_2\) are explaining most of the variences.
We can ignore \(PC_3\).
\(Cos^{2}\) values :-
\(Cos^{2}\) shows the importance of any principal component for a given observation.
From the above plot we can conclude that, for the \(PC_1\) & \(PC_2\) , all the three input varaibels are contributing almost equally.
Bi-Plot :-
As we observed from the \(cos^2\) plot, all the input variables are contributing almost equally. The same thing, we can observe in this bi-plot, as all the arrows are near to circumference of the circle, means, all input variables are important.
Updated Dataset :-
The Dimensionally reduced DataSet for USArrest Data is as follows
## PC1 PC2 SCORE
## 1 -0.42153665 -0.333451152 62
## 2 1.56236878 0.193549856 58
## 3 0.20443843 -0.967259600 52
## 4 2.09446682 1.551067844 55
## 5 -1.87624414 0.670308752 75
## 6 -1.52882220 0.887421614 82
## 7 2.02156364 -0.990217256 38
## 8 1.04142492 0.142296206 55
## 9 1.35557708 -1.543558063 48
## 10 -0.36652937 1.617387488 68
## 11 0.46144684 0.149773461 62
## 12 0.03495089 0.671850540 62
## 13 -2.63957870 -0.283119859 72
## 14 -0.64678767 -1.742842222 58
## 15 0.31060395 -0.005827844 65
## 16 1.48253857 -0.164922772 42
## 17 0.12902070 0.696349985 68
## 18 -1.34433886 0.012000860 68
## 19 0.27040713 0.274017497 58
## 20 -2.14497016 -0.834825336 72
Fitting Linear Model:-
Fitting Linear Model with the reduced dimensions ( PC1 & PC2 ).
The summary of the fitted Linear Regression Model :-
##
## Call:
## lm(formula = SCORE ~ ., data = df_latest)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4931 -2.6588 -0.2073 3.3768 6.4494
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.0000 0.9711 62.814 < 2e-16 ***
## PC1 -6.5109 0.7194 -9.050 6.55e-08 ***
## PC2 5.1797 1.0995 4.711 0.000201 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.343 on 17 degrees of freedom
## Multiple R-squared: 0.8596, Adjusted R-squared: 0.8431
## F-statistic: 52.05 on 2 and 17 DF, p-value: 5.653e-08
We can observe that,
Residuals Median is almost equal to zero. ( which is good).
Residuals Minimum and Maximum values are very close (opposite signs). (Data points are equally distributed on both sides of fitted line)
Intercept value is 61 and statistically significant.
PC1 coeeficent value is -6.5109 and also statistically significant.
PC2 coeeficent value is 5.1797 and also statistically significant.
85.96 % of Score distribution is Explained by all dependent variables. ( Which is good ).
P-Value is 5.653e-08 , so the linear model is statistically significant.
ROOT MEAN SQUARE ERROR OF THIS MODEL IS 4.0040426
CONCLUSION:-
- THE RMSE of Linear Model without dimensionally reduced is 3.9959115
- THE RMSE of Linear Model with dimensionally reduced data is 4.0040426
As RMSE without PCA & with PCAA are almost same, I can conclude as Dimenonlity reduction is not improved for this Linear Model. May be for other models, it might be good.
————————————————————- THANK YOU ————————————————————-