This study was to find out what is/are the most critical factor(s) in predicting the numeric response.
There are 21 predictors and only 8 observations.The main challenge is to deal with the redundant predictors.
After EDA, we first reduce the number of predictors from 22 to 6 by comparing the correlation, then applied the best subset method to reduce the final predictor to only 3.
The final model uses 3 predictors out of 21, and explains over 99% of the speed(response) variation.
There had been many details to consider in this project, from data cleaning, exploring, and transforming, to the selection of statistical methods. Here I post the main ideas for better understanding.
This study was to find out the most critical factor(s) in predicting the swimming speed of a fish species. 1
In this data set, many physical characters related to swimming speed were collected, such as height, weight, fins length, and so on. There are 21 physical characters; thus, we have 21 predictors. They are all numeric. The predictors were marked as “A” to “U” here to simplify the statistics concept.
The output is the swimming speed average. For each fish, its swimming speed was measured 3 times and got the mean speed as the response.
There are 8 fish in total.
The observation numbers were limited comparing to the number of predictors.
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A | 1 | 8 | 175.9375000 | 6.0500148 | 176.7500000 | 175.9375000 | 6.6717000 | 166.0000000 | 184.0000000 | 18.0000000 | -0.2464374 | -1.4381568 | 2.1390032 |
| B | 2 | 8 | 78.7125000 | 8.6448065 | 79.7500000 | 78.7125000 | 11.2677600 | 67.0000000 | 91.3000000 | 24.3000000 | -0.0069473 | -1.6861814 | 3.0564006 |
| C | 3 | 8 | 47.7750000 | 2.7122737 | 48.1000000 | 47.7750000 | 2.4462900 | 43.0000000 | 51.4000000 | 8.4000000 | -0.3644488 | -1.2609265 | 0.9589336 |
| D | 4 | 8 | 39.5750000 | 2.2814469 | 39.9500000 | 39.5750000 | 2.9652000 | 35.6000000 | 42.0000000 | 6.4000000 | -0.4110892 | -1.4291907 | 0.8066133 |
| E | 5 | 8 | 37.3500000 | 2.1705825 | 36.6500000 | 37.3500000 | 0.8895600 | 34.3000000 | 40.8000000 | 6.5000000 | 0.4332842 | -1.2721971 | 0.7674168 |
| F | 6 | 8 | 26.5125000 | 1.1519394 | 26.8000000 | 26.5125000 | 1.4826000 | 25.0000000 | 28.2000000 | 3.2000000 | -0.0593151 | -1.7290641 | 0.4072721 |
| G | 7 | 8 | 19.3125000 | 1.0629306 | 19.3000000 | 19.3125000 | 1.2602100 | 17.8000000 | 20.7000000 | 2.9000000 | -0.0064111 | -1.6797491 | 0.3758027 |
| H | 8 | 8 | 7.2000000 | 0.3422614 | 7.2500000 | 7.2000000 | 0.2965200 | 6.6000000 | 7.6000000 | 1.0000000 | -0.5611885 | -1.2637939 | 0.1210077 |
| I | 9 | 8 | 6.3375000 | 0.7190023 | 6.3000000 | 6.3375000 | 0.8895600 | 5.4000000 | 7.4000000 | 2.0000000 | 0.0778416 | -1.6384434 | 0.2542057 |
| J | 10 | 8 | 4.8875000 | 0.4223658 | 4.9500000 | 4.8875000 | 0.5930400 | 4.3000000 | 5.5000000 | 1.2000000 | -0.0135311 | -1.6797654 | 0.1493289 |
| K | 11 | 8 | 5.8083217 | 0.6258158 | 5.7337832 | 5.8083217 | 0.6028295 | 4.9540441 | 6.7707317 | 1.8166876 | 0.2165198 | -1.4860478 | 0.2212593 |
| L | 12 | 8 | 4.2403995 | 0.3564790 | 4.3552977 | 4.2403995 | 0.3626059 | 3.7000000 | 4.6626298 | 0.9626298 | -0.3542296 | -1.7007501 | 0.1260344 |
| M | 13 | 8 | 5.2169277 | 0.5198262 | 5.1954695 | 5.2169277 | 0.5876878 | 4.5767996 | 5.9407198 | 1.3639202 | 0.2246487 | -1.5230771 | 0.1837863 |
| N | 14 | 8 | 3.9671252 | 0.3999422 | 4.1242348 | 3.9671252 | 0.2502143 | 3.2489181 | 4.3395483 | 1.0906302 | -0.7668571 | -1.2176764 | 0.1414009 |
| O | 15 | 8 | 4.5920265 | 0.4093516 | 4.6063455 | 4.5920265 | 0.5061098 | 4.0198751 | 5.1401340 | 1.1202590 | 0.0178798 | -1.5764287 | 0.1447277 |
| P | 16 | 8 | 7.0562500 | 0.2731267 | 7.0250000 | 7.0562500 | 0.2223900 | 6.5500000 | 7.4500000 | 0.9000000 | -0.3399914 | -0.8819384 | 0.0965649 |
| Q | 17 | 8 | 6.7187500 | 0.3999442 | 6.7500000 | 6.7187500 | 0.3706500 | 6.0000000 | 7.2500000 | 1.2500000 | -0.4038021 | -1.1882326 | 0.1414016 |
| R | 18 | 8 | 0.3485170 | 0.0236993 | 0.3510002 | 0.3485170 | 0.0226263 | 0.3141361 | 0.3846154 | 0.0704793 | -0.0649592 | -1.4459260 | 0.0083790 |
| S | 19 | 8 | 0.3661487 | 0.0211938 | 0.3633688 | 0.3661487 | 0.0252586 | 0.3398058 | 0.3956044 | 0.0557986 | 0.1819943 | -1.6968674 | 0.0074931 |
| T | 20 | 8 | 0.9515874 | 0.0251768 | 0.9577129 | 0.9515874 | 0.0224593 | 0.9142857 | 0.9787234 | 0.0644377 | -0.4559407 | -1.6051204 | 0.0089013 |
| U | 21 | 8 | 0.3375000 | 0.1663688 | 0.3000000 | 0.3375000 | 0.1482600 | 0.1500000 | 0.6000000 | 0.4500000 | 0.4351730 | -1.5711414 | 0.0588202 |
| speed | 22 | 8 | 3.0882506 | 0.1416354 | 3.0877764 | 3.0882506 | 0.1312616 | 2.8246197 | 3.2766599 | 0.4520401 | -0.4380793 | -1.0167055 | 0.0500757 |
Some simple value meanings in the table:
"trimmed" is the means by dropping the top and bottom trim fraction;
"mad" is the median absolute deviation from the median;
"skew" is a measure of symmetry;
"kurtosis" identifies whether the tails of the distribution contain extreme values.
We could see that some of the data are symmetry because of small skew absolute values, such as the predictor B, F, G, I, O, R, S, and T. On the other hand, the variable “N” is skewed to the right. (Further EDA shows that log transformation does not help to reduce the skewness.)
The kurtosis test shows that most of the variables do not influence much by extreme values.
We could do further EDA, for example, the correlation tests.
The correlation chart shows the correlation between the variables by the color and circle size. The darker the color is, (or the bigger the circle is), the higher the correlation. The blue color indicates a positive correlation, and the red color a negative correlation.
The result shows that many of the predictors have severe correlations. For example, the variable of O and R has a heavy positive correlation, and the variable of T and U has a heavy negative correlation.
We first need to avoid severe correlations to build a good model.
After selecting the variables, we keep the predictors that have absolute correlations that is lower than 0.7. Only variables of A, H, N, P, R, and U were selected for further analysis.
We could see the correlation, plots and distributions of the selected variables from the figure above.
All the factors have absolute correlations of less than 0.7.
From the plots, we could see there are no extremely high or low values in the selected variables.
The histogram shows the distributions. Some of them are not perfectly normally distributed. We could try transformation if necessary.
Now we have 6 predictors that are not heavily correlated. However, it might still contain some abundant information to predict.
There are many useful dimension reduction techniques such as PCA, PCR, Ridge regression, lasso regression and best subset. Most of them perform well in a large amount of observation.
Here we choose the best subset method because it will find out the exact variables that help to build the predicting model.
The best subset method compares all possible models combinations and gives the best-fitting models. First, we need to decide how many predictors would perform best in the model, -this could be quite subjective; thus, the computer needs us to make a decision. Then the computer finds out which predictors they are.
As mentioned above, choosing the number of predictors are very subjective. There are several standards we could think of when making the decision. Four most popular standards are adjusted R square, Cp, and AIC or BIC.
In the above figure, the best number of predictors were marked as a red plot. For example, according to the adjusted R square standard, the best number of predictor is 4; while according to the Cp standard, the number of predictors is 3.
By looking at all the information, we could see that the adjusted R square and BIC does not change much by reducing predictors from 4 to 3, though 4 is the best (highest in adjusted R square and lowest in BIC) value. Thus, we choose to put 3 predictors into our model.
Then we need to find out which 3 predictors we need, and this is the result:
## (Intercept) P R U
## 0.1864546 0.4012233 0.4127862 -0.2168852
Variables of P, R and U are our final choice.
Then we built the multiple linear model, with the P, R and U as predicotrs and speed as the response.
##
## Call:
## lm(formula = speed ~ P + R + U, data = data_refined)
##
## Residuals:
## 1 2 3 4 5 6 7
## 8.512e-05 1.150e-04 -2.316e-04 -1.607e-04 2.240e-04 -1.060e-04 -2.074e-05
## 8
## 9.490e-05
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1864546 0.0027895 66.84 3.00e-07 ***
## P 0.4012233 0.0003982 1007.48 5.82e-12 ***
## R 0.4127862 0.0042195 97.83 6.55e-08 ***
## U -0.2168852 0.0006313 -343.56 4.31e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0002066 on 4 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.097e+06 on 3 and 4 DF, p-value: 2.772e-12
We could see that the adjusted R square is 1 - the three predictors explain almost 100% of the variation. (This is very high value, and does not often happen.)
All of the three predictors are significant.
## P R U
## 1.940157 1.639804 1.808846
We then look at the VIF value of the model. It shows that vif values are all below 2, and this result is good. High vif values mean there are some problems between the predictors, and we need to look at our model again in that case.
Then we get 95% confidence intervals.
## 2.5 % 97.5 %
## (Intercept) 0.1787098 0.1941994
## P 0.4001176 0.4023290
## R 0.4010711 0.4245013
## U -0.2186379 -0.2151325
From the residual report, we see that residual fits well overall, and they are normally distributed. The observation of 2 and 3 are possible outliers, but considering the limited number of observation, it is hard to tell whether they are outliers or not. We decided to keep them.
We chose 3 out of 21 predictors to predict the response. They are the predictors of P, R, and U.
All of the three predictors are significant in the model. Every increase of 1 unit in predictor P will contribute an increase of 0.40 in the response; every increase of 1 unit in predictor R will contribute an increase of 0.41 in the response; every increase of 1 unite in predictor U will contribute a decrease of 0.22 in the response.
The final model explains over 99% variance.
We have got the written permission for the data to be used in public. The names of the variables were simplified to understand better of the statistical concept.↩︎