Abstract
Logistic regression analysis of Stat2Data package dataset Pulse: Resting pulse is converted to a binary variable where a resting pulse greater than 68 is considered as ‘false’, the kilogramme converted and centred weight offset from the mean is then used as a predictor alongside the binary attribute that records whether the subject is a smoker. These two variables, and their interaction is considered in fitting a logistic regression model with the positive outcome being that of a low resting pulse followed by the testing of these predictions by the fitted model.The Stat2Data::Pulse dataset is sourced and analysis indicates 0 rows with missing values, thus no post-processing is required to filter out NA values nor, undesirably, to introduce estimates for missing values. The 3D scatter plot of the two predictors of subject weight, converted to kilogramme and centred around its sample mean so that it simply encodes the subject’s weight as an offset from the sample mean, and smoking are plotted against the outcome of low resting pulse indicator - where a low resting plus is determined as the positive outcome value and the smoking attribute is considered the positive predictor value.
The source data is then randomly partitioned using the outcome, Pulse, into two datasets - the train and test sets, with a row count ratio of 4:1.
We undertake the generalised linear model’s binomial regression against the training data using a formula of Pulse ~ Smoke*Weight in order to consider the effects of smoking and weight upon outcome, as well as their interaction.
The model encoding is thus:
Points to bear in mind:
The saturated model has zero degrees of freedom since all parameters are used to classify the outcome - the number of parameters is essentially the number of ‘categories’.
In a saturated model \(y=f(x)\) but, for a fitted model, \(y=f(x)+\varepsilon\); where \(\varepsilon\) are the residuals and eliminating them in the saturated model leads to over fitting and generally poor prediction accuracy resulting in inflation of predictor variance in out-of-sample data.
It ‘doesn’t matter’ about how to formulaically characterise the saturation model - it’s sufficient, even if only theoretically, that such exists as a residual-free target for modelling the in-sample data.
The fitted model should optimally have an outcome determination efficacy ‘close’ to that of the saturated model for a limited number of predictors and retain sufficient ‘noise’ so as to limit predictor variance inflation in out-of-sample data.
In general, there are two different approaches to assessing goodness-of-fit in logistic regression models:
Residual analysis wherein one investigates the model on the level of individuals and looks for those observations which are not adequately described by the model or which are highly influential on the model fit, or:
A statistical test in which a metric seeks to combine the information on the amount of lack-of-fit in a single number; that metric is then judged to determine if a lack-of-fit is significant or due to random chance. These tests do not evaluate specific alternatives, rather test unspecific hypotheses of the form ‘the model fits’ versus the alternative ‘the model does not fit’.
The ‘goodness of fit’ premise of ‘the model fits’ should therefore be read as ‘the model fits the in-sample data’ and, for the deviance test has a very specific interpretation:
The deviance test statistic is evaluated as \(D(y)=-2\ell(\beta;y)+2\ell(\theta;y)\) where \(\beta\) represents the fitted model, \(\theta\) the saturated model and \(ℓ\) is the log-likelihood function.
Comparing the residual deviance against the appropriate chi-squared distribution constitutes testing the fitted model against the saturated model.
Using log-likelihoods relies upon asymptotic behaviour which is inapplicable when the degrees of freedom increases at the same rate as the sample size. This would invalidate the null hypothesis.
The log-likelihood for the saturated model is, more often than not, zero.
The deviance test statistic, not the residuals, is assumed to be distributed via a \(\chi^2_{Categories-Parameters}\) distribution.
For small sample sizes the log-likelihood estimation is dubious, so one would use the Hosmer-Lemeshow test.
The deviance test should not generally be considered to be sufficient in providing evidence of goodness of fit.
The one-sided \(\chi^2_{C-P}\) probability of occurrence test of such a metric for some significance level will determine one of the cases:
Should the probability of occurrence of such a metric be less than the significance level, then such an outcome is deemed unlikely - that is, the fitted model does not exhibit the goodness of fit exhibited by the saturated model.
Conversely, outcome estimation is deemed to infer that the fitted model exhibits a fit that is statistically similar to the fit of the saturated model.
Rejecting the null hypothesis infers that the fitted model is lacking in sufficient parameters to yield a good fit to the in-sample data - there are sources of unexplained residual variation.
The model reports a residual deviance of 253.514 with 182 degrees of freedom for the fitted model and a null deviance of 256.473 with 185 degrees of freedom. The null deviance is derived from the null model in which only the intercept is used to fit data. A saturated model is one in which each datum uses every other datum as a predictor to obtain a ‘near-perfect’ fit.
The residual/null deviances are evaluated using the log-likelihood of outcome for a saturated model less the log-likelihood of the the fitted/null model. McFadden’s \(pseudoR^2\) evaluates the ratio of the log-likelihood of the fitted model to that of the null model and subtracts it from 1. It therefore has a domain of (0,1). With regard to the different evaluation algorithms, the ratio of reported residual to null deviance differs slightly from the ratio used by McFadden. For our fitted model, McFadden’s \(pseudoR^2\) has a value of 0.012.
A \(pseudoR^2\) value in logistic modelling is interpreted as for \(R^2\) in linear modelling; in this case, such a low value with respect to unity implies that the model residuals, in and of themselves, do not explain with a great deal of success the variation in the data.
Goodness of fit tests in logistic regression have an implicit premise that the test statistic employed is approximately \(\chi^2_{C-P}\) distributed where the degrees of freedom is the difference between the number of \(C\) ‘categories’ and \(P\) parameters’ - the row count of the dataset is not used since using duplicate values can lead to poor approximations to a corresponding \(\chi^2_{df}\) distribution:
Categories: In our training data we have, for the row count, nrow(dplyr::select(train,-Pulse)) = 186. However, there may be duplication of per-row variable values, so rather than a row count, we determine the unique count of variable permutations in the dataset via nrow(unique(dplyr::select(train,-Pulse))) = 76. There are therefore 110 duplicate rows and \(C=76\). The category is essentially what would be used in a contingency table where duplicates would not appear but rather be summed in cell frequency counts.
Parameters: Our model uses the formula Pulse ~ Smoke * Weight which, upon parsing by R, is expanded into the distinct terms (Smoke, Weight, Smoke:Weight), additionally, our generalised linear model fit includes an intercept coefficient term so, in modelling fit, \(P=4\).
Therefore, for our goodness of fit test statistics, we use the comparison of a test statistic to the \(\chi^2_{72}\) theoretical distribution.
We have the implicit premise that any test statistic employed can be considered to be drawn from a \(\chi^2_{df}\) distribution. A secondary implicit premise is that we will use a one-sided test in establishing fit to the theoretical distribution, this implies that the test statistic employed will have a non-negative value.
The primary premise of the hull hypothesis for a goodness of fit test is that the fitted model is statistically equivalent, at a pre-defined level of significance, to the saturated model in terms of its outcome detection to the saturated model - H0: The fitted model fits the in-sample data as well as does the saturated model.
The primary premise does not infer any subjective judgement of ‘better’ or ‘worse’; it is simply a statement than infers an equivalence in predictive capability. Rejection of the null hypothesis therefore only implies that the model predictive capability is different than that of the saturated model as a predictor of outcome - it is a goodness of fit, not of subjective, or objective quality.
The R reported degrees of freedom for the residual deviance of 182 does not bear any resemblance to the category/parameter count used in selecting a theoretical \(\chi^2_{df}\) distribution against which a hypothesis test regarding residuals may be undertaken. The reported degrees of freedom is that used in establishing a prediction estimate for each input row (186) less the number of parameters (4). With that prediction one can then evaluate the residual as a function of the difference between the estimated and actual fit. The vector of residuals is yielded via residuals(model,type="deviance") and the non-negative deviance is evaluated via sum(residuals(model,type="deviance")^2) = 253.514. This is the test statistic that we will apply against the \(\chi^2_{72}\) theoretical distribution - it is the quantile, \(q\), with which we wish to establish a probability of occurrence of such a value in a Probability Density Function (PDF).
To undertake the test, we establish the area under the theoretical distribution PDF yielded by using the test statistic quantile \(q\) as the cut-off - that is, \(Pr(\chi^2_{72}|_{q=253.514})\). This is evaluated as a very small value - of the order of \(10^{-22}\), so, clearly, if \(\alpha = 0.05\), then \(Pr(\chi^2_{72}|_{q=253.514}) < \alpha\).
The test is therefore significant and we may therefore reject the null hypothesis and conclude with a 95% level of confidence, that the fitted model outcome prediction capability is significantly different from that proffered by the saturated model.
The null hypothesis and degrees of freedom employed are as for the deviance goodness of fit test. For the Pearson test, the vector of residuals is yielded via residuals(model,type="pearson") and the non-negative deviance is evaluated via sum(residuals(model,type="pearson")^2) = 186.471. This is the test statistic that we will apply against the \(\chi^2_{72}\) theoretical distribution.
To undertake the test, we establish the area under the theoretical distribution PDF yielded by using the test statistic quantile as the cut-off - that is, \(Pr(\chi^2_{72}|_{q=186.471})\). This is evaluated as a very small value - of the order of \(10^{-12}\), so, clearly, if \(\alpha = 0.05\), then \(Pr(\chi^2_{72}|_{q=186.471}) < \alpha\).
The test is therefore significant and we may therefore reject the null hypothesis and conclude with a 95% level of confidence, that the fitted model outcome prediction capability is significantly different from that proffered by the saturated model.
For a Hosmer-Lemeshow test, one has to insure that the input vector is numeric with only one and zero values therein. Consequently, we use purrr::map(train, ~ ifelse(.=="Low",1,0))$Pulse to convert our factor data for test usage. The test, using decile grouping, reports a \(\chi^2\) test statistic value of 7.094 on 8 degrees of freedom and a p-value of 0.5265.
This would appear to be a non-significant test, however, the null hypothesis employed by Hosmer-Lemeshow is not that used by the deviance and Pearson goodness of fit tests. The implicit premises are the same, but the primary premise is actually the converse of that used in our previous tests - that is, The fitted model differs in outcome prediction to the saturated model. Clearly, now, failing to reject the null hypothesis has a different interpretation and an interpretation that is in-line with the conclusions of the deviance and Pearson’s goodness of fit tests.
The consensus opinion, therefore, regarding goodness of fit of our fitted model, is that it performs outcome evaluation in a manner that is not consistent with the saturated model. In this case the inference is that the fitted model predictions are significantly different from those proffered by the saturated model. Recall, also, that the \(pseudoR^2\) value of 0.012 implies that the fitted model is not adept at explaining the variance observed in the outcomes. The goodness of fit tests and \(pseudoR^2\) value imply that we need additional parameters in order to explain the unexplained variance of the fitted model.
Our fitted model essentially emulates the linear (in coefficients) relationship:
\[r_{j} = \alpha + \beta_{s}s_{j} + \beta_{w}w_{j} + \beta_{sw}s_{j}w_{j}\] Where we use \(r_{j}\) to represent the binary resting pulse output condition (\(r_{j}=1\) represents a low resting pulse with our cut-off threshold of 68), \(s_{j}\) to represent the binary smoking condition (\(s_{j}=1\) represents a smoker), and \(w_{j}\) to represent the offset from the sample mean weight, in kilogramme, of the test subjects. Additionally, the intercept coefficient is labelled \(\alpha\), the smoking variable coefficient as \(\beta_{s}\), the weight variable coefficient as \(\beta_{w}\) and the interaction between smoking and weight having a coefficient labelled as \(\beta_{sw}\).
| Fitted Model Coefficient Data and Confidence Interval Bounds | ||||||
| Estimate | SE | z-Value | Pr(>|z|) | CI @ 2.5% | CI @ 97.5% | |
|---|---|---|---|---|---|---|
| (Intercept) | 0.2260 | 0.1600 | 1.4121 | 0.1579 | -0.0862 | 0.5424 |
| Smoker | -0.4225 | 0.4680 | -0.9027 | 0.3667 | -1.3737 | 0.4897 |
| Weight | 0.2412 | 0.1713 | 1.4079 | 0.1592 | -0.0892 | 0.5863 |
| Smoker:Weight | 0.0005 | 0.3975 | 0.0012 | 0.9990 | -0.7722 | 0.8203 |
Rather than use reported standard errors (SE) to evaluate confidence intervals for the coefficients - as \(\pm 1.96 \cdot SE\), we use the R stats::confint function which employs, for logistic regression, the profiling of likelihood and yields ‘better’ estimates of confidence intervals. Function output is tabulated on the right as the right-most columns of the table - the (two-sided) 2.5% lower and 97.5% upper bounds of the 95% confidence interval. The balance of the columns are the data from the fitted model coefficients.
As is always the case in logistic modelling output, data, including the upper and lower bounds of the confidence interval, refer to log-odds values. One may ‘convert’ these to the more easily interpreted odds values by taking their exponential, even, via noting that odds, in terms of a probability, \(p\) of an event outcome, are defined as the ratio \(p/(1-p)\), to probabilities - however, interpretation of anything other than a specific case of estimated value becomes convoluted.
The p-values reported are based upon a null hypothesis whose primary premise is that \(H_{0,j}:\beta_{j} = 0 \quad \forall j\). The proposition of the null hypothesis is that the \(j^{th}\) coefficient does not add any significant value to model prediction outcome estimation - that is, its presence is essentially superfluous. Implicit premises are that the tabulated \(z\)-score arises from a standardised normal distribution, \(\mathcal{N}(0,1)\) and, in order to perform a one-sided test, we use the absolute value of the tabulated \(z\)-score - \(|z|\).
The \(z\)-value is evaluated as the associated coefficient divided by its standard error (SE) - tabulated to the left of the \(z\)-value and still reported as log-odds values. The tabulated \(z\)-value is employed as the test statistic in using the specified null hypothesis; the test is undertaken by evaluating the probability of the quantile \(z\)-value for the \(\mathcal{N}(0,1)\) PDF. For a significance level \(\alpha=0.05\), we thus determine if \(Pr(\mathcal{N}(0,1)|_{q=|z|}) < 0.05\). The value of \(Pr(\mathcal{N}(0,1)|_{q=|z|})\) is tabulated in the column entitled Pr(>|z|).
As to the reported tabulated p-values above, clearly all exceed \(\alpha=0.05\) and are thus deemed not significant. Consequently, for each coefficient we fail to reject the null hypothesis with a 95% confidence level. Thus, all coefficients may be deemed to add no value to an estimate of a predicted outcome so they may as well not exist! This is a powerful argument but it does necessarily contradict the goodness of fit tests which determine, on balance, that model estimation ability is significantly different from that of a \(\chi^2\) distribution, but does tend to corroborate the failure of the model to adequately explain the observed variance in the residuals via the \(pseudoR^2\) evaluation.
The coefficients and their associated data reported by the fitted model may be introduced to our linear model - as follows:
\[r_{j} = 0.2260 -0.4225 \cdot s_{j} +0.2412 \cdot w_{j} +0.0005 \cdot s_{j}w_{j}\] Using this relationship, we can enumerate a number of conditions regarding the subsequent estimate evaluation of \(r_{j}\) - bearing in mind, these values represent a log-odds relationship at this stage:
\[\begin{equation} \begin{cases} r_{j}=0.2260 \quad \forall \ s_{j}=0,w_{j}=0 \quad \text{- non-smokers of average weight} \\ r_{j}=-0.1965 \quad \forall \ s_{j}=1,w_{j}=0 \quad \text{ - smokers of average weight}\\ r_{j}=0.2260 +0.2412 \cdot w_{j} \quad \forall \ s_{j}=0 \quad \text{- non-average weight, non-smokers}\\ r_{j}=-0.1965 +0.2417 \cdot w_{j} \quad \forall \ s_{j}=1 \quad \text{- non-average weight smokers}\\ \end{cases} \end{equation}\]
We can make additional inference for distinct subjects based upon a unit change (1 kg) in their weight rather than simply using the condition ‘non-average’ weight for addressing questions of the like of estimating response for a unit decrease in weight of a distinct subject, similarly, the response for a distinct subject should they become non-smokers:
\[\begin{equation} \begin{cases} r_{j}=-0.4225 +0.0005 \cdot w{_j} \quad \forall \ w_{j} \quad \text{- a subject becoming a non-smoker}\\ r_{j}=0.2412 \quad \forall \ s_{j}=0 \quad \text{- a non-smoker subject decreasing their weight by 1 kg}\\ r_{j}=0.2417 \quad \forall \ s_{j}=1 \quad \text{- a smoking subject decreasing their weight by 1 kg}\\ \end{cases} \end{equation}\]
For the enumerated relationships where the outcome is constant, that is, without reference to a predictor variable, we may convert the log-odds estimator to a percentage indicator meaningfully to indicate a direct probability estimation of outcome for effect:
Note that these probabilities may not seem far off from those obtained by tossing a coin, in fact, they all lie within -5% to 7% of random outcome selection. However, confidence intervals range from 29% below, and 20% above those of random outcome selection. That variance affirms a degree of model validity in terms of being relatively competent at distinguishing an outcome.
For this purpose we use our, as yet ‘untouched’, test dataset partition of the source data. The fitted model is run against the data and the predicted outcomes are compared with the actual outcomes encoded in the test data:
trainModel <- train(Pulse~Smoke*Weight,data=train,model="glm",family="binomial")
prediction <- predict(trainModel,test)
cm <- confusionMatrix(prediction,test$Pulse,
dnn=c("Predicted Resting Pulse","Observed Resting Pulse"),
positive="Low")|
Confusion Matrix of Resting Pulse - Oberved versus Predicted outcomes |
||
| Observed | ||
|---|---|---|
| High | Low | |
| High | 9 | 4 |
| Low | 12 | 21 |
The prediction function uses random forest methodology to select the optimal model based upon bootstrapping of the samples - in this case the optimal model is that for which mtry=2 - the number of variables used in splitting a tree node (note that three are available as we are also considering an interaction term). The optimal model reports an in-process accuracy against the test data of 54.35% - whereas that for the out-of-process test data (via the confusion matrix output) is 65.22% with a confidence interval of (49.75%,78.65%). This is an unusual result as one would ordinarily expect the out-of-process error (1-Accuracy) to be greater than that of in-process error. It is yet another indication of the potentially dubious value of the model. Having undertaken the prediction against the test data, we then use a confusion matrix to quantitatively summarise the efficacy of prediction:
|
Confusion Matrix Analysis of Accuracy and other parameters |
|
| Value | |
|---|---|
| Accuracy (ACC) | 0.6522 |
| Kappa | 0.2770 |
| ACC 2.5% CI Lower Bound | 0.4975 |
| ACC 97.5% CI Upper Bound | 0.7865 |
| No Information Rate (NIR) | 0.5435 |
| P(ACC>NIR) | 0.0906 |
| McNemar P-Value | 0.0801 |
| Sensitivity | 0.8400 |
| Spcificity | 0.4286 |
| +ve Predictive Value | 0.6364 |
| -ve Predictive Value | 0.6923 |
One aspect of the confusion matrix one may immediately make a judgement upon is the number of Type I and Type II Errors exhibited by the predictive modelling comparison against observed values: For the Type I Error rate, the number of False Positives (in our model, ‘positive’ relates to identification of a low resting pulse) - the upper-right quadrant of our matrix, is reported as 4 (9.76% of the dataset), these outcomes represent the number of predicted outcomes that were incorrectly identified as positive outcomes when compared to the actual observations which were recorded as negative. Whilst Type I Error rates in exploratory analyses may be deemed tolerable, it is generally desirable to minimise the number of Type II Errors - False Negatives - that is, when the model predicts a negative outcome whereas the observed outcome is positive. Our model predicts 12 (29.27% of the dataset) occurrences of a falsely positive prediction and such high Type II Error rates may be deemed unacceptable, especially in conducting confirmatory as opposed to exploratory statistical analyses.
The No Information Rate (NIR) denotes the accuracy for the null model (where only the intercept coefficient is employed). The confusion matrix reports a non-significant result for the null-hypothesis that the accuracy of the model is greater than that of the null model, consequently, we may reject the null hypothesis and conclude that the fitted model is no more accurate than the null model. This is contrary to our goodness of fit findings against the training data - based upon an acceptance of how the prediction outcomes were obtained and thus reported via the confusion matrix.
Cohen’s Kappa coefficient is a statistic which measures the degree of consensus between two categorical items. It is generally thought to be a more robust measure than simple percent agreement calculation, as it takes into account the possibility of the agreement occurring by chance. Whilst there is no conventional agreement upon range characterisation, the somewhat arbitrary ranking of Landis and Koch proffers that values < 0 indicate no agreement; 0–0.20 as slight; 0.21–0.40 as fair; 0.41–0.60 as moderate; 0.61–0.80 as substantial and 0.81–1 as almost perfect agreement. For our fitted model test predictions, the value of Kappa may be interpreted such that the difference between observed and expected value outcomes in our model is “fair”.
The McNemar test is a \(\chi^2\) test for the symmetry of rows and columns in a two-dimensional contingency table. That is, a null hypothesis that marginal frequencies between rows and columns are equal. R reports the p-value of this test as 0.0801 which is not significant. so we cannot reject the null hypothesis of symmetry.
The sensitivity, specificity and associated measures derived from the allocation of outcomes are used in choosing the most accurate representation of the outcome predictions; generally, an algorithm cycles through a threshold in determining when an evaluated outcome probability estimate should be mapped to a binary outcome for the response. At each iteration, the sum of sensitivity and specificity is evaluated and, following the cycle, that value of the threshold which maximises the sum of sensitivity and specificity may be selected as the ‘best fit’, algorithmically, for the model. Ultimately, the consequence of that choice and an overview of the overall accuracy of the model is proffered by a Receiver Operator Characteristic (ROC) curve. In the following figure, the plot on the left identifies the threshold at which the sum of specificity and sensitivity is maximised and, on the right, this point is plotted on the ROC curve. Nevertheless, even if one were to concurrently trace multiple sensitivity/specificity values, it is visually difficult to determine what may, overall, represent a good choice of threshold in terms of model outcome assessment especially in light of the other test statistics yielded. In order to quantify the selection of sensitivity and specificity via the ROC curve one simply evaluates the area under the ROC curve (AUC or AUROC) and the trace with the greatest AUC is generally deemed ‘the best’ - clearly, if that area is unity, then the fit is ‘perfect’. For reference, the plot shows the line \(y=x\) which is the expected trace if one is using random selection and thus, a model that yields an AUC of less than 0.5 does ‘no better’ than random selection of an outcome.
You’ll note that the tabulated confusion matrix output does not correspond to the maximal sensitivity and specificity identified by the ROC curve analysis; when producing a ROC curve, within computational limits, every threshold is touched to produced the maximal ordinate value so, in theory, given the model predicted outcome probabilities against which to employ the threshold, and the observed outcomes, the ROC analysis will yield the optimal solution. The confusion matrix methodology simply reports upon what it ‘sees’ in the predicted versus observed values - that is, prior to this point, the probability threshold to determine the prediction outcome has already been undertaken to enable the confusion matrix outcome to compare like-with-like as opposed to probability of prediction with observed outcome. That process occurs via the use of the stats::predict function which, for binomial logistic regression, uses the stats::glm.fit function which, in turn, employs an iteratively re-weighted least squares algorithm to determine if a predicted probability should be assigned as either a ‘true’ or ‘false’ binary outcome.
The logistic regression of Pulse ~ Smoke * Weight is shown to be a valid model that produces outcome predictions that are significantly different from those that may be yielded by a \(\chi^2\) analysis or via random outcome selection with prediction bounds of (-29%,20%) variance from random outcome selection. However, its efficacy is dubious in term of explaining residual variance and in its propensity to exhibit Type II Errors as well as the lack of significance in any of the model’s predictor coefficients in influencing prediction estimation.
Therefore, whilst we may conclude that the model is valid, it is not necessarily efficacious in its application - it has limited predictive power with wide scope of inherent variability in performing predictions. Further exploratory analysis of the source data is indicated with a view to the construction of an alternate model which may show better predictive quality, or a re-formulation of the requirement of the model - that is, for example, even if one should attempt to determine resting pulse as a binary outcome and/or whether other covariates should be included in the study.
Hypothesis testing requires a clear, unambiguous statement of premises that, jointly, are sufficient to form the basis of a deductive argument that may lead to a sound conclusion based upon inference. The ‘null’ hypothesis of statistical deductive inference testing is a statement of such an argument although it appears, in its symbolic representation, for example \(H_{0}:\mu=0\), to have only one premise. However, inference testing requires a framework within which to couch such a premise about a test statistic and, necessarily, that framework must also be a premise of the hypothesis. For example, in testing the mean of a sample, one may defer to a comparison against the standardised normal distribution \(\mathcal{N}(0,1)\) (or, similarly, a \(\chi^2_{df}\) distribution for logistic testing). Consequently, an implicit premise of the null hypothesis is that the statistic undergoing testing conforms to the requirements of the framework being employed. Therefore, any null hypothesis is always, necessarily, comprised of a minimum of two premises and all premises must be deemed valid in order to form the basis of a sound argument.
In statistical inference testing the implied premise of it being appropriate to use the test statistic against the employed framework is often untested, or simply assumed to be a valid premise - yet it is a fundamental requirement for overall hypothesis validity. Additionally, when all premises are deemed valid, the hypothesis only forms the basis of a sound argument - it cannot present itself as a ‘fact’. Consequently one cannot agree with output as a fact, but can only state that the hypothesis cannot be rejected and demonstrate a level of confidence in making such a statement since it is not an absolute certainty. Conversely, demonstrating the invalidity of any premise can only lead to the conclusion that the hypothesis is rejected as being invalid in at least one of its premises. The rejection of a hypothesis, except in the instance of a highly simplistic, single binary premise describing a closed system, can never imply the acceptance of an ‘alternative’ hypothesis.
The p-value is the tool of frequent choice in statistical deductive hypothesis testing. The p-value quantifies a probability estimate of an event using a framework employed as an implicit premise of the hypothesis - which premise states that the test statistic arises from a distribution that can be theoretically modelled as a Probability Density Function (PDF) - be that \(\mathcal{N}(0,1)\) or \(\chi^2_{df}\). For validity testing, comparison of the test statistic ‘subject’ requires an ‘object’ against which to compare; the object of comparison is usually taken as the probability of occurrence of an event with a pre-defined level of significance, denoted by \(\alpha\) and usually taken to be 0.05 (5%) - the converse of the level of confidence of \(1-\alpha = 0.95\) (95%).
The PDF is commonly an integrable continuous function and integration over a range yields the probability of event occurrence for any given independent variable - the area between the PDF trace and the \(x\) axis. This leads to a slight dilemma about which ‘tail’ of a PDF, if two exist, to choose, under which to measure said area as a probability. For a confidence interval one uses two tails to obtain a lower and upper bound - for a 95% confidence level this corresponds to a lower bound at the 2.5% level and an upper bound at the 97.5% level, the area under such left and right cut-offs of the domain corresponds to 5% of the total area under the PDF of 100%. When such a two-tailed test is required at a significance level of \(\alpha\), we generally denote the bounds as \((\alpha/2,(1-\alpha/2)\).
The expedient of a one-sided test may be used, regardless of an underlying two-tailed PDF, by testing the absolute value of the test statistic in which case only the upper (right-hand) area under the PDF curve at the \(\alpha\) probability cut-off is used. Note that many test statistics are explicitly designed to yield non-negative values so ‘taking the absolute value’ for a one-tailed test is superfluous - but the implication in the required use of a one-tailed test still exists. To further simplify, the linear mapping of the independent variable is often undertaken to centre the variable distribution around zero, and scale its value as multiples of a unit variance around zero. We then compute the probability for the test statistic quantile - for example for the standard distribution and, for the modulus of the test statistic \(Z\)-score, \(Pr(\mathcal{N}(0,1)|_{q=|Z|})=\int_{|Z|}^{\infty} \mathcal{N}(0,1) dx\), thence make the comparison \(Pr(PDF|_{q=|Z|}) < \alpha\). This condition is labelled as ‘significant’ at the \(\alpha\) significance level if true and infers that the probability of such a \(Z\)-score occurring in the system is less than the significance level probability, \(\alpha\) - stated with a \((1-\alpha)*100\%\) confidence level. This significance may be used to provide strong evidence that an appropriately couched null hypothesis is invalid in at least one of its premises which is sufficient for rejection of the hypothesis.