1 Regression analyses

1.1 Linear regression

1.2 Logistic regression

1.2.1 Few general guidelines for statistical model

  • The propensity score is merely the predicted value from a multivariable model where the response variable is the exposure or the treatment actually used. The estimated propensity score is then used in a second step as an adjustment variable in the model for the response of interest.

  • Dichtomizing the outcome at a different point may require a many-fold increase in sample size to make up for the lost information

  • The model must use the data efficiently. If, for example, one were interested in predicting the probability that a patient with a specific set of characteristics would live five years from diagnosis, an inefficient model would be a binary logistic model. A more efficient method, and one that would also allow for losses to follow-up before five years, would be a semi- parametric (rank based) or parametric survival model. Such a model uses individual times of events in estimating coefficients, but it can easily be used to estimate the probability of surviving five years. As another exam- ple, if one were interested in predicting patients’ quality of life on a scale of excellent, very good, good, fair, and poor, a polytomous (multinomial) categorical response model would not be efficient as it would not make use of the ordering of responses.

  • Choose a model that fits overall structures likely to be present in the data. In modeling survival time in chronic disease one might feel that the importance of most of the risk factors is constant over time. In that case, a proportional hazards model such as the Cox or Weibull model would be a good initial choice. If on the other hand one were studying acutely ill patients whose risk factors wane in importance as the patients survive longer, a model such as the log-normal (See log-normal) or log-logistic regression model would be more appropriate.

  • Choose a model that is robust to problems in the data that are difficult to check. For example, the Cox proportional hazards model and ordinal logistic models are not affected by monotonic transformations of the response variable.

  • Choose a model whose mathematical form is appropriate for the response being modeled. This often has to do with minimizing the need for interaction terms that are included only to address a basic lack of fit. For example, many researchers have used ordinary linear regression models for binary responses, because of their simplicity. But such models allow predicted probabilities to be outside the interval [0,1], and strange interactions among the predictor variables are needed to make predictions remain in the legal range.

  • Choose a model that is readily extendible. The Cox model, by its use of stratification, easily allows a few of the predictors, especially if they are categorical, to violate the assumption of equal regression coefficients over time (proportional hazards assumption) (See proportional hazard assumption). The continuation ratio ordinal logistic model can also be generalized easily to allow for varying coefficients of some of the predictors as one proceeds across categories of the response. See Figure 1.1

  • Ye developed a general method for estimating the “generalized degrees of freedom” (GDF) for any “data mining” or model selection procedure based on least squares. The GDF is an extremely useful index of the amount of “data dredging” or overfitting that has been done in a modeling process. It is also useful for estimating the residual variance with less bias. In one example, Ye developed a regression tree using recursive partitioning involving 10 candidate predictor variables on 100 observations. The resulting tree had 19 nodes and GDF of 76.The usual way of estimating the residual variance involves dividing the pooled within-node sum of squares by 100 − 19, but Ye showed that dividing by 100 − 76 instead yielded a much less biased (and 2 much higher) estimate of σ^2 . In another example, Ye considered stepwise variable selection using 20 candidate predictors and 22 observations. When there is no true association between any of the predictors and the response, Ye found that GDF = 14.1 for a strategy that selected the best five-variable model

  • AUC, only because it equals the concordance probability in the binary Y case, is still often useful as a predictive discrimination measure.

1.2.2 interaction term

Parameters in a simple model with interaction

Figure 1.1: Parameters in a simple model with interaction

\(C(Y |X_1 + 1,X_2) − C(Y |X_1,X_2) = β_0 +β_1(X_1 +1)+β_2X_2+ β_3(X_1 + 1)X_2− [β_0 + β_1X_1 + β_2X_2 + β_3X_1X_2]\) \(= β_1 + β_3X_2\)

  • Burada sonuç olarak interaction term’ le, X1 ’in etkisni X2 yi hesaba katarak bulmak için \(β_1 + β_3X_2\) formulünü kullanıyoruz.Yine X2’nin etkisini X1 i hesaba katarak bulmak için \(β_2 +β_3X_1\) formülünü kullanmış oluyoruz.
C(Y|age,sex)=β0 +β1age+β2[sex=f]+β3age[sex=f]
  • β3 is the difference in slopes (female – male)

-There are many useful hypotheses that can be tested for this model. First let’s consider two hypotheses that are seldom appropriate although they are routinely tested.
1. H0 : β1 = 0: This tests whether age is associated with Y for males.
2. H0 : β2 = 0: This tests whether sex is associated with Y for zero-year olds

  • Now consider more useful hypotheses. For each hypothesis we should write what is being tested, translate this to tests in terms of parameters, write the alternative hypothesis, and describe what the test has maximum power to detect. The latter component of a hypothesis test needs to be emphasized, as almost every statistical test is focused on one specific pattern to detect. For example, a test of association against an alternative hypothesis that a slope is nonzero will have maximum power when the true association is linear. If the true regression model is exponential in X, a linear regression test will have some power to detect “non-flatness” but it will not be as powerful as the test from a well-specified exponential regression effect. If the true effect is U-shaped, a test of association based on a linear model will have almost no power to detect association. If one tests for association against a quadratic (parabolic) alternative, the test will have some power to detect a logarithmic shape but it will have very little power to detect a cyclical trend having multiple “humps.” In a quadratic regression model, a test of linearity against a quadratic alternative hypothesis will have reasonable power to detect a quadratic nonlinear effect but very limited power to detect a multiphase cyclical trend. Therefore in the tests in Table 2.2 keep in mind that power is maximal when linearity of the age relationship holds for both sexes. In fact it may be useful to write alternative hypotheses as, for example, “Ha : age is associated with C(Y ), powered to detect a linear relationship.”

  • Note that if there is an interaction effect, we know that there is both an age and a sex effect. However, there can also be age or sex effects when the lines are parallel. That’s why the tests of total association have 2 d.f.

1.2.3 Problems caused by dichotomization include the following

  1. Estimated values will have reduced precision, and associated tests will have reduced power
  2. If cutpoints are determined in a way that is not blinded to the response vari- able, calculation of P-values and confidence intervals requires special simulation techniques; ordinary inferential methods are completely invalid. For example, if cutpoints are chosen by trial and error in a way that utilizes the response, even informally, ordinary P-values will be too small and confidence intervals will not have the claimed coverage probabilities. The correct Monte-Carlo simulations must take into account both multiplicities and uncertainty in the choice of cut- points. For example, if a cutpoint is chosen that minimizes the P-value and the 300 resulting P-value is 0.05, the true type I error can easily be above 0.5
  3. Likewise, categorization that is not blinded to the response variable results in biased effect estimates
  4. “Optimal” cutpoints do not replicate over studies. Hollander et al.“. . . the optimal cutpoint approach has disadvantages. One of these is that in almost every study where this method is applied, another cutpoint will emerge. This makes comparisons across studies extremely difficult or even impossible. Altman et al. point out this problem for studies of the prognostic relevance of the S-phase fraction in breast cancer published in the literature. They identified 19 different cutpoints used in the literature; some of them were solely used because they emerged as the ‘optimal’ cutpoint in a specific data set. In a meta-analysis on the relationship between cathepsin-D content and disease-free survival in node- negative breast cancer patients, 12 studies were in included with 12 different cutpoints . . . Interestingly, neither cathepsin-D nor the S-phase fraction are recommended to be used as prognostic markers in breast cancer in the recent update of the American Society of Clinical Oncology. Giannoni et al. demonstrated that many claimed “optimal cutpoints” are just the observed median values in the sample, which happens to optimize statistical power for detecting a separation in outcomes and have nothing to do with true outcome thresholds. Disagreements in cutpoints (which are bound to happen whenever one searches for things that do not exist) cause severe interpretation problems. One study may provide an odds ratio for comparing body mass index (BMI) > 30 with BMI ≤ 30, another for comparing BMI > 28 with BMI ≤ 28. Neither of these odds ratios has a good definition and the two estimates are not comparable.
  5. When cutpoints are chosen using Y , categorization represents one of those few times in statistics where both type I and type II errors are elevated
  • A scientific quantity is a quantity which can be defined outside of the specifics of the current experiment. The kind of high:low estimates that re- sult from categorizing a continuous variable are not scientific quantities; their interpretation depends on the entire sample distribution of continuous mea- surements within the chosen intervals.

  • To summarize problems with categorization it is useful to examine its effective assumptions. Suppose one assumes there is a single cutpoint c for predictor X. Assumptions implicit in seeking or using this cutpoint include (1) the relationship between X and the response Y is discontinuous at X = c and only X = c; (2) c is correctly found as the cutpoint; (3) X vs. Y is flat to the left of c; (4) X vs. Y is flat to the right of c; (5) the “optimal” cutpoint does not depend on the values of other predictors. Failure to have these assumptions satisfied will result in great error in estimating c (because it doesn’t exist), low predictive accuracy, serious lack of model fit, residual confounding, and overestimation of effects of remaining variables

1.2.4 Simple Nonlinear Terms

The simplest way to describe a nonlinear effect of X1 is to include a term for \(X_2=X_1^{2}\) in the model:

\(C(Y |X1) = β_0 + β_1X_1 + β_2X_1^{2}\)

  • If the model is truly linear in \(X_1\), \(β_2\) will be zero

-If the model is truly linear in \(X_1\), \(β_2\) will be zero. This model formulation allows one to test \(H_0\) : model is linear in \(X_1\) against Ha : model is quadratic (parabolic) in \(X_1\) by testing \(H_0 : β_2 = 0\)

Nonlinear effects will frequently not be of a parabolic nature. If a trans- formation of the predictor is known to induce linearity, that transformation (e.g., log(X)) may be substituted for the predictor.

  • polynomials do not adequately fit logarithmic functions or “threshold” effects

1.2.5 Splines for Estimating Shape of Regression Function and Determining Predictor Transformations

  • Spline functions are piecewise polynomials used in curve fitting.That is, they are polynomials within intervals of X that are connected across different intervals of \(X\).

  • That is, they are polynomials within intervals of X that are connected across different intervals of X.

1.2.5.1 Cubic spline function

\(f(X) = β_0 +β_1X +β_2X^2 +β_3X^3+ β_4(X − a)^3+ + β_5(X − b)^3+ + β_6(X − c)^3+ =Xβ\)

If the cubic spline function has k knots, the function will require estimating k + 3 regression coefficients besides the intercept.

1.2.5.2 Restricted Cubic Splines

  • Natural splines da denir.
  • Uç kısımlarda ( tailler de ) fonksiyonun daha düz olmasını sağlar.
  • Burada normal cubic splines a göre k-1 coefficient tahmin edilir. (cubic de k+3 idi)
1.2.5.2.1 Kaç knots ve nereye koyacağız?

Placing knots at fixed quantiles (percentiles) of a predictor’s marginal distribution is a good approach in most datasets. See Figure 1.2

Default quantiles for knots

Figure 1.2: Default quantiles for knots

  • Genelde küçük sample size da 3 kullanılıyor. Büyük sample size da 7 knots kullanılıyor.
  • Genelde kullanılan 3,4,5. Ama 4 knots kullanmak çoğu data sette aslında yeterli fit sağlıyor.
  • When the sample size is large (e.g., n ≥ 100 with a continuous uncensored response variable), k = 5 is a good choice. Small samples (< 30, say) may require the use of k = 3. Akaike’s information criterion can be used for a data-based choice of k.
  • The value of k maximizing the model likelihood ratio χ2 − 2k would be the best“for the money” using AIC.

1.2.6 Non-parametric Regression

  1. moving average
  2. moving least squares linear regression smoother (loess)
  3. super smoother

Non-parametric de knots data da her noktaya konulur. Parametric spline prediction yapar ve predictor model oluşturur ancak non-parametric metod prediction denklemi oluşturmaz.

The standard non- parametric smoothers work when one is interested in assessing one continuous predictor at a time and when the property of the response that should be lin- early related to the predictor is a standard measure of central tendency. For example, when C(Y ) is E(Y ) or Pr[Y = 1], standard smoothers are useful, but when C(Y ) is a measure of variability or a rate (instantaneous risk), or when Y is only incompletely measured for some subjects (e.g., Y is censored for some subjects), simple smoothers will not work.
1. The oldest and simplest nonparametric smoother is the moving average.
2. A moving least squares linear regression smoother \(\color{red}{\text{loess}}\) is far superior to a moving flat line smoother (moving average)

Actually, loess uses weighted least squares estimates, which is why it is called a locally weighted least squares method. The weights are chosen so that points near X = x are given the most weight in the calculation of the slope and intercept. Surprisingly, a good default choice for the interval about x is an interval containing 2/3 of the data points

1.2.7 Recursive Partitioning: Tree-Based Models

Tree models are especially useful in messy situations or settings in which overfitting is not so problematic, such as confounder adjustment using propensity scores or in missing value imputation.

1.2.8 Multiple Degree of Freedom Tests of Association

C(Y |X) = β0 + β1X + β2X′ + β3X′′ + β4X′′′
şimdi burada eğer rcs ile 4 beta değeri de anlamsız çıkıyor ama linear association bakınca anlamlı çıkıyorsa hele de borderline ise o zaman non-linear terimleri regresyon denkleminden çıkarmak mantıklı değildir. ama eğer biz non-linearity nin hafif olduğunu düşünüyorsak o zaman 1df ile non-linear association bakılabilir. ama 1 df kullanırsak 1 df test statistic \(X^2\) dağılım içermez.

Lojistik regresyonda linear regresyonda olduğu gibi bir hata terimi yoktur

\(R^2\) değeri daha çok prediktif modellerde modelin dependent variable ’ın ne kadarını açıklayabildiğni gösterir

1.3 Log-binomial regression

Log binomial regression

Figure 1.3: Log binomial regression

Log binomial regression

Figure 1.4: Log binomial regression

##Penalized regression