Datacamp project: What determines hair loss? A Bayesian logistic regression analysis

Author

Chung Him Au-Yeung

Published

June 12, 2024

Background

As we age, hair loss becomes one of the health concerns of many people. The fullness of hair not only affects appearance, but is also closely related to an individual’s health.

A survey brings together a variety of factors that may contribute to hair loss, including genetic factors, hormonal changes, medical conditions, medications, nutritional deficiencies, psychological stress, and more. Through data exploration and analysis, the potential correlation between these factors and hair loss can be deeply explored, thereby providing useful reference for the development of individual health management, medical intervention and related industries.

The data

Data contains information on persons in this survey. Each row represents one person.

  • “Id” - A unique identifier for each person.
  • “Genetics” - Whether the person has a family history of baldness.
  • “Hormonal Changes” - Indicates whether the individual has experienced hormonal changes (Yes/No).
  • “Medical Conditions” - Medical history that may lead to baldness; alopecia areata, thyroid problems, scalp infections, psoriasis, dermatitis, etc.
  • “Medications & Treatments” - History of medications that may cause hair loss; chemotherapy, heart medications, antidepressants, steroids, etc.
  • “Nutritional Deficiencies” - Lists nutritional deficiencies that may contribute to hair loss, such as iron deficiency, vitamin D deficiency, biotin deficiency, omega-3 fatty acid deficiency, etc.
  • “Stress” - Indicates the stress level of the individual (Low/Moderate/High).
  • “Age” - Represents the age of the individual.
  • “Poor Hair Care Habits” - Indicates whether the individual practices poor hair care habits (Yes/No).
  • “Environmental Factors” - Indicates whether the individual is exposed to environmental factors that may contribute to hair loss (Yes/No).
  • “Smoking” - Indicates whether the individual smokes (Yes/No).
  • “Weight Loss” - Indicates whether the individual has experienced significant weight loss (Yes/No).
  • “Hair Loss” - Binary variable indicating the presence (1) or absence (0) of baldness in the individual.
suppressPackageStartupMessages({
  library(bayesrules)
  library(rstanarm)
  library(bayesplot)
  library(tidyverse)
  library(tidybayes)
  library(broom.mixed)
  library(readr)
  library(data.table)
  library(janitor)
})
#import the data
data = read.csv(file = "hair_loss_data.csv")
str(data)
'data.frame':   999 obs. of  13 variables:
 $ Id                      : int  133992 148393 155074 118261 111915 139661 169255 112032 140785 187999 ...
 $ Genetics                : chr  "Yes" "No" "No" "Yes" ...
 $ Hormonal.Changes        : chr  "No" "No" "No" "Yes" ...
 $ Medical.Conditions      : chr  "No Data" "Eczema" "Dermatosis" "Ringworm" ...
 $ Medications...Treatments: chr  "No Data" "Antibiotics" "Antifungal Cream" "Antibiotics" ...
 $ Nutritional.Deficiencies: chr  "Magnesium deficiency" "Magnesium deficiency" "Protein deficiency" "Biotin Deficiency " ...
 $ Stress                  : chr  "Moderate" "High" "Moderate" "Moderate" ...
 $ Age                     : int  19 43 26 46 30 37 40 35 19 49 ...
 $ Poor.Hair.Care.Habits   : chr  "Yes" "Yes" "Yes" "Yes" ...
 $ Environmental.Factors   : chr  "Yes" "Yes" "Yes" "Yes" ...
 $ Smoking                 : chr  "No" "No" "No" "No" ...
 $ Weight.Loss             : chr  "No" "No" "Yes" "No" ...
 $ Hair.Loss               : int  0 0 0 0 1 1 1 0 1 0 ...
#check missing values
missing_values = sapply(data, function(x) sum(is.na(x)))
missing_values
                      Id                 Genetics         Hormonal.Changes 
                       0                        0                        0 
      Medical.Conditions Medications...Treatments Nutritional.Deficiencies 
                       0                        0                        0 
                  Stress                      Age    Poor.Hair.Care.Habits 
                       0                        0                        0 
   Environmental.Factors                  Smoking              Weight.Loss 
                       0                        0                        0 
               Hair.Loss 
                       0 

The echo: false option disables the printing of code (only output is displayed).

Data cleaning

Most of these columns are either characters or integers which are not ideal for analysis. Therefore, continuous variables will be transformed into numeric and categorical/binary variables to be transformed into factor. There does not seem to have NA values in the data set, but some of these columns have entry like “No data”, we will have to drop them for simplicity. List-wise deletion is used to deal with “No data”.

#clean data

data$Genetics = factor(data$Genetics, levels = c("No", "Yes"))
data$Hormonal.Changes = factor(data$Hormonal.Changes, levels = c("No", "Yes"))
data$Medical.Conditions = factor(data$Medical.Conditions)
data$Medications...Treatments = factor(data$Medications...Treatments)
data$Nutritional.Deficiencies = factor(data$Nutritional.Deficiencies)
data$Stress = factor(data$Stress)
data$Age = as.numeric(data$Age)
data$Poor.Hair.Care.Habits = factor(data$Poor.Hair.Care.Habits)
data$Environmental.Factors = factor(data$Environmental.Factors)
data$Smoking = factor(data$Smoking, levels = c("No", "Yes"))
data$Weight.Loss = factor(data$Weight.Loss, levels = c("No", "Yes"))
data$Hair.Loss = factor(ifelse(data$Hair.Loss ==1, "Yes", "No"), levels = c("No", "Yes"))

data = data %>% 
  filter(if_all(everything(), ~ . != "No Data"))

#id duplication
data %>%
  group_by(Id) %>%
  filter(n() > 1)
# A tibble: 4 × 13
# Groups:   Id [2]
      Id Genetics Hormonal.Changes Medical.Conditions Medications...Treatments
   <int> <fct>    <fct>            <fct>              <fct>                   
1 157627 Yes      No               Dermatosis         Rogaine                 
2 110171 Yes      No               Thyroid Problems   Antifungal Cream        
3 110171 No       No               Psoriasis          Immunomodulators        
4 157627 No       No               Thyroid Problems   Accutane                
# ℹ 8 more variables: Nutritional.Deficiencies <fct>, Stress <fct>, Age <dbl>,
#   Poor.Hair.Care.Habits <fct>, Environmental.Factors <fct>, Smoking <fct>,
#   Weight.Loss <fct>, Hair.Loss <fct>
#remove them 
data = data %>% 
  mutate(Id = as.character(Id)) %>% 
  filter(Id != "157627"& Id != "110171")

Specifying priors of the coefficient

Starting with tuning the centered intercept, since i am not an expert in this field, I have to look up on internet to check that on average, what is the chance of balding in our population. However, it gives me different chance of balding at different age and gender. Therefore, the chance of balding, on average, covers a wide range from 16% to 50% given that our data has a max age at 50 (range will be wider if we have older individuals). The upper and lower range of log odds can be calculated as:

\[ \text{log-odds}_{\text{upper}} = \ln\left(\frac{0.5}{1-0.5}\right) = \ln(1) \approx 0 \]

\[ \text{log-odds}_{\text{lower}} = \ln\left(\frac{0.16}{1-0.16}\right) = \ln(0.190) \approx −1.660 \]

\[ \text{mean} = \frac{-1.660 + 0.000}{2} \approx -0.830 \]

Thus, the prior mean of the intercept on log odds scale will be -0.830. Since we have some understanding about the range of the intercept which is roughly -1.66 to 0 (-0.8 +- 2*0.4).

However, we have less understanding about the prior distribution of other coefficients and the relationship of hair loss when controlling for other predictors. Therefore, we will utilise weakly informative prior for other coefficients being included in the model. For simplicity, I do not want to use all the variables yet but to choose a few of them to build a simple model. Since contributors to hair loss such as medical conditions, age and treatment are well known, the first model will include these variables.

Simulate the posterior

#simulate the model

  model_1 <- stan_glm(
    Hair.Loss ~ Age + Genetics+Hormonal.Changes,
    data = data, family = binomial,
    prior_intercept = normal(-0.8, 0.4),
    prior = normal(0, 2.5, autoscale = TRUE),
    chains = 4, iter = 5000 * 2, seed = 87453804,
  )

Check the diagnostic plots for stability our the simulation.

mcmc_trace(model_1)

mcmc_dens_overlay(model_1)

mcmc_acf(model_1)

These plots suggest the model achieved convergence. A posterior summary is shown below. The 80% credible intervals, Age and Hormoal Change straddle 0 on the log-odds scale (equivalent to including 1 on the odds ratio scale). Genetics seems to be a a statistically significant predictor with the outcome at the 80% credible level.

tidy(model_1, effects = "fixed",
     conf.int = TRUE, conf.level = 0.8)
# A tibble: 4 × 5
  term                estimate std.error conf.low conf.high
  <chr>                  <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)          0.0422    0.327    -0.373    0.468  
2 Age                 -0.00510   0.00852  -0.0162   0.00565
3 GeneticsYes          0.203     0.142     0.0188   0.383  
4 Hormonal.ChangesYes  0.0619    0.143    -0.123    0.247  

In the second model, i will try to add variables including poor hair habits, smoking, weight loss and environmental factors.

#simulate the model

  model_2 <- stan_glm(
    Hair.Loss ~ Age + Genetics+Hormonal.Changes+Poor.Hair.Care.Habits + Smoking + Weight.Loss+Environmental.Factors,
    data = data, family = binomial,
    prior_intercept = normal(-0.8, 0.4),
    prior = normal(0, 2.5, autoscale = TRUE),
    chains = 4, iter = 5000 * 2, seed = 87453804,
  )
mcmc_trace(model_2)

mcmc_dens_overlay(model_2)

mcmc_acf(model_2)

Model 2 does not perform any better in identifying significant predictors of the outcome. The 80% credible intervals for most of the variables straddle 0 on the log-odds scale. Genetics remain as a significant predictor as well as Weight Loss.

tidy(model_2, effects = "fixed",
     conf.int = TRUE, conf.level = 0.8)
# A tibble: 8 × 5
  term                     estimate std.error conf.low conf.high
  <chr>                       <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)               0.183     0.371   -0.283     0.653  
2 Age                      -0.00567   0.00869 -0.0168    0.00552
3 GeneticsYes               0.180     0.145   -0.00713   0.364  
4 Hormonal.ChangesYes       0.0701    0.146   -0.117     0.257  
5 Poor.Hair.Care.HabitsYes -0.171     0.146   -0.355     0.0114 
6 SmokingYes               -0.234     0.141   -0.418    -0.0514 
7 Weight.LossYes            0.207     0.139    0.0239    0.388  
8 Environmental.FactorsYes -0.0155    0.140   -0.198     0.168  

Model 3 will include the remaining variables, medical conditions, medical treatment and stress.

#set reference level
data[["Medical.Conditions"]] =relevel(factor(data[["Medical.Conditions"]]), ref = "Eczema")

data[["Medications...Treatments"]] =relevel(factor(data[["Medications...Treatments"]]), ref = "Antibiotics")

data[["Stress"]] =relevel(factor(data[["Stress"]]), ref = "Low")


#simulate the model
  model_3 <- stan_glm(
    Hair.Loss ~ Age + Genetics+Hormonal.Changes+Poor.Hair.Care.Habits + Smoking + Weight.Loss+Environmental.Factors+ Medical.Conditions+Medications...Treatments+Stress,
    data = data, family = binomial,
    prior_intercept = normal(-0.8, 0.4),
    prior = normal(0, 2.5, autoscale = TRUE),
    chains = 4, iter = 5000 * 2, seed = 87453804,
  )

Notice that Genetics posterior straddles 0 with a 80% posterior credible interval. This suggest that Genetics is not a significant predictor of hair loss when controlling for medical conditions, medical treatment and stress. In other words, this may imply when we already know the medical conditions, medical treatment and stress, genetics does not significantly improve our understanding of hair loss. Weight loss remains as a significant predictor as well as medical conditions especially Alopecia Areata at 80% posterior credible interval.

tidy(model_3, effects = "fixed",
     conf.int = TRUE, conf.level = 0.8)
# A tibble: 28 × 5
   term                                    estimate std.error conf.low conf.high
   <chr>                                      <dbl>     <dbl>    <dbl>     <dbl>
 1 "(Intercept)"                            0.192     0.495    -0.446    0.824  
 2 "Age"                                   -0.00632   0.00905  -0.0176   0.00528
 3 "GeneticsYes"                            0.153     0.148    -0.0366   0.341  
 4 "Hormonal.ChangesYes"                    0.0531    0.146    -0.138    0.243  
 5 "Poor.Hair.Care.HabitsYes"              -0.202     0.146    -0.390   -0.0132 
 6 "SmokingYes"                            -0.280     0.148    -0.471   -0.0914 
 7 "Weight.LossYes"                         0.224     0.149     0.0329   0.415  
 8 "Environmental.FactorsYes"               0.00636   0.145    -0.179    0.193  
 9 "Medical.ConditionsAlopecia Areata "     0.500     0.338     0.0759   0.934  
10 "Medical.ConditionsAndrogenetic Alopec…  0.409     0.341    -0.0284   0.847  
# ℹ 18 more rows

I would want to use an extra model 4 to just include variables that have shown significance in the previous model.

#simulate the model

  model_4 <- stan_glm(
    Hair.Loss ~ Genetics+Weight.Loss+Medical.Conditions,
    data = data, family = binomial,
    prior_intercept = normal(-0.8, 0.4),
    prior = normal(0, 2.5, autoscale = TRUE),
    chains = 4, iter = 5000 * 2, seed = 87453804,
  )
tidy(model_4, effects = "fixed",
     conf.int = TRUE, conf.level = 0.8)
# A tibble: 12 × 5
   term                                    estimate std.error conf.low conf.high
   <chr>                                      <dbl>     <dbl>    <dbl>     <dbl>
 1 "(Intercept)"                           -0.341       0.275  -0.692     0.0124
 2 "GeneticsYes"                            0.219       0.143   0.0316    0.400 
 3 "Weight.LossYes"                         0.224       0.143   0.0422    0.408 
 4 "Medical.ConditionsAlopecia Areata "     0.446       0.325   0.0327    0.871 
 5 "Medical.ConditionsAndrogenetic Alopec…  0.311       0.332  -0.116     0.741 
 6 "Medical.ConditionsDermatitis"          -0.00527     0.333  -0.439     0.423 
 7 "Medical.ConditionsDermatosis"           0.0126      0.340  -0.426     0.451 
 8 "Medical.ConditionsPsoriasis"            0.119       0.330  -0.306     0.544 
 9 "Medical.ConditionsRingworm"            -0.0437      0.360  -0.512     0.422 
10 "Medical.ConditionsScalp Infection"      0.0357      0.348  -0.404     0.483 
11 "Medical.ConditionsSeborrheic Dermatit…  0.353       0.343  -0.0783    0.796 
12 "Medical.ConditionsThyroid Problems"    -0.162       0.332  -0.585     0.265 

How wrong is the model?

Lets just use model 1 as an example. From each of the 100 posterior simulated data sets, we record the proportion of outcome that are 1, i.e., the proportion of hair loss. A histogram of these simulated hair loss rates are consistent with the original data. Most of the posterior simulated data sets showed a roughly 50% chance of hair loss, close to the observed data.

proportion_hairloss_1 = function(x){mean(x==1)}

pp_check(model_1, nreps = 100,
         plotfun = "stat", stat="proportion_hairloss_1")
Warning: 'nreps' is ignored for this PPC
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

How accurate is the model?

A confusion matrix summarise the information of the overall accuracy of the model. The model does not seem to perform very well. The model only correctly classified 431 of the 809 total test cases. The overall classification accuracy rate is 53%.

hair_loss_pred_1 = posterior_predict(model_1,newdata = data)

hair_loss_classfication =data %>% 
  mutate(hairloss_prob = colMeans(hair_loss_pred_1),
         hairloss_class = as.numeric(hairloss_prob>=0.5)) %>% 
  tabyl(Hair.Loss, hairloss_class) %>% 
  adorn_totals(c("row","col"))
  
hair_loss_classfication
 Hair.Loss   0   1 Total
        No 206 192   398
       Yes 187 220   407
     Total 393 412   805
classification_summary(model = model_1, data=data, cutoff=0.5)$accuracy_rates
                          
sensitivity      0.5307125
specificity      0.5326633
overall_accuracy 0.5316770

Cross validation

Lets compare the cross-validated estimates of classification accuracy for all these models. It seems like model 1 perform the best.

set.seed(883548)
cv_accuray_1 = classification_summary_cv(model=model_1,data=data,cutoff = 0.5,k=5)
cv_accuray_2 = classification_summary_cv(model=model_2,data=data,cutoff = 0.5,k=5)
cv_accuray_3 = classification_summary_cv(model=model_3,data=data,cutoff = 0.5,k=5)
cv_accuray_4 = classification_summary_cv(model=model_4,data=data,cutoff = 0.5,k=5)
cv_accuray_1$cv
  sensitivity specificity overall_accuracy
1   0.5080583   0.5142898        0.5043478
cv_accuray_2$cv
  sensitivity specificity overall_accuracy
1   0.4776228   0.5084696        0.4881988
cv_accuray_3$cv
  sensitivity specificity overall_accuracy
1   0.4880108   0.5067766        0.4956522
cv_accuray_4$cv
  sensitivity specificity overall_accuracy
1   0.4642343   0.5473011        0.5031056

Summary

The initial logistic regression models built to predict hair loss using variables such as age, medical conditions, and treatments showed poor overall accuracy. This performance is close to random guessing, indicating that the model may not be effectively capturing the patterns in the data. This report is primarily a trial-and-error exercise in practicing Bayesian regression. Therefore, the low overall prediction accuracy is not a critical concern in this context. Instead, the results may highlight that logistic regression may not be the most suitable model choice for this data set.