Introduction

In this study, the following questions would be answered using the 2017 data from The Behavioral Risk Factor Surveillance System (BRFSS):

1. Which state in the U.S. had the highest probability of finding a person suffering from asthma in 2017? Which state had the lowest possibility?

2. At what age was asthma most commonly diagnosed for the selected respondents? Which medical care(s) did they need and how often did they need it (them)?

3. Could the likelihood of having asthma be predicted using logistic regression model using the associated demographic and health risk factors found in literature? If yes, how accurate were these predictions?

The descriptive analytics goals of this study were to explore the geographic distribution of asthma in the U.S. by state, and to visualize the behaviors of those who reported having asthma. The predictive analytics goal of this study was to test how accurately a binary logistic regression model could predict the likelihood a respondent having a specific set of characteristics would have asthma. This was a classification predictive problem.

Importance of this study

As mentioned in previous study, asthma is a very common chronic respiratory disease where a person’s airways to the lung are inflamed, narrowed, swollen, and often produce extra thickened mucus which make breathing difficult (Mayo Clinic, 2018). Millions of Americans are affected by asthma every year. Asthma may not pose a major problem to some people, but for others, it can gravely interfere with daily activities. An asthma-attack can be life-threatening. Understanding the medical needs of asthma patients, in which states were asthma patients most and least likely to be found, is important for asthma prevention measures, and reliefs of asthma symptoms. Furthermore, having an algorithm to accurately predict who would be more in danger of suffering from asthma, will be extremely beneficial to medical providers to devise prevention methods, and better prepare and allocate resources for diagnosis and necessary treatments.

The Behavioral Risk Factor Surveillance System (BRFSS) is a state-based cross-sectional telephone survey initiated by the Centers for Disease Control and Prevention (CDC). The state health departments have been conducting the survey monthly over land-lines and cellular telephones using a standardized questionnaire and other assistance from the CDC. The survey data sets contain relevant data that are useful for asthma analysis under the scope of this project.

Dataset

The original data set was obtained from the CDC website, under the Behavioral Risk Factor Surveillance System, following this link.

The data set has 450,016 observations for 358 variables including records of respondents’ answers and calculated variables that categorized original data points into groups represented by an integer each.

The variable selection and data cleaning process were similar to previous study thanks to the CDC’s effort to standardize the data input and document the variable descriptions. The cleaned dataset used in this study can be viewed here.

Appendix A demonstrated data “read-in” and the packages to be used. Appendix B listed the variables chosen for analysis with their respective descriptions from the CDC Codebook.

Descriptive Analytics - Continued

Previous study investigated the demographic trends and risk factors associated with asthma, and recommended exploring asthma distribution by state and a closer look at the respondents chosen for the asthma call-back survey. The graphs and analysis below followed the future research suggestions from previous study.

Geographic Distribution of Asthma Patients

Appendix 1 displayed two maps: The green-shaded one was count distribution of respondents reported having asthma, and the blue-shaded one was a probability heat map showing asthma prevalence by state.

The two maps were slightly different. States with the darkest green shades (the largest number of respondents reported having asthma) were Arizona (AZ), Florida (FL), and Kansas (KS), and states with the lightest green shades (the least number of respondents with asthma) were Nevada (NZ) and Alaska (AK). Meanwhile, the blue-shaded map indicated that asthma was the most prevalent in West Virginia (WV) and the least prevalent in Minnesota (MN).

Observations for Sampled Asthma Patients

The graph in Appendix 2 showed that asthma was more commonly diagnosed for children or teenage (20 and below). This is expected since asthma tends to be more prevalent among children, and showed symptoms since childhood.

Among the respondents currently having asthma, 749 of them were asked if they had an asthma attack 12 months before the interview. Less than half of them had to visit the ER or required urgent care for an asthma attack, which agreed to what DF12 showed: only 308 respondents among the 749 had an attack 12 months before the interview. (Appendix 3)

The probability that a respondent who had an attack 12 months before the interview also required urgent care was \(1- 122/308 = 0.6038961\), which was higher than expected. The majority of respondents asked also did not go to more than 5 checkups a year for asthma. (Appendix 4)

For more than half of the respondents who was asked if they had asthma attacks in the past 12 months, asthma symptoms showed twice a week or less. However, only 219 among 441 of those who did not have an attack 12 months before the interview showed no symptoms, which was less than half. All respondents who reported having an attack 12 months before the interview had asthma symptoms at least once. Most respondents did not report using prevention medicines for asthma, or used inhalers to stop an attack. (Appendix 5 and 6)

Predictive Analytics

Logistic Regression - Full Model Fit Statistic

Binary Logistic Regression Model was run on the full data, using all demographic and health risk factors: “AgeGroup”, “Sex”, “Race”, “Income Group”, “EducationLevel”, “CurrentSmoker”, “Exercise”, “BMICategory”, and “DepressiveDisorders” as predictors, predicting the log-odd of the response variable “AsthmaStatus”. All predictors were categorical variables, presented as factors with levels in the logistic regression model.

The response variable “AsthmaStatus” was originally a categorical variable of three levels. To perform binary logistic regression model, it was converted to a binary variable, with 1 representing the “Current” status, or respondents “currently” having asthma at the time of the survey, and 0 representing both “Former” and “Never” statuses, or respondents “not currently” having asthma at the time of the survey.

Logistic regression was performed using the glm() function in package {stat}, with “family=binomial(link=”logit“)”. Appendix 7 displayed the algorithm and the coefficient table of each level of predictors. All predictors were significant in predicting the log-odd of the response variable.

A coefficient plot created in Appendix 8 was the visualization of the significance of the coefficients of all predictors. Appendix 9 transformed the log-odd to odds and probabilities as another way of presenting the same results.

  1. Likelihood of having Asthma Decreases with Increases in Age

All coefficients in the table below were negative. With “Age Group 18 to 24” being the reference level, the negative coefficients suggested the higher the age group a respondent belonged in, the less likely he/she would have reported having asthma. This result agreed to descriptive analytics results.

  1. Likelihood of having Asthma is Higher for Female

As Sex moved from 0 to 1, log-likelihood of having asthma decreased. This means female respondents were more likely to then suffered from asthma compared to male respondents.

  1. Likelihood of having Asthma Differs among Races

Compared to Asian, which was selected as the reference level, all other races had higher asthma prevalence, with the Multiracial group having the highest difference in log-odd.

  1. Likelihood of having Asthma Decreases with Increases in Income

Income Group from 10,000 USD to 14,999 USD set as the reference level, a respondent earning less than that was more likely to have asthma (as shown by the positive coefficient), while for other Income Group, as income increased, the likelihood decreased (illustrated by the increasingly more negative coefficients).

  1. Likelihood of having Asthma Differs among Education Level

From descriptive analytics, the probability of respondents having asthma given they had graduated from college/technical school was the lowest. It was expected that the coefficient for this level would be lowest. However, compared to the set reference level (Education Level - Did not Attend High School), Education Level - Graduated from High School had the most negative coefficient, indicating that respondents who only graduated high school had an even lower log-odd of having asthma than those achieved higher level of education. The log-likelihood of having asthma were similar between those attended and graduated from college/technical school.

  1. Likelihood of having Asthma Decreases with Decreases in BMI

With Normal Weight being the set reference level, all other BMI Category coefficients were positive, with Obese being the most negative - Obese respondents were the most likely to have asthma.

  1. Likelihood of having Asthma Decreases for Non-Smokers

The negative coefficient as CurrentSmoker variable “moved” from 0 to 1 indicated that smokers were more likely to have asthma.

  1. Likelihood of having Asthma Decreases among Physically Active People

Similarly, the negative coefficient for Activeness supported the descriptive analytics results - Respondents who had not engaged in physical activities 30 days prior to the interview were more likely to have asthma.

  1. Likelihood of having Asthma Increases among Depressed People

A respondent with some sort of depressive disorder were much more likely to suffer from asthma, as suggested from the positive coefficient.

Logistic Regression Predictive Accuracy

Data Partitioning and Logistic Regression Model Setup

Since the response variable - Asthma Status was binary, again, logistic regression was performed, using the glm() function, with “family=binomial(link=”logit“).” The data was partitioned into a training set - a sample of 60% of observations, and a test set - the rest of the data. Appendix 10 presented the partitioning set up and the algorithm.

Confusion Matrix and ROC Curve

To evaluated the performance of the logistic regression model in correctly classifying the test subset’s respondents in “Having Asthma” and “Not Having Asthma” groups, a Receiver Operating Characteristics (ROC) Curve was drawn to visualize the trade off between Sensitivity and Specificity of this model. The curve was constructed using the prediction{ROCR} and performance{ROCR} functions (Appendix 11). The area under the ROC curve was close to 0.67, which was higher than 0.5, indicating that the model performed better than chance.

A confusion matrix was constructed to better evaluate how well the logistic regression model developed from the train subset predicted the test subset, by Accuracy Rate, Error Rate, Sensitivity (True Positive), Specificity (True Negative), and False Positive. (Appendix 12)

Since only a small proportion of the dataset gave a positive response (less than 10% of the sampled observations reported “currently having asthma” in the 2017 survey), the classification threshold was set to a low level of 16%. This means if the predicted probability of having asthma was more than 16%, the respondent was classified as having asthma, and vice versa, a predicted probability of 16% or below was classified as not having asthma.

At 16% threshold, the model predicted with 84% accuracy rate, 16% error rate, 29% True-Positive (Sensitivity) Rate, and 90% True-Negative (Specificity) Rate. Thus, 30% of the times, if a respondent had a calculated asthma probability of more than 16%, the model predicted accurately that he/she would have asthma; 90% of the times, if a respondent had a calculated asthma probability of 16% or below, the model predicted accurately that he/she would not have asthma.

Conclusions, Challenges, and Future Research

Asthma was most commonly diagnosed among children and young adults (age 20 or younger). Less than half of asthmatics had not suffer from an attack 12 months prior to the survey, but more than half had symptoms twice a week or less. Among those who had an attack, less than half needed emergency care and ER support. Inhaler usage was also low among all respondents.

All demographic and risk factors predictors were significant in predicting the likelihood of asthma. The interpretation generally did not differ considerably from descriptive analytics results, except for Education Level.

Machine learning logistic regression algorithm with a 60% train subset classify the asthmatics correctly 29% of the times, and the non-asthmatics correctly 90% of the times, using a classification threshold of 16%. The predictive accuracy of 84% and error rate of 16% were acceptable, especially when the area under the ROC curve suggested the model performed better than chance.

A ShinyApp was designed to display selected descriptive and predictive analytics results from both studies.

Challenges and Future Research

  • Data was collected from land-line phone survey. The sampled population using land-line telephones were not representative of the entire U.S. population, urging future researchers to find better survey sampling and data gathering methods.

  • Only very few asthmatic respondents were contacted for the asthma survey, generating a very small dataset with large number of variables the resulted in low degree of freedom, affecting predictive analytics results. If possible, a larger and more complete dataset for asthma call-back survey data would be useful for future analysis.

  • Missing data occurred randomly across data columns and rows were all removed for the glm() function to run properly. These data had the potential to drastically changed the modeling results, prompting future research to look into other predictive models such as trees, multinomial logistic regression, Least Discriminant Analysis (LDA), etc.

  • The majority of the codes in this document may not be reproducible for future dataset. More universal functions could be developed to make analysis more repeatable.

  • Due to time constraint, this study did not look into the change in asthma prevalence, distribution and significant of predictors over time. This could be a promising area for future research, where time-series analysis, regression with lagging variables, etc. could be performed for visualization and better predictive modeling.

Appendix

Appendix A - Data Preparation

##Preparation
library(tidyverse)
library(scales)

##Visual
library(viridis)
library(cowplot)
library(gridExtra)
library(grid)
library(usmap)

##Predictive modeling
library(boot)
library(coefplot)
library(ROCR)
if(!file.exists("Data")){dir.create("Data")}
asthma <- read_csv("./Data/Asthma2017.csv", guess_max = 69000)

#remove missing values for AsthmaStatus only
health2017_narm <- asthma %>%
  filter(!is.na(AsthmaStatus))

#Data for all respondents who reported formerlly or currently having asthma 
asthma2017 <- asthma %>%
  filter(AsthmaStatus == "Current" | AsthmaStatus == "Former")

#Data for all respondents who reported currently having asthma
curr_ast_17 <- asthma2017 %>%
  filter(AsthmaStatus == "Current")

curr_ast_17
## # A tibble: 42,254 x 21
##    AsthmaStatus AgeGroup   Sex Race  IncomeGroup EducationLevel BMICategory
##    <chr>        <chr>    <dbl> <chr> <chr>       <chr>          <chr>      
##  1 Current      65 and ~     0 White $10,000 to~ Did not Atten~ Normal     
##  2 Current      55 to 64     1 Mult~ $10,000 to~ Did not Atten~ Obese      
##  3 Current      65 and ~     0 White $25,000 to~ Did not Atten~ Obese      
##  4 Current      65 and ~     1 Black <NA>        Attended Coll~ Obese      
##  5 Current      55 to 64     1 White less than ~ Did not Atten~ Obese      
##  6 Current      65 and ~     0 <NA>  $10,000 to~ Graduated Hig~ Overweight 
##  7 Current      65 and ~     0 White $25,000 to~ Graduated fro~ Underweight
##  8 Current      55 to 64     0 White $75,000 or~ Graduated fro~ Obese      
##  9 Current      65 and ~     0 White <NA>        Did not Atten~ Overweight 
## 10 Current      45 to 54     0 White $10,000 to~ Graduated Hig~ Obese      
## # ... with 42,244 more rows, and 14 more variables: CurrentSmoker <dbl>,
## #   Exercise <dbl>, DepressiveDisorder <dbl>, State <dbl>,
## #   AsthmaAge <chr>, Last12MonthsAttack <dbl>, ERVisit <chr>,
## #   UrgentCare <chr>, RoutineCheckUp <chr>, LimitActivities <dbl>,
## #   AsthmaSymptoms <chr>, AsthmaSleepDiffi <chr>, DaysWithMeds <chr>,
## #   InhalerForAttack <chr>
#Data for all respondents who reported formerlly having asthma
form_ast_17 <- asthma2017 %>%
  filter(AsthmaStatus == "Former")

Appendix B - Variable Description

Below is a summary of the variables, similar to that from previous study.

  • AsthmaStatus: If a respondent currently, formerly or never had asthma

  • Race: Respondents grouped by nine categories of race-ethnicity, with Hispanics grouped as a separate category

  • AgeGroup: Reported age in five-year age categories calculated variable

  • Sex: (binary) Female vs. Male

  • BMICategory: Respondents’ weight status by dividing reported weights by the square of their respective heights

  • EducationLevel: Respondents’ highest education level completed, sorted into four groups

  • IncomeGroup: Respondents’ reported income sorted into nine groups

  • CurrentSmoker: (binary) Whether respondents smoked a cigarette within 30 days before the interview

  • Exercise: (binary) Whether respondents engaged in physical activities within 30 days before the interview

  • DepressiveDisorder: (binary) Whether respondents had been suffering from, or was diagnosed with one or more depressive disorders

  • State: State FIPS code of where the respondents came from

  • Last12MonthsAttack: (binary) Whether respondents had an asthma attack or an asthma episode during the 12 month period before the interview.

  • AsthmaSymptoms: Number of times respondents experienced any symptom of asthma during the 30 day period before the interview

  • AsthmaSleepDiffi: Number of days respondents had difficulties sleeping because asthma during the 30 day period before the interview

  • DaysWithMeds: Number of days respondents took a prescription of asthma medication to prevent an asthma attack from occurring during the 30 day period before the interview

  • InhalerForAttack: Number of times respondents used a prescription of asthma inhaler during an asthma attack to stop it during the 30 day period before the interview

  • AsthmaAge: Age at first asthma diagnosis

  • ERVisit: Number of times respondents visited the emergency room or an urgent care center because of asthma during the 12 month period before the interview

  • UrgentCare: Number of times respondents specifically required urgent care for asthma during the 12 month period before the interview

  • RoutineCheckUp: Number of routine check-ups for asthma during the 12 month period before the interview

  • LimitActivities: Number of times respondents were unable to work or carry out activities because of asthma during the 12 month period before the interview

Appendix 1 - Asthma Geographical Distribution

CurrAsthmaState <- curr_ast_17 %>%
  group_by(State) %>%
  summarise(CurrAsthmaCount = n())

SampleState <- health2017_narm %>%
  group_by(State) %>%
  summarize(SampleCount = n())

ByState <- as_tibble(cbind(CurrAsthmaState, "TotalSampled" = SampleState$SampleCount))

StateName = as.list(cdlTools::fips(ByState$State, to = "Name"))

StateName = replace(StateName, StateName == "Deleware", "Delaware")

ByState <- ByState %>%
  mutate(PercentByState = ByState$CurrAsthmaCount/ByState$TotalSampled*100,
         StateName = cdlTools::fips(State, to = "Name"),
         StateWithPercent = paste(cdlTools::fips(State, to = "Name"), as.character(PercentByState), "%"))
  
ByState$fips = ByState$State

plot_usmap (data = ByState, values = "CurrAsthmaCount", labels = TRUE) +
  scale_fill_continuous(name = "Currently having Asthma", low = "white", high = "dark green", label = comma) +
  labs(title = "Visual Presentation of Asthma Count by State") +
  theme(legend.position = "right") 

plot_usmap (data = ByState, values = "PercentByState", labels = TRUE, line = "white") +
  scale_fill_continuous(name = "Currently having Asthma", low = "white", high = "steel blue", label = comma) +
  labs(title = "Visual Presentation of Asthma Prevalence by State", 
             subtitle = "The map did not show Puerto Rico and Guam. From the color scale, 
             West Virginia (WV) had the highest percentage of respondents reporting 
             suffering from asthma in 2017. Meanwhile, Minnesota (MN) had the lowest percentage.") +
  theme(legend.position = "right")

Appendix 2 - Reported Ages that Asthma was Diagnosed

Asthma_Diagnosed_Age <- asthma2017 %>%
  filter(!is.na(AsthmaAge))%>%
  ggplot(aes(x = AsthmaAge)) +
  geom_bar(fill = "navy") +
  coord_flip() +
  guides(fill = FALSE)+
  ggtitle("Asthma Diagnosed Age")

Asthma_Diagnosed_Age

Appendix 3 - Reported Asthma Care by Respondents with Asthma

Asthma_ERVisit <- asthma2017 %>%
  filter(!is.na(ERVisit))%>%
  ggplot(aes(x = ERVisit)) +
  geom_bar(  fill = "tomato4") +
  coord_flip() +
  guides(fill = FALSE)+
  ggtitle("Frequency of ER Visit")

Asthma_Urgent_Care <- asthma2017 %>%
  filter(!is.na(UrgentCare))%>%
  ggplot(aes(x = UrgentCare)) +
  geom_bar(  fill = "thistle4") +
  coord_flip() +
  guides(fill = FALSE)+
  ggtitle("Frequency of Urgent Care")

cowplot::plot_grid(Asthma_ERVisit, Asthma_Urgent_Care)

DF11.1_Summary_Asthma_Attack_Past_Year <- curr_ast_17 %>%
  filter(!is.na(Last12MonthsAttack)) %>%
  group_by(Last12MonthsAttack) %>%
  summarize(Count = n())

DF11.2_Summary_Asthma_Care <- curr_ast_17 %>%
  filter(!is.na(Last12MonthsAttack)) %>%
  group_by(Last12MonthsAttack, ERVisit, UrgentCare) %>%
  summarize(Count = n()) %>%
  arrange(desc(Count))

DF11.1_Summary_Asthma_Attack_Past_Year 
## # A tibble: 2 x 2
##   Last12MonthsAttack Count
##                <dbl> <int>
## 1                  0   441
## 2                  1   308
DF11.2_Summary_Asthma_Care 
## # A tibble: 15 x 4
## # Groups:   Last12MonthsAttack, ERVisit [6]
##    Last12MonthsAttack ERVisit            UrgentCare        Count
##                 <dbl> <chr>              <chr>             <int>
##  1                  0 <NA>               <NA>                441
##  2                  1 None               None                122
##  3                  1 Less than 5 Times  Less than 5 Times    60
##  4                  1 None               Less than 5 Times    59
##  5                  1 Less than 5 Times  None                 42
##  6                  1 5 to 20 times      Less than 5 Times     8
##  7                  1 5 to 20 times      5 to 20 times         4
##  8                  1 <NA>               None                  3
##  9                  1 Less than 5 Times  <NA>                  2
## 10                  1 Less than 5 Times  5 to 20 times         2
## 11                  1 None               <NA>                  2
## 12                  1 <NA>               <NA>                  1
## 13                  1 <NA>               Less than 5 Times     1
## 14                  1 More than 80 times Less than 5 Times     1
## 15                  1 None               5 to 20 times         1

Appendix 4 - Reported Checkup Frequency by Respondents with Asthma

Asthma_CheckUp <- asthma2017 %>%
  filter(!is.na(RoutineCheckUp)) %>%
  ggplot(aes(x = RoutineCheckUp)) +
  geom_bar(  fill = "tan3") +
  coord_flip() +
  guides(fill = FALSE)+
  ggtitle("Frequency of Routine Checkups")

Asthma_CheckUp

Appendix 5 - Reported Symptoms of Respondents with Asthma

DF12.1_Asthma_Symptoms <- asthma2017 %>%
  filter(!is.na(AsthmaSymptoms))%>%
  group_by(AsthmaSymptoms) %>%
  summarize(count = n()) %>%
  arrange(desc(count))

DF12.1_Asthma_Symptoms
## # A tibble: 6 x 2
##   AsthmaSymptoms                 count
##   <chr>                          <int>
## 1 Not at any time                  281
## 2 Once or twice a week             138
## 3 Less than once a Week            125
## 4 More than twice, not every day    74
## 5 Every day, not all the time       63
## 6 Every day, all the time           38
DF12.2_Asthma_Symptoms_vs_Attacks <- asthma2017 %>%
  filter(!is.na(Last12MonthsAttack), !is.na(AsthmaSymptoms))%>%
  group_by(Last12MonthsAttack, AsthmaSymptoms) %>%
  summarize(count = n()) %>%
  arrange(Last12MonthsAttack, desc(count))

DF12.2_Asthma_Symptoms_vs_Attacks 
## # A tibble: 12 x 3
## # Groups:   Last12MonthsAttack [2]
##    Last12MonthsAttack AsthmaSymptoms                 count
##                 <dbl> <chr>                          <int>
##  1                  0 Not at any time                  219
##  2                  0 Less than once a Week             69
##  3                  0 Once or twice a week              61
##  4                  0 More than twice, not every day    34
##  5                  0 Every day, not all the time       27
##  6                  0 Every day, all the time           12
##  7                  1 Once or twice a week              76
##  8                  1 Not at any time                   60
##  9                  1 Less than once a Week             56
## 10                  1 More than twice, not every day    39
## 11                  1 Every day, not all the time       36
## 12                  1 Every day, all the time           26

Appendix 6 - Medication Usage of Respondents with Asthma

Asthma_Days_With_Meds <- asthma2017 %>%
  filter(!is.na(DaysWithMeds))%>%
  ggplot(aes(x = DaysWithMeds)) +
  geom_bar(  fill = "slateblue3") +
  coord_flip() +
  guides(fill = FALSE)+
  ggtitle("Number of Days with Medicine for Asthma")

Asthma_Inhaler_Used <- asthma2017 %>%
  filter(!is.na(InhalerForAttack))%>%
  ggplot(aes(x = InhalerForAttack)) +
  geom_bar(  fill = "turquoise") +
  coord_flip() +
  guides(fill = FALSE)+
  ggtitle("Frequency of Inhaler Used")

plot_grid(Asthma_Days_With_Meds, Asthma_Inhaler_Used)

Appendix 7 - Logistic Regression Coefficients (Full Dataset)

asthma_lm <- asthma %>% 
  dplyr::select(AsthmaStatus, AgeGroup, Sex, Race, IncomeGroup, EducationLevel, BMICategory, CurrentSmoker, Exercise, DepressiveDisorder) %>% 
  filter(!is.na(AsthmaStatus), !is.nan(AsthmaStatus)) %>%
  mutate(AsthmaStatus = replace(AsthmaStatus, AsthmaStatus == "Current", 1),
         AsthmaStatus = replace(AsthmaStatus, AsthmaStatus == "Former", 0),
         AsthmaStatus = replace(AsthmaStatus, AsthmaStatus == "Never", 0)) %>%
  as_factor()

asthma_lm$AsthmaStatus <- as.numeric(asthma_lm$AsthmaStatus)

asthma_lm$Race <- relevel(as_factor(asthma_lm$Race), ref="Asian")
asthma_lm$EducationLevel <- relevel(as_factor(asthma_lm$EducationLevel), ref="Did not Attend High School")

asthma_narm <- asthma_lm[complete.cases(asthma_lm),]
asthma_narm <- data.frame(asthma_narm)

asthma.lr <- glm(AsthmaStatus ~., family=binomial(link="logit"), data = asthma_narm)

summary.full <- summary(asthma.lr)
  1. Age Group
knitr::kable(summary.full$coefficients[2:6,])
Estimate Std. Error z value Pr(>|z|)
AgeGroup25 to 34 -0.1950794 0.0327872 -5.949867 0
AgeGroup35 to 44 -0.2243081 0.0323279 -6.938537 0
AgeGroup45 to 54 -0.1780118 0.0309175 -5.757636 0
AgeGroup55 to 64 -0.2007309 0.0298474 -6.725239 0
AgeGroup65 and above -0.2958622 0.0291774 -10.140114 0
  1. Sex
summary.full$coefficients[7,]
##     Estimate   Std. Error      z value     Pr(>|z|) 
##  -0.52007641   0.01313706 -39.58848171   0.00000000
  1. Race
knitr::kable(summary.full$coefficients[8:14,])
Estimate Std. Error z value Pr(>|z|)
RaceWhite 0.2922191 0.0536361 5.448178 0.0000001
RaceBlack 0.3713658 0.0569985 6.515365 0.0000000
RaceMultiracial, non-Hispanic 0.8233547 0.0628520 13.099902 0.0000000
RaceHispanic 0.1271339 0.0578223 2.198701 0.0278992
RaceNative Hawaiian/other Pacific Islander 0.2623479 0.1155389 2.270645 0.0231685
RaceOthers 0.4434298 0.1017032 4.360037 0.0000130
RaceAmerican Indian/Alaskan Native 0.5381696 0.0655466 8.210484 0.0000000
  1. Income Group
knitr::kable(summary.full$coefficients[15:21,])
Estimate Std. Error z value Pr(>|z|)
IncomeGroup$15,000 to $19,999 -0.1777974 0.0294723 -6.032697 0.0000000
IncomeGroup$20,000 to $24,999 -0.2929805 0.0289376 -10.124554 0.0000000
IncomeGroup$25,000 to $34,999 -0.4128075 0.0288733 -14.297218 0.0000000
IncomeGroup$35,000 to $49,999 -0.4935480 0.0280708 -17.582274 0.0000000
IncomeGroup$50,000 to $74,999 -0.5414332 0.0281641 -19.224253 0.0000000
IncomeGroup$75,000 or above -0.5893166 0.0269603 -21.858671 0.0000000
IncomeGroupless than $10,000 0.0999899 0.0308015 3.246273 0.0011693
  1. Education Level
knitr::kable(summary.full$coefficients[22:24,])
Estimate Std. Error z value Pr(>|z|)
EducationLevelGraduated from College/Technical School -0.1235646 0.0261702 -4.721578 2.3e-06
EducationLevelGraduated High School -0.2376996 0.0249263 -9.536087 0.0e+00
EducationLevelAttended College/Technical School -0.1241886 0.0251241 -4.943002 8.0e-07
  1. BMI Category
knitr::kable(summary.full$coefficients[25:27,])
Estimate Std. Error z value Pr(>|z|)
BMICategoryObese 0.5488615 0.0159395 34.434147 0.0000000
BMICategoryOverweight 0.1358312 0.0166493 8.158364 0.0000000
BMICategoryUnderweight 0.1384883 0.0493552 2.805951 0.0050168
  1. Smoke Status
summary.full$coefficients[28,]
##      Estimate    Std. Error       z value      Pr(>|z|) 
## -9.949962e-02  1.675651e-02 -5.937968e+00  2.885767e-09
  1. Exercise
summary.full$coefficients[29,]
##      Estimate    Std. Error       z value      Pr(>|z|) 
## -1.414922e-01  1.363303e-02 -1.037863e+01  3.101855e-25
  1. Depressive Disorders
summary.full$coefficients[30,]
##   Estimate Std. Error    z value   Pr(>|z|) 
##  0.7213164  0.0134119 53.7818211  0.0000000

Appendix 8 - Logistic Regression Coefficients Plot

coefplot(asthma.lr, color = 'royalblue4')

c("2LL"=-2*logLik(asthma.lr), "Deviance"=deviance(asthma.lr), "AIC"=AIC(asthma.lr))
##      2LL Deviance      AIC 
## 193729.3 193729.3 193789.3

Appendix 9 - Transformed Log-Odd, Odds and Probability

log.odds = coef(asthma.lr) 

odds <- exp(coef(asthma.lr)) 

prob = odds/(1+odds) 

stat = cbind("Log-Odds"=log.odds, "Odds"=odds, "Probabilities"=prob)

stat
##                                                          Log-Odds
## (Intercept)                                           -1.85077288
## AgeGroup25 to 34                                      -0.19507940
## AgeGroup35 to 44                                      -0.22430807
## AgeGroup45 to 54                                      -0.17801177
## AgeGroup55 to 64                                      -0.20073091
## AgeGroup65 and above                                  -0.29586219
## Sex                                                   -0.52007641
## RaceWhite                                              0.29221913
## RaceBlack                                              0.37136576
## RaceMultiracial, non-Hispanic                          0.82335466
## RaceHispanic                                           0.12713389
## RaceNative Hawaiian/other Pacific Islander             0.26234787
## RaceOthers                                             0.44342983
## RaceAmerican Indian/Alaskan Native                     0.53816961
## IncomeGroup$15,000 to $19,999                         -0.17779741
## IncomeGroup$20,000 to $24,999                         -0.29298049
## IncomeGroup$25,000 to $34,999                         -0.41280746
## IncomeGroup$35,000 to $49,999                         -0.49354800
## IncomeGroup$50,000 to $74,999                         -0.54143325
## IncomeGroup$75,000 or above                           -0.58931661
## IncomeGroupless than $10,000                           0.09998993
## EducationLevelGraduated from College/Technical School -0.12356456
## EducationLevelGraduated High School                   -0.23769956
## EducationLevelAttended College/Technical School       -0.12418858
## BMICategoryObese                                       0.54886148
## BMICategoryOverweight                                  0.13583122
## BMICategoryUnderweight                                 0.13848834
## CurrentSmoker                                         -0.09949962
## Exercise                                              -0.14149221
## DepressiveDisorder                                     0.72131642
##                                                            Odds
## (Intercept)                                           0.1571157
## AgeGroup25 to 34                                      0.8227693
## AgeGroup35 to 44                                      0.7990689
## AgeGroup45 to 54                                      0.8369326
## AgeGroup55 to 64                                      0.8181326
## AgeGroup65 and above                                  0.7438899
## Sex                                                   0.5944751
## RaceWhite                                             1.3393965
## RaceBlack                                             1.4497132
## RaceMultiracial, non-Hispanic                         2.2781294
## RaceHispanic                                          1.1355690
## RaceNative Hawaiian/other Pacific Islander            1.2999787
## RaceOthers                                            1.5580419
## RaceAmerican Indian/Alaskan Native                    1.7128688
## IncomeGroup$15,000 to $19,999                         0.8371120
## IncomeGroup$20,000 to $24,999                         0.7460367
## IncomeGroup$25,000 to $34,999                         0.6617897
## IncomeGroup$35,000 to $49,999                         0.6104566
## IncomeGroup$50,000 to $74,999                         0.5819136
## IncomeGroup$75,000 or above                           0.5547062
## IncomeGroupless than $10,000                          1.1051598
## EducationLevelGraduated from College/Technical School 0.8837646
## EducationLevelGraduated High School                   0.7884395
## EducationLevelAttended College/Technical School       0.8832133
## BMICategoryObese                                      1.7312808
## BMICategoryOverweight                                 1.1454885
## BMICategoryUnderweight                                1.1485363
## CurrentSmoker                                         0.9052903
## Exercise                                              0.8680619
## DepressiveDisorder                                    2.0571395
##                                                       Probabilities
## (Intercept)                                               0.1357822
## AgeGroup25 to 34                                          0.4513842
## AgeGroup35 to 44                                          0.4441569
## AgeGroup45 to 54                                          0.4556142
## AgeGroup55 to 64                                          0.4499851
## AgeGroup65 and above                                      0.4265693
## Sex                                                       0.3728344
## RaceWhite                                                 0.5725393
## RaceBlack                                                 0.5917890
## RaceMultiracial, non-Hispanic                             0.6949480
## RaceHispanic                                              0.5317407
## RaceNative Hawaiian/other Pacific Islander                0.5652134
## RaceOthers                                                0.6090760
## RaceAmerican Indian/Alaskan Native                        0.6313865
## IncomeGroup$15,000 to $19,999                             0.4556674
## IncomeGroup$20,000 to $24,999                             0.4272744
## IncomeGroup$25,000 to $34,999                             0.3982391
## IncomeGroup$35,000 to $49,999                             0.3790581
## IncomeGroup$50,000 to $74,999                             0.3678542
## IncomeGroup$75,000 or above                               0.3567917
## IncomeGroupless than $10,000                              0.5249767
## EducationLevelGraduated from College/Technical School     0.4691481
## EducationLevelGraduated High School                       0.4408533
## EducationLevelAttended College/Technical School           0.4689927
## BMICategoryObese                                          0.6338714
## BMICategoryOverweight                                     0.5339057
## BMICategoryUnderweight                                    0.5345669
## CurrentSmoker                                             0.4751456
## Exercise                                                  0.4646858
## DepressiveDisorder                                        0.6728968

Appendix 10 - Logistic Regression Machine Learning (For Trained subset)

set.seed(12345)
training <- sample(1:nrow(asthma_narm), 0.6*nrow(asthma_narm))
asthma.training <- asthma_narm[training,]
asthma.test <- seq(1:nrow(asthma_narm))[-training]
asthma.test.results = asthma_narm[-training,1]


asthma.lr.ML <- glm(AsthmaStatus ~., family=binomial(link="logit"), data = asthma.training)

summary(asthma.lr.ML)
## 
## Call:
## glm(formula = AsthmaStatus ~ ., family = binomial(link = "logit"), 
##     data = asthma.training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3438  -0.4753  -0.3738  -0.3044   2.6971  
## 
## Coefficients:
##                                                       Estimate Std. Error
## (Intercept)                                           -1.86045    0.08658
## AgeGroup25 to 34                                      -0.20212    0.04203
## AgeGroup35 to 44                                      -0.23964    0.04154
## AgeGroup45 to 54                                      -0.16586    0.03956
## AgeGroup55 to 64                                      -0.22879    0.03830
## AgeGroup65 and above                                  -0.30323    0.03738
## Sex                                                   -0.51477    0.01695
## RaceWhite                                              0.27639    0.06854
## RaceBlack                                              0.34375    0.07292
## RaceMultiracial, non-Hispanic                          0.83550    0.08059
## RaceHispanic                                           0.10889    0.07395
## RaceNative Hawaiian/other Pacific Islander             0.30366    0.14614
## RaceOthers                                             0.33038    0.13479
## RaceAmerican Indian/Alaskan Native                     0.52465    0.08436
## IncomeGroup$15,000 to $19,999                         -0.17337    0.03822
## IncomeGroup$20,000 to $24,999                         -0.29175    0.03757
## IncomeGroup$25,000 to $34,999                         -0.40846    0.03751
## IncomeGroup$35,000 to $49,999                         -0.46115    0.03624
## IncomeGroup$50,000 to $74,999                         -0.52582    0.03653
## IncomeGroup$75,000 or above                           -0.59144    0.03504
## IncomeGroupless than $10,000                           0.12251    0.03993
## EducationLevelGraduated from College/Technical School -0.12354    0.03381
## EducationLevelGraduated High School                   -0.22527    0.03218
## EducationLevelAttended College/Technical School       -0.12322    0.03245
## BMICategoryObese                                       0.56637    0.02058
## BMICategoryOverweight                                  0.11901    0.02158
## BMICategoryUnderweight                                 0.19162    0.06268
## CurrentSmoker                                         -0.08020    0.02173
## Exercise                                              -0.13717    0.01759
## DepressiveDisorder                                     0.71921    0.01733
##                                                       z value Pr(>|z|)    
## (Intercept)                                           -21.489  < 2e-16 ***
## AgeGroup25 to 34                                       -4.809 1.52e-06 ***
## AgeGroup35 to 44                                       -5.768 8.00e-09 ***
## AgeGroup45 to 54                                       -4.193 2.76e-05 ***
## AgeGroup55 to 64                                       -5.974 2.31e-09 ***
## AgeGroup65 and above                                   -8.113 4.95e-16 ***
## Sex                                                   -30.364  < 2e-16 ***
## RaceWhite                                               4.032 5.52e-05 ***
## RaceBlack                                               4.714 2.43e-06 ***
## RaceMultiracial, non-Hispanic                          10.367  < 2e-16 ***
## RaceHispanic                                            1.472 0.140895    
## RaceNative Hawaiian/other Pacific Islander              2.078 0.037725 *  
## RaceOthers                                              2.451 0.014242 *  
## RaceAmerican Indian/Alaskan Native                      6.219 4.99e-10 ***
## IncomeGroup$15,000 to $19,999                          -4.536 5.72e-06 ***
## IncomeGroup$20,000 to $24,999                          -7.766 8.07e-15 ***
## IncomeGroup$25,000 to $34,999                         -10.890  < 2e-16 ***
## IncomeGroup$35,000 to $49,999                         -12.724  < 2e-16 ***
## IncomeGroup$50,000 to $74,999                         -14.394  < 2e-16 ***
## IncomeGroup$75,000 or above                           -16.879  < 2e-16 ***
## IncomeGroupless than $10,000                            3.068 0.002151 ** 
## EducationLevelGraduated from College/Technical School  -3.654 0.000258 ***
## EducationLevelGraduated High School                    -7.000 2.57e-12 ***
## EducationLevelAttended College/Technical School        -3.797 0.000146 ***
## BMICategoryObese                                       27.524  < 2e-16 ***
## BMICategoryOverweight                                   5.514 3.51e-08 ***
## BMICategoryUnderweight                                  3.057 0.002236 ** 
## CurrentSmoker                                          -3.690 0.000224 ***
## Exercise                                               -7.796 6.38e-15 ***
## DepressiveDisorder                                     41.491  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 122980  on 196860  degrees of freedom
## Residual deviance: 116132  on 196831  degrees of freedom
## AIC: 116192
## 
## Number of Fisher Scoring iterations: 5

Appendix 11 - ROC Curve

asthma.test.probabilities <- predict(asthma.lr.ML, asthma_narm, type = "response")[asthma.test]

pred <- prediction(asthma.test.probabilities, asthma.test.results) 

perf <- performance(pred,"tpr","fpr")
plot(perf, colorize=T)

auc <- performance(pred,"auc")

c(auc@y.name[[1]], auc@y.values[[1]])
## [1] "Area under the ROC curve" "0.66954885547759"

Appendix - Confusion Matrix

  • Confusion Matrix
asthma.pred.test <- ifelse(asthma.test.probabilities > 0.16, 1, 0)  

conf.mat <- table("Predicted" = asthma.pred.test, "Actual" = asthma.test.results) 
colnames(conf.mat) = c("No", "Yes")
rownames(conf.mat) = c("No", "Yes")
conf.mat
##          Actual
## Predicted     No    Yes
##       No  106628   8827
##       Yes  12214   3572
  • Evaluation
TruN <- conf.mat[1,1] 
TruP <- conf.mat[2,2] 
FalN <- conf.mat[1,2] 
FalP <- conf.mat[2,1] 
TotN <- conf.mat[1,1] + conf.mat[2,1] 
TotP <- conf.mat[1,2] + conf.mat[2,2] 
Tot <- TotN+TotP 

Accuracy.Rate <- (TruN + TruP) / Tot

Error.Rate <- (FalN + FalP) / Tot

Sensitivity <- TruP / TotP 

Specificity <- TruN / TotN 

FalseP.Rate <- 1 - Specificity # False positive rate

logit.rates.50 <- c(Accuracy.Rate, Error.Rate, Sensitivity, Specificity, FalseP.Rate)

names(logit.rates.50) <- c("Accuracy Rate", "Error Rate", "True Positives", "True Negatives", "False Positives")

options(scipen="4")
print(logit.rates.50, digits=2)
##   Accuracy Rate      Error Rate  True Positives  True Negatives 
##            0.84            0.16            0.29            0.90 
## False Positives 
##            0.10

References

  1. Mayo Clinic. (2018). Asthma. Mayo Clinic (2018). Retrieved from https://www.mayoclinic.org/diseases-conditions/asthma/symptoms-causes/syc-20369653 in 12/1/2018