In this study, the following questions would be answered using the 2017 data from The Behavioral Risk Factor Surveillance System (BRFSS):
1. Which state in the U.S. had the highest probability of finding a person suffering from asthma in 2017? Which state had the lowest possibility?
2. At what age was asthma most commonly diagnosed for the selected respondents? Which medical care(s) did they need and how often did they need it (them)?
3. Could the likelihood of having asthma be predicted using logistic regression model using the associated demographic and health risk factors found in literature? If yes, how accurate were these predictions?
The descriptive analytics goals of this study were to explore the geographic distribution of asthma in the U.S. by state, and to visualize the behaviors of those who reported having asthma. The predictive analytics goal of this study was to test how accurately a binary logistic regression model could predict the likelihood a respondent having a specific set of characteristics would have asthma. This was a classification predictive problem.
As mentioned in previous study, asthma is a very common chronic respiratory disease where a person’s airways to the lung are inflamed, narrowed, swollen, and often produce extra thickened mucus which make breathing difficult (Mayo Clinic, 2018). Millions of Americans are affected by asthma every year. Asthma may not pose a major problem to some people, but for others, it can gravely interfere with daily activities. An asthma-attack can be life-threatening. Understanding the medical needs of asthma patients, in which states were asthma patients most and least likely to be found, is important for asthma prevention measures, and reliefs of asthma symptoms. Furthermore, having an algorithm to accurately predict who would be more in danger of suffering from asthma, will be extremely beneficial to medical providers to devise prevention methods, and better prepare and allocate resources for diagnosis and necessary treatments.
The Behavioral Risk Factor Surveillance System (BRFSS) is a state-based cross-sectional telephone survey initiated by the Centers for Disease Control and Prevention (CDC). The state health departments have been conducting the survey monthly over land-lines and cellular telephones using a standardized questionnaire and other assistance from the CDC. The survey data sets contain relevant data that are useful for asthma analysis under the scope of this project.
The original data set was obtained from the CDC website, under the Behavioral Risk Factor Surveillance System, following this link.
The data set has 450,016 observations for 358 variables including records of respondents’ answers and calculated variables that categorized original data points into groups represented by an integer each.
The variable selection and data cleaning process were similar to previous study thanks to the CDC’s effort to standardize the data input and document the variable descriptions. The cleaned dataset used in this study can be viewed here.
Appendix A demonstrated data “read-in” and the packages to be used. Appendix B listed the variables chosen for analysis with their respective descriptions from the CDC Codebook.
Previous study investigated the demographic trends and risk factors associated with asthma, and recommended exploring asthma distribution by state and a closer look at the respondents chosen for the asthma call-back survey. The graphs and analysis below followed the future research suggestions from previous study.
Appendix 1 displayed two maps: The green-shaded one was count distribution of respondents reported having asthma, and the blue-shaded one was a probability heat map showing asthma prevalence by state.
The two maps were slightly different. States with the darkest green shades (the largest number of respondents reported having asthma) were Arizona (AZ), Florida (FL), and Kansas (KS), and states with the lightest green shades (the least number of respondents with asthma) were Nevada (NZ) and Alaska (AK). Meanwhile, the blue-shaded map indicated that asthma was the most prevalent in West Virginia (WV) and the least prevalent in Minnesota (MN).
The graph in Appendix 2 showed that asthma was more commonly diagnosed for children or teenage (20 and below). This is expected since asthma tends to be more prevalent among children, and showed symptoms since childhood.
Among the respondents currently having asthma, 749 of them were asked if they had an asthma attack 12 months before the interview. Less than half of them had to visit the ER or required urgent care for an asthma attack, which agreed to what DF12 showed: only 308 respondents among the 749 had an attack 12 months before the interview. (Appendix 3)
The probability that a respondent who had an attack 12 months before the interview also required urgent care was \(1- 122/308 = 0.6038961\), which was higher than expected. The majority of respondents asked also did not go to more than 5 checkups a year for asthma. (Appendix 4)
For more than half of the respondents who was asked if they had asthma attacks in the past 12 months, asthma symptoms showed twice a week or less. However, only 219 among 441 of those who did not have an attack 12 months before the interview showed no symptoms, which was less than half. All respondents who reported having an attack 12 months before the interview had asthma symptoms at least once. Most respondents did not report using prevention medicines for asthma, or used inhalers to stop an attack. (Appendix 5 and 6)
Binary Logistic Regression Model was run on the full data, using all demographic and health risk factors: “AgeGroup”, “Sex”, “Race”, “Income Group”, “EducationLevel”, “CurrentSmoker”, “Exercise”, “BMICategory”, and “DepressiveDisorders” as predictors, predicting the log-odd of the response variable “AsthmaStatus”. All predictors were categorical variables, presented as factors with levels in the logistic regression model.
The response variable “AsthmaStatus” was originally a categorical variable of three levels. To perform binary logistic regression model, it was converted to a binary variable, with 1 representing the “Current” status, or respondents “currently” having asthma at the time of the survey, and 0 representing both “Former” and “Never” statuses, or respondents “not currently” having asthma at the time of the survey.
Logistic regression was performed using the glm() function in package {stat}, with “family=binomial(link=”logit“)”. Appendix 7 displayed the algorithm and the coefficient table of each level of predictors. All predictors were significant in predicting the log-odd of the response variable.
A coefficient plot created in Appendix 8 was the visualization of the significance of the coefficients of all predictors. Appendix 9 transformed the log-odd to odds and probabilities as another way of presenting the same results.
All coefficients in the table below were negative. With “Age Group 18 to 24” being the reference level, the negative coefficients suggested the higher the age group a respondent belonged in, the less likely he/she would have reported having asthma. This result agreed to descriptive analytics results.
As Sex moved from 0 to 1, log-likelihood of having asthma decreased. This means female respondents were more likely to then suffered from asthma compared to male respondents.
Compared to Asian, which was selected as the reference level, all other races had higher asthma prevalence, with the Multiracial group having the highest difference in log-odd.
Income Group from 10,000 USD to 14,999 USD set as the reference level, a respondent earning less than that was more likely to have asthma (as shown by the positive coefficient), while for other Income Group, as income increased, the likelihood decreased (illustrated by the increasingly more negative coefficients).
From descriptive analytics, the probability of respondents having asthma given they had graduated from college/technical school was the lowest. It was expected that the coefficient for this level would be lowest. However, compared to the set reference level (Education Level - Did not Attend High School), Education Level - Graduated from High School had the most negative coefficient, indicating that respondents who only graduated high school had an even lower log-odd of having asthma than those achieved higher level of education. The log-likelihood of having asthma were similar between those attended and graduated from college/technical school.
With Normal Weight being the set reference level, all other BMI Category coefficients were positive, with Obese being the most negative - Obese respondents were the most likely to have asthma.
The negative coefficient as CurrentSmoker variable “moved” from 0 to 1 indicated that smokers were more likely to have asthma.
Similarly, the negative coefficient for Activeness supported the descriptive analytics results - Respondents who had not engaged in physical activities 30 days prior to the interview were more likely to have asthma.
A respondent with some sort of depressive disorder were much more likely to suffer from asthma, as suggested from the positive coefficient.
Data Partitioning and Logistic Regression Model Setup
Since the response variable - Asthma Status was binary, again, logistic regression was performed, using the glm() function, with “family=binomial(link=”logit“).” The data was partitioned into a training set - a sample of 60% of observations, and a test set - the rest of the data. Appendix 10 presented the partitioning set up and the algorithm.
Confusion Matrix and ROC Curve
To evaluated the performance of the logistic regression model in correctly classifying the test subset’s respondents in “Having Asthma” and “Not Having Asthma” groups, a Receiver Operating Characteristics (ROC) Curve was drawn to visualize the trade off between Sensitivity and Specificity of this model. The curve was constructed using the prediction{ROCR} and performance{ROCR} functions (Appendix 11). The area under the ROC curve was close to 0.67, which was higher than 0.5, indicating that the model performed better than chance.
A confusion matrix was constructed to better evaluate how well the logistic regression model developed from the train subset predicted the test subset, by Accuracy Rate, Error Rate, Sensitivity (True Positive), Specificity (True Negative), and False Positive. (Appendix 12)
Since only a small proportion of the dataset gave a positive response (less than 10% of the sampled observations reported “currently having asthma” in the 2017 survey), the classification threshold was set to a low level of 16%. This means if the predicted probability of having asthma was more than 16%, the respondent was classified as having asthma, and vice versa, a predicted probability of 16% or below was classified as not having asthma.
At 16% threshold, the model predicted with 84% accuracy rate, 16% error rate, 29% True-Positive (Sensitivity) Rate, and 90% True-Negative (Specificity) Rate. Thus, 30% of the times, if a respondent had a calculated asthma probability of more than 16%, the model predicted accurately that he/she would have asthma; 90% of the times, if a respondent had a calculated asthma probability of 16% or below, the model predicted accurately that he/she would not have asthma.
Asthma was most commonly diagnosed among children and young adults (age 20 or younger). Less than half of asthmatics had not suffer from an attack 12 months prior to the survey, but more than half had symptoms twice a week or less. Among those who had an attack, less than half needed emergency care and ER support. Inhaler usage was also low among all respondents.
All demographic and risk factors predictors were significant in predicting the likelihood of asthma. The interpretation generally did not differ considerably from descriptive analytics results, except for Education Level.
Machine learning logistic regression algorithm with a 60% train subset classify the asthmatics correctly 29% of the times, and the non-asthmatics correctly 90% of the times, using a classification threshold of 16%. The predictive accuracy of 84% and error rate of 16% were acceptable, especially when the area under the ROC curve suggested the model performed better than chance.
A ShinyApp was designed to display selected descriptive and predictive analytics results from both studies.
Data was collected from land-line phone survey. The sampled population using land-line telephones were not representative of the entire U.S. population, urging future researchers to find better survey sampling and data gathering methods.
Only very few asthmatic respondents were contacted for the asthma survey, generating a very small dataset with large number of variables the resulted in low degree of freedom, affecting predictive analytics results. If possible, a larger and more complete dataset for asthma call-back survey data would be useful for future analysis.
Missing data occurred randomly across data columns and rows were all removed for the glm() function to run properly. These data had the potential to drastically changed the modeling results, prompting future research to look into other predictive models such as trees, multinomial logistic regression, Least Discriminant Analysis (LDA), etc.
The majority of the codes in this document may not be reproducible for future dataset. More universal functions could be developed to make analysis more repeatable.
Due to time constraint, this study did not look into the change in asthma prevalence, distribution and significant of predictors over time. This could be a promising area for future research, where time-series analysis, regression with lagging variables, etc. could be performed for visualization and better predictive modeling.
##Preparation
library(tidyverse)
library(scales)
##Visual
library(viridis)
library(cowplot)
library(gridExtra)
library(grid)
library(usmap)
##Predictive modeling
library(boot)
library(coefplot)
library(ROCR)
if(!file.exists("Data")){dir.create("Data")}
asthma <- read_csv("./Data/Asthma2017.csv", guess_max = 69000)
#remove missing values for AsthmaStatus only
health2017_narm <- asthma %>%
filter(!is.na(AsthmaStatus))
#Data for all respondents who reported formerlly or currently having asthma
asthma2017 <- asthma %>%
filter(AsthmaStatus == "Current" | AsthmaStatus == "Former")
#Data for all respondents who reported currently having asthma
curr_ast_17 <- asthma2017 %>%
filter(AsthmaStatus == "Current")
curr_ast_17
## # A tibble: 42,254 x 21
## AsthmaStatus AgeGroup Sex Race IncomeGroup EducationLevel BMICategory
## <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 Current 65 and ~ 0 White $10,000 to~ Did not Atten~ Normal
## 2 Current 55 to 64 1 Mult~ $10,000 to~ Did not Atten~ Obese
## 3 Current 65 and ~ 0 White $25,000 to~ Did not Atten~ Obese
## 4 Current 65 and ~ 1 Black <NA> Attended Coll~ Obese
## 5 Current 55 to 64 1 White less than ~ Did not Atten~ Obese
## 6 Current 65 and ~ 0 <NA> $10,000 to~ Graduated Hig~ Overweight
## 7 Current 65 and ~ 0 White $25,000 to~ Graduated fro~ Underweight
## 8 Current 55 to 64 0 White $75,000 or~ Graduated fro~ Obese
## 9 Current 65 and ~ 0 White <NA> Did not Atten~ Overweight
## 10 Current 45 to 54 0 White $10,000 to~ Graduated Hig~ Obese
## # ... with 42,244 more rows, and 14 more variables: CurrentSmoker <dbl>,
## # Exercise <dbl>, DepressiveDisorder <dbl>, State <dbl>,
## # AsthmaAge <chr>, Last12MonthsAttack <dbl>, ERVisit <chr>,
## # UrgentCare <chr>, RoutineCheckUp <chr>, LimitActivities <dbl>,
## # AsthmaSymptoms <chr>, AsthmaSleepDiffi <chr>, DaysWithMeds <chr>,
## # InhalerForAttack <chr>
#Data for all respondents who reported formerlly having asthma
form_ast_17 <- asthma2017 %>%
filter(AsthmaStatus == "Former")
Below is a summary of the variables, similar to that from previous study.
AsthmaStatus: If a respondent currently, formerly or never had asthma
Race: Respondents grouped by nine categories of race-ethnicity, with Hispanics grouped as a separate category
AgeGroup: Reported age in five-year age categories calculated variable
Sex: (binary) Female vs. Male
BMICategory: Respondents’ weight status by dividing reported weights by the square of their respective heights
EducationLevel: Respondents’ highest education level completed, sorted into four groups
IncomeGroup: Respondents’ reported income sorted into nine groups
CurrentSmoker: (binary) Whether respondents smoked a cigarette within 30 days before the interview
Exercise: (binary) Whether respondents engaged in physical activities within 30 days before the interview
DepressiveDisorder: (binary) Whether respondents had been suffering from, or was diagnosed with one or more depressive disorders
State: State FIPS code of where the respondents came from
Last12MonthsAttack: (binary) Whether respondents had an asthma attack or an asthma episode during the 12 month period before the interview.
AsthmaSymptoms: Number of times respondents experienced any symptom of asthma during the 30 day period before the interview
AsthmaSleepDiffi: Number of days respondents had difficulties sleeping because asthma during the 30 day period before the interview
DaysWithMeds: Number of days respondents took a prescription of asthma medication to prevent an asthma attack from occurring during the 30 day period before the interview
InhalerForAttack: Number of times respondents used a prescription of asthma inhaler during an asthma attack to stop it during the 30 day period before the interview
AsthmaAge: Age at first asthma diagnosis
ERVisit: Number of times respondents visited the emergency room or an urgent care center because of asthma during the 12 month period before the interview
UrgentCare: Number of times respondents specifically required urgent care for asthma during the 12 month period before the interview
RoutineCheckUp: Number of routine check-ups for asthma during the 12 month period before the interview
LimitActivities: Number of times respondents were unable to work or carry out activities because of asthma during the 12 month period before the interview
CurrAsthmaState <- curr_ast_17 %>%
group_by(State) %>%
summarise(CurrAsthmaCount = n())
SampleState <- health2017_narm %>%
group_by(State) %>%
summarize(SampleCount = n())
ByState <- as_tibble(cbind(CurrAsthmaState, "TotalSampled" = SampleState$SampleCount))
StateName = as.list(cdlTools::fips(ByState$State, to = "Name"))
StateName = replace(StateName, StateName == "Deleware", "Delaware")
ByState <- ByState %>%
mutate(PercentByState = ByState$CurrAsthmaCount/ByState$TotalSampled*100,
StateName = cdlTools::fips(State, to = "Name"),
StateWithPercent = paste(cdlTools::fips(State, to = "Name"), as.character(PercentByState), "%"))
ByState$fips = ByState$State
plot_usmap (data = ByState, values = "CurrAsthmaCount", labels = TRUE) +
scale_fill_continuous(name = "Currently having Asthma", low = "white", high = "dark green", label = comma) +
labs(title = "Visual Presentation of Asthma Count by State") +
theme(legend.position = "right")
plot_usmap (data = ByState, values = "PercentByState", labels = TRUE, line = "white") +
scale_fill_continuous(name = "Currently having Asthma", low = "white", high = "steel blue", label = comma) +
labs(title = "Visual Presentation of Asthma Prevalence by State",
subtitle = "The map did not show Puerto Rico and Guam. From the color scale,
West Virginia (WV) had the highest percentage of respondents reporting
suffering from asthma in 2017. Meanwhile, Minnesota (MN) had the lowest percentage.") +
theme(legend.position = "right")
Asthma_Diagnosed_Age <- asthma2017 %>%
filter(!is.na(AsthmaAge))%>%
ggplot(aes(x = AsthmaAge)) +
geom_bar(fill = "navy") +
coord_flip() +
guides(fill = FALSE)+
ggtitle("Asthma Diagnosed Age")
Asthma_Diagnosed_Age
Asthma_ERVisit <- asthma2017 %>%
filter(!is.na(ERVisit))%>%
ggplot(aes(x = ERVisit)) +
geom_bar( fill = "tomato4") +
coord_flip() +
guides(fill = FALSE)+
ggtitle("Frequency of ER Visit")
Asthma_Urgent_Care <- asthma2017 %>%
filter(!is.na(UrgentCare))%>%
ggplot(aes(x = UrgentCare)) +
geom_bar( fill = "thistle4") +
coord_flip() +
guides(fill = FALSE)+
ggtitle("Frequency of Urgent Care")
cowplot::plot_grid(Asthma_ERVisit, Asthma_Urgent_Care)
DF11.1_Summary_Asthma_Attack_Past_Year <- curr_ast_17 %>%
filter(!is.na(Last12MonthsAttack)) %>%
group_by(Last12MonthsAttack) %>%
summarize(Count = n())
DF11.2_Summary_Asthma_Care <- curr_ast_17 %>%
filter(!is.na(Last12MonthsAttack)) %>%
group_by(Last12MonthsAttack, ERVisit, UrgentCare) %>%
summarize(Count = n()) %>%
arrange(desc(Count))
DF11.1_Summary_Asthma_Attack_Past_Year
## # A tibble: 2 x 2
## Last12MonthsAttack Count
## <dbl> <int>
## 1 0 441
## 2 1 308
DF11.2_Summary_Asthma_Care
## # A tibble: 15 x 4
## # Groups: Last12MonthsAttack, ERVisit [6]
## Last12MonthsAttack ERVisit UrgentCare Count
## <dbl> <chr> <chr> <int>
## 1 0 <NA> <NA> 441
## 2 1 None None 122
## 3 1 Less than 5 Times Less than 5 Times 60
## 4 1 None Less than 5 Times 59
## 5 1 Less than 5 Times None 42
## 6 1 5 to 20 times Less than 5 Times 8
## 7 1 5 to 20 times 5 to 20 times 4
## 8 1 <NA> None 3
## 9 1 Less than 5 Times <NA> 2
## 10 1 Less than 5 Times 5 to 20 times 2
## 11 1 None <NA> 2
## 12 1 <NA> <NA> 1
## 13 1 <NA> Less than 5 Times 1
## 14 1 More than 80 times Less than 5 Times 1
## 15 1 None 5 to 20 times 1
Asthma_CheckUp <- asthma2017 %>%
filter(!is.na(RoutineCheckUp)) %>%
ggplot(aes(x = RoutineCheckUp)) +
geom_bar( fill = "tan3") +
coord_flip() +
guides(fill = FALSE)+
ggtitle("Frequency of Routine Checkups")
Asthma_CheckUp
DF12.1_Asthma_Symptoms <- asthma2017 %>%
filter(!is.na(AsthmaSymptoms))%>%
group_by(AsthmaSymptoms) %>%
summarize(count = n()) %>%
arrange(desc(count))
DF12.1_Asthma_Symptoms
## # A tibble: 6 x 2
## AsthmaSymptoms count
## <chr> <int>
## 1 Not at any time 281
## 2 Once or twice a week 138
## 3 Less than once a Week 125
## 4 More than twice, not every day 74
## 5 Every day, not all the time 63
## 6 Every day, all the time 38
DF12.2_Asthma_Symptoms_vs_Attacks <- asthma2017 %>%
filter(!is.na(Last12MonthsAttack), !is.na(AsthmaSymptoms))%>%
group_by(Last12MonthsAttack, AsthmaSymptoms) %>%
summarize(count = n()) %>%
arrange(Last12MonthsAttack, desc(count))
DF12.2_Asthma_Symptoms_vs_Attacks
## # A tibble: 12 x 3
## # Groups: Last12MonthsAttack [2]
## Last12MonthsAttack AsthmaSymptoms count
## <dbl> <chr> <int>
## 1 0 Not at any time 219
## 2 0 Less than once a Week 69
## 3 0 Once or twice a week 61
## 4 0 More than twice, not every day 34
## 5 0 Every day, not all the time 27
## 6 0 Every day, all the time 12
## 7 1 Once or twice a week 76
## 8 1 Not at any time 60
## 9 1 Less than once a Week 56
## 10 1 More than twice, not every day 39
## 11 1 Every day, not all the time 36
## 12 1 Every day, all the time 26
Asthma_Days_With_Meds <- asthma2017 %>%
filter(!is.na(DaysWithMeds))%>%
ggplot(aes(x = DaysWithMeds)) +
geom_bar( fill = "slateblue3") +
coord_flip() +
guides(fill = FALSE)+
ggtitle("Number of Days with Medicine for Asthma")
Asthma_Inhaler_Used <- asthma2017 %>%
filter(!is.na(InhalerForAttack))%>%
ggplot(aes(x = InhalerForAttack)) +
geom_bar( fill = "turquoise") +
coord_flip() +
guides(fill = FALSE)+
ggtitle("Frequency of Inhaler Used")
plot_grid(Asthma_Days_With_Meds, Asthma_Inhaler_Used)
asthma_lm <- asthma %>%
dplyr::select(AsthmaStatus, AgeGroup, Sex, Race, IncomeGroup, EducationLevel, BMICategory, CurrentSmoker, Exercise, DepressiveDisorder) %>%
filter(!is.na(AsthmaStatus), !is.nan(AsthmaStatus)) %>%
mutate(AsthmaStatus = replace(AsthmaStatus, AsthmaStatus == "Current", 1),
AsthmaStatus = replace(AsthmaStatus, AsthmaStatus == "Former", 0),
AsthmaStatus = replace(AsthmaStatus, AsthmaStatus == "Never", 0)) %>%
as_factor()
asthma_lm$AsthmaStatus <- as.numeric(asthma_lm$AsthmaStatus)
asthma_lm$Race <- relevel(as_factor(asthma_lm$Race), ref="Asian")
asthma_lm$EducationLevel <- relevel(as_factor(asthma_lm$EducationLevel), ref="Did not Attend High School")
asthma_narm <- asthma_lm[complete.cases(asthma_lm),]
asthma_narm <- data.frame(asthma_narm)
asthma.lr <- glm(AsthmaStatus ~., family=binomial(link="logit"), data = asthma_narm)
summary.full <- summary(asthma.lr)
knitr::kable(summary.full$coefficients[2:6,])
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| AgeGroup25 to 34 | -0.1950794 | 0.0327872 | -5.949867 | 0 |
| AgeGroup35 to 44 | -0.2243081 | 0.0323279 | -6.938537 | 0 |
| AgeGroup45 to 54 | -0.1780118 | 0.0309175 | -5.757636 | 0 |
| AgeGroup55 to 64 | -0.2007309 | 0.0298474 | -6.725239 | 0 |
| AgeGroup65 and above | -0.2958622 | 0.0291774 | -10.140114 | 0 |
summary.full$coefficients[7,]
## Estimate Std. Error z value Pr(>|z|)
## -0.52007641 0.01313706 -39.58848171 0.00000000
knitr::kable(summary.full$coefficients[8:14,])
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| RaceWhite | 0.2922191 | 0.0536361 | 5.448178 | 0.0000001 |
| RaceBlack | 0.3713658 | 0.0569985 | 6.515365 | 0.0000000 |
| RaceMultiracial, non-Hispanic | 0.8233547 | 0.0628520 | 13.099902 | 0.0000000 |
| RaceHispanic | 0.1271339 | 0.0578223 | 2.198701 | 0.0278992 |
| RaceNative Hawaiian/other Pacific Islander | 0.2623479 | 0.1155389 | 2.270645 | 0.0231685 |
| RaceOthers | 0.4434298 | 0.1017032 | 4.360037 | 0.0000130 |
| RaceAmerican Indian/Alaskan Native | 0.5381696 | 0.0655466 | 8.210484 | 0.0000000 |
knitr::kable(summary.full$coefficients[15:21,])
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| IncomeGroup$15,000 to $19,999 | -0.1777974 | 0.0294723 | -6.032697 | 0.0000000 |
| IncomeGroup$20,000 to $24,999 | -0.2929805 | 0.0289376 | -10.124554 | 0.0000000 |
| IncomeGroup$25,000 to $34,999 | -0.4128075 | 0.0288733 | -14.297218 | 0.0000000 |
| IncomeGroup$35,000 to $49,999 | -0.4935480 | 0.0280708 | -17.582274 | 0.0000000 |
| IncomeGroup$50,000 to $74,999 | -0.5414332 | 0.0281641 | -19.224253 | 0.0000000 |
| IncomeGroup$75,000 or above | -0.5893166 | 0.0269603 | -21.858671 | 0.0000000 |
| IncomeGroupless than $10,000 | 0.0999899 | 0.0308015 | 3.246273 | 0.0011693 |
knitr::kable(summary.full$coefficients[22:24,])
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| EducationLevelGraduated from College/Technical School | -0.1235646 | 0.0261702 | -4.721578 | 2.3e-06 |
| EducationLevelGraduated High School | -0.2376996 | 0.0249263 | -9.536087 | 0.0e+00 |
| EducationLevelAttended College/Technical School | -0.1241886 | 0.0251241 | -4.943002 | 8.0e-07 |
knitr::kable(summary.full$coefficients[25:27,])
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| BMICategoryObese | 0.5488615 | 0.0159395 | 34.434147 | 0.0000000 |
| BMICategoryOverweight | 0.1358312 | 0.0166493 | 8.158364 | 0.0000000 |
| BMICategoryUnderweight | 0.1384883 | 0.0493552 | 2.805951 | 0.0050168 |
summary.full$coefficients[28,]
## Estimate Std. Error z value Pr(>|z|)
## -9.949962e-02 1.675651e-02 -5.937968e+00 2.885767e-09
summary.full$coefficients[29,]
## Estimate Std. Error z value Pr(>|z|)
## -1.414922e-01 1.363303e-02 -1.037863e+01 3.101855e-25
summary.full$coefficients[30,]
## Estimate Std. Error z value Pr(>|z|)
## 0.7213164 0.0134119 53.7818211 0.0000000
coefplot(asthma.lr, color = 'royalblue4')
c("2LL"=-2*logLik(asthma.lr), "Deviance"=deviance(asthma.lr), "AIC"=AIC(asthma.lr))
## 2LL Deviance AIC
## 193729.3 193729.3 193789.3
log.odds = coef(asthma.lr)
odds <- exp(coef(asthma.lr))
prob = odds/(1+odds)
stat = cbind("Log-Odds"=log.odds, "Odds"=odds, "Probabilities"=prob)
stat
## Log-Odds
## (Intercept) -1.85077288
## AgeGroup25 to 34 -0.19507940
## AgeGroup35 to 44 -0.22430807
## AgeGroup45 to 54 -0.17801177
## AgeGroup55 to 64 -0.20073091
## AgeGroup65 and above -0.29586219
## Sex -0.52007641
## RaceWhite 0.29221913
## RaceBlack 0.37136576
## RaceMultiracial, non-Hispanic 0.82335466
## RaceHispanic 0.12713389
## RaceNative Hawaiian/other Pacific Islander 0.26234787
## RaceOthers 0.44342983
## RaceAmerican Indian/Alaskan Native 0.53816961
## IncomeGroup$15,000 to $19,999 -0.17779741
## IncomeGroup$20,000 to $24,999 -0.29298049
## IncomeGroup$25,000 to $34,999 -0.41280746
## IncomeGroup$35,000 to $49,999 -0.49354800
## IncomeGroup$50,000 to $74,999 -0.54143325
## IncomeGroup$75,000 or above -0.58931661
## IncomeGroupless than $10,000 0.09998993
## EducationLevelGraduated from College/Technical School -0.12356456
## EducationLevelGraduated High School -0.23769956
## EducationLevelAttended College/Technical School -0.12418858
## BMICategoryObese 0.54886148
## BMICategoryOverweight 0.13583122
## BMICategoryUnderweight 0.13848834
## CurrentSmoker -0.09949962
## Exercise -0.14149221
## DepressiveDisorder 0.72131642
## Odds
## (Intercept) 0.1571157
## AgeGroup25 to 34 0.8227693
## AgeGroup35 to 44 0.7990689
## AgeGroup45 to 54 0.8369326
## AgeGroup55 to 64 0.8181326
## AgeGroup65 and above 0.7438899
## Sex 0.5944751
## RaceWhite 1.3393965
## RaceBlack 1.4497132
## RaceMultiracial, non-Hispanic 2.2781294
## RaceHispanic 1.1355690
## RaceNative Hawaiian/other Pacific Islander 1.2999787
## RaceOthers 1.5580419
## RaceAmerican Indian/Alaskan Native 1.7128688
## IncomeGroup$15,000 to $19,999 0.8371120
## IncomeGroup$20,000 to $24,999 0.7460367
## IncomeGroup$25,000 to $34,999 0.6617897
## IncomeGroup$35,000 to $49,999 0.6104566
## IncomeGroup$50,000 to $74,999 0.5819136
## IncomeGroup$75,000 or above 0.5547062
## IncomeGroupless than $10,000 1.1051598
## EducationLevelGraduated from College/Technical School 0.8837646
## EducationLevelGraduated High School 0.7884395
## EducationLevelAttended College/Technical School 0.8832133
## BMICategoryObese 1.7312808
## BMICategoryOverweight 1.1454885
## BMICategoryUnderweight 1.1485363
## CurrentSmoker 0.9052903
## Exercise 0.8680619
## DepressiveDisorder 2.0571395
## Probabilities
## (Intercept) 0.1357822
## AgeGroup25 to 34 0.4513842
## AgeGroup35 to 44 0.4441569
## AgeGroup45 to 54 0.4556142
## AgeGroup55 to 64 0.4499851
## AgeGroup65 and above 0.4265693
## Sex 0.3728344
## RaceWhite 0.5725393
## RaceBlack 0.5917890
## RaceMultiracial, non-Hispanic 0.6949480
## RaceHispanic 0.5317407
## RaceNative Hawaiian/other Pacific Islander 0.5652134
## RaceOthers 0.6090760
## RaceAmerican Indian/Alaskan Native 0.6313865
## IncomeGroup$15,000 to $19,999 0.4556674
## IncomeGroup$20,000 to $24,999 0.4272744
## IncomeGroup$25,000 to $34,999 0.3982391
## IncomeGroup$35,000 to $49,999 0.3790581
## IncomeGroup$50,000 to $74,999 0.3678542
## IncomeGroup$75,000 or above 0.3567917
## IncomeGroupless than $10,000 0.5249767
## EducationLevelGraduated from College/Technical School 0.4691481
## EducationLevelGraduated High School 0.4408533
## EducationLevelAttended College/Technical School 0.4689927
## BMICategoryObese 0.6338714
## BMICategoryOverweight 0.5339057
## BMICategoryUnderweight 0.5345669
## CurrentSmoker 0.4751456
## Exercise 0.4646858
## DepressiveDisorder 0.6728968
set.seed(12345)
training <- sample(1:nrow(asthma_narm), 0.6*nrow(asthma_narm))
asthma.training <- asthma_narm[training,]
asthma.test <- seq(1:nrow(asthma_narm))[-training]
asthma.test.results = asthma_narm[-training,1]
asthma.lr.ML <- glm(AsthmaStatus ~., family=binomial(link="logit"), data = asthma.training)
summary(asthma.lr.ML)
##
## Call:
## glm(formula = AsthmaStatus ~ ., family = binomial(link = "logit"),
## data = asthma.training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3438 -0.4753 -0.3738 -0.3044 2.6971
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -1.86045 0.08658
## AgeGroup25 to 34 -0.20212 0.04203
## AgeGroup35 to 44 -0.23964 0.04154
## AgeGroup45 to 54 -0.16586 0.03956
## AgeGroup55 to 64 -0.22879 0.03830
## AgeGroup65 and above -0.30323 0.03738
## Sex -0.51477 0.01695
## RaceWhite 0.27639 0.06854
## RaceBlack 0.34375 0.07292
## RaceMultiracial, non-Hispanic 0.83550 0.08059
## RaceHispanic 0.10889 0.07395
## RaceNative Hawaiian/other Pacific Islander 0.30366 0.14614
## RaceOthers 0.33038 0.13479
## RaceAmerican Indian/Alaskan Native 0.52465 0.08436
## IncomeGroup$15,000 to $19,999 -0.17337 0.03822
## IncomeGroup$20,000 to $24,999 -0.29175 0.03757
## IncomeGroup$25,000 to $34,999 -0.40846 0.03751
## IncomeGroup$35,000 to $49,999 -0.46115 0.03624
## IncomeGroup$50,000 to $74,999 -0.52582 0.03653
## IncomeGroup$75,000 or above -0.59144 0.03504
## IncomeGroupless than $10,000 0.12251 0.03993
## EducationLevelGraduated from College/Technical School -0.12354 0.03381
## EducationLevelGraduated High School -0.22527 0.03218
## EducationLevelAttended College/Technical School -0.12322 0.03245
## BMICategoryObese 0.56637 0.02058
## BMICategoryOverweight 0.11901 0.02158
## BMICategoryUnderweight 0.19162 0.06268
## CurrentSmoker -0.08020 0.02173
## Exercise -0.13717 0.01759
## DepressiveDisorder 0.71921 0.01733
## z value Pr(>|z|)
## (Intercept) -21.489 < 2e-16 ***
## AgeGroup25 to 34 -4.809 1.52e-06 ***
## AgeGroup35 to 44 -5.768 8.00e-09 ***
## AgeGroup45 to 54 -4.193 2.76e-05 ***
## AgeGroup55 to 64 -5.974 2.31e-09 ***
## AgeGroup65 and above -8.113 4.95e-16 ***
## Sex -30.364 < 2e-16 ***
## RaceWhite 4.032 5.52e-05 ***
## RaceBlack 4.714 2.43e-06 ***
## RaceMultiracial, non-Hispanic 10.367 < 2e-16 ***
## RaceHispanic 1.472 0.140895
## RaceNative Hawaiian/other Pacific Islander 2.078 0.037725 *
## RaceOthers 2.451 0.014242 *
## RaceAmerican Indian/Alaskan Native 6.219 4.99e-10 ***
## IncomeGroup$15,000 to $19,999 -4.536 5.72e-06 ***
## IncomeGroup$20,000 to $24,999 -7.766 8.07e-15 ***
## IncomeGroup$25,000 to $34,999 -10.890 < 2e-16 ***
## IncomeGroup$35,000 to $49,999 -12.724 < 2e-16 ***
## IncomeGroup$50,000 to $74,999 -14.394 < 2e-16 ***
## IncomeGroup$75,000 or above -16.879 < 2e-16 ***
## IncomeGroupless than $10,000 3.068 0.002151 **
## EducationLevelGraduated from College/Technical School -3.654 0.000258 ***
## EducationLevelGraduated High School -7.000 2.57e-12 ***
## EducationLevelAttended College/Technical School -3.797 0.000146 ***
## BMICategoryObese 27.524 < 2e-16 ***
## BMICategoryOverweight 5.514 3.51e-08 ***
## BMICategoryUnderweight 3.057 0.002236 **
## CurrentSmoker -3.690 0.000224 ***
## Exercise -7.796 6.38e-15 ***
## DepressiveDisorder 41.491 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 122980 on 196860 degrees of freedom
## Residual deviance: 116132 on 196831 degrees of freedom
## AIC: 116192
##
## Number of Fisher Scoring iterations: 5
asthma.test.probabilities <- predict(asthma.lr.ML, asthma_narm, type = "response")[asthma.test]
pred <- prediction(asthma.test.probabilities, asthma.test.results)
perf <- performance(pred,"tpr","fpr")
plot(perf, colorize=T)
auc <- performance(pred,"auc")
c(auc@y.name[[1]], auc@y.values[[1]])
## [1] "Area under the ROC curve" "0.66954885547759"
asthma.pred.test <- ifelse(asthma.test.probabilities > 0.16, 1, 0)
conf.mat <- table("Predicted" = asthma.pred.test, "Actual" = asthma.test.results)
colnames(conf.mat) = c("No", "Yes")
rownames(conf.mat) = c("No", "Yes")
conf.mat
## Actual
## Predicted No Yes
## No 106628 8827
## Yes 12214 3572
TruN <- conf.mat[1,1]
TruP <- conf.mat[2,2]
FalN <- conf.mat[1,2]
FalP <- conf.mat[2,1]
TotN <- conf.mat[1,1] + conf.mat[2,1]
TotP <- conf.mat[1,2] + conf.mat[2,2]
Tot <- TotN+TotP
Accuracy.Rate <- (TruN + TruP) / Tot
Error.Rate <- (FalN + FalP) / Tot
Sensitivity <- TruP / TotP
Specificity <- TruN / TotN
FalseP.Rate <- 1 - Specificity # False positive rate
logit.rates.50 <- c(Accuracy.Rate, Error.Rate, Sensitivity, Specificity, FalseP.Rate)
names(logit.rates.50) <- c("Accuracy Rate", "Error Rate", "True Positives", "True Negatives", "False Positives")
options(scipen="4")
print(logit.rates.50, digits=2)
## Accuracy Rate Error Rate True Positives True Negatives
## 0.84 0.16 0.29 0.90
## False Positives
## 0.10