Final Project

OVERVIEW

The self-management of asthma helps improve patients’ health. Asthma self-management provides to the patients and caregivers the skills to understand the disease and its treatment. It teaches them to take medications appropriately, recognize early signs and symptoms of asthma episodes, seek medical care as appropriate, and identify and avoid environmental asthma allergens and irritants In this project, we study the characteristics that influence asthma self-management.

## [1] 13922   899

The data set comes from CDC with url = “https://www.cdc.gov/brfss/acbs/2016_documentation.html”. It is a survey study. The download file is “2016 ACBS Adult Data SAS [ZIP – 3.10 MB]” The unzip file has 899 features and 13,922 observations. We have selected the variables for our studies.

EXPLORATORY DATA ANALYSIS

Meaning of variables used in the dataset

Response Variables

ASTHNOW Have you ever been told by a doctor or other health professional that you have asthma?

TCH_SIGN Has a doctor or other health professional ever taught you… a. How to recognize early signs or symptoms of an asthma episode?

TCH_RESP Has a doctor or other health professional ever taught you… b. What to do during an asthma episode or attack?

TCH_MON A peak flow meter is a hand held device that measures how quickly you can blow air out of your lungs. Has a doctor or other health professional ever taught you… c. How to use a peak flow meter to adjust your daily medications?

MGT_PLAN An asthma action plan, or asthma management plan, is a form with instructions about when to change the amount or type of medicine, when to call the doctorfor advice, and when to go to the emergency room. Has a doctor or other health professional EVER given you an asthma action plan?

MOD_ENV (7.13) INTERVIEWER READ: Now, back to questions specifically about you. Has a health professional ever advised you to change things in your home, school, or work to improve your asthma

MGT_CLAS Have you ever taken a course or class on how to manage your asthma?

INHALERH (8.3) Did a doctor or other health professional show you how to use the inhaler?

INHALERW (8.4) Did a doctor or other health professional watch you use the inhaler?

Responses types (1) YES (2) NO (7) DON’T KNOW (9) REFUSED

Possible Predictors

MISS_DAY = “NUMBER OF MISSED DAYS”

MOD_ENV = “EVER ADVISED CHANGE THINGS IN YOUR HOME”

AGEDX = “AGE AT ASTHMA DIAGNOSIS”

AGEG_F6_M = “MODIFIED SIX AGE GROUPS USED IN ASTHMA ADULT POST-STRATIFICATION”

AIRCLEANER = “AIR CLEANER USED”

ASMDCOST = “COST BARRIER: PRIMARY CARE DOCTOR”

ASRXCOST = “COST BARRIER: MEDICATION”

ASSPCOST = “COST BARRIER: SPECIALIST”

CATTMPTS_F = “DISPOSITION CODES FOR CALL ATTEMPTS 1 THROUGH 20 …”

EMP_STAT = “CURRENT EMPLOYMENT STATUS”

EPIS_12M = “ASTHMA EPISODE OR ATTACK”

EPIS_TP = “NUMBER OF EPISODES / ATTACKS”

ER_TIMES = “NUMBER OF EMERGENCY ROOM VISITS”

ER_VISIT = “EMERGENCY ROOM VISIT”

EVER_ASTH = “EVER HAVE ASTHMA INCONSISTENT WITH BRFSS”

HOSPPLAN = “HOSPITAL FOLLOW-UP”

HOSPTIME = “NUMBER OF HOSPITAL VISITS”

HOSP_VST = “HOSPITAL VISIT”

QSTLANG_F = “LANGUAGE IDENTIFIER”

SCR_MED3 = “HAVE ALL THE MEDICATIONS”

UNEMP_R = “REASON NOT NOW EMPLOYED”

URG_TIME = “NUMBER OF URGENT VISITS”

WORKENV5 = “ASTHMA AGGRAVATED BY CURRENT JOB”

WORKENV6 = “ASTHMA CAUSED BY CURRENT JOB”

WORKENV7 = “ASTHMA AGGRAVATED BY PREVIOUS JOB”

WORKENV8 = “ASTHMA CAUSED BY PREVIOUS JOB”

WORKQUIT1 = “EVER CHANGE OR QUIT A JOB”

WORKSEN3 = “DOCTOR DIAGNOSED WORK ASTHMA”

WORKSEN4 = “SELF-IDENTIFIED WORK ASTHMA”

WORKTALK = “DOCTOR DISCUSSED WORK ASTHMA”

INS1 = “INSURANCE”

INS2 = “INSURANCE OR COVERAGE GAP”

LASTSYMP = “LAST HAD ANY SYMPTOMS OF ASTHMA”

LAST_MD = “LAST TALKED TO A DOCTOR”

LAST_MED = “LAST TOOK ASTHMA MEDICATION”

COMPASTH = “TYPICAL ATTACK”

Constructing the Data Frame by Selecting variables

We select all possible variable that we can use in our dataset. We also start to clean the dataset.

summary of the data set

Here we categorize.

##  TCH.SIGN TCH.RESP TCH.MON  MGT.PLAN MGT.CLAS  INHALERW MOD.ENV  SEX     
##  1:8639   1:9936   1:5694   1:3814   1: 1190   1:9382   1:4401   1:4786  
##  2:4972   2:3669   2:8014   2:9759   2:12675   2:2925   2:9407   2:9136  
##  7: 301   7: 299   7: 195   7: 336   7:   53   5: 386   7: 110           
##  9:  10   9:  18   9:  19   9:  13   9:    4   6: 762   9:   4           
##                                                7: 466                    
##                                                9:   1                    
##                                                                          
##  AGEG.F7  X_RACEGR3 EDUCAL   X_INCOMG X_RFBMI5 SMOKE100   COPD      
##  1: 763   1:10806   1:  24   1:1910   1:3640   1:6546   1   : 2686  
##  2:1186   2:  766   2: 350   2:2264   2:9647   2:7312   2   :11027  
##  3:1419   3:  513   3: 722   3:1272   9: 635   7:  59   7   :  149  
##  4:2149   4:  459   4:3307   4:1659            9:   5   9   :   16  
##  5:3476   5: 1209   5:4177   5:5265                     NA's:   44  
##  6:3199   9:  169   6:5321   9:1552                                 
##  7:1730             9:  21                                          
##   EMPHY       DEPRESS      BRONCH         DUR.30D     INCINDT   LAST.MD  
##  1   : 1133   1   :5185   1   : 3670   12     :4735   1:  316   4 :7977  
##  2   :12638   2   :8628   2   :10027   6      :4351   2: 1002   5 :1820  
##  7   :   93   7   :  36   7   :  166   10     :1618   3:12549   6 : 819  
##  9   :   14   9   :  29   9   :   15   1      :1288   7:   44   7 :3025  
##  NA's:   44   NA's:  44   NA's:   44   2      : 899   9:   11   77: 131  
##                                        11     : 749             88: 135  
##                                        (Other): 282             99:  15  
##     LAST.MED      LAST.SYMP    EPIS.12M    COMPASTH    INS1      INS2     
##  1      :4924   1      :3567   1:5210   11     :4361   1:13121   1:  683  
##  7      :2978   7      :2616   2:4237   6      :4351   2:  767   2:12415  
##  3      :1555   3      :2515   6:4351   3      :3205   7:   26   5:  767  
##  4      :1238   4      :1618   7: 117   1      :1169   9:    8   7:   46  
##  5      :1072   2      :1613   9:   7   2      : 781             9:   11  
##  2      :1059   5      :1113            7      :  41                      
##  (Other):1096   (Other): 880            (Other):  14                      
##  ER.VISIT HOSP.VST ASMDCOST     ASRXCOST    ASSPCOST     WORKTALK    
##  1:1347   1: 380   1   :  770   1   :1559   1   :  506   1   : 2592  
##  2:6700   2:6702   2   :10370   2   :9592   2   :10647   2   :10851  
##  5:2737   4: 979   5   : 2736   5   :2736   5   : 2736   6   :  281  
##  6:3105   5:2737   7   :   31   7   :  18   7   :   15   7   :  144  
##  7:  32   6:3105   9   :   10   9   :  12   9   :   13   8   :   31  
##  9:   1   7:  19   NA's:    5   NA's:   5   NA's:    5   9   :   11  
##                                                          NA's:   12

Here we collapse certain variables with too many classes, and factors with few cases.

##  TCH.SIGN TCH.RESP TCH.MON  MGT.PLAN MGT.CLAS  INHALERW MOD.ENV  SEX     
##  1:8639   1:9936   1:5694   1:3814   1: 1190   1:9382   1:4401   1:4786  
##  2:4972   2:3669   2:8014   2:9759   2:12675   2:2925   2:9407   2:9136  
##  7: 301   7: 299   7: 195   7: 336   7:   53   5: 386   7: 110           
##  9:  10   9:  18   9:  19   9:  13   9:    4   6: 762   9:   4           
##                                                7: 466                    
##                                                9:   1                    
##                                                                          
##  AGEG.F7  X_RACEGR3 EDUCAL   X_INCOMG X_RFBMI5 SMOKE100   COPD      
##  1: 763   1:10806   1:  24   1:1910   1:3640   1:6546   1   : 2686  
##  2:1186   2:  766   2: 350   2:2264   2:9647   2:7312   2   :11027  
##  3:1419   3:  513   3: 722   3:1272   9: 635   7:  59   7   :  149  
##  4:2149   4:  459   4:3307   4:1659            9:   5   9   :   16  
##  5:3476   5: 1209   5:4177   5:5265                     NA's:   44  
##  6:3199   9:  169   6:5342   9:1552                                 
##  7:1730                                                             
##   EMPHY       DEPRESS      BRONCH      DUR.30D   INCINDT   LAST.MD  LAST.MED
##  1   : 1133   1   :5185   1   : 3670   1 :1288   1:  316   4:7977   4:4924  
##  2   :12638   2   :8628   2   :10027   10:1618   2: 1002   5:1820   5:2131  
##  7   :   93   7   :  36   7   :  166   11: 749   3:12549   6: 819   6:2136  
##  9   :   14   9   :  29   9   :   15   12:4735   7:   55   7:3025   7:4216  
##  NA's:   44   NA's:  44   NA's:   44   2 : 899             9: 281   9: 515  
##                                        6 :4351                              
##                                        7 : 282                              
##    LAST.SYMP    EPIS.12M COMPASTH  INS1      INS2      ER.VISIT HOSP.VST
##  1      :3567   1:5210   1 :1169   1:13121   1:  683   1:1347   1: 380  
##  7      :2616   2:4237   11:4361   2:  767   2:12415   2:6700   2:6702  
##  3      :2515   6:4351   2 : 781   7:   26   5:  767   5:2737   4: 979  
##  4      :1618   7: 124   3 :3217   9:    8   7:   46   6:3105   5:2737  
##  2      :1613            6 :4351             9:   11   7:  33   6:3105  
##  5      :1113            7 :  43                                7:  19  
##  (Other): 880                                                           
##  ASMDCOST     ASRXCOST    ASSPCOST     WORKTALK    
##  1   :  770   1   :1559   1   :  506   1   : 2592  
##  2   :10370   2   :9592   2   :10647   2   :10851  
##  5   : 2736   5   :2736   5   : 2736   6   :  281  
##  7   :   31   7   :  18   7   :   15   7   :  144  
##  9   :   10   9   :  12   9   :   13   8   :   31  
##  NA's:    5   NA's:   5   NA's:    5   9   :   11  
##                                        NA's:   12
## [1] 11494    33

Structure of the data

## 'data.frame':    11494 obs. of  33 variables:
##  $ TCH.SIGN : num  1 2 1 2 2 1 1 2 1 2 ...
##   ..- attr(*, "label")= chr "EVER TAUGHT RECOGNIZE EARLY SIGN OR SYMPTOMS"
##   ..- attr(*, "format.sas")= chr "TCH_SIGN"
##  $ TCH.RESP : num  1 1 1 2 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "EVER TAUGHT WHAT TO DO DURING ASTHMA EPISODE OR ATTACK"
##   ..- attr(*, "format.sas")= chr "TCH_RESP"
##  $ TCH.MON  : num  2 2 2 2 2 1 1 2 2 2 ...
##   ..- attr(*, "label")= chr "EVER TAUGHT HOW TO USE A PEAK FLOW"
##   ..- attr(*, "format.sas")= chr "TCH_MON"
##  $ MGT.PLAN : num  2 2 2 2 2 2 1 2 2 2 ...
##   ..- attr(*, "label")= chr "EVER GIVEN AN ASTHMA ACTION PLAN"
##   ..- attr(*, "format.sas")= chr "MGT_PLAN"
##  $ MGT.CLAS : num  2 2 2 2 2 2 2 2 2 2 ...
##   ..- attr(*, "label")= chr "EVER TAKEN A COURSE TO MANAGE ASTHMA"
##   ..- attr(*, "format.sas")= chr "MGT_CLAS"
##  $ INHALERW : num  2 2 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "INHALER USE WATCHED"
##   ..- attr(*, "format.sas")= chr "INHALERW"
##  $ MOD.ENV  : num  2 2 2 2 1 2 2 2 1 2 ...
##   ..- attr(*, "label")= chr "EVER ADVISED CHANGE THINGS IN YOUR HOME"
##   ..- attr(*, "format.sas")= chr "MOD_ENV"
##  $ SEX      : num  1 2 2 2 2 2 1 2 2 2 ...
##   ..- attr(*, "label")= chr "RESPONDENTS SEX"
##   ..- attr(*, "format.sas")= chr "SEX"
##  $ AGEG.F7  : num  4 5 5 3 6 5 4 6 6 7 ...
##   ..- attr(*, "label")= chr "AGE COLLAPSED TO 7 GROUPS FOR ASTHMA CALL-BACK"
##   ..- attr(*, "format.sas")= chr "AGEG_F7Z"
##  $ X_RACEGR3: num  3 1 1 5 1 5 1 1 1 1 ...
##   ..- attr(*, "label")= chr "COMPUTED FIVE LEVEL RACE/ETHNICITY CATEGORY."
##   ..- attr(*, "format.sas")= chr "_3RACEGR"
##  $ EDUCAL   : num  6 4 4 5 6 6 6 6 6 5 ...
##   ..- attr(*, "label")= chr "EDUCATION LEVEL"
##   ..- attr(*, "format.sas")= chr "EDUCA"
##  $ X_INCOMG : num  5 1 1 5 5 5 5 5 3 9 ...
##   ..- attr(*, "label")= chr "COMPUTED INCOME CATEGORIES"
##   ..- attr(*, "format.sas")= chr "_INCOMG"
##  $ X_RFBMI5 : num  2 2 2 2 2 2 1 2 2 1 ...
##   ..- attr(*, "label")= chr "OVERWEIGHT OR OBESE CALCULATED VARIABLE"
##   ..- attr(*, "format.sas")= chr "_5RFBMI"
##  $ SMOKE100 : num  2 1 1 2 1 2 1 1 2 2 ...
##   ..- attr(*, "label")= chr "SMOKED AT LEAST 100 CIGARETTES"
##   ..- attr(*, "format.sas")= chr "SMOK100_"
##  $ COPD     : num  2 1 2 2 2 2 2 2 2 1 ...
##   ..- attr(*, "label")= chr "EVER TOLD HAVE CHRONIC OBSTRUCTIVE PULMONARY DISEASE"
##   ..- attr(*, "format.sas")= chr "COPD"
##  $ EMPHY    : num  2 2 2 2 2 2 2 2 2 2 ...
##   ..- attr(*, "label")= chr "EVER TOLD HAVE EMPHYSEMA"
##   ..- attr(*, "format.sas")= chr "EMPHY"
##  $ DEPRESS  : num  2 1 2 2 2 2 2 2 1 1 ...
##   ..- attr(*, "label")= chr "EVER TOLD DEPRESSED"
##   ..- attr(*, "format.sas")= chr "DEPRESS"
##  $ BRONCH   : num  2 1 2 2 1 2 2 2 1 2 ...
##   ..- attr(*, "label")= chr "EVER TOLD HAVE CHRONIC BRONCHITIS"
##   ..- attr(*, "format.sas")= chr "BRONCH"
##  $ DUR.30D  : num  10 2 12 6 12 10 12 6 1 6 ...
##   ..- attr(*, "label")= chr "CONSTANT SYMPTOMS"
##   ..- attr(*, "format.sas")= chr "DUR_30D"
##  $ INCINDT  : num  3 2 3 3 3 2 3 3 3 3 ...
##   ..- attr(*, "label")= chr "TIME SINCE DIAGNOSIS"
##   ..- attr(*, "format.sas")= chr "INCIDNT"
##  $ LAST.MD  : num  5 4 4 7 4 4 4 5 4 5 ...
##   ..- attr(*, "label")= chr "LAST TALKED TO A DOCTOR"
##   ..- attr(*, "format.sas")= chr "LAST_MD"
##  $ LAST.MED : num  4 1 3 7 3 1 1 6 1 5 ...
##   ..- attr(*, "label")= chr "LAST TOOK ASTHMA MEDICATION"
##   ..- attr(*, "format.sas")= chr "LAST_MED"
##  $ LAST.SYMP: num  4 1 3 7 3 4 3 5 1 5 ...
##   ..- attr(*, "label")= chr "LAST HAD ANY SYMPTOMS OF ASTHMA"
##   ..- attr(*, "format.sas")= chr "LASTSYMP"
##  $ EPIS.12M : num  1 1 1 6 1 2 1 6 2 6 ...
##   ..- attr(*, "label")= chr "ASTHMA EPISODE OR ATTACK"
##   ..- attr(*, "format.sas")= chr "EPIS_12M"
##  $ COMPASTH : num  1 3 1 6 3 11 3 6 11 6 ...
##   ..- attr(*, "label")= chr "TYPICAL ATTACK"
##   ..- attr(*, "format.sas")= chr "COMPASTH"
##  $ INS1     : num  1 1 1 2 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "INSURANCE"
##   ..- attr(*, "format.sas")= chr "INS1Z"
##  $ INS2     : num  2 2 2 5 2 2 2 2 2 2 ...
##   ..- attr(*, "label")= chr "INSURANCE OR COVERAGE GAP"
##   ..- attr(*, "format.sas")= chr "INS2Z"
##  $ ER.VISIT : num  6 2 2 5 2 2 2 5 2 6 ...
##   ..- attr(*, "label")= chr "EMERGENCY ROOM VISIT"
##   ..- attr(*, "format.sas")= chr "ER_VISIT"
##  $ HOSP.VST : num  6 2 2 5 2 2 2 5 2 6 ...
##   ..- attr(*, "label")= chr "HOSPITAL VISIT"
##   ..- attr(*, "format.sas")= chr "HOSP_VST"
##  $ ASMDCOST : num  2 2 2 5 2 2 2 5 2 2 ...
##   ..- attr(*, "label")= chr "COST BARRIER: PRIMARY CARE DOCTOR"
##   ..- attr(*, "format.sas")= chr "ASMDCOST"
##  $ ASRXCOST : num  2 2 2 5 2 2 2 5 1 2 ...
##   ..- attr(*, "label")= chr "COST BARRIER: MEDICATION"
##   ..- attr(*, "format.sas")= chr "ASRXCOST"
##  $ ASSPCOST : num  2 2 2 5 2 2 2 5 2 2 ...
##   ..- attr(*, "label")= chr "COST BARRIER: SPECIALIST"
##   ..- attr(*, "format.sas")= chr "ASSPCOST"
##  $ WORKTALK : num  2 2 2 2 2 2 2 2 2 2 ...
##   ..- attr(*, "label")= chr "DOCTOR DISCUSSED WORK ASTHMA"
##   ..- attr(*, "format.sas")= chr "WORKTALK"

Summary of the Data

##     TCH.SIGN        TCH.RESP        TCH.MON         MGT.PLAN    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :1.000   Median :1.000   Median :2.000   Median :2.000  
##  Mean   :1.337   Mean   :1.238   Mean   :1.555   Mean   :1.694  
##  3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :2.000   Max.   :2.000   Max.   :2.000   Max.   :2.000  
##                                                                 
##     MGT.CLAS        INHALERW       MOD.ENV           SEX           AGEG.F7     
##  Min.   :1.000   Min.   :1.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:1.00   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:4.000  
##  Median :2.000   Median :1.00   Median :2.000   Median :2.000   Median :5.000  
##  Mean   :1.906   Mean   :1.24   Mean   :1.662   Mean   :1.678   Mean   :4.594  
##  3rd Qu.:2.000   3rd Qu.:1.00   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:6.000  
##  Max.   :2.000   Max.   :2.00   Max.   :2.000   Max.   :2.000   Max.   :7.000  
##                                                                                
##    X_RACEGR3         EDUCAL        X_INCOMG        X_RFBMI5        SMOKE100    
##  Min.   :1.000   Min.   :1.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:4.00   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :1.000   Median :5.00   Median :4.000   Median :2.000   Median :2.000  
##  Mean   :1.653   Mean   :4.98   Mean   :4.057   Mean   :2.062   Mean   :1.549  
##  3rd Qu.:1.000   3rd Qu.:6.00   3rd Qu.:5.000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :9.000   Max.   :9.00   Max.   :9.000   Max.   :9.000   Max.   :9.000  
##                                                                                
##       COPD          EMPHY          DEPRESS          BRONCH     
##  Min.   :1.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.00   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :2.00   Median :2.000   Median :2.000   Median :2.000  
##  Mean   :1.85   Mean   :1.945   Mean   :1.633   Mean   :1.769  
##  3rd Qu.:2.00   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :9.00   Max.   :9.000   Max.   :9.000   Max.   :9.000  
##  NA's   :30     NA's   :30      NA's   :30      NA's   :30     
##     DUR.30D          INCINDT         LAST.MD          LAST.MED     
##  Min.   : 1.000   Min.   :1.000   Min.   : 4.000   Min.   : 1.000  
##  1st Qu.: 6.000   1st Qu.:3.000   1st Qu.: 4.000   1st Qu.: 1.000  
##  Median :10.000   Median :3.000   Median : 4.000   Median : 3.000  
##  Mean   : 9.201   Mean   :2.892   Mean   : 5.729   Mean   : 3.733  
##  3rd Qu.:12.000   3rd Qu.:3.000   3rd Qu.: 5.000   3rd Qu.: 5.000  
##  Max.   :99.000   Max.   :9.000   Max.   :99.000   Max.   :99.000  
##                                                                    
##    LAST.SYMP         EPIS.12M        COMPASTH           INS1      
##  Min.   : 1.000   Min.   :1.000   Min.   : 1.000   Min.   :1.000  
##  1st Qu.: 1.000   1st Qu.:1.000   1st Qu.: 3.000   1st Qu.:1.000  
##  Median : 3.000   Median :2.000   Median : 6.000   Median :1.000  
##  Mean   : 4.742   Mean   :2.703   Mean   : 6.126   Mean   :1.065  
##  3rd Qu.: 5.000   3rd Qu.:6.000   3rd Qu.:11.000   3rd Qu.:1.000  
##  Max.   :99.000   Max.   :9.000   Max.   :11.000   Max.   :9.000  
##                                                                   
##       INS2          ER.VISIT        HOSP.VST        ASMDCOST    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :2.000   Median :2.000   Median :2.000   Median :2.000  
##  Mean   :2.127   Mean   :3.256   Mean   :3.468   Mean   :2.416  
##  3rd Qu.:2.000   3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:2.000  
##  Max.   :9.000   Max.   :9.000   Max.   :7.000   Max.   :9.000  
##                                                  NA's   :4      
##     ASRXCOST        ASSPCOST        WORKTALK    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :2.000   Median :2.000   Median :2.000  
##  Mean   :2.353   Mean   :2.433   Mean   :1.928  
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :9.000   Max.   :9.000   Max.   :9.000  
##  NA's   :4       NA's   :4       NA's   :8

Distribution of the Variables in the Data

Histograms

Histograms tell us how the data is distributed in the dataset (numeric fields).

The correlations betweeen predictors

There are highly correlated predictors. We are going to remove some of them.

##  [1] "TCH.SIGN"  "TCH.RESP"  "TCH.MON"   "MGT.PLAN"  "MGT.CLAS"  "INHALERW" 
##  [7] "MOD.ENV"   "SEX"       "AGEG.F7"   "X_RACEGR3" "EDUCAL"    "X_INCOMG" 
## [13] "X_RFBMI5"  "SMOKE100"  "COPD"      "EMPHY"     "DEPRESS"   "BRONCH"   
## [19] "DUR.30D"   "INCINDT"   "LAST.MD"   "LAST.MED"  "LAST.SYMP" "EPIS.12M" 
## [25] "COMPASTH"  "INS1"      "INS2"      "ER.VISIT"  "HOSP.VST"  "ASRXCOST" 
## [31] "WORKTALK"

CONSTRUCT THE RESPONSE VARIABLE

We first extract variables related to education.
#### Selection of variables

##   TCH.SIGN TCH.RESP TCH.MON MGT.PLAN MGT.CLAS INHALERW MOD.ENV
## 1        1        1       2        2        2        2       2
## 2        2        1       2        2        2        2       2
## 3        1        1       2        2        2        1       2
## 4        2        2       2        2        2        1       2
## 5        2        1       2        2        2        1       1
## 6        1        1       1        2        2        1       2

Exploration of the clustering

##  TCH.SIGN TCH.RESP TCH.MON  MGT.PLAN MGT.CLAS  INHALERW MOD.ENV 
##  1:7621   1:8760   1:5118   1:3522   1: 1078   1:8738   1:3889  
##  2:3873   2:2734   2:6376   2:7972   2:10416   2:2756   2:7605

Elbow method to find the number of clusters

We run k-means with different clusters from 1 to 16 and we produce a plot to determine the number of cluster at the elbow.
Elbow method Scree Plot

Elbow method Scree Plot

The number of clusters is 3.

Now we do the clustering and extract the centers of resulting model

##   TCH.SIGN TCH.RESP  TCH.MON MGT.PLAN MGT.CLAS INHALERW  MOD.ENV
## 1 1.228339 1.057762 2.000000 1.935018 1.944043 1.251805 1.000000
## 2 1.042545 1.022664 1.229423 1.000000 2.000000 1.073161 1.522465
## 3 1.035419 1.021251 1.129870 1.216057 1.000000 1.066116 1.493506
## 4 1.051266 1.013589 1.000000 2.000000 2.000000 1.140828 1.559605
## 5 1.310152 1.040536 2.000000 1.948680 1.962068 1.289327 2.000000
## 6 1.904126 1.694175 1.000000 1.916262 1.967233 1.348301 1.845874
## 7 1.962474 2.000000 2.000000 1.966173 1.978858 1.498943 1.835624

We add the point classification to the original data

##   TCH.SIGN TCH.RESP TCH.MON MGT.PLAN MGT.CLAS INHALERW MOD.ENV target
## 1        1        1       2        2        2        2       2      5
## 2        2        1       2        2        2        2       2      5
## 3        1        1       2        2        2        1       2      5
## 4        2        2       2        2        2        1       2      7
## 5        2        1       2        2        2        1       1      1
## 6        1        1       1        2        2        1       2      4
##  TCH.SIGN TCH.RESP TCH.MON  MGT.PLAN MGT.CLAS  INHALERW MOD.ENV  target  
##  1:7621   1:8760   1:5118   1:3522   1: 1078   1:8738   1:3889   1:1108  
##  2:3873   2:2734   2:6376   2:7972   2:10416   2:2756   2:7605   2:2515  
##                                                                  3: 847  
##                                                                  4:1619  
##                                                                  5:2689  
##                                                                  6: 824  
##                                                                  7:1892
View of the clustering result

View of the clustering result

Interpretation of the Selft-Management Response clustering

TCH.SIGN

## # A tibble: 14 x 5
##    TCH.SIGN target count etotal proportion
##    <fct>    <fct>  <int>  <int>      <dbl>
##  1 1        1        855   7621    0.112  
##  2 1        2       2408   7621    0.316  
##  3 1        3        817   7621    0.107  
##  4 1        4       1536   7621    0.202  
##  5 1        5       1855   7621    0.243  
##  6 1        6         79   7621    0.0104 
##  7 1        7         71   7621    0.00932
##  8 2        1        253   3873    0.0653 
##  9 2        2        107   3873    0.0276 
## 10 2        3         30   3873    0.00775
## 11 2        4         83   3873    0.0214 
## 12 2        5        834   3873    0.215  
## 13 2        6        745   3873    0.192  
## 14 2        7       1821   3873    0.470
## # A tibble: 2 x 6
## # Groups:   target [2]
##   TCH.SIGN target count etotal proportion group.max
##   <fct>    <fct>  <int>  <int>      <dbl>     <int>
## 1 1        2       2408   7621      0.316      2408
## 2 2        7       1821   3873      0.470      1821

In the target response, 8 is the positive answer, 3 is the negative answer, 5 is unknown and 6 is refused for to answer the questions:
TCH_SIGN Has a doctor or other health professional ever taught you… a. How to recognize early signs or symptoms of an asthma episode?

TCH.RESP

## # A tibble: 13 x 5
##    TCH.RESP target count etotal proportion
##    <fct>    <fct>  <int>  <int>      <dbl>
##  1 1        1       1044   8760    0.119  
##  2 1        2       2458   8760    0.281  
##  3 1        3        829   8760    0.0946 
##  4 1        4       1597   8760    0.182  
##  5 1        5       2580   8760    0.295  
##  6 1        6        252   8760    0.0288 
##  7 2        1         64   2734    0.0234 
##  8 2        2         57   2734    0.0208 
##  9 2        3         18   2734    0.00658
## 10 2        4         22   2734    0.00805
## 11 2        5        109   2734    0.0399 
## 12 2        6        572   2734    0.209  
## 13 2        7       1892   2734    0.692
## # A tibble: 2 x 6
## # Groups:   target [2]
##   TCH.RESP target count etotal proportion group.max
##   <fct>    <fct>  <int>  <int>      <dbl>     <int>
## 1 1        5       2580   8760      0.295      2580
## 2 2        7       1892   2734      0.692      1892

In the target response, 8 is the positive answer, 3 is the negative answer, 1 is don’t know and 1 is refused for the question: TCH_RESP Has a doctor or other health professional ever taught you… b. What to do during an asthma episode or attack?

TCH.MON

## # A tibble: 9 x 5
##   TCH.MON target count etotal proportion
##   <fct>   <fct>  <int>  <int>      <dbl>
## 1 1       2       1938   5118     0.379 
## 2 1       3        737   5118     0.144 
## 3 1       4       1619   5118     0.316 
## 4 1       6        824   5118     0.161 
## 5 2       1       1108   6376     0.174 
## 6 2       2        577   6376     0.0905
## 7 2       3        110   6376     0.0173
## 8 2       5       2689   6376     0.422 
## 9 2       7       1892   6376     0.297
## # A tibble: 2 x 6
## # Groups:   target [2]
##   TCH.MON target count etotal proportion group.max
##   <fct>   <fct>  <int>  <int>      <dbl>     <int>
## 1 1       2       1938   5118      0.379      1938
## 2 2       5       2689   6376      0.422      2689

In the target response, 8 is the positive answer, 7 are the negative answers, 2 is don’t know and 2 is refused for the question: TCH_MON A peak flow meter is a hand held device that measures how quickly you can blow air out of your lungs. Has a doctor or other health professional ever taught you… c. How to use a peak flow meter to adjust your daily medications?

MGT.PLAN

## # A tibble: 12 x 5
##    MGT.PLAN target count etotal proportion
##    <fct>    <fct>  <int>  <int>      <dbl>
##  1 1        1         72   3522     0.0204
##  2 1        2       2515   3522     0.714 
##  3 1        3        664   3522     0.189 
##  4 1        5        138   3522     0.0392
##  5 1        6         69   3522     0.0196
##  6 1        7         64   3522     0.0182
##  7 2        1       1036   7972     0.130 
##  8 2        3        183   7972     0.0230
##  9 2        4       1619   7972     0.203 
## 10 2        5       2551   7972     0.320 
## 11 2        6        755   7972     0.0947
## 12 2        7       1828   7972     0.229
## # A tibble: 2 x 6
## # Groups:   target [2]
##   MGT.PLAN target count etotal proportion group.max
##   <fct>    <fct>  <int>  <int>      <dbl>     <int>
## 1 1        2       2515   3522      0.714      2515
## 2 2        5       2551   7972      0.320      2551

In the target response, 8 is the positive answer, 3 is the negative answer, 9 is don’t know and 9 is refused for the question: MGT_PLAN An asthma action plan, or asthma management plan, is a form with instructions about when to change the amount or type of medicine, when to call the doctor for advice, and when to go to the emergency room. Has a doctor or other health professional EVER given you an asthma action plan?

MGT.CLAS

## # A tibble: 11 x 5
##    MGT.CLAS target count etotal proportion
##    <fct>    <fct>  <int>  <int>      <dbl>
##  1 1        1         62   1078     0.0575
##  2 1        3        847   1078     0.786 
##  3 1        5        102   1078     0.0946
##  4 1        6         27   1078     0.0250
##  5 1        7         40   1078     0.0371
##  6 2        1       1046  10416     0.100 
##  7 2        2       2515  10416     0.241 
##  8 2        4       1619  10416     0.155 
##  9 2        5       2587  10416     0.248 
## 10 2        6        797  10416     0.0765
## 11 2        7       1852  10416     0.178
## # A tibble: 2 x 6
## # Groups:   target [2]
##   MGT.CLAS target count etotal proportion group.max
##   <fct>    <fct>  <int>  <int>      <dbl>     <int>
## 1 1        3        847   1078      0.786       847
## 2 2        5       2587  10416      0.248      2587

In the target response, 8 is the positive answer, 8 or(3,7) is the negative answer, 8 is don’t know and 6 is refused for the question: MGT_CLAS Have you ever taken a course or class on how to manage your asthma?

INHALERW

## # A tibble: 14 x 5
##    INHALERW target count etotal proportion
##    <fct>    <fct>  <int>  <int>      <dbl>
##  1 1        1        829   8738     0.0949
##  2 1        2       2331   8738     0.267 
##  3 1        3        791   8738     0.0905
##  4 1        4       1391   8738     0.159 
##  5 1        5       1911   8738     0.219 
##  6 1        6        537   8738     0.0615
##  7 1        7        948   8738     0.108 
##  8 2        1        279   2756     0.101 
##  9 2        2        184   2756     0.0668
## 10 2        3         56   2756     0.0203
## 11 2        4        228   2756     0.0827
## 12 2        5        778   2756     0.282 
## 13 2        6        287   2756     0.104 
## 14 2        7        944   2756     0.343
## # A tibble: 2 x 6
## # Groups:   target [2]
##   INHALERW target count etotal proportion group.max
##   <fct>    <fct>  <int>  <int>      <dbl>     <int>
## 1 1        2       2331   8738      0.267      2331
## 2 2        7        944   2756      0.343       944

In the target response, 8 is the positive answer, 3 is the negative answer, 4 is don’t know and 1 is refused for the question: INHALERW (8.4) Did a doctor or other health professional watch you use the inhaler?

MOD.ENV

## # A tibble: 12 x 5
##    MOD.ENV target count etotal proportion
##    <fct>   <fct>  <int>  <int>      <dbl>
##  1 1       1       1108   3889     0.285 
##  2 1       2       1201   3889     0.309 
##  3 1       3        429   3889     0.110 
##  4 1       4        713   3889     0.183 
##  5 1       6        127   3889     0.0327
##  6 1       7        311   3889     0.0800
##  7 2       2       1314   7605     0.173 
##  8 2       3        418   7605     0.0550
##  9 2       4        906   7605     0.119 
## 10 2       5       2689   7605     0.354 
## 11 2       6        697   7605     0.0917
## 12 2       7       1581   7605     0.208
## # A tibble: 2 x 6
## # Groups:   target [2]
##   MOD.ENV target count etotal proportion group.max
##   <fct>   <fct>  <int>  <int>      <dbl>     <int>
## 1 1       2       1201   3889      0.309      1201
## 2 2       5       2689   7605      0.354      2689

Summary of the response variables

## # A tibble: 2 x 8
##   RESPONSE TCH.SIGN TCH.RES TCH.MON MGT.PLAN MGT.CLAS INHALERW MOD.ENV
##   <chr>    <fct>    <fct>   <fct>   <fct>    <fct>    <fct>    <fct>  
## 1 1=YES    2        5       2       2        3        2        2      
## 2 2=NO     7        7       5       5        5        7        5
For the response variable TARGET, an excellent management skill has number 2 but a poor management skill has number 7 and 5.
We can build a logistics regression on the dataset.

!!!! Please, check the values of yes in the response. var and change if condition of resp.asthma2$taget

!!!! above are different. Remove “Break” in the chunk below!

Here we remove the varibles used to calculate the target variable and reformat the data frame.

## 'data.frame':    11494 obs. of  25 variables:
##  $ TARGET   : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SEX      : num  1 2 2 2 2 2 1 2 2 2 ...
##   ..- attr(*, "label")= chr "RESPONDENTS SEX"
##   ..- attr(*, "format.sas")= chr "SEX"
##  $ AGEG.F7  : num  4 5 5 3 6 5 4 6 6 7 ...
##   ..- attr(*, "label")= chr "AGE COLLAPSED TO 7 GROUPS FOR ASTHMA CALL-BACK"
##   ..- attr(*, "format.sas")= chr "AGEG_F7Z"
##  $ X_RACEGR3: num  3 1 1 5 1 5 1 1 1 1 ...
##   ..- attr(*, "label")= chr "COMPUTED FIVE LEVEL RACE/ETHNICITY CATEGORY."
##   ..- attr(*, "format.sas")= chr "_3RACEGR"
##  $ EDUCAL   : num  6 4 4 5 6 6 6 6 6 5 ...
##   ..- attr(*, "label")= chr "EDUCATION LEVEL"
##   ..- attr(*, "format.sas")= chr "EDUCA"
##  $ X_INCOMG : num  5 1 1 5 5 5 5 5 3 9 ...
##   ..- attr(*, "label")= chr "COMPUTED INCOME CATEGORIES"
##   ..- attr(*, "format.sas")= chr "_INCOMG"
##  $ X_RFBMI5 : num  2 2 2 2 2 2 1 2 2 1 ...
##   ..- attr(*, "label")= chr "OVERWEIGHT OR OBESE CALCULATED VARIABLE"
##   ..- attr(*, "format.sas")= chr "_5RFBMI"
##  $ SMOKE100 : num  2 1 1 2 1 2 1 1 2 2 ...
##   ..- attr(*, "label")= chr "SMOKED AT LEAST 100 CIGARETTES"
##   ..- attr(*, "format.sas")= chr "SMOK100_"
##  $ COPD     : num  2 1 2 2 2 2 2 2 2 1 ...
##   ..- attr(*, "label")= chr "EVER TOLD HAVE CHRONIC OBSTRUCTIVE PULMONARY DISEASE"
##   ..- attr(*, "format.sas")= chr "COPD"
##  $ EMPHY    : num  2 2 2 2 2 2 2 2 2 2 ...
##   ..- attr(*, "label")= chr "EVER TOLD HAVE EMPHYSEMA"
##   ..- attr(*, "format.sas")= chr "EMPHY"
##  $ DEPRESS  : num  2 1 2 2 2 2 2 2 1 1 ...
##   ..- attr(*, "label")= chr "EVER TOLD DEPRESSED"
##   ..- attr(*, "format.sas")= chr "DEPRESS"
##  $ BRONCH   : num  2 1 2 2 1 2 2 2 1 2 ...
##   ..- attr(*, "label")= chr "EVER TOLD HAVE CHRONIC BRONCHITIS"
##   ..- attr(*, "format.sas")= chr "BRONCH"
##  $ DUR.30D  : num  10 2 12 6 12 10 12 6 1 6 ...
##   ..- attr(*, "label")= chr "CONSTANT SYMPTOMS"
##   ..- attr(*, "format.sas")= chr "DUR_30D"
##  $ INCINDT  : num  3 2 3 3 3 2 3 3 3 3 ...
##   ..- attr(*, "label")= chr "TIME SINCE DIAGNOSIS"
##   ..- attr(*, "format.sas")= chr "INCIDNT"
##  $ LAST.MD  : num  5 4 4 7 4 4 4 5 4 5 ...
##   ..- attr(*, "label")= chr "LAST TALKED TO A DOCTOR"
##   ..- attr(*, "format.sas")= chr "LAST_MD"
##  $ LAST.MED : num  4 1 3 7 3 1 1 6 1 5 ...
##   ..- attr(*, "label")= chr "LAST TOOK ASTHMA MEDICATION"
##   ..- attr(*, "format.sas")= chr "LAST_MED"
##  $ LAST.SYMP: num  4 1 3 7 3 4 3 5 1 5 ...
##   ..- attr(*, "label")= chr "LAST HAD ANY SYMPTOMS OF ASTHMA"
##   ..- attr(*, "format.sas")= chr "LASTSYMP"
##  $ EPIS.12M : num  1 1 1 6 1 2 1 6 2 6 ...
##   ..- attr(*, "label")= chr "ASTHMA EPISODE OR ATTACK"
##   ..- attr(*, "format.sas")= chr "EPIS_12M"
##  $ COMPASTH : num  1 3 1 6 3 11 3 6 11 6 ...
##   ..- attr(*, "label")= chr "TYPICAL ATTACK"
##   ..- attr(*, "format.sas")= chr "COMPASTH"
##  $ INS1     : num  1 1 1 2 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "INSURANCE"
##   ..- attr(*, "format.sas")= chr "INS1Z"
##  $ INS2     : num  2 2 2 5 2 2 2 2 2 2 ...
##   ..- attr(*, "label")= chr "INSURANCE OR COVERAGE GAP"
##   ..- attr(*, "format.sas")= chr "INS2Z"
##  $ ER.VISIT : num  6 2 2 5 2 2 2 5 2 6 ...
##   ..- attr(*, "label")= chr "EMERGENCY ROOM VISIT"
##   ..- attr(*, "format.sas")= chr "ER_VISIT"
##  $ HOSP.VST : num  6 2 2 5 2 2 2 5 2 6 ...
##   ..- attr(*, "label")= chr "HOSPITAL VISIT"
##   ..- attr(*, "format.sas")= chr "HOSP_VST"
##  $ ASRXCOST : num  2 2 2 5 2 2 2 5 1 2 ...
##   ..- attr(*, "label")= chr "COST BARRIER: MEDICATION"
##   ..- attr(*, "format.sas")= chr "ASRXCOST"
##  $ WORKTALK : num  2 2 2 2 2 2 2 2 2 2 ...
##   ..- attr(*, "label")= chr "DOCTOR DISCUSSED WORK ASTHMA"
##   ..- attr(*, "format.sas")= chr "WORKTALK"

PREPARE THE DATA FOR MODELISATION

We remove the rows with missing values.

Here we are going to drop missing data because they are only 12 over 13,922 rows. We also transform all predictors to categorical.

##  TARGET   SEX      AGEG.F7  X_RACEGR3 EDUCAL   X_INCOMG X_RFBMI5 SMOKE100
##  0:8114   1:3686   1: 622   1:8935    1:  14   1:1583   1:2917   1:5401  
##  1:3350   2:7778   2: 996   2: 636    2: 255   2:1872   2:8027   2:6017  
##                    3:1217   3: 420    3: 573   3:1042   9: 520   7:  43  
##                    4:1869   4: 384    4:2702   4:1394            9:   3  
##                    5:2924   5: 964    5:3514   5:4412                    
##                    6:2555   9: 125    6:4392   9:1161                    
##                    7:1281             9:  14                             
##  COPD     EMPHY     DEPRESS  BRONCH      DUR.30D     INCINDT   LAST.MD  
##  1:2311   1:  973   1:4428   1:3198   12     :4181   1:  259   4 :7077  
##  2:9038   2:10425   2:7000   2:8159   6      :3094   2:  864   5 :1563  
##  7: 106   7:   61   7:  18   7: 100   10     :1425   3:10307   6 : 673  
##  9:   9   9:    5   9:  18   9:   7   1      :1149   7:   28   7 :2014  
##                                       2      : 804   9:    6   77:  76  
##                                       11     : 613             88:  55  
##                                       (Other): 198             99:   6  
##     LAST.MED      LAST.SYMP    EPIS.12M    COMPASTH    INS1      INS2     
##  1      :4479   1      :3161   1:4724   11     :3646   1:10833   1:  561  
##  7      :2027   3      :2205   2:3565   6      :3094   2:  611   2:10257  
##  3      :1388   7      :1669   6:3094   3      :2915   7:   14   5:  611  
##  4      :1111   4      :1425   7:  77   1      :1060   9:    6   7:   27  
##  2      : 950   2      :1414   9:   4   2      : 703             9:    8  
##  5      : 931   5      : 924            7      :  33                      
##  (Other): 578   (Other): 666            (Other):  13                      
##  ER.VISIT HOSP.VST ASRXCOST WORKTALK
##  1:1229   1: 345   1:1393   1:2305  
##  2:5896   2:5986   2:8279   2:8841  
##  5:1772   4: 808   5:1772   6: 219  
##  6:2546   5:1772   7:  13   7:  76  
##  7:  20   6:2546   9:   7   8:  15  
##  9:   1   7:   7            9:   8  
## 

Visualization of a combination of variables

Proportion of Good Skill Management in terme of Education Level

Proportion of Good Skill Management in terme of Education Level

Proportion of Good skill management in terme of Duration of Asthma Attack

Proportion of Good skill management in terme of Duration of Asthma Attack

Splitting the data into train and test sets

We split the model in 80 / 20 ratio.

BUILDS MODELS

Model using full predictors with glm

## 
## Call:  glm(formula = TARGET ~ ., family = binomial, data = training1)
## 
## Coefficients:
## (Intercept)         SEX2     AGEG.F72     AGEG.F73     AGEG.F74     AGEG.F75  
##    0.815274     0.376929     0.046073     0.110514    -0.009079    -0.224369  
##    AGEG.F76     AGEG.F77   X_RACEGR32   X_RACEGR33   X_RACEGR34   X_RACEGR35  
##   -0.355010    -0.545024     0.446352    -0.215877     0.210427     0.251571  
##  X_RACEGR39      EDUCAL2      EDUCAL3      EDUCAL4      EDUCAL5      EDUCAL6  
##    0.307141    -1.394910    -0.877344    -0.670978    -0.538958    -0.543019  
##     EDUCAL9    X_INCOMG2    X_INCOMG3    X_INCOMG4    X_INCOMG5    X_INCOMG9  
##   -0.073281     0.024733     0.122684    -0.120826    -0.006362    -0.092898  
##   X_RFBMI52    X_RFBMI59    SMOKE1002    SMOKE1007    SMOKE1009        COPD2  
##   -0.077778    -0.057911     0.025179     0.463083     1.509425    -0.223620  
##       COPD7        COPD9       EMPHY2       EMPHY7       EMPHY9     DEPRESS2  
##   -0.432349     0.231118    -0.047004    -0.496181    -1.281916     0.131436  
##    DEPRESS7     DEPRESS9      BRONCH2      BRONCH7      BRONCH9    DUR.30D10  
##   -0.436594     0.399594    -0.096194    -0.245964    -1.653375     0.361595  
##   DUR.30D11    DUR.30D12     DUR.30D2     DUR.30D6     DUR.30D7    DUR.30D77  
##   -0.044773    -0.028588    -0.248560    -1.261388    -1.089232    -0.277066  
##    DUR.30D9    DUR.30D99     INCINDT2     INCINDT3     INCINDT7     INCINDT9  
##  -12.485576    -0.143906     0.070017     0.661739    -0.138097     0.811971  
##    LAST.MD5     LAST.MD6     LAST.MD7    LAST.MD77    LAST.MD88    LAST.MD99  
##    0.288263     0.015621    -0.306763    -0.885122     0.129965    -0.405754  
##   LAST.MED2    LAST.MED3    LAST.MED4    LAST.MED5    LAST.MED6    LAST.MED7  
##   -0.175013    -0.370797    -0.531523    -0.449710    -0.425216    -0.267810  
##  LAST.MED77   LAST.MED99   LAST.SYMP2   LAST.SYMP3   LAST.SYMP4   LAST.SYMP5  
##   -0.689832   -14.116510     0.179669     0.237474           NA     1.234484  
##  LAST.SYMP6   LAST.SYMP7  LAST.SYMP77  LAST.SYMP88  LAST.SYMP99    EPIS.12M2  
##    1.249424     1.085248    -0.175410           NA     1.712970    -0.399376  
##   EPIS.12M6    EPIS.12M7    EPIS.12M9   COMPASTH11    COMPASTH2    COMPASTH3  
##          NA    -0.864829     0.500115           NA    -0.302154    -0.228848  
##   COMPASTH4    COMPASTH6    COMPASTH7    COMPASTH9        INS12        INS17  
##   -1.194822           NA    -1.058498    13.639362    -0.079325     0.697682  
##       INS19        INS22        INS25        INS27        INS29    ER.VISIT2  
##    0.439947    -0.041448           NA    -0.884905     1.593803    -0.146146  
##   ER.VISIT5    ER.VISIT6    ER.VISIT7    ER.VISIT9    HOSP.VST2    HOSP.VST4  
##   -0.811271    -1.117798    -0.541108   -12.841299    -0.370558    -0.330771  
##   HOSP.VST5    HOSP.VST6    HOSP.VST7    ASRXCOST2    ASRXCOST5    ASRXCOST7  
##          NA           NA    -0.246388     0.001970           NA     0.099281  
##   ASRXCOST9    WORKTALK2    WORKTALK6    WORKTALK7    WORKTALK8    WORKTALK9  
##    0.022983    -0.645312    -0.875433    -0.736890     0.493321    -1.026428  
## 
## Degrees of Freedom: 9171 Total (i.e. Null);  9067 Residual
## Null Deviance:       11080 
## Residual Deviance: 10290     AIC: 10500

Confusion Matrix with the testingset

First glm model using backward elimination of step function

## 
## Call:  glm(formula = TARGET ~ SEX + AGEG.F7 + X_RACEGR3 + EDUCAL + COPD + 
##     DEPRESS + INCINDT + LAST.MD + LAST.MED + LAST.SYMP + COMPASTH + 
##     HOSP.VST + WORKTALK, family = binomial, data = training1)
## 
## Coefficients:
## (Intercept)         SEX2     AGEG.F72     AGEG.F73     AGEG.F74     AGEG.F75  
##     0.67503      0.38978      0.03151      0.09901     -0.02593     -0.24730  
##    AGEG.F76     AGEG.F77   X_RACEGR32   X_RACEGR33   X_RACEGR34   X_RACEGR35  
##    -0.37922     -0.57735      0.44897     -0.20453      0.21221      0.26320  
##  X_RACEGR39      EDUCAL2      EDUCAL3      EDUCAL4      EDUCAL5      EDUCAL6  
##     0.28926     -1.48356     -0.96518     -0.76571     -0.64278     -0.64865  
##     EDUCAL9        COPD2        COPD7        COPD9     DEPRESS2     DEPRESS7  
##    -0.15141     -0.26254     -0.48888     -0.65500      0.12697     -0.48893  
##    DEPRESS9     INCINDT2     INCINDT3     INCINDT7     INCINDT9     LAST.MD5  
##     0.10701      0.05560      0.65730     -0.17102      0.81770      0.23577  
##    LAST.MD6     LAST.MD7    LAST.MD77    LAST.MD88    LAST.MD99    LAST.MED2  
##    -0.05437     -0.35912     -0.94164      0.07989     -0.44613     -0.17772  
##   LAST.MED3    LAST.MED4    LAST.MED5    LAST.MED6    LAST.MED7   LAST.MED77  
##    -0.36298     -0.53144     -0.44380     -0.42466     -0.26556     -0.74123  
##  LAST.MED99   LAST.SYMP2   LAST.SYMP3   LAST.SYMP4   LAST.SYMP5   LAST.SYMP6  
##   -11.31099      0.19962      0.25733      0.43099      0.01780      0.04559  
##  LAST.SYMP7  LAST.SYMP77  LAST.SYMP88  LAST.SYMP99   COMPASTH11    COMPASTH2  
##    -0.13301     -0.17353     -1.18106      1.84112     -0.43808     -0.27741  
##   COMPASTH3    COMPASTH4    COMPASTH6    COMPASTH7    COMPASTH9    HOSP.VST2  
##    -0.22703     -1.21507           NA     -1.04322     12.62674     -0.45987  
##   HOSP.VST4    HOSP.VST5    HOSP.VST6    HOSP.VST7    WORKTALK2    WORKTALK6  
##    -0.43338     -0.71792     -1.02477     -0.25267     -0.65732     -0.90508  
##   WORKTALK7    WORKTALK8    WORKTALK9  
##    -0.75831      0.25672     -1.02508  
## 
## Degrees of Freedom: 9171 Total (i.e. Null);  9104 Residual
## Null Deviance:       11080 
## Residual Deviance: 10330     AIC: 10470

Call: glm(formula = TARGET ~ SEX + AGEG.F7 + X_RACEGR3 + EDUCAL + BRONCH + DUR.30D + INCINDT + LAST.MD + LAST.MED + LAST.SYMP + COMPASTH + HOSPTIME + ASRXCOST + WORKTALK, family = binomial, data = training1)

Confusion Matrix with the testingset

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1541  584
##          1   82   85
##                                          
##                Accuracy : 0.7094         
##                  95% CI : (0.6904, 0.728)
##     No Information Rate : 0.7081         
##     P-Value [Acc > NIR] : 0.4555         
##                                          
##                   Kappa : 0.0982         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.12706        
##             Specificity : 0.94948        
##          Pos Pred Value : 0.50898        
##          Neg Pred Value : 0.72518        
##              Prevalence : 0.29188        
##          Detection Rate : 0.03709        
##    Detection Prevalence : 0.07286        
##       Balanced Accuracy : 0.53827        
##                                          
##        'Positive' Class : 1              
## 

Second glm model

## 
## Call:  glm(formula = TARGET ~ SEX + AGEG.F7 + X_RACEGR3 + EDUCAL + X_INCOMG + 
##     BRONCH + DUR.30D + INCINDT + LAST.MD + LAST.MED + LAST.SYMP + 
##     COMPASTH + WORKTALK, family = binomial, data = training1)
## 
## Coefficients:
## (Intercept)         SEX2     AGEG.F72     AGEG.F73     AGEG.F74     AGEG.F75  
##    0.080651     0.361343     0.021414     0.084859    -0.037847    -0.240055  
##    AGEG.F76     AGEG.F77   X_RACEGR32   X_RACEGR33   X_RACEGR34   X_RACEGR35  
##   -0.357383    -0.514889     0.447556    -0.220765     0.210588     0.255396  
##  X_RACEGR39      EDUCAL2      EDUCAL3      EDUCAL4      EDUCAL5      EDUCAL6  
##    0.314758    -1.186373    -0.649726    -0.457224    -0.331784    -0.336729  
##     EDUCAL9    X_INCOMG2    X_INCOMG3    X_INCOMG4    X_INCOMG5    X_INCOMG9  
##    0.082770     0.029843     0.125526    -0.132588    -0.007516    -0.085219  
##     BRONCH2      BRONCH7      BRONCH9    DUR.30D10    DUR.30D11    DUR.30D12  
##   -0.146313    -0.267072    -1.345563     0.298953    -0.104421    -0.091564  
##    DUR.30D2     DUR.30D6     DUR.30D7    DUR.30D77     DUR.30D9    DUR.30D99  
##   -0.301763    -1.263734    -1.130751    -0.339126   -11.569494    -0.065062  
##    INCINDT2     INCINDT3     INCINDT7     INCINDT9     LAST.MD5     LAST.MD6  
##    0.036932     0.622928    -0.184242     0.791634    -0.320354    -0.561904  
##    LAST.MD7    LAST.MD77    LAST.MD88    LAST.MD99    LAST.MED2    LAST.MED3  
##   -0.837523    -1.019408    -0.498965    -0.396666    -0.217678    -0.422067  
##   LAST.MED4    LAST.MED5    LAST.MED6    LAST.MED7   LAST.MED77   LAST.MED99  
##   -0.590182    -0.481048    -0.449286    -0.261555    -0.766662   -11.440581  
##  LAST.SYMP2   LAST.SYMP3   LAST.SYMP4   LAST.SYMP5   LAST.SYMP6   LAST.SYMP7  
##    0.175750     0.233560           NA     1.214932     1.265219     1.136694  
## LAST.SYMP77  LAST.SYMP88  LAST.SYMP99   COMPASTH11    COMPASTH2    COMPASTH3  
##   -0.133388           NA     1.899011    -0.440892    -0.259834    -0.218830  
##   COMPASTH4    COMPASTH6    COMPASTH7    COMPASTH9    WORKTALK2    WORKTALK6  
##   -1.195201           NA    -0.989726    12.669093    -0.642399    -0.866024  
##   WORKTALK7    WORKTALK8    WORKTALK9  
##   -0.747603     0.252681    -1.013315  
## 
## Degrees of Freedom: 9171 Total (i.e. Null);  9100 Residual
## Null Deviance:       11080 
## Residual Deviance: 10350     AIC: 10490

Confusion Matrix with the testingset

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1545  581
##          1   78   88
##                                           
##                Accuracy : 0.7125          
##                  95% CI : (0.6935, 0.7309)
##     No Information Rate : 0.7081          
##     P-Value [Acc > NIR] : 0.3322          
##                                           
##                   Kappa : 0.1072          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.13154         
##             Specificity : 0.95194         
##          Pos Pred Value : 0.53012         
##          Neg Pred Value : 0.72672         
##              Prevalence : 0.29188         
##          Detection Rate : 0.03839         
##    Detection Prevalence : 0.07243         
##       Balanced Accuracy : 0.54174         
##                                           
##        'Positive' Class : 1               
## 

Lasso and Ridge model

Since our dataset has multiple variable, we can use penalized logistic regression to find an optimal performing model. Ridge Regression and Lasso Regression have two different approaches. Ridge Regression incorporates all variables in the model and gives the coefficients of variables with minor contribution close to zero Lasso Regression keeps only the most significant variables and gives zero to the coefficient of the rest of variables.

Split the data into trainset and testingset, Dumy code categorical predictors

Ridge Regression

We fit and observe the coefficients of rigde regression against the log of lambda.
Variation of Ridge Model Coefficient by Log Lambda

Variation of Ridge Model Coefficient by Log Lambda

The coefficients are significant for negative log lambda and start to stabilize around -4.

Lambda that Minimises MSE

Lambda that Minimises MSE

The plot shows that the log of the optimal value of lambda (i.e. the one that minimises the root mean square error) is approximately -3. The exact value can be viewed by examining the variable lambda_min in the code below. In general though, the objective of regularisation is to balance accuracy and simplicity. In the present context though, this means that a model with the smallest number of coefficients also gives a good accuracy. To this end, the cv.glmnet function finds the value of lambda that gives the simplest model but also lies within one standard error of the optimal value of lambda.

## [1] 0.05934789

Confusion matrix with lambda min

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1620  668
##          1    3    1
##                                           
##                Accuracy : 0.7072          
##                  95% CI : (0.6881, 0.7258)
##     No Information Rate : 0.7081          
##     P-Value [Acc > NIR] : 0.547           
##                                           
##                   Kappa : -5e-04          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0014948       
##             Specificity : 0.9981516       
##          Pos Pred Value : 0.2500000       
##          Neg Pred Value : 0.7080420       
##              Prevalence : 0.2918848       
##          Detection Rate : 0.0004363       
##    Detection Prevalence : 0.0017452       
##       Balanced Accuracy : 0.4998232       
##                                           
##        'Positive' Class : 1               
## 

We observe overfitting with this ridge model.

Confusion matrix with best lambda

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1623  669
##          1    0    0
##                                          
##                Accuracy : 0.7081         
##                  95% CI : (0.689, 0.7267)
##     No Information Rate : 0.7081         
##     P-Value [Acc > NIR] : 0.5104         
##                                          
##                   Kappa : 0              
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.0000         
##             Specificity : 1.0000         
##          Pos Pred Value :    NaN         
##          Neg Pred Value : 0.7081         
##              Prevalence : 0.2919         
##          Detection Rate : 0.0000         
##    Detection Prevalence : 0.0000         
##       Balanced Accuracy : 0.5000         
##                                          
##        'Positive' Class : 1              
## 

We observe overfitting with this second ridge model.

Getting the coefficients

## 115 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept) -0.311165635
## (Intercept)  .          
## SEX2         0.268905875
## AGEG.F72     0.121885314
## AGEG.F73     0.191736371
## AGEG.F74     0.123772029
## AGEG.F75    -0.032188058
## AGEG.F76    -0.130739675
## AGEG.F77    -0.276264679
## X_RACEGR32   0.345370100
## X_RACEGR33  -0.168141719
## X_RACEGR34   0.163555652
## X_RACEGR35   0.188481891
## X_RACEGR39   0.195951087
## EDUCAL2     -0.563246481
## EDUCAL3     -0.182972819
## EDUCAL4     -0.042867391
## EDUCAL5      0.062680154
## EDUCAL6      0.055564892
## EDUCAL9      0.371413396
## X_INCOMG2    0.025243211
## X_INCOMG3    0.096292186
## X_INCOMG4   -0.092595652
## X_INCOMG5    0.001257866
## X_INCOMG9   -0.077444585
## X_RFBMI52   -0.061302201
## X_RFBMI59   -0.016489589
## SMOKE1002    0.028885927
## SMOKE1007    0.291630495
## SMOKE1009    1.088366129
## COPD2       -0.160502579
## COPD7       -0.317161991
## COPD9       -0.008265946
## EMPHY2      -0.044676385
## EMPHY7      -0.357739837
## EMPHY9      -0.571614550
## DEPRESS2     0.077386564
## DEPRESS7    -0.282525745
## DEPRESS9     0.177231137
## BRONCH2     -0.091278530
## BRONCH7     -0.185188286
## BRONCH9     -0.892552189
## DUR.30D10    0.103059663
## DUR.30D11   -0.007410897
## DUR.30D12    0.018440319
## DUR.30D2    -0.189528989
## DUR.30D6    -0.043051137
## DUR.30D7    -0.714063545
## DUR.30D77   -0.199413085
## DUR.30D9    -1.369560353
## DUR.30D99   -0.149140530
## INCINDT2    -0.177554947
## INCINDT3     0.333653991
## INCINDT7    -0.305724039
## INCINDT9     0.327707463
## LAST.MD5     0.046352354
## LAST.MD6    -0.140558615
## LAST.MD7    -0.324589440
## LAST.MD77   -0.622666977
## LAST.MD88   -0.092433891
## LAST.MD99   -0.416059561
## LAST.MED2   -0.069855890
## LAST.MED3   -0.190342125
## LAST.MED4   -0.282619266
## LAST.MED5   -0.213971081
## LAST.MED6   -0.204577487
## LAST.MED7   -0.142499572
## LAST.MED77  -0.462091222
## LAST.MED99  -1.519208340
## LAST.SYMP2   0.074695782
## LAST.SYMP3   0.104344631
## LAST.SYMP4   0.103427204
## LAST.SYMP5   0.047617877
## LAST.SYMP6   0.047414364
## LAST.SYMP7  -0.092154999
## LAST.SYMP77 -0.239520085
## LAST.SYMP88 -0.731665053
## LAST.SYMP99  1.029382576
## EPIS.12M2   -0.131132983
## EPIS.12M6   -0.045041216
## EPIS.12M7   -0.455741521
## EPIS.12M9    0.241892279
## COMPASTH11  -0.143376396
## COMPASTH2   -0.072476113
## COMPASTH3   -0.060151767
## COMPASTH4   -0.769612572
## COMPASTH6   -0.044309830
## COMPASTH7   -0.628683252
## COMPASTH9    2.088878498
## INS12       -0.041714894
## INS17        0.165315864
## INS19        0.226422818
## INS22       -0.036741753
## INS25       -0.041725122
## INS27       -0.425696420
## INS29        0.827497153
## ER.VISIT2   -0.067959603
## ER.VISIT5   -0.077432802
## ER.VISIT6   -0.233869539
## ER.VISIT7   -0.376234693
## ER.VISIT9   -1.554564161
## HOSP.VST2   -0.022813098
## HOSP.VST4    0.066196704
## HOSP.VST5   -0.076585501
## HOSP.VST6   -0.233271485
## HOSP.VST7    0.070767181
## ASRXCOST2   -0.006830214
## ASRXCOST5   -0.076131222
## ASRXCOST7    0.043116310
## ASRXCOST9   -0.029149814
## WORKTALK2   -0.476959994
## WORKTALK6   -0.543345939
## WORKTALK7   -0.464366092
## WORKTALK8    0.321695786
## WORKTALK9   -0.669199573
Lasso Regression

Find the best lambda using cross validation

Lambda that minimises MSE in Lasso

Lambda that minimises MSE in Lasso

The plot shows that the log of the optimal value of lambda (i.e. the one that minimises the root mean square error) is approximately -10. The exact value can be viewed by examining the variable lambda_min in the code below. In general though, the objective of regularisation is to balance accuracy and simplicity. In the present context though, this means that a model with the smallest number of coefficients also gives a good accuracy. To this end, the cv.glmnet function finds the value of lambda that gives the simplest model but also lies within one standard error of the optimal value of lambda.

Confusion Matrix with lambda min

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1555  593
##          1   68   76
##                                           
##                Accuracy : 0.7116          
##                  95% CI : (0.6926, 0.7301)
##     No Information Rate : 0.7081          
##     P-Value [Acc > NIR] : 0.3663          
##                                           
##                   Kappa : 0.0932          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.11360         
##             Specificity : 0.95810         
##          Pos Pred Value : 0.52778         
##          Neg Pred Value : 0.72393         
##              Prevalence : 0.29188         
##          Detection Rate : 0.03316         
##    Detection Prevalence : 0.06283         
##       Balanced Accuracy : 0.53585         
##                                           
##        'Positive' Class : 1               
## 

Getting the coefficients

## 115 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept) -0.2579854899
## (Intercept)  .           
## SEX2         0.3542166859
## AGEG.F72     0.0632343529
## AGEG.F73     0.1399556825
## AGEG.F74     0.0390792790
## AGEG.F75    -0.1367723728
## AGEG.F76    -0.2586333383
## AGEG.F77    -0.4402801733
## X_RACEGR32   0.4020122135
## X_RACEGR33  -0.1657005319
## X_RACEGR34   0.1596831570
## X_RACEGR35   0.2233125366
## X_RACEGR39   0.1951977048
## EDUCAL2     -0.7606948688
## EDUCAL3     -0.2686823287
## EDUCAL4     -0.1016518549
## EDUCAL5      .           
## EDUCAL6      .           
## EDUCAL9      0.1981910680
## X_INCOMG2    0.0039926583
## X_INCOMG3    0.0911052087
## X_INCOMG4   -0.0979138979
## X_INCOMG5    .           
## X_INCOMG9   -0.0712901224
## X_RFBMI52   -0.0532072941
## X_RFBMI59    .           
## SMOKE1002    0.0068398773
## SMOKE1007    0.2528453161
## SMOKE1009    0.9267670939
## COPD2       -0.2077955878
## COPD7       -0.3397625834
## COPD9        .           
## EMPHY2      -0.0115236452
## EMPHY7      -0.3225277669
## EMPHY9      -0.3000741589
## DEPRESS2     0.1033939415
## DEPRESS7    -0.1313032009
## DEPRESS9     .           
## BRONCH2     -0.0823806414
## BRONCH7     -0.1481377298
## BRONCH9     -0.7407051201
## DUR.30D10    0.1671942061
## DUR.30D11    .           
## DUR.30D12    .           
## DUR.30D2    -0.2071573169
## DUR.30D6     .           
## DUR.30D7    -0.7771113181
## DUR.30D77   -0.1641557803
## DUR.30D9    -0.9766940404
## DUR.30D99    .           
## INCINDT2     .           
## INCINDT3     0.5674152990
## INCINDT7     .           
## INCINDT9     0.3871056134
## LAST.MD5     .           
## LAST.MD6    -0.2343082060
## LAST.MD7    -0.5630790101
## LAST.MD77   -0.7754395287
## LAST.MD88   -0.0395389804
## LAST.MD99   -0.1722368343
## LAST.MED2   -0.0967159868
## LAST.MED3   -0.2750431383
## LAST.MED4   -0.4149006454
## LAST.MED5   -0.3401234455
## LAST.MED6   -0.3126209215
## LAST.MED7   -0.1919073171
## LAST.MED77  -0.5040969504
## LAST.MED99  -1.6110364164
## LAST.SYMP2   0.1017268810
## LAST.SYMP3   0.1563643114
## LAST.SYMP4   0.1166262675
## LAST.SYMP5   .           
## LAST.SYMP6   .           
## LAST.SYMP7  -0.1257986113
## LAST.SYMP77 -0.1792473910
## LAST.SYMP88 -0.9576660890
## LAST.SYMP99  1.1412666330
## EPIS.12M2    .           
## EPIS.12M6    .           
## EPIS.12M7   -0.3437286481
## EPIS.12M9    .           
## COMPASTH11  -0.2837501453
## COMPASTH2   -0.1067554496
## COMPASTH3   -0.0820766622
## COMPASTH4   -0.6426440521
## COMPASTH6    .           
## COMPASTH7   -0.6743718403
## COMPASTH9    1.7028612685
## INS12       -0.0007142587
## INS17        .           
## INS19        .           
## INS22        .           
## INS25        .           
## INS27       -0.2845170762
## INS29        1.0038782839
## ER.VISIT2   -0.1189195122
## ER.VISIT5   -0.0030299001
## ER.VISIT6   -0.4258432089
## ER.VISIT7   -0.2794223153
## ER.VISIT9   -0.9050853282
## HOSP.VST2   -0.0716798701
## HOSP.VST4    .           
## HOSP.VST5   -0.1627026609
## HOSP.VST6   -0.0835855569
## HOSP.VST7    .           
## ASRXCOST2    .           
## ASRXCOST5   -0.0233760057
## ASRXCOST7    .           
## ASRXCOST9    .           
## WORKTALK2   -0.6171783972
## WORKTALK6   -0.7728890333
## WORKTALK7   -0.6132713063
## WORKTALK8    0.0712529778
## WORKTALK9   -0.6265422376

Confusion Matrix with best lambda

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1614  658
##          1    9   11
##                                           
##                Accuracy : 0.709           
##                  95% CI : (0.6899, 0.7275)
##     No Information Rate : 0.7081          
##     P-Value [Acc > NIR] : 0.4738          
##                                           
##                   Kappa : 0.0152          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.016442        
##             Specificity : 0.994455        
##          Pos Pred Value : 0.550000        
##          Neg Pred Value : 0.710387        
##              Prevalence : 0.291885        
##          Detection Rate : 0.004799        
##    Detection Prevalence : 0.008726        
##       Balanced Accuracy : 0.505449        
##                                           
##        'Positive' Class : 1               
## 
Calculating the AICc of Ridge and Lasso Models

it <- glmnet(x, y, family = “multinomial”)

tLL <- fit\(nulldev - deviance(fit) k <- fit\)df n <- fit$nobs AICc <- -tLL+2k+2k*(k+1)/(n-k-1) AICc

## [1] -495.6112
## [1] -593.2907

Partial Least Squared

Confusion Matrix with best lambda

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    F    T
##          F 1556  593
##          T   67   76
##                                          
##                Accuracy : 0.712          
##                  95% CI : (0.693, 0.7305)
##     No Information Rate : 0.7081         
##     P-Value [Acc > NIR] : 0.3491         
##                                          
##                   Kappa : 0.0941         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.11360        
##             Specificity : 0.95872        
##          Pos Pred Value : 0.53147        
##          Neg Pred Value : 0.72406        
##              Prevalence : 0.29188        
##          Detection Rate : 0.03316        
##    Detection Prevalence : 0.06239        
##       Balanced Accuracy : 0.53616        
##                                          
##        'Positive' Class : T              
## 

Here we train the model with partial least square using tune parameter.

## Partial Least Squares 
## 
## 9172 samples
##   24 predictor
##    2 classes: 'F', 'T' 
## 
## Pre-processing: centered (113), scaled (113) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 8254, 8255, 8255, 8255, 8255, 8255, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  ROC        Sens       Spec      
##    1     0.6219778  1.0000000  0.00000000
##    2     0.6429671  0.9875208  0.03605809
##    3     0.6518623  0.9685715  0.08355250
##    4     0.6540849  0.9612278  0.10630722
##    5     0.6553881  0.9588149  0.11177801
##    6     0.6560877  0.9580958  0.11588202
##    7     0.6561267  0.9576333  0.11625238
##    8     0.6559958  0.9562980  0.11799460
##    9     0.6560079  0.9573250  0.11899147
##   10     0.6557761  0.9572736  0.12011088
##   11     0.6558837  0.9568113  0.11849350
##   12     0.6559341  0.9569138  0.11787207
##   13     0.6561282  0.9567598  0.11724926
##   14     0.6561443  0.9564517  0.11637907
##   15     0.6561146  0.9567085  0.11724972
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 14.

Confusion Matrix with best lambda

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    F    T
##          F 1548  598
##          T   75   71
##                                          
##                Accuracy : 0.7064         
##                  95% CI : (0.6873, 0.725)
##     No Information Rate : 0.7081         
##     P-Value [Acc > NIR] : 0.5831         
##                                          
##                   Kappa : 0.0778         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.10613        
##             Specificity : 0.95379        
##          Pos Pred Value : 0.48630        
##          Neg Pred Value : 0.72134        
##              Prevalence : 0.29188        
##          Detection Rate : 0.03098        
##    Detection Prevalence : 0.06370        
##       Balanced Accuracy : 0.52996        
##                                          
##        'Positive' Class : T              
## 

SELECT MODELS

We compare the models with the accuray, precision, sensitivity, specificity, and F1 score from the confusion matrix

##             glm.mod11 glm.mod12  ridge.mod1 ridge.mod2 lasso.mod1 lasso.mod2
## Accuracy    0.7094241 0.7124782 0.707242583  0.7081152  0.7116056 0.70898778
## Precision   0.5089820 0.5301205 0.250000000         NA  0.5277778 0.55000000
## Sensitivity 0.1270553 0.1315396 0.001494768  0.0000000  0.1136024 0.01644245
## Specificity 0.9494763 0.9519409 0.998151571  1.0000000  0.9581023 0.99445471
## F1          0.2033493 0.2107784 0.002971768         NA  0.1869619 0.03193033
##              pls.mod1  pls.mod2
## Accuracy    0.7120419 0.7063700
## Precision   0.5314685 0.4863014
## Sensitivity 0.1136024 0.1061286
## Specificity 0.9587184 0.9537893
## F1          0.1871921 0.1742331

With precision and specificity equal to 1, the ridge.mod2 model is overfitting. But lasso.mod1 has the best accuracy, precision, sensivity, and specificity.

Using pROC package.

We can plot the ROC curve and extract the AUC value.

Best Model with AUC

Best Model with AUC

The Lasso model has the best Area Under the Curve.

We run the lasso.mod1 model with the entire dataset

The best model is Lasso model

The Statistic of the best model is given below.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7826 2982
##          1  288  368
##                                          
##                Accuracy : 0.7148         
##                  95% CI : (0.7064, 0.723)
##     No Information Rate : 0.7078         
##     P-Value [Acc > NIR] : 0.05102        
##                                          
##                   Kappa : 0.0973         
##                                          
##  Mcnemar's Test P-Value : < 2e-16        
##                                          
##             Sensitivity : 0.10985        
##             Specificity : 0.96451        
##          Pos Pred Value : 0.56098        
##          Neg Pred Value : 0.72409        
##              Prevalence : 0.29222        
##          Detection Rate : 0.03210        
##    Detection Prevalence : 0.05722        
##       Balanced Accuracy : 0.53718        
##                                          
##        'Positive' Class : 1              
## 

AUC of the best model

Coefficients of the best model

The dot before the coefficient means that the lasso model ignore unimportant class of the variable.

## 115 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept) -0.389162006
## (Intercept)  .          
## SEX2         0.315935944
## AGEG.F72     0.126953447
## AGEG.F73     0.129609754
## AGEG.F74     0.119746169
## AGEG.F75    -0.086149259
## AGEG.F76    -0.178043508
## AGEG.F77    -0.428324418
## X_RACEGR32   0.435568317
## X_RACEGR33  -0.126520492
## X_RACEGR34   0.186805154
## X_RACEGR35   0.239934397
## X_RACEGR39   0.266893963
## EDUCAL2     -0.729064961
## EDUCAL3     -0.304725168
## EDUCAL4     -0.082105342
## EDUCAL5      .          
## EDUCAL6      .          
## EDUCAL9      .          
## X_INCOMG2    0.018558932
## X_INCOMG3    0.071815314
## X_INCOMG4   -0.079792188
## X_INCOMG5    .          
## X_INCOMG9   -0.089406986
## X_RFBMI52   -0.026835994
## X_RFBMI59    .          
## SMOKE1002    0.022988651
## SMOKE1007    0.464330974
## SMOKE1009    .          
## COPD2       -0.209017985
## COPD7       -0.364113566
## COPD9        .          
## EMPHY2       .          
## EMPHY7      -0.097567863
## EMPHY9       .          
## DEPRESS2     0.090219026
## DEPRESS7    -0.145431809
## DEPRESS9     .          
## BRONCH2     -0.051581723
## BRONCH7     -0.220352957
## BRONCH9     -0.106815319
## DUR.30D10    0.144815062
## DUR.30D11   -0.020720913
## DUR.30D12    .          
## DUR.30D2    -0.190426366
## DUR.30D6    -0.058860873
## DUR.30D7    -0.447253445
## DUR.30D77   -0.079374515
## DUR.30D9    -0.797903917
## DUR.30D99    .          
## INCINDT2     .          
## INCINDT3     0.595361610
## INCINDT7    -0.074308959
## INCINDT9     0.375926037
## LAST.MD5     .          
## LAST.MD6    -0.186099044
## LAST.MD7    -0.530617909
## LAST.MD77   -0.723394466
## LAST.MD88   -0.004102567
## LAST.MD99   -0.231797646
## LAST.MED2   -0.158707860
## LAST.MED3   -0.321637348
## LAST.MED4   -0.393498345
## LAST.MED5   -0.310953274
## LAST.MED6   -0.218075294
## LAST.MED7   -0.187054889
## LAST.MED77  -0.379774902
## LAST.MED99  -1.204354649
## LAST.SYMP2   0.068448925
## LAST.SYMP3   0.185526131
## LAST.SYMP4   0.112891990
## LAST.SYMP5   .          
## LAST.SYMP6   0.028925594
## LAST.SYMP7  -0.082313265
## LAST.SYMP77 -0.049581438
## LAST.SYMP88 -0.350383539
## LAST.SYMP99  0.758631013
## EPIS.12M2    .          
## EPIS.12M6    .          
## EPIS.12M7   -0.307380341
## EPIS.12M9    0.622861308
## COMPASTH11  -0.322144142
## COMPASTH2   -0.154330752
## COMPASTH3   -0.105111809
## COMPASTH4   -0.869953108
## COMPASTH6    .          
## COMPASTH7   -0.729187813
## COMPASTH9    1.464891346
## INS12       -0.008647654
## INS17        0.038057258
## INS19        .          
## INS22        .          
## INS25        .          
## INS27       -0.092594429
## INS29        .          
## ER.VISIT2   -0.100099174
## ER.VISIT5   -0.002332758
## ER.VISIT6   -0.405670789
## ER.VISIT7   -0.037196869
## ER.VISIT9   -0.761033033
## HOSP.VST2   -0.032543838
## HOSP.VST4    .          
## HOSP.VST5   -0.161117241
## HOSP.VST6   -0.057415525
## HOSP.VST7    .          
## ASRXCOST2    .          
## ASRXCOST5   -0.016515391
## ASRXCOST7   -0.047960016
## ASRXCOST9    .          
## WORKTALK2   -0.625983237
## WORKTALK6   -0.592483711
## WORKTALK7   -0.623644542
## WORKTALK8    .          
## WORKTALK9   -0.140851542

We look at the odd ratio off each variable

A value greater than 1 means an increased effect on the odd ratio compared to baseline. For example, focusing on SEX variable, Women(SEX2) is more likely to have good Skill on asthma management than men(SEX1 the baseline). Other variables can be interpreted the same way.

## 115 x 1 Matrix of class "dgeMatrix"
##                    s0
## (Intercept) 0.6776245
## (Intercept) 1.0000000
## SEX2        1.3715424
## AGEG.F72    1.1353642
## AGEG.F73    1.1383840
## AGEG.F74    1.1272107
## AGEG.F75    0.9174573
## AGEG.F76    0.8369060
## AGEG.F77    0.6516000
## X_RACEGR32  1.5458413
## X_RACEGR33  0.8811561
## X_RACEGR34  1.2053924
## X_RACEGR35  1.2711658
## X_RACEGR39  1.3059020
## EDUCAL2     0.4823598
## EDUCAL3     0.7373260
## EDUCAL4     0.9211749
## EDUCAL5     1.0000000
## EDUCAL6     1.0000000
## EDUCAL9     1.0000000
## X_INCOMG2   1.0187322
## X_INCOMG3   1.0744569
## X_INCOMG4   0.9233082
## X_INCOMG5   1.0000000
## X_INCOMG9   0.9144733
## X_RFBMI52   0.9735209
## X_RFBMI59   1.0000000
## SMOKE1002   1.0232549
## SMOKE1007   1.5909494
## SMOKE1009   1.0000000
## COPD2       0.8113806
## COPD7       0.6948123
## COPD9       1.0000000
## EMPHY2      1.0000000
## EMPHY7      0.9070408
## EMPHY9      1.0000000
## DEPRESS2    1.0944140
## DEPRESS7    0.8646488
## DEPRESS9    1.0000000
## BRONCH2     0.9497260
## BRONCH7     0.8022356
## BRONCH9     0.8986916
## DUR.30D10   1.1558258
## DUR.30D11   0.9794923
## DUR.30D12   1.0000000
## DUR.30D2    0.8266066
## DUR.30D6    0.9428379
## DUR.30D7    0.6393818
## DUR.30D77   0.9236939
## DUR.30D9    0.4502718
## DUR.30D99   1.0000000
## INCINDT2    1.0000000
## INCINDT3    1.8136867
## INCINDT7    0.9283848
## INCINDT9    1.4563394
## LAST.MD5    1.0000000
## LAST.MD6    0.8301914
## LAST.MD7    0.5882414
## LAST.MD77   0.4851028
## LAST.MD88   0.9959058
## LAST.MD99   0.7931066
## LAST.MED2   0.8532456
## LAST.MED3   0.7249611
## LAST.MED4   0.6746924
## LAST.MED5   0.7327481
## LAST.MED6   0.8040649
## LAST.MED7   0.8293982
## LAST.MED77  0.6840154
## LAST.MED99  0.2998855
## LAST.SYMP2  1.0708459
## LAST.SYMP3  1.2038517
## LAST.SYMP4  1.1195110
## LAST.SYMP5  1.0000000
## LAST.SYMP6  1.0293480
## LAST.SYMP7  0.9209834
## LAST.SYMP77 0.9516277
## LAST.SYMP88 0.7044179
## LAST.SYMP99 2.1353510
## EPIS.12M2   1.0000000
## EPIS.12M6   1.0000000
## EPIS.12M7   0.7353709
## EPIS.12M9   1.8642546
## COMPASTH11  0.7245937
## COMPASTH2   0.8569885
## COMPASTH3   0.9002239
## COMPASTH4   0.4189712
## COMPASTH6   1.0000000
## COMPASTH7   0.4823005
## COMPASTH9   4.3270731
## INS12       0.9913896
## INS17       1.0387907
## INS19       1.0000000
## INS22       1.0000000
## INS25       1.0000000
## INS27       0.9115631
## INS29       1.0000000
## ER.VISIT2   0.9047477
## ER.VISIT5   0.9976700
## ER.VISIT6   0.6665296
## ER.VISIT7   0.9634864
## ER.VISIT9   0.4671836
## HOSP.VST2   0.9679800
## HOSP.VST4   1.0000000
## HOSP.VST5   0.8511923
## HOSP.VST6   0.9442016
## HOSP.VST7   1.0000000
## ASRXCOST2   1.0000000
## ASRXCOST5   0.9836202
## ASRXCOST7   0.9531719
## ASRXCOST9   1.0000000
## WORKTALK2   0.5347354
## WORKTALK6   0.5529522
## WORKTALK7   0.5359874
## WORKTALK8   1.0000000
## WORKTALK9   0.8686183

Marker: 621-06_p