DATA 621 Final Project

OVERVIEW

The self-management of asthma help improve patient health. Asthma self-management provide to the patient and caregivers the skills to understand the disease and its treatment. It teaches them to take medications appropriately, recognize early signs and symptoms of asthma episodes, seek medical care as appropriate, and identify and avoid environmental asthma allergens and irritants In this project, we study the characteristics that influence asthma self-management.

## [1] 13572  1057

The data set come from CDC with ulr = “https://www.cdc.gov/brfss/acbs/2016_documentation.html”. It is a survey study. The download file is “2016 ACBS Adult Data SAS [ZIP – 3.10 MB]” The unzip file has 899 variables and 13,922 cases. We have selected the variables to use on our studies.

EXPLORATORY DATA ANALYSIS

Meaning of variables used in the dataset

Response Variables

ASTHNOW Have you ever been told by a doctor or other health professional that you have asthma?

TCH_SIGN Has a doctor or other health professional ever taught you… a. How to recognize early signs or symptoms of an asthma episode?

TCH_RESP Has a doctor or other health professional ever taught you… b. What to do during an asthma episode or attack?

TCH_MON A peak flow meter is a hand held device that measures how quickly you can blow air out of your lungs. Has a doctor or other health professional ever taught you… c. How to use a peak flow meter to adjust your daily medications?

MGT_PLAN An asthma action plan, or asthma management plan, is a form with instructions about when to change the amount or type of medicine, when to call the doctorfor advice, and when to go to the emergency room. Has a doctor or other health professional EVER given you an asthma action plan?

MOD_ENV (7.13) INTERVIEWER READ: Now, back to questions specifically about you. Has a health professional ever advised you to change things in your home, school, or work to improve your asthma

MGT_CLAS Have you ever taken a course or class on how to manage your asthma?

INHALERH (8.3) Did a doctor or other health professional show you how to use the inhaler?

INHALERW (8.4) Did a doctor or other health professional watch you use the inhaler?

Responses types (1) YES (2) NO (7) DON’T KNOW (9) REFUSED

Possible Predictors

MISS_DAY = “NUMBER OF MISSED DAYS”

MOD_ENV = “EVER ADVISED CHANGE THINGS IN YOUR HOME”

AGEDX = “AGE AT ASTHMA DIAGNOSIS”

AGEG_F6_M = “MODIFIED SIX AGE GROUPS USED IN ASTHMA ADULT POST-STRATIFICATION”

AIRCLEANER = “AIR CLEANER USED”

ASMDCOST = “COST BARRIER: PRIMARY CARE DOCTOR”

ASRXCOST = “COST BARRIER: MEDICATION”

ASSPCOST = “COST BARRIER: SPECIALIST”

CATTMPTS_F = “DISPOSITION CODES FOR CALL ATTEMPTS 1 THROUGH 20 …”

EMP_STAT = “CURRENT EMPLOYMENT STATUS”

EPIS_12M = “ASTHMA EPISODE OR ATTACK”

EPIS_TP = “NUMBER OF EPISODES / ATTACKS”

ER_TIMES = “NUMBER OF EMERGENCY ROOM VISITS”

ER_VISIT = “EMERGENCY ROOM VISIT”

EVER_ASTH = “EVER HAVE ASTHMA INCONSISTENT WITH BRFSS”

HOSPPLAN = “HOSPITAL FOLLOW-UP”

HOSPTIME = “NUMBER OF HOSPITAL VISITS”

HOSP_VST = “HOSPITAL VISIT”

QSTLANG_F = “LANGUAGE IDENTIFIER”

SCR_MED3 = “HAVE ALL THE MEDICATIONS”

UNEMP_R = “REASON NOT NOW EMPLOYED”

URG_TIME = “NUMBER OF URGENT VISITS”

WORKENV5 = “ASTHMA AGGRAVATED BY CURRENT JOB”

WORKENV6 = “ASTHMA CAUSED BY CURRENT JOB”

WORKENV7 = “ASTHMA AGGRAVATED BY PREVIOUS JOB”

WORKENV8 = “ASTHMA CAUSED BY PREVIOUS JOB”

WORKQUIT1 = “EVER CHANGE OR QUIT A JOB”

WORKSEN3 = “DOCTOR DIAGNOSED WORK ASTHMA”

WORKSEN4 = “SELF-IDENTIFIED WORK ASTHMA”

WORKTALK = “DOCTOR DISCUSSED WORK ASTHMA”

INS1 = “INSURANCE”

INS2 = “INSURANCE OR COVERAGE GAP”

LASTSYMP = “LAST HAD ANY SYMPTOMS OF ASTHMA”

LAST_MD = “LAST TALKED TO A DOCTOR”

LAST_MED = “LAST TOOK ASTHMA MEDICATION”

COMPASTH = “TYPICAL ATTACK”

ACT_DAYS30 = “ACTIVITY LIMITATION”

Constructing the Data Frame by Selecting variables

We select all possible variable that we can use in our dataset. We also start to clean the dataset

summary of the data set after categorizing the variables

Here we categ

##  TCH.SIGN TCH.RESP TCH.MON  MGT.PLAN MGT.CLAS  INHALERW MOD.ENV  SEX     
##  1:8390   1:9755   1:5615   1:3657   1: 1179   1:9168   1:4218   1:4896  
##  2:4877   2:3525   2:7800   2:9604   2:12357   2:2778   2:9265   2:8676  
##  7: 294   7: 284   7: 150   7: 299   7:   32   5: 427   7:  81           
##  9:  11   9:   8   9:   7   9:  12   9:    4   6: 747   9:   8           
##                                                7: 451                    
##                                                9:   1                    
##                                                                          
##  AGEG.F7  X_IMPRACE EDUCAL   X_INCOMG X_RFBMI5 SMOKE100   COPD      
##  1: 795   1:10741   1:  17   1:1778   1:3424   1:6302   1   : 2665  
##  2:1222   2:  797   2: 245   2:2108   2:9509   2:7231   2   :10738  
##  3:1293   3:  176   3: 666   3:1231   9: 639   7:  37   7   :  117  
##  4:2077   4:  290   4:3148   4:1604            9:   2   9   :   10  
##  5:3299   5:  997   5:4183   5:5213                     NA's:   42  
##  6:3148   6:  571   6:5297   9:1638                                 
##  7:1738             9:  16                                          
##   EMPHY       DEPRESS      BRONCH        DUR.30D     INCINDT   LAST.MD  
##  1   : 1090   1   :5194   1   :3483   12     :4774   1:  287   4 :7760  
##  2   :12344   2   :8269   2   :9902   6      :4118   2: 1001   5 :1880  
##  7   :   85   7   :  41   7   : 134   10     :1548   3:12253   6 : 801  
##  9   :   11   9   :  26   9   :  11   1      :1237   7:   29   7 :2852  
##  NA's:   42   NA's:  42   NA's:  42   2      : 909   9:    2   77: 140  
##                                       11     : 736             88: 133  
##                                       (Other): 250             99:   6  
##     LAST.MED      LAST.SYMP       COMPASTH    INS2      ER.VISIT HOSP.VST
##  1      :4789   1      :3513   11     :4252   1:  628   1:1327   1: 404  
##  7      :2822   7      :2456   6      :4118   2:12137   2:6517   2:6524  
##  3      :1480   3      :2440   3      :3196   5:  743   5:2609   4: 920  
##  4      :1178   2      :1665   1      :1139   7:   51   6:3097   5:2609  
##  2      :1174   4      :1548   2      : 795   9:   13   7:  22   6:3097  
##  5      :1018   5      :1014   7      :  51                      7:  16  
##  (Other):1111   (Other): 936   (Other):  21                      9:   2  
##  ASRXCOST    WORKTALK     ACT.DAY30
##  1   :1509   1   : 2409   1:5667   
##  2   :9421   2   :10681   2:3009   
##  5   :2607   6   :  264   3:1431   
##  7   :  16   7   :  176   4: 801   
##  9   :  12   8   :   20   5:2609   
##  NA's:   7   9   :    9   7:  45   
##              NA's:   13   9:  10

####bHere we collapse certain variables with to many classes, and factors with few cases.

Eliminating NA’s

asthma.mgt.adult2 <- asthma.mgt.adult2 %>% mutate(COPD = replace(COPD, is.na(COPD), "7"),
                                                  EMPHY = replace(EMPHY, is.na(EMPHY), "7"),
                                                  DEPRESS = replace(DEPRESS, is.na(DEPRESS), "7"),
                                                  BRONCH = replace(BRONCH, is.na(BRONCH), "7"),
                                                  ASRXCOST = replace(ASRXCOST, is.na(ASRXCOST), "7"),
                                                  WORKTALK = replace(WORKTALK, is.na(WORKTALK), "7")
                                                  )

##  TCH.SIGN TCH.RESP TCH.MON  MGT.PLAN MGT.CLAS  INHALERW MOD.ENV  SEX     
##  1:8390   1:9755   1:5615   1:3657   1: 1179   1:9168   1:4218   1:4896  
##  2:4877   2:3525   2:7800   2:9604   2:12357   2:2778   2:9265   2:8676  
##  3: 305   3: 292   3: 157   3: 311   3:   36   4:1174   3:  89           
##                                                3: 452                    
##                                                                          
##                                                                          
##                                                                          
##  AGEG.F7  X_IMPRACE EDUCAL   X_INCOMG X_RFBMI5 SMOKE100 COPD      EMPHY    
##  1: 795   1:10741   1:  17   1:1778   1:3424   1:6302   1: 2665   1: 1090  
##  2:1222   2:  797   2: 245   2:2108   2:9509   2:7231   2:10738   2:12344  
##  3:1293   3:  176   3: 666   3:1231   9: 639   7:  39   7:  169   7:  127  
##  4:2077   4:  290   4:3148   4:1604                               9:   11  
##  5:3299   5:  997   5:4183   5:5213                                        
##  6:3148   6:  571   6:5313   9:1638                                        
##  7:1738                                                                    
##  DEPRESS  BRONCH   DUR.30D   INCINDT   LAST.MD  LAST.MED   LAST.SYMP   
##  1:5194   1:3483   1 :1237   1:  287   4:7760   4:4789   1      :3513  
##  2:8269   2:9902   10:1548   2: 1001   5:1880   5:2192   7      :2456  
##  7:  83   7: 176   11: 736   3:12253   6: 801   6:2031   3      :2440  
##  9:  26   9:  11   12:4774   7:   31   7:2852   7:4000   2      :1665  
##                    2 : 909             9: 279   9: 560   4      :1548  
##                    6 :4118                               5      :1014  
##                    7 : 250                               (Other): 936  
##  COMPASTH  INS2      ER.VISIT HOSP.VST ASRXCOST WORKTALK  ACT.DAY30
##  1 :1139   1:  628   1:1327   1: 404   1:1509   1: 2409   1:5667   
##  11:4252   2:12137   2:6517   2:6524   2:9421   2:10681   2:3009   
##  2 : 795   5:  743   5:2609   4: 920   5:2607   6:  264   3:1431   
##  3 :3210   7:   51   6:3097   5:2609   7:  23   7:  189   4: 801   
##  6 :4118   9:   13   7:  22   6:3097   9:  12   8:   20   5:2609   
##  7 :  58                      7:  16            9:    9   7:  55   
##                               9:   2

## [1] 13572    30

Structure of the data

Summary of the Data where variable are numeric

##  TCH.SIGN TCH.RESP TCH.MON  MGT.PLAN MGT.CLAS  INHALERW MOD.ENV  SEX     
##  1:8390   1:9755   1:5615   1:3657   1: 1179   1:9168   1:4218   1:4896  
##  2:4877   2:3525   2:7800   2:9604   2:12357   2:2778   2:9265   2:8676  
##  3: 305   3: 292   3: 157   3: 311   3:   36   4:1174   3:  89           
##                                                3: 452                    
##                                                                          
##                                                                          
##                                                                          
##  AGEG.F7  X_IMPRACE EDUCAL   X_INCOMG X_RFBMI5 SMOKE100 COPD      EMPHY    
##  1: 795   1:10741   1:  17   1:1778   1:3424   1:6302   1: 2665   1: 1090  
##  2:1222   2:  797   2: 245   2:2108   2:9509   2:7231   2:10738   2:12344  
##  3:1293   3:  176   3: 666   3:1231   9: 639   7:  39   7:  169   7:  127  
##  4:2077   4:  290   4:3148   4:1604                               9:   11  
##  5:3299   5:  997   5:4183   5:5213                                        
##  6:3148   6:  571   6:5313   9:1638                                        
##  7:1738                                                                    
##  DEPRESS  BRONCH   DUR.30D   INCINDT   LAST.MD  LAST.MED   LAST.SYMP   
##  1:5194   1:3483   1 :1237   1:  287   4:7760   4:4789   1      :3513  
##  2:8269   2:9902   10:1548   2: 1001   5:1880   5:2192   7      :2456  
##  7:  83   7: 176   11: 736   3:12253   6: 801   6:2031   3      :2440  
##  9:  26   9:  11   12:4774   7:   31   7:2852   7:4000   2      :1665  
##                    2 : 909             9: 279   9: 560   4      :1548  
##                    6 :4118                               5      :1014  
##                    7 : 250                               (Other): 936  
##  COMPASTH  INS2      ER.VISIT HOSP.VST ASRXCOST WORKTALK  ACT.DAY30
##  1 :1139   1:  628   1:1327   1: 404   1:1509   1: 2409   1:5667   
##  11:4252   2:12137   2:6517   2:6524   2:9421   2:10681   2:3009   
##  2 : 795   5:  743   5:2609   4: 920   5:2607   6:  264   3:1431   
##  3 :3210   7:   51   6:3097   5:2609   7:  23   7:  189   4: 801   
##  6 :4118   9:   13   7:  22   6:3097   9:  12   8:   20   5:2609   
##  7 :  58                      7:  16            9:    9   7:  55   
##                               9:   2

Distribution of the Variables in the Data

Proportions

Histograms tell us how the data is distributed in the dataset (numeric fields).

The correlations betweeen predictors

There are highly correlated predictors. We are going to remove some of them. It remain 33 variable for our data set with 7 response variables.

CONSTRUCT THE RESPONSE VARIABLE

We first extract variables related to education, #### Selection of variables

##  TCH.SIGN TCH.RESP TCH.MON  MGT.PLAN MGT.CLAS  INHALERW MOD.ENV 
##  1:8390   1:9755   1:5615   1:3657   1: 1179   1:9168   1:4218  
##  2:4877   2:3525   2:7800   2:9604   2:12357   2:2778   2:9265  
##  3: 305   3: 292   3: 157   3: 311   3:   36   4:1174   3:  89  
##                                                3: 452

Exploration of the clustering

Elbow method to find the number of clusters

We run kmeans with different clusters from 1 to 16 and we produce a scree plot to determine the number of cluster at the elbow.

Elbow method Scree Plot

The number of cluster is 4

Now we do the clustering and extract the centers of resulting model

##   TCH.SIGN TCH.RESP  TCH.MON MGT.PLAN MGT.CLAS INHALERW  MOD.ENV
## 1 1.664207 1.597786 1.821033 1.935424 1.967405 3.722017 1.849323
## 2 1.169630 1.043248 2.010332 1.825324 1.938491 1.151129 1.669630
## 3 1.099168 1.033932 1.000000 1.460814 1.827715 1.110552 1.565674
## 4 2.009950 1.871269 1.800373 1.984142 1.985386 1.511194 1.836754

We add the point classification to the original data

##   TCH.SIGN TCH.RESP TCH.MON MGT.PLAN MGT.CLAS INHALERW MOD.ENV target
## 1        1        1       2        2        2        1       1      2
## 2        1        1       1        1        1        1       2      3
## 3        1        1       1        1        2        1       2      3
## 4        2        1       2        2        2        2       2      4
## 5        2        1       2        2        2        1       2      2
## 6        2        1       2        2        2        1       2      2

##  TCH.SIGN TCH.RESP TCH.MON  MGT.PLAN MGT.CLAS  INHALERW MOD.ENV  target  
##  1:8390   1:9755   1:5615   1:3657   1: 1179   1:9168   1:4218   1:1626  
##  2:4877   2:3525   2:7800   2:9604   2:12357   2:2778   2:9265   2:4162  
##  3: 305   3: 292   3: 157   3: 311   3:   36   4:1174   3:  89   3:4568  
##                                                3: 452            4:3216

View of the clustering result

Interpretation of the Selft-Management Response clustering

TCH.SIGN

## # A tibble: 11 x 5
##    TCH.SIGN target count etotal proportion
##    <fct>    <fct>  <int>  <int>      <dbl>
##  1 1        1        618   8390    0.0737 
##  2 1        2       3456   8390    0.412  
##  3 1        3       4118   8390    0.491  
##  4 1        4        198   8390    0.0236 
##  5 2        1        936   4877    0.192  
##  6 2        2        706   4877    0.145  
##  7 2        3        447   4877    0.0917 
##  8 2        4       2788   4877    0.572  
##  9 3        1         72    305    0.236  
## 10 3        3          3    305    0.00984
## 11 3        4        230    305    0.754

## # A tibble: 3 x 6
## # Groups:   target [2]
##   TCH.SIGN target count etotal proportion group.max
##   <fct>    <fct>  <int>  <int>      <dbl>     <int>
## 1 1        3       4118   8390      0.491      4118
## 2 2        4       2788   4877      0.572      2788
## 3 3        4        230    305      0.754       230

In the target response, 8 is the positive answer, 3 is the negative answer, 5 is don’t know and 6 is refused for the question: TCH_SIGN Has a doctor or other health professional ever taught you… a. How to recognize early signs or symptoms of an asthma episode?

## # A tibble: 3 x 6
## # Groups:   TCH.SIGN [3]
##   target TCH.SIGN count etotal proportion group.max
##   <fct>  <fct>    <int>  <int>      <dbl>     <int>
## 1 3      1         4118   4568     0.901       4118
## 2 4      2         2788   3216     0.867       2788
## 3 4      3          230   3216     0.0715       230

TCH.RESP

## # A tibble: 11 x 5
##    TCH.RESP target count etotal proportion
##    <fct>    <fct>  <int>  <int>      <dbl>
##  1 1        1        741   9755     0.0760
##  2 1        2       3982   9755     0.408 
##  3 1        3       4419   9755     0.453 
##  4 1        4        613   9755     0.0628
##  5 2        1        798   3525     0.226 
##  6 2        2        180   3525     0.0511
##  7 2        3        143   3525     0.0406
##  8 2        4       2404   3525     0.682 
##  9 3        1         87    292     0.298 
## 10 3        3          6    292     0.0205
## 11 3        4        199    292     0.682

## # A tibble: 3 x 6
## # Groups:   target [2]
##   TCH.RESP target count etotal proportion group.max
##   <fct>    <fct>  <int>  <int>      <dbl>     <int>
## 1 1        3       4419   9755      0.453      4419
## 2 2        4       2404   3525      0.682      2404
## 3 3        4        199    292      0.682       199

In the target response, 8 is the positive answer, 3 is the negative answer, 1 is don’t know and 1 is refused for the question: TCH_RESP Has a doctor or other health professional ever taught you… b. What to do during an asthma episode or attack?

## # A tibble: 3 x 6
## # Groups:   TCH.RESP [3]
##   target TCH.RESP count etotal proportion group.max
##   <fct>  <fct>    <int>  <int>      <dbl>     <int>
## 1 3      1         4419   4568     0.967       4419
## 2 4      2         2404   3216     0.748       2404
## 3 4      3          199   3216     0.0619       199

TCH.MON

## # A tibble: 10 x 5
##    TCH.MON target count etotal proportion
##    <fct>   <fct>  <int>  <int>      <dbl>
##  1 1       1        319   5615    0.0568 
##  2 1       2         46   5615    0.00819
##  3 1       3       4568   5615    0.814  
##  4 1       4        682   5615    0.121  
##  5 2       1       1279   7800    0.164  
##  6 2       2       4027   7800    0.516  
##  7 2       4       2494   7800    0.320  
##  8 3       1         28    157    0.178  
##  9 3       2         89    157    0.567  
## 10 3       4         40    157    0.255

## # A tibble: 3 x 6
## # Groups:   target [2]
##   TCH.MON target count etotal proportion group.max
##   <fct>   <fct>  <int>  <int>      <dbl>     <int>
## 1 1       3       4568   5615      0.814      4568
## 2 2       2       4027   7800      0.516      4027
## 3 3       2         89    157      0.567        89

In the target response, 8 is the positive answer, 7 are the negative answers, 2 is don’t know and 2 is refused for the question: TCH_MON A peak flow meter is a hand held device that measures how quickly you can blow air out of your lungs. Has a doctor or other health professional ever taught you… c. How to use a peak flow meter to adjust your daily medications?

## # A tibble: 3 x 6
## # Groups:   TCH.MON [3]
##   target TCH.MON count etotal proportion group.max
##   <fct>  <fct>   <int>  <int>      <dbl>     <int>
## 1 2      2        4027   4162     0.968       4027
## 2 2      3          89   4162     0.0214        89
## 3 3      1        4568   4568     1           4568

MGT.PLAN

## # A tibble: 12 x 5
##    MGT.PLAN target count etotal proportion
##    <fct>    <fct>  <int>  <int>      <dbl>
##  1 1        1        161   3657     0.0440
##  2 1        2        854   3657     0.234 
##  3 1        3       2491   3657     0.681 
##  4 1        4        151   3657     0.0413
##  5 2        1       1409   9604     0.147 
##  6 2        2       3181   9604     0.331 
##  7 2        3       2049   9604     0.213 
##  8 2        4       2965   9604     0.309 
##  9 3        1         56    311     0.180 
## 10 3        2        127    311     0.408 
## 11 3        3         28    311     0.0900
## 12 3        4        100    311     0.322

## # A tibble: 3 x 6
## # Groups:   target [2]
##   MGT.PLAN target count etotal proportion group.max
##   <fct>    <fct>  <int>  <int>      <dbl>     <int>
## 1 1        3       2491   3657      0.681      2491
## 2 2        2       3181   9604      0.331      3181
## 3 3        2        127    311      0.408       127

In the target response, 8 is the positive answer, 3 is the negative answer, 9 is don’t know and 9 is refused for the question: MGT_PLAN An asthma action plan, or asthma management plan, is a form with instructions about when to change the amount or type of medicine, when to call the doctor for advice, and when to go to the emergency room. Has a doctor or other health professional EVER given you an asthma action plan?

## # A tibble: 3 x 6
## # Groups:   MGT.PLAN [3]
##   target MGT.PLAN count etotal proportion group.max
##   <fct>  <fct>    <int>  <int>      <dbl>     <int>
## 1 2      2         3181   4162     0.764       3181
## 2 2      3          127   4162     0.0305       127
## 3 3      1         2491   4568     0.545       2491

MGT.CLAS

## # A tibble: 12 x 5
##    MGT.CLAS target count etotal proportion
##    <fct>    <fct>  <int>  <int>      <dbl>
##  1 1        1         61   1179     0.0517
##  2 1        2        266   1179     0.226 
##  3 1        3        798   1179     0.677 
##  4 1        4         54   1179     0.0458
##  5 2        1       1557  12357     0.126 
##  6 2        2       3886  12357     0.314 
##  7 2        3       3759  12357     0.304 
##  8 2        4       3155  12357     0.255 
##  9 3        1          8     36     0.222 
## 10 3        2         10     36     0.278 
## 11 3        3         11     36     0.306 
## 12 3        4          7     36     0.194

## # A tibble: 3 x 6
## # Groups:   target [2]
##   MGT.CLAS target count etotal proportion group.max
##   <fct>    <fct>  <int>  <int>      <dbl>     <int>
## 1 1        3        798   1179      0.677       798
## 2 2        2       3886  12357      0.314      3886
## 3 3        3         11     36      0.306        11

In the target response, 8 is the positive answer, 8 or(3,7) is the negative answer, 8 is don’t know and 6 is refused for the question: MGT_CLAS Have you ever taken a course or class on how to manage your asthma?

## # A tibble: 3 x 6
## # Groups:   MGT.CLAS [3]
##   target MGT.CLAS count etotal proportion group.max
##   <fct>  <fct>    <int>  <int>      <dbl>     <int>
## 1 2      2         3886   4162    0.934        3886
## 2 3      1          798   4568    0.175         798
## 3 3      3           11   4568    0.00241        11

INHALERW

## # A tibble: 8 x 5
##   INHALERW target count etotal proportion
##   <fct>    <fct>  <int>  <int>      <dbl>
## 1 1        2       3533   9168      0.385
## 2 1        3       4063   9168      0.443
## 3 1        4       1572   9168      0.171
## 4 2        2        629   2778      0.226
## 5 2        3        505   2778      0.182
## 6 2        4       1644   2778      0.592
## 7 4        1       1174   1174      1    
## 8 3        1        452    452      1

## # A tibble: 4 x 6
## # Groups:   target [3]
##   INHALERW target count etotal proportion group.max
##   <fct>    <fct>  <int>  <int>      <dbl>     <int>
## 1 1        3       4063   9168      0.443      4063
## 2 2        4       1644   2778      0.592      1644
## 3 4        1       1174   1174      1          1174
## 4 3        1        452    452      1           452

In the target response, 8 is the positive answer, 3 is the negative answer, 4 is don’t know and 1 is refused for the question: INHALERW (8.4) Did a doctor or other health professional watch you use the inhaler?

## # A tibble: 4 x 6
## # Groups:   INHALERW [4]
##   target INHALERW count etotal proportion group.max
##   <fct>  <fct>    <int>  <int>      <dbl>     <int>
## 1 1      4         1174   1626      0.722      1174
## 2 1      3          452   1626      0.278       452
## 3 3      1         4063   4568      0.889      4063
## 4 4      2         1644   3216      0.511      1644

MOD.ENV

## # A tibble: 12 x 5
##    MOD.ENV target count etotal proportion
##    <fct>   <fct>  <int>  <int>      <dbl>
##  1 1       1        268   4218     0.0635
##  2 1       2       1396   4218     0.331 
##  3 1       3       2005   4218     0.475 
##  4 1       4        549   4218     0.130 
##  5 2       1       1335   9265     0.144 
##  6 2       2       2745   9265     0.296 
##  7 2       3       2542   9265     0.274 
##  8 2       4       2643   9265     0.285 
##  9 3       1         23     89     0.258 
## 10 3       2         21     89     0.236 
## 11 3       3         21     89     0.236 
## 12 3       4         24     89     0.270

## # A tibble: 3 x 6
## # Groups:   target [3]
##   MOD.ENV target count etotal proportion group.max
##   <fct>   <fct>  <int>  <int>      <dbl>     <int>
## 1 1       3       2005   4218      0.475      2005
## 2 2       2       2745   9265      0.296      2745
## 3 3       4         24     89      0.270        24

In the target response, 3 is the positive answer, 3 is the negative answer, 4 is don’t know and 1 is refused for the question: MOD_ENV (7.13) INTERVIEWER READ: Now, back to questions specifically about you. Has a health professional ever advised you to change things in your home, school, or work to improve your asthma

length(asth.res1$TCH.SIGN)

## [1] 3

## # A tibble: 3 x 6
## # Groups:   MOD.ENV [3]
##   target MOD.ENV count etotal proportion group.max
##   <fct>  <fct>   <int>  <int>      <dbl>     <int>
## 1 2      2        2745   4162    0.660        2745
## 2 3      1        2005   4568    0.439        2005
## 3 4      3          24   3216    0.00746        24

Summary of the response variables

asth.edul1 <- merge(asth.res1, asth.res2 ,by.x = "target", by.y = "target", all = TRUE) %>%
  merge(., asth.res3 ,by.x = "target", by.y = "target", all = TRUE) %>%
  merge(., asth.res4 ,by.x = "target", by.y = "target", all = TRUE) %>%
  merge(., asth.res5 ,by.x = "target", by.y = "target", all = TRUE) %>%
  merge(., asth.res6 ,by.x = "target", by.y = "target", all = TRUE) %>%
  merge(., asth.res7 ,by.x = "target", by.y = "target", all = TRUE) %>%
  select(., target, TCH.SIGN, TCH.RESP, TCH.MON, MGT.PLAN, MGT.CLAS, INHALERW, MOD.ENV)

## Warning in merge.data.frame(., asth.res4, by.x = "target", by.y = "target", :
## column names 'count.x', 'etotal.x', 'proportion.x', 'group.max.x', 'count.y',
## 'etotal.y', 'proportion.y', 'group.max.y' are duplicated in the result

## Warning in merge.data.frame(., asth.res5, by.x = "target", by.y = "target", :
## column names 'count.x', 'etotal.x', 'proportion.x', 'group.max.x', 'count.y',
## 'etotal.y', 'proportion.y', 'group.max.y' are duplicated in the result

## Warning in merge.data.frame(., asth.res6, by.x = "target", by.y = "target", :
## column names 'count.x', 'etotal.x', 'proportion.x', 'group.max.x',
## 'count.y', 'etotal.y', 'proportion.y', 'group.max.y', 'count.x', 'etotal.x',
## 'proportion.x', 'group.max.x', 'count.y', 'etotal.y', 'proportion.y',
## 'group.max.y' are duplicated in the result

## Warning in merge.data.frame(., asth.res7, by.x = "target", by.y = "target", :
## column names 'count.x', 'etotal.x', 'proportion.x', 'group.max.x',
## 'count.y', 'etotal.y', 'proportion.y', 'group.max.y', 'count.x', 'etotal.x',
## 'proportion.x', 'group.max.x', 'count.y', 'etotal.y', 'proportion.y',
## 'group.max.y' are duplicated in the result

asth.edul1

##    target TCH.SIGN TCH.RESP TCH.MON MGT.PLAN MGT.CLAS INHALERW MOD.ENV
## 1       1     <NA>     <NA>    <NA>     <NA>     <NA>        4    <NA>
## 2       1     <NA>     <NA>    <NA>     <NA>     <NA>        3    <NA>
## 3       2     <NA>     <NA>       2        2        2     <NA>       2
## 4       2     <NA>     <NA>       2        3        2     <NA>       2
## 5       2     <NA>     <NA>       3        2        2     <NA>       2
## 6       2     <NA>     <NA>       3        3        2     <NA>       2
## 7       3        1        1       1        1        1        1       1
## 8       3        1        1       1        1        3        1       1
## 9       4        3        3    <NA>     <NA>     <NA>        2       3
## 10      4        2        2    <NA>     <NA>     <NA>        2       3
## 11      4        2        3    <NA>     <NA>     <NA>        2       3
## 12      4        3        2    <NA>     <NA>     <NA>        2       3

#write.csv(asth.edul1, "asthma_edu_level.csv")

asth.edul2 <- merge(asth.sign, asth.resp ,by.x = "target", by.y = "target", all = TRUE) %>%
  merge(., asth.mon ,by.x = "target", by.y = "target", all = TRUE) %>%
  merge(., asth.plan ,by.x = "target", by.y = "target", all = TRUE) %>%
  merge(., asth.clas ,by.x = "target", by.y = "target", all = TRUE) %>%
  merge(., asth.inhal ,by.x = "target", by.y = "target", all = TRUE) %>%
  merge(., asth.env ,by.x = "target", by.y = "target", all = TRUE) %>%
  select(., target, TCH.SIGN, TCH.RESP, TCH.MON, MGT.PLAN, MGT.CLAS, INHALERW, MOD.ENV)

## Warning in merge.data.frame(., asth.plan, by.x = "target", by.y = "target", :
## column names 'count.x', 'etotal.x', 'proportion.x', 'group.max.x', 'count.y',
## 'etotal.y', 'proportion.y', 'group.max.y' are duplicated in the result

## Warning in merge.data.frame(., asth.clas, by.x = "target", by.y = "target", :
## column names 'count.x', 'etotal.x', 'proportion.x', 'group.max.x', 'count.y',
## 'etotal.y', 'proportion.y', 'group.max.y' are duplicated in the result

## Warning in merge.data.frame(., asth.inhal, by.x = "target", by.y =
## "target", : column names 'count.x', 'etotal.x', 'proportion.x', 'group.max.x',
## 'count.y', 'etotal.y', 'proportion.y', 'group.max.y', 'count.x', 'etotal.x',
## 'proportion.x', 'group.max.x', 'count.y', 'etotal.y', 'proportion.y',
## 'group.max.y' are duplicated in the result

## Warning in merge.data.frame(., asth.env, by.x = "target", by.y = "target", :
## column names 'count.x', 'etotal.x', 'proportion.x', 'group.max.x',
## 'count.y', 'etotal.y', 'proportion.y', 'group.max.y', 'count.x', 'etotal.x',
## 'proportion.x', 'group.max.x', 'count.y', 'etotal.y', 'proportion.y',
## 'group.max.y' are duplicated in the result

asth.edul2

##    target TCH.SIGN TCH.RESP TCH.MON MGT.PLAN MGT.CLAS INHALERW MOD.ENV
## 1       1     <NA>     <NA>    <NA>     <NA>     <NA>        4    <NA>
## 2       1     <NA>     <NA>    <NA>     <NA>     <NA>        3    <NA>
## 3       2     <NA>     <NA>       2        2        2     <NA>       2
## 4       2     <NA>     <NA>       2        3        2     <NA>       2
## 5       2     <NA>     <NA>       3        2        2     <NA>       2
## 6       2     <NA>     <NA>       3        3        2     <NA>       2
## 7       3        1        1       1        1        1        1       1
## 8       3        1        1       1        1        3        1       1
## 9       4        3        3    <NA>     <NA>     <NA>        2       3
## 10      4        2        2    <NA>     <NA>     <NA>        2       3
## 11      4        2        3    <NA>     <NA>     <NA>        2       3
## 12      4        3        2    <NA>     <NA>     <NA>        2       3

For the response variable TARGET, an excellent management skill has number 2 but a poor management skill has number 7 and 5.

We can build a logistics regression on the dataset.

!!!! Please, check the values of yes in the response. var and change if condition of resp.asthma2$taget

!!!! above are different. Remove “Break” in the chunk below!

Here we remove the varibles used to calculate the target variable and reformat the data frame.

## 'data.frame':    13572 obs. of  24 variables:
##  $ TARGET   : num  0 1 1 0 0 0 0 0 1 0 ...
##  $ SEX      : Factor w/ 2 levels "1","2": 1 2 2 2 2 1 1 2 1 2 ...
##  $ AGEG.F7  : Factor w/ 7 levels "1","2","3","4",..: 4 4 6 5 6 6 6 7 3 5 ...
##  $ X_IMPRACE: Factor w/ 6 levels "1","2","3","4",..: 1 2 1 5 1 1 1 1 5 1 ...
##  $ EDUCAL   : Factor w/ 6 levels "1","2","3","4",..: 6 4 5 5 6 4 6 6 4 5 ...
##  $ X_INCOMG : Factor w/ 6 levels "1","2","3","4",..: 5 1 2 5 5 5 5 6 1 5 ...
##  $ X_RFBMI5 : Factor w/ 3 levels "1","2","9": 2 2 2 2 1 1 2 1 1 2 ...
##  $ SMOKE100 : Factor w/ 3 levels "1","2","7": 2 2 2 2 2 1 1 2 1 2 ...
##  $ COPD     : Factor w/ 3 levels "1","2","7": 2 2 2 2 2 1 2 2 2 2 ...
##  $ EMPHY    : Factor w/ 4 levels "1","2","7","9": 2 2 2 2 2 1 2 2 2 2 ...
##  $ DEPRESS  : Factor w/ 4 levels "1","2","7","9": 2 1 2 1 2 2 2 2 1 2 ...
##  $ BRONCH   : Factor w/ 4 levels "1","2","7","9": 2 2 2 2 2 2 2 2 2 2 ...
##  $ DUR.30D  : Factor w/ 7 levels "1","10","11",..: 4 4 3 6 3 4 6 6 2 3 ...
##  $ INCINDT  : Factor w/ 4 levels "1","2","3","7": 1 3 3 3 3 3 3 3 3 2 ...
##  $ LAST.MD  : Factor w/ 5 levels "4","5","6","7",..: 1 1 1 4 4 3 1 4 2 1 ...
##  $ LAST.MED : Factor w/ 5 levels "4","5","6","7",..: 1 1 3 4 3 2 1 4 4 1 ...
##  $ LAST.SYMP: Factor w/ 8 levels "1","2","3","4",..: 1 1 3 7 3 2 5 7 4 3 ...
##  $ COMPASTH : Factor w/ 6 levels "1","11","2","3",..: 4 4 6 5 2 2 5 5 2 4 ...
##  $ INS2     : Factor w/ 5 levels "1","2","5","7",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ ER.VISIT : Factor w/ 5 levels "1","2","5","6",..: 1 2 2 3 4 4 2 3 4 2 ...
##  $ HOSP.VST : Factor w/ 7 levels "1","2","4","5",..: 2 2 2 4 5 5 3 4 5 2 ...
##  $ ASRXCOST : Factor w/ 5 levels "1","2","5","7",..: 2 2 2 3 2 2 2 3 2 2 ...
##  $ WORKTALK : Factor w/ 6 levels "1","2","6","7",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ ACT.DAY30: Factor w/ 6 levels "1","2","3","4",..: 1 4 2 5 1 1 1 5 1 1 ...

PREPARE THE DATA FOR MODELISATION

We remove the rows with missing values.

Here were are going to drop missing data because they are only 12 over 13,922 rows. We also transform all predictors to categorical.

##  TARGET   SEX      AGEG.F7  X_IMPRACE EDUCAL   X_INCOMG X_RFBMI5 SMOKE100
##  0:9004   1:4896   1: 795   1:10741   1:  17   1:1778   1:3424   1:6302  
##  1:4568   2:8676   2:1222   2:  797   2: 245   2:2108   2:9509   2:7231  
##                    3:1293   3:  176   3: 666   3:1231   9: 639   7:  39  
##                    4:2077   4:  290   4:3148   4:1604                    
##                    5:3299   5:  997   5:4183   5:5213                    
##                    6:3148   6:  571   6:5313   9:1638                    
##                    7:1738                                                
##  COPD      EMPHY     DEPRESS  BRONCH   DUR.30D   INCINDT   LAST.MD  LAST.MED
##  1: 2665   1: 1090   1:5194   1:3483   1 :1237   1:  287   4:7760   4:4789  
##  2:10738   2:12344   2:8269   2:9902   10:1548   2: 1001   5:1880   5:2192  
##  7:  169   7:  127   7:  83   7: 176   11: 736   3:12253   6: 801   6:2031  
##            9:   11   9:  26   9:  11   12:4774   7:   31   7:2852   7:4000  
##                                        2 : 909             9: 279   9: 560  
##                                        6 :4118                              
##                                        7 : 250                              
##    LAST.SYMP    COMPASTH  INS2      ER.VISIT HOSP.VST ASRXCOST WORKTALK 
##  1      :3513   1 :1139   1:  628   1:1327   1: 404   1:1509   1: 2409  
##  7      :2456   11:4252   2:12137   2:6517   2:6524   2:9421   2:10681  
##  3      :2440   2 : 795   5:  743   5:2609   4: 920   5:2607   6:  264  
##  2      :1665   3 :3210   7:   51   6:3097   5:2609   7:  23   7:  189  
##  4      :1548   6 :4118   9:   13   7:  22   6:3097   9:  12   8:   20  
##  5      :1014   7 :  58                      7:  16            9:    9  
##  (Other): 936                                9:   2                     
##  ACT.DAY30
##  1:5667   
##  2:3009   
##  3:1431   
##  4: 801   
##  5:2609   
##  7:  55   
##

Visualization of some combine variables

Proportion of Good Skill Management in terme of Education Level

Proportion of Good skill management in terme of Duration of Asthma Attack

Splitting the data into train and test sets

BUILDS MODELS

Model using full predictors with glm

## 
## Call:  glm(formula = TARGET ~ ., family = binomial, data = training1)
## 
## Coefficients:
## (Intercept)         SEX2     AGEG.F72     AGEG.F73     AGEG.F74     AGEG.F75  
##   -1.174536     0.277355     0.185815    -0.057024    -0.092841    -0.269772  
##    AGEG.F76     AGEG.F77   X_IMPRACE2   X_IMPRACE3   X_IMPRACE4   X_IMPRACE5  
##   -0.437822    -0.673535     0.290042    -0.304171    -0.125454     0.066594  
##  X_IMPRACE6      EDUCAL2      EDUCAL3      EDUCAL4      EDUCAL5      EDUCAL6  
##   -0.052169     0.385098     0.387940     0.554764     0.603154     0.583484  
##   X_INCOMG2    X_INCOMG3    X_INCOMG4    X_INCOMG5    X_INCOMG9    X_RFBMI52  
##    0.171072    -0.015702    -0.012750     0.061015    -0.045599     0.010538  
##   X_RFBMI59    SMOKE1002    SMOKE1007        COPD2        COPD7       EMPHY2  
##   -0.016012     0.117088    -0.354173    -0.136657    -0.225422    -0.014686  
##      EMPHY7       EMPHY9     DEPRESS2     DEPRESS7     DEPRESS9      BRONCH2  
##   -0.534735    -0.142864     0.064885     0.133971    -0.487342    -0.064650  
##     BRONCH7      BRONCH9    DUR.30D10    DUR.30D11    DUR.30D12     DUR.30D2  
##   -0.388348     0.953002     0.292401     0.384874     0.142016     0.033771  
##    DUR.30D6     DUR.30D7     INCINDT2     INCINDT3     INCINDT7     LAST.MD5  
##   -0.357111    -0.438918     0.093255     1.127746    -0.326672    -0.036131  
##    LAST.MD6     LAST.MD7     LAST.MD9    LAST.MED5    LAST.MED6    LAST.MED7  
##   -0.162538    -0.449114    -0.669319    -0.269302    -0.352852    -0.437017  
##   LAST.MED9   LAST.SYMP2   LAST.SYMP3   LAST.SYMP4   LAST.SYMP5   LAST.SYMP6  
##   -2.154621     0.007867     0.141573           NA     0.534054     0.722396  
##  LAST.SYMP7   LAST.SYMP9   COMPASTH11    COMPASTH2    COMPASTH3    COMPASTH6  
##    0.500094    -0.170911    -0.263481    -0.332154    -0.103789           NA  
##   COMPASTH7        INS22        INS25        INS27        INS29    ER.VISIT2  
##   -0.396203     0.225311     0.053888    -0.067315   -12.177179    -0.150526  
##   ER.VISIT5    ER.VISIT6    ER.VISIT7    HOSP.VST2    HOSP.VST4    HOSP.VST5  
##  -12.056841    -0.609194    -1.108039    -0.221764    -0.317311           NA  
##   HOSP.VST6    HOSP.VST7    HOSP.VST9    ASRXCOST2    ASRXCOST5    ASRXCOST7  
##          NA    -1.516388   -13.699485    -0.059624    11.330773     0.147173  
##   ASRXCOST9    WORKTALK2    WORKTALK6    WORKTALK7    WORKTALK8    WORKTALK9  
##   -1.171160    -0.575374    -0.615817    -0.385011     0.445304     0.018963  
##  ACT.DAY302   ACT.DAY303   ACT.DAY304   ACT.DAY305   ACT.DAY307  
##    0.147657     0.186494     0.130929           NA     0.003462  
## 
## Degrees of Freedom: 10857 Total (i.e. Null);  10768 Residual
## Null Deviance:       13850 
## Residual Deviance: 12650     AIC: 12830

Confusion Matrix with the testingset

First glm model using backward elimination of step function

## 
## Call:  glm(formula = TARGET ~ SEX + AGEG.F7 + X_IMPRACE + X_INCOMG + 
##     SMOKE100 + COPD + DUR.30D + INCINDT + LAST.MD + LAST.MED + 
##     LAST.SYMP + COMPASTH + INS2 + ER.VISIT + HOSP.VST + WORKTALK + 
##     ACT.DAY30, family = binomial, data = training1)
## 
## Coefficients:
## (Intercept)         SEX2     AGEG.F72     AGEG.F73     AGEG.F74     AGEG.F75  
##   -0.678798     0.277213     0.190033    -0.057372    -0.087781    -0.261394  
##    AGEG.F76     AGEG.F77   X_IMPRACE2   X_IMPRACE3   X_IMPRACE4   X_IMPRACE5  
##   -0.427736    -0.655346     0.282615    -0.300361    -0.126453     0.046405  
##  X_IMPRACE6    X_INCOMG2    X_INCOMG3    X_INCOMG4    X_INCOMG5    X_INCOMG9  
##   -0.050401     0.196905     0.017796     0.030185     0.101560    -0.031801  
##   SMOKE1002    SMOKE1007        COPD2        COPD7    DUR.30D10    DUR.30D11  
##    0.125338    -0.315724    -0.143446    -0.426710     0.282624     0.379578  
##   DUR.30D12     DUR.30D2     DUR.30D6     DUR.30D7     INCINDT2     INCINDT3  
##    0.136008     0.031677    -0.376067    -0.456387     0.098652     1.139783  
##    INCINDT7     LAST.MD5     LAST.MD6     LAST.MD7     LAST.MD9    LAST.MED5  
##   -0.384838    -0.043707    -0.176186    -0.456484    -0.698527    -0.272267  
##   LAST.MED6    LAST.MED7    LAST.MED9   LAST.SYMP2   LAST.SYMP3   LAST.SYMP4  
##   -0.353263    -0.435766    -2.161181     0.003324     0.137790           NA  
##  LAST.SYMP5   LAST.SYMP6   LAST.SYMP7   LAST.SYMP9   COMPASTH11    COMPASTH2  
##    0.535938     0.722550     0.500386    -0.174032    -0.272680    -0.330235  
##   COMPASTH3    COMPASTH6    COMPASTH7        INS22        INS25        INS27  
##   -0.105775           NA    -0.418558     0.219372     0.050479    -0.081843  
##       INS29    ER.VISIT2    ER.VISIT5    ER.VISIT6    ER.VISIT7    HOSP.VST2  
##  -12.152164    -0.154096    -0.676921    -0.623558    -1.138380    -0.235152  
##   HOSP.VST4    HOSP.VST5    HOSP.VST6    HOSP.VST7    HOSP.VST9    WORKTALK2  
##   -0.333963           NA           NA    -1.535202   -13.765679    -0.580233  
##   WORKTALK6    WORKTALK7    WORKTALK8    WORKTALK9   ACT.DAY302   ACT.DAY303  
##   -0.676153    -0.496932     0.114223    -0.018377     0.146346     0.190329  
##  ACT.DAY304   ACT.DAY305   ACT.DAY307  
##    0.124820           NA     0.019868  
## 
## Degrees of Freedom: 10857 Total (i.e. Null);  10788 Residual
## Null Deviance:       13850 
## Residual Deviance: 12670     AIC: 12810

Confusion Matrix with the testingset

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1625  686
##          1  161  242
##                                           
##                Accuracy : 0.6879          
##                  95% CI : (0.6701, 0.7053)
##     No Information Rate : 0.6581          
##     P-Value [Acc > NIR] : 0.0005201       
##                                           
##                   Kappa : 0.1975          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.26078         
##             Specificity : 0.90985         
##          Pos Pred Value : 0.60050         
##          Neg Pred Value : 0.70316         
##              Prevalence : 0.34193         
##          Detection Rate : 0.08917         
##    Detection Prevalence : 0.14849         
##       Balanced Accuracy : 0.58532         
##                                           
##        'Positive' Class : 1               
##

Second glm model

## 
## Call:  glm(formula = TARGET ~ SEX + AGEG.F7 + EDUCAL + X_INCOMG + BRONCH + 
##     DUR.30D + INCINDT + LAST.MD + LAST.MED + LAST.SYMP + COMPASTH + 
##     WORKTALK, family = binomial, data = training1)
## 
## Coefficients:
## (Intercept)         SEX2     AGEG.F72     AGEG.F73     AGEG.F74     AGEG.F75  
##   -1.118715     0.284237     0.167241    -0.071978    -0.092099    -0.256311  
##    AGEG.F76     AGEG.F77      EDUCAL2      EDUCAL3      EDUCAL4      EDUCAL5  
##   -0.413130    -0.637474     0.424996     0.401404     0.555033     0.603441  
##     EDUCAL6    X_INCOMG2    X_INCOMG3    X_INCOMG4    X_INCOMG5    X_INCOMG9  
##    0.584129     0.150742    -0.056336    -0.051021     0.018527    -0.081780  
##     BRONCH2      BRONCH7      BRONCH9    DUR.30D10    DUR.30D11    DUR.30D12  
##   -0.099060    -0.574021    -0.010334     0.204898     0.292044     0.095202  
##    DUR.30D2     DUR.30D6     DUR.30D7     INCINDT2     INCINDT3     INCINDT7  
##   -0.001767    -0.556779    -0.508096     0.070431     1.097929    -0.384934  
##    LAST.MD5     LAST.MD6     LAST.MD7     LAST.MD9    LAST.MED5    LAST.MED6  
##   -0.299784    -0.423770    -0.705918    -0.836967    -0.304092    -0.401499  
##   LAST.MED7    LAST.MED9   LAST.SYMP2   LAST.SYMP3   LAST.SYMP4   LAST.SYMP5  
##   -0.491506    -2.222637     0.018452     0.156085           NA     0.530238  
##  LAST.SYMP6   LAST.SYMP7   LAST.SYMP9   COMPASTH11    COMPASTH2    COMPASTH3  
##    0.739814     0.500976    -0.211515    -0.315412    -0.280003    -0.104776  
##   COMPASTH6    COMPASTH7    WORKTALK2    WORKTALK6    WORKTALK7    WORKTALK8  
##          NA    -0.400090    -0.577549    -0.562460    -0.457966    -0.004085  
##   WORKTALK9  
##   -0.013825  
## 
## Degrees of Freedom: 10857 Total (i.e. Null);  10805 Residual
## Null Deviance:       13850 
## Residual Deviance: 12730     AIC: 12830

Confusion Matrix with the testingset

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1626  701
##          1  160  227
##                                           
##                Accuracy : 0.6828          
##                  95% CI : (0.6649, 0.7002)
##     No Information Rate : 0.6581          
##     P-Value [Acc > NIR] : 0.003415        
##                                           
##                   Kappa : 0.1803          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.24461         
##             Specificity : 0.91041         
##          Pos Pred Value : 0.58656         
##          Neg Pred Value : 0.69875         
##              Prevalence : 0.34193         
##          Detection Rate : 0.08364         
##    Detection Prevalence : 0.14259         
##       Balanced Accuracy : 0.57751         
##                                           
##        'Positive' Class : 1               
##

Lasso and Ridge model

Since our dataset has multiple variable, we can use penalized logistic regression to find an optimal performing model. Ridge Regression and Lasso Regression have two different approaches. Ridge Regression incorporates all variables in the model and gives the coefficients of variables with minor contribution close to zero Lasso Regression keeps only the most significant variables and gives zero to the coefficient of the rest of variables.

Split the data into trainset and testingset, Dumy code categorical predictors

Ridge Regression

We fit and obsrve the coefficients of rigde regression against the log of lambda.

Variation of Ridge Model Coefficient by Log Lambda

The coefficients are significative for negative log lambda and start stabilize around -4

Lambda that Minimises MSE

The plot shows that the log of the optimal value of lambda (i.e. the one that minimises the root mean square error) is approximately -3. The exact value can be viewed by examining the variable lambda_min in the code below. In general though, the objective of regularisation is to balance accuracy and simplicity. In the present context, this means a model with the smallest number of coefficients that also gives a good accuracy. To this end, the cv.glmnet function finds the value of lambda that gives the simplest model but also lies within one standard error of the optimal value of lambda.

## [1] 0.0232599

Confusion matrix with lambda min

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1770  905
##          1   16   23
##                                           
##                Accuracy : 0.6606          
##                  95% CI : (0.6425, 0.6785)
##     No Information Rate : 0.6581          
##     P-Value [Acc > NIR] : 0.397           
##                                           
##                   Kappa : 0.0206          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.024784        
##             Specificity : 0.991041        
##          Pos Pred Value : 0.589744        
##          Neg Pred Value : 0.661682        
##              Prevalence : 0.341931        
##          Detection Rate : 0.008475        
##    Detection Prevalence : 0.014370        
##       Balanced Accuracy : 0.507913        
##                                           
##        'Positive' Class : 1               
##

We observe overfitting with this ridge model

Confusion matrix with best lambda

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1786  928
##          1    0    0
##                                           
##                Accuracy : 0.6581          
##                  95% CI : (0.6399, 0.6759)
##     No Information Rate : 0.6581          
##     P-Value [Acc > NIR] : 0.5089          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.6581          
##              Prevalence : 0.3419          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 1               
##

We observe overfitting with this second ridge model

Getting the coefficients

## 96 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept) -0.6785089445
## (Intercept)  .           
## SEX2         0.2377312946
## AGEG.F72     0.2924064689
## AGEG.F73     0.0882851251
## AGEG.F74     0.0739161765
## AGEG.F75    -0.0784363273
## AGEG.F76    -0.2266438614
## AGEG.F77    -0.4304243552
## X_IMPRACE2   0.2783708011
## X_IMPRACE3  -0.2617968077
## X_IMPRACE4  -0.1250822625
## X_IMPRACE5   0.0770993145
## X_IMPRACE6  -0.0291325153
## EDUCAL2     -0.1849724455
## EDUCAL3     -0.1539710139
## EDUCAL4      0.0048576314
## EDUCAL5      0.0472495259
## EDUCAL6      0.0240563693
## X_INCOMG2    0.1424495042
## X_INCOMG3   -0.0318310008
## X_INCOMG4   -0.0230713655
## X_INCOMG5    0.0438650791
## X_INCOMG9   -0.0566232475
## X_RFBMI52    0.0008484487
## X_RFBMI59   -0.0133589051
## SMOKE1002    0.1186408841
## SMOKE1007   -0.3227703841
## COPD2       -0.1146803040
## COPD7       -0.2100418150
## EMPHY2      -0.0119953775
## EMPHY7      -0.4306694169
## EMPHY9       0.0158954894
## DEPRESS2     0.0496872761
## DEPRESS7     0.0435764855
## DEPRESS9    -0.3683416941
## BRONCH2     -0.0644667318
## BRONCH7     -0.3570145913
## BRONCH9      0.5477470454
## DUR.30D10    0.0783812228
## DUR.30D11    0.2622753285
## DUR.30D12    0.0778292172
## DUR.30D2    -0.0099787768
## DUR.30D6    -0.0063046250
## DUR.30D7    -0.4413631006
## INCINDT2    -0.2309764521
## INCINDT3     0.7463276137
## INCINDT7    -0.5225892789
## LAST.MD5     0.0434348649
## LAST.MD6    -0.0675088335
## LAST.MD7    -0.3285714325
## LAST.MD9    -0.5710486154
## LAST.MED5   -0.1749557026
## LAST.MED6   -0.2340047015
## LAST.MED7   -0.2972935433
## LAST.MED9   -1.5011993571
## LAST.SYMP2  -0.0149304285
## LAST.SYMP3   0.1037468900
## LAST.SYMP4   0.0787196299
## LAST.SYMP5   0.0738268250
## LAST.SYMP6   0.2212507509
## LAST.SYMP7  -0.0216193311
## LAST.SYMP9  -0.2944037512
## COMPASTH11  -0.2050397620
## COMPASTH2   -0.2085562100
## COMPASTH3   -0.0304637973
## COMPASTH6   -0.0127191965
## COMPASTH7   -0.3240287061
## INS22        0.1619687023
## INS25       -0.0027254990
## INS27       -0.0889205227
## INS29       -1.9283450931
## ER.VISIT2   -0.0995234384
## ER.VISIT5   -0.1307948414
## ER.VISIT6   -0.2151046502
## ER.VISIT7   -0.9032624092
## HOSP.VST2   -0.0127124439
## HOSP.VST4   -0.0329132274
## HOSP.VST5   -0.1254808417
## HOSP.VST6   -0.2092959558
## HOSP.VST7   -1.0928568540
## HOSP.VST9   -2.6073184150
## ASRXCOST2   -0.0455939879
## ASRXCOST5   -0.1153836580
## ASRXCOST7    0.1327327507
## ASRXCOST9   -0.8039086057
## WORKTALK2   -0.5040869742
## WORKTALK6   -0.4678981762
## WORKTALK7   -0.3132559117
## WORKTALK8    0.3544284083
## WORKTALK9    0.0071385665
## ACT.DAY302   0.1277000083
## ACT.DAY303   0.1598666842
## ACT.DAY304   0.1249975782
## ACT.DAY305  -0.1174198946
## ACT.DAY307  -0.0230500270

Lasso Regression

Find the best lambda using cross validation

Lambda that minimises MSE in Lasso

The plot shows that the log of the optimal value of lambda (i.e. the one that minimises the root mean square error) is approximately -10. The exact value can be viewed by examining the variable lambda_min in the code below. In general though, the objective of regularisation is to balance accuracy and simplicity. In the present context, this means a model with the smallest number of coefficients that also gives a good accuracy. To this end, the cv.glmnet function finds the value of lambda that gives the simplest model but also lies within one standard error of the optimal value of lambda.

Confusion Matrix with lambda min

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1640  711
##          1  146  217
##                                           
##                Accuracy : 0.6842          
##                  95% CI : (0.6664, 0.7017)
##     No Information Rate : 0.6581          
##     P-Value [Acc > NIR] : 0.002059        
##                                           
##                   Kappa : 0.1781          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.23384         
##             Specificity : 0.91825         
##          Pos Pred Value : 0.59780         
##          Neg Pred Value : 0.69758         
##              Prevalence : 0.34193         
##          Detection Rate : 0.07996         
##    Detection Prevalence : 0.13375         
##       Balanced Accuracy : 0.57604         
##                                           
##        'Positive' Class : 1               
##

Getting the coefficients

## 96 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept) -0.8606406405
## (Intercept)  .           
## SEX2         0.2594402162
## AGEG.F72     0.2340892110
## AGEG.F73     0.0034616959
## AGEG.F74     .           
## AGEG.F75    -0.1503925580
## AGEG.F76    -0.3141421454
## AGEG.F77    -0.5410988285
## X_IMPRACE2   0.2789342148
## X_IMPRACE3  -0.2088864711
## X_IMPRACE4  -0.0796924307
## X_IMPRACE5   0.0499448037
## X_IMPRACE6   .           
## EDUCAL2     -0.1306405673
## EDUCAL3     -0.1452989939
## EDUCAL4      .           
## EDUCAL5      0.0213667953
## EDUCAL6      .           
## X_INCOMG2    0.1528571389
## X_INCOMG3    .           
## X_INCOMG4    .           
## X_INCOMG5    0.0480297485
## X_INCOMG9   -0.0359921061
## X_RFBMI52    .           
## X_RFBMI59    .           
## SMOKE1002    0.1112354961
## SMOKE1007   -0.1956145109
## COPD2       -0.1086642369
## COPD7       -0.1401300934
## EMPHY2       .           
## EMPHY7      -0.3890223792
## EMPHY9       .           
## DEPRESS2     0.0411592140
## DEPRESS7     .           
## DEPRESS9    -0.0220003046
## BRONCH2     -0.0516858120
## BRONCH7     -0.3190504035
## BRONCH9      .           
## DUR.30D10    0.1244464432
## DUR.30D11    0.2430586667
## DUR.30D12    0.0480299274
## DUR.30D2    -0.0087203262
## DUR.30D6     .           
## DUR.30D7    -0.4637068965
## INCINDT2     .           
## INCINDT3     1.0114193590
## INCINDT7    -0.1267449349
## LAST.MD5     .           
## LAST.MD6    -0.0971328786
## LAST.MD7    -0.4076446404
## LAST.MD9    -0.6022595780
## LAST.MED5   -0.2122311723
## LAST.MED6   -0.2848120938
## LAST.MED7   -0.3795487381
## LAST.MED9   -1.9850509111
## LAST.SYMP2   .           
## LAST.SYMP3   0.1162080311
## LAST.SYMP4   0.0185077766
## LAST.SYMP5   0.0135376553
## LAST.SYMP6   0.1735428119
## LAST.SYMP7   .           
## LAST.SYMP9  -0.2513469618
## COMPASTH11  -0.1923410551
## COMPASTH2   -0.1789780883
## COMPASTH3   -0.0003082605
## COMPASTH6    .           
## COMPASTH7   -0.2199743590
## INS22        0.1598095306
## INS25        .           
## INS27        .           
## INS29       -1.8091008137
## ER.VISIT2   -0.1002383815
## ER.VISIT5   -0.2548500002
## ER.VISIT6   -0.3308156833
## ER.VISIT7   -0.8113733714
## HOSP.VST2    .           
## HOSP.VST4    .           
## HOSP.VST5   -0.0948642873
## HOSP.VST6   -0.0252333067
## HOSP.VST7   -0.7813748722
## HOSP.VST9   -1.9292755596
## ASRXCOST2   -0.0016379281
## ASRXCOST5    .           
## ASRXCOST7    .           
## ASRXCOST9   -0.0787455438
## WORKTALK2   -0.5490490040
## WORKTALK6   -0.5184983953
## WORKTALK7   -0.3338646176
## WORKTALK8    .           
## WORKTALK9    .           
## ACT.DAY302   0.1080484254
## ACT.DAY303   0.1359887589
## ACT.DAY304   0.0759042895
## ACT.DAY305   .           
## ACT.DAY307   .

Confusion Matrix with best lambda

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1716  822
##          1   70  106
##                                          
##                Accuracy : 0.6713         
##                  95% CI : (0.6533, 0.689)
##     No Information Rate : 0.6581         
##     P-Value [Acc > NIR] : 0.07509        
##                                          
##                   Kappa : 0.0932         
##                                          
##  Mcnemar's Test P-Value : < 2e-16        
##                                          
##             Sensitivity : 0.11422        
##             Specificity : 0.96081        
##          Pos Pred Value : 0.60227        
##          Neg Pred Value : 0.67612        
##              Prevalence : 0.34193        
##          Detection Rate : 0.03906        
##    Detection Prevalence : 0.06485        
##       Balanced Accuracy : 0.53752        
##                                          
##        'Positive' Class : 1              
##

Calculating the AICc of Ridge and Lasso Models

it <- glmnet(x, y, family = “multinomial”)

tLL <- fit$nulldev - deviance(fit) k <- fit$df n <- fit$nobs AICc <- -tLL+2k+2k*(k+1)/(n-k-1) AICc

## [1] -975.4119

## [1] -1043.829

Partial Least Squared

Confusion Matrix with best lambda

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    F    T
##          F 1635  712
##          T  151  216
##                                           
##                Accuracy : 0.682           
##                  95% CI : (0.6641, 0.6995)
##     No Information Rate : 0.6581          
##     P-Value [Acc > NIR] : 0.004356        
##                                           
##                   Kappa : 0.1734          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.23276         
##             Specificity : 0.91545         
##          Pos Pred Value : 0.58856         
##          Neg Pred Value : 0.69663         
##              Prevalence : 0.34193         
##          Detection Rate : 0.07959         
##    Detection Prevalence : 0.13522         
##       Balanced Accuracy : 0.57411         
##                                           
##        'Positive' Class : T               
##

Here we train the model with partial least square using tune parameter.

## Partial Least Squares 
## 
## 10858 samples
##    23 predictor
##     2 classes: 'F', 'T' 
## 
## Pre-processing: centered (94), scaled (94) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 9772, 9772, 9772, 9772, 9772, 9773, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  ROC        Sens       Spec     
##    1     0.6393174  1.0000000  0.0000000
##    2     0.6692877  0.9255111  0.1661172
##    3     0.6738019  0.9249578  0.1762821
##    4     0.6758076  0.9159058  0.2019231
##    5     0.6772703  0.9143360  0.2068681
##    6     0.6784624  0.9093482  0.2118132
##    7     0.6791885  0.9091633  0.2141941
##    8     0.6791625  0.9091169  0.2168498
##    9     0.6790442  0.9082397  0.2171245
##   10     0.6791496  0.9077783  0.2169414
##   11     0.6794495  0.9082865  0.2183150
##   12     0.6795904  0.9087943  0.2172161
##   13     0.6795758  0.9085632  0.2170330
##   14     0.6795364  0.9084245  0.2177656
##   15     0.6795312  0.9085169  0.2173993
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 12.

Confusion Matrix with best lambda

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    F    T
##          F 1636  713
##          T  150  215
##                                           
##                Accuracy : 0.682           
##                  95% CI : (0.6641, 0.6995)
##     No Information Rate : 0.6581          
##     P-Value [Acc > NIR] : 0.004356        
##                                           
##                   Kappa : 0.1729          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.23168         
##             Specificity : 0.91601         
##          Pos Pred Value : 0.58904         
##          Neg Pred Value : 0.69647         
##              Prevalence : 0.34193         
##          Detection Rate : 0.07922         
##    Detection Prevalence : 0.13449         
##       Balanced Accuracy : 0.57385         
##                                           
##        'Positive' Class : T               
##

Here we train the model with gradient boosting machine using tune parameter.

Confusion Matrix with gbmfit1

glmnet Model

## glmnet 
## 
## 10858 samples
##    23 predictor
##     2 classes: 'F', 'T' 
## 
## Pre-processing: centered (94), scaled (94) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 9773, 9772, 9772, 9772, 9772, 9772, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        ROC        Sens       Spec       
##   0.1    3.206429e-05  0.6802415  0.8943211  0.241318681
##   0.1    7.407266e-05  0.6802415  0.8943211  0.241318681
##   0.1    1.711175e-04  0.6802415  0.8943211  0.241318681
##   0.1    3.953035e-04  0.6802415  0.8943211  0.241318681
##   0.1    9.132025e-04  0.6802846  0.8947644  0.240329670
##   0.1    2.109616e-03  0.6804602  0.8966207  0.236648352
##   0.1    4.873487e-03  0.6806421  0.9023842  0.227747253
##   0.1    1.125839e-02  0.6804576  0.9143263  0.205549451
##   0.1    2.600834e-02  0.6787664  0.9354952  0.160329670
##   0.1    6.008263e-02  0.6743981  0.9670277  0.084120879
##   0.2    3.206429e-05  0.6802413  0.8941548  0.241428571
##   0.2    7.407266e-05  0.6802413  0.8941548  0.241428571
##   0.2    1.711175e-04  0.6802413  0.8941548  0.241428571
##   0.2    3.953035e-04  0.6802557  0.8941825  0.241428571
##   0.2    9.132025e-04  0.6804270  0.8952354  0.239450549
##   0.2    2.109616e-03  0.6806806  0.8988929  0.233956044
##   0.2    4.873487e-03  0.6809683  0.9074549  0.220549451
##   0.2    1.125839e-02  0.6800156  0.9229714  0.186978022
##   0.2    2.600834e-02  0.6768876  0.9508733  0.126043956
##   0.2    6.008263e-02  0.6686358  0.9924356  0.014175824
##   0.3    3.206429e-05  0.6802420  0.8940440  0.241758242
##   0.3    7.407266e-05  0.6802420  0.8940440  0.241758242
##   0.3    1.711175e-04  0.6802420  0.8940440  0.241758242
##   0.3    3.953035e-04  0.6802907  0.8942933  0.241153846
##   0.3    9.132025e-04  0.6805116  0.8957895  0.238241758
##   0.3    2.109616e-03  0.6808756  0.9008049  0.230604396
##   0.3    4.873487e-03  0.6810121  0.9116110  0.211923077
##   0.3    1.125839e-02  0.6790696  0.9316712  0.169285714
##   0.3    2.600834e-02  0.6738435  0.9614306  0.096208791
##   0.3    6.008263e-02  0.6621744  1.0000000  0.000000000
##   0.4    3.206429e-05  0.6802456  0.8942103  0.241703297
##   0.4    7.407266e-05  0.6802456  0.8942103  0.241703297
##   0.4    1.711175e-04  0.6802456  0.8942103  0.241703297
##   0.4    3.953035e-04  0.6803585  0.8946535  0.240934066
##   0.4    9.132025e-04  0.6806077  0.8969256  0.236923077
##   0.4    2.109616e-03  0.6810925  0.9028553  0.228076923
##   0.4    4.873487e-03  0.6808637  0.9160442  0.203736264
##   0.4    1.125839e-02  0.6781757  0.9390136  0.153296703
##   0.4    2.600834e-02  0.6710244  0.9717103  0.066648352
##   0.4    6.008263e-02  0.6573615  1.0000000  0.000000000
##   0.5    3.206429e-05  0.6802603  0.8942103  0.241813187
##   0.5    7.407266e-05  0.6802603  0.8942103  0.241813187
##   0.5    1.711175e-04  0.6802612  0.8942103  0.241813187
##   0.5    3.953035e-04  0.6804279  0.8950691  0.240549451
##   0.5    9.132025e-04  0.6807016  0.8978677  0.235549451
##   0.5    2.109616e-03  0.6812028  0.9053213  0.224615385
##   0.5    4.873487e-03  0.6804699  0.9194523  0.195604396
##   0.5    1.125839e-02  0.6772468  0.9454422  0.139505495
##   0.5    2.600834e-02  0.6681134  0.9828764  0.036483516
##   0.5    6.008263e-02  0.6538381  1.0000000  0.000000000
##   0.6    3.206429e-05  0.6802591  0.8941272  0.241813187
##   0.6    7.407266e-05  0.6802591  0.8941272  0.241813187
##   0.6    1.711175e-04  0.6802724  0.8942103  0.241813187
##   0.6    3.953035e-04  0.6804657  0.8952353  0.239945055
##   0.6    9.132025e-04  0.6807795  0.8986711  0.234560440
##   0.6    2.109616e-03  0.6812633  0.9072608  0.221263736
##   0.6    4.873487e-03  0.6799424  0.9235530  0.187307692
##   0.6    1.125839e-02  0.6759076  0.9506517  0.125109890
##   0.6    2.600834e-02  0.6651068  0.9979220  0.002747253
##   0.6    6.008263e-02  0.6476886  1.0000000  0.000000000
##   0.7    3.206429e-05  0.6802623  0.8940718  0.241978022
##   0.7    7.407266e-05  0.6802623  0.8940718  0.241978022
##   0.7    1.711175e-04  0.6802898  0.8941272  0.241703297
##   0.7    3.953035e-04  0.6805081  0.8956233  0.239395604
##   0.7    9.132025e-04  0.6808912  0.8995027  0.233241758
##   0.7    2.109616e-03  0.6812659  0.9090341  0.218021978
##   0.7    4.873487e-03  0.6794641  0.9270440  0.179285714
##   0.7    1.125839e-02  0.6742793  0.9551963  0.111263736
##   0.7    2.600834e-02  0.6624273  1.0000000  0.000000000
##   0.7    6.008263e-02  0.6454073  1.0000000  0.000000000
##   0.8    3.206429e-05  0.6802720  0.8941272  0.241978022
##   0.8    7.407266e-05  0.6802720  0.8941272  0.241978022
##   0.8    1.711175e-04  0.6803014  0.8942657  0.241538462
##   0.8    3.953035e-04  0.6805375  0.8959004  0.238571429
##   0.8    9.132025e-04  0.6809874  0.9003893  0.231923077
##   0.8    2.109616e-03  0.6812187  0.9109459  0.214560440
##   0.8    4.873487e-03  0.6790118  0.9306183  0.171098901
##   0.8    1.125839e-02  0.6727659  0.9592694  0.099120879
##   0.8    2.600834e-02  0.6601809  1.0000000  0.000000000
##   0.8    6.008263e-02  0.6353261  1.0000000  0.000000000
##   0.9    3.206429e-05  0.6802894  0.8941826  0.242032967
##   0.9    7.407266e-05  0.6802894  0.8941826  0.242032967
##   0.9    1.711175e-04  0.6803392  0.8944319  0.241208791
##   0.9    3.953035e-04  0.6805658  0.8961775  0.238186813
##   0.9    9.132025e-04  0.6810919  0.9011929  0.230989011
##   0.9    2.109616e-03  0.6811689  0.9131346  0.210384615
##   0.9    4.873487e-03  0.6785547  0.9340816  0.163956044
##   0.9    1.125839e-02  0.6714683  0.9633978  0.087857143
##   0.9    2.600834e-02  0.6581901  1.0000000  0.000000000
##   0.9    6.008263e-02  0.6192685  1.0000000  0.000000000
##   1.0    3.206429e-05  0.6802743  0.8942103  0.241923077
##   1.0    7.407266e-05  0.6802743  0.8942103  0.241923077
##   1.0    1.711175e-04  0.6803633  0.8945981  0.241043956
##   1.0    3.953035e-04  0.6806191  0.8966485  0.237362637
##   1.0    9.132025e-04  0.6811696  0.9023566  0.230000000
##   1.0    2.109616e-03  0.6810133  0.9147696  0.206263736
##   1.0    4.873487e-03  0.6781338  0.9368800  0.157637363
##   1.0    1.125839e-02  0.6701328  0.9676096  0.074450549
##   1.0    2.600834e-02  0.6560262  1.0000000  0.000000000
##   1.0    6.008263e-02  0.5872780  1.0000000  0.000000000
## 
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 0.7 and lambda = 0.002109616.

##    alpha      lambda
## 66   0.7 0.002109616

Confusion Matrix with elasticnet

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    F    T
##          F 1640  710
##          T  146  218
##                                           
##                Accuracy : 0.6846          
##                  95% CI : (0.6667, 0.7021)
##     No Information Rate : 0.6581          
##     P-Value [Acc > NIR] : 0.001807        
##                                           
##                   Kappa : 0.1793          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.23491         
##             Specificity : 0.91825         
##          Pos Pred Value : 0.59890         
##          Neg Pred Value : 0.69787         
##              Prevalence : 0.34193         
##          Detection Rate : 0.08032         
##    Detection Prevalence : 0.13412         
##       Balanced Accuracy : 0.57658         
##                                           
##        'Positive' Class : T               
##

Confusion matrix of gbmFit2

SELECT MODELS

We compare the models with the accuray, precision, sensitivity, specificity, and F1 score from the confusion matrix

##             glm.mod11 glm.mod12 ridge.mod1 ridge.mod2 lasso.mod1 lasso.mod2
## Accuracy    0.6879145 0.6827561 0.66064849  0.6580693  0.6842299  0.6713338
## Precision   0.6004963 0.5865633 0.58974359         NA  0.5977961  0.6022727
## Sensitivity 0.2607759 0.2446121 0.02478448  0.0000000  0.2338362  0.1142241
## Specificity 0.9098544 0.9104143 0.99104143  1.0000000  0.9182531  0.9608063
## F1          0.3636364 0.3452471 0.04756980         NA  0.3361735  0.1920290
##              pls.mod1  pls.mod2    en.mod
## Accuracy    0.6820192 0.6820192 0.6845984
## Precision   0.5885559 0.5890411 0.5989011
## Sensitivity 0.2327586 0.2316810 0.2349138
## Specificity 0.9154535 0.9160134 0.9182531
## F1          0.3335907 0.3325599 0.3374613

With precision and specificity equal to 1, the ridge.mod2 model is overfitting. But lasso.mod1 has the best accuracy, precision, sensivity, and specificity.

Using pROC package.

We can plot the ROC curve and extract the AUC value.

Best Model with AUC

The Lasso model has the best Area Under the Curve.

We run the lasso.mod1 model with the entire dataset

The best model is Lasso model

The Statistic of the best model is given below.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 8208 3520
##          1  796 1048
##                                           
##                Accuracy : 0.682           
##                  95% CI : (0.6741, 0.6898)
##     No Information Rate : 0.6634          
##     P-Value [Acc > NIR] : 2.227e-06       
##                                           
##                   Kappa : 0.1653          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.22942         
##             Specificity : 0.91159         
##          Pos Pred Value : 0.56833         
##          Neg Pred Value : 0.69986         
##              Prevalence : 0.33658         
##          Detection Rate : 0.07722         
##    Detection Prevalence : 0.13587         
##       Balanced Accuracy : 0.57051         
##                                           
##        'Positive' Class : 1               
##

AUC of the best model

Coefficients of the best model

The dot before the coefficient means that the lasso model ignore unimportant class of the variable.

## 96 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept) -0.779627250
## (Intercept)  .          
## SEX2         0.277009383
## AGEG.F72     0.330360460
## AGEG.F73     0.070974710
## AGEG.F74     .          
## AGEG.F75    -0.136878425
## AGEG.F76    -0.256333773
## AGEG.F77    -0.493612960
## X_IMPRACE2   0.292997741
## X_IMPRACE3  -0.286803639
## X_IMPRACE4   .          
## X_IMPRACE5   .          
## X_IMPRACE6  -0.004331095
## EDUCAL2     -0.054088606
## EDUCAL3     -0.067669698
## EDUCAL4      .          
## EDUCAL5      0.035873647
## EDUCAL6      .          
## X_INCOMG2    0.146624310
## X_INCOMG3    .          
## X_INCOMG4    .          
## X_INCOMG5    0.064057869
## X_INCOMG9   -0.036774320
## X_RFBMI52    .          
## X_RFBMI59   -0.028364176
## SMOKE1002    0.125876474
## SMOKE1007    .          
## COPD2       -0.124264966
## COPD7        .          
## EMPHY2       0.013162274
## EMPHY7      -0.402759362
## EMPHY9       .          
## DEPRESS2     0.018738710
## DEPRESS7     .          
## DEPRESS9     .          
## BRONCH2     -0.037758413
## BRONCH7     -0.328147049
## BRONCH9      .          
## DUR.30D10    0.079177130
## DUR.30D11    0.184955227
## DUR.30D12    0.003960264
## DUR.30D2     .          
## DUR.30D6     .          
## DUR.30D7    -0.420857016
## INCINDT2     .          
## INCINDT3     0.938995906
## INCINDT7     .          
## LAST.MD5     .          
## LAST.MD6    -0.119977273
## LAST.MD7    -0.401266743
## LAST.MD9    -0.585236929
## LAST.MED5   -0.219709485
## LAST.MED6   -0.296826230
## LAST.MED7   -0.415750708
## LAST.MED9   -2.080335812
## LAST.SYMP2   .          
## LAST.SYMP3   0.122806811
## LAST.SYMP4   0.086519091
## LAST.SYMP5   .          
## LAST.SYMP6   0.133828101
## LAST.SYMP7   .          
## LAST.SYMP9  -0.314854748
## COMPASTH11  -0.199542636
## COMPASTH2   -0.209621740
## COMPASTH3   -0.035235080
## COMPASTH6    .          
## COMPASTH7   -0.191927449
## INS22        0.098402396
## INS25       -0.017884077
## INS27       -0.093557160
## INS29       -0.988939087
## ER.VISIT2   -0.059881236
## ER.VISIT5   -0.245948577
## ER.VISIT6   -0.325013005
## ER.VISIT7   -0.731852448
## HOSP.VST2    .          
## HOSP.VST4    .          
## HOSP.VST5   -0.093194618
## HOSP.VST6   -0.024683506
## HOSP.VST7   -0.782042795
## HOSP.VST9   -1.839595107
## ASRXCOST2    .          
## ASRXCOST5    .          
## ASRXCOST7    .          
## ASRXCOST9   -0.177166670
## WORKTALK2   -0.548777713
## WORKTALK6   -0.564731481
## WORKTALK7   -0.230557453
## WORKTALK8    .          
## WORKTALK9   -0.289124337
## ACT.DAY302   0.098410549
## ACT.DAY303   0.149522931
## ACT.DAY304   0.142025917
## ACT.DAY305   .          
## ACT.DAY307   .

We look at the odd ratio off each variable

A value greater than 1 means an increase effect on the odd ratio compare to baseline. For example, focusing on SEX variable, Women(SEX2) are more likely to have good Skill on asthma management than men(SEX1 the baseline). Other variables can be interpret the same way.

## 96 x 1 Matrix of class "dgeMatrix"
##                    s0
## (Intercept) 0.4585769
## (Intercept) 1.0000000
## SEX2        1.3191787
## AGEG.F72    1.3914696
## AGEG.F73    1.0735541
## AGEG.F74    1.0000000
## AGEG.F75    0.8720762
## AGEG.F76    0.7738836
## AGEG.F77    0.6104170
## X_IMPRACE2  1.3404398
## X_IMPRACE3  0.7506591
## X_IMPRACE4  1.0000000
## X_IMPRACE5  1.0000000
## X_IMPRACE6  0.9956783
## EDUCAL2     0.9473482
## EDUCAL3     0.9345691
## EDUCAL4     1.0000000
## EDUCAL5     1.0365249
## EDUCAL6     1.0000000
## X_INCOMG2   1.1579189
## X_INCOMG3   1.0000000
## X_INCOMG4   1.0000000
## X_INCOMG5   1.0661541
## X_INCOMG9   0.9638936
## X_RFBMI52   1.0000000
## X_RFBMI59   0.9720343
## SMOKE1002   1.1341421
## SMOKE1007   1.0000000
## COPD2       0.8831458
## COPD7       1.0000000
## EMPHY2      1.0132493
## EMPHY7      0.6684729
## EMPHY9      1.0000000
## DEPRESS2    1.0189154
## DEPRESS7    1.0000000
## DEPRESS9    1.0000000
## BRONCH2     0.9629455
## BRONCH7     0.7202571
## BRONCH9     1.0000000
## DUR.30D10   1.0823960
## DUR.30D11   1.2031646
## DUR.30D12   1.0039681
## DUR.30D2    1.0000000
## DUR.30D6    1.0000000
## DUR.30D7    0.6564840
## INCINDT2    1.0000000
## INCINDT3    2.5574122
## INCINDT7    1.0000000
## LAST.MD5    1.0000000
## LAST.MD6    0.8869406
## LAST.MD7    0.6694715
## LAST.MD9    0.5569739
## LAST.MED5   0.8027520
## LAST.MED6   0.7431731
## LAST.MED7   0.6598447
## LAST.MED9   0.1248883
## LAST.SYMP2  1.0000000
## LAST.SYMP3  1.1306660
## LAST.SYMP4  1.0903722
## LAST.SYMP5  1.0000000
## LAST.SYMP6  1.1431963
## LAST.SYMP7  1.0000000
## LAST.SYMP9  0.7298949
## COMPASTH11  0.8191053
## COMPASTH2   0.8108909
## COMPASTH3   0.9653784
## COMPASTH6   1.0000000
## COMPASTH7   0.8253667
## INS22       1.1034067
## INS25       0.9822749
## INS27       0.9106860
## INS29       0.3719711
## ER.VISIT2   0.9418764
## ER.VISIT5   0.7819624
## ER.VISIT6   0.7225180
## ER.VISIT7   0.4810171
## HOSP.VST2   1.0000000
## HOSP.VST4   1.0000000
## HOSP.VST5   0.9110162
## HOSP.VST6   0.9756186
## HOSP.VST7   0.4574705
## HOSP.VST9   0.1588817
## ASRXCOST2   1.0000000
## ASRXCOST5   1.0000000
## ASRXCOST7   1.0000000
## ASRXCOST9   0.8376402
## WORKTALK2   0.5776554
## WORKTALK6   0.5685128
## WORKTALK7   0.7940908
## WORKTALK8   1.0000000
## WORKTALK9   0.7489191
## ACT.DAY302  1.1034157
## ACT.DAY303  1.1612801
## ACT.DAY304  1.1526065
## ACT.DAY305  1.0000000
## ACT.DAY307  1.0000000

DATA 621 Final Project

Alain Kuiete Tchoupou

11/18/2020

OVERVIEW

EXPLORATORY DATA ANALYSIS

Response Variables

Possible Predictors

Constructing the Data Frame by Selecting variables

summary of the data set after categorizing the variables

Eliminating NA’s

Structure of the data

Summary of the Data where variable are numeric

Distribution of the Variables in the Data

Proportions

The correlations betweeen predictors

CONSTRUCT THE RESPONSE VARIABLE

Exploration of the clustering

Elbow method to find the number of clusters

Now we do the clustering and extract the centers of resulting model

We add the point classification to the original data

Interpretation of the Selft-Management Response clustering

TCH.SIGN

TCH.RESP

TCH.MON

MGT.PLAN

MGT.CLAS

INHALERW

MOD.ENV

Summary of the response variables

For the response variable TARGET, an excellent management skill has number 2 but a poor management skill has number 7 and 5.

We can build a logistics regression on the dataset.

!!!! Please, check the values of yes in the response. var and change if condition of resp.asthma2$taget

!!!! above are different. Remove “Break” in the chunk below!

Here we remove the varibles used to calculate the target variable and reformat the data frame.

PREPARE THE DATA FOR MODELISATION

We remove the rows with missing values.

Visualization of some combine variables

Splitting the data into train and test sets

BUILDS MODELS

Model using full predictors with glm

Confusion Matrix with the testingset

First glm model using backward elimination of step function

Confusion Matrix with the testingset

Second glm model

Confusion Matrix with the testingset

Lasso and Ridge model

Split the data into trainset and testingset, Dumy code categorical predictors

Ridge Regression

Confusion matrix with lambda min

Confusion matrix with best lambda

Getting the coefficients

Lasso Regression

Find the best lambda using cross validation

Confusion Matrix with lambda min

Getting the coefficients

Confusion Matrix with best lambda

Calculating the AICc of Ridge and Lasso Models

Partial Least Squared

Confusion Matrix with best lambda

Here we train the model with partial least square using tune parameter.

Confusion Matrix with best lambda

Here we train the model with gradient boosting machine using tune parameter.

Confusion Matrix with gbmfit1

glmnet Model

Confusion Matrix with elasticnet

Confusion matrix of gbmFit2

SELECT MODELS

We compare the models with the accuray, precision, sensitivity, specificity, and F1 score from the confusion matrix

Using pROC package.

We run the lasso.mod1 model with the entire dataset

The best model is Lasso model

The Statistic of the best model is given below.

AUC of the best model

Coefficients of the best model

We look at the odd ratio off each variable