Setup and Data Loading

Let’s load and view the data’s first few lines as well as summary statistics about it

# load the data
df <- read.csv("./output/bronch.csv")

# remove irrelevant columns
#df[ ,c('X','pathology', 'pathology_original', 'id_type', 'minor_18','geopoint', 'n_visits', 'dayofyear', 'date_numerical')] <- list(NULL)
df$X <- NULL

# recode relevant columns in the relevant encoding: float, categorical
df$bronch <- factor(df$bronch)
kable(summary(df))
status time_delay city city_id age zip lat lon date id_personal pathology_eng visits month dayofmonth dayofweek hour x_age_scaled x_visits x_date x_month x_dayofweek x_hour x_lon x_lat x_zip pathology_id year station_lat pollution distance station_city station_lon_air station_area bronch
Min. :4.000 Min. : 0.00 Terrassa :1913 Min. : 1.00 Min. : 1.00 Min. :8004 Min. :41.21 Min. :1.624 2018-05-01 15:02:00: 16 Min. : 0.00 bronchitis :7872 Min. : 1 Min. : 1.000 Min. : 1.00 Min. :0.000 Min. : 0.00 Min. :-1.25041 Min. :-2.015002 Min. :-2.86046 Min. :-1.3824150 Min. :-1.47097 Min. :-2.52934 Min. :-4.74391 Min. :-3.097613 Min. :-1.20005 Min. : 1.00 Min. :2015 Min. :41.20 Min. : 2.914 Min. : 0.001931 TERRASSA :2310 Min. :1.618 rural : 70 0:7872
1st Qu.:4.000 1st Qu.: 45.00 Sant Cugat :1902 1st Qu.: 26.00 1st Qu.: 8.00 1st Qu.:8184 1st Qu.:41.40 1st Qu.:2.048 2017-01-08 19:00:00: 13 1st Qu.: 46.00 fever :2052 1st Qu.:1127 1st Qu.: 3.000 1st Qu.: 6.00 1st Qu.:1.000 1st Qu.:10.00 1st Qu.:-0.98151 1st Qu.:-0.666875 1st Qu.:-0.60114 1st Qu.:-0.8573621 1st Qu.:-0.98581 1st Qu.:-0.68611 1st Qu.:-0.59755 1st Qu.:-1.033615 1st Qu.:-0.62159 1st Qu.: 33.00 1st Qu.:2016 1st Qu.:41.40 1st Qu.:27.950 1st Qu.: 1.050538 SABADELL :1863 1st Qu.:2.042 suburban: 4529 1:7872
Median :4.000 Median : 60.00 Sabadell :1383 Median : 50.00 Median :36.00 Median :8222 Median :41.50 Median :2.090 2017-12-27 08:56:00: 13 Median : 55.00 tonsillitis : 643 Median :1584 Median : 6.000 Median :16.00 Median :3.000 Median :13.00 Median : 0.09405 Median :-0.119722 Median : 0.30771 Median :-0.0697829 Median :-0.01547 Median :-0.13314 Median :-0.18707 Median : 0.018138 Median :-0.49947 Median : 33.00 Median :2017 Median :41.51 Median :36.589 Median : 1.882925 BARCELONA :1711 Median :2.101 urban :11145 NA
Mean :4.046 Mean : 70.24 Barcelona :1360 Mean : 70.56 Mean :35.88 Mean :8374 Mean :41.50 Mean :2.110 2016-03-14 13:32:00: 12 Mean : 55.01 gastroenteritis: 444 Mean :1687 Mean : 6.266 Mean :15.35 Mean :3.011 Mean :13.54 Mean : 0.08927 Mean : 0.003351 Mean : 0.03999 Mean :-0.0000326 Mean :-0.01008 Mean :-0.03385 Mean : 0.01476 Mean : 0.005824 Mean :-0.01001 Mean : 76.14 Mean :2017 Mean :41.49 Mean :34.399 Mean : 2.579297 SANT FELIU DE LLOBREGAT:1662 Mean :2.111 NA NA
3rd Qu.:4.000 3rd Qu.: 90.00 Sant Feliu de Llobregat: 644 3rd Qu.:114.00 3rd Qu.:56.00 3rd Qu.:8520 3rd Qu.:41.57 3rd Qu.:2.147 2017-11-13 10:10:00: 12 3rd Qu.: 68.00 flu : 378 3rd Qu.:2463 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:5.000 3rd Qu.:18.00 3rd Qu.: 0.86232 3rd Qu.: 0.932678 3rd Qu.: 0.78773 3rd Qu.: 0.9803229 3rd Qu.: 0.95486 3rd Qu.: 0.78847 3rd Qu.: 0.37533 3rd Qu.: 0.732254 3rd Qu.: 0.45821 3rd Qu.: 92.00 3rd Qu.:2018 3rd Qu.:41.56 3rd Qu.:42.260 3rd Qu.: 3.064469 SANT CUGAT DEL VALLÈS :1658 3rd Qu.:2.153 NA NA
Max. :5.000 Max. :150.00 Cerdanyola : 521 Max. :188.00 Max. :99.00 Max. :8980 Max. :41.98 Max. :2.696 2017-11-25 11:16:00: 12 Max. :102.00 vomiting : 371 Max. :2695 Max. :12.000 Max. :31.00 Max. :6.000 Max. :23.00 Max. : 2.51408 Max. : 1.210445 Max. : 1.71578 Max. : 1.5053757 Max. : 1.44003 Max. : 1.71008 Max. : 5.75296 Max. : 5.117218 Max. : 1.93651 Max. :221.00 Max. :2018 Max. :42.00 Max. :52.057 Max. :17.628569 GRANOLLERS :1471 Max. :2.817 NA NA
NA NA (Other) :8021 NA NA NA NA NA (Other) :15666 NA (Other) :3984 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA (Other) :5069 NA NA NA
#summary(df)

Data Checks

Most of the data exploration has been done before at the Python notebook and mentioned in part 1 of this paper. Here I will perform specific data exploration and verfiications reelvant to the modeling dne here: checking for variance among relevant variables, outliers, and class balance.

ggplot(data = df, aes(x = age)) +
  geom_histogram(binwidth = 5, alpha=0.6) +
  scale_x_continuous(breaks = seq(0,100,by=5)) +
  labs(title = "Ages of Patients", y = "Count", x = "Age of Patients") +
  theme_minimal() +
  theme(plot.title = element_text(size=18, family = "Helvetica", face = "bold", hjust = 0.5))

ggplot(data = df, aes(x = age, fill=bronch, color=bronch)) +
  #geom_histogram(binwidth = 5, alpha=0.5) +
  geom_density(alpha=0.5) +
  scale_x_continuous(breaks = seq(0,100,by=5)) +
  labs(title = "Ages of Patients", y = "Count", x = "Age of Patients") +
  theme_minimal() +
  theme(plot.title = element_text(size=18, family = "Helvetica", face = "bold", hjust = 0.5))

The distribution of age per class (between Bronchitis patients and not) is sufficiently similar. Naturally, there are more babies and older people who have this condition, while more young people are without this condition (and most likely healthier). The distribution of age in both classes is bi-modal rather than normal. This makes sense, since this is a This may pose some problems in the modeling because most models and procedures work ideally with a normally distributed dataset.

Class balance by city_id

## two-way contingency table of categorical outcome and predictors we want
## to make sure there are not 0 cells

xtabs(~bronch + city_id, data = df)
##       city_id
## bronch   1   2   3   4   5   6   7   8   9  10  11  12  13  14  16  17  18
##      0   7  84  29  98  28 263   1 103 121  48  21   1 100 197   3 123 112
##      1   6  65  20  94  31 258   5 109 131  58  11   0  81 188   4 105  85
##       city_id
## bronch  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
##      0  80  35  52 153  47 108  27 357  89   7 220   6  50  40   3 142   3
##      1  52  29  55 115  81 144  29 275  94  16 156   0  68  19   0 154   2
##       city_id
## bronch  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  53
##      0  85  27   7 169  84  72 190 215  53  76  49  12  29  42 139   3  34
##      1  91  42  12 177  86  55 207 211  75 114  60   1   6  73 116   6  50
##       city_id
## bronch  54  55  56  57  58  60  62  63  64  65  66  67  68  69  70  71  72
##      0   0  23  14  64   4   1   1  88 107  17   6  35  10  45  23  25  10
##      1   5  13   6 102   0   0   2  75 123   0   0  52  20  45  37  17   8
##       city_id
## bronch  73  74  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
##      0  19  20  62  27  30   5  17  30   9   3  19   4  18   8   5  20  56
##      1   8  16  86  22  35   0  25  45  20   0  55   7  40   6  39   8  48
##       city_id
## bronch  92  93  94  96  97  98  99 100 101 102 103 104 105 106 107 108 109
##      0   6  14  69  10  11  16 241   7  27  31 164  48   7  22  96  19  10
##      1  12  21  42   1  10  29 210   3  13  30 166  25  13  33  87  38   6
##       city_id
## bronch 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
##      0  32  58   2   7 328 196 194  22   5   6  18  49  48  48  47   5  18
##      1  37  35   4  20 316 164 222  15   7   0  17  16  67  44  36  17  10
##       city_id
## bronch 128 129 131 134 135 136 142 144 145 146 147 149 152 153 155 159 160
##      0   3   2 232  37   2   1   2  71   1  12   1   4   5   7  82 108  90
##      1   8   4 217  26   0   0   4 157   0   4   7   0   0   0 131  80  61
##       city_id
## bronch 162 166 167 172 174 175 181 182 183 184 185 186 187 188
##      0   5   6 119  88  24   2  60  24   5  58  18  19  11  23
##      1   0   4 128 109  36  12  43  33  10  42  35  20  24  24

Since I will be later including factor(city_id), we need to make sure these are somewhat balanced. There are some cities with 0 patient of one type (normally). To prevent complete overfitting and biased predictions for these cities, I will remove them from my dataset. The zero-patient stratas are only missing bronchitis patients. Hence, I will find all cities with missing bronchitis patients and remove them.

setDT(df) # setting as data.table for more efficient data processing
cities_with_nonbronch <- unique(df[bronch==0, city_id])
cities_with_bronch <- unique(df[bronch==1, city_id])
imblanaced_cities <- setdiff(cities_with_nonbronch, cities_with_bronch)
size_beofre_balancing <- nrow(df)

# balance the dataframe by removing imblanaced cities:
df <- df[! city_id %in% imblanaced_cities,]
size_after_balancing <- nrow(df)
print(paste("Balancing removed", size_beofre_balancing - size_after_balancing, "rows, now remaining with ", nrow(df)))
## [1] "Balancing removed 77 rows, now remaining with  15667"

Class balance by Pollution level

ggplot(data = df, aes(x = pollution, fill=factor(bronch), color=factor(bronch))) +
  geom_density(alpha=0.5) +
  scale_x_continuous(breaks = seq(0,100,by=5)) +
  labs(title = "Pollution Levels for Calls with bronchitis or Withouot", y = "Density", x = "Pollution") +
  theme_minimal() +
  theme(plot.title = element_text(size=18, family = "Helvetica", face = "bold", hjust = 0.5))

Surprisingly, there seems to be no difference between the pollution levels for bronchitis patients versus not! This might mean that our efforts to establish any meaningful effect by pollution would be futile. I will make a few simple regression models, but will then transition to modeling the chance for a bronchitis call with predictive purposes instead, taking advantage of other variables.

Modeling

In this section, I will use logistic regression to model the chance of a call to be about bronchitis by the effect of the pollution level (at the closest measuring station). This will not establish causality perfectly because there are still many omitted variables and data limitations such as class imbalances, too few pollution station measurements (in both time and space), missing cities, potential for negative reverse causality (where patients with bronchitis may move away from a polluted area) and there is no good way to emulate a randomized experiment. This analysis could at least suggest if pollution seems to have any effect or not, although the results should be taken with a grain of salt. The null hypothesis would be that pollution would have no effect on the chance for a call to be about bronchitis. The alternative hypothesis would be that pollution would have some non-zero effect on the chance for a call to be about bronchitis.

Since it is hard to establish causality, I will try at the end of this analysis to find a sufficiently accurate model for predictive purposes. By trying various models and optimizing for their performance, we might at least find a sufficiently accurate predictive model, even if it is less sparse and theoretically founded.

I will develop models from the simplest possible model to more complex ones, slowly adding more features or making the form more flexible. I use logistic regression since the outcome variable is binary, and I’m looking predict probabilities bounded between [0,1]. If I would use a normal regression, the probabilities predicted could be outside of the boundaries of [0,1], but a logistic regression compresses the probabilities to be within that boundary with a sigmoid like function.

Separate data into test and training data

Since I am going to test the performance of various models and be able to use it for predictive purposes, it is important to make sure I am not overfitting the model to the training data. I will do this by training and checking the models on the training data while keeping 20% of the data aside as a test dataset.

set.seed(1)
# split dataframe into training (80%) and test (20%)
split_idx <- sample(c(TRUE, FALSE), nrow(df), replace = T, prob = c(0.8,0.2))
train <- df[split_idx, ]
test <- df[!split_idx, ]
df$bronch <- as.numeric(df$bronch)

Visualize a simple Linear model

# Visualize
ggplot(aes(x=pollution, y=bronch), data=df) +
  geom_point(alpha = .15) +
  geom_smooth(method = "glm") + #, method.args = list(family = "binomial")) +
  ggtitle("Logistic regression model fit") +
  xlab("Pollution") +
  ylab("Probability of bronchitis")

Unfortunately, by this plotting of values of pollution per class of bronchitis or not, there is almost no difference between the distribution of pollution levels. A linear model is slightly skewed up, but not sufficiently to be able to predict bronchitis or not.
(*Note: the logistic nonlinear model could not be visualized on the plot above for technical reasons, so I instead used a linear model for visualization purposes.)

Model 1: Only Pollution

The syntax of the glm function is similar to that of lm, except that we must pass the argument family = binomial in order to tell R to run a logistic regression rather than some other type of generalized linear model.

model1_pol <- glm(bronch ~ pollution , family = "binomial", data = train)
summary(model1_pol)
## 
## Call:
## glm(formula = bronch ~ pollution, family = "binomial", data = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.243  -1.187   1.113   1.168   1.292  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.290544   0.070804  -4.104 4.07e-05 ***
## pollution    0.008534   0.001991   4.287 1.81e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 17250  on 12442  degrees of freedom
## Residual deviance: 17231  on 12441  degrees of freedom
## AIC: 17235
## 
## Number of Fisher Scoring iterations: 3

A simple logistic regression on pollution shows that while the estimate on pollution is miniscule, wehn it is alone it is statistically significant. As it is a logistic regression, there is no simple measure of accuracy such as an R-Squared to report.

Model 1 performance evaluation

First, before evaluating performance on the test set, I will examine wether the model have succeeded sufficiently explaining the variance within-sample (on the same training sdata it was trained on). If it has, then I will continue to test the model on the test dataset.

# building a function to test performance within-sample
test_predict_within <- function(model) {
  ytrain_fac <- factor(train$bronch)
  ytrain_pred_fac <- as.vector(round(fitted(model)))
  ytrain_pred_fac <- factor(ytrain_pred_fac, levels=c("0","1"))
  print(confusionMatrix(ytrain_fac, ytrain_pred_fac))
}
test_predict_within(model1_pol)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2355 3857
##          1 2197 4034
##                                           
##                Accuracy : 0.5135          
##                  95% CI : (0.5046, 0.5223)
##     No Information Rate : 0.6342          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0265          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.5174          
##             Specificity : 0.5112          
##          Pos Pred Value : 0.3791          
##          Neg Pred Value : 0.6474          
##              Prevalence : 0.3658          
##          Detection Rate : 0.1893          
##    Detection Prevalence : 0.4992          
##       Balanced Accuracy : 0.5143          
##                                           
##        'Positive' Class : 0               
## 

Even when testing model1_pol (only pollution) within-sample (on the same data it was trained on), the model performs poorly: accuracy is only %51.35, when the baseline by random guessing should be %50 (since the dataset is evenly distributed). Hence, pollution might have helped in explaining an additional 1.35% of the data over the 50% random guess expectation, which is very poor. It was not biased much towards one class. Pollution therefore does not seem to have a meaningful effect on the chance for a call to be bronchitis or not.

Model 1: Adding distance from air measurement station

The data only had 65 air measurement stations in teh region, so that distances from the nearest pollution measurement system varied greatly. Hence, visits with higher distances from the matched nearest station are less accurate. Therefore, I will try including the distance either alone or with an interaction term.

model1_pol_dis <- glm(bronch ~ pollution + distance , family = "binomial", data = train)
model1_pol_dis_int <- glm(bronch ~ pollution + distance + I(distance*pollution) , family = "binomial", data = train)
stargazer(model1_pol_dis, model1_pol_dis_int, type = "text")
## 
## ====================================================
##                             Dependent variable:     
##                         ----------------------------
##                                    bronch           
##                              (1)            (2)     
## ----------------------------------------------------
## pollution                  0.008***       0.007**   
##                            (0.002)        (0.003)   
##                                                     
## distance                   0.019**        -0.002    
##                            (0.008)        (0.034)   
##                                                     
## I(distance * pollution)                    0.001    
##                                           (0.001)   
##                                                     
## Constant                  -0.326***      -0.279***  
##                            (0.072)        (0.102)   
##                                                     
## ----------------------------------------------------
## Observations                12,443        12,443    
## Log Likelihood            -8,612.544    -8,612.332  
## Akaike Inf. Crit.         17,231.090    17,232.660  
## ====================================================
## Note:                    *p<0.1; **p<0.05; ***p<0.01

Distance alone showed to have a statistically meaningful coefficient, but not the interaction term. However, it does not make theoretic sense: distance from an air measurement station seemed to increase the likelihood for bronchitis, which is what we would expect if pollution was greater. That could be the case if stations are sparser in areas like cities where pollution is greater, but it’s not the case. Therefore, this is likely a spurious correlation and a confounding variable to add in this form for now.

Model 2: Personal Variables: Age and Number of Visits

Here I will try to add in personal variables: age, and later on - number of visits. Age is likely to have a nonlinear effect, since I’ve seen that babies and older people tend to get Bronchitis more. Therefore, I will include second order and test third order of age for prediction. I will also test an interaction term between pollution and age, since it might be that pollution affects more the delicate lungs of babies and older people. Since the interaction term might have nonlinear effects, I will also test adding higher orders of the interaction term.

Unfortunately, the standard F-test does not extend from Linear models to nonlinear logistic regression, therefore I will not be able to test these models using the F-test (Principles of Econometrics, 4th Edition; and uPenn STAT 501 Course). Hence, I will use other methods: I will asses the significance of added variables by the p-values of their coefficients, and compare the performance of the models against the Null hypothesis model without them (Model 1 in this case).

m2_age2 <- glm(bronch ~ pollution + age + I(age^2), family = "binomial", data = train)
m2_age3 <- glm(bronch ~ pollution + age + I(age^2) + I(age^3), family = "binomial", data = train)
m2_ageint2 <- glm(bronch ~ pollution + age + I(age^2) + I(pollution*age) + I((pollution*age)^2), family = "binomial", data = train)

stargazer(model1_pol, m2_age2, m2_age3, m2_ageint2, type="text")
## 
## ==================================================================
##                                   Dependent variable:             
##                       --------------------------------------------
##                                          bronch                   
##                          (1)        (2)         (3)        (4)    
## ------------------------------------------------------------------
## pollution              0.009***   0.007***   0.007***    0.008**  
##                        (0.002)    (0.002)     (0.002)    (0.004)  
##                                                                   
## age                              -0.027***   -0.108***  -0.029*** 
##                                   (0.002)     (0.006)    (0.007)  
##                                                                   
## I(age2)                          0.0004***   0.003***   0.0004*** 
##                                  (0.00003)   (0.0002)    (0.0001) 
##                                                                   
## I(age3)                                     -0.00002***           
##                                              (0.00000)            
##                                                                   
## I(pollution * age)                                        0.0001  
##                                                          (0.0002) 
##                                                                   
## I((pollution * age)2)                                    -0.00000 
##                                                         (0.00000) 
##                                                                   
## Constant              -0.291***   -0.134*    0.274***     -0.162  
##                        (0.071)    (0.078)     (0.083)    (0.137)  
##                                                                   
## ------------------------------------------------------------------
## Observations            12,443     12,443     12,443      12,443  
## Log Likelihood        -8,615.608 -8,447.739 -8,334.024  -8,446.930
## Akaike Inf. Crit.     17,235.220 16,903.480 16,678.050  16,905.860
## ==================================================================
## Note:                                  *p<0.1; **p<0.05; ***p<0.01

The personal charachteristics added model shows that adding age to the equation was determined as statistically significant for the model, including the second and third order degrees, since they all are highly statistically significant, with a p-value of less than 1%. The Age^3 model performed better in practice, achieving lower AIC (16,667 against 16,892 or 16,895 of the others). Notice that the coefficient estimate on Pollution has decreased when adding age, and more so when adding age^3. Thus, we know that the estimate of Pollution was not “true” or stable, and that age was able to explain some of the same variance. That said, all of the absolute magnitudes of the coefficients are extremely small, suggesting that the model still may not be flexible enough to distinguish well between the classes of bronchitis or not.

Model 2 - Age only performance evaluation

test_predict_within(m2_age3)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3676 2536
##          1 2470 3761
##                                          
##                Accuracy : 0.5977         
##                  95% CI : (0.589, 0.6063)
##     No Information Rate : 0.5061         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.1954         
##  Mcnemar's Test P-Value : 0.3583         
##                                          
##             Sensitivity : 0.5981         
##             Specificity : 0.5973         
##          Pos Pred Value : 0.5918         
##          Neg Pred Value : 0.6036         
##              Prevalence : 0.4939         
##          Detection Rate : 0.2954         
##    Detection Prevalence : 0.4992         
##       Balanced Accuracy : 0.5977         
##                                          
##        'Positive' Class : 0              
## 

Adding Age significantly improved our performance - from 51.3% with pollution only, to 59.5% accuracy with age. This is still not an objectively good perfomance, so I will continue to add features in hope to increase the perforamnce and explain more of the variance.

Model 2: Adding Number of Visits or Removing Pollution

After determining which is the best form to add in age with its nonlinearities, I can add the other personal variable - number of visits. I will test it against our recent better perfoming model, with the third higher order of Age.
I also test the removal of pollution from the Age^3 model.

m2_age3_nopol_novis <- glm(bronch ~  age + I(age^2) + I(age^3), family = "binomial", data = train)
m2_age3_visits_nopol <- glm(bronch ~  age + visits + I(age^2) + I(age^3), family = "binomial", data = train)
m2_age_visits_pol <- glm(bronch ~ pollution + visits + age + I(age^2) + I(age^3), family = "binomial", data = train)

stargazer(m2_age3_nopol_novis, m2_age3_visits_nopol, m2_age3, m2_age_visits_pol, type="text")
## 
## =================================================================
##                                 Dependent variable:              
##                   -----------------------------------------------
##                                       bronch                     
##                       (1)         (2)         (3)         (4)    
## -----------------------------------------------------------------
## pollution                                  0.007***    0.007***  
##                                             (0.002)     (0.002)  
##                                                                  
## age                -0.108***   -0.108***   -0.108***   -0.108*** 
##                     (0.006)     (0.006)     (0.006)     (0.006)  
##                                                                  
## visits                        -0.0001***              -0.0001*** 
##                                (0.00002)               (0.00002) 
##                                                                  
## I(age2)            0.003***    0.003***    0.003***    0.003***  
##                    (0.0002)    (0.0002)    (0.0002)    (0.0002)  
##                                                                  
## I(age3)           -0.00002*** -0.00002*** -0.00002*** -0.00002***
##                    (0.00000)   (0.00000)   (0.00000)   (0.00000) 
##                                                                  
## Constant           0.516***    0.643***    0.274***    0.398***  
##                     (0.046)     (0.059)     (0.083)     (0.090)  
##                                                                  
## -----------------------------------------------------------------
## Observations        12,443      12,443      12,443      12,443   
## Log Likelihood    -8,340.059  -8,334.035  -8,334.024  -8,327.561 
## Akaike Inf. Crit. 16,688.120  16,678.070  16,678.050  16,667.120 
## =================================================================
## Note:                                 *p<0.1; **p<0.05; ***p<0.01

Adding visits to the pollution included model was statistically significant (with a p-value of 0.00002), but a very small coefficient, so it is not highly economically significant. Let’s check its performance in prediction, still within-sample. Adding visits had the same significant coefficient wether pollution was included or not. Adding pollution resulted in the same significant coefficient wether visits was included or not. All these models resulted in similar AICs, but both pollution and visits yielded the same marginal improvement in AIC and log likelihood, such that the better performning model of those was the one including both visits and pollution. They seem to expalin differrent parts of the variance.

Model 2 - Age and Visits performance evaluation

test_predict_within(m2_age_visits_pol)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3666 2546
##          1 2488 3743
##                                           
##                Accuracy : 0.5954          
##                  95% CI : (0.5867, 0.6041)
##     No Information Rate : 0.5054          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.1909          
##  Mcnemar's Test P-Value : 0.4218          
##                                           
##             Sensitivity : 0.5957          
##             Specificity : 0.5952          
##          Pos Pred Value : 0.5901          
##          Neg Pred Value : 0.6007          
##              Prevalence : 0.4946          
##          Detection Rate : 0.2946          
##    Detection Prevalence : 0.4992          
##       Balanced Accuracy : 0.5954          
##                                           
##        'Positive' Class : 0               
## 

The Visits variable didn’t seem to increase accuracy meaningfully, only by 0.2%. In this order of magnitude, it might be overfitting or due to random chance, and the gain is not large enough to justify that, so I am not going to include that in my model.

Model 3: Geographical Variables only

Here I will try to predict the chance for a bronchitis call using only Geographical Variables, and later add them onto the pollution and or age variables examined before. The data has multiple different geographical variables: Area (“urban”/“suburban”/“rural”), Latitude and Longitude, or city/area codes. I will try using each of them and compare the perforamnce of the model. According to LeSage “Spatial Statistics” book, while not transofrming my methods completely to spatial regressions, I transformed my Lat and Lon variables to be flexible enough to capture the needed variance by:
1. noramlizing and standartizing these variables around 0 and so that 1 is equal 1 standard deviation of the original value.
2. Using second order degrees 3. Using interaction terms

m3_geo_area <- glm(bronch ~ factor(station_area), family = "binomial", data = train) 
m3_geo_latlon<- glm(bronch ~ x_lat + x_lon + I(x_lat^2) + I(x_lon^2) + I(x_lat*x_lon), family = "binomial", data = train)
m3_geo_cityid<- glm(bronch ~ factor(city), family = "binomial", data = train)

stargazer(m3_geo_area, m3_geo_latlon, m3_geo_cityid, type="text")
## 
## ========================================================================
##                                               Dependent variable:       
##                                         --------------------------------
##                                                      bronch             
##                                            (1)        (2)        (3)    
## ------------------------------------------------------------------------
## factor(station_area)suburban              0.241                         
##                                          (0.293)                        
##                                                                         
## factor(station_area)urban                 0.261                         
##                                          (0.292)                        
##                                                                         
## x_lat                                               0.051**             
##                                                     (0.021)             
##                                                                         
## x_lon                                               0.087***            
##                                                     (0.026)             
##                                                                         
## I(x_lat2)                                           0.055***            
##                                                     (0.020)             
##                                                                         
## I(x_lon2)                                           0.028**             
##                                                     (0.013)             
##                                                                         
## I(x_lat * x_lon)                                   -0.089***            
##                                                     (0.030)             
##                                                                         
## factor(city)Aiguafreda                                          -0.154  
##                                                                (0.988)  
##                                                                         
## factor(city)Badia del Vall�s                                    0.028   
##                                                                (0.822)  
##                                                                         
## factor(city)Barbera                                             -0.493  
##                                                                (0.585)  
##                                                                         
## factor(city)Barcelona                                           0.006   
##                                                                (0.560)  
##                                                                         
## factor(city)Begas                                               0.593   
##                                                                (0.688)  
##                                                                         
## factor(city)Bellaterra                                          -0.714  
##                                                                (0.664)  
##                                                                         
## factor(city)Bigues i Riells                                     -0.606  
##                                                                (0.737)  
##                                                                         
## factor(city)Caldes de Montbui                                   0.108   
##                                                                (0.607)  
##                                                                         
## factor(city)Campins                                             12.412  
##                                                               (145.231) 
##                                                                         
## factor(city)Canovelles                                          -0.614  
##                                                                (0.667)  
##                                                                         
## factor(city)Cardedeu                                            0.155   
##                                                                (0.583)  
##                                                                         
## factor(city)Castellar del Vall�s                                -0.182  
##                                                                (0.581)  
##                                                                         
## factor(city)Castellbisbal                                       -0.026  
##                                                                (0.628)  
##                                                                         
## factor(city)Castelldefels                                       -0.635  
##                                                                (0.598)  
##                                                                         
## factor(city)Cerdanyola                                          -0.098  
##                                                                (0.565)  
##                                                                         
## factor(city)Cervell�                                           -2.100*  
##                                                                (1.205)  
##                                                                         
## factor(city)Collbat�                                            -0.154  
##                                                                (0.729)  
##                                                                         
## factor(city)Corbera de Llobregat                                0.907   
##                                                                (0.678)  
##                                                                         
## factor(city)Cornell� de Llobregat                               -0.292  
##                                                                (0.566)  
##                                                                         
## factor(city)C�noves i Samal�s                                   -0.943  
##                                                                (0.775)  
##                                                                         
## factor(city)El Papiol                                           -1.764  
##                                                                (1.229)  
##                                                                         
## factor(city)El Prat                                             -0.847  
##                                                                (0.678)  
##                                                                         
## factor(city)Esparreguera                                        -0.110  
##                                                                (0.631)  
##                                                                         
## factor(city)Esplugas de Llobregat                               -0.169  
##                                                                (0.570)  
##                                                                         
## factor(city)Gallifa                                             12.412  
##                                                               (145.231) 
##                                                                         
## factor(city)Gav�                                                -0.752  
##                                                                (0.616)  
##                                                                         
## factor(city)Granera                                             12.412  
##                                                               (324.744) 
##                                                                         
## factor(city)Granollers                                          -0.024  
##                                                                (0.567)  
##                                                                         
## factor(city)Hospitalet de Llobregat                             -0.059  
##                                                                (0.566)  
##                                                                         
## factor(city)l'Ametlla del Vall�s                                0.224   
##                                                                (0.596)  
##                                                                         
## factor(city)La Garriga                                          0.238   
##                                                                (0.614)  
##                                                                         
## factor(city)La Llagosta                                         0.421   
##                                                                (0.695)  
##                                                                         
## factor(city)La Palma de Cervell�                                0.539   
##                                                                (0.781)  
##                                                                         
## factor(city)La Roca del Vall�s                                  -0.247  
##                                                                (0.609)  
##                                                                         
## factor(city)Les Franqueses del Vall�s                           -0.036  
##                                                                (0.597)  
##                                                                         
## factor(city)Llinars del Vall�s                                  -0.228  
##                                                                (0.677)  
##                                                                         
## factor(city)Lli�� d'Amunt                                       -0.272  
##                                                                (0.607)  
##                                                                         
## factor(city)Lli�� de Vall                                       0.251   
##                                                                (0.627)  
##                                                                         
## factor(city)Martorell                                           0.120   
##                                                                (0.634)  
##                                                                         
## factor(city)Martorelles                                         -0.154  
##                                                                (0.802)  
##                                                                         
## factor(city)Matadepera                                          -0.061  
##                                                                (0.577)  
##                                                                         
## factor(city)Molins de Rei                                       -0.329  
##                                                                (0.580)  
##                                                                         
## factor(city)Mollet del Vall�s                                   -0.294  
##                                                                (0.573)  
##                                                                         
## factor(city)Moncada y Reixach                                   -0.021  
##                                                                (0.575)  
##                                                                         
## factor(city)Montmel�                                            -0.908  
##                                                                (0.702)  
##                                                                         
## factor(city)Montorn�s del Vall�s                                -0.442  
##                                                                (0.675)  
##                                                                         
## factor(city)Olesa de Montserrat                                 0.323   
##                                                                (0.632)  
##                                                                         
## factor(city)Palau-solit� i Plegamans                            -0.021  
##                                                                (0.603)  
##                                                                         
## factor(city)Pallej�                                             -0.742  
##                                                                (0.788)  
##                                                                         
## factor(city)Pallej� (Fontpineda)                                0.539   
##                                                                (1.029)  
##                                                                         
## factor(city)Parets del Vall�s                                   0.159   
##                                                                (0.587)  
##                                                                         
## factor(city)Poliny�                                             -0.901  
##                                                                (0.688)  
##                                                                         
## factor(city)Ripollet                                            -0.426  
##                                                                (0.581)  
##                                                                         
## factor(city)Rubi                                                -0.229  
##                                                                (0.567)  
##                                                                         
## factor(city)Sabadell                                            -0.219  
##                                                                (0.560)  
##                                                                         
## factor(city)Sant Andreu de la Barca                             0.035   
##                                                                (0.621)  
##                                                                         
## factor(city)Sant Antoni de Vilamajor                            -0.260  
##                                                                (0.644)  
##                                                                         
## factor(city)Sant Boi de Llobregat                               -0.714  
##                                                                (0.605)  
##                                                                         
## factor(city)Sant Celoni                                         -0.049  
##                                                                (0.616)  
##                                                                         
## factor(city)Sant Climent de Llobregat                           -0.154  
##                                                                (1.144)  
##                                                                         
## factor(city)Sant Cugat                                          -0.348  
##                                                                (0.559)  
##                                                                         
## factor(city)Sant Esteve de Sesrovires                           0.608   
##                                                                (0.720)  
##                                                                         
## factor(city)Sant Feliu de Codines                               0.277   
##                                                                (0.661)  
##                                                                         
## factor(city)Sant Feliu de Llobregat                             -0.194  
##                                                                (0.563)  
##                                                                         
## factor(city)Sant Fost de Campsentelles                          0.266   
##                                                                (0.618)  
##                                                                         
## factor(city)Sant Joan Despi                                     -0.395  
##                                                                (0.569)  
##                                                                         
## factor(city)Sant Just Desvern                                   0.063   
##                                                                (0.567)  
##                                                                         
## factor(city)Sant Pere de Vilamajor                              0.619   
##                                                                (0.744)  
##                                                                         
## factor(city)Sant Quirze                                         0.047   
##                                                                (0.571)  
##                                                                         
## factor(city)Sant Vicen� dels Horts                              -0.279  
##                                                                (0.660)  
##                                                                         
## factor(city)Santa Coloma de Cervell�                            0.134   
##                                                                (0.945)  
##                                                                         
## factor(city)Santa Eul�lia de Ron�ana                            0.834   
##                                                                (0.629)  
##                                                                         
## factor(city)Santa Maria de Martorelles                          0.069   
##                                                                (0.872)  
##                                                                         
## factor(city)Santa Maria de Palautordera                         0.474   
##                                                                (0.637)  
##                                                                         
## factor(city)Santa Perpetua de Mogoda                            -0.057  
##                                                                (0.585)  
##                                                                         
## factor(city)Sentmenat                                           0.348   
##                                                                (0.616)  
##                                                                         
## factor(city)Tagamanent                                          -0.308  
##                                                                (0.787)  
##                                                                         
## factor(city)Terrassa                                            -0.105  
##                                                                (0.559)  
##                                                                         
## factor(city)Ullastrell                                         -2.457** 
##                                                                (1.187)  
##                                                                         
## factor(city)Vacarises                                          -1.764** 
##                                                                (0.781)  
##                                                                         
## factor(city)Valldoreix                                          -0.430  
##                                                                (0.624)  
##                                                                         
## factor(city)Vallgorguina                                       1.670**  
##                                                                (0.736)  
##                                                                         
## factor(city)Vallirana                                           -0.219  
##                                                                (0.662)  
##                                                                         
## factor(city)Vallromanes                                         -1.099  
##                                                                (0.713)  
##                                                                         
## factor(city)Viladecans                                         -1.184*  
##                                                                (0.632)  
##                                                                         
## factor(city)Viladecavalls                                       0.418   
##                                                                (0.595)  
##                                                                         
## factor(city)Vilanova del Vall�s                                 -0.486  
##                                                                (0.601)  
##                                                                         
## Constant                                  -0.251    -0.045*     0.154   
##                                          (0.291)    (0.026)    (0.556)  
##                                                                         
## ------------------------------------------------------------------------
## Observations                              12,443     12,443     12,443  
## Log Likelihood                          -8,624.294 -8,605.205 -8,502.446
## Akaike Inf. Crit.                       17,254.590 17,222.410 17,180.890
## ========================================================================
## Note:                                        *p<0.1; **p<0.05; ***p<0.01

These geographical variables do not seem to perform better alone than the age variable, and seem to perform similarly to the pollution only data, while they are more detailed. The charachterization of area as urban/suburban/rural produced non-statistically-significant coefficients. However, it performed not much worse than the others which did. The latitdue and longitude model suggested significant coefficient for each of the variants, and a better performance. The factor(cities) model, where each city is regarded as a seperate indicator (“dummy”) variable, showed that most of the cities were not statistically significant than the baseline, but only a few were. This model gives us better performance in AIC and better interpretability of results - with this information we now know which specific cities are less or more likely to have bronchitis cases. Let’s examine the performance of those models in prediction.

print("Area Model:")
## [1] "Area Model:"
test_predict_within(m3_geo_area)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1822 4390
##          1 1797 4434
##                                           
##                Accuracy : 0.5028          
##                  95% CI : (0.4939, 0.5116)
##     No Information Rate : 0.7092          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0049          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.5035          
##             Specificity : 0.5025          
##          Pos Pred Value : 0.2933          
##          Neg Pred Value : 0.7116          
##              Prevalence : 0.2908          
##          Detection Rate : 0.1464          
##    Detection Prevalence : 0.4992          
##       Balanced Accuracy : 0.5030          
##                                           
##        'Positive' Class : 0               
## 
print("Lat/Lon Model:")
## [1] "Lat/Lon Model:"
test_predict_within(m3_geo_latlon)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3154 3058
##          1 2855 3376
##                                          
##                Accuracy : 0.5248         
##                  95% CI : (0.516, 0.5336)
##     No Information Rate : 0.5171         
##     P-Value [Acc > NIR] : 0.043312       
##                                          
##                   Kappa : 0.0495         
##  Mcnemar's Test P-Value : 0.008616       
##                                          
##             Sensitivity : 0.5249         
##             Specificity : 0.5247         
##          Pos Pred Value : 0.5077         
##          Neg Pred Value : 0.5418         
##              Prevalence : 0.4829         
##          Detection Rate : 0.2535         
##    Detection Prevalence : 0.4992         
##       Balanced Accuracy : 0.5248         
##                                          
##        'Positive' Class : 0              
## 
print("Cities Model:")
## [1] "Cities Model:"
test_predict_within(m3_geo_cityid)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3341 2871
##          1 2777 3454
##                                           
##                Accuracy : 0.5461          
##                  95% CI : (0.5373, 0.5549)
##     No Information Rate : 0.5083          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.0922          
##  Mcnemar's Test P-Value : 0.2159          
##                                           
##             Sensitivity : 0.5461          
##             Specificity : 0.5461          
##          Pos Pred Value : 0.5378          
##          Neg Pred Value : 0.5543          
##              Prevalence : 0.4917          
##          Detection Rate : 0.2685          
##    Detection Prevalence : 0.4992          
##       Balanced Accuracy : 0.5461          
##                                           
##        'Positive' Class : 0               
## 

All of them seem to have only little predicitive power. The area model does not have any, the accuracy is around 50%. The lat/lon model seems to have a slight improvement of 2.5%, and the cities model has an improvement of 4%. However, the cities model is highly restrictive because of using ~150 dummy variables, dividing our data to too many small stratas (as we saw at first). This might pose an issue later if we combine the cities indicator variables together with many other variables so we should remember that.

Model 4: combining personal and geographical information

Let’s see the performance of models with both personal and geographical variables tested previously.

m4_person_geolatlon<- glm(bronch ~ pollution + visits + age + I(age^2) + I(age^3) + x_lat + x_lon + I(x_lat^2) + I(x_lon^2) + I(x_lat*x_lon), family = "binomial", data = train)
m4_person_geocityid<- glm(bronch ~ pollution + visits + age + I(age^2) + I(age^3) + factor(city), family = "binomial", data = train)

stargazer(m2_age_visits_pol, m4_person_geolatlon, m4_person_geocityid, type="text")
## 
## ===========================================================================
##                                                 Dependent variable:        
##                                         -----------------------------------
##                                                       bronch               
##                                             (1)         (2)         (3)    
## ---------------------------------------------------------------------------
## pollution                                0.007***      0.003      -0.0003  
##                                           (0.002)     (0.003)     (0.005)  
##                                                                            
## visits                                  -0.0001***  -0.0001***  -0.0001*** 
##                                          (0.00002)   (0.00002)   (0.00002) 
##                                                                            
## age                                      -0.108***   -0.108***   -0.109*** 
##                                           (0.006)     (0.006)     (0.006)  
##                                                                            
## I(age2)                                  0.003***    0.003***    0.003***  
##                                          (0.0002)    (0.0002)    (0.0002)  
##                                                                            
## I(age3)                                 -0.00002*** -0.00002*** -0.00002***
##                                          (0.00000)   (0.00000)   (0.00000) 
##                                                                            
## x_lat                                                 0.058**              
##                                                       (0.024)              
##                                                                            
## x_lon                                                 0.055*               
##                                                       (0.030)              
##                                                                            
## I(x_lat2)                                            0.054***              
##                                                       (0.021)              
##                                                                            
## I(x_lon2)                                              0.014               
##                                                       (0.014)              
##                                                                            
## I(x_lat * x_lon)                                      -0.049               
##                                                       (0.036)              
##                                                                            
## factor(city)Aiguafreda                                            -0.551   
##                                                                   (1.025)  
##                                                                            
## factor(city)Badia del Vall�s                                      -0.599   
##                                                                   (0.840)  
##                                                                            
## factor(city)Barbera                                               -0.639   
##                                                                   (0.600)  
##                                                                            
## factor(city)Barcelona                                             -0.242   
##                                                                   (0.576)  
##                                                                            
## factor(city)Begas                                                  0.368   
##                                                                   (0.711)  
##                                                                            
## factor(city)Bellaterra                                            -0.907   
##                                                                   (0.680)  
##                                                                            
## factor(city)Bigues i Riells                                       -0.796   
##                                                                   (0.755)  
##                                                                            
## factor(city)Caldes de Montbui                                     -0.286   
##                                                                   (0.623)  
##                                                                            
## factor(city)Campins                                               12.605   
##                                                                  (145.231) 
##                                                                            
## factor(city)Canovelles                                            -0.650   
##                                                                   (0.682)  
##                                                                            
## factor(city)Cardedeu                                               0.052   
##                                                                   (0.597)  
##                                                                            
## factor(city)Castellar del Vall�s                                  -0.135   
##                                                                   (0.596)  
##                                                                            
## factor(city)Castellbisbal                                         -0.161   
##                                                                   (0.644)  
##                                                                            
## factor(city)Castelldefels                                         -0.812   
##                                                                   (0.623)  
##                                                                            
## factor(city)Cerdanyola                                            -0.312   
##                                                                   (0.579)  
##                                                                            
## factor(city)Cervell�                                             -2.684**  
##                                                                   (1.219)  
##                                                                            
## factor(city)Collbat�                                              -0.406   
##                                                                   (0.744)  
##                                                                            
## factor(city)Corbera de Llobregat                                   0.646   
##                                                                   (0.694)  
##                                                                            
## factor(city)Cornell� de Llobregat                                 -0.418   
##                                                                   (0.582)  
##                                                                            
## factor(city)C�noves i Samal�s                                     -0.950   
##                                                                   (0.810)  
##                                                                            
## factor(city)El Papiol                                             -1.911   
##                                                                   (1.245)  
##                                                                            
## factor(city)El Prat                                               -1.080   
##                                                                   (0.695)  
##                                                                            
## factor(city)Esparreguera                                          -0.363   
##                                                                   (0.646)  
##                                                                            
## factor(city)Esplugas de Llobregat                                 -0.377   
##                                                                   (0.585)  
##                                                                            
## factor(city)Gallifa                                               11.939   
##                                                                  (145.231) 
##                                                                            
## factor(city)Gav�                                                  -0.826   
##                                                                   (0.639)  
##                                                                            
## factor(city)Granera                                               11.670   
##                                                                  (324.744) 
##                                                                            
## factor(city)Granollers                                            -0.189   
##                                                                   (0.582)  
##                                                                            
## factor(city)Hospitalet de Llobregat                               -0.152   
##                                                                   (0.581)  
##                                                                            
## factor(city)l'Ametlla del Vall�s                                   0.148   
##                                                                   (0.611)  
##                                                                            
## factor(city)La Garriga                                             0.157   
##                                                                   (0.629)  
##                                                                            
## factor(city)La Llagosta                                            0.356   
##                                                                   (0.709)  
##                                                                            
## factor(city)La Palma de Cervell�                                   0.032   
##                                                                   (0.810)  
##                                                                            
## factor(city)La Roca del Vall�s                                    -0.406   
##                                                                   (0.624)  
##                                                                            
## factor(city)Les Franqueses del Vall�s                             -0.071   
##                                                                   (0.612)  
##                                                                            
## factor(city)Llinars del Vall�s                                    -0.197   
##                                                                   (0.690)  
##                                                                            
## factor(city)Lli�� d'Amunt                                         -0.283   
##                                                                   (0.622)  
##                                                                            
## factor(city)Lli�� de Vall                                          0.021   
##                                                                   (0.642)  
##                                                                            
## factor(city)Martorell                                             -0.097   
##                                                                   (0.649)  
##                                                                            
## factor(city)Martorelles                                           -0.330   
##                                                                   (0.819)  
##                                                                            
## factor(city)Matadepera                                            -0.247   
##                                                                   (0.592)  
##                                                                            
## factor(city)Molins de Rei                                         -0.400   
##                                                                   (0.597)  
##                                                                            
## factor(city)Mollet del Vall�s                                     -0.500   
##                                                                   (0.589)  
##                                                                            
## factor(city)Moncada y Reixach                                     -0.193   
##                                                                   (0.589)  
##                                                                            
## factor(city)Montmel�                                              -1.197*  
##                                                                   (0.720)  
##                                                                            
## factor(city)Montorn�s del Vall�s                                  -0.448   
##                                                                   (0.690)  
##                                                                            
## factor(city)Olesa de Montserrat                                    0.074   
##                                                                   (0.648)  
##                                                                            
## factor(city)Palau-solit� i Plegamans                              -0.278   
##                                                                   (0.619)  
##                                                                            
## factor(city)Pallej�                                               -1.181   
##                                                                   (0.808)  
##                                                                            
## factor(city)Pallej� (Fontpineda)                                  -0.133   
##                                                                   (1.040)  
##                                                                            
## factor(city)Parets del Vall�s                                      0.033   
##                                                                   (0.603)  
##                                                                            
## factor(city)Poliny�                                               -1.045   
##                                                                   (0.704)  
##                                                                            
## factor(city)Ripollet                                              -0.649   
##                                                                   (0.595)  
##                                                                            
## factor(city)Rubi                                                  -0.407   
##                                                                   (0.582)  
##                                                                            
## factor(city)Sabadell                                              -0.368   
##                                                                   (0.574)  
##                                                                            
## factor(city)Sant Andreu de la Barca                               -0.144   
##                                                                   (0.637)  
##                                                                            
## factor(city)Sant Antoni de Vilamajor                              -0.409   
##                                                                   (0.659)  
##                                                                            
## factor(city)Sant Boi de Llobregat                                 -0.902   
##                                                                   (0.621)  
##                                                                            
## factor(city)Sant Celoni                                           -0.388   
##                                                                   (0.633)  
##                                                                            
## factor(city)Sant Climent de Llobregat                             -0.244   
##                                                                   (1.163)  
##                                                                            
## factor(city)Sant Cugat                                            -0.461   
##                                                                   (0.574)  
##                                                                            
## factor(city)Sant Esteve de Sesrovires                              0.656   
##                                                                   (0.734)  
##                                                                            
## factor(city)Sant Feliu de Codines                                  0.195   
##                                                                   (0.677)  
##                                                                            
## factor(city)Sant Feliu de Llobregat                               -0.337   
##                                                                   (0.584)  
##                                                                            
## factor(city)Sant Fost de Campsentelles                             0.070   
##                                                                   (0.635)  
##                                                                            
## factor(city)Sant Joan Despi                                       -0.487   
##                                                                   (0.589)  
##                                                                            
## factor(city)Sant Just Desvern                                     -0.164   
##                                                                   (0.587)  
##                                                                            
## factor(city)Sant Pere de Vilamajor                                 0.123   
##                                                                   (0.760)  
##                                                                            
## factor(city)Sant Quirze                                           -0.111   
##                                                                   (0.586)  
##                                                                            
## factor(city)Sant Vicen� dels Horts                                -0.686   
##                                                                   (0.674)  
##                                                                            
## factor(city)Santa Coloma de Cervell�                              -0.067   
##                                                                   (0.971)  
##                                                                            
## factor(city)Santa Eul�lia de Ron�ana                               0.814   
##                                                                   (0.643)  
##                                                                            
## factor(city)Santa Maria de Martorelles                             0.363   
##                                                                   (0.885)  
##                                                                            
## factor(city)Santa Maria de Palautordera                            0.423   
##                                                                   (0.653)  
##                                                                            
## factor(city)Santa Perpetua de Mogoda                              -0.282   
##                                                                   (0.599)  
##                                                                            
## factor(city)Sentmenat                                              0.217   
##                                                                   (0.632)  
##                                                                            
## factor(city)Tagamanent                                            -0.637   
##                                                                   (0.820)  
##                                                                            
## factor(city)Terrassa                                              -0.270   
##                                                                   (0.574)  
##                                                                            
## factor(city)Ullastrell                                           -2.466**  
##                                                                   (1.196)  
##                                                                            
## factor(city)Vacarises                                            -1.782**  
##                                                                   (0.796)  
##                                                                            
## factor(city)Valldoreix                                            -0.604   
##                                                                   (0.641)  
##                                                                            
## factor(city)Vallgorguina                                          1.312*   
##                                                                   (0.749)  
##                                                                            
## factor(city)Vallirana                                             -0.374   
##                                                                   (0.681)  
##                                                                            
## factor(city)Vallromanes                                          -1.674**  
##                                                                   (0.728)  
##                                                                            
## factor(city)Viladecans                                           -1.476**  
##                                                                   (0.655)  
##                                                                            
## factor(city)Viladecavalls                                          0.270   
##                                                                   (0.611)  
##                                                                            
## factor(city)Vilanova del Vall�s                                   -0.616   
##                                                                   (0.616)  
##                                                                            
## Constant                                 0.398***    0.468***      0.961   
##                                           (0.090)     (0.116)     (0.601)  
##                                                                            
## ---------------------------------------------------------------------------
## Observations                              12,443      12,443      12,443   
## Log Likelihood                          -8,327.561  -8,314.407  -8,217.180 
## Akaike Inf. Crit.                       16,667.120  16,650.810  16,620.360 
## ===========================================================================
## Note:                                           *p<0.1; **p<0.05; ***p<0.01

As expected from before - the addition of the latitude and longitude variables improved the AIC over the only personal charachteristics model with pollution but only slightly. More importantly, it reduced the coefficient (estimated impact) of pollution because of the correlation between the geographical variables and pollution, and the cities model made the coefficient of pollution be negative (suggesting that pollution reduced the likelihood of bronchitis). This does not seem reasonable, unless operating through a third variable (for example, if pollution is correalted with lower income otherwise than with cities).

Model 4 (Geo) - testing predictive ability

Let’s test the predictive ability

print("Personal + Lat/Lon Model:")
## [1] "Personal + Lat/Lon Model:"
test_predict_within(m4_person_geolatlon)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3705 2507
##          1 2480 3751
##                                           
##                Accuracy : 0.5992          
##                  95% CI : (0.5905, 0.6078)
##     No Information Rate : 0.5029          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.1984          
##  Mcnemar's Test P-Value : 0.7127          
##                                           
##             Sensitivity : 0.5990          
##             Specificity : 0.5994          
##          Pos Pred Value : 0.5964          
##          Neg Pred Value : 0.6020          
##              Prevalence : 0.4971          
##          Detection Rate : 0.2978          
##    Detection Prevalence : 0.4992          
##       Balanced Accuracy : 0.5992          
##                                           
##        'Positive' Class : 0               
## 
print("Personal + City Model:")
## [1] "Personal + City Model:"
test_predict_within(m4_person_geocityid)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3772 2440
##          1 2478 3753
##                                           
##                Accuracy : 0.6048          
##                  95% CI : (0.5961, 0.6134)
##     No Information Rate : 0.5023          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.2095          
##  Mcnemar's Test P-Value : 0.5978          
##                                           
##             Sensitivity : 0.6035          
##             Specificity : 0.6060          
##          Pos Pred Value : 0.6072          
##          Neg Pred Value : 0.6023          
##              Prevalence : 0.5023          
##          Detection Rate : 0.3031          
##    Detection Prevalence : 0.4992          
##       Balanced Accuracy : 0.6048          
##                                           
##        'Positive' Class : 0               
## 

As expected, they both did only slightly better than the personal variables only model, with the cities based model doing 1% better in accuracy.

Test performance of best models out-of-sample (test data)

Now I test the performance on the test data.

test_predict_outsample <- function(model) {
  ytest_fac <- factor(test$bronch)
  ytest_pred_fac <- as.vector(round(predict(model, newdata = test)))
  ytest_pred_fac <- factor(ytest_pred_fac, levels=c("0","1"))
  print(confusionMatrix(ytest_fac, ytest_pred_fac))
}
print("All personal variables")
## [1] "All personal variables"
test_predict_outsample(m2_age3)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1048  204
##          1 1099  330
##                                           
##                Accuracy : 0.514           
##                  95% CI : (0.4949, 0.5331)
##     No Information Rate : 0.8008          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0651          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.4881          
##             Specificity : 0.6180          
##          Pos Pred Value : 0.8371          
##          Neg Pred Value : 0.2309          
##              Prevalence : 0.8008          
##          Detection Rate : 0.3909          
##    Detection Prevalence : 0.4670          
##       Balanced Accuracy : 0.5531          
##                                           
##        'Positive' Class : 0               
## 
print("All variables with time, Lat/Lon")
## [1] "All variables with time, Lat/Lon"
test_predict_outsample(m5_time_all)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0  985  221
##          1 1051  385
##                                           
##                Accuracy : 0.5185          
##                  95% CI : (0.4993, 0.5378)
##     No Information Rate : 0.7706          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0804          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.4838          
##             Specificity : 0.6353          
##          Pos Pred Value : 0.8167          
##          Neg Pred Value : 0.2681          
##              Prevalence : 0.7706          
##          Detection Rate : 0.3728          
##    Detection Prevalence : 0.4565          
##       Balanced Accuracy : 0.5596          
##                                           
##        'Positive' Class : 0               
## 
print("All variables with time, Cities")
## [1] "All variables with time, Cities"
test_predict_outsample(m5_time_all_city)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 936 228
##          1 979 434
##                                          
##                Accuracy : 0.5316         
##                  95% CI : (0.5121, 0.551)
##     No Information Rate : 0.7431         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.1053         
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.4888         
##             Specificity : 0.6556         
##          Pos Pred Value : 0.8041         
##          Neg Pred Value : 0.3071         
##              Prevalence : 0.7431         
##          Detection Rate : 0.3632         
##    Detection Prevalence : 0.4517         
##       Balanced Accuracy : 0.5722         
##                                          
##        'Positive' Class : 0              
## 

The test data performance shows accuracy of maximum 53.2%, meaning that our models so far were not able to successfully explain or predict the chance for a bronchitis call using the best logistic regression models attempted. They perform worse on the test data, meaning that they overfitted to the training data.

Statistical Tests and Corrections

Synthesis of Models and Conclusions