Setup and Data Loading

Let’s load and view the data’s first few lines as well as summary statistics about it

# load the data
df <- read.csv("./output/bronch.csv")

# remove irrelevant columns
#df[ ,c('X','pathology', 'pathology_original', 'id_type', 'minor_18','geopoint', 'n_visits', 'dayofyear', 'date_numerical')] <- list(NULL)
df$X <- NULL

# recode relevant columns in the relevant encoding: float, categorical
df$bronch <- factor(df$bronch)
kable(summary(df))

status	time_delay	city	city_id	age	zip	lat	lon	date	id_personal	pathology_eng	visits	month	dayofmonth	dayofweek	hour	x_age_scaled	x_visits	x_date	x_month	x_dayofweek	x_hour	x_lon	x_lat	x_zip	pathology_id	year	station_lat	pollution	distance	station_city	station_lon_air	station_area	bronch
Min. :4.000	Min. : 0.00	Terrassa :1913	Min. : 1.00	Min. : 1.00	Min. :8004	Min. :41.21	Min. :1.624	2018-05-01 15:02:00: 16	Min. : 0.00	bronchitis :7872	Min. : 1	Min. : 1.000	Min. : 1.00	Min. :0.000	Min. : 0.00	Min. :-1.25041	Min. :-2.015002	Min. :-2.86046	Min. :-1.3824150	Min. :-1.47097	Min. :-2.52934	Min. :-4.74391	Min. :-3.097613	Min. :-1.20005	Min. : 1.00	Min. :2015	Min. :41.20	Min. : 2.914	Min. : 0.001931	TERRASSA :2310	Min. :1.618	rural : 70	0:7872
1st Qu.:4.000	1st Qu.: 45.00	Sant Cugat :1902	1st Qu.: 26.00	1st Qu.: 8.00	1st Qu.:8184	1st Qu.:41.40	1st Qu.:2.048	2017-01-08 19:00:00: 13	1st Qu.: 46.00	fever :2052	1st Qu.:1127	1st Qu.: 3.000	1st Qu.: 6.00	1st Qu.:1.000	1st Qu.:10.00	1st Qu.:-0.98151	1st Qu.:-0.666875	1st Qu.:-0.60114	1st Qu.:-0.8573621	1st Qu.:-0.98581	1st Qu.:-0.68611	1st Qu.:-0.59755	1st Qu.:-1.033615	1st Qu.:-0.62159	1st Qu.: 33.00	1st Qu.:2016	1st Qu.:41.40	1st Qu.:27.950	1st Qu.: 1.050538	SABADELL :1863	1st Qu.:2.042	suburban: 4529	1:7872
Median :4.000	Median : 60.00	Sabadell :1383	Median : 50.00	Median :36.00	Median :8222	Median :41.50	Median :2.090	2017-12-27 08:56:00: 13	Median : 55.00	tonsillitis : 643	Median :1584	Median : 6.000	Median :16.00	Median :3.000	Median :13.00	Median : 0.09405	Median :-0.119722	Median : 0.30771	Median :-0.0697829	Median :-0.01547	Median :-0.13314	Median :-0.18707	Median : 0.018138	Median :-0.49947	Median : 33.00	Median :2017	Median :41.51	Median :36.589	Median : 1.882925	BARCELONA :1711	Median :2.101	urban :11145	NA
Mean :4.046	Mean : 70.24	Barcelona :1360	Mean : 70.56	Mean :35.88	Mean :8374	Mean :41.50	Mean :2.110	2016-03-14 13:32:00: 12	Mean : 55.01	gastroenteritis: 444	Mean :1687	Mean : 6.266	Mean :15.35	Mean :3.011	Mean :13.54	Mean : 0.08927	Mean : 0.003351	Mean : 0.03999	Mean :-0.0000326	Mean :-0.01008	Mean :-0.03385	Mean : 0.01476	Mean : 0.005824	Mean :-0.01001	Mean : 76.14	Mean :2017	Mean :41.49	Mean :34.399	Mean : 2.579297	SANT FELIU DE LLOBREGAT:1662	Mean :2.111	NA	NA
3rd Qu.:4.000	3rd Qu.: 90.00	Sant Feliu de Llobregat: 644	3rd Qu.:114.00	3rd Qu.:56.00	3rd Qu.:8520	3rd Qu.:41.57	3rd Qu.:2.147	2017-11-13 10:10:00: 12	3rd Qu.: 68.00	flu : 378	3rd Qu.:2463	3rd Qu.:10.000	3rd Qu.:23.00	3rd Qu.:5.000	3rd Qu.:18.00	3rd Qu.: 0.86232	3rd Qu.: 0.932678	3rd Qu.: 0.78773	3rd Qu.: 0.9803229	3rd Qu.: 0.95486	3rd Qu.: 0.78847	3rd Qu.: 0.37533	3rd Qu.: 0.732254	3rd Qu.: 0.45821	3rd Qu.: 92.00	3rd Qu.:2018	3rd Qu.:41.56	3rd Qu.:42.260	3rd Qu.: 3.064469	SANT CUGAT DEL VALLÃˆS :1658	3rd Qu.:2.153	NA	NA
Max. :5.000	Max. :150.00	Cerdanyola : 521	Max. :188.00	Max. :99.00	Max. :8980	Max. :41.98	Max. :2.696	2017-11-25 11:16:00: 12	Max. :102.00	vomiting : 371	Max. :2695	Max. :12.000	Max. :31.00	Max. :6.000	Max. :23.00	Max. : 2.51408	Max. : 1.210445	Max. : 1.71578	Max. : 1.5053757	Max. : 1.44003	Max. : 1.71008	Max. : 5.75296	Max. : 5.117218	Max. : 1.93651	Max. :221.00	Max. :2018	Max. :42.00	Max. :52.057	Max. :17.628569	GRANOLLERS :1471	Max. :2.817	NA	NA
NA	NA	(Other) :8021	NA	NA	NA	NA	NA	(Other) :15666	NA	(Other) :3984	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	(Other) :5069	NA	NA	NA

#summary(df)

Data Checks

Most of the data exploration has been done before at the Python notebook and mentioned in part 1 of this paper. Here I will perform specific data exploration and verfiications reelvant to the modeling dne here: checking for variance among relevant variables, outliers, and class balance.

ggplot(data = df, aes(x = age)) +
  geom_histogram(binwidth = 5, alpha=0.6) +
  scale_x_continuous(breaks = seq(0,100,by=5)) +
  labs(title = "Ages of Patients", y = "Count", x = "Age of Patients") +
  theme_minimal() +
  theme(plot.title = element_text(size=18, family = "Helvetica", face = "bold", hjust = 0.5))

ggplot(data = df, aes(x = age, fill=bronch, color=bronch)) +
  #geom_histogram(binwidth = 5, alpha=0.5) +
  geom_density(alpha=0.5) +
  scale_x_continuous(breaks = seq(0,100,by=5)) +
  labs(title = "Ages of Patients", y = "Count", x = "Age of Patients") +
  theme_minimal() +
  theme(plot.title = element_text(size=18, family = "Helvetica", face = "bold", hjust = 0.5))

The distribution of age per class (between Bronchitis patients and not) is sufficiently similar. Naturally, there are more babies and older people who have this condition, while more young people are without this condition (and most likely healthier). The distribution of age in both classes is bi-modal rather than normal. This makes sense, since this is a This may pose some problems in the modeling because most models and procedures work ideally with a normally distributed dataset.

Class balance by city_id

## two-way contingency table of categorical outcome and predictors we want
## to make sure there are not 0 cells

xtabs(~bronch + city_id, data = df)

##       city_id
## bronch   1   2   3   4   5   6   7   8   9  10  11  12  13  14  16  17  18
##      0   7  84  29  98  28 263   1 103 121  48  21   1 100 197   3 123 112
##      1   6  65  20  94  31 258   5 109 131  58  11   0  81 188   4 105  85
##       city_id
## bronch  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
##      0  80  35  52 153  47 108  27 357  89   7 220   6  50  40   3 142   3
##      1  52  29  55 115  81 144  29 275  94  16 156   0  68  19   0 154   2
##       city_id
## bronch  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  53
##      0  85  27   7 169  84  72 190 215  53  76  49  12  29  42 139   3  34
##      1  91  42  12 177  86  55 207 211  75 114  60   1   6  73 116   6  50
##       city_id
## bronch  54  55  56  57  58  60  62  63  64  65  66  67  68  69  70  71  72
##      0   0  23  14  64   4   1   1  88 107  17   6  35  10  45  23  25  10
##      1   5  13   6 102   0   0   2  75 123   0   0  52  20  45  37  17   8
##       city_id
## bronch  73  74  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
##      0  19  20  62  27  30   5  17  30   9   3  19   4  18   8   5  20  56
##      1   8  16  86  22  35   0  25  45  20   0  55   7  40   6  39   8  48
##       city_id
## bronch  92  93  94  96  97  98  99 100 101 102 103 104 105 106 107 108 109
##      0   6  14  69  10  11  16 241   7  27  31 164  48   7  22  96  19  10
##      1  12  21  42   1  10  29 210   3  13  30 166  25  13  33  87  38   6
##       city_id
## bronch 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
##      0  32  58   2   7 328 196 194  22   5   6  18  49  48  48  47   5  18
##      1  37  35   4  20 316 164 222  15   7   0  17  16  67  44  36  17  10
##       city_id
## bronch 128 129 131 134 135 136 142 144 145 146 147 149 152 153 155 159 160
##      0   3   2 232  37   2   1   2  71   1  12   1   4   5   7  82 108  90
##      1   8   4 217  26   0   0   4 157   0   4   7   0   0   0 131  80  61
##       city_id
## bronch 162 166 167 172 174 175 181 182 183 184 185 186 187 188
##      0   5   6 119  88  24   2  60  24   5  58  18  19  11  23
##      1   0   4 128 109  36  12  43  33  10  42  35  20  24  24

Since I will be later including factor(city_id), we need to make sure these are somewhat balanced. There are some cities with 0 patient of one type (normally). To prevent complete overfitting and biased predictions for these cities, I will remove them from my dataset. The zero-patient stratas are only missing bronchitis patients. Hence, I will find all cities with missing bronchitis patients and remove them.

setDT(df) # setting as data.table for more efficient data processing
cities_with_nonbronch <- unique(df[bronch==0, city_id])
cities_with_bronch <- unique(df[bronch==1, city_id])
imblanaced_cities <- setdiff(cities_with_nonbronch, cities_with_bronch)
size_beofre_balancing <- nrow(df)

# balance the dataframe by removing imblanaced cities:
df <- df[! city_id %in% imblanaced_cities,]
size_after_balancing <- nrow(df)
print(paste("Balancing removed", size_beofre_balancing - size_after_balancing, "rows, now remaining with ", nrow(df)))

## [1] "Balancing removed 77 rows, now remaining with  15667"

Class balance by Pollution level

ggplot(data = df, aes(x = pollution, fill=factor(bronch), color=factor(bronch))) +
  geom_density(alpha=0.5) +
  scale_x_continuous(breaks = seq(0,100,by=5)) +
  labs(title = "Pollution Levels for Calls with bronchitis or Withouot", y = "Density", x = "Pollution") +
  theme_minimal() +
  theme(plot.title = element_text(size=18, family = "Helvetica", face = "bold", hjust = 0.5))

Surprisingly, there seems to be no difference between the pollution levels for bronchitis patients versus not! This might mean that our efforts to establish any meaningful effect by pollution would be futile. I will make a few simple regression models, but will then transition to modeling the chance for a bronchitis call with predictive purposes instead, taking advantage of other variables.

Modeling

In this section, I will use logistic regression to model the chance of a call to be about bronchitis by the effect of the pollution level (at the closest measuring station). This will not establish causality perfectly because there are still many omitted variables and data limitations such as class imbalances, too few pollution station measurements (in both time and space), missing cities, potential for negative reverse causality (where patients with bronchitis may move away from a polluted area) and there is no good way to emulate a randomized experiment. This analysis could at least suggest if pollution seems to have any effect or not, although the results should be taken with a grain of salt. The null hypothesis would be that pollution would have no effect on the chance for a call to be about bronchitis. The alternative hypothesis would be that pollution would have some non-zero effect on the chance for a call to be about bronchitis.

Since it is hard to establish causality, I will try at the end of this analysis to find a sufficiently accurate model for predictive purposes. By trying various models and optimizing for their performance, we might at least find a sufficiently accurate predictive model, even if it is less sparse and theoretically founded.

I will develop models from the simplest possible model to more complex ones, slowly adding more features or making the form more flexible. I use logistic regression since the outcome variable is binary, and I’m looking predict probabilities bounded between [0,1]. If I would use a normal regression, the probabilities predicted could be outside of the boundaries of [0,1], but a logistic regression compresses the probabilities to be within that boundary with a sigmoid like function.

Separate data into test and training data

Since I am going to test the performance of various models and be able to use it for predictive purposes, it is important to make sure I am not overfitting the model to the training data. I will do this by training and checking the models on the training data while keeping 20% of the data aside as a test dataset.

set.seed(1)
# split dataframe into training (80%) and test (20%)
split_idx <- sample(c(TRUE, FALSE), nrow(df), replace = T, prob = c(0.8,0.2))
train <- df[split_idx, ]
test <- df[!split_idx, ]
df$bronch <- as.numeric(df$bronch)

Visualize a simple Linear model

# Visualize
ggplot(aes(x=pollution, y=bronch), data=df) +
  geom_point(alpha = .15) +
  geom_smooth(method = "glm") + #, method.args = list(family = "binomial")) +
  ggtitle("Logistic regression model fit") +
  xlab("Pollution") +
  ylab("Probability of bronchitis")

Unfortunately, by this plotting of values of pollution per class of bronchitis or not, there is almost no difference between the distribution of pollution levels. A linear model is slightly skewed up, but not sufficiently to be able to predict bronchitis or not.
(*Note: the logistic nonlinear model could not be visualized on the plot above for technical reasons, so I instead used a linear model for visualization purposes.)

Model 1: Only Pollution

The syntax of the glm function is similar to that of lm, except that we must pass the argument family = binomial in order to tell R to run a logistic regression rather than some other type of generalized linear model.

model1_pol <- glm(bronch ~ pollution , family = "binomial", data = train)
summary(model1_pol)

## 
## Call:
## glm(formula = bronch ~ pollution, family = "binomial", data = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.243  -1.187   1.113   1.168   1.292  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.290544   0.070804  -4.104 4.07e-05 ***
## pollution    0.008534   0.001991   4.287 1.81e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 17250  on 12442  degrees of freedom
## Residual deviance: 17231  on 12441  degrees of freedom
## AIC: 17235
## 
## Number of Fisher Scoring iterations: 3

A simple logistic regression on pollution shows that while the estimate on pollution is miniscule, wehn it is alone it is statistically significant. As it is a logistic regression, there is no simple measure of accuracy such as an R-Squared to report.

Model 1 performance evaluation

First, before evaluating performance on the test set, I will examine wether the model have succeeded sufficiently explaining the variance within-sample (on the same training sdata it was trained on). If it has, then I will continue to test the model on the test dataset.

# building a function to test performance within-sample
test_predict_within <- function(model) {
  ytrain_fac <- factor(train$bronch)
  ytrain_pred_fac <- as.vector(round(fitted(model)))
  ytrain_pred_fac <- factor(ytrain_pred_fac, levels=c("0","1"))
  print(confusionMatrix(ytrain_fac, ytrain_pred_fac))
}
test_predict_within(model1_pol)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2355 3857
##          1 2197 4034
##                                           
##                Accuracy : 0.5135          
##                  95% CI : (0.5046, 0.5223)
##     No Information Rate : 0.6342          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0265          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.5174          
##             Specificity : 0.5112          
##          Pos Pred Value : 0.3791          
##          Neg Pred Value : 0.6474          
##              Prevalence : 0.3658          
##          Detection Rate : 0.1893          
##    Detection Prevalence : 0.4992          
##       Balanced Accuracy : 0.5143          
##                                           
##        'Positive' Class : 0               
##

Even when testing model1_pol (only pollution) within-sample (on the same data it was trained on), the model performs poorly: accuracy is only %51.35, when the baseline by random guessing should be %50 (since the dataset is evenly distributed). Hence, pollution might have helped in explaining an additional 1.35% of the data over the 50% random guess expectation, which is very poor. It was not biased much towards one class. Pollution therefore does not seem to have a meaningful effect on the chance for a call to be bronchitis or not.

Model 1: Adding distance from air measurement station

The data only had 65 air measurement stations in teh region, so that distances from the nearest pollution measurement system varied greatly. Hence, visits with higher distances from the matched nearest station are less accurate. Therefore, I will try including the distance either alone or with an interaction term.

model1_pol_dis <- glm(bronch ~ pollution + distance , family = "binomial", data = train)
model1_pol_dis_int <- glm(bronch ~ pollution + distance + I(distance*pollution) , family = "binomial", data = train)
stargazer(model1_pol_dis, model1_pol_dis_int, type = "text")

## 
## ====================================================
##                             Dependent variable:     
##                         ----------------------------
##                                    bronch           
##                              (1)            (2)     
## ----------------------------------------------------
## pollution                  0.008***       0.007**   
##                            (0.002)        (0.003)   
##                                                     
## distance                   0.019**        -0.002    
##                            (0.008)        (0.034)   
##                                                     
## I(distance * pollution)                    0.001    
##                                           (0.001)   
##                                                     
## Constant                  -0.326***      -0.279***  
##                            (0.072)        (0.102)   
##                                                     
## ----------------------------------------------------
## Observations                12,443        12,443    
## Log Likelihood            -8,612.544    -8,612.332  
## Akaike Inf. Crit.         17,231.090    17,232.660  
## ====================================================
## Note:                    *p<0.1; **p<0.05; ***p<0.01

Distance alone showed to have a statistically meaningful coefficient, but not the interaction term. However, it does not make theoretic sense: distance from an air measurement station seemed to increase the likelihood for bronchitis, which is what we would expect if pollution was greater. That could be the case if stations are sparser in areas like cities where pollution is greater, but it’s not the case. Therefore, this is likely a spurious correlation and a confounding variable to add in this form for now.

Model 2: Personal Variables: Age and Number of Visits

Here I will try to add in personal variables: age, and later on - number of visits. Age is likely to have a nonlinear effect, since I’ve seen that babies and older people tend to get Bronchitis more. Therefore, I will include second order and test third order of age for prediction. I will also test an interaction term between pollution and age, since it might be that pollution affects more the delicate lungs of babies and older people. Since the interaction term might have nonlinear effects, I will also test adding higher orders of the interaction term.

Unfortunately, the standard F-test does not extend from Linear models to nonlinear logistic regression, therefore I will not be able to test these models using the F-test (Principles of Econometrics, 4th Edition; and uPenn STAT 501 Course). Hence, I will use other methods: I will asses the significance of added variables by the p-values of their coefficients, and compare the performance of the models against the Null hypothesis model without them (Model 1 in this case).

m2_age2 <- glm(bronch ~ pollution + age + I(age^2), family = "binomial", data = train)
m2_age3 <- glm(bronch ~ pollution + age + I(age^2) + I(age^3), family = "binomial", data = train)
m2_ageint2 <- glm(bronch ~ pollution + age + I(age^2) + I(pollution*age) + I((pollution*age)^2), family = "binomial", data = train)

stargazer(model1_pol, m2_age2, m2_age3, m2_ageint2, type="text")

## 
## ==================================================================
##                                   Dependent variable:             
##                       --------------------------------------------
##                                          bronch                   
##                          (1)        (2)         (3)        (4)    
## ------------------------------------------------------------------
## pollution              0.009***   0.007***   0.007***    0.008**  
##                        (0.002)    (0.002)     (0.002)    (0.004)  
##                                                                   
## age                              -0.027***   -0.108***  -0.029*** 
##                                   (0.002)     (0.006)    (0.007)  
##                                                                   
## I(age2)                          0.0004***   0.003***   0.0004*** 
##                                  (0.00003)   (0.0002)    (0.0001) 
##                                                                   
## I(age3)                                     -0.00002***           
##                                              (0.00000)            
##                                                                   
## I(pollution * age)                                        0.0001  
##                                                          (0.0002) 
##                                                                   
## I((pollution * age)2)                                    -0.00000 
##                                                         (0.00000) 
##                                                                   
## Constant              -0.291***   -0.134*    0.274***     -0.162  
##                        (0.071)    (0.078)     (0.083)    (0.137)  
##                                                                   
## ------------------------------------------------------------------
## Observations            12,443     12,443     12,443      12,443  
## Log Likelihood        -8,615.608 -8,447.739 -8,334.024  -8,446.930
## Akaike Inf. Crit.     17,235.220 16,903.480 16,678.050  16,905.860
## ==================================================================
## Note:                                  *p<0.1; **p<0.05; ***p<0.01

The personal charachteristics added model shows that adding age to the equation was determined as statistically significant for the model, including the second and third order degrees, since they all are highly statistically significant, with a p-value of less than 1%. The Age^3 model performed better in practice, achieving lower AIC (16,667 against 16,892 or 16,895 of the others). Notice that the coefficient estimate on Pollution has decreased when adding age, and more so when adding age^3. Thus, we know that the estimate of Pollution was not “true” or stable, and that age was able to explain some of the same variance. That said, all of the absolute magnitudes of the coefficients are extremely small, suggesting that the model still may not be flexible enough to distinguish well between the classes of bronchitis or not.

Model 2 - Age only performance evaluation

test_predict_within(m2_age3)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3676 2536
##          1 2470 3761
##                                          
##                Accuracy : 0.5977         
##                  95% CI : (0.589, 0.6063)
##     No Information Rate : 0.5061         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.1954         
##  Mcnemar's Test P-Value : 0.3583         
##                                          
##             Sensitivity : 0.5981         
##             Specificity : 0.5973         
##          Pos Pred Value : 0.5918         
##          Neg Pred Value : 0.6036         
##              Prevalence : 0.4939         
##          Detection Rate : 0.2954         
##    Detection Prevalence : 0.4992         
##       Balanced Accuracy : 0.5977         
##                                          
##        'Positive' Class : 0              
##

Adding Age significantly improved our performance - from 51.3% with pollution only, to 59.5% accuracy with age. This is still not an objectively good perfomance, so I will continue to add features in hope to increase the perforamnce and explain more of the variance.

Model 2: Adding Number of Visits or Removing Pollution

After determining which is the best form to add in age with its nonlinearities, I can add the other personal variable - number of visits. I will test it against our recent better perfoming model, with the third higher order of Age.
I also test the removal of pollution from the Age^3 model.

m2_age3_nopol_novis <- glm(bronch ~  age + I(age^2) + I(age^3), family = "binomial", data = train)
m2_age3_visits_nopol <- glm(bronch ~  age + visits + I(age^2) + I(age^3), family = "binomial", data = train)
m2_age_visits_pol <- glm(bronch ~ pollution + visits + age + I(age^2) + I(age^3), family = "binomial", data = train)

stargazer(m2_age3_nopol_novis, m2_age3_visits_nopol, m2_age3, m2_age_visits_pol, type="text")

## 
## =================================================================
##                                 Dependent variable:              
##                   -----------------------------------------------
##                                       bronch                     
##                       (1)         (2)         (3)         (4)    
## -----------------------------------------------------------------
## pollution                                  0.007***    0.007***  
##                                             (0.002)     (0.002)  
##                                                                  
## age                -0.108***   -0.108***   -0.108***   -0.108*** 
##                     (0.006)     (0.006)     (0.006)     (0.006)  
##                                                                  
## visits                        -0.0001***              -0.0001*** 
##                                (0.00002)               (0.00002) 
##                                                                  
## I(age2)            0.003***    0.003***    0.003***    0.003***  
##                    (0.0002)    (0.0002)    (0.0002)    (0.0002)  
##                                                                  
## I(age3)           -0.00002*** -0.00002*** -0.00002*** -0.00002***
##                    (0.00000)   (0.00000)   (0.00000)   (0.00000) 
##                                                                  
## Constant           0.516***    0.643***    0.274***    0.398***  
##                     (0.046)     (0.059)     (0.083)     (0.090)  
##                                                                  
## -----------------------------------------------------------------
## Observations        12,443      12,443      12,443      12,443   
## Log Likelihood    -8,340.059  -8,334.035  -8,334.024  -8,327.561 
## Akaike Inf. Crit. 16,688.120  16,678.070  16,678.050  16,667.120 
## =================================================================
## Note:                                 *p<0.1; **p<0.05; ***p<0.01

Adding visits to the pollution included model was statistically significant (with a p-value of 0.00002), but a very small coefficient, so it is not highly economically significant. Let’s check its performance in prediction, still within-sample. Adding visits had the same significant coefficient wether pollution was included or not. Adding pollution resulted in the same significant coefficient wether visits was included or not. All these models resulted in similar AICs, but both pollution and visits yielded the same marginal improvement in AIC and log likelihood, such that the better performning model of those was the one including both visits and pollution. They seem to expalin differrent parts of the variance.

Model 2 - Age and Visits performance evaluation

test_predict_within(m2_age_visits_pol)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3666 2546
##          1 2488 3743
##                                           
##                Accuracy : 0.5954          
##                  95% CI : (0.5867, 0.6041)
##     No Information Rate : 0.5054          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.1909          
##  Mcnemar's Test P-Value : 0.4218          
##                                           
##             Sensitivity : 0.5957          
##             Specificity : 0.5952          
##          Pos Pred Value : 0.5901          
##          Neg Pred Value : 0.6007          
##              Prevalence : 0.4946          
##          Detection Rate : 0.2946          
##    Detection Prevalence : 0.4992          
##       Balanced Accuracy : 0.5954          
##                                           
##        'Positive' Class : 0               
##

The Visits variable didn’t seem to increase accuracy meaningfully, only by 0.2%. In this order of magnitude, it might be overfitting or due to random chance, and the gain is not large enough to justify that, so I am not going to include that in my model.

Model 3: Geographical Variables only

Here I will try to predict the chance for a bronchitis call using only Geographical Variables, and later add them onto the pollution and or age variables examined before. The data has multiple different geographical variables: Area (“urban”/“suburban”/“rural”), Latitude and Longitude, or city/area codes. I will try using each of them and compare the perforamnce of the model. According to LeSage “Spatial Statistics” book, while not transofrming my methods completely to spatial regressions, I transformed my Lat and Lon variables to be flexible enough to capture the needed variance by:
1. noramlizing and standartizing these variables around 0 and so that 1 is equal 1 standard deviation of the original value.
2. Using second order degrees 3. Using interaction terms

m3_geo_area <- glm(bronch ~ factor(station_area), family = "binomial", data = train) 
m3_geo_latlon<- glm(bronch ~ x_lat + x_lon + I(x_lat^2) + I(x_lon^2) + I(x_lat*x_lon), family = "binomial", data = train)
m3_geo_cityid<- glm(bronch ~ factor(city), family = "binomial", data = train)

stargazer(m3_geo_area, m3_geo_latlon, m3_geo_cityid, type="text")

## 
## ========================================================================
##                                               Dependent variable:       
##                                         --------------------------------
##                                                      bronch             
##                                            (1)        (2)        (3)    
## ------------------------------------------------------------------------
## factor(station_area)suburban              0.241                         
##                                          (0.293)                        
##                                                                         
## factor(station_area)urban                 0.261                         
##                                          (0.292)                        
##                                                                         
## x_lat                                               0.051**             
##                                                     (0.021)             
##                                                                         
## x_lon                                               0.087***            
##                                                     (0.026)             
##                                                                         
## I(x_lat2)                                           0.055***            
##                                                     (0.020)             
##                                                                         
## I(x_lon2)                                           0.028**             
##                                                     (0.013)             
##                                                                         
## I(x_lat * x_lon)                                   -0.089***            
##                                                     (0.030)             
##                                                                         
## factor(city)Aiguafreda                                          -0.154  
##                                                                (0.988)  
##                                                                         
## factor(city)Badia del Vall�s                                    0.028   
##                                                                (0.822)  
##                                                                         
## factor(city)Barbera                                             -0.493  
##                                                                (0.585)  
##                                                                         
## factor(city)Barcelona                                           0.006   
##                                                                (0.560)  
##                                                                         
## factor(city)Begas                                               0.593   
##                                                                (0.688)  
##                                                                         
## factor(city)Bellaterra                                          -0.714  
##                                                                (0.664)  
##                                                                         
## factor(city)Bigues i Riells                                     -0.606  
##                                                                (0.737)  
##                                                                         
## factor(city)Caldes de Montbui                                   0.108   
##                                                                (0.607)  
##                                                                         
## factor(city)Campins                                             12.412  
##                                                               (145.231) 
##                                                                         
## factor(city)Canovelles                                          -0.614  
##                                                                (0.667)  
##                                                                         
## factor(city)Cardedeu                                            0.155   
##                                                                (0.583)  
##                                                                         
## factor(city)Castellar del Vall�s                                -0.182  
##                                                                (0.581)  
##                                                                         
## factor(city)Castellbisbal                                       -0.026  
##                                                                (0.628)  
##                                                                         
## factor(city)Castelldefels                                       -0.635  
##                                                                (0.598)  
##                                                                         
## factor(city)Cerdanyola                                          -0.098  
##                                                                (0.565)  
##                                                                         
## factor(city)Cervell�                                           -2.100*  
##                                                                (1.205)  
##                                                                         
## factor(city)Collbat�                                            -0.154  
##                                                                (0.729)  
##                                                                         
## factor(city)Corbera de Llobregat                                0.907   
##                                                                (0.678)  
##                                                                         
## factor(city)Cornell� de Llobregat                               -0.292  
##                                                                (0.566)  
##                                                                         
## factor(city)C�noves i Samal�s                                   -0.943  
##                                                                (0.775)  
##                                                                         
## factor(city)El Papiol                                           -1.764  
##                                                                (1.229)  
##                                                                         
## factor(city)El Prat                                             -0.847  
##                                                                (0.678)  
##                                                                         
## factor(city)Esparreguera                                        -0.110  
##                                                                (0.631)  
##                                                                         
## factor(city)Esplugas de Llobregat                               -0.169  
##                                                                (0.570)  
##                                                                         
## factor(city)Gallifa                                             12.412  
##                                                               (145.231) 
##                                                                         
## factor(city)Gav�                                                -0.752  
##                                                                (0.616)  
##                                                                         
## factor(city)Granera                                             12.412  
##                                                               (324.744) 
##                                                                         
## factor(city)Granollers                                          -0.024  
##                                                                (0.567)  
##                                                                         
## factor(city)Hospitalet de Llobregat                             -0.059  
##                                                                (0.566)  
##                                                                         
## factor(city)l'Ametlla del Vall�s                                0.224   
##                                                                (0.596)  
##                                                                         
## factor(city)La Garriga                                          0.238   
##                                                                (0.614)  
##                                                                         
## factor(city)La Llagosta                                         0.421   
##                                                                (0.695)  
##                                                                         
## factor(city)La Palma de Cervell�                                0.539   
##                                                                (0.781)  
##                                                                         
## factor(city)La Roca del Vall�s                                  -0.247  
##                                                                (0.609)  
##                                                                         
## factor(city)Les Franqueses del Vall�s                           -0.036  
##                                                                (0.597)  
##                                                                         
## factor(city)Llinars del Vall�s                                  -0.228  
##                                                                (0.677)  
##                                                                         
## factor(city)Lli�� d'Amunt                                       -0.272  
##                                                                (0.607)  
##                                                                         
## factor(city)Lli�� de Vall                                       0.251   
##                                                                (0.627)  
##                                                                         
## factor(city)Martorell                                           0.120   
##                                                                (0.634)  
##                                                                         
## factor(city)Martorelles                                         -0.154  
##                                                                (0.802)  
##                                                                         
## factor(city)Matadepera                                          -0.061  
##                                                                (0.577)  
##                                                                         
## factor(city)Molins de Rei                                       -0.329  
##                                                                (0.580)  
##                                                                         
## factor(city)Mollet del Vall�s                                   -0.294  
##                                                                (0.573)  
##                                                                         
## factor(city)Moncada y Reixach                                   -0.021  
##                                                                (0.575)  
##                                                                         
## factor(city)Montmel�                                            -0.908  
##                                                                (0.702)  
##                                                                         
## factor(city)Montorn�s del Vall�s                                -0.442  
##                                                                (0.675)  
##                                                                         
## factor(city)Olesa de Montserrat                                 0.323   
##                                                                (0.632)  
##                                                                         
## factor(city)Palau-solit� i Plegamans                            -0.021  
##                                                                (0.603)  
##                                                                         
## factor(city)Pallej�                                             -0.742  
##                                                                (0.788)  
##                                                                         
## factor(city)Pallej� (Fontpineda)                                0.539   
##                                                                (1.029)  
##                                                                         
## factor(city)Parets del Vall�s                                   0.159   
##                                                                (0.587)  
##                                                                         
## factor(city)Poliny�                                             -0.901  
##                                                                (0.688)  
##                                                                         
## factor(city)Ripollet                                            -0.426  
##                                                                (0.581)  
##                                                                         
## factor(city)Rubi                                                -0.229  
##                                                                (0.567)  
##                                                                         
## factor(city)Sabadell                                            -0.219  
##                                                                (0.560)  
##                                                                         
## factor(city)Sant Andreu de la Barca                             0.035   
##                                                                (0.621)  
##                                                                         
## factor(city)Sant Antoni de Vilamajor                            -0.260  
##                                                                (0.644)  
##                                                                         
## factor(city)Sant Boi de Llobregat                               -0.714  
##                                                                (0.605)  
##                                                                         
## factor(city)Sant Celoni                                         -0.049  
##                                                                (0.616)  
##                                                                         
## factor(city)Sant Climent de Llobregat                           -0.154  
##                                                                (1.144)  
##                                                                         
## factor(city)Sant Cugat                                          -0.348  
##                                                                (0.559)  
##                                                                         
## factor(city)Sant Esteve de Sesrovires                           0.608   
##                                                                (0.720)  
##                                                                         
## factor(city)Sant Feliu de Codines                               0.277   
##                                                                (0.661)  
##                                                                         
## factor(city)Sant Feliu de Llobregat                             -0.194  
##                                                                (0.563)  
##                                                                         
## factor(city)Sant Fost de Campsentelles                          0.266   
##                                                                (0.618)  
##                                                                         
## factor(city)Sant Joan Despi                                     -0.395  
##                                                                (0.569)  
##                                                                         
## factor(city)Sant Just Desvern                                   0.063   
##                                                                (0.567)  
##                                                                         
## factor(city)Sant Pere de Vilamajor                              0.619   
##                                                                (0.744)  
##                                                                         
## factor(city)Sant Quirze                                         0.047   
##                                                                (0.571)  
##                                                                         
## factor(city)Sant Vicen� dels Horts                              -0.279  
##                                                                (0.660)  
##                                                                         
## factor(city)Santa Coloma de Cervell�                            0.134   
##                                                                (0.945)  
##                                                                         
## factor(city)Santa Eul�lia de Ron�ana                            0.834   
##                                                                (0.629)  
##                                                                         
## factor(city)Santa Maria de Martorelles                          0.069   
##                                                                (0.872)  
##                                                                         
## factor(city)Santa Maria de Palautordera                         0.474   
##                                                                (0.637)  
##                                                                         
## factor(city)Santa Perpetua de Mogoda                            -0.057  
##                                                                (0.585)  
##                                                                         
## factor(city)Sentmenat                                           0.348   
##                                                                (0.616)  
##                                                                         
## factor(city)Tagamanent                                          -0.308  
##                                                                (0.787)  
##                                                                         
## factor(city)Terrassa                                            -0.105  
##                                                                (0.559)  
##                                                                         
## factor(city)Ullastrell                                         -2.457** 
##                                                                (1.187)  
##                                                                         
## factor(city)Vacarises                                          -1.764** 
##                                                                (0.781)  
##                                                                         
## factor(city)Valldoreix                                          -0.430  
##                                                                (0.624)  
##                                                                         
## factor(city)Vallgorguina                                       1.670**  
##                                                                (0.736)  
##                                                                         
## factor(city)Vallirana                                           -0.219  
##                                                                (0.662)  
##                                                                         
## factor(city)Vallromanes                                         -1.099  
##                                                                (0.713)  
##                                                                         
## factor(city)Viladecans                                         -1.184*  
##                                                                (0.632)  
##                                                                         
## factor(city)Viladecavalls                                       0.418   
##                                                                (0.595)  
##                                                                         
## factor(city)Vilanova del Vall�s                                 -0.486  
##                                                                (0.601)  
##                                                                         
## Constant                                  -0.251    -0.045*     0.154   
##                                          (0.291)    (0.026)    (0.556)  
##                                                                         
## ------------------------------------------------------------------------
## Observations                              12,443     12,443     12,443  
## Log Likelihood                          -8,624.294 -8,605.205 -8,502.446
## Akaike Inf. Crit.                       17,254.590 17,222.410 17,180.890
## ========================================================================
## Note:                                        *p<0.1; **p<0.05; ***p<0.01

These geographical variables do not seem to perform better alone than the age variable, and seem to perform similarly to the pollution only data, while they are more detailed. The charachterization of area as urban/suburban/rural produced non-statistically-significant coefficients. However, it performed not much worse than the others which did. The latitdue and longitude model suggested significant coefficient for each of the variants, and a better performance. The factor(cities) model, where each city is regarded as a seperate indicator (“dummy”) variable, showed that most of the cities were not statistically significant than the baseline, but only a few were. This model gives us better performance in AIC and better interpretability of results - with this information we now know which specific cities are less or more likely to have bronchitis cases. Let’s examine the performance of those models in prediction.

print("Area Model:")

## [1] "Area Model:"

test_predict_within(m3_geo_area)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1822 4390
##          1 1797 4434
##                                           
##                Accuracy : 0.5028          
##                  95% CI : (0.4939, 0.5116)
##     No Information Rate : 0.7092          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0049          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.5035          
##             Specificity : 0.5025          
##          Pos Pred Value : 0.2933          
##          Neg Pred Value : 0.7116          
##              Prevalence : 0.2908          
##          Detection Rate : 0.1464          
##    Detection Prevalence : 0.4992          
##       Balanced Accuracy : 0.5030          
##                                           
##        'Positive' Class : 0               
##

print("Lat/Lon Model:")

## [1] "Lat/Lon Model:"

test_predict_within(m3_geo_latlon)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3154 3058
##          1 2855 3376
##                                          
##                Accuracy : 0.5248         
##                  95% CI : (0.516, 0.5336)
##     No Information Rate : 0.5171         
##     P-Value [Acc > NIR] : 0.043312       
##                                          
##                   Kappa : 0.0495         
##  Mcnemar's Test P-Value : 0.008616       
##                                          
##             Sensitivity : 0.5249         
##             Specificity : 0.5247         
##          Pos Pred Value : 0.5077         
##          Neg Pred Value : 0.5418         
##              Prevalence : 0.4829         
##          Detection Rate : 0.2535         
##    Detection Prevalence : 0.4992         
##       Balanced Accuracy : 0.5248         
##                                          
##        'Positive' Class : 0              
##

print("Cities Model:")

## [1] "Cities Model:"

test_predict_within(m3_geo_cityid)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3341 2871
##          1 2777 3454
##                                           
##                Accuracy : 0.5461          
##                  95% CI : (0.5373, 0.5549)
##     No Information Rate : 0.5083          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.0922          
##  Mcnemar's Test P-Value : 0.2159          
##                                           
##             Sensitivity : 0.5461          
##             Specificity : 0.5461          
##          Pos Pred Value : 0.5378          
##          Neg Pred Value : 0.5543          
##              Prevalence : 0.4917          
##          Detection Rate : 0.2685          
##    Detection Prevalence : 0.4992          
##       Balanced Accuracy : 0.5461          
##                                           
##        'Positive' Class : 0               
##

All of them seem to have only little predicitive power. The area model does not have any, the accuracy is around 50%. The lat/lon model seems to have a slight improvement of 2.5%, and the cities model has an improvement of 4%. However, the cities model is highly restrictive because of using ~150 dummy variables, dividing our data to too many small stratas (as we saw at first). This might pose an issue later if we combine the cities indicator variables together with many other variables so we should remember that.

Model 4: combining personal and geographical information

Let’s see the performance of models with both personal and geographical variables tested previously.

m4_person_geolatlon<- glm(bronch ~ pollution + visits + age + I(age^2) + I(age^3) + x_lat + x_lon + I(x_lat^2) + I(x_lon^2) + I(x_lat*x_lon), family = "binomial", data = train)
m4_person_geocityid<- glm(bronch ~ pollution + visits + age + I(age^2) + I(age^3) + factor(city), family = "binomial", data = train)

stargazer(m2_age_visits_pol, m4_person_geolatlon, m4_person_geocityid, type="text")

## 
## ===========================================================================
##                                                 Dependent variable:        
##                                         -----------------------------------
##                                                       bronch               
##                                             (1)         (2)         (3)    
## ---------------------------------------------------------------------------
## pollution                                0.007***      0.003      -0.0003  
##                                           (0.002)     (0.003)     (0.005)  
##                                                                            
## visits                                  -0.0001***  -0.0001***  -0.0001*** 
##                                          (0.00002)   (0.00002)   (0.00002) 
##                                                                            
## age                                      -0.108***   -0.108***   -0.109*** 
##                                           (0.006)     (0.006)     (0.006)  
##                                                                            
## I(age2)                                  0.003***    0.003***    0.003***  
##                                          (0.0002)    (0.0002)    (0.0002)  
##                                                                            
## I(age3)                                 -0.00002*** -0.00002*** -0.00002***
##                                          (0.00000)   (0.00000)   (0.00000) 
##                                                                            
## x_lat                                                 0.058**              
##                                                       (0.024)              
##                                                                            
## x_lon                                                 0.055*               
##                                                       (0.030)              
##                                                                            
## I(x_lat2)                                            0.054***              
##                                                       (0.021)              
##                                                                            
## I(x_lon2)                                              0.014               
##                                                       (0.014)              
##                                                                            
## I(x_lat * x_lon)                                      -0.049               
##                                                       (0.036)              
##                                                                            
## factor(city)Aiguafreda                                            -0.551   
##                                                                   (1.025)  
##                                                                            
## factor(city)Badia del Vall�s                                      -0.599   
##                                                                   (0.840)  
##                                                                            
## factor(city)Barbera                                               -0.639   
##                                                                   (0.600)  
##                                                                            
## factor(city)Barcelona                                             -0.242   
##                                                                   (0.576)  
##                                                                            
## factor(city)Begas                                                  0.368   
##                                                                   (0.711)  
##                                                                            
## factor(city)Bellaterra                                            -0.907   
##                                                                   (0.680)  
##                                                                            
## factor(city)Bigues i Riells                                       -0.796   
##                                                                   (0.755)  
##                                                                            
## factor(city)Caldes de Montbui                                     -0.286   
##                                                                   (0.623)  
##                                                                            
## factor(city)Campins                                               12.605   
##                                                                  (145.231) 
##                                                                            
## factor(city)Canovelles                                            -0.650   
##                                                                   (0.682)  
##                                                                            
## factor(city)Cardedeu                                               0.052   
##                                                                   (0.597)  
##                                                                            
## factor(city)Castellar del Vall�s                                  -0.135   
##                                                                   (0.596)  
##                                                                            
## factor(city)Castellbisbal                                         -0.161   
##                                                                   (0.644)  
##                                                                            
## factor(city)Castelldefels                                         -0.812   
##                                                                   (0.623)  
##                                                                            
## factor(city)Cerdanyola                                            -0.312   
##                                                                   (0.579)  
##                                                                            
## factor(city)Cervell�                                             -2.684**  
##                                                                   (1.219)  
##                                                                            
## factor(city)Collbat�                                              -0.406   
##                                                                   (0.744)  
##                                                                            
## factor(city)Corbera de Llobregat                                   0.646   
##                                                                   (0.694)  
##                                                                            
## factor(city)Cornell� de Llobregat                                 -0.418   
##                                                                   (0.582)  
##                                                                            
## factor(city)C�noves i Samal�s                                     -0.950   
##                                                                   (0.810)  
##                                                                            
## factor(city)El Papiol                                             -1.911   
##                                                                   (1.245)  
##                                                                            
## factor(city)El Prat                                               -1.080   
##                                                                   (0.695)  
##                                                                            
## factor(city)Esparreguera                                          -0.363   
##                                                                   (0.646)  
##                                                                            
## factor(city)Esplugas de Llobregat                                 -0.377   
##                                                                   (0.585)  
##                                                                            
## factor(city)Gallifa                                               11.939   
##                                                                  (145.231) 
##                                                                            
## factor(city)Gav�                                                  -0.826   
##                                                                   (0.639)  
##                                                                            
## factor(city)Granera                                               11.670   
##                                                                  (324.744) 
##                                                                            
## factor(city)Granollers                                            -0.189   
##                                                                   (0.582)  
##                                                                            
## factor(city)Hospitalet de Llobregat                               -0.152   
##                                                                   (0.581)  
##                                                                            
## factor(city)l'Ametlla del Vall�s                                   0.148   
##                                                                   (0.611)  
##                                                                            
## factor(city)La Garriga                                             0.157   
##                                                                   (0.629)  
##                                                                            
## factor(city)La Llagosta                                            0.356   
##                                                                   (0.709)  
##                                                                            
## factor(city)La Palma de Cervell�                                   0.032   
##                                                                   (0.810)  
##                                                                            
## factor(city)La Roca del Vall�s                                    -0.406   
##                                                                   (0.624)  
##                                                                            
## factor(city)Les Franqueses del Vall�s                             -0.071   
##                                                                   (0.612)  
##                                                                            
## factor(city)Llinars del Vall�s                                    -0.197   
##                                                                   (0.690)  
##                                                                            
## factor(city)Lli�� d'Amunt                                         -0.283   
##                                                                   (0.622)  
##                                                                            
## factor(city)Lli�� de Vall                                          0.021   
##                                                                   (0.642)  
##                                                                            
## factor(city)Martorell                                             -0.097   
##                                                                   (0.649)  
##                                                                            
## factor(city)Martorelles                                           -0.330   
##                                                                   (0.819)  
##                                                                            
## factor(city)Matadepera                                            -0.247   
##                                                                   (0.592)  
##                                                                            
## factor(city)Molins de Rei                                         -0.400   
##                                                                   (0.597)  
##                                                                            
## factor(city)Mollet del Vall�s                                     -0.500   
##                                                                   (0.589)  
##                                                                            
## factor(city)Moncada y Reixach                                     -0.193   
##                                                                   (0.589)  
##                                                                            
## factor(city)Montmel�                                              -1.197*  
##                                                                   (0.720)  
##                                                                            
## factor(city)Montorn�s del Vall�s                                  -0.448   
##                                                                   (0.690)  
##                                                                            
## factor(city)Olesa de Montserrat                                    0.074   
##                                                                   (0.648)  
##                                                                            
## factor(city)Palau-solit� i Plegamans                              -0.278   
##                                                                   (0.619)  
##                                                                            
## factor(city)Pallej�                                               -1.181   
##                                                                   (0.808)  
##                                                                            
## factor(city)Pallej� (Fontpineda)                                  -0.133   
##                                                                   (1.040)  
##                                                                            
## factor(city)Parets del Vall�s                                      0.033   
##                                                                   (0.603)  
##                                                                            
## factor(city)Poliny�                                               -1.045   
##                                                                   (0.704)  
##                                                                            
## factor(city)Ripollet                                              -0.649   
##                                                                   (0.595)  
##                                                                            
## factor(city)Rubi                                                  -0.407   
##                                                                   (0.582)  
##                                                                            
## factor(city)Sabadell                                              -0.368   
##                                                                   (0.574)  
##                                                                            
## factor(city)Sant Andreu de la Barca                               -0.144   
##                                                                   (0.637)  
##                                                                            
## factor(city)Sant Antoni de Vilamajor                              -0.409   
##                                                                   (0.659)  
##                                                                            
## factor(city)Sant Boi de Llobregat                                 -0.902   
##                                                                   (0.621)  
##                                                                            
## factor(city)Sant Celoni                                           -0.388   
##                                                                   (0.633)  
##                                                                            
## factor(city)Sant Climent de Llobregat                             -0.244   
##                                                                   (1.163)  
##                                                                            
## factor(city)Sant Cugat                                            -0.461   
##                                                                   (0.574)  
##                                                                            
## factor(city)Sant Esteve de Sesrovires                              0.656   
##                                                                   (0.734)  
##                                                                            
## factor(city)Sant Feliu de Codines                                  0.195   
##                                                                   (0.677)  
##                                                                            
## factor(city)Sant Feliu de Llobregat                               -0.337   
##                                                                   (0.584)  
##                                                                            
## factor(city)Sant Fost de Campsentelles                             0.070   
##                                                                   (0.635)  
##                                                                            
## factor(city)Sant Joan Despi                                       -0.487   
##                                                                   (0.589)  
##                                                                            
## factor(city)Sant Just Desvern                                     -0.164   
##                                                                   (0.587)  
##                                                                            
## factor(city)Sant Pere de Vilamajor                                 0.123   
##                                                                   (0.760)  
##                                                                            
## factor(city)Sant Quirze                                           -0.111   
##                                                                   (0.586)  
##                                                                            
## factor(city)Sant Vicen� dels Horts                                -0.686   
##                                                                   (0.674)  
##                                                                            
## factor(city)Santa Coloma de Cervell�                              -0.067   
##                                                                   (0.971)  
##                                                                            
## factor(city)Santa Eul�lia de Ron�ana                               0.814   
##                                                                   (0.643)  
##                                                                            
## factor(city)Santa Maria de Martorelles                             0.363   
##                                                                   (0.885)  
##                                                                            
## factor(city)Santa Maria de Palautordera                            0.423   
##                                                                   (0.653)  
##                                                                            
## factor(city)Santa Perpetua de Mogoda                              -0.282   
##                                                                   (0.599)  
##                                                                            
## factor(city)Sentmenat                                              0.217   
##                                                                   (0.632)  
##                                                                            
## factor(city)Tagamanent                                            -0.637   
##                                                                   (0.820)  
##                                                                            
## factor(city)Terrassa                                              -0.270   
##                                                                   (0.574)  
##                                                                            
## factor(city)Ullastrell                                           -2.466**  
##                                                                   (1.196)  
##                                                                            
## factor(city)Vacarises                                            -1.782**  
##                                                                   (0.796)  
##                                                                            
## factor(city)Valldoreix                                            -0.604   
##                                                                   (0.641)  
##                                                                            
## factor(city)Vallgorguina                                          1.312*   
##                                                                   (0.749)  
##                                                                            
## factor(city)Vallirana                                             -0.374   
##                                                                   (0.681)  
##                                                                            
## factor(city)Vallromanes                                          -1.674**  
##                                                                   (0.728)  
##                                                                            
## factor(city)Viladecans                                           -1.476**  
##                                                                   (0.655)  
##                                                                            
## factor(city)Viladecavalls                                          0.270   
##                                                                   (0.611)  
##                                                                            
## factor(city)Vilanova del Vall�s                                   -0.616   
##                                                                   (0.616)  
##                                                                            
## Constant                                 0.398***    0.468***      0.961   
##                                           (0.090)     (0.116)     (0.601)  
##                                                                            
## ---------------------------------------------------------------------------
## Observations                              12,443      12,443      12,443   
## Log Likelihood                          -8,327.561  -8,314.407  -8,217.180 
## Akaike Inf. Crit.                       16,667.120  16,650.810  16,620.360 
## ===========================================================================
## Note:                                           *p<0.1; **p<0.05; ***p<0.01

As expected from before - the addition of the latitude and longitude variables improved the AIC over the only personal charachteristics model with pollution but only slightly. More importantly, it reduced the coefficient (estimated impact) of pollution because of the correlation between the geographical variables and pollution, and the cities model made the coefficient of pollution be negative (suggesting that pollution reduced the likelihood of bronchitis). This does not seem reasonable, unless operating through a third variable (for example, if pollution is correalted with lower income otherwise than with cities).

Model 4 (Geo) - testing predictive ability

Let’s test the predictive ability

print("Personal + Lat/Lon Model:")

## [1] "Personal + Lat/Lon Model:"

test_predict_within(m4_person_geolatlon)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3705 2507
##          1 2480 3751
##                                           
##                Accuracy : 0.5992          
##                  95% CI : (0.5905, 0.6078)
##     No Information Rate : 0.5029          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.1984          
##  Mcnemar's Test P-Value : 0.7127          
##                                           
##             Sensitivity : 0.5990          
##             Specificity : 0.5994          
##          Pos Pred Value : 0.5964          
##          Neg Pred Value : 0.6020          
##              Prevalence : 0.4971          
##          Detection Rate : 0.2978          
##    Detection Prevalence : 0.4992          
##       Balanced Accuracy : 0.5992          
##                                           
##        'Positive' Class : 0               
##

print("Personal + City Model:")

## [1] "Personal + City Model:"

test_predict_within(m4_person_geocityid)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3772 2440
##          1 2478 3753
##                                           
##                Accuracy : 0.6048          
##                  95% CI : (0.5961, 0.6134)
##     No Information Rate : 0.5023          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.2095          
##  Mcnemar's Test P-Value : 0.5978          
##                                           
##             Sensitivity : 0.6035          
##             Specificity : 0.6060          
##          Pos Pred Value : 0.6072          
##          Neg Pred Value : 0.6023          
##              Prevalence : 0.5023          
##          Detection Rate : 0.3031          
##    Detection Prevalence : 0.4992          
##       Balanced Accuracy : 0.6048          
##                                           
##        'Positive' Class : 0               
##

As expected, they both did only slightly better than the personal variables only model, with the cities based model doing 1% better in accuracy.

Model 5 - Adding time related variables

The last group of variables we may add is time-related variables. Month could explain seasonal changes (or monthly changes, therefore I will include it as a factor variable for maximum flexibility), year could explain if there is a yearly trend, and day of week could explain if there are pattenrs for which diseases people call a doctor for at each day (for example, weekend or not).

# base: m4_person_geolatlon
m5_time_only<- glm(bronch ~ factor(month) + dayofweek + year
                            , family = "binomial", data = train)
m5_time_all_dayofweek<- glm(bronch ~ pollution + distance + visits + age + I(age^2) + I(age^3) + 
                            x_lat + x_lon + I(x_lat^2) + I(x_lon^2) + I(x_lat*x_lon) +
                            factor(month) + dayofweek + year
                            , family = "binomial", data = train)
m5_time_all<- glm(bronch ~ pollution + distance + visits + age + I(age^2) + I(age^3) + 
                            x_lat + x_lon + I(x_lat^2) + I(x_lon^2) + I(x_lat*x_lon) +
                            factor(month) + year
                            , family = "binomial", data = train)
stargazer(m4_person_geolatlon, m5_time_only, m5_time_all_dayofweek, m5_time_all, type = "text")

## 
## =================================================================
##                                 Dependent variable:              
##                   -----------------------------------------------
##                                       bronch                     
##                       (1)         (2)         (3)         (4)    
## -----------------------------------------------------------------
## pollution            0.003                   0.002       0.002   
##                     (0.003)                 (0.003)     (0.003)  
##                                                                  
## distance                                     0.001       0.001   
##                                             (0.011)     (0.011)  
##                                                                  
## visits            -0.0001***              -0.0001***  -0.0001*** 
##                    (0.00002)               (0.00002)   (0.00002) 
##                                                                  
## age                -0.108***               -0.109***   -0.108*** 
##                     (0.006)                 (0.006)     (0.006)  
##                                                                  
## I(age2)            0.003***                0.003***    0.003***  
##                    (0.0002)                (0.0002)    (0.0002)  
##                                                                  
## I(age3)           -0.00002***             -0.00002*** -0.00002***
##                    (0.00000)               (0.00000)   (0.00000) 
##                                                                  
## x_lat               0.058**                0.081***    0.079***  
##                     (0.024)                 (0.028)     (0.028)  
##                                                                  
## x_lon               0.055*                   0.049       0.048   
##                     (0.030)                 (0.030)     (0.030)  
##                                                                  
## I(x_lat2)          0.054***                 0.047**     0.047**  
##                     (0.021)                 (0.023)     (0.023)  
##                                                                  
## I(x_lon2)            0.014                   0.012       0.012   
##                     (0.014)                 (0.014)     (0.014)  
##                                                                  
## I(x_lat * x_lon)    -0.049                  -0.041      -0.040   
##                     (0.036)                 (0.037)     (0.037)  
##                                                                  
## factor(month)2                   0.080       0.107       0.105   
##                                 (0.074)     (0.076)     (0.076)  
##                                                                  
## factor(month)3                   0.006       0.032       0.030   
##                                 (0.076)     (0.078)     (0.078)  
##                                                                  
## factor(month)4                 -0.173**    -0.206**    -0.208**  
##                                 (0.082)     (0.085)     (0.085)  
##                                                                  
## factor(month)5                 -0.317***   -0.311***   -0.307*** 
##                                 (0.086)     (0.088)     (0.088)  
##                                                                  
## factor(month)6                 -0.249***   -0.223**    -0.224**  
##                                 (0.091)     (0.093)     (0.093)  
##                                                                  
## factor(month)7                  -0.145      -0.105      -0.109   
##                                 (0.088)     (0.091)     (0.091)  
##                                                                  
## factor(month)8                  -0.148      -0.145      -0.147   
##                                 (0.092)     (0.095)     (0.095)  
##                                                                  
## factor(month)9                   0.016       0.028       0.022   
##                                 (0.090)     (0.092)     (0.092)  
##                                                                  
## factor(month)10                  0.056      -0.026      -0.026   
##                                 (0.089)     (0.091)     (0.091)  
##                                                                  
## factor(month)11                  0.096       0.086       0.080   
##                                 (0.082)     (0.084)     (0.084)  
##                                                                  
## factor(month)12                0.282***    0.312***    0.305***  
##                                 (0.070)     (0.072)     (0.072)  
##                                                                  
## dayofweek                       -0.006      -0.014               
##                                 (0.009)     (0.009)              
##                                                                  
## year                           0.075***    0.071***    0.070***  
##                                 (0.021)     (0.023)     (0.023)  
##                                                                  
## Constant           0.468***   -151.017*** -143.339*** -141.476***
##                     (0.116)    (43.212)    (46.786)    (46.766)  
##                                                                  
## -----------------------------------------------------------------
## Observations        12,443      12,443      12,443      12,443   
## Log Likelihood    -8,314.407  -8,581.158  -8,269.872  -8,271.116 
## Akaike Inf. Crit. 16,650.810  17,190.320  16,589.740  16,590.230 
## =================================================================
## Note:                                 *p<0.1; **p<0.05; ***p<0.01

It seems like the time based model alone did similar to the previous models basd only on one type of variables in terms of AIC. The baseline month is (1), January, and we can see that spring months had significnatly lower chance for a bronchitis call, and that december had a higher chance. A positive coefficient on year suggests that the proportion of bronchitis cases from all cases increased per year. Day of week seemed to not be significant, so I will drop it.

Let’s test the prediction ability: #### Model 5 - predictive performance of model with time variables Let’s test the predictive ability of the time added model. I will also test the same model with “city” as geographical variable rather than the Lat/Lon.

m5_time_all_city<- glm(bronch ~ pollution + distance + visits + age + I(age^2) + I(age^3) + 
                            factor(city) + factor(month) + year
                            , family = "binomial", data = train)
print("All variables with time, Lat/Lon")

## [1] "All variables with time, Lat/Lon"

test_predict_within(m5_time_all)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3713 2499
##          1 2452 3779
##                                           
##                Accuracy : 0.6021          
##                  95% CI : (0.5934, 0.6107)
##     No Information Rate : 0.5045          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.2042          
##  Mcnemar's Test P-Value : 0.5133          
##                                           
##             Sensitivity : 0.6023          
##             Specificity : 0.6019          
##          Pos Pred Value : 0.5977          
##          Neg Pred Value : 0.6065          
##              Prevalence : 0.4955          
##          Detection Rate : 0.2984          
##    Detection Prevalence : 0.4992          
##       Balanced Accuracy : 0.6021          
##                                           
##        'Positive' Class : 0               
##

print("All variables with time, Cities")

## [1] "All variables with time, Cities"

test_predict_within(m5_time_all_city)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3781 2431
##          1 2441 3790
##                                          
##                Accuracy : 0.6085         
##                  95% CI : (0.5998, 0.617)
##     No Information Rate : 0.5            
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.2169         
##  Mcnemar's Test P-Value : 0.8974         
##                                          
##             Sensitivity : 0.6077         
##             Specificity : 0.6092         
##          Pos Pred Value : 0.6087         
##          Neg Pred Value : 0.6082         
##              Prevalence : 0.5000         
##          Detection Rate : 0.3039         
##    Detection Prevalence : 0.4992         
##       Balanced Accuracy : 0.6085         
##                                          
##        'Positive' Class : 0              
##

The time added model improved the accuracy only slightly, by about 1% than the same model without time. Teh cities or lat/lon achieved pretty much the same performance results.

Test performance of best models out-of-sample (test data)

Now I test the performance on the test data.

test_predict_outsample <- function(model) {
  ytest_fac <- factor(test$bronch)
  ytest_pred_fac <- as.vector(round(predict(model, newdata = test)))
  ytest_pred_fac <- factor(ytest_pred_fac, levels=c("0","1"))
  print(confusionMatrix(ytest_fac, ytest_pred_fac))
}
print("All personal variables")

## [1] "All personal variables"

test_predict_outsample(m2_age3)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1048  204
##          1 1099  330
##                                           
##                Accuracy : 0.514           
##                  95% CI : (0.4949, 0.5331)
##     No Information Rate : 0.8008          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0651          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.4881          
##             Specificity : 0.6180          
##          Pos Pred Value : 0.8371          
##          Neg Pred Value : 0.2309          
##              Prevalence : 0.8008          
##          Detection Rate : 0.3909          
##    Detection Prevalence : 0.4670          
##       Balanced Accuracy : 0.5531          
##                                           
##        'Positive' Class : 0               
##

print("All variables with time, Lat/Lon")

## [1] "All variables with time, Lat/Lon"

test_predict_outsample(m5_time_all)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0  985  221
##          1 1051  385
##                                           
##                Accuracy : 0.5185          
##                  95% CI : (0.4993, 0.5378)
##     No Information Rate : 0.7706          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0804          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.4838          
##             Specificity : 0.6353          
##          Pos Pred Value : 0.8167          
##          Neg Pred Value : 0.2681          
##              Prevalence : 0.7706          
##          Detection Rate : 0.3728          
##    Detection Prevalence : 0.4565          
##       Balanced Accuracy : 0.5596          
##                                           
##        'Positive' Class : 0               
##

print("All variables with time, Cities")

## [1] "All variables with time, Cities"

test_predict_outsample(m5_time_all_city)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 936 228
##          1 979 434
##                                          
##                Accuracy : 0.5316         
##                  95% CI : (0.5121, 0.551)
##     No Information Rate : 0.7431         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.1053         
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.4888         
##             Specificity : 0.6556         
##          Pos Pred Value : 0.8041         
##          Neg Pred Value : 0.3071         
##              Prevalence : 0.7431         
##          Detection Rate : 0.3632         
##    Detection Prevalence : 0.4517         
##       Balanced Accuracy : 0.5722         
##                                          
##        'Positive' Class : 0              
##

The test data performance shows accuracy of maximum 53.2%, meaning that our models so far were not able to successfully explain or predict the chance for a bronchitis call using the best logistic regression models attempted. They perform worse on the test data, meaning that they overfitted to the training data.

Statistical Tests and Corrections

Synthesis of Models and Conclusions

bronchitis Modeling with Air Pollution Data in Barcelona

Modeling Medical Home Visits Data in Barcelona Combined with Air Pollution Data

Tomer Eldor