Let’s load and view the data’s first few lines as well as summary statistics about it
# load the data
df <- read.csv("./output/bronch.csv")
# remove irrelevant columns
#df[ ,c('X','pathology', 'pathology_original', 'id_type', 'minor_18','geopoint', 'n_visits', 'dayofyear', 'date_numerical')] <- list(NULL)
df$X <- NULL
# recode relevant columns in the relevant encoding: float, categorical
df$bronch <- factor(df$bronch)
kable(summary(df))
| status | time_delay | city | city_id | age | zip | lat | lon | date | id_personal | pathology_eng | visits | month | dayofmonth | dayofweek | hour | x_age_scaled | x_visits | x_date | x_month | x_dayofweek | x_hour | x_lon | x_lat | x_zip | pathology_id | year | station_lat | pollution | distance | station_city | station_lon_air | station_area | bronch | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. :4.000 | Min. : 0.00 | Terrassa :1913 | Min. : 1.00 | Min. : 1.00 | Min. :8004 | Min. :41.21 | Min. :1.624 | 2018-05-01 15:02:00: 16 | Min. : 0.00 | bronchitis :7872 | Min. : 1 | Min. : 1.000 | Min. : 1.00 | Min. :0.000 | Min. : 0.00 | Min. :-1.25041 | Min. :-2.015002 | Min. :-2.86046 | Min. :-1.3824150 | Min. :-1.47097 | Min. :-2.52934 | Min. :-4.74391 | Min. :-3.097613 | Min. :-1.20005 | Min. : 1.00 | Min. :2015 | Min. :41.20 | Min. : 2.914 | Min. : 0.001931 | TERRASSA :2310 | Min. :1.618 | rural : 70 | 0:7872 | |
| 1st Qu.:4.000 | 1st Qu.: 45.00 | Sant Cugat :1902 | 1st Qu.: 26.00 | 1st Qu.: 8.00 | 1st Qu.:8184 | 1st Qu.:41.40 | 1st Qu.:2.048 | 2017-01-08 19:00:00: 13 | 1st Qu.: 46.00 | fever :2052 | 1st Qu.:1127 | 1st Qu.: 3.000 | 1st Qu.: 6.00 | 1st Qu.:1.000 | 1st Qu.:10.00 | 1st Qu.:-0.98151 | 1st Qu.:-0.666875 | 1st Qu.:-0.60114 | 1st Qu.:-0.8573621 | 1st Qu.:-0.98581 | 1st Qu.:-0.68611 | 1st Qu.:-0.59755 | 1st Qu.:-1.033615 | 1st Qu.:-0.62159 | 1st Qu.: 33.00 | 1st Qu.:2016 | 1st Qu.:41.40 | 1st Qu.:27.950 | 1st Qu.: 1.050538 | SABADELL :1863 | 1st Qu.:2.042 | suburban: 4529 | 1:7872 | |
| Median :4.000 | Median : 60.00 | Sabadell :1383 | Median : 50.00 | Median :36.00 | Median :8222 | Median :41.50 | Median :2.090 | 2017-12-27 08:56:00: 13 | Median : 55.00 | tonsillitis : 643 | Median :1584 | Median : 6.000 | Median :16.00 | Median :3.000 | Median :13.00 | Median : 0.09405 | Median :-0.119722 | Median : 0.30771 | Median :-0.0697829 | Median :-0.01547 | Median :-0.13314 | Median :-0.18707 | Median : 0.018138 | Median :-0.49947 | Median : 33.00 | Median :2017 | Median :41.51 | Median :36.589 | Median : 1.882925 | BARCELONA :1711 | Median :2.101 | urban :11145 | NA | |
| Mean :4.046 | Mean : 70.24 | Barcelona :1360 | Mean : 70.56 | Mean :35.88 | Mean :8374 | Mean :41.50 | Mean :2.110 | 2016-03-14 13:32:00: 12 | Mean : 55.01 | gastroenteritis: 444 | Mean :1687 | Mean : 6.266 | Mean :15.35 | Mean :3.011 | Mean :13.54 | Mean : 0.08927 | Mean : 0.003351 | Mean : 0.03999 | Mean :-0.0000326 | Mean :-0.01008 | Mean :-0.03385 | Mean : 0.01476 | Mean : 0.005824 | Mean :-0.01001 | Mean : 76.14 | Mean :2017 | Mean :41.49 | Mean :34.399 | Mean : 2.579297 | SANT FELIU DE LLOBREGAT:1662 | Mean :2.111 | NA | NA | |
| 3rd Qu.:4.000 | 3rd Qu.: 90.00 | Sant Feliu de Llobregat: 644 | 3rd Qu.:114.00 | 3rd Qu.:56.00 | 3rd Qu.:8520 | 3rd Qu.:41.57 | 3rd Qu.:2.147 | 2017-11-13 10:10:00: 12 | 3rd Qu.: 68.00 | flu : 378 | 3rd Qu.:2463 | 3rd Qu.:10.000 | 3rd Qu.:23.00 | 3rd Qu.:5.000 | 3rd Qu.:18.00 | 3rd Qu.: 0.86232 | 3rd Qu.: 0.932678 | 3rd Qu.: 0.78773 | 3rd Qu.: 0.9803229 | 3rd Qu.: 0.95486 | 3rd Qu.: 0.78847 | 3rd Qu.: 0.37533 | 3rd Qu.: 0.732254 | 3rd Qu.: 0.45821 | 3rd Qu.: 92.00 | 3rd Qu.:2018 | 3rd Qu.:41.56 | 3rd Qu.:42.260 | 3rd Qu.: 3.064469 | SANT CUGAT DEL VALLÈS :1658 | 3rd Qu.:2.153 | NA | NA | |
| Max. :5.000 | Max. :150.00 | Cerdanyola : 521 | Max. :188.00 | Max. :99.00 | Max. :8980 | Max. :41.98 | Max. :2.696 | 2017-11-25 11:16:00: 12 | Max. :102.00 | vomiting : 371 | Max. :2695 | Max. :12.000 | Max. :31.00 | Max. :6.000 | Max. :23.00 | Max. : 2.51408 | Max. : 1.210445 | Max. : 1.71578 | Max. : 1.5053757 | Max. : 1.44003 | Max. : 1.71008 | Max. : 5.75296 | Max. : 5.117218 | Max. : 1.93651 | Max. :221.00 | Max. :2018 | Max. :42.00 | Max. :52.057 | Max. :17.628569 | GRANOLLERS :1471 | Max. :2.817 | NA | NA | |
| NA | NA | (Other) :8021 | NA | NA | NA | NA | NA | (Other) :15666 | NA | (Other) :3984 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | (Other) :5069 | NA | NA | NA |
#summary(df)
Most of the data exploration has been done before at the Python notebook and mentioned in part 1 of this paper. Here I will perform specific data exploration and verfiications reelvant to the modeling dne here: checking for variance among relevant variables, outliers, and class balance.
ggplot(data = df, aes(x = age)) +
geom_histogram(binwidth = 5, alpha=0.6) +
scale_x_continuous(breaks = seq(0,100,by=5)) +
labs(title = "Ages of Patients", y = "Count", x = "Age of Patients") +
theme_minimal() +
theme(plot.title = element_text(size=18, family = "Helvetica", face = "bold", hjust = 0.5))
ggplot(data = df, aes(x = age, fill=bronch, color=bronch)) +
#geom_histogram(binwidth = 5, alpha=0.5) +
geom_density(alpha=0.5) +
scale_x_continuous(breaks = seq(0,100,by=5)) +
labs(title = "Ages of Patients", y = "Count", x = "Age of Patients") +
theme_minimal() +
theme(plot.title = element_text(size=18, family = "Helvetica", face = "bold", hjust = 0.5))
The distribution of age per class (between Bronchitis patients and not) is sufficiently similar. Naturally, there are more babies and older people who have this condition, while more young people are without this condition (and most likely healthier). The distribution of age in both classes is bi-modal rather than normal. This makes sense, since this is a This may pose some problems in the modeling because most models and procedures work ideally with a normally distributed dataset.
## two-way contingency table of categorical outcome and predictors we want
## to make sure there are not 0 cells
xtabs(~bronch + city_id, data = df)
## city_id
## bronch 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 17 18
## 0 7 84 29 98 28 263 1 103 121 48 21 1 100 197 3 123 112
## 1 6 65 20 94 31 258 5 109 131 58 11 0 81 188 4 105 85
## city_id
## bronch 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
## 0 80 35 52 153 47 108 27 357 89 7 220 6 50 40 3 142 3
## 1 52 29 55 115 81 144 29 275 94 16 156 0 68 19 0 154 2
## city_id
## bronch 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 53
## 0 85 27 7 169 84 72 190 215 53 76 49 12 29 42 139 3 34
## 1 91 42 12 177 86 55 207 211 75 114 60 1 6 73 116 6 50
## city_id
## bronch 54 55 56 57 58 60 62 63 64 65 66 67 68 69 70 71 72
## 0 0 23 14 64 4 1 1 88 107 17 6 35 10 45 23 25 10
## 1 5 13 6 102 0 0 2 75 123 0 0 52 20 45 37 17 8
## city_id
## bronch 73 74 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## 0 19 20 62 27 30 5 17 30 9 3 19 4 18 8 5 20 56
## 1 8 16 86 22 35 0 25 45 20 0 55 7 40 6 39 8 48
## city_id
## bronch 92 93 94 96 97 98 99 100 101 102 103 104 105 106 107 108 109
## 0 6 14 69 10 11 16 241 7 27 31 164 48 7 22 96 19 10
## 1 12 21 42 1 10 29 210 3 13 30 166 25 13 33 87 38 6
## city_id
## bronch 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## 0 32 58 2 7 328 196 194 22 5 6 18 49 48 48 47 5 18
## 1 37 35 4 20 316 164 222 15 7 0 17 16 67 44 36 17 10
## city_id
## bronch 128 129 131 134 135 136 142 144 145 146 147 149 152 153 155 159 160
## 0 3 2 232 37 2 1 2 71 1 12 1 4 5 7 82 108 90
## 1 8 4 217 26 0 0 4 157 0 4 7 0 0 0 131 80 61
## city_id
## bronch 162 166 167 172 174 175 181 182 183 184 185 186 187 188
## 0 5 6 119 88 24 2 60 24 5 58 18 19 11 23
## 1 0 4 128 109 36 12 43 33 10 42 35 20 24 24
Since I will be later including factor(city_id), we need to make sure these are somewhat balanced. There are some cities with 0 patient of one type (normally). To prevent complete overfitting and biased predictions for these cities, I will remove them from my dataset. The zero-patient stratas are only missing bronchitis patients. Hence, I will find all cities with missing bronchitis patients and remove them.
setDT(df) # setting as data.table for more efficient data processing
cities_with_nonbronch <- unique(df[bronch==0, city_id])
cities_with_bronch <- unique(df[bronch==1, city_id])
imblanaced_cities <- setdiff(cities_with_nonbronch, cities_with_bronch)
size_beofre_balancing <- nrow(df)
# balance the dataframe by removing imblanaced cities:
df <- df[! city_id %in% imblanaced_cities,]
size_after_balancing <- nrow(df)
print(paste("Balancing removed", size_beofre_balancing - size_after_balancing, "rows, now remaining with ", nrow(df)))
## [1] "Balancing removed 77 rows, now remaining with 15667"
ggplot(data = df, aes(x = pollution, fill=factor(bronch), color=factor(bronch))) +
geom_density(alpha=0.5) +
scale_x_continuous(breaks = seq(0,100,by=5)) +
labs(title = "Pollution Levels for Calls with bronchitis or Withouot", y = "Density", x = "Pollution") +
theme_minimal() +
theme(plot.title = element_text(size=18, family = "Helvetica", face = "bold", hjust = 0.5))
Surprisingly, there seems to be no difference between the pollution levels for bronchitis patients versus not! This might mean that our efforts to establish any meaningful effect by pollution would be futile. I will make a few simple regression models, but will then transition to modeling the chance for a bronchitis call with predictive purposes instead, taking advantage of other variables.
In this section, I will use logistic regression to model the chance of a call to be about bronchitis by the effect of the pollution level (at the closest measuring station). This will not establish causality perfectly because there are still many omitted variables and data limitations such as class imbalances, too few pollution station measurements (in both time and space), missing cities, potential for negative reverse causality (where patients with bronchitis may move away from a polluted area) and there is no good way to emulate a randomized experiment. This analysis could at least suggest if pollution seems to have any effect or not, although the results should be taken with a grain of salt. The null hypothesis would be that pollution would have no effect on the chance for a call to be about bronchitis. The alternative hypothesis would be that pollution would have some non-zero effect on the chance for a call to be about bronchitis.
Since it is hard to establish causality, I will try at the end of this analysis to find a sufficiently accurate model for predictive purposes. By trying various models and optimizing for their performance, we might at least find a sufficiently accurate predictive model, even if it is less sparse and theoretically founded.
I will develop models from the simplest possible model to more complex ones, slowly adding more features or making the form more flexible. I use logistic regression since the outcome variable is binary, and I’m looking predict probabilities bounded between [0,1]. If I would use a normal regression, the probabilities predicted could be outside of the boundaries of [0,1], but a logistic regression compresses the probabilities to be within that boundary with a sigmoid like function.
Since I am going to test the performance of various models and be able to use it for predictive purposes, it is important to make sure I am not overfitting the model to the training data. I will do this by training and checking the models on the training data while keeping 20% of the data aside as a test dataset.
set.seed(1)
# split dataframe into training (80%) and test (20%)
split_idx <- sample(c(TRUE, FALSE), nrow(df), replace = T, prob = c(0.8,0.2))
train <- df[split_idx, ]
test <- df[!split_idx, ]
df$bronch <- as.numeric(df$bronch)
# Visualize
ggplot(aes(x=pollution, y=bronch), data=df) +
geom_point(alpha = .15) +
geom_smooth(method = "glm") + #, method.args = list(family = "binomial")) +
ggtitle("Logistic regression model fit") +
xlab("Pollution") +
ylab("Probability of bronchitis")
Unfortunately, by this plotting of values of pollution per class of bronchitis or not, there is almost no difference between the distribution of pollution levels. A linear model is slightly skewed up, but not sufficiently to be able to predict bronchitis or not.
(*Note: the logistic nonlinear model could not be visualized on the plot above for technical reasons, so I instead used a linear model for visualization purposes.)
The syntax of the glm function is similar to that of lm, except that we must pass the argument family = binomial in order to tell R to run a logistic regression rather than some other type of generalized linear model.
model1_pol <- glm(bronch ~ pollution , family = "binomial", data = train)
summary(model1_pol)
##
## Call:
## glm(formula = bronch ~ pollution, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.243 -1.187 1.113 1.168 1.292
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.290544 0.070804 -4.104 4.07e-05 ***
## pollution 0.008534 0.001991 4.287 1.81e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 17250 on 12442 degrees of freedom
## Residual deviance: 17231 on 12441 degrees of freedom
## AIC: 17235
##
## Number of Fisher Scoring iterations: 3
A simple logistic regression on pollution shows that while the estimate on pollution is miniscule, wehn it is alone it is statistically significant. As it is a logistic regression, there is no simple measure of accuracy such as an R-Squared to report.
First, before evaluating performance on the test set, I will examine wether the model have succeeded sufficiently explaining the variance within-sample (on the same training sdata it was trained on). If it has, then I will continue to test the model on the test dataset.
# building a function to test performance within-sample
test_predict_within <- function(model) {
ytrain_fac <- factor(train$bronch)
ytrain_pred_fac <- as.vector(round(fitted(model)))
ytrain_pred_fac <- factor(ytrain_pred_fac, levels=c("0","1"))
print(confusionMatrix(ytrain_fac, ytrain_pred_fac))
}
test_predict_within(model1_pol)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2355 3857
## 1 2197 4034
##
## Accuracy : 0.5135
## 95% CI : (0.5046, 0.5223)
## No Information Rate : 0.6342
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0265
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.5174
## Specificity : 0.5112
## Pos Pred Value : 0.3791
## Neg Pred Value : 0.6474
## Prevalence : 0.3658
## Detection Rate : 0.1893
## Detection Prevalence : 0.4992
## Balanced Accuracy : 0.5143
##
## 'Positive' Class : 0
##
Even when testing model1_pol (only pollution) within-sample (on the same data it was trained on), the model performs poorly: accuracy is only %51.35, when the baseline by random guessing should be %50 (since the dataset is evenly distributed). Hence, pollution might have helped in explaining an additional 1.35% of the data over the 50% random guess expectation, which is very poor. It was not biased much towards one class. Pollution therefore does not seem to have a meaningful effect on the chance for a call to be bronchitis or not.
The data only had 65 air measurement stations in teh region, so that distances from the nearest pollution measurement system varied greatly. Hence, visits with higher distances from the matched nearest station are less accurate. Therefore, I will try including the distance either alone or with an interaction term.
model1_pol_dis <- glm(bronch ~ pollution + distance , family = "binomial", data = train)
model1_pol_dis_int <- glm(bronch ~ pollution + distance + I(distance*pollution) , family = "binomial", data = train)
stargazer(model1_pol_dis, model1_pol_dis_int, type = "text")
##
## ====================================================
## Dependent variable:
## ----------------------------
## bronch
## (1) (2)
## ----------------------------------------------------
## pollution 0.008*** 0.007**
## (0.002) (0.003)
##
## distance 0.019** -0.002
## (0.008) (0.034)
##
## I(distance * pollution) 0.001
## (0.001)
##
## Constant -0.326*** -0.279***
## (0.072) (0.102)
##
## ----------------------------------------------------
## Observations 12,443 12,443
## Log Likelihood -8,612.544 -8,612.332
## Akaike Inf. Crit. 17,231.090 17,232.660
## ====================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Distance alone showed to have a statistically meaningful coefficient, but not the interaction term. However, it does not make theoretic sense: distance from an air measurement station seemed to increase the likelihood for bronchitis, which is what we would expect if pollution was greater. That could be the case if stations are sparser in areas like cities where pollution is greater, but it’s not the case. Therefore, this is likely a spurious correlation and a confounding variable to add in this form for now.
Here I will try to add in personal variables: age, and later on - number of visits. Age is likely to have a nonlinear effect, since I’ve seen that babies and older people tend to get Bronchitis more. Therefore, I will include second order and test third order of age for prediction. I will also test an interaction term between pollution and age, since it might be that pollution affects more the delicate lungs of babies and older people. Since the interaction term might have nonlinear effects, I will also test adding higher orders of the interaction term.
Unfortunately, the standard F-test does not extend from Linear models to nonlinear logistic regression, therefore I will not be able to test these models using the F-test (Principles of Econometrics, 4th Edition; and uPenn STAT 501 Course). Hence, I will use other methods: I will asses the significance of added variables by the p-values of their coefficients, and compare the performance of the models against the Null hypothesis model without them (Model 1 in this case).
m2_age2 <- glm(bronch ~ pollution + age + I(age^2), family = "binomial", data = train)
m2_age3 <- glm(bronch ~ pollution + age + I(age^2) + I(age^3), family = "binomial", data = train)
m2_ageint2 <- glm(bronch ~ pollution + age + I(age^2) + I(pollution*age) + I((pollution*age)^2), family = "binomial", data = train)
stargazer(model1_pol, m2_age2, m2_age3, m2_ageint2, type="text")
##
## ==================================================================
## Dependent variable:
## --------------------------------------------
## bronch
## (1) (2) (3) (4)
## ------------------------------------------------------------------
## pollution 0.009*** 0.007*** 0.007*** 0.008**
## (0.002) (0.002) (0.002) (0.004)
##
## age -0.027*** -0.108*** -0.029***
## (0.002) (0.006) (0.007)
##
## I(age2) 0.0004*** 0.003*** 0.0004***
## (0.00003) (0.0002) (0.0001)
##
## I(age3) -0.00002***
## (0.00000)
##
## I(pollution * age) 0.0001
## (0.0002)
##
## I((pollution * age)2) -0.00000
## (0.00000)
##
## Constant -0.291*** -0.134* 0.274*** -0.162
## (0.071) (0.078) (0.083) (0.137)
##
## ------------------------------------------------------------------
## Observations 12,443 12,443 12,443 12,443
## Log Likelihood -8,615.608 -8,447.739 -8,334.024 -8,446.930
## Akaike Inf. Crit. 17,235.220 16,903.480 16,678.050 16,905.860
## ==================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The personal charachteristics added model shows that adding age to the equation was determined as statistically significant for the model, including the second and third order degrees, since they all are highly statistically significant, with a p-value of less than 1%. The Age^3 model performed better in practice, achieving lower AIC (16,667 against 16,892 or 16,895 of the others). Notice that the coefficient estimate on Pollution has decreased when adding age, and more so when adding age^3. Thus, we know that the estimate of Pollution was not “true” or stable, and that age was able to explain some of the same variance. That said, all of the absolute magnitudes of the coefficients are extremely small, suggesting that the model still may not be flexible enough to distinguish well between the classes of bronchitis or not.
test_predict_within(m2_age3)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3676 2536
## 1 2470 3761
##
## Accuracy : 0.5977
## 95% CI : (0.589, 0.6063)
## No Information Rate : 0.5061
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.1954
## Mcnemar's Test P-Value : 0.3583
##
## Sensitivity : 0.5981
## Specificity : 0.5973
## Pos Pred Value : 0.5918
## Neg Pred Value : 0.6036
## Prevalence : 0.4939
## Detection Rate : 0.2954
## Detection Prevalence : 0.4992
## Balanced Accuracy : 0.5977
##
## 'Positive' Class : 0
##
Adding Age significantly improved our performance - from 51.3% with pollution only, to 59.5% accuracy with age. This is still not an objectively good perfomance, so I will continue to add features in hope to increase the perforamnce and explain more of the variance.
After determining which is the best form to add in age with its nonlinearities, I can add the other personal variable - number of visits. I will test it against our recent better perfoming model, with the third higher order of Age.
I also test the removal of pollution from the Age^3 model.
m2_age3_nopol_novis <- glm(bronch ~ age + I(age^2) + I(age^3), family = "binomial", data = train)
m2_age3_visits_nopol <- glm(bronch ~ age + visits + I(age^2) + I(age^3), family = "binomial", data = train)
m2_age_visits_pol <- glm(bronch ~ pollution + visits + age + I(age^2) + I(age^3), family = "binomial", data = train)
stargazer(m2_age3_nopol_novis, m2_age3_visits_nopol, m2_age3, m2_age_visits_pol, type="text")
##
## =================================================================
## Dependent variable:
## -----------------------------------------------
## bronch
## (1) (2) (3) (4)
## -----------------------------------------------------------------
## pollution 0.007*** 0.007***
## (0.002) (0.002)
##
## age -0.108*** -0.108*** -0.108*** -0.108***
## (0.006) (0.006) (0.006) (0.006)
##
## visits -0.0001*** -0.0001***
## (0.00002) (0.00002)
##
## I(age2) 0.003*** 0.003*** 0.003*** 0.003***
## (0.0002) (0.0002) (0.0002) (0.0002)
##
## I(age3) -0.00002*** -0.00002*** -0.00002*** -0.00002***
## (0.00000) (0.00000) (0.00000) (0.00000)
##
## Constant 0.516*** 0.643*** 0.274*** 0.398***
## (0.046) (0.059) (0.083) (0.090)
##
## -----------------------------------------------------------------
## Observations 12,443 12,443 12,443 12,443
## Log Likelihood -8,340.059 -8,334.035 -8,334.024 -8,327.561
## Akaike Inf. Crit. 16,688.120 16,678.070 16,678.050 16,667.120
## =================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Adding visits to the pollution included model was statistically significant (with a p-value of 0.00002), but a very small coefficient, so it is not highly economically significant. Let’s check its performance in prediction, still within-sample. Adding visits had the same significant coefficient wether pollution was included or not. Adding pollution resulted in the same significant coefficient wether visits was included or not. All these models resulted in similar AICs, but both pollution and visits yielded the same marginal improvement in AIC and log likelihood, such that the better performning model of those was the one including both visits and pollution. They seem to expalin differrent parts of the variance.
test_predict_within(m2_age_visits_pol)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3666 2546
## 1 2488 3743
##
## Accuracy : 0.5954
## 95% CI : (0.5867, 0.6041)
## No Information Rate : 0.5054
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.1909
## Mcnemar's Test P-Value : 0.4218
##
## Sensitivity : 0.5957
## Specificity : 0.5952
## Pos Pred Value : 0.5901
## Neg Pred Value : 0.6007
## Prevalence : 0.4946
## Detection Rate : 0.2946
## Detection Prevalence : 0.4992
## Balanced Accuracy : 0.5954
##
## 'Positive' Class : 0
##
The Visits variable didn’t seem to increase accuracy meaningfully, only by 0.2%. In this order of magnitude, it might be overfitting or due to random chance, and the gain is not large enough to justify that, so I am not going to include that in my model.
Here I will try to predict the chance for a bronchitis call using only Geographical Variables, and later add them onto the pollution and or age variables examined before. The data has multiple different geographical variables: Area (“urban”/“suburban”/“rural”), Latitude and Longitude, or city/area codes. I will try using each of them and compare the perforamnce of the model. According to LeSage “Spatial Statistics” book, while not transofrming my methods completely to spatial regressions, I transformed my Lat and Lon variables to be flexible enough to capture the needed variance by:
1. noramlizing and standartizing these variables around 0 and so that 1 is equal 1 standard deviation of the original value.
2. Using second order degrees 3. Using interaction terms
m3_geo_area <- glm(bronch ~ factor(station_area), family = "binomial", data = train)
m3_geo_latlon<- glm(bronch ~ x_lat + x_lon + I(x_lat^2) + I(x_lon^2) + I(x_lat*x_lon), family = "binomial", data = train)
m3_geo_cityid<- glm(bronch ~ factor(city), family = "binomial", data = train)
stargazer(m3_geo_area, m3_geo_latlon, m3_geo_cityid, type="text")
##
## ========================================================================
## Dependent variable:
## --------------------------------
## bronch
## (1) (2) (3)
## ------------------------------------------------------------------------
## factor(station_area)suburban 0.241
## (0.293)
##
## factor(station_area)urban 0.261
## (0.292)
##
## x_lat 0.051**
## (0.021)
##
## x_lon 0.087***
## (0.026)
##
## I(x_lat2) 0.055***
## (0.020)
##
## I(x_lon2) 0.028**
## (0.013)
##
## I(x_lat * x_lon) -0.089***
## (0.030)
##
## factor(city)Aiguafreda -0.154
## (0.988)
##
## factor(city)Badia del Vall�s 0.028
## (0.822)
##
## factor(city)Barbera -0.493
## (0.585)
##
## factor(city)Barcelona 0.006
## (0.560)
##
## factor(city)Begas 0.593
## (0.688)
##
## factor(city)Bellaterra -0.714
## (0.664)
##
## factor(city)Bigues i Riells -0.606
## (0.737)
##
## factor(city)Caldes de Montbui 0.108
## (0.607)
##
## factor(city)Campins 12.412
## (145.231)
##
## factor(city)Canovelles -0.614
## (0.667)
##
## factor(city)Cardedeu 0.155
## (0.583)
##
## factor(city)Castellar del Vall�s -0.182
## (0.581)
##
## factor(city)Castellbisbal -0.026
## (0.628)
##
## factor(city)Castelldefels -0.635
## (0.598)
##
## factor(city)Cerdanyola -0.098
## (0.565)
##
## factor(city)Cervell� -2.100*
## (1.205)
##
## factor(city)Collbat� -0.154
## (0.729)
##
## factor(city)Corbera de Llobregat 0.907
## (0.678)
##
## factor(city)Cornell� de Llobregat -0.292
## (0.566)
##
## factor(city)C�noves i Samal�s -0.943
## (0.775)
##
## factor(city)El Papiol -1.764
## (1.229)
##
## factor(city)El Prat -0.847
## (0.678)
##
## factor(city)Esparreguera -0.110
## (0.631)
##
## factor(city)Esplugas de Llobregat -0.169
## (0.570)
##
## factor(city)Gallifa 12.412
## (145.231)
##
## factor(city)Gav� -0.752
## (0.616)
##
## factor(city)Granera 12.412
## (324.744)
##
## factor(city)Granollers -0.024
## (0.567)
##
## factor(city)Hospitalet de Llobregat -0.059
## (0.566)
##
## factor(city)l'Ametlla del Vall�s 0.224
## (0.596)
##
## factor(city)La Garriga 0.238
## (0.614)
##
## factor(city)La Llagosta 0.421
## (0.695)
##
## factor(city)La Palma de Cervell� 0.539
## (0.781)
##
## factor(city)La Roca del Vall�s -0.247
## (0.609)
##
## factor(city)Les Franqueses del Vall�s -0.036
## (0.597)
##
## factor(city)Llinars del Vall�s -0.228
## (0.677)
##
## factor(city)Lli�� d'Amunt -0.272
## (0.607)
##
## factor(city)Lli�� de Vall 0.251
## (0.627)
##
## factor(city)Martorell 0.120
## (0.634)
##
## factor(city)Martorelles -0.154
## (0.802)
##
## factor(city)Matadepera -0.061
## (0.577)
##
## factor(city)Molins de Rei -0.329
## (0.580)
##
## factor(city)Mollet del Vall�s -0.294
## (0.573)
##
## factor(city)Moncada y Reixach -0.021
## (0.575)
##
## factor(city)Montmel� -0.908
## (0.702)
##
## factor(city)Montorn�s del Vall�s -0.442
## (0.675)
##
## factor(city)Olesa de Montserrat 0.323
## (0.632)
##
## factor(city)Palau-solit� i Plegamans -0.021
## (0.603)
##
## factor(city)Pallej� -0.742
## (0.788)
##
## factor(city)Pallej� (Fontpineda) 0.539
## (1.029)
##
## factor(city)Parets del Vall�s 0.159
## (0.587)
##
## factor(city)Poliny� -0.901
## (0.688)
##
## factor(city)Ripollet -0.426
## (0.581)
##
## factor(city)Rubi -0.229
## (0.567)
##
## factor(city)Sabadell -0.219
## (0.560)
##
## factor(city)Sant Andreu de la Barca 0.035
## (0.621)
##
## factor(city)Sant Antoni de Vilamajor -0.260
## (0.644)
##
## factor(city)Sant Boi de Llobregat -0.714
## (0.605)
##
## factor(city)Sant Celoni -0.049
## (0.616)
##
## factor(city)Sant Climent de Llobregat -0.154
## (1.144)
##
## factor(city)Sant Cugat -0.348
## (0.559)
##
## factor(city)Sant Esteve de Sesrovires 0.608
## (0.720)
##
## factor(city)Sant Feliu de Codines 0.277
## (0.661)
##
## factor(city)Sant Feliu de Llobregat -0.194
## (0.563)
##
## factor(city)Sant Fost de Campsentelles 0.266
## (0.618)
##
## factor(city)Sant Joan Despi -0.395
## (0.569)
##
## factor(city)Sant Just Desvern 0.063
## (0.567)
##
## factor(city)Sant Pere de Vilamajor 0.619
## (0.744)
##
## factor(city)Sant Quirze 0.047
## (0.571)
##
## factor(city)Sant Vicen� dels Horts -0.279
## (0.660)
##
## factor(city)Santa Coloma de Cervell� 0.134
## (0.945)
##
## factor(city)Santa Eul�lia de Ron�ana 0.834
## (0.629)
##
## factor(city)Santa Maria de Martorelles 0.069
## (0.872)
##
## factor(city)Santa Maria de Palautordera 0.474
## (0.637)
##
## factor(city)Santa Perpetua de Mogoda -0.057
## (0.585)
##
## factor(city)Sentmenat 0.348
## (0.616)
##
## factor(city)Tagamanent -0.308
## (0.787)
##
## factor(city)Terrassa -0.105
## (0.559)
##
## factor(city)Ullastrell -2.457**
## (1.187)
##
## factor(city)Vacarises -1.764**
## (0.781)
##
## factor(city)Valldoreix -0.430
## (0.624)
##
## factor(city)Vallgorguina 1.670**
## (0.736)
##
## factor(city)Vallirana -0.219
## (0.662)
##
## factor(city)Vallromanes -1.099
## (0.713)
##
## factor(city)Viladecans -1.184*
## (0.632)
##
## factor(city)Viladecavalls 0.418
## (0.595)
##
## factor(city)Vilanova del Vall�s -0.486
## (0.601)
##
## Constant -0.251 -0.045* 0.154
## (0.291) (0.026) (0.556)
##
## ------------------------------------------------------------------------
## Observations 12,443 12,443 12,443
## Log Likelihood -8,624.294 -8,605.205 -8,502.446
## Akaike Inf. Crit. 17,254.590 17,222.410 17,180.890
## ========================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
These geographical variables do not seem to perform better alone than the age variable, and seem to perform similarly to the pollution only data, while they are more detailed. The charachterization of area as urban/suburban/rural produced non-statistically-significant coefficients. However, it performed not much worse than the others which did. The latitdue and longitude model suggested significant coefficient for each of the variants, and a better performance. The factor(cities) model, where each city is regarded as a seperate indicator (“dummy”) variable, showed that most of the cities were not statistically significant than the baseline, but only a few were. This model gives us better performance in AIC and better interpretability of results - with this information we now know which specific cities are less or more likely to have bronchitis cases. Let’s examine the performance of those models in prediction.
print("Area Model:")
## [1] "Area Model:"
test_predict_within(m3_geo_area)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1822 4390
## 1 1797 4434
##
## Accuracy : 0.5028
## 95% CI : (0.4939, 0.5116)
## No Information Rate : 0.7092
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0049
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.5035
## Specificity : 0.5025
## Pos Pred Value : 0.2933
## Neg Pred Value : 0.7116
## Prevalence : 0.2908
## Detection Rate : 0.1464
## Detection Prevalence : 0.4992
## Balanced Accuracy : 0.5030
##
## 'Positive' Class : 0
##
print("Lat/Lon Model:")
## [1] "Lat/Lon Model:"
test_predict_within(m3_geo_latlon)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3154 3058
## 1 2855 3376
##
## Accuracy : 0.5248
## 95% CI : (0.516, 0.5336)
## No Information Rate : 0.5171
## P-Value [Acc > NIR] : 0.043312
##
## Kappa : 0.0495
## Mcnemar's Test P-Value : 0.008616
##
## Sensitivity : 0.5249
## Specificity : 0.5247
## Pos Pred Value : 0.5077
## Neg Pred Value : 0.5418
## Prevalence : 0.4829
## Detection Rate : 0.2535
## Detection Prevalence : 0.4992
## Balanced Accuracy : 0.5248
##
## 'Positive' Class : 0
##
print("Cities Model:")
## [1] "Cities Model:"
test_predict_within(m3_geo_cityid)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3341 2871
## 1 2777 3454
##
## Accuracy : 0.5461
## 95% CI : (0.5373, 0.5549)
## No Information Rate : 0.5083
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.0922
## Mcnemar's Test P-Value : 0.2159
##
## Sensitivity : 0.5461
## Specificity : 0.5461
## Pos Pred Value : 0.5378
## Neg Pred Value : 0.5543
## Prevalence : 0.4917
## Detection Rate : 0.2685
## Detection Prevalence : 0.4992
## Balanced Accuracy : 0.5461
##
## 'Positive' Class : 0
##
All of them seem to have only little predicitive power. The area model does not have any, the accuracy is around 50%. The lat/lon model seems to have a slight improvement of 2.5%, and the cities model has an improvement of 4%. However, the cities model is highly restrictive because of using ~150 dummy variables, dividing our data to too many small stratas (as we saw at first). This might pose an issue later if we combine the cities indicator variables together with many other variables so we should remember that.
Let’s see the performance of models with both personal and geographical variables tested previously.
m4_person_geolatlon<- glm(bronch ~ pollution + visits + age + I(age^2) + I(age^3) + x_lat + x_lon + I(x_lat^2) + I(x_lon^2) + I(x_lat*x_lon), family = "binomial", data = train)
m4_person_geocityid<- glm(bronch ~ pollution + visits + age + I(age^2) + I(age^3) + factor(city), family = "binomial", data = train)
stargazer(m2_age_visits_pol, m4_person_geolatlon, m4_person_geocityid, type="text")
##
## ===========================================================================
## Dependent variable:
## -----------------------------------
## bronch
## (1) (2) (3)
## ---------------------------------------------------------------------------
## pollution 0.007*** 0.003 -0.0003
## (0.002) (0.003) (0.005)
##
## visits -0.0001*** -0.0001*** -0.0001***
## (0.00002) (0.00002) (0.00002)
##
## age -0.108*** -0.108*** -0.109***
## (0.006) (0.006) (0.006)
##
## I(age2) 0.003*** 0.003*** 0.003***
## (0.0002) (0.0002) (0.0002)
##
## I(age3) -0.00002*** -0.00002*** -0.00002***
## (0.00000) (0.00000) (0.00000)
##
## x_lat 0.058**
## (0.024)
##
## x_lon 0.055*
## (0.030)
##
## I(x_lat2) 0.054***
## (0.021)
##
## I(x_lon2) 0.014
## (0.014)
##
## I(x_lat * x_lon) -0.049
## (0.036)
##
## factor(city)Aiguafreda -0.551
## (1.025)
##
## factor(city)Badia del Vall�s -0.599
## (0.840)
##
## factor(city)Barbera -0.639
## (0.600)
##
## factor(city)Barcelona -0.242
## (0.576)
##
## factor(city)Begas 0.368
## (0.711)
##
## factor(city)Bellaterra -0.907
## (0.680)
##
## factor(city)Bigues i Riells -0.796
## (0.755)
##
## factor(city)Caldes de Montbui -0.286
## (0.623)
##
## factor(city)Campins 12.605
## (145.231)
##
## factor(city)Canovelles -0.650
## (0.682)
##
## factor(city)Cardedeu 0.052
## (0.597)
##
## factor(city)Castellar del Vall�s -0.135
## (0.596)
##
## factor(city)Castellbisbal -0.161
## (0.644)
##
## factor(city)Castelldefels -0.812
## (0.623)
##
## factor(city)Cerdanyola -0.312
## (0.579)
##
## factor(city)Cervell� -2.684**
## (1.219)
##
## factor(city)Collbat� -0.406
## (0.744)
##
## factor(city)Corbera de Llobregat 0.646
## (0.694)
##
## factor(city)Cornell� de Llobregat -0.418
## (0.582)
##
## factor(city)C�noves i Samal�s -0.950
## (0.810)
##
## factor(city)El Papiol -1.911
## (1.245)
##
## factor(city)El Prat -1.080
## (0.695)
##
## factor(city)Esparreguera -0.363
## (0.646)
##
## factor(city)Esplugas de Llobregat -0.377
## (0.585)
##
## factor(city)Gallifa 11.939
## (145.231)
##
## factor(city)Gav� -0.826
## (0.639)
##
## factor(city)Granera 11.670
## (324.744)
##
## factor(city)Granollers -0.189
## (0.582)
##
## factor(city)Hospitalet de Llobregat -0.152
## (0.581)
##
## factor(city)l'Ametlla del Vall�s 0.148
## (0.611)
##
## factor(city)La Garriga 0.157
## (0.629)
##
## factor(city)La Llagosta 0.356
## (0.709)
##
## factor(city)La Palma de Cervell� 0.032
## (0.810)
##
## factor(city)La Roca del Vall�s -0.406
## (0.624)
##
## factor(city)Les Franqueses del Vall�s -0.071
## (0.612)
##
## factor(city)Llinars del Vall�s -0.197
## (0.690)
##
## factor(city)Lli�� d'Amunt -0.283
## (0.622)
##
## factor(city)Lli�� de Vall 0.021
## (0.642)
##
## factor(city)Martorell -0.097
## (0.649)
##
## factor(city)Martorelles -0.330
## (0.819)
##
## factor(city)Matadepera -0.247
## (0.592)
##
## factor(city)Molins de Rei -0.400
## (0.597)
##
## factor(city)Mollet del Vall�s -0.500
## (0.589)
##
## factor(city)Moncada y Reixach -0.193
## (0.589)
##
## factor(city)Montmel� -1.197*
## (0.720)
##
## factor(city)Montorn�s del Vall�s -0.448
## (0.690)
##
## factor(city)Olesa de Montserrat 0.074
## (0.648)
##
## factor(city)Palau-solit� i Plegamans -0.278
## (0.619)
##
## factor(city)Pallej� -1.181
## (0.808)
##
## factor(city)Pallej� (Fontpineda) -0.133
## (1.040)
##
## factor(city)Parets del Vall�s 0.033
## (0.603)
##
## factor(city)Poliny� -1.045
## (0.704)
##
## factor(city)Ripollet -0.649
## (0.595)
##
## factor(city)Rubi -0.407
## (0.582)
##
## factor(city)Sabadell -0.368
## (0.574)
##
## factor(city)Sant Andreu de la Barca -0.144
## (0.637)
##
## factor(city)Sant Antoni de Vilamajor -0.409
## (0.659)
##
## factor(city)Sant Boi de Llobregat -0.902
## (0.621)
##
## factor(city)Sant Celoni -0.388
## (0.633)
##
## factor(city)Sant Climent de Llobregat -0.244
## (1.163)
##
## factor(city)Sant Cugat -0.461
## (0.574)
##
## factor(city)Sant Esteve de Sesrovires 0.656
## (0.734)
##
## factor(city)Sant Feliu de Codines 0.195
## (0.677)
##
## factor(city)Sant Feliu de Llobregat -0.337
## (0.584)
##
## factor(city)Sant Fost de Campsentelles 0.070
## (0.635)
##
## factor(city)Sant Joan Despi -0.487
## (0.589)
##
## factor(city)Sant Just Desvern -0.164
## (0.587)
##
## factor(city)Sant Pere de Vilamajor 0.123
## (0.760)
##
## factor(city)Sant Quirze -0.111
## (0.586)
##
## factor(city)Sant Vicen� dels Horts -0.686
## (0.674)
##
## factor(city)Santa Coloma de Cervell� -0.067
## (0.971)
##
## factor(city)Santa Eul�lia de Ron�ana 0.814
## (0.643)
##
## factor(city)Santa Maria de Martorelles 0.363
## (0.885)
##
## factor(city)Santa Maria de Palautordera 0.423
## (0.653)
##
## factor(city)Santa Perpetua de Mogoda -0.282
## (0.599)
##
## factor(city)Sentmenat 0.217
## (0.632)
##
## factor(city)Tagamanent -0.637
## (0.820)
##
## factor(city)Terrassa -0.270
## (0.574)
##
## factor(city)Ullastrell -2.466**
## (1.196)
##
## factor(city)Vacarises -1.782**
## (0.796)
##
## factor(city)Valldoreix -0.604
## (0.641)
##
## factor(city)Vallgorguina 1.312*
## (0.749)
##
## factor(city)Vallirana -0.374
## (0.681)
##
## factor(city)Vallromanes -1.674**
## (0.728)
##
## factor(city)Viladecans -1.476**
## (0.655)
##
## factor(city)Viladecavalls 0.270
## (0.611)
##
## factor(city)Vilanova del Vall�s -0.616
## (0.616)
##
## Constant 0.398*** 0.468*** 0.961
## (0.090) (0.116) (0.601)
##
## ---------------------------------------------------------------------------
## Observations 12,443 12,443 12,443
## Log Likelihood -8,327.561 -8,314.407 -8,217.180
## Akaike Inf. Crit. 16,667.120 16,650.810 16,620.360
## ===========================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
As expected from before - the addition of the latitude and longitude variables improved the AIC over the only personal charachteristics model with pollution but only slightly. More importantly, it reduced the coefficient (estimated impact) of pollution because of the correlation between the geographical variables and pollution, and the cities model made the coefficient of pollution be negative (suggesting that pollution reduced the likelihood of bronchitis). This does not seem reasonable, unless operating through a third variable (for example, if pollution is correalted with lower income otherwise than with cities).
Let’s test the predictive ability
print("Personal + Lat/Lon Model:")
## [1] "Personal + Lat/Lon Model:"
test_predict_within(m4_person_geolatlon)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3705 2507
## 1 2480 3751
##
## Accuracy : 0.5992
## 95% CI : (0.5905, 0.6078)
## No Information Rate : 0.5029
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.1984
## Mcnemar's Test P-Value : 0.7127
##
## Sensitivity : 0.5990
## Specificity : 0.5994
## Pos Pred Value : 0.5964
## Neg Pred Value : 0.6020
## Prevalence : 0.4971
## Detection Rate : 0.2978
## Detection Prevalence : 0.4992
## Balanced Accuracy : 0.5992
##
## 'Positive' Class : 0
##
print("Personal + City Model:")
## [1] "Personal + City Model:"
test_predict_within(m4_person_geocityid)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3772 2440
## 1 2478 3753
##
## Accuracy : 0.6048
## 95% CI : (0.5961, 0.6134)
## No Information Rate : 0.5023
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.2095
## Mcnemar's Test P-Value : 0.5978
##
## Sensitivity : 0.6035
## Specificity : 0.6060
## Pos Pred Value : 0.6072
## Neg Pred Value : 0.6023
## Prevalence : 0.5023
## Detection Rate : 0.3031
## Detection Prevalence : 0.4992
## Balanced Accuracy : 0.6048
##
## 'Positive' Class : 0
##
As expected, they both did only slightly better than the personal variables only model, with the cities based model doing 1% better in accuracy.
Now I test the performance on the test data.
test_predict_outsample <- function(model) {
ytest_fac <- factor(test$bronch)
ytest_pred_fac <- as.vector(round(predict(model, newdata = test)))
ytest_pred_fac <- factor(ytest_pred_fac, levels=c("0","1"))
print(confusionMatrix(ytest_fac, ytest_pred_fac))
}
print("All personal variables")
## [1] "All personal variables"
test_predict_outsample(m2_age3)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1048 204
## 1 1099 330
##
## Accuracy : 0.514
## 95% CI : (0.4949, 0.5331)
## No Information Rate : 0.8008
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0651
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.4881
## Specificity : 0.6180
## Pos Pred Value : 0.8371
## Neg Pred Value : 0.2309
## Prevalence : 0.8008
## Detection Rate : 0.3909
## Detection Prevalence : 0.4670
## Balanced Accuracy : 0.5531
##
## 'Positive' Class : 0
##
print("All variables with time, Lat/Lon")
## [1] "All variables with time, Lat/Lon"
test_predict_outsample(m5_time_all)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 985 221
## 1 1051 385
##
## Accuracy : 0.5185
## 95% CI : (0.4993, 0.5378)
## No Information Rate : 0.7706
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0804
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.4838
## Specificity : 0.6353
## Pos Pred Value : 0.8167
## Neg Pred Value : 0.2681
## Prevalence : 0.7706
## Detection Rate : 0.3728
## Detection Prevalence : 0.4565
## Balanced Accuracy : 0.5596
##
## 'Positive' Class : 0
##
print("All variables with time, Cities")
## [1] "All variables with time, Cities"
test_predict_outsample(m5_time_all_city)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 936 228
## 1 979 434
##
## Accuracy : 0.5316
## 95% CI : (0.5121, 0.551)
## No Information Rate : 0.7431
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1053
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.4888
## Specificity : 0.6556
## Pos Pred Value : 0.8041
## Neg Pred Value : 0.3071
## Prevalence : 0.7431
## Detection Rate : 0.3632
## Detection Prevalence : 0.4517
## Balanced Accuracy : 0.5722
##
## 'Positive' Class : 0
##
The test data performance shows accuracy of maximum 53.2%, meaning that our models so far were not able to successfully explain or predict the chance for a bronchitis call using the best logistic regression models attempted. They perform worse on the test data, meaning that they overfitted to the training data.