Intro

The purpose of this document is to detail the analysis I did of the Ottawa 311 data set which was discussed in the recent ‘Open data Ottawa’ meetup (http://www.meetup.com/Open-Data-Ottawa/).

A few apologetic notes:
- this analysis was done quickly, for fun. Were it part of a more serious effort, the documentation would be better, and proper software engineering procedures would be adhered to.
- the statistical analyses reported are only the first steps of what could be a more fulsome analysis.

In all, this document is intended as a starting point, not the conclusion.

Data

The primary data set is available at http://data.ottawa.ca/en/dataset/2014-311-monthly-service-request-submissions

Files from Jan through Aug 2014 were downloaded, converted to .csv, and massaged a little by hand to ensure that the first row of the .csv file was the header row (i.e. no blank rows), and that all the date columns looked like “2014-Jan-19 13:40:00”

Other data used:
Ward information
- area, population, type (urban, suburban, rural)

Demographic info by ward
- %male, %female, %<15 yrs old, %>65 yrs old

Weather by day
- mean temp, snowfall, rain

These are included here as r variables to keep things simple.

Also, I categorized the call_type variable into the bins
- traffic, health, mtce, garbage, admin, animals, parking in order to simplify this initial analysis.

library(lubridate)
library(dplyr)
library(ggplot2)
library(reshape2)

# Of course, you will have to change the following to match your directory structure
data1 <- read.csv("C:/Users/prabinovitch/Downloads/JanSR-2014.csv", stringsAsFactors=FALSE)
data2 <- read.csv("C:/Users/prabinovitch/Downloads/FebSR-2014.csv", stringsAsFactors=FALSE)
data3 <- read.csv("C:/Users/prabinovitch/Downloads/MarSR-2014-2.csv", stringsAsFactors=FALSE)
data4 <- read.csv("C:/Users/prabinovitch/Downloads/AprSR-2.csv", stringsAsFactors=FALSE)
data5 <- read.csv("C:/Users/prabinovitch/Downloads/MaySR-2.csv", stringsAsFactors=FALSE)
data6 <- read.csv("C:/Users/prabinovitch/Downloads/junSR2014.csv", stringsAsFactors=FALSE)
data7 <- read.csv("C:/Users/prabinovitch/Downloads/julSR2014.csv", stringsAsFactors=FALSE)
data8 <- read.csv("C:/Users/prabinovitch/Downloads/AugSR-2014.csv", stringsAsFactors=FALSE)

# load all the data into one dataframe
data<-data1
data<-rbind(data,data2)
data<-rbind(data,data3)
data<-rbind(data,data4)
data<-rbind(data,data5)
data<-rbind(data,data6)
data<-rbind(data,data7)
data<-rbind(data,data8)

# pull out date and ward information
data<-mutate(data,dt=ymd_hms(creation_date))
data<-mutate(data,wards=sapply(strsplit(data$ward,' '),`[`,2))
data<-mutate(data,wardi=strtoi(sapply(strsplit(data$ward,' '),`[`,2)))
data$dow<-wday(data$dt,label=TRUE)
data$hour<-hour(data$dt)
data$day<-day(data$dt)
data$month<-month(data$dt)
data$wkend<-FALSE
data$wkend[data$dow=='Sat']<-TRUE
data$wkend[data$dow=='Sun']<-TRUE

# drop rows with no ward
data<-data[which(!is.na(data$wardi)),]

# categorize the callCategory into a few bins
data$callCategory<-''
data$callCategory[grepl('Traffic',data$call_type)]<-'traffic'
data$callCategory[grepl('Accessibility',data$call_type)]<-'health'
data$callCategory[grepl('Health',data$call_type)]<-'health'
data$callCategory[grepl('Needles',data$call_type)]<-'health'
data$callCategory[grepl('Maintenance',data$call_type)]<-'mtce'
data$callCategory[grepl('Drainage',data$call_type)]<-'mtce'
data$callCategory[grepl('Roads',data$call_type)]<-'mtce'
data$callCategory[grepl('Retail',data$call_type)]<-'garbage'
data$callCategory[grepl('RPAM',data$call_type)]<-'garbage'
data$callCategory[grepl('Waste',data$call_type)]<-'garbage'
data$callCategory[grepl('Birth',data$call_type)]<-'admin'
data$callCategory[grepl('Budget',data$call_type)]<-'admin'
data$callCategory[grepl('Bylaw',data$call_type)]<-'admin'
data$callCategory[grepl('Code',data$call_type)]<-'admin'
data$callCategory[grepl('Graffiti',data$call_type)]<-'admin'
data$callCategory[grepl('Wildlife',data$call_type)]<-'animals'
data$callCategory[grepl('Livestock',data$call_type)]<-'animals'
data$callCategory[grepl('Parking',data$call_type)]<-'parking'

# originally this was loaded from a file via
#    wards <- read.csv("C:/Users/prabinovitch/Downloads/wards.csv")
# but to keep this as self contained as possible...

wards<-data.frame(ward=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23))
wards$pop<-c(50100,41500,51230,37200,27700,26800,45700,56900,41200,49000,35900,49200,47000,48200,41200,52400,37900,47500,42500,27800,27170,44200,46400)
wards$area_kms_sq<-c(25.5,31.1,26.1,24.2,763.0,16.3,64.2,46.2,47.5,67.2,41.2,7.9,19.8,6.5,15.1,23.3,9.9,20.3,383.0,459.0,744.0,37.6,16.5)
wards$type<-c('suburban','suburban','suburban','suburban','rural','suburban','urban','urban','urban','urban','urban','urban','urban','urban','urban','urban','urban','urban','rural','rural','rural','suburban','suburban')
wards$kidsfrac<-c(0.24398481,0.25369432,0.29450493,0.29861695,0.26441266,0.30085787,0.20285896,0.20219257,0.20477344,0.27300261,0.22093969,0.13817421,0.20620807,0.09409352,0.19385872,0.21383510,0.17953472,0.22400291,0.28698218,0.26661215,0.25303512,0.32160261,0.27439263)
wards$adultsfrac<-c(0.6407459,0.6327346,0.6381514,0.5914929,0.6252709,0.5867125,0.5923928,0.6098449,0.6214336,0.6339426,0.6134535,0.7153105,0.6077385,0.7837330,0.6553769,0.6389295,0.6680766,0.6043856,0.6391341,0.6156205,0.6055977,0.6122620,0.6252451)
wards$seniorsfrac<-1-wards$adultsfrac-wards$kidsfrac
wards$malefrac<-c(0.4857892,0.4827046,0.4892953,0.4888309,0.5045533,0.4849745,0.4630704,0.4804607,0.4876873,0.4794191,0.4760971,0.4892944,0.4647336,0.5047714,0.4793672,0.4885905,0.4756451,0.4809308,0.4915274,0.5003068,0.5051988,0.4887074,0.4860903)
wards$femalfrac<-1-wards$malefrac

names(wards)[1]<-'wardi'
wards<-mutate(wards,density=pop/area_kms_sq)

datf<-data %>%
  group_by(wardi,month,day,callCategory) %>%
  summarise(
    count = n(),
    dow=first(dow),
    wkend=first(wkend)
  )%>%
  arrange(month,day,wardi)  

# originally this was loaded from a file via
# datw <- read.csv("C:/Users/prabinovitch/Downloads/OttawaWeather.csv", stringsAsFactors=FALSE)
# but to keep this as self contained as possible...
datw<-data.frame(Year=rep(2014,243))
datw$Month<-c(rep(1,31),rep(2,28),rep(3,31),rep(4,30),rep(5,31),rep(6,30),rep(7,31),rep(8,31))
datw$Day<-c(1:31,1:28,1:31,1:30,1:31,1:30,1:31,1:31)
datw$MeanTempC<-c(-22.9,-24.5,-25.1,-14.1,-5.2,-4.5,-16.3,-13.4,-11.6,-8.9,0.9,1.4,2.3,-0.6,-2.8,-6.0,-3.1,-6.3,-6.1,-16.0,-23.6,-22.2,-21.8,-17.6,-12.6,-20.4,-14.5,-16.6,-13.8,-10.1,-4.3,-9.1,-6.5,-10.9,-10.4,-13.7,-15.1,-10.8,-11.0,-9.5,-12.9,-17.8,-17.2,-12.6,-8.7,-9.3,-13.9,-15.1,-10.7,-2.8,-3.2,1.6,0.3,-3.3,-10.1,-13.0,-14.0,-12.8,-16.7,-10.2,-15.6,-17.7,-15.5,-13.1,-15.4,-9.9,-4.3,-8.4,-1.6,2.2,-7.2,-13.4,-2.9,-3.3,-13.9,-14.0,-8.4,-3.8,0.4,-2.2,-2.7,-11.0,-11.8,-8.1,-9.6,-9.3,0.4,1.0,1.1,2.6,0.2,2.2,0.7,0.8,2.3,2.7,4.0,3.7,2.1,7.8,6.8,8.1,6.2,14.5,0.1,-3.8,1.1,5.4,3.9,7.2,14.0,8.6,5.8,5.8,6.4,5.2,7.6,10.5,9.7,6.2,12.1,11.3,12.0,7.7,9.5,8.6,8.6,11.4,14.4,17.1,15.9,15.6,12.9,18.4,23.6,14.1,9.7,9.1,14.0,15.4,15.1,17.9,14.9,16.7,16.3,20.8,16.2,12.9,13.6,17.3,16.4,16.2,23.0,18.3,14.8,14.5,16.9,19.9,20.1,22.5,20.2,17.3,16.8,20.8,14.0,16.8,15.8,20.6,19.8,18.2,16.6,18.0,18.9,18.8,20.4,21.4,21.2,20.8,22.1,24.3,25.6,26.3,24.3,19.0,18.8,19.9,22.7,21.1,22.3,17.9,17.1,18.1,21.0,21.4,18.2,19.0,17.8,16.5,19.2,19.0,21.7,22.8,23.9,20.7,17.6,19.7,19.0,20.7,16.4,16.1,16.6,17.2,20.7,21.3,20.8,21.6,20.5,18.5,18.7,19.8,20.9,21.2,21.8,19.2,17.8,13.8,13.9,15.1,16.6,15.0,15.7,19.4,20.7,18.9,20.7,21.0,21.0,21.8,20.2,16.2,16.5,20.6,22.5)
datw$TotalRainMm<-c(0,0,0,0,6,8.4,0,0,0,0,7.6,0,0.4,1,0,0,0,0,0,0,0,0,0,0,0,0,0.4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,21,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2.2,2,0,0,0,0,0,0,0,4.8,0,0,0,0,0,0,7,0,0,6.8,16.6,0,2.8,0,1,17.4,5,12.6,0,0,0.6,0,0,2.4,13.6,0,0,0,4.4,0,0,2.4,9.2,7.2,0.2,0.6,18.6,0,0,0,0.4,27.4,0,0,0,1.4,11.2,3.4,16,0.2,0,0,0,0,0,0.2,0,0,0.2,0,0,0,0,0,0,0,10.2,0,0.8,0,0,0,0.6,0,22.0,16.4,9.8,0,0,0,13.8,1.4,0,0,0,0,0,55.6,0.4,0,0,0,0,0,0,1.6,0.4,0,0,1.2,2.4,26.4,0,0,0,0,14,0,5,0,0,0,0,0.6,0,0,2.4,0,0,1,5.6,2.6,0,23.2,22.6,0,0,0,2,0,1,0.6,0,0,0,0,26.4,28.6,1.4,0.2,9.6,1.4,0,0,14,18.8,0,0,0,0,0,0,0,0,2,7.8)
datw$TotalSnowCm<-c(0,0.4,0,1.2,7.8,1,0,0,0,0.8,0,0.2,0,0,0,0,1,1.8,3.6,0.4,0,0,0,0,7,3.6,7.6,0,0.2,0,0,16.2,2,0,0,4.2,0,0,0,2.2,0,0,0,0.2,6.6,0,0,0,4.4,0,0,0.2,0,0,0.4,0,0,3.6,0,9.4,0.4,0,1.6,0.6,0,0,0,1,2,0,8.4,0,0,0.2,0,0,0,2,0.8,0,8.4,0,0,0,0,0,2.6,0,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.6,rep(0,138))


names(datw)<-c("Year","month","day","MeanTempC","TotalRainMm","TotalSnowCm")
datfw<-left_join(datf,datw)

datall<-inner_join(datfw,wards)

# this will pull out all the parking data on a monthly basis
datallpm<-datall %>%
  filter(callCategory=='parking') %>%
  group_by(wardi,month) %>%
  summarise(
    count=sum(count),
    pop=mean(pop),
    area=mean(area_kms_sq),
    type=first(type),
    malefrac=first(malefrac),
    femalefrac=first(femalfrac),
    kidsfrac=first(kidsfrac),
    seniorsfrac=first(seniorsfrac),
    adultsfrac=first(adultsfrac),
    density=first(density)
  )

# this will pull out all the parking data on a daily basis
datallpd<-datall %>%
  filter(callCategory=='parking') %>%
  group_by(wardi,month,day) %>%
  summarise(
    count=sum(count),
    pop=mean(pop),
    area=mean(area_kms_sq),
    type=first(type),
    malefrac=first(malefrac),
    femalefrac=first(femalfrac),
    kidsfrac=first(kidsfrac),
    seniorsfrac=first(seniorsfrac),
    adultsfrac=first(adultsfrac),
    density=first(density),
    temp=first(MeanTempC),
    rain=first(TotalRainMm),
    snow=first(TotalSnowCm),
    dow=first(dow),
    wkend=first(wkend)
  )

Calls By Day

data %>%
  group_by(month,day) %>%
  summarise(
    count=n(),
    wkend=first(wkend)
    )%>%
  ggplot(aes(month*30+day,count,colour=wkend)) +
  geom_point(size=3.2)

plot of chunk unnamed-chunk-2

Note these features:
- The worst days are early in the year
- Other than the worst days, there appears to be some cyclicality with more calls during the summer months.

Non Weekend Days with Low Volume

There are some weekdays with suspiciously low call volumes:

data %>%
  group_by(month,day) %>%
  summarise(
    count=n(),
    wkend=first(wkend)
  )%>%
  filter(wkend==FALSE,count<300)

## Source: local data frame [8 x 4]
## Groups: month
## 
##   month day count wkend
## 1     1   1   128 FALSE
## 2     2  17   148 FALSE
## 3     4  18   215 FALSE
## 4     4  21   196 FALSE
## 5     5  19   205 FALSE
## 6     5  29   153 FALSE
## 7     7   1   226 FALSE
## 8     8   4   148 FALSE

So they are holidays - except May 29. I wonder what happened then!

Calls By Day - Worst Days

datfw %>%
  group_by(month,day) %>%
  summarise(  
    count=sum(count),
    temp=first(MeanTempC),
    snow=first(TotalSnowCm)
    ) %>%
  arrange(desc(count))

## Source: local data frame [241 x 5]
## Groups: month
## 
##    month day count  temp snow
## 1      2  21  1431   1.6  0.2
## 2      1   6  1295  -4.5  1.0
## 3      1  13  1163   2.3  0.0
## 4      1  14  1106  -0.6  0.0
## 5      1   8  1085 -13.4  0.0
## 6      1   9  1004 -11.6  0.0
## 7      1   7   972 -16.3  0.0
## 8      6   9   918  22.5  0.0
## 9      4  28   888  10.5  0.0
## 10     5  12   866  15.6  0.0
## ..   ... ...   ...   ...  ...

So the worst days were February 21, and a period in early January.

Feb 21, 2014 is an easy one…

http://www.ottawasun.com/2014/02/21/massive-sinkhole-halts-ottawa-lrt-tunnelling

And unsurprisingly….

data %>%
  filter(month==2,day==21) %>%
  group_by(call_type) %>%
  summarise(
    count=n()
  ) %>%
  arrange(desc(count))

## Source: local data frame [20 x 2]
## 
##                    call_type count
## 1          Roads Maintenance  1161
## 2     Solid Waste Collection    73
## 3            Parking Control    59
## 4             Bylaw Services    53
## 5         Traffic Operations    34
## 6       Retail-Green Bin 80L     8
## 7     Parking Ticket Inquiry     7
## 8            Retail-Blue Box     5
## 9      RPAM (Buildings Only)     4
## 10 Retail-All Container(80L)     4
## 11          Retail-Black Box     4
## 12                  Drainage     3
## 13      Park Maintenance SAP     3
## 14         Parking Equipment     3
## 15                    Health     2
## 16 Retail-All Container(46L)     2
## 17   Retail-Blue & Black Box     2
## 18     Retail-Green 80L & KC     2
## 19                  Graffiti     1
## 20     Parking Permits (CSC)     1

almost all calls to 311 that day were for road maintenance.

For the early January days, this is typical: http://www.ottawasun.com/2014/01/06/grim-weather-day-in-ottawa

Smoothed Calls By Day and Category

We next look at smoothed call volume by type and date.

data %>%
  group_by(callCategory,month,day) %>%
  summarise(
    count=n()
  )%>%
  ggplot(aes(month*30+day,count,colour=callCategory)) +
  geom_smooth(size=2)+
  ylim(0,200)

plot of chunk unnamed-chunk-6

It seems the admin category peaks in the summer. This would perhaps be interesting to look at further.

Calls By Month

data %>%
  group_by(month,callCategory) %>%  
  summarise(
    count = n()
  ) %>%
  ggplot(aes(x=month,y=count,fill=callCategory))+
  geom_bar(stat='identity')

plot of chunk unnamed-chunk-7

Calls by month does not show anything particularly interesting at first glance.

Calls by Ward

But looking at the calls by ward does.

data %>%
  group_by(wardi,callCategory) %>%  
  summarise(
    count = n()
  ) %>%
  ggplot(aes(x=wardi,y=count,fill=callCategory))+
  geom_bar(stat='identity')

plot of chunk unnamed-chunk-8

Note that the parking and admin categories are much larger for wards 12 and 14 than other wards. What is different about these wards?

Wards 12 & 14

If we define density=pop/area, the answer appears

wards %>%
  select(wardi, density) %>%
  arrange(desc(density)) %>%
  top_n(6)

##   wardi density
## 1    14    7415
## 2    12    6228
## 3    17    3828
## 4    23    2812
## 5    15    2728
## 6    13    2374

Note ward 12 is Vanier, ward 14 is Somerset. And they both have significantly higher population densities than the other wards.

Parking Model for Monthly Data

Here we first form a linear model for the number of parking related calls, based on the month, population, area of the ward, type of ward (urban, suburban or rural), fraction of the population that is female, fraction of the population that is 15 yrs old or younger, fraction of the population that is 65 yrs old or older,and population density of the ward.

summary(lm(count~factor(month)+pop+area+factor(type)+femalefrac+kidsfrac+seniorsfrac+density,data=datallpm))

## 
## Call:
## lm(formula = count ~ factor(month) + pop + area + factor(type) + 
##     femalefrac + kidsfrac + seniorsfrac + density, data = datallpm)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -84.11 -14.65  -0.24  17.58 138.94 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           6.77e+02   2.12e+02    3.20  0.00164 ** 
## factor(month)2       -1.67e+01   9.30e+00   -1.79  0.07526 .  
## factor(month)3       -1.80e+01   9.30e+00   -1.93  0.05469 .  
## factor(month)4       -3.47e+01   9.30e+00   -3.73  0.00026 ***
## factor(month)5       -2.70e+01   9.30e+00   -2.90  0.00420 ** 
## factor(month)6       -1.27e+01   9.30e+00   -1.36  0.17565    
## factor(month)7       -1.66e+01   9.30e+00   -1.78  0.07678 .  
## factor(month)8       -2.53e+01   9.30e+00   -2.72  0.00721 ** 
## pop                   6.47e-04   3.86e-04    1.68  0.09553 .  
## area                 -2.39e-02   3.82e-02   -0.63  0.53190    
## factor(type)suburban -2.11e+01   2.07e+01   -1.02  0.30970    
## factor(type)urban     3.26e+01   2.18e+01    1.49  0.13764    
## femalefrac           -5.22e+02   4.23e+02   -1.23  0.21957    
## kidsfrac             -1.01e+03   1.65e+02   -6.13  6.0e-09 ***
## seniorsfrac          -9.85e+02   1.49e+02   -6.60  5.0e-10 ***
## density               3.00e-02   3.71e-03    8.07  1.2e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.5 on 168 degrees of freedom
## Multiple R-squared:  0.928,  Adjusted R-squared:  0.922 
## F-statistic:  145 on 15 and 168 DF,  p-value: <2e-16

We see that the month (in most cases), population, adult fraction (= 1- kidsfraction-seniorsfraction) and population density are significant.

We can simplify and get a model that is almost as good.

summary(lm(count~pop+kidsfrac+seniorsfrac+density,data=datallpm))

## 
## Call:
## lm(formula = count ~ pop + kidsfrac + seniorsfrac + density, 
##     data = datallpm)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -78.79 -21.13  -0.75  22.22 155.75 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.73e+02   4.66e+01   10.16  < 2e-16 ***
## pop          1.13e-03   3.28e-04    3.44  0.00072 ***
## kidsfrac    -1.42e+03   1.17e+02  -12.07  < 2e-16 ***
## seniorsfrac -9.28e+02   1.14e+02   -8.12  7.4e-14 ***
## density      2.58e-02   2.82e-03    9.16  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.4 on 179 degrees of freedom
## Multiple R-squared:  0.909,  Adjusted R-squared:  0.907 
## F-statistic:  449 on 4 and 179 DF,  p-value: <2e-16

Here all the coefficients make sense:
- increased population and population density tend to increase the number of calls
- kidsfrac and seniorsfrac tend to decrease the number of calls. This is equivalent to adultsfrac increasing the number of calls. And of course, this makes sense because who drives the cars? Adults.

Parking Model for Daily Data

Here we fit model to the daily data. Rather than the month as an independent variable, we can use the daily weather data (although we keep the month in for now to account for non-weather but monthly related effects).

summary(lm(count~factor(month)+pop+area+factor(type)+femalefrac+kidsfrac+seniorsfrac+density+temp+rain+snow+wkend,data=datallpd))

## 
## Call:
## lm(formula = count ~ factor(month) + pop + area + factor(type) + 
##     femalefrac + kidsfrac + seniorsfrac + density + temp + rain + 
##     snow + wkend, data = datallpd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.570  -1.563  -0.249   1.354  24.371 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.58e+01   3.94e+00    6.55  6.2e-11 ***
## factor(month)2       -1.44e-01   1.78e-01   -0.81  0.41629    
## factor(month)3       -6.75e-01   1.77e-01   -3.81  0.00014 ***
## factor(month)4       -1.15e+00   2.40e-01   -4.80  1.7e-06 ***
## factor(month)5       -8.13e-01   3.02e-01   -2.70  0.00704 ** 
## factor(month)6       -5.65e-01   3.40e-01   -1.66  0.09605 .  
## factor(month)7       -9.06e-01   3.45e-01   -2.63  0.00857 ** 
## factor(month)8       -1.10e+00   3.39e-01   -3.25  0.00115 ** 
## pop                   1.46e-05   7.24e-06    2.02  0.04322 *  
## area                  1.51e-04   9.35e-04    0.16  0.87143    
## factor(type)suburban -4.62e-01   4.62e-01   -1.00  0.31727    
## factor(type)urban     1.11e+00   4.75e-01    2.33  0.02007 *  
## femalefrac           -2.23e+01   7.79e+00   -2.87  0.00417 ** 
## kidsfrac             -3.38e+01   3.02e+00  -11.21  < 2e-16 ***
## seniorsfrac          -3.09e+01   2.75e+00  -11.25  < 2e-16 ***
## density               9.70e-04   6.73e-05   14.41  < 2e-16 ***
## temp                  1.12e-02   9.46e-03    1.18  0.23713    
## rain                 -4.64e-03   6.86e-03   -0.68  0.49835    
## snow                  9.01e-02   2.60e-02    3.47  0.00053 ***
## wkendTRUE            -1.94e+00   1.02e-01  -19.01  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.96 on 4547 degrees of freedom
## Multiple R-squared:  0.61,   Adjusted R-squared:  0.608 
## F-statistic:  374 on 19 and 4547 DF,  p-value: <2e-16

We see that month, population, ward type, femalefrac, adultfrac, population density, snowfall, and whether or not the day was a weekend are significant.

Again, simplification leads to a model that is easier to interpret, and as good.

summary(lm(count~pop+factor(type)+femalefrac+kidsfrac+seniorsfrac+density+temp+snow+wkend,data=datallpd))

## 
## Call:
## lm(formula = count ~ pop + factor(type) + femalefrac + kidsfrac + 
##     seniorsfrac + density + temp + snow + wkend, data = datallpd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.411  -1.571  -0.202   1.316  24.627 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.58e+01   3.39e+00    7.61  3.3e-14 ***
## pop                   1.42e-05   7.11e-06    2.00   0.0460 *  
## factor(type)suburban -5.35e-01   2.35e-01   -2.28   0.0225 *  
## factor(type)urban     1.03e+00   2.62e-01    3.92  9.2e-05 ***
## femalefrac           -2.32e+01   7.19e+00   -3.24   0.0012 ** 
## kidsfrac             -3.37e+01   3.03e+00  -11.11  < 2e-16 ***
## seniorsfrac          -3.07e+01   2.69e+00  -11.40  < 2e-16 ***
## density               9.71e-04   6.62e-05   14.67  < 2e-16 ***
## temp                 -7.29e-03   3.33e-03   -2.19   0.0286 *  
## snow                  1.05e-01   2.54e-02    4.14  3.6e-05 ***
## wkendTRUE            -1.93e+00   1.01e-01  -19.06  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.97 on 4556 degrees of freedom
## Multiple R-squared:  0.606,  Adjusted R-squared:  0.605 
## F-statistic:  701 on 10 and 4556 DF,  p-value: <2e-16

Interpretation of the coefficients is similar, the new ones are:
- ward type: urban leads to more calls than rural, and suburban less than rural (this last one is puzzling)
- femalefrac: more females leads to fewer calls
- temperature: a negative temperature gets multiplied by a negative coefficient to lead to more calls when the temperature is below freezing
-snowfall leads to more calls
-a weekend day leads to fewer calls

Conclusion / Next Steps

It is interesting that the monthly model gives better results (as measured by r-squared) than the daily model. This can be understood by two factors:
- the monthly data is in effect a smoothed version of the daily data, and so the model has an easier time fitting
- for the monthly data, we see an average of about 115 calls per month related to parking. At this level of aggregation, a normal distribution (used by the regression model) is not a bad approximation. But when looking at daily data we see an average of about 5 calls per day - and clearly a discrete variable with mean 5 is not accurately approximated by a normal distribution. Hence a next step would be to use a GLM with Poisson link. For more on the statistical aspects of modeling, I can think of no better reference than http://www.stat.columbia.edu/~gelman/arm/

It would also be interesting to delve into the component calls that make up the parking category to see if they can be better understood.

And the same approach could be applied to the admin category.

All in all, the approach could be extended to understand the drivers of large call volumes to the Ottawa 311 system.

Ottawa 311 Doc

Peter Rabinovitch

October 5, 2014