The purpose of this document is to detail the analysis I did of the Ottawa 311 data set which was discussed in the recent ‘Open data Ottawa’ meetup (http://www.meetup.com/Open-Data-Ottawa/).
A few apologetic notes:
- this analysis was done quickly, for fun. Were it part of a more serious effort, the documentation would be better, and proper software engineering procedures would be adhered to.
- the statistical analyses reported are only the first steps of what could be a more fulsome analysis.
In all, this document is intended as a starting point, not the conclusion.
The primary data set is available at http://data.ottawa.ca/en/dataset/2014-311-monthly-service-request-submissions
Files from Jan through Aug 2014 were downloaded, converted to .csv, and massaged a little by hand to ensure that the first row of the .csv file was the header row (i.e. no blank rows), and that all the date columns looked like “2014-Jan-19 13:40:00”
Other data used:
Ward information
- area, population, type (urban, suburban, rural)
Demographic info by ward
- %male, %female, %<15 yrs old, %>65 yrs old
Weather by day
- mean temp, snowfall, rain
These are included here as r variables to keep things simple.
Also, I categorized the call_type variable into the bins
- traffic, health, mtce, garbage, admin, animals, parking in order to simplify this initial analysis.
library(lubridate)
library(dplyr)
library(ggplot2)
library(reshape2)
# Of course, you will have to change the following to match your directory structure
data1 <- read.csv("C:/Users/prabinovitch/Downloads/JanSR-2014.csv", stringsAsFactors=FALSE)
data2 <- read.csv("C:/Users/prabinovitch/Downloads/FebSR-2014.csv", stringsAsFactors=FALSE)
data3 <- read.csv("C:/Users/prabinovitch/Downloads/MarSR-2014-2.csv", stringsAsFactors=FALSE)
data4 <- read.csv("C:/Users/prabinovitch/Downloads/AprSR-2.csv", stringsAsFactors=FALSE)
data5 <- read.csv("C:/Users/prabinovitch/Downloads/MaySR-2.csv", stringsAsFactors=FALSE)
data6 <- read.csv("C:/Users/prabinovitch/Downloads/junSR2014.csv", stringsAsFactors=FALSE)
data7 <- read.csv("C:/Users/prabinovitch/Downloads/julSR2014.csv", stringsAsFactors=FALSE)
data8 <- read.csv("C:/Users/prabinovitch/Downloads/AugSR-2014.csv", stringsAsFactors=FALSE)
# load all the data into one dataframe
data<-data1
data<-rbind(data,data2)
data<-rbind(data,data3)
data<-rbind(data,data4)
data<-rbind(data,data5)
data<-rbind(data,data6)
data<-rbind(data,data7)
data<-rbind(data,data8)
# pull out date and ward information
data<-mutate(data,dt=ymd_hms(creation_date))
data<-mutate(data,wards=sapply(strsplit(data$ward,' '),`[`,2))
data<-mutate(data,wardi=strtoi(sapply(strsplit(data$ward,' '),`[`,2)))
data$dow<-wday(data$dt,label=TRUE)
data$hour<-hour(data$dt)
data$day<-day(data$dt)
data$month<-month(data$dt)
data$wkend<-FALSE
data$wkend[data$dow=='Sat']<-TRUE
data$wkend[data$dow=='Sun']<-TRUE
# drop rows with no ward
data<-data[which(!is.na(data$wardi)),]
# categorize the callCategory into a few bins
data$callCategory<-''
data$callCategory[grepl('Traffic',data$call_type)]<-'traffic'
data$callCategory[grepl('Accessibility',data$call_type)]<-'health'
data$callCategory[grepl('Health',data$call_type)]<-'health'
data$callCategory[grepl('Needles',data$call_type)]<-'health'
data$callCategory[grepl('Maintenance',data$call_type)]<-'mtce'
data$callCategory[grepl('Drainage',data$call_type)]<-'mtce'
data$callCategory[grepl('Roads',data$call_type)]<-'mtce'
data$callCategory[grepl('Retail',data$call_type)]<-'garbage'
data$callCategory[grepl('RPAM',data$call_type)]<-'garbage'
data$callCategory[grepl('Waste',data$call_type)]<-'garbage'
data$callCategory[grepl('Birth',data$call_type)]<-'admin'
data$callCategory[grepl('Budget',data$call_type)]<-'admin'
data$callCategory[grepl('Bylaw',data$call_type)]<-'admin'
data$callCategory[grepl('Code',data$call_type)]<-'admin'
data$callCategory[grepl('Graffiti',data$call_type)]<-'admin'
data$callCategory[grepl('Wildlife',data$call_type)]<-'animals'
data$callCategory[grepl('Livestock',data$call_type)]<-'animals'
data$callCategory[grepl('Parking',data$call_type)]<-'parking'
# originally this was loaded from a file via
# wards <- read.csv("C:/Users/prabinovitch/Downloads/wards.csv")
# but to keep this as self contained as possible...
wards<-data.frame(ward=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23))
wards$pop<-c(50100,41500,51230,37200,27700,26800,45700,56900,41200,49000,35900,49200,47000,48200,41200,52400,37900,47500,42500,27800,27170,44200,46400)
wards$area_kms_sq<-c(25.5,31.1,26.1,24.2,763.0,16.3,64.2,46.2,47.5,67.2,41.2,7.9,19.8,6.5,15.1,23.3,9.9,20.3,383.0,459.0,744.0,37.6,16.5)
wards$type<-c('suburban','suburban','suburban','suburban','rural','suburban','urban','urban','urban','urban','urban','urban','urban','urban','urban','urban','urban','urban','rural','rural','rural','suburban','suburban')
wards$kidsfrac<-c(0.24398481,0.25369432,0.29450493,0.29861695,0.26441266,0.30085787,0.20285896,0.20219257,0.20477344,0.27300261,0.22093969,0.13817421,0.20620807,0.09409352,0.19385872,0.21383510,0.17953472,0.22400291,0.28698218,0.26661215,0.25303512,0.32160261,0.27439263)
wards$adultsfrac<-c(0.6407459,0.6327346,0.6381514,0.5914929,0.6252709,0.5867125,0.5923928,0.6098449,0.6214336,0.6339426,0.6134535,0.7153105,0.6077385,0.7837330,0.6553769,0.6389295,0.6680766,0.6043856,0.6391341,0.6156205,0.6055977,0.6122620,0.6252451)
wards$seniorsfrac<-1-wards$adultsfrac-wards$kidsfrac
wards$malefrac<-c(0.4857892,0.4827046,0.4892953,0.4888309,0.5045533,0.4849745,0.4630704,0.4804607,0.4876873,0.4794191,0.4760971,0.4892944,0.4647336,0.5047714,0.4793672,0.4885905,0.4756451,0.4809308,0.4915274,0.5003068,0.5051988,0.4887074,0.4860903)
wards$femalfrac<-1-wards$malefrac
names(wards)[1]<-'wardi'
wards<-mutate(wards,density=pop/area_kms_sq)
datf<-data %>%
group_by(wardi,month,day,callCategory) %>%
summarise(
count = n(),
dow=first(dow),
wkend=first(wkend)
)%>%
arrange(month,day,wardi)
# originally this was loaded from a file via
# datw <- read.csv("C:/Users/prabinovitch/Downloads/OttawaWeather.csv", stringsAsFactors=FALSE)
# but to keep this as self contained as possible...
datw<-data.frame(Year=rep(2014,243))
datw$Month<-c(rep(1,31),rep(2,28),rep(3,31),rep(4,30),rep(5,31),rep(6,30),rep(7,31),rep(8,31))
datw$Day<-c(1:31,1:28,1:31,1:30,1:31,1:30,1:31,1:31)
datw$MeanTempC<-c(-22.9,-24.5,-25.1,-14.1,-5.2,-4.5,-16.3,-13.4,-11.6,-8.9,0.9,1.4,2.3,-0.6,-2.8,-6.0,-3.1,-6.3,-6.1,-16.0,-23.6,-22.2,-21.8,-17.6,-12.6,-20.4,-14.5,-16.6,-13.8,-10.1,-4.3,-9.1,-6.5,-10.9,-10.4,-13.7,-15.1,-10.8,-11.0,-9.5,-12.9,-17.8,-17.2,-12.6,-8.7,-9.3,-13.9,-15.1,-10.7,-2.8,-3.2,1.6,0.3,-3.3,-10.1,-13.0,-14.0,-12.8,-16.7,-10.2,-15.6,-17.7,-15.5,-13.1,-15.4,-9.9,-4.3,-8.4,-1.6,2.2,-7.2,-13.4,-2.9,-3.3,-13.9,-14.0,-8.4,-3.8,0.4,-2.2,-2.7,-11.0,-11.8,-8.1,-9.6,-9.3,0.4,1.0,1.1,2.6,0.2,2.2,0.7,0.8,2.3,2.7,4.0,3.7,2.1,7.8,6.8,8.1,6.2,14.5,0.1,-3.8,1.1,5.4,3.9,7.2,14.0,8.6,5.8,5.8,6.4,5.2,7.6,10.5,9.7,6.2,12.1,11.3,12.0,7.7,9.5,8.6,8.6,11.4,14.4,17.1,15.9,15.6,12.9,18.4,23.6,14.1,9.7,9.1,14.0,15.4,15.1,17.9,14.9,16.7,16.3,20.8,16.2,12.9,13.6,17.3,16.4,16.2,23.0,18.3,14.8,14.5,16.9,19.9,20.1,22.5,20.2,17.3,16.8,20.8,14.0,16.8,15.8,20.6,19.8,18.2,16.6,18.0,18.9,18.8,20.4,21.4,21.2,20.8,22.1,24.3,25.6,26.3,24.3,19.0,18.8,19.9,22.7,21.1,22.3,17.9,17.1,18.1,21.0,21.4,18.2,19.0,17.8,16.5,19.2,19.0,21.7,22.8,23.9,20.7,17.6,19.7,19.0,20.7,16.4,16.1,16.6,17.2,20.7,21.3,20.8,21.6,20.5,18.5,18.7,19.8,20.9,21.2,21.8,19.2,17.8,13.8,13.9,15.1,16.6,15.0,15.7,19.4,20.7,18.9,20.7,21.0,21.0,21.8,20.2,16.2,16.5,20.6,22.5)
datw$TotalRainMm<-c(0,0,0,0,6,8.4,0,0,0,0,7.6,0,0.4,1,0,0,0,0,0,0,0,0,0,0,0,0,0.4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,21,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2.2,2,0,0,0,0,0,0,0,4.8,0,0,0,0,0,0,7,0,0,6.8,16.6,0,2.8,0,1,17.4,5,12.6,0,0,0.6,0,0,2.4,13.6,0,0,0,4.4,0,0,2.4,9.2,7.2,0.2,0.6,18.6,0,0,0,0.4,27.4,0,0,0,1.4,11.2,3.4,16,0.2,0,0,0,0,0,0.2,0,0,0.2,0,0,0,0,0,0,0,10.2,0,0.8,0,0,0,0.6,0,22.0,16.4,9.8,0,0,0,13.8,1.4,0,0,0,0,0,55.6,0.4,0,0,0,0,0,0,1.6,0.4,0,0,1.2,2.4,26.4,0,0,0,0,14,0,5,0,0,0,0,0.6,0,0,2.4,0,0,1,5.6,2.6,0,23.2,22.6,0,0,0,2,0,1,0.6,0,0,0,0,26.4,28.6,1.4,0.2,9.6,1.4,0,0,14,18.8,0,0,0,0,0,0,0,0,2,7.8)
datw$TotalSnowCm<-c(0,0.4,0,1.2,7.8,1,0,0,0,0.8,0,0.2,0,0,0,0,1,1.8,3.6,0.4,0,0,0,0,7,3.6,7.6,0,0.2,0,0,16.2,2,0,0,4.2,0,0,0,2.2,0,0,0,0.2,6.6,0,0,0,4.4,0,0,0.2,0,0,0.4,0,0,3.6,0,9.4,0.4,0,1.6,0.6,0,0,0,1,2,0,8.4,0,0,0.2,0,0,0,2,0.8,0,8.4,0,0,0,0,0,2.6,0,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.6,rep(0,138))
names(datw)<-c("Year","month","day","MeanTempC","TotalRainMm","TotalSnowCm")
datfw<-left_join(datf,datw)
datall<-inner_join(datfw,wards)
# this will pull out all the parking data on a monthly basis
datallpm<-datall %>%
filter(callCategory=='parking') %>%
group_by(wardi,month) %>%
summarise(
count=sum(count),
pop=mean(pop),
area=mean(area_kms_sq),
type=first(type),
malefrac=first(malefrac),
femalefrac=first(femalfrac),
kidsfrac=first(kidsfrac),
seniorsfrac=first(seniorsfrac),
adultsfrac=first(adultsfrac),
density=first(density)
)
# this will pull out all the parking data on a daily basis
datallpd<-datall %>%
filter(callCategory=='parking') %>%
group_by(wardi,month,day) %>%
summarise(
count=sum(count),
pop=mean(pop),
area=mean(area_kms_sq),
type=first(type),
malefrac=first(malefrac),
femalefrac=first(femalfrac),
kidsfrac=first(kidsfrac),
seniorsfrac=first(seniorsfrac),
adultsfrac=first(adultsfrac),
density=first(density),
temp=first(MeanTempC),
rain=first(TotalRainMm),
snow=first(TotalSnowCm),
dow=first(dow),
wkend=first(wkend)
)
data %>%
group_by(month,day) %>%
summarise(
count=n(),
wkend=first(wkend)
)%>%
ggplot(aes(month*30+day,count,colour=wkend)) +
geom_point(size=3.2)
Note these features:
- The worst days are early in the year
- Other than the worst days, there appears to be some cyclicality with more calls during the summer months.
There are some weekdays with suspiciously low call volumes:
data %>%
group_by(month,day) %>%
summarise(
count=n(),
wkend=first(wkend)
)%>%
filter(wkend==FALSE,count<300)
## Source: local data frame [8 x 4]
## Groups: month
##
## month day count wkend
## 1 1 1 128 FALSE
## 2 2 17 148 FALSE
## 3 4 18 215 FALSE
## 4 4 21 196 FALSE
## 5 5 19 205 FALSE
## 6 5 29 153 FALSE
## 7 7 1 226 FALSE
## 8 8 4 148 FALSE
So they are holidays - except May 29. I wonder what happened then!
datfw %>%
group_by(month,day) %>%
summarise(
count=sum(count),
temp=first(MeanTempC),
snow=first(TotalSnowCm)
) %>%
arrange(desc(count))
## Source: local data frame [241 x 5]
## Groups: month
##
## month day count temp snow
## 1 2 21 1431 1.6 0.2
## 2 1 6 1295 -4.5 1.0
## 3 1 13 1163 2.3 0.0
## 4 1 14 1106 -0.6 0.0
## 5 1 8 1085 -13.4 0.0
## 6 1 9 1004 -11.6 0.0
## 7 1 7 972 -16.3 0.0
## 8 6 9 918 22.5 0.0
## 9 4 28 888 10.5 0.0
## 10 5 12 866 15.6 0.0
## .. ... ... ... ... ...
So the worst days were February 21, and a period in early January.
Feb 21, 2014 is an easy one…
http://www.ottawasun.com/2014/02/21/massive-sinkhole-halts-ottawa-lrt-tunnelling
And unsurprisingly….
data %>%
filter(month==2,day==21) %>%
group_by(call_type) %>%
summarise(
count=n()
) %>%
arrange(desc(count))
## Source: local data frame [20 x 2]
##
## call_type count
## 1 Roads Maintenance 1161
## 2 Solid Waste Collection 73
## 3 Parking Control 59
## 4 Bylaw Services 53
## 5 Traffic Operations 34
## 6 Retail-Green Bin 80L 8
## 7 Parking Ticket Inquiry 7
## 8 Retail-Blue Box 5
## 9 RPAM (Buildings Only) 4
## 10 Retail-All Container(80L) 4
## 11 Retail-Black Box 4
## 12 Drainage 3
## 13 Park Maintenance SAP 3
## 14 Parking Equipment 3
## 15 Health 2
## 16 Retail-All Container(46L) 2
## 17 Retail-Blue & Black Box 2
## 18 Retail-Green 80L & KC 2
## 19 Graffiti 1
## 20 Parking Permits (CSC) 1
almost all calls to 311 that day were for road maintenance.
For the early January days, this is typical: http://www.ottawasun.com/2014/01/06/grim-weather-day-in-ottawa
We next look at smoothed call volume by type and date.
data %>%
group_by(callCategory,month,day) %>%
summarise(
count=n()
)%>%
ggplot(aes(month*30+day,count,colour=callCategory)) +
geom_smooth(size=2)+
ylim(0,200)
It seems the admin category peaks in the summer. This would perhaps be interesting to look at further.
data %>%
group_by(month,callCategory) %>%
summarise(
count = n()
) %>%
ggplot(aes(x=month,y=count,fill=callCategory))+
geom_bar(stat='identity')
Calls by month does not show anything particularly interesting at first glance.
But looking at the calls by ward does.
data %>%
group_by(wardi,callCategory) %>%
summarise(
count = n()
) %>%
ggplot(aes(x=wardi,y=count,fill=callCategory))+
geom_bar(stat='identity')
Note that the parking and admin categories are much larger for wards 12 and 14 than other wards. What is different about these wards?
If we define density=pop/area, the answer appears
wards %>%
select(wardi, density) %>%
arrange(desc(density)) %>%
top_n(6)
## wardi density
## 1 14 7415
## 2 12 6228
## 3 17 3828
## 4 23 2812
## 5 15 2728
## 6 13 2374
Note ward 12 is Vanier, ward 14 is Somerset. And they both have significantly higher population densities than the other wards.
Here we first form a linear model for the number of parking related calls, based on the month, population, area of the ward, type of ward (urban, suburban or rural), fraction of the population that is female, fraction of the population that is 15 yrs old or younger, fraction of the population that is 65 yrs old or older,and population density of the ward.
summary(lm(count~factor(month)+pop+area+factor(type)+femalefrac+kidsfrac+seniorsfrac+density,data=datallpm))
##
## Call:
## lm(formula = count ~ factor(month) + pop + area + factor(type) +
## femalefrac + kidsfrac + seniorsfrac + density, data = datallpm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -84.11 -14.65 -0.24 17.58 138.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.77e+02 2.12e+02 3.20 0.00164 **
## factor(month)2 -1.67e+01 9.30e+00 -1.79 0.07526 .
## factor(month)3 -1.80e+01 9.30e+00 -1.93 0.05469 .
## factor(month)4 -3.47e+01 9.30e+00 -3.73 0.00026 ***
## factor(month)5 -2.70e+01 9.30e+00 -2.90 0.00420 **
## factor(month)6 -1.27e+01 9.30e+00 -1.36 0.17565
## factor(month)7 -1.66e+01 9.30e+00 -1.78 0.07678 .
## factor(month)8 -2.53e+01 9.30e+00 -2.72 0.00721 **
## pop 6.47e-04 3.86e-04 1.68 0.09553 .
## area -2.39e-02 3.82e-02 -0.63 0.53190
## factor(type)suburban -2.11e+01 2.07e+01 -1.02 0.30970
## factor(type)urban 3.26e+01 2.18e+01 1.49 0.13764
## femalefrac -5.22e+02 4.23e+02 -1.23 0.21957
## kidsfrac -1.01e+03 1.65e+02 -6.13 6.0e-09 ***
## seniorsfrac -9.85e+02 1.49e+02 -6.60 5.0e-10 ***
## density 3.00e-02 3.71e-03 8.07 1.2e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.5 on 168 degrees of freedom
## Multiple R-squared: 0.928, Adjusted R-squared: 0.922
## F-statistic: 145 on 15 and 168 DF, p-value: <2e-16
We see that the month (in most cases), population, adult fraction (= 1- kidsfraction-seniorsfraction) and population density are significant.
We can simplify and get a model that is almost as good.
summary(lm(count~pop+kidsfrac+seniorsfrac+density,data=datallpm))
##
## Call:
## lm(formula = count ~ pop + kidsfrac + seniorsfrac + density,
## data = datallpm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -78.79 -21.13 -0.75 22.22 155.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.73e+02 4.66e+01 10.16 < 2e-16 ***
## pop 1.13e-03 3.28e-04 3.44 0.00072 ***
## kidsfrac -1.42e+03 1.17e+02 -12.07 < 2e-16 ***
## seniorsfrac -9.28e+02 1.14e+02 -8.12 7.4e-14 ***
## density 2.58e-02 2.82e-03 9.16 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34.4 on 179 degrees of freedom
## Multiple R-squared: 0.909, Adjusted R-squared: 0.907
## F-statistic: 449 on 4 and 179 DF, p-value: <2e-16
Here all the coefficients make sense:
- increased population and population density tend to increase the number of calls
- kidsfrac and seniorsfrac tend to decrease the number of calls. This is equivalent to adultsfrac increasing the number of calls. And of course, this makes sense because who drives the cars? Adults.
Here we fit model to the daily data. Rather than the month as an independent variable, we can use the daily weather data (although we keep the month in for now to account for non-weather but monthly related effects).
summary(lm(count~factor(month)+pop+area+factor(type)+femalefrac+kidsfrac+seniorsfrac+density+temp+rain+snow+wkend,data=datallpd))
##
## Call:
## lm(formula = count ~ factor(month) + pop + area + factor(type) +
## femalefrac + kidsfrac + seniorsfrac + density + temp + rain +
## snow + wkend, data = datallpd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.570 -1.563 -0.249 1.354 24.371
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.58e+01 3.94e+00 6.55 6.2e-11 ***
## factor(month)2 -1.44e-01 1.78e-01 -0.81 0.41629
## factor(month)3 -6.75e-01 1.77e-01 -3.81 0.00014 ***
## factor(month)4 -1.15e+00 2.40e-01 -4.80 1.7e-06 ***
## factor(month)5 -8.13e-01 3.02e-01 -2.70 0.00704 **
## factor(month)6 -5.65e-01 3.40e-01 -1.66 0.09605 .
## factor(month)7 -9.06e-01 3.45e-01 -2.63 0.00857 **
## factor(month)8 -1.10e+00 3.39e-01 -3.25 0.00115 **
## pop 1.46e-05 7.24e-06 2.02 0.04322 *
## area 1.51e-04 9.35e-04 0.16 0.87143
## factor(type)suburban -4.62e-01 4.62e-01 -1.00 0.31727
## factor(type)urban 1.11e+00 4.75e-01 2.33 0.02007 *
## femalefrac -2.23e+01 7.79e+00 -2.87 0.00417 **
## kidsfrac -3.38e+01 3.02e+00 -11.21 < 2e-16 ***
## seniorsfrac -3.09e+01 2.75e+00 -11.25 < 2e-16 ***
## density 9.70e-04 6.73e-05 14.41 < 2e-16 ***
## temp 1.12e-02 9.46e-03 1.18 0.23713
## rain -4.64e-03 6.86e-03 -0.68 0.49835
## snow 9.01e-02 2.60e-02 3.47 0.00053 ***
## wkendTRUE -1.94e+00 1.02e-01 -19.01 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.96 on 4547 degrees of freedom
## Multiple R-squared: 0.61, Adjusted R-squared: 0.608
## F-statistic: 374 on 19 and 4547 DF, p-value: <2e-16
We see that month, population, ward type, femalefrac, adultfrac, population density, snowfall, and whether or not the day was a weekend are significant.
Again, simplification leads to a model that is easier to interpret, and as good.
summary(lm(count~pop+factor(type)+femalefrac+kidsfrac+seniorsfrac+density+temp+snow+wkend,data=datallpd))
##
## Call:
## lm(formula = count ~ pop + factor(type) + femalefrac + kidsfrac +
## seniorsfrac + density + temp + snow + wkend, data = datallpd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.411 -1.571 -0.202 1.316 24.627
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.58e+01 3.39e+00 7.61 3.3e-14 ***
## pop 1.42e-05 7.11e-06 2.00 0.0460 *
## factor(type)suburban -5.35e-01 2.35e-01 -2.28 0.0225 *
## factor(type)urban 1.03e+00 2.62e-01 3.92 9.2e-05 ***
## femalefrac -2.32e+01 7.19e+00 -3.24 0.0012 **
## kidsfrac -3.37e+01 3.03e+00 -11.11 < 2e-16 ***
## seniorsfrac -3.07e+01 2.69e+00 -11.40 < 2e-16 ***
## density 9.71e-04 6.62e-05 14.67 < 2e-16 ***
## temp -7.29e-03 3.33e-03 -2.19 0.0286 *
## snow 1.05e-01 2.54e-02 4.14 3.6e-05 ***
## wkendTRUE -1.93e+00 1.01e-01 -19.06 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.97 on 4556 degrees of freedom
## Multiple R-squared: 0.606, Adjusted R-squared: 0.605
## F-statistic: 701 on 10 and 4556 DF, p-value: <2e-16
Interpretation of the coefficients is similar, the new ones are:
- ward type: urban leads to more calls than rural, and suburban less than rural (this last one is puzzling)
- femalefrac: more females leads to fewer calls
- temperature: a negative temperature gets multiplied by a negative coefficient to lead to more calls when the temperature is below freezing
-snowfall leads to more calls
-a weekend day leads to fewer calls
It is interesting that the monthly model gives better results (as measured by r-squared) than the daily model. This can be understood by two factors:
- the monthly data is in effect a smoothed version of the daily data, and so the model has an easier time fitting
- for the monthly data, we see an average of about 115 calls per month related to parking. At this level of aggregation, a normal distribution (used by the regression model) is not a bad approximation. But when looking at daily data we see an average of about 5 calls per day - and clearly a discrete variable with mean 5 is not accurately approximated by a normal distribution. Hence a next step would be to use a GLM with Poisson link. For more on the statistical aspects of modeling, I can think of no better reference than http://www.stat.columbia.edu/~gelman/arm/
It would also be interesting to delve into the component calls that make up the parking category to see if they can be better understood.
And the same approach could be applied to the admin category.
All in all, the approach could be extended to understand the drivers of large call volumes to the Ottawa 311 system.