I started my career in Intelligent Transportation Systems, and had worked for Caltrans (California State Department of Transportation) in graduate school, so thought I might like to look at traffic counts. My home state has 48-hour traffic counts available, but they have to be downloaded and cleaned/loaded individually for each site, which seemed like a lot of work.
I then went to the Federal DOT, but they only had state summaries that did not have enough data, so I looked at the state of California’s website. I hope the state-wide average annual traffic counts will provide a data set that is good enough for this investigation.
California Average Annual Daily Traffic Counts for 2012
The data was obtained at http://traffic-counts.dot.ca.gov/docs/2012AADT.xlsx
A description of the data is at http://traffic-counts.dot.ca.gov/docs/traffic-counts-diagram.pdf and http://traffic-counts.dot.ca.gov/2013all/
The data came in Excel format. I downloaded it into Excel. It had headers every certain number of lines and I used Excel to insert ‘#’ signs in front of each but the first header and saved it as csv. Then I read in the data to R ignoring lines starting with ‘#’ to eliminate the extra headers.
Looking at the data, it had 3 unnamed columns (X, X.1, and X.2). I did some investigation to find out what the column names were and what the abbreviations in the fields meant.
I obtained the county name/abbreviation by copying data from http://sv08data.dot.ca.gov/contractcost/map.html and pasting into Excel and saving as csv.
I also obtained the definitions for the letters in the Postmile Prefix column from http://traffic-counts.dot.ca.gov/docs/traffic-data-faq.pdf and put them in a separate table (X.1) and the Route Suffix (X) and Alignment (X.2) columns from http://www.dot.ca.gov/cwwp2/documentation/prefix-suffix-alignment-charts.htm
| Route Suffix | Description |
|---|---|
| S | Supplemental Route |
| U | Unrelinquished Route |
| Prefix | Meaning |
|---|---|
| C | Commercial lane |
| D | Duplicate post mile at meandering county line |
| G | Reposting of duplicate post mile at end of route |
| H | Realignment of D mileage |
| L | Overlap post mile |
| M | Realignment of R mileage |
| N | Realignment of M mileage |
| R | First realignment |
| S | Spur |
| T | Temporary connection |
| U | Unrelinquished |
| Alignment | Description |
|---|---|
| L | Left independent alignment |
| R | Right independent alignment |
First load the libraries I know I will need later…
require(ggplot2)
library(tidyr)
library(reshape2)
library(GGally)
library(gridExtra)
library(Hmisc)
library(dplyr)
# And load the project directory
pdir <- "~/DataAnalystNanoDegree/DataAnalysisWithR/Project4"
# read in the dataset, ignoring comments
dfp3 <- read.csv(paste0(pdir, "/2012AADT.csv"), comment.char = "#")
# Get county names
county_table_CA <- read.csv(paste0(pdir, "/CA_Counties.csv"))
# Get postmile prefix abbreviations
postmile_prefix_table <- read.csv(paste0(pdir, "/PostmilePrefix.csv"))
# Get route suffix abbreviations
route_suffix_table <- read.csv(paste0(pdir, "/RouteSuffix.csv"))
# Get alignment abbreviations
alignment <- read.csv(paste0(pdir, "/Alignment.csv"))
# Add names for unnamed columns
colnames(dfp3)[3]<- "route_suffix"
colnames(dfp3)[5]<- "postmile_prefix"
colnames(dfp3)[7]<- "alignment"
# Make all names lowercase
colnames(dfp3)<- tolower(colnames(dfp3))
# Change .s in names to _
colnames(dfp3) <- gsub("\\.+", "_", colnames(dfp3), perl = T)
# Change route numbers and district numbers to factors
dfp3$route <-as.factor(dfp3$route)
dfp3$dist <- as.factor(dfp3$dist)
#Add a column containing actual county names
countyMap <- match(as.character(dfp3$county),
as.character(county_table_CA$ABBREV.))
dfp3$county_name <-county_table_CA$COUNTY[countyMap]
head(dfp3)
## dist route route_suffix county postmile_prefix postmile alignment
## 1 12 1 ORA R 0.129
## 2 12 1 ORA R 0.780
## 3 12 1 ORA 8.430
## 4 12 1 ORA 9.418
## 5 12 1 ORA 9.600
## 6 12 1 ORA 11.500
## description back_peak_hour back_peak_month
## 1 DANA POINT, JCT. RTE. 5 NA NA
## 2 DANA POINT, DOHENY PARK ROAD 3750 40000
## 3 LAGUNA BEACH, MOUNTAIN ROAD 2850 38500
## 4 LAGUNA BEACH, JCT. RTE. 133 NORTH 3000 40500
## 5 LAGUNA BEACH, CLIFF DR/ASTER ST 3350 39500
## 6 LAGUNA BEACH, NORTH CITY LIMITS 3150 37500
## back_aadt ahead_peak_hour ahead_peak_aadt ahead_aadt county_name
## 1 NA 3750 40000 37000 Orange
## 2 37000 3900 42000 38500 Orange
## 3 36000 2850 38500 36000 Orange
## 4 38000 3400 40500 38000 Orange
## 5 37000 3350 39500 37000 Orange
## 6 35000 3150 38500 35000 Orange
str(dfp3)
## 'data.frame': 6777 obs. of 15 variables:
## $ dist : Factor w/ 12 levels "1","2","3","4",..: 12 12 12 12 12 12 12 12 12 12 ...
## $ route : Factor w/ 243 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ route_suffix : Factor w/ 3 levels "","S","U": 1 1 1 1 1 1 1 1 1 1 ...
## $ county : Factor w/ 59 levels "","ALA","ALP",..: 31 31 31 31 31 31 31 31 31 31 ...
## $ postmile_prefix: Factor w/ 8 levels "","C","D","L",..: 6 6 1 1 1 1 6 6 1 1 ...
## $ postmile : num 0.129 0.78 8.43 9.418 9.6 ...
## $ alignment : Factor w/ 3 levels "","L","R": 1 1 1 1 1 1 1 1 1 1 ...
## $ description : Factor w/ 5762 levels ""," JCT. RTE. 101",..: 1129 1128 2623 2622 2619 2625 3508 3509 3510 3507 ...
## $ back_peak_hour : int NA 3750 2850 3000 3350 3150 4000 4250 4350 5400 ...
## $ back_peak_month: int NA 40000 38500 40500 39500 37500 49500 53000 53000 52000 ...
## $ back_aadt : int NA 37000 36000 38000 37000 35000 45000 48000 48700 48500 ...
## $ ahead_peak_hour: int 3750 3900 2850 3400 3350 3150 4800 4250 5300 5400 ...
## $ ahead_peak_aadt: int 40000 42000 38500 40500 39500 38500 59000 53000 52000 52000 ...
## $ ahead_aadt : int 37000 38500 36000 38000 37000 35000 54000 48000 48700 48500 ...
## $ county_name : Factor w/ 58 levels "Alameda","Alpine",..: 30 30 30 30 30 30 30 30 30 30 ...
summary(dfp3)
## dist route route_suffix county postmile_prefix
## 4 :1084 101 : 507 :6747 LA : 762 :4704
## 7 : 913 5 : 420 S: 20 SD : 453 R :1844
## 3 : 771 1 : 278 U: 10 SBD : 339 M : 71
## 6 : 721 99 : 254 KER : 267 T : 70
## 11 : 609 80 : 157 ORA : 261 L : 63
## (Other):2670 (Other):5152 RIV : 242 S : 16
## NA's : 9 NA's : 9 (Other):4453 (Other): 9
## postmile alignment description
## Min. : 0.000 :6658 JCT. RTE. 5 : 33
## 1st Qu.: 5.752 L: 54 JCT. RTE. 101 : 16
## Median : 15.367 R: 65 JCT. RTE. 99 : 14
## Mean : 22.308 NEVADA STATE LINE : 13
## 3rd Qu.: 30.408 JCT. RTE. 15 : 12
## Max. :186.238 LOS ANGELES/SAN BERNARDINO COUNTY LINE: 12
## NA's :15 (Other) :6677
## back_peak_hour back_peak_month back_aadt ahead_peak_hour
## Min. : 10 Min. : 100 Min. : 80 Min. : 10
## 1st Qu.: 940 1st Qu.: 9400 1st Qu.: 8200 1st Qu.: 940
## Median : 2600 Median : 29000 Median : 25600 Median : 2600
## Mean : 5326 Mean : 68262 Mean : 64726 Mean : 5333
## 3rd Qu.: 8800 3rd Qu.:109000 3rd Qu.:104000 3rd Qu.: 8800
## Max. :31000 Max. :406000 Max. :377500 Max. :31000
## NA's :546 NA's :546 NA's :546 NA's :546
## ahead_peak_aadt ahead_aadt county_name
## Min. : 100 Min. : 80 Los Angeles : 762
## 1st Qu.: 9500 1st Qu.: 8400 San Diego : 453
## Median : 29000 Median : 26000 San Bernardino: 339
## Mean : 68318 Mean : 64773 Kern : 267
## 3rd Qu.:109000 3rd Qu.:104000 Orange : 261
## Max. :406000 Max. :377500 (Other) :4686
## NA's :546 NA's :546 NA's : 9
This dataset has district (dist) and route which are numeric factors, and county which is a character factor, and a numeric postmile field (mile post). There are three columns containing alphabetic keys, and a character description (all factors). Then there are 6 integer traffic count fields - three are at the back of the location (South or West), and three are Ahead of the location (North or East). There are back and ahead peak_hour counts, as well as back and ahead peak_month counts and back and ahead AADT (Annual Average Daily Traffic) and is the total volume for the year divided by 365 days. ###Create a Tidy Dataset A tidy dataset has one value per row.
tidyData <- melt(dfp3, id.vars = 1:8, measure.vars = 9:14,
variable.name = "variable", value.name = "value")
table(dfp3$dist)
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 325 411 771 1084 460 721 913 581 98 533 609 262
table(dfp3$route)
##
## 1 2 3 4 5 6 7 8 9 10 12 13 14 15 16 17 18 19
## 278 34 27 65 420 6 10 70 21 152 73 14 39 130 37 17 38 6
## 20 22 23 24 25 26 27 28 29 32 33 34 35 36 37 38 39 40
## 77 19 22 15 16 26 10 12 56 42 92 13 30 60 13 15 34 23
## 41 43 44 45 46 47 49 50 51 52 53 54 55 56 57 58 59 60
## 56 32 24 21 25 6 122 66 16 11 4 11 20 10 28 59 16 64
## 61 62 63 65 66 67 68 70 71 72 73 74 75 76 77 78 79 80
## 8 25 32 37 12 20 16 56 18 11 12 29 24 24 3 69 21 157
## 82 83 84 85 86 87 88 89 90 91 92 94 95 96 97 98 99 101
## 48 11 47 24 43 10 37 58 19 68 25 37 11 24 11 26 254 507
## 103 104 105 107 108 109 110 111 112 113 114 115 116 118 119 120 121 123
## 4 16 17 9 35 2 44 42 4 30 2 15 27 35 11 42 23 14
## 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 142
## 4 12 20 11 38 9 6 5 36 9 18 16 2 22 32 14 32 11
## 144 145 146 147 150 151 152 153 154 155 156 158 160 161 162 163 164 165
## 2 30 4 6 17 12 38 3 7 16 11 5 19 3 45 21 8 19
## 166 167 168 169 170 172 173 174 175 177 178 180 182 183 184 185 186 187
## 17 2 29 7 10 3 8 14 10 5 38 34 3 5 12 14 3 10
## 188 189 190 191 192 193 195 197 198 199 200 201 202 203 204 205 207 210
## 3 6 23 7 11 15 2 2 50 4 3 15 6 7 10 7 2 72
## 211 213 215 216 217 218 219 220 221 222 223 225 227 229 232 233 236 237
## 5 7 47 13 4 5 2 6 4 3 12 9 9 2 4 7 5 16
## 238 241 242 243 244 245 246 247 253 254 255 259 260 261 262 263 265 266
## 11 11 6 8 4 9 7 8 2 7 9 5 3 6 3 3 2 3
## 267 269 270 271 273 275 280 281 282 283 284 299 330 371 380 395 405 505
## 9 7 2 7 34 4 58 3 11 2 2 64 2 4 4 67 74 14
## 580 605 680 710 780 805 880 905 980
## 70 27 64 28 9 32 49 9 4
table(dfp3$route_suffix)
##
## S U
## 6747 20 10
table(dfp3$county_name)
##
## Alameda Alpine Amador Butte
## 220 19 47 96
## Calaveras Colusa Contra Costa Del Norte
## 50 36 109 36
## El Dorado Fresno Glenn Humboldt
## 80 183 42 146
## Imperial Inyo Kern Kings
## 156 45 267 49
## Lake Lassen Los Angeles Madera
## 45 41 762 56
## Marin Mariposa Mendocino Merced
## 60 31 98 103
## Modoc Mono Monterey Napa
## 22 53 111 59
## Nevada Orange Placer Plumas
## 68 261 103 44
## Riverside Sacramento San Benito San Bernardino
## 242 151 27 339
## San Diego San Francisco San Joaquin San Luis Obispo
## 453 52 139 126
## San Mateo Santa Barbara Santa Clara Santa Cruz
## 157 128 202 70
## Shasta Sierra Siskiyou Solano
## 138 22 84 93
## Sonoma Stanislaus Sutter Tehama
## 132 91 40 54
## Trinity Tulare Tuolumne Ventura
## 28 163 54 152
## Yolo Yuba
## 89 44
# Look at counts by county
ggplot(data = dfp3, aes(x = county)) + geom_bar() +
theme(axis.text.x = element_text(angle=45)) +
ggtitle(label = "Number of California Traffic Counts by County")
# ggplot(data = dfp3, aes(x = county, y = ..count..)) + geom_bar() +
# theme(axis.text.x = element_text(angle=45)) +
# ggtitle(label = "Number of California Traffic Counts by County")
From looking at the first plot, it is obvious that LA County has the most intersections in which CalTrans is doing traffic counts. If we want to look at the top 10 counties in terms of number of Traffic Counts:
sort(table(dfp3$county_name), decreasing = TRUE)[1:10]
##
## Los Angeles San Diego San Bernardino Kern Orange
## 762 453 339 267 261
## Riverside Alameda Santa Clara Fresno Tulare
## 242 220 202 183 163
Now look at it by Route Number
# a function to convert a numeric factor into a number
nbr <- function(x) {
as.numeric(as.character(x))
}
#table(nbr(dfp3$route))
p1<-ggplot(data = dfp3, aes(x = nbr(route))) +
geom_bar(binwidth = 10, colour="#FF9999") +
theme(axis.text.x = element_text(angle=45)) +
ggtitle(label = "Histogram of California Traffic Counts by Route 10 Bin")
p2<-ggplot(data = dfp3, aes(x = route)) + geom_bar() +
theme(axis.text.x = element_text(angle=45)) +
ggtitle(label = "Number of California Traffic Counts by Route as Factor")
table(dfp3$route)
##
## 1 2 3 4 5 6 7 8 9 10 12 13 14 15 16 17 18 19
## 278 34 27 65 420 6 10 70 21 152 73 14 39 130 37 17 38 6
## 20 22 23 24 25 26 27 28 29 32 33 34 35 36 37 38 39 40
## 77 19 22 15 16 26 10 12 56 42 92 13 30 60 13 15 34 23
## 41 43 44 45 46 47 49 50 51 52 53 54 55 56 57 58 59 60
## 56 32 24 21 25 6 122 66 16 11 4 11 20 10 28 59 16 64
## 61 62 63 65 66 67 68 70 71 72 73 74 75 76 77 78 79 80
## 8 25 32 37 12 20 16 56 18 11 12 29 24 24 3 69 21 157
## 82 83 84 85 86 87 88 89 90 91 92 94 95 96 97 98 99 101
## 48 11 47 24 43 10 37 58 19 68 25 37 11 24 11 26 254 507
## 103 104 105 107 108 109 110 111 112 113 114 115 116 118 119 120 121 123
## 4 16 17 9 35 2 44 42 4 30 2 15 27 35 11 42 23 14
## 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 142
## 4 12 20 11 38 9 6 5 36 9 18 16 2 22 32 14 32 11
## 144 145 146 147 150 151 152 153 154 155 156 158 160 161 162 163 164 165
## 2 30 4 6 17 12 38 3 7 16 11 5 19 3 45 21 8 19
## 166 167 168 169 170 172 173 174 175 177 178 180 182 183 184 185 186 187
## 17 2 29 7 10 3 8 14 10 5 38 34 3 5 12 14 3 10
## 188 189 190 191 192 193 195 197 198 199 200 201 202 203 204 205 207 210
## 3 6 23 7 11 15 2 2 50 4 3 15 6 7 10 7 2 72
## 211 213 215 216 217 218 219 220 221 222 223 225 227 229 232 233 236 237
## 5 7 47 13 4 5 2 6 4 3 12 9 9 2 4 7 5 16
## 238 241 242 243 244 245 246 247 253 254 255 259 260 261 262 263 265 266
## 11 11 6 8 4 9 7 8 2 7 9 5 3 6 3 3 2 3
## 267 269 270 271 273 275 280 281 282 283 284 299 330 371 380 395 405 505
## 9 7 2 7 34 4 58 3 11 2 2 64 2 4 4 67 74 14
## 580 605 680 710 780 805 880 905 980
## 70 27 64 28 9 32 49 9 4
sort(table(dfp3$route), decreasing = TRUE)[1:10]
##
## 101 5 1 99 80 10 15 49 33 20
## 507 420 278 254 157 152 130 122 92 77
grid.arrange(p1,p2, ncol = 1)
## Warning: position_stack requires constant width: output may be incorrect
Now look at it by District http://en.wikipedia.org/wiki/California_Department_of_Transportation#Districts
| District | Counties |
|---|---|
| 1 | Del Norte, Humboldt, Lake, Mendocino Eureka |
| 2 | Lassen, Modoc, Plumas, Shasta, Siskiyou, Tehama, Trinity; portions of Butte and Sierra Redding |
| 3 | Butte, Colusa, El Dorado, Glenn, Nevada, Placer, Sacramento, Sierra, Sutter, Yolo,Yuba Marysville |
| 4 | Alameda, Contra Costa, Marin, Napa, San Francisco, San Mateo, Santa Clara, Solano, Sonoma, Oakland |
| 5 | Monterey, San Benito, San Luis Obispo, Santa Barbara, Santa Cruz San Luis Obispo |
| 6 | Madera, Fresno, Tulare, Kings, Kern Fresno |
| 7 | Los Angeles, Ventura Los Angeles |
| 8 | Riverside, San Bernardino San Bernardino |
| 9 | Inyo, Mono Bishop |
| 10 | Alpine, Amador, Calaveras, Mariposa, Merced, San Joaquin, Stanislaus, Tuolumne Stockton |
| 11 | Imperial, San Diego San Diego |
| 12 | Orange Irvine |
p1 <- ggplot(data = subset(dfp3, route_suffix != ""), aes(x = route_suffix)) +
geom_bar(aes(fill = county_name), colour = "black")
p2 <- ggplot(data = subset(dfp3, alignment != ""), aes(x = alignment)) +
geom_bar(aes(fill = county_name), colour = "black")
p3 <- ggplot(data = subset(dfp3, postmile_prefix != ""),
aes(x = postmile_prefix)) +
geom_bar(aes(fill = county_name), colour = "black") +
theme(legend.text = element_text(size = 8),
legend.key.size = unit(.3, "cm")) +
guides(colour=guide_legend(ncol=3), fill = guide_legend(ncol = 2))
p1
p2
p3
ggplot(data = dfp3, aes(x = dist)) +
geom_bar(colour="#FF9999") +
theme(axis.text.x = element_text(angle=45)) +
ggtitle(label = "Histogram of California Traffic Counts by District")
You can look at the California highways at http://en.wikipedia.org/wiki/List_of_state_highways_in_California.
It seems that the route with the largest number of traffic counts is State Highway 101, which covers California from N South to North. There are 507 points on highway 101. It might be an interesting roadway to investigate.
The second most counted roadway is I-5. It is an interstate, not a state highway and it also covers the whole state from South to North.
The third most counted highway is highway 1, which hugs the coastline.
The fourth is Highway 99. Those 4 have many more counts than the other highways.
ggplot(data = dfp3[dfp3$route %nin% c(1,5,99,101),], aes(x = route)) +
geom_bar() +
theme(axis.text.x = element_text(angle=45)) + ggtitle(label =
"Number of California Traffic Counts by Route as Factor\nMinus top Four")
## Warning: position_stack requires constant width: output may be incorrect
##Let’s look at traffic counts along Highway 101
Unfortunately, the mileposts are restarted at each county line. So you cannot follow traffic in a Route for its whole distance without delving into county boundaries.
ggplot(data = dfp3[dfp3$route == 101,], aes(x = postmile,
y = back_peak_hour,
colour=county)) +
geom_point() +
ggtitle("Highway 1 Back Peak Hour Traffic by Postmile/County")
## Warning: Removed 25 rows containing missing values (geom_point).
#Trying something weird
p <- ggplot(data = dfp3)
for (i in unique(dfp3$county)) {
p <- p + geom_line(data = subset(dfp3, county = i),
aes(x = postmile, y = back_peak_hour, color = county))
}
Highway 101 from South to North goes through these counties in order: (from http://en.wikipedia.org/wiki/U.S._Route_101_in_California) LA - Los Angeles VEN - Ventura SB - Santa Barbara SLO - San Luis Obispo MON - Monterey SBT - San Benito SCL - Santa Clara SM - San Mateo SF - San Francisco MRN - Marin SON - Sonoma MEN - Mendocino HUM - Humboldt DN - Del Norte
Look at the traffic on Highway 101 throughout it’s length
hw101 <- subset(dfp3, route == "101")
cn <- function(x) {
return(county_table_CA$COUNTY[county_table_CA$ABBREV. == x])
}
hw101$county <- ordered(hw101$county,
levels = c("LA", "VEN","SB", "SLO",
"MON", "SBT", "SCL", "SM",
"SF", "MRN", "SON",
"MEN", "HUM", "DN"))
countyMap101 <- match(as.character(hw101$county),
as.character(county_table_CA$ABBREV.))
hw101$county_name <- ordered(hw101$county_name,
levels = cn(c("LA", "VEN","SB", "SLO",
"MON", "SBT", "SCL", "SM",
"SF", "MRN", "SON",
"MEN", "HUM", "DN")))
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple
## of shorter object length
## Warning in `==.default`(county_table_CA$ABBREV., x): longer object length
## is not a multiple of shorter object length
hw101$county_postmile <- paste(as.character(hw101$county),
as.character(hw101$postmile_prefix),
as.character(hw101$postmile),
sep = '_')
ggplot(aes(x = county_postmile, y = ahead_aadt),
data = hw101) +
geom_point(aes(color = county)) + geom_line() +
ggtitle("Ahead AADT Traffic Counts On US 101") +
theme(axis.text.x = element_text(angle=45, size = 8))
## Warning: Removed 16 rows containing missing values (geom_point).
ggplot(aes(x = postmile, y = ahead_aadt),
data = hw101) +
geom_point(aes(color = county)) + geom_line() +
ggtitle("Ahead AADT Traffic Counts On US 101") +
theme(axis.text.x = element_text(angle=45, size = 8)) +
facet_wrap(~county_name)
## Warning: Removed 16 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_path).
#Let's look at US 101 in LA County and Humboldt County
ggplot(aes(x = postmile, y = ahead_aadt),
data = subset(hw101, hw101$county == "LA")) +
geom_point(aes(color = county)) + geom_line() +
geom_line(aes(y = ahead_peak_aadt), colour = "green") +
geom_point(aes(y = ahead_peak_aadt), colour = "green") +
ggtitle("Ahead AADT Traffic Counts On US 101 in LA County") +
theme(axis.text.x = element_text(angle=45, size = 8))
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 2 rows containing missing values (geom_point).
#scale_x_discrete(label = function(x){
# return(county_table_CA$COUNTY[county_table_CA$ABBREV. == x])}) +
Maybe use cut to group traffic
dfp3$traffic_back_peak_month <-
ordered(cut(dfp3$back_peak_month,c(0,9400,29000,109000, 406000),
labels=c("Q1", "Q2", "Q3", "Q4")))
ggpairs(data = dfp3, columns = 9:14) +
theme(axis.text.x = element_text(angle=45, size = 8),
axis.text.y = element_text(angle=45, size = 8))
It looks like all of the traffic counts are correlated with each other, which makes sense. The traffic Ahead of an intersection minus the traffic Behind an intersection consists of the cars that turned off and the cars that turned on. If there is a big difference, a lot of cars either get on or get off of that highway at that intersection. The ones that look the most correlated are - Ahead Peak (Monthly) AADT and Ahead AADT, with a correlation of .999 and - Back Peak (Monthly) AADT and Back AADT, also with a correlation of .999
The ones that are the least correlated are - Ahead Peak Hourly and Back AADT which has a correlation f .967
But the other crosses (Hourly vs. AADT, Back vs. Ahead) are also below .97
We can’t really compare intersections, but for a given intersection we can compare the different relationships.
Let’s look at the Back AADT and the Ahead AADT
p1 <- ggplot(data = dfp3, aes(x = county, y = back_aadt)) +
geom_point(alpha=.5, colour = "orange") +
theme(axis.text.x = element_text(colour="grey20",size=8,angle=45,hjust=.5,
vjust=.5,face="plain"),
axis.text.y = element_text(colour="grey20",size=12,angle=0,hjust=1,
vjust=0,face="plain"),
axis.title.x = element_text(colour="grey20",size=12,angle=0,hjust=.5,
vjust=0,face="plain"),
axis.title.y = element_text(colour="grey20",size=12,angle=90,hjust=.5,
vjust=.5,face="plain")) +
ggtitle(label = "CA Back AADT by County")
p2 <- ggplot(data = dfp3, aes(x = county, y = ahead_aadt)) +
geom_point(alpha=.5, colour = "green") +
theme(axis.text.x = element_text(colour="grey20",size=8,angle=45,hjust=.5,
vjust=.5,face="plain"),
axis.text.y = element_text(colour="grey20",size=12,angle=0,hjust=1,
vjust=0,face="plain"),
axis.title.x = element_text(colour="grey20",size=12,angle=0,hjust=.5,
vjust=0,face="plain"),
axis.title.y = element_text(colour="grey20",size=12,angle=90,hjust=.5,
vjust=.5,face="plain")) +
ggtitle(label = "CA Ahead AADT by County")
grid.arrange(p1,p2, ncol = 1)
## Warning: Removed 546 rows containing missing values (geom_point).
## Warning: Removed 546 rows containing missing values (geom_point).
Let’s look at the Back Peak Month and the Ahead Peak Month
p1 <- ggplot(data = dfp3, aes(x = county, y = back_peak_month)) +
geom_jitter(alpha=.5, aes(colour = county)) +
theme(axis.text.x = element_text(colour="grey20",size=8,angle=45,hjust=.5,
vjust=.5,face="plain"),
axis.text.y = element_text(colour="grey20",size=12,angle=0,hjust=1,
vjust=0,face="plain"),
axis.title.x = element_text(colour="grey20",size=12,angle=0,hjust=.5,
vjust=0,face="plain"),
axis.title.y = element_text(colour="grey20",size=12,angle=90,hjust=.5,
vjust=.5,face="plain"),
legend.position = 'none') +
ggtitle(label = "CA Back Peak Month AADT by County")
p2 <- ggplot(data = dfp3, aes(x = county, y = ahead_peak_aadt)) +
geom_jitter(alpha=.5, aes(colour = county)) +
theme(axis.text.x = element_text(colour="grey20",size=8,angle=45,hjust=.5,
vjust=.5,face="plain"),
axis.text.y = element_text(colour="grey20",size=12,angle=0,hjust=1,
vjust=0,face="plain"),
axis.title.x = element_text(colour="grey20",size=12,angle=0,hjust=.5,
vjust=0,face="plain"),
axis.title.y = element_text(colour="grey20",size=12,angle=90,hjust=.5,
vjust=.5,face="plain"),
legend.text = element_text(size = 8),
legend.key.size = unit(.3, "cm")) +
guides(colour=guide_legend(ncol=2)) +
ggtitle(label = "CA Ahead Peak Month AADT by County")
grid.arrange(p1,p2, ncol = 1)
## Warning: Removed 546 rows containing missing values (geom_point).
## Warning: Removed 546 rows containing missing values (geom_point).
What about the difference between them?
#colors=rainbow( ncol(frame) ,s = 0.5, v = 1 )
ggplot(data = dfp3, aes(x = county,
y = back_peak_month - ahead_peak_aadt)) +
geom_jitter(alpha=.5, aes(colour = county)) +
theme(axis.text.x = element_text(colour="grey20",size=8,angle=45,hjust=.5,
vjust=.5,face="plain"),
axis.text.y = element_text(colour="grey20",size=10,angle=0,hjust=1,
vjust=0,face="plain"),
axis.title.x = element_text(colour="grey20",size=12,angle=0,hjust=.5,
vjust=0,face="plain"),
axis.title.y = element_text(colour="grey20",size=12,angle=90,hjust=.5,
vjust=.5,face="plain"),
legend.text = element_text(size = 8),
legend.key.size = unit(.3, "cm")) +
guides(colour=guide_legend(ncol=2)) +
#scale_colour_brewer(palette=c("Set1", "Set2", "Set3", "Set4", "Set5")) +
ggtitle(label = "CA Difference in Back and Ahead Peak Month\nAADTby County")
## Warning: Removed 1076 rows containing missing values (geom_point).
# Peak Hour (Back - Forward)
#colors=rainbow( ncol(frame) ,s = 0.5, v = 1 )
ggplot(data = dfp3, aes(x = county, y = back_peak_hour - ahead_peak_hour)) +
geom_jitter(alpha=.5, aes(colour = county)) +
theme(axis.text.x = element_text(colour="grey20",size=8,angle=45,hjust=.5,
vjust=.5,face="plain"),
axis.text.y = element_text(colour="grey20",size=10,angle=0,hjust=1,
vjust=0,face="plain"),
axis.title.x = element_text(colour="grey20",size=12,angle=0,hjust=.5,
vjust=0,face="plain"),
axis.title.y = element_text(colour="grey20",size=12,angle=90,hjust=.5,
vjust=.5,face="plain"),
legend.text = element_text(size = 8),
legend.key.size = unit(.3, "cm")) +
guides(colour=guide_legend(ncol=2)) +
#scale_colour_brewer(palette=c("Set1", "Set2", "Set3", "Set4", "Set5")) +
ggtitle(label = "CA Difference in Back and Ahead Peak Hour\nby County")
## Warning: Removed 1076 rows containing missing values (geom_point).
# AADT (Back - Forward)
#colors=rainbow( ncol(frame) ,s = 0.5, v = 1 )
ggplot(data = dfp3, aes(x = county, y = back_aadt - ahead_aadt)) +
geom_jitter(alpha=.5, aes(colour = county)) +
theme(axis.text.x = element_text(colour="grey20",size=8,angle=45,
hjust=.5,vjust=.5,face="plain"),
axis.text.y = element_text(colour="grey20",size=10,angle=0,
hjust=1,vjust=0,face="plain"),
axis.title.x = element_text(colour="grey20",size=12,angle=0,
hjust=.5,vjust=0,face="plain"),
axis.title.y = element_text(colour="grey20",size=12,angle=90,
hjust=.5,vjust=.5,face="plain"),
legend.text = element_text(size = 8),
legend.key.size = unit(.3, "cm")) +
guides(colour=guide_legend(ncol=2)) +
#scale_colour_brewer(palette=c("Set1", "Set2", "Set3", "Set4", "Set5")) +
ggtitle(label = "CA Difference in Back and Ahead AADT\nBy County")
## Warning: Removed 1076 rows containing missing values (geom_point).
Lets look at the Back Peak Hour first
#ggplot(data = dfp3, aes(x = ))
Let’s look at the different variables (columns) in this dataset:
- Dist (District) - this is a categorical variable
str(dfp3$dist)
## Factor w/ 12 levels "1","2","3","4",..: 12 12 12 12 12 12 12 12 12 12 ...
ggplot(data = dfp3, aes(x = dist)) + geom_bar() +
theme(axis.text.x = element_text(angle=45)) +
ggtitle(label = "Number of California Traffic Counts by District")
str(dfp3$county_name)
## Factor w/ 58 levels "Alameda","Alpine",..: 30 30 30 30 30 30 30 30 30 30 ...
ggplot(data = dfp3, aes(x = county_name)) + geom_bar() +
theme(axis.text.x = element_text(angle=45)) +
ggtitle(label = "Number of California Traffic Counts by County")
There are some counts that don’t have districts. Let’s examine them: > dfp3[is.na(dfp3$dist),] are all breaks in route.
You will select three plots from your analysis to polish and share in this section. The three plots should show different trends and should be polished with appropriate labels, units, and titles Are the final three plots varied and do they meet some of the following criteria: - Draw comparisons.
- Identify trends.
- Engage a wide audience.
- Explain a complicated finding.
- Clarify a gap between perception and reality.
- Enable the reader to digest large amounts of information.
Are the plots polished?
- are axes labeled?
- are units labeled on each axis?
- are plots titled?
- are all labels, titles, and units readable?
Are the plots explained?
Does the section provide a written reflection of the analysis? Consider the following in your reflections: - Where did I run into difficulties in the analysis? - Where did I find successes? - How could the analysis be enriched in future work (e.g. additional data and analyses)?
definition of any variables, units, levels of categorical variables, and the data generating process, such as how data was collected if possible
A description is at http://traffic-counts.dot.ca.gov/docs/traffic-data-faq.pdf
Legs: Counts are taken ahead of or in back of a given location on the state highway system. These are called Legs.
Postmile Prefix Codes: Assigned when a length of highway is changed due to construction or realignment. The alpha code is used to differentiate the different values.
Back annual average daily traffic (AADT) usually represents traffic South or West of the count location and is the total volume for the year divided by 365 days.
Ahead annual average daily traffic (AADT) usually represents traffic North or East of the count location and is the total volume for the year divided by 365 days.
AADT’s represent both directions of travel, and summing them together will result in erroneous data. Peak Hour usually represents an estimate of the heaviest traffic flow which usually occurs between 7 to 9 AM and 5 to 7 PM. Peak Hour values indicate the volume in both directions. In urban and suburban areas, the peak hour normally occurs every weekday. On roads with large seasonal fluctuations in traffic, the peak hour is the hour near the maximum for the year but excluding a few (30 to 50 hours) that are exceedingly high and are not typical of the frequency of the high hours occurring during the season. Peak Month ADT is the average daily traffic for the month of heaviest traffic flow, usually July or August. This data is obtained because on many routes, high traffic volumes which occur during a certain season of the year are more representative of traffic conditions than the annual ADT.
Annual average daily traffic is the total traffic volume for the year divided by 365* days. The traffic count year is from October 1st through September 30th. Very few locations in California are actually counted continuously. Traffic counting is generally performed by electronic counting instruments moved from location to location throughout the State in a program of continuous traffic count sampling. The resulting counts are adjusted to an estimate of annual average daily traffic by compensating for seasonal influence, weekly variation and other variables which may be present. Annual ADT is necessary for presenting a statewide picture of traffic flow, evaluating traffic trends, computing accident rates, planning and designing highways and other purposes.
Peak Month ADT The peak month ADT is the average daily traffic for the month of heaviest traffic flow. This data is obtained because on many routes, high traffic volumes which occur during a certain season of the year are more representative of traffic conditions than the annual ADT.
Peak Hour This publication includes an estimate of the “peak hour” traffic at all points on the state highway system. This value is useful to traffic engineers in estimating the amount of congestion experienced, and shows how near to capacity the highway is operating. Unless otherwise indicated, peak hour values indicate the volume in both directions. A few hours each year are higher than the “peak hour,” but not many. In urban and suburban areas, the peak hour normally occurs every weekday, during what is considered “rush hour” traffic. On roads with large seasonal fluctuations in traffic, the peak hour is the hour near the maximum for the year but excluding a few (30 to 50 hours) that are exceedingly high and are not typical of the frequency of the high hours occurring during the season.
Counties in California: http://en.wikipedia.org/wiki/List_of_counties_in_California and http://www.dot.ca.gov/hq/tsip/hseb/products/county_name.pdf