1 Introduction:

I started my career in Intelligent Transportation Systems, and had worked for Caltrans (California State Department of Transportation) in graduate school, so thought I might like to look at traffic counts. My home state has 48-hour traffic counts available, but they have to be downloaded and cleaned/loaded individually for each site, which seemed like a lot of work.

I then went to the Federal DOT, but they only had state summaries that did not have enough data, so I looked at the state of California’s website. I hope the state-wide average annual traffic counts will provide a data set that is good enough for this investigation.

1.1 Obtaining the Data

California Average Annual Daily Traffic Counts for 2012
The data was obtained at http://traffic-counts.dot.ca.gov/docs/2012AADT.xlsx

A description of the data is at http://traffic-counts.dot.ca.gov/docs/traffic-counts-diagram.pdf and http://traffic-counts.dot.ca.gov/2013all/

The data came in Excel format. I downloaded it into Excel. It had headers every certain number of lines and I used Excel to insert ‘#’ signs in front of each but the first header and saved it as csv. Then I read in the data to R ignoring lines starting with ‘#’ to eliminate the extra headers.

Looking at the data, it had 3 unnamed columns (X, X.1, and X.2). I did some investigation to find out what the column names were and what the abbreviations in the fields meant.
I obtained the county name/abbreviation by copying data from http://sv08data.dot.ca.gov/contractcost/map.html and pasting into Excel and saving as csv.

I also obtained the definitions for the letters in the Postmile Prefix column from http://traffic-counts.dot.ca.gov/docs/traffic-data-faq.pdf and put them in a separate table (X.1) and the Route Suffix (X) and Alignment (X.2) columns from http://www.dot.ca.gov/cwwp2/documentation/prefix-suffix-alignment-charts.htm

1.1.1 Route Suffix Field Chart (column 3)

Route Suffix Description
S Supplemental Route
U Unrelinquished Route

1.1.2 Post Mile Prefix Field Chart (column 5)

Prefix Meaning
C Commercial lane
D Duplicate post mile at meandering county line
G Reposting of duplicate post mile at end of route
H Realignment of D mileage
L Overlap post mile
M Realignment of R mileage
N Realignment of M mileage
R First realignment
S Spur
T Temporary connection
U Unrelinquished

1.1.3 Alignment Field Chart (column 7)

Alignment Description
L Left independent alignment
R Right independent alignment

1.2 Load the Libraries

First load the libraries I know I will need later…

require(ggplot2)
library(tidyr)
library(reshape2)
library(GGally)
library(gridExtra)
library(Hmisc)
library(dplyr)
# And load the project directory
pdir <- "~/DataAnalystNanoDegree/DataAnalysisWithR/Project4"

1.3 Load the Data

# read in the dataset, ignoring comments
dfp3 <- read.csv(paste0(pdir, "/2012AADT.csv"), comment.char = "#")
# Get county names
county_table_CA <- read.csv(paste0(pdir, "/CA_Counties.csv"))
# Get postmile prefix abbreviations
postmile_prefix_table <- read.csv(paste0(pdir, "/PostmilePrefix.csv"))
# Get route suffix abbreviations
route_suffix_table <- read.csv(paste0(pdir, "/RouteSuffix.csv"))
# Get alignment abbreviations
alignment <- read.csv(paste0(pdir, "/Alignment.csv"))
# Add names for unnamed columns
colnames(dfp3)[3]<- "route_suffix"
colnames(dfp3)[5]<- "postmile_prefix"
colnames(dfp3)[7]<- "alignment"
# Make all names lowercase
colnames(dfp3)<- tolower(colnames(dfp3))
# Change .s in names to _
colnames(dfp3) <- gsub("\\.+", "_", colnames(dfp3), perl = T)
# Change route numbers and district numbers to factors
dfp3$route <-as.factor(dfp3$route)
dfp3$dist <- as.factor(dfp3$dist)
#Add a column containing actual county names
countyMap <- match(as.character(dfp3$county),
                   as.character(county_table_CA$ABBREV.))
dfp3$county_name <-county_table_CA$COUNTY[countyMap]

1.4 Look at the data

head(dfp3)
##   dist route route_suffix county postmile_prefix postmile alignment
## 1   12     1                 ORA               R    0.129          
## 2   12     1                 ORA               R    0.780          
## 3   12     1                 ORA                    8.430          
## 4   12     1                 ORA                    9.418          
## 5   12     1                 ORA                    9.600          
## 6   12     1                 ORA                   11.500          
##                         description back_peak_hour back_peak_month
## 1           DANA POINT, JCT. RTE. 5             NA              NA
## 2      DANA POINT, DOHENY PARK ROAD           3750           40000
## 3       LAGUNA BEACH, MOUNTAIN ROAD           2850           38500
## 4 LAGUNA BEACH, JCT. RTE. 133 NORTH           3000           40500
## 5   LAGUNA BEACH, CLIFF DR/ASTER ST           3350           39500
## 6   LAGUNA BEACH, NORTH CITY LIMITS           3150           37500
##   back_aadt ahead_peak_hour ahead_peak_aadt ahead_aadt county_name
## 1        NA            3750           40000      37000      Orange
## 2     37000            3900           42000      38500      Orange
## 3     36000            2850           38500      36000      Orange
## 4     38000            3400           40500      38000      Orange
## 5     37000            3350           39500      37000      Orange
## 6     35000            3150           38500      35000      Orange
str(dfp3)
## 'data.frame':    6777 obs. of  15 variables:
##  $ dist           : Factor w/ 12 levels "1","2","3","4",..: 12 12 12 12 12 12 12 12 12 12 ...
##  $ route          : Factor w/ 243 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ route_suffix   : Factor w/ 3 levels "","S","U": 1 1 1 1 1 1 1 1 1 1 ...
##  $ county         : Factor w/ 59 levels "","ALA","ALP",..: 31 31 31 31 31 31 31 31 31 31 ...
##  $ postmile_prefix: Factor w/ 8 levels "","C","D","L",..: 6 6 1 1 1 1 6 6 1 1 ...
##  $ postmile       : num  0.129 0.78 8.43 9.418 9.6 ...
##  $ alignment      : Factor w/ 3 levels "","L","R": 1 1 1 1 1 1 1 1 1 1 ...
##  $ description    : Factor w/ 5762 levels ""," JCT. RTE. 101",..: 1129 1128 2623 2622 2619 2625 3508 3509 3510 3507 ...
##  $ back_peak_hour : int  NA 3750 2850 3000 3350 3150 4000 4250 4350 5400 ...
##  $ back_peak_month: int  NA 40000 38500 40500 39500 37500 49500 53000 53000 52000 ...
##  $ back_aadt      : int  NA 37000 36000 38000 37000 35000 45000 48000 48700 48500 ...
##  $ ahead_peak_hour: int  3750 3900 2850 3400 3350 3150 4800 4250 5300 5400 ...
##  $ ahead_peak_aadt: int  40000 42000 38500 40500 39500 38500 59000 53000 52000 52000 ...
##  $ ahead_aadt     : int  37000 38500 36000 38000 37000 35000 54000 48000 48700 48500 ...
##  $ county_name    : Factor w/ 58 levels "Alameda","Alpine",..: 30 30 30 30 30 30 30 30 30 30 ...
summary(dfp3)
##       dist          route      route_suffix     county     postmile_prefix
##  4      :1084   101    : 507    :6747       LA     : 762          :4704   
##  7      : 913   5      : 420   S:  20       SD     : 453   R      :1844   
##  3      : 771   1      : 278   U:  10       SBD    : 339   M      :  71   
##  6      : 721   99     : 254                KER    : 267   T      :  70   
##  11     : 609   80     : 157                ORA    : 261   L      :  63   
##  (Other):2670   (Other):5152                RIV    : 242   S      :  16   
##  NA's   :   9   NA's   :   9                (Other):4453   (Other):   9   
##     postmile       alignment                                 description  
##  Min.   :  0.000    :6658    JCT. RTE. 5                           :  33  
##  1st Qu.:  5.752   L:  54    JCT. RTE. 101                         :  16  
##  Median : 15.367   R:  65    JCT. RTE. 99                          :  14  
##  Mean   : 22.308             NEVADA STATE LINE                     :  13  
##  3rd Qu.: 30.408             JCT. RTE. 15                          :  12  
##  Max.   :186.238             LOS ANGELES/SAN BERNARDINO COUNTY LINE:  12  
##  NA's   :15                  (Other)                               :6677  
##  back_peak_hour  back_peak_month    back_aadt      ahead_peak_hour
##  Min.   :   10   Min.   :   100   Min.   :    80   Min.   :   10  
##  1st Qu.:  940   1st Qu.:  9400   1st Qu.:  8200   1st Qu.:  940  
##  Median : 2600   Median : 29000   Median : 25600   Median : 2600  
##  Mean   : 5326   Mean   : 68262   Mean   : 64726   Mean   : 5333  
##  3rd Qu.: 8800   3rd Qu.:109000   3rd Qu.:104000   3rd Qu.: 8800  
##  Max.   :31000   Max.   :406000   Max.   :377500   Max.   :31000  
##  NA's   :546     NA's   :546      NA's   :546      NA's   :546    
##  ahead_peak_aadt    ahead_aadt             county_name  
##  Min.   :   100   Min.   :    80   Los Angeles   : 762  
##  1st Qu.:  9500   1st Qu.:  8400   San Diego     : 453  
##  Median : 29000   Median : 26000   San Bernardino: 339  
##  Mean   : 68318   Mean   : 64773   Kern          : 267  
##  3rd Qu.:109000   3rd Qu.:104000   Orange        : 261  
##  Max.   :406000   Max.   :377500   (Other)       :4686  
##  NA's   :546      NA's   :546      NA's          :   9

This dataset has district (dist) and route which are numeric factors, and county which is a character factor, and a numeric postmile field (mile post). There are three columns containing alphabetic keys, and a character description (all factors). Then there are 6 integer traffic count fields - three are at the back of the location (South or West), and three are Ahead of the location (North or East). There are back and ahead peak_hour counts, as well as back and ahead peak_month counts and back and ahead AADT (Annual Average Daily Traffic) and is the total volume for the year divided by 365 days. ###Create a Tidy Dataset A tidy dataset has one value per row.

tidyData <- melt(dfp3, id.vars = 1:8, measure.vars = 9:14, 
                 variable.name = "variable", value.name = "value")

1.4.1 Let’s look at some introductory statistics and plots

table(dfp3$dist)
## 
##    1    2    3    4    5    6    7    8    9   10   11   12 
##  325  411  771 1084  460  721  913  581   98  533  609  262
table(dfp3$route)
## 
##   1   2   3   4   5   6   7   8   9  10  12  13  14  15  16  17  18  19 
## 278  34  27  65 420   6  10  70  21 152  73  14  39 130  37  17  38   6 
##  20  22  23  24  25  26  27  28  29  32  33  34  35  36  37  38  39  40 
##  77  19  22  15  16  26  10  12  56  42  92  13  30  60  13  15  34  23 
##  41  43  44  45  46  47  49  50  51  52  53  54  55  56  57  58  59  60 
##  56  32  24  21  25   6 122  66  16  11   4  11  20  10  28  59  16  64 
##  61  62  63  65  66  67  68  70  71  72  73  74  75  76  77  78  79  80 
##   8  25  32  37  12  20  16  56  18  11  12  29  24  24   3  69  21 157 
##  82  83  84  85  86  87  88  89  90  91  92  94  95  96  97  98  99 101 
##  48  11  47  24  43  10  37  58  19  68  25  37  11  24  11  26 254 507 
## 103 104 105 107 108 109 110 111 112 113 114 115 116 118 119 120 121 123 
##   4  16  17   9  35   2  44  42   4  30   2  15  27  35  11  42  23  14 
## 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 142 
##   4  12  20  11  38   9   6   5  36   9  18  16   2  22  32  14  32  11 
## 144 145 146 147 150 151 152 153 154 155 156 158 160 161 162 163 164 165 
##   2  30   4   6  17  12  38   3   7  16  11   5  19   3  45  21   8  19 
## 166 167 168 169 170 172 173 174 175 177 178 180 182 183 184 185 186 187 
##  17   2  29   7  10   3   8  14  10   5  38  34   3   5  12  14   3  10 
## 188 189 190 191 192 193 195 197 198 199 200 201 202 203 204 205 207 210 
##   3   6  23   7  11  15   2   2  50   4   3  15   6   7  10   7   2  72 
## 211 213 215 216 217 218 219 220 221 222 223 225 227 229 232 233 236 237 
##   5   7  47  13   4   5   2   6   4   3  12   9   9   2   4   7   5  16 
## 238 241 242 243 244 245 246 247 253 254 255 259 260 261 262 263 265 266 
##  11  11   6   8   4   9   7   8   2   7   9   5   3   6   3   3   2   3 
## 267 269 270 271 273 275 280 281 282 283 284 299 330 371 380 395 405 505 
##   9   7   2   7  34   4  58   3  11   2   2  64   2   4   4  67  74  14 
## 580 605 680 710 780 805 880 905 980 
##  70  27  64  28   9  32  49   9   4
table(dfp3$route_suffix)
## 
##         S    U 
## 6747   20   10
table(dfp3$county_name)
## 
##         Alameda          Alpine          Amador           Butte 
##             220              19              47              96 
##       Calaveras          Colusa    Contra Costa       Del Norte 
##              50              36             109              36 
##       El Dorado          Fresno           Glenn        Humboldt 
##              80             183              42             146 
##        Imperial            Inyo            Kern           Kings 
##             156              45             267              49 
##            Lake          Lassen     Los Angeles          Madera 
##              45              41             762              56 
##           Marin        Mariposa       Mendocino          Merced 
##              60              31              98             103 
##           Modoc            Mono        Monterey            Napa 
##              22              53             111              59 
##          Nevada          Orange          Placer          Plumas 
##              68             261             103              44 
##       Riverside      Sacramento      San Benito  San Bernardino 
##             242             151              27             339 
##       San Diego   San Francisco     San Joaquin San Luis Obispo 
##             453              52             139             126 
##       San Mateo   Santa Barbara     Santa Clara      Santa Cruz 
##             157             128             202              70 
##          Shasta          Sierra        Siskiyou          Solano 
##             138              22              84              93 
##          Sonoma      Stanislaus          Sutter          Tehama 
##             132              91              40              54 
##         Trinity          Tulare        Tuolumne         Ventura 
##              28             163              54             152 
##            Yolo            Yuba 
##              89              44
# Look at counts by county
ggplot(data = dfp3, aes(x = county)) + geom_bar() +
  theme(axis.text.x = element_text(angle=45)) +
  ggtitle(label = "Number of California Traffic Counts by County")

# ggplot(data = dfp3, aes(x = county, y = ..count..)) + geom_bar() +
#   theme(axis.text.x = element_text(angle=45)) +
#   ggtitle(label = "Number of California Traffic Counts by County")

From looking at the first plot, it is obvious that LA County has the most intersections in which CalTrans is doing traffic counts. If we want to look at the top 10 counties in terms of number of Traffic Counts:

sort(table(dfp3$county_name), decreasing = TRUE)[1:10]
## 
##    Los Angeles      San Diego San Bernardino           Kern         Orange 
##            762            453            339            267            261 
##      Riverside        Alameda    Santa Clara         Fresno         Tulare 
##            242            220            202            183            163

Now look at it by Route Number

# a function to convert a numeric factor into a number
nbr <- function(x) {
  as.numeric(as.character(x))
}
#table(nbr(dfp3$route))
p1<-ggplot(data = dfp3, aes(x = nbr(route))) + 
  geom_bar(binwidth = 10, colour="#FF9999") +
  theme(axis.text.x = element_text(angle=45)) +
  ggtitle(label = "Histogram of California Traffic Counts by Route 10 Bin")
p2<-ggplot(data = dfp3, aes(x = route)) + geom_bar() +
  theme(axis.text.x = element_text(angle=45)) +
  ggtitle(label = "Number of California Traffic Counts by Route as Factor")
table(dfp3$route)
## 
##   1   2   3   4   5   6   7   8   9  10  12  13  14  15  16  17  18  19 
## 278  34  27  65 420   6  10  70  21 152  73  14  39 130  37  17  38   6 
##  20  22  23  24  25  26  27  28  29  32  33  34  35  36  37  38  39  40 
##  77  19  22  15  16  26  10  12  56  42  92  13  30  60  13  15  34  23 
##  41  43  44  45  46  47  49  50  51  52  53  54  55  56  57  58  59  60 
##  56  32  24  21  25   6 122  66  16  11   4  11  20  10  28  59  16  64 
##  61  62  63  65  66  67  68  70  71  72  73  74  75  76  77  78  79  80 
##   8  25  32  37  12  20  16  56  18  11  12  29  24  24   3  69  21 157 
##  82  83  84  85  86  87  88  89  90  91  92  94  95  96  97  98  99 101 
##  48  11  47  24  43  10  37  58  19  68  25  37  11  24  11  26 254 507 
## 103 104 105 107 108 109 110 111 112 113 114 115 116 118 119 120 121 123 
##   4  16  17   9  35   2  44  42   4  30   2  15  27  35  11  42  23  14 
## 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 142 
##   4  12  20  11  38   9   6   5  36   9  18  16   2  22  32  14  32  11 
## 144 145 146 147 150 151 152 153 154 155 156 158 160 161 162 163 164 165 
##   2  30   4   6  17  12  38   3   7  16  11   5  19   3  45  21   8  19 
## 166 167 168 169 170 172 173 174 175 177 178 180 182 183 184 185 186 187 
##  17   2  29   7  10   3   8  14  10   5  38  34   3   5  12  14   3  10 
## 188 189 190 191 192 193 195 197 198 199 200 201 202 203 204 205 207 210 
##   3   6  23   7  11  15   2   2  50   4   3  15   6   7  10   7   2  72 
## 211 213 215 216 217 218 219 220 221 222 223 225 227 229 232 233 236 237 
##   5   7  47  13   4   5   2   6   4   3  12   9   9   2   4   7   5  16 
## 238 241 242 243 244 245 246 247 253 254 255 259 260 261 262 263 265 266 
##  11  11   6   8   4   9   7   8   2   7   9   5   3   6   3   3   2   3 
## 267 269 270 271 273 275 280 281 282 283 284 299 330 371 380 395 405 505 
##   9   7   2   7  34   4  58   3  11   2   2  64   2   4   4  67  74  14 
## 580 605 680 710 780 805 880 905 980 
##  70  27  64  28   9  32  49   9   4
sort(table(dfp3$route), decreasing = TRUE)[1:10]
## 
## 101   5   1  99  80  10  15  49  33  20 
## 507 420 278 254 157 152 130 122  92  77
grid.arrange(p1,p2, ncol = 1)
## Warning: position_stack requires constant width: output may be incorrect

Now look at it by District http://en.wikipedia.org/wiki/California_Department_of_Transportation#Districts

District Counties
1 Del Norte, Humboldt, Lake, Mendocino Eureka
2 Lassen, Modoc, Plumas, Shasta, Siskiyou, Tehama, Trinity; portions of Butte and Sierra Redding
3 Butte, Colusa, El Dorado, Glenn, Nevada, Placer, Sacramento, Sierra, Sutter, Yolo,Yuba Marysville
4 Alameda, Contra Costa, Marin, Napa, San Francisco, San Mateo, Santa Clara, Solano, Sonoma, Oakland
5 Monterey, San Benito, San Luis Obispo, Santa Barbara, Santa Cruz San Luis Obispo
6 Madera, Fresno, Tulare, Kings, Kern Fresno
7 Los Angeles, Ventura Los Angeles
8 Riverside, San Bernardino San Bernardino
9 Inyo, Mono Bishop
10 Alpine, Amador, Calaveras, Mariposa, Merced, San Joaquin, Stanislaus, Tuolumne Stockton
11 Imperial, San Diego San Diego
12 Orange Irvine
p1 <- ggplot(data = subset(dfp3, route_suffix != ""), aes(x = route_suffix)) +
  geom_bar(aes(fill = county_name), colour = "black")
p2 <- ggplot(data = subset(dfp3, alignment != ""), aes(x = alignment)) +
geom_bar(aes(fill = county_name), colour = "black")
p3 <- ggplot(data = subset(dfp3, postmile_prefix != ""), 
             aes(x = postmile_prefix)) +
  geom_bar(aes(fill = county_name), colour = "black") +
  theme(legend.text = element_text(size = 8), 
        legend.key.size = unit(.3, "cm")) +
  guides(colour=guide_legend(ncol=3), fill = guide_legend(ncol = 2))
p1

p2

p3

ggplot(data = dfp3, aes(x = dist)) + 
  geom_bar(colour="#FF9999") +
  theme(axis.text.x = element_text(angle=45)) +
  ggtitle(label = "Histogram of California Traffic Counts by District")

You can look at the California highways at http://en.wikipedia.org/wiki/List_of_state_highways_in_California.

It seems that the route with the largest number of traffic counts is State Highway 101, which covers California from N South to North. There are 507 points on highway 101. It might be an interesting roadway to investigate.

The second most counted roadway is I-5. It is an interstate, not a state highway and it also covers the whole state from South to North.

The third most counted highway is highway 1, which hugs the coastline.
The fourth is Highway 99. Those 4 have many more counts than the other highways.

ggplot(data = dfp3[dfp3$route %nin% c(1,5,99,101),], aes(x = route)) + 
  geom_bar() +
  theme(axis.text.x = element_text(angle=45)) + ggtitle(label = 
      "Number of California Traffic Counts by Route as Factor\nMinus top Four")
## Warning: position_stack requires constant width: output may be incorrect

##Let’s look at traffic counts along Highway 101
Unfortunately, the mileposts are restarted at each county line. So you cannot follow traffic in a Route for its whole distance without delving into county boundaries.

ggplot(data = dfp3[dfp3$route == 101,], aes(x = postmile, 
                                                y = back_peak_hour, 
                                                colour=county)) + 
  geom_point() +
  ggtitle("Highway 1 Back Peak Hour Traffic by Postmile/County")
## Warning: Removed 25 rows containing missing values (geom_point).

#Trying something weird

p <- ggplot(data = dfp3)
for (i in unique(dfp3$county)) {
  p <- p + geom_line(data = subset(dfp3, county = i), 
                     aes(x = postmile, y = back_peak_hour, color = county))
}

Highway 101 from South to North goes through these counties in order: (from http://en.wikipedia.org/wiki/U.S._Route_101_in_California) LA - Los Angeles VEN - Ventura SB - Santa Barbara SLO - San Luis Obispo MON - Monterey SBT - San Benito SCL - Santa Clara SM - San Mateo SF - San Francisco MRN - Marin SON - Sonoma MEN - Mendocino HUM - Humboldt DN - Del Norte

Look at the traffic on Highway 101 throughout it’s length

hw101 <- subset(dfp3, route == "101")
cn <- function(x) {
  return(county_table_CA$COUNTY[county_table_CA$ABBREV. == x])
}
hw101$county <- ordered(hw101$county, 
                       levels = c("LA", "VEN","SB", "SLO",
                                  "MON", "SBT", "SCL", "SM", 
                                  "SF", "MRN", "SON", 
                                  "MEN", "HUM", "DN"))
countyMap101 <- match(as.character(hw101$county),
                   as.character(county_table_CA$ABBREV.))
hw101$county_name <- ordered(hw101$county_name, 
                             levels = cn(c("LA", "VEN","SB", "SLO",
                                  "MON", "SBT", "SCL", "SM", 
                                  "SF", "MRN", "SON", 
                                  "MEN", "HUM", "DN")))
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple
## of shorter object length
## Warning in `==.default`(county_table_CA$ABBREV., x): longer object length
## is not a multiple of shorter object length
hw101$county_postmile <- paste(as.character(hw101$county), 
                               as.character(hw101$postmile_prefix),
                               as.character(hw101$postmile),
                               sep = '_')
                               
ggplot(aes(x = county_postmile, y = ahead_aadt),
       data = hw101) + 
  geom_point(aes(color = county)) + geom_line() +
  ggtitle("Ahead AADT Traffic Counts On US 101") + 
  theme(axis.text.x = element_text(angle=45, size = 8))
## Warning: Removed 16 rows containing missing values (geom_point).

ggplot(aes(x = postmile, y = ahead_aadt),
       data = hw101) + 
  geom_point(aes(color = county)) + geom_line() +
  ggtitle("Ahead AADT Traffic Counts On US 101") + 
  theme(axis.text.x = element_text(angle=45, size = 8)) +
  facet_wrap(~county_name)
## Warning: Removed 16 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_path).

#Let's look at US 101 in LA County and Humboldt County
ggplot(aes(x = postmile, y = ahead_aadt),
       data = subset(hw101, hw101$county == "LA")) + 
  geom_point(aes(color = county)) + geom_line() +
  geom_line(aes(y = ahead_peak_aadt), colour = "green") +
  geom_point(aes(y = ahead_peak_aadt), colour = "green") +
  ggtitle("Ahead AADT Traffic Counts On US 101 in LA County") +  
  theme(axis.text.x = element_text(angle=45, size = 8))
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 2 rows containing missing values (geom_point).

#scale_x_discrete(label = function(x){
#    return(county_table_CA$COUNTY[county_table_CA$ABBREV. == x])}) +

Maybe use cut to group traffic

dfp3$traffic_back_peak_month <-
  ordered(cut(dfp3$back_peak_month,c(0,9400,29000,109000, 406000), 
              labels=c("Q1", "Q2", "Q3", "Q4")))

1.5 Let’s do a ggpairs

ggpairs(data = dfp3, columns = 9:14) + 
  theme(axis.text.x = element_text(angle=45, size = 8), 
        axis.text.y = element_text(angle=45, size = 8))

It looks like all of the traffic counts are correlated with each other, which makes sense. The traffic Ahead of an intersection minus the traffic Behind an intersection consists of the cars that turned off and the cars that turned on. If there is a big difference, a lot of cars either get on or get off of that highway at that intersection. The ones that look the most correlated are - Ahead Peak (Monthly) AADT and Ahead AADT, with a correlation of .999 and - Back Peak (Monthly) AADT and Back AADT, also with a correlation of .999

The ones that are the least correlated are - Ahead Peak Hourly and Back AADT which has a correlation f .967
But the other crosses (Hourly vs. AADT, Back vs. Ahead) are also below .97

  • Ahead AADT and Back Peak Hourly with a correlation of .968 ##Let’s examine the 6 different traffic counts.
    AADT is annual average daily traffic which averages the daily traffic over a given year. I chose 2012 because I was having trouble loading the 2013 data.
    Peak Hour represents the highst hourly traffic flow, which is usually during rush hour, 7-9 AM and 5-7 PM. Peak Month represents the average daily traffic for the month of highest traffic flow, since seasonal variations can affect traffic as well.

We can’t really compare intersections, but for a given intersection we can compare the different relationships.

Let’s look at the Back AADT and the Ahead AADT

p1 <- ggplot(data = dfp3, aes(x = county, y = back_aadt)) + 
  geom_point(alpha=.5, colour = "orange") +
  theme(axis.text.x = element_text(colour="grey20",size=8,angle=45,hjust=.5,
                                   vjust=.5,face="plain"),
        axis.text.y = element_text(colour="grey20",size=12,angle=0,hjust=1,
                                   vjust=0,face="plain"),  
        axis.title.x = element_text(colour="grey20",size=12,angle=0,hjust=.5,
                                    vjust=0,face="plain"),
        axis.title.y = element_text(colour="grey20",size=12,angle=90,hjust=.5,
                                    vjust=.5,face="plain")) +
  ggtitle(label = "CA Back AADT by County")
p2 <- ggplot(data = dfp3, aes(x = county, y = ahead_aadt)) + 
  geom_point(alpha=.5, colour = "green") +
  theme(axis.text.x = element_text(colour="grey20",size=8,angle=45,hjust=.5,
                                   vjust=.5,face="plain"),
        axis.text.y = element_text(colour="grey20",size=12,angle=0,hjust=1,
                                   vjust=0,face="plain"),  
        axis.title.x = element_text(colour="grey20",size=12,angle=0,hjust=.5,
                                    vjust=0,face="plain"),
        axis.title.y = element_text(colour="grey20",size=12,angle=90,hjust=.5,
                                    vjust=.5,face="plain")) +
  ggtitle(label = "CA Ahead AADT by County")
grid.arrange(p1,p2, ncol = 1)
## Warning: Removed 546 rows containing missing values (geom_point).
## Warning: Removed 546 rows containing missing values (geom_point).

Let’s look at the Back Peak Month and the Ahead Peak Month

p1 <- ggplot(data = dfp3, aes(x = county, y = back_peak_month)) + 
  geom_jitter(alpha=.5, aes(colour = county)) +
  theme(axis.text.x = element_text(colour="grey20",size=8,angle=45,hjust=.5,
                                   vjust=.5,face="plain"),
        axis.text.y = element_text(colour="grey20",size=12,angle=0,hjust=1,
                                   vjust=0,face="plain"),  
        axis.title.x = element_text(colour="grey20",size=12,angle=0,hjust=.5,
                                    vjust=0,face="plain"),
        axis.title.y = element_text(colour="grey20",size=12,angle=90,hjust=.5,
                                    vjust=.5,face="plain"), 
        legend.position = 'none') +
  ggtitle(label = "CA Back Peak Month AADT by County")
p2 <- ggplot(data = dfp3, aes(x = county, y = ahead_peak_aadt)) + 
  geom_jitter(alpha=.5, aes(colour = county)) +
  theme(axis.text.x = element_text(colour="grey20",size=8,angle=45,hjust=.5,
                                   vjust=.5,face="plain"),
        axis.text.y = element_text(colour="grey20",size=12,angle=0,hjust=1,
                                   vjust=0,face="plain"),  
        axis.title.x = element_text(colour="grey20",size=12,angle=0,hjust=.5,
                                    vjust=0,face="plain"),
        axis.title.y = element_text(colour="grey20",size=12,angle=90,hjust=.5,
                                    vjust=.5,face="plain"),
        legend.text = element_text(size = 8), 
        legend.key.size = unit(.3, "cm")) + 
  guides(colour=guide_legend(ncol=2)) +
  ggtitle(label = "CA Ahead Peak Month AADT by County")
grid.arrange(p1,p2, ncol = 1)
## Warning: Removed 546 rows containing missing values (geom_point).
## Warning: Removed 546 rows containing missing values (geom_point).

What about the difference between them?

#colors=rainbow( ncol(frame) ,s = 0.5, v = 1 )
ggplot(data = dfp3, aes(x = county, 
                          y = back_peak_month - ahead_peak_aadt)) + 
  geom_jitter(alpha=.5, aes(colour = county)) +
  theme(axis.text.x = element_text(colour="grey20",size=8,angle=45,hjust=.5,
                                   vjust=.5,face="plain"),
        axis.text.y = element_text(colour="grey20",size=10,angle=0,hjust=1,
                                   vjust=0,face="plain"),  
        axis.title.x = element_text(colour="grey20",size=12,angle=0,hjust=.5,
                                    vjust=0,face="plain"),
        axis.title.y = element_text(colour="grey20",size=12,angle=90,hjust=.5,
                                    vjust=.5,face="plain"),
        legend.text = element_text(size = 8), 
        legend.key.size = unit(.3, "cm")) + 
  guides(colour=guide_legend(ncol=2)) + 
  #scale_colour_brewer(palette=c("Set1", "Set2", "Set3", "Set4", "Set5")) + 
  ggtitle(label = "CA Difference in Back and Ahead Peak Month\nAADTby County")
## Warning: Removed 1076 rows containing missing values (geom_point).

# Peak Hour (Back - Forward)

#colors=rainbow( ncol(frame) ,s = 0.5, v = 1 )
ggplot(data = dfp3, aes(x = county, y = back_peak_hour - ahead_peak_hour)) + 
  geom_jitter(alpha=.5, aes(colour = county)) +
  theme(axis.text.x = element_text(colour="grey20",size=8,angle=45,hjust=.5,
                                   vjust=.5,face="plain"),
        axis.text.y = element_text(colour="grey20",size=10,angle=0,hjust=1,
                                   vjust=0,face="plain"),  
        axis.title.x = element_text(colour="grey20",size=12,angle=0,hjust=.5,
                                    vjust=0,face="plain"),
        axis.title.y = element_text(colour="grey20",size=12,angle=90,hjust=.5,
                                    vjust=.5,face="plain"),
        legend.text = element_text(size = 8), 
        legend.key.size = unit(.3, "cm")) + 
  guides(colour=guide_legend(ncol=2)) + 
  #scale_colour_brewer(palette=c("Set1", "Set2", "Set3", "Set4", "Set5")) + 
  ggtitle(label = "CA Difference in Back and Ahead Peak Hour\nby County")
## Warning: Removed 1076 rows containing missing values (geom_point).

# AADT (Back - Forward)

#colors=rainbow( ncol(frame) ,s = 0.5, v = 1 )
ggplot(data = dfp3, aes(x = county, y = back_aadt - ahead_aadt)) + 
  geom_jitter(alpha=.5, aes(colour = county)) +
  theme(axis.text.x = element_text(colour="grey20",size=8,angle=45,
                                   hjust=.5,vjust=.5,face="plain"),
        axis.text.y = element_text(colour="grey20",size=10,angle=0,
                                   hjust=1,vjust=0,face="plain"),  
        axis.title.x = element_text(colour="grey20",size=12,angle=0,
                                    hjust=.5,vjust=0,face="plain"),
        axis.title.y = element_text(colour="grey20",size=12,angle=90,
                                    hjust=.5,vjust=.5,face="plain"),
        legend.text = element_text(size = 8), 
        legend.key.size = unit(.3, "cm")) + 
  guides(colour=guide_legend(ncol=2)) + 
  #scale_colour_brewer(palette=c("Set1", "Set2", "Set3", "Set4", "Set5")) + 
  ggtitle(label = "CA Difference in Back and Ahead AADT\nBy County")
## Warning: Removed 1076 rows containing missing values (geom_point).

1.6 Which areas have the highest traffic counts?

Lets look at the Back Peak Hour first

#ggplot(data = dfp3, aes(x = ))

2 Univariate Analysis

Let’s look at the different variables (columns) in this dataset:
- Dist (District) - this is a categorical variable

str(dfp3$dist)
##  Factor w/ 12 levels "1","2","3","4",..: 12 12 12 12 12 12 12 12 12 12 ...
ggplot(data = dfp3, aes(x = dist)) + geom_bar() +
  theme(axis.text.x = element_text(angle=45)) +
  ggtitle(label = "Number of California Traffic Counts by District")

str(dfp3$county_name)
##  Factor w/ 58 levels "Alameda","Alpine",..: 30 30 30 30 30 30 30 30 30 30 ...
ggplot(data = dfp3, aes(x = county_name)) + geom_bar() +
  theme(axis.text.x = element_text(angle=45)) +
  ggtitle(label = "Number of California Traffic Counts by County")

There are some counts that don’t have districts. Let’s examine them: > dfp3[is.na(dfp3$dist),] are all breaks in route.

2.1 What is the structure of your dataset?

2.2 What is/are the main feature(s) of interest in your dataset?

2.3 What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

2.4 Did you create any new variables from existing variables in the dataset?

2.5 Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

3 Bivariate Plots Section

4 Bivariate Analysis

4.1 Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

4.2 Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

4.3 What was the strongest relationship you found?

5 Multivariate Plots Section

6 Multivariate Analysis

6.1 Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

6.2 Were there any interesting or surprising interactions between features?

6.3 OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

7 Final Plots and Summary

You will select three plots from your analysis to polish and share in this section. The three plots should show different trends and should be polished with appropriate labels, units, and titles Are the final three plots varied and do they meet some of the following criteria: - Draw comparisons.
- Identify trends.
- Engage a wide audience.
- Explain a complicated finding.
- Clarify a gap between perception and reality.
- Enable the reader to digest large amounts of information.

Are the plots polished?
- are axes labeled?
- are units labeled on each axis?
- are plots titled?
- are all labels, titles, and units readable?

Are the plots explained?

7.1 Plot One

7.2 Description One

7.3 Plot Two

7.4 Description Two

7.5 Plot Three

7.6 Description Three

8 Reflection

Does the section provide a written reflection of the analysis? Consider the following in your reflections: - Where did I run into difficulties in the analysis? - Where did I find successes? - How could the analysis be enriched in future work (e.g. additional data and analyses)?

9 Document Your Data

9.1 Description of the Data Set

definition of any variables, units, levels of categorical variables, and the data generating process, such as how data was collected if possible

A description is at http://traffic-counts.dot.ca.gov/docs/traffic-data-faq.pdf

Legs: Counts are taken ahead of or in back of a given location on the state highway system. These are called Legs.
Postmile Prefix Codes: Assigned when a length of highway is changed due to construction or realignment. The alpha code is used to differentiate the different values.

9.1.1 Definitions of AADTs

Back annual average daily traffic (AADT) usually represents traffic South or West of the count location and is the total volume for the year divided by 365 days.
Ahead annual average daily traffic (AADT) usually represents traffic North or East of the count location and is the total volume for the year divided by 365 days.
AADT’s represent both directions of travel, and summing them together will result in erroneous data. Peak Hour usually represents an estimate of the heaviest traffic flow which usually occurs between 7 to 9 AM and 5 to 7 PM. Peak Hour values indicate the volume in both directions. In urban and suburban areas, the peak hour normally occurs every weekday. On roads with large seasonal fluctuations in traffic, the peak hour is the hour near the maximum for the year but excluding a few (30 to 50 hours) that are exceedingly high and are not typical of the frequency of the high hours occurring during the season. Peak Month ADT is the average daily traffic for the month of heaviest traffic flow, usually July or August. This data is obtained because on many routes, high traffic volumes which occur during a certain season of the year are more representative of traffic conditions than the annual ADT.

9.1.2 Annual Average Daily Traffic (Annual ADT)

Annual average daily traffic is the total traffic volume for the year divided by 365* days. The traffic count year is from October 1st through September 30th. Very few locations in California are actually counted continuously. Traffic counting is generally performed by electronic counting instruments moved from location to location throughout the State in a program of continuous traffic count sampling. The resulting counts are adjusted to an estimate of annual average daily traffic by compensating for seasonal influence, weekly variation and other variables which may be present. Annual ADT is necessary for presenting a statewide picture of traffic flow, evaluating traffic trends, computing accident rates, planning and designing highways and other purposes.
Peak Month ADT The peak month ADT is the average daily traffic for the month of heaviest traffic flow. This data is obtained because on many routes, high traffic volumes which occur during a certain season of the year are more representative of traffic conditions than the annual ADT.
Peak Hour This publication includes an estimate of the “peak hour” traffic at all points on the state highway system. This value is useful to traffic engineers in estimating the amount of congestion experienced, and shows how near to capacity the highway is operating. Unless otherwise indicated, peak hour values indicate the volume in both directions. A few hours each year are higher than the “peak hour,” but not many. In urban and suburban areas, the peak hour normally occurs every weekday, during what is considered “rush hour” traffic. On roads with large seasonal fluctuations in traffic, the peak hour is the hour near the maximum for the year but excluding a few (30 to 50 hours) that are exceedingly high and are not typical of the frequency of the high hours occurring during the season.

10 References

Counties in California: http://en.wikipedia.org/wiki/List_of_counties_in_California and http://www.dot.ca.gov/hq/tsip/hseb/products/county_name.pdf