Project 3: Annual Average Daily Traffic, CA, 2012 by Susan S.

01234567890123456789012345678901234567890123456789012345678901234567890123456789

## Warning: joining factor and character vector, coercing into character
## vector

Univariate Plots Section

Dimensions of dataset:  6768, 17  
Column names:  

```
##  [1] "dist"            "route"           "route_suffix"   
##  [4] "county"          "postmile_prefix" "postmile"       
##  [7] "alignment"       "description"     "back_peak_hour" 
## [10] "back_peak_month" "back_aadt"       "ahead_peak_hour"
## [13] "ahead_peak_aadt" "ahead_aadt"      "county_name"    
## [16] "district"        "county_pop"
```
Structure of dataset:  

```
## 'data.frame':    6768 obs. of  17 variables:
##  $ dist           : Factor w/ 12 levels "1","2","3","4",..: 12 12 12 12 12 12 12 12 12 12 ...
##  $ route          : Factor w/ 243 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ route_suffix   : Factor w/ 3 levels "","S","U": 1 1 1 1 1 1 1 1 1 1 ...
##  $ county         : chr  "ORA" "ORA" "ORA" "ORA" ...
##  $ postmile_prefix: Factor w/ 8 levels "","C","D","L",..: 6 6 1 1 1 1 6 6 1 1 ...
##  $ postmile       : num  0.129 0.78 8.43 9.418 9.6 ...
##  $ alignment      : Factor w/ 3 levels "","L","R": 1 1 1 1 1 1 1 1 1 1 ...
##  $ description    : Factor w/ 5762 levels ""," JCT. RTE. 101",..: 1129 1128 2623 2622 2619 2625 3508 3509 3510 3507 ...
##  $ back_peak_hour : int  NA 3750 2850 3000 3350 3150 4000 4250 4350 5400 ...
##  $ back_peak_month: int  NA 40000 38500 40500 39500 37500 49500 53000 53000 52000 ...
##  $ back_aadt      : int  NA 37000 36000 38000 37000 35000 45000 48000 48700 48500 ...
##  $ ahead_peak_hour: int  3750 3900 2850 3400 3350 3150 4800 4250 5300 5400 ...
##  $ ahead_peak_aadt: int  40000 42000 38500 40500 39500 38500 59000 53000 52000 52000 ...
##  $ ahead_aadt     : int  37000 38500 36000 38000 37000 35000 54000 48000 48700 48500 ...
##  $ county_name    : chr  "Orange" "Orange" "Orange" "Orange" ...
##  $ district       : int  12 12 12 12 12 12 12 12 12 12 ...
##  $ county_pop     : int  3073540 3073540 3073540 3073540 3073540 3073540 3073540 3073540 3073540 3073540 ...
```
Summary of dataset: 

```
##       dist          route      route_suffix    county         
##  4      :1084   101    : 507    :6738       Length:6768       
##  7      : 913   5      : 420   S:  20       Class :character  
##  3      : 771   1      : 278   U:  10       Mode  :character  
##  6      : 721   99     : 254                                  
##  11     : 609   80     : 157                                  
##  8      : 581   10     : 152                                  
##  (Other):2089   (Other):5000                                  
##  postmile_prefix    postmile       alignment
##         :4695    Min.   :  0.000    :6649   
##  R      :1844    1st Qu.:  5.752   L:  54   
##  M      :  71    Median : 15.367   R:  65   
##  T      :  70    Mean   : 22.308            
##  L      :  63    3rd Qu.: 30.408            
##  S      :  16    Max.   :186.238            
##  (Other):   9    NA's   :6                  
##                                  description   back_peak_hour 
##  JCT. RTE. 5                           :  33   Min.   :   10  
##  JCT. RTE. 101                         :  16   1st Qu.:  940  
##  JCT. RTE. 99                          :  14   Median : 2600  
##  NEVADA STATE LINE                     :  13   Mean   : 5326  
##  JCT. RTE. 15                          :  12   3rd Qu.: 8800  
##  LOS ANGELES/SAN BERNARDINO COUNTY LINE:  12   Max.   :31000  
##  (Other)                               :6668   NA's   :537    
##  back_peak_month    back_aadt      ahead_peak_hour ahead_peak_aadt 
##  Min.   :   100   Min.   :    80   Min.   :   10   Min.   :   100  
##  1st Qu.:  9400   1st Qu.:  8200   1st Qu.:  940   1st Qu.:  9500  
##  Median : 29000   Median : 25600   Median : 2600   Median : 29000  
##  Mean   : 68262   Mean   : 64726   Mean   : 5333   Mean   : 68318  
##  3rd Qu.:109000   3rd Qu.:104000   3rd Qu.: 8800   3rd Qu.:109000  
##  Max.   :406000   Max.   :377500   Max.   :31000   Max.   :406000  
##  NA's   :537      NA's   :537      NA's   :537     NA's   :537     
##    ahead_aadt     county_name           district        county_pop     
##  Min.   :    80   Length:6768        Min.   : 1.000   Min.   :   1114  
##  1st Qu.:  8400   Class :character   1st Qu.: 4.000   1st Qu.: 182640  
##  Median : 26000   Mode  :character   Median : 6.000   Median : 822403  
##  Mean   : 64773                      Mean   : 6.134   Mean   :2000090  
##  3rd Qu.:104000                      3rd Qu.: 8.000   3rd Qu.:2249045  
##  Max.   :377500                      Max.   :12.000   Max.   :9946947  
##  NA's   :537
```

* The maximum length of any route in a single county is approximately 186 miles.  
* The hourly peak traffic ranges from 10 vehicles to 31,000 vehicles.    
* The daily peak traffic ranges from 100 vehicles to 406,100 vehicles.   
* The average daily traffic ranges from 80 vehicles to 377,500 vehicles.

I am not loading the CA traffic districts as a dataset, but am including the table here to help understand the plots by district. Traffic Districts http://en.wikipedia.org/wiki/California_Department_of_Transportation#Districts

District	Counties
1	Del Norte, Humboldt, Lake, Mendocino Eureka
2	Lassen, Modoc, Plumas, Shasta, Siskiyou, Tehama, Trinity; portions of Butte and Sierra Redding
3	Butte, Colusa, El Dorado, Glenn, Nevada, Placer, Sacramento, Sierra, Sutter, Yolo,Yuba Marysville
4	Alameda, Contra Costa, Marin, Napa, San Francisco, San Mateo, Santa Clara, Solano, Sonoma, Oakland
5	Monterey, San Benito, San Luis Obispo, Santa Barbara, Santa Cruz San Luis Obispo
6	Madera, Fresno, Tulare, Kings, Kern Fresno
7	Los Angeles, Ventura Los Angeles
8	Riverside, San Bernardino San Bernardino
9	Inyo, Mono Bishop
10	Alpine, Amador, Calaveras, Mariposa, Merced, San Joaquin, Stanislaus, Tuolumne Stockton
11	Imperial, San Diego San Diego
12	Orange Irvine

Frequency of District traffic counts:

## 
##    1    2    3    4    5    6    7    8    9   10   11   12 
##  325  411  771 1084  460  721  913  581   98  533  609  262

## 
##    4    7    3    6   11    8   10    5    2    1   12    9 
## 1084  913  771  721  609  581  533  460  411  325  262   98

## 
## 101   5   1  99  80  10  15  49  33  20 405  12 210   8 580  78  91 395 
## 507 420 278 254 157 152 130 122  92  77  74  73  72  70  70  69  68  67 
##  50   4 
##  66  65

Frequency of route_suffix, county, postmile_prefix, alignment, traffic:

## Source: local data frame [3 x 3]
## 
##   route_suffix    n          description
## 1              6738         Normal Route
## 2            S   20   Supplemental Route
## 3            U   10 Unrelinquished Route

## 
##         Alameda          Alpine          Amador           Butte 
##             220              19              47              96 
##       Calaveras          Colusa    Contra Costa       Del Norte 
##              50              36             109              36 
##       El Dorado          Fresno           Glenn        Humboldt 
##              80             183              42             146 
##        Imperial            Inyo            Kern           Kings 
##             156              45             267              49 
##            Lake          Lassen     Los Angeles          Madera 
##              45              41             762              56 
##           Marin        Mariposa       Mendocino          Merced 
##              60              31              98             103 
##           Modoc            Mono        Monterey            Napa 
##              22              53             111              59 
##          Nevada          Orange          Placer          Plumas 
##              68             261             103              44 
##       Riverside      Sacramento      San Benito  San Bernardino 
##             242             151              27             339 
##       San Diego   San Francisco     San Joaquin San Luis Obispo 
##             453              52             139             126 
##       San Mateo   Santa Barbara     Santa Clara      Santa Cruz 
##             157             128             202              70 
##          Shasta          Sierra        Siskiyou          Solano 
##             138              22              84              93 
##          Sonoma      Stanislaus          Sutter          Tehama 
##             132              91              40              54 
##         Trinity          Tulare        Tuolumne         Ventura 
##              28             163              54             152 
##            Yolo            Yuba 
##              89              44

## Warning in tapply(X = X, INDEX = x, FUN = FUN, ...): NAs introduced by
## coercion

Postmile prefix - Definitions of Postmile prefix joined to count of each prefix. Note that not all prefixes are in use.

## 
##         C    D    L    M    R    S    T 
## 4695    6    3   63   71 1844   16   70

##    Prefix                                           Meaning
## 1                                                          
## 2       C                                  Commercial Lanes
## 3       D      Duplicate postmile at meandering county line
## 4       G  Reposting of duplicate postmile oat end of route
## 5       H                              Overlap of D mileage
## 6       L                                  Overlap postmile
## 7       M                          Realignment of R mileage
## 8       N                          Realignment of M mileage
## 9       R                                 First realignment
## 10      S                                              Spur
## 11      T                              Temporary connection
## 12      U                                    Unrelinquished

## Warning: joining factors with different levels, coercing to character
## vector

## Source: local data frame [8 x 3]
## 
##   postmile_prefix    n                                       Meaning
## 1                 4695                                              
## 2               C    6                              Commercial Lanes
## 3               D    3  Duplicate postmile at meandering county line
## 4               L   63                              Overlap postmile
## 5               M   71                      Realignment of R mileage
## 6               R 1844                             First realignment
## 7               S   16                                          Spur
## 8               T   70                          Temporary connection

Alignment - Definitions of alignment prefix joined to count of each prefix

## Source: local data frame [3 x 3]
## 
##   alignment    n                 description
## 1           6649                            
## 2         L   54  Left independent alignment
## 3         R   65 Right independent alignment

It is difficult to see the hourly and daily traffic counts in the same graph because the scales are so different. If I use free scales for the facets it might look better.

Hmm, I seem to remember seeing this type of distribution in one of the lessons and it was transformed into a normal distribution somehow… Lesson 3. By taking log. I will try to take the log or sqrt.

It looks like a bimodality - must be different types of highways Both transformations look more normal - log has no tail on right, sqrt has no tail on left
Try to figure out why it is bimodal - add population to tidyData and cut it into two levels: county population under and over 1 Million people.

When I separate it out into counties that have under a Million population and over a Million Population, it separates the bimodality. Counties under a million have a more normally distributed traffic counts. Counties with over a million population have a higher skewed number of traffic count instances with high traffic counts.

OK, technically this should go into the bivariate section, but it started off trying to investigate the traffic counts on their own.

I wonder what thes plots look like across the categorical variables of district, route, and county?

Now lets look at the population.

##             county pop20120701
## 1      Los Angeles     9946947
## 2        San Diego     3153376
## 3           Orange     3073540
## 4        Riverside     2249045
## 5   San Bernardino     2063867
## 6      Santa Clara     1827313
## 7          Alameda     1539145
## 8       Sacramento     1435118
## 9     Contra Costa     1069158
## 10          Fresno      948453
## 11            Kern      855974
## 12         Ventura      833361
## 13   San Francisco      822403
## 14       San Mateo      736019
## 15     San Joaquin      697758
## 16      Stanislaus      523126
## 17          Sonoma      488300
## 18          Tulare      452301
## 19   Santa Barbara      426063
## 20        Monterey      421465
## 21          Solano      419064
## 22          Placer      360098
## 23 San Luis Obispo      270637
## 24      Santa Cruz      268607
## 25          Merced      261002
## 26           Marin      253892
## 27           Butte      220980
## 28            Yolo      204974
## 29       El Dorado      182640
## 30          Shasta      178402
## 31        Imperial      178382
## 32          Madera      151242
## 33           Kings      150643
## 34            Napa      138019
## 35        Humboldt      134601
## 36          Nevada       97920
## 37          Sutter       96557
## 38       Mendocino       88550
## 39            Yuba       72915
## 40            Lake       64204
## 41          Tehama       63937
## 42      San Benito       56643
## 43        Tuolumne       53949
## 44        Siskiyou       45218
## 45       Calaveras       44923
## 46          Amador       36403
## 47          Lassen       33719
## 48           Glenn       28560
## 49       Del Norte       28533
## 50          Colusa       21442
## 51          Plumas       19906
## 52            Inyo       18578
## 53        Mariposa       17959
## 54            Mono       14393
## 55         Trinity       13496
## 56           Modoc        9516
## 57          Sierra        3133
## 58          Alpine        1114

## 'data.frame':    58 obs. of  2 variables:
##  $ county     : chr  "Alameda" "Alpine" "Amador" "Butte" ...
##  $ pop20120701: int  1539145 1114 36403 220980 44923 21442 1069158 28533 182640 948453 ...

Population Frequency - Los Angeles outlier - then remove it

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

When I get rid of Lost Angeles, the population distribution is more even. The counties with smaller population might have fewer traffic counts, but there are a lot of them.

Plot county population, not

q1 <- qplot(data=pop12, x = pop20120701) +
  ggtitle("With Outlier")
q2 <- qplot(data=subset(pop12, pop20120701 < 5000000), x = pop20120701) +
  ggtitle("Remove Outlier")
grid.arrange(q1, q2, ncol=2)

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Los Angeles has a very high population, so it is an outlier and makes it more difficult to see the population distribution in the lower counties. Removing counties with populations > 5 Million (Los Angeles) helps make the rest clearer.

Univariate Analysis

What is the structure of your dataset?

There are 6768 traffic count locations in this dataset, with 8 original features: dist, route, route_suffix, county, postmile_prefix, postmile, alignment, description and 6 traffic count types: back_peak_hour, back_peak_month, back_aadt, ahead_peak_hour, ahead_peak_aadt, ahead_aadt plus county_name and county_pop (population)

There are 12 districts, 243 routes, 3 route suffixes (most are blank), 58 counties, 8 postmile prefixes (R is the most common besides blank), 3 alignments (most are blank), 5762 descriptions (which are the intersections with town or county/state line), and 537 missing traffic counts for each traffic type (at a boundary, the ahead and back counts are usually on separate lines).

Other observations:
The most popular description is “JCT. RTE. 5”. The county with the most traffic count locations is Los Angeles. The district with the most traffic counts is 4, which includes San Francisco and Oakland. The route with the most traffic counts is 101. The ahead_peak_aadt was equal to or higher than the back_peak_month for min/mean/median/max. (these are equivalent measures at front and back of intersection, just not consistently named in source document)

It seems that the route with the largest number of traffic counts is State Highway 101, which covers California from South to North. There are 507 points on highway 101. It might be an interesting roadway to investigate.

The second most counted roadway is I-5. It is an interstate, not a state highway and it also covers the whole state from South to North.

The third most counted highway is highway 1, which hugs the coastline. The fourth is Highway 99. Those 4 have many more counts than the other highways.

Los Angeles has a huge population, which raises the mean much higher than the median county population.

All of the traffic count types have a high concentration of counts that are in the lower 2/30 of the traffic count range. The hourly traffic counts have a slightly different distribution than the daily traffic counts - the daily counts had more falling into the lower 1/30 than the hourly counts.

What is/are the main feature(s) of interest in your dataset?

The main features of interest are the traffic counts, counties and routes.
I also added population from a separate table to do traffic correlation based on county population.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

The population count also has the population difference between 2012 and 2013 which could be useful to see if people came to or left areas of high traffic.

Did you create any new variables from existing variables in the dataset?

I created a county_name variable which was the full name of the county taken from the county in the dataset which was actually an abbreviation.

In a subset of data for Highway 101 I created a cumulative mileage column. In the next sections I created a few summary datasets such as dfp3.bpm_by_county and dfp3.baadt_by_county to plot the mean traffic by county and do modeling to predict average traffic based on population.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I created a tidy data set of one variable/value pair per row so that I could plot the different traffic types in a facet.

I summarized the traffic (back peak month) by county_name since I had the population of the counties, so I could compare the average traffic to the county population. I tried to find a correlation, so I plotted each axis as a log function until I found one that looked correlated - the log of the county population is correlated with the log of the average traffic (back peak month).

To look at the county population distribution, I removed Los Angeles because it was such an outlier (population of over 9 million when next largest county only had 3 million). It made the histogram spread out more.

I looked at the traffic distributions and saw that a log transformation would make them more normal, so I plotted the log10 of the traffic counts. It looked bimodal. I separated it by county population - made two groups using cut: one for counties with population under a million and the rest over a million.
This removed the bimodality. Counties with under a million residents had a more normal-looking log traffic frequency distribution.

Bivariate Plots Section

Below is a correlation matrix of the numeric columns in my main data frame.

##                 ahead_peak_hour ahead_peak_aadt ahead_aadt back_peak_hour
## ahead_peak_hour       1.0000000       0.9861205  0.9851281      0.9784667
## ahead_peak_aadt       0.9861205       1.0000000  0.9992771      0.9689689
## ahead_aadt            0.9851281       0.9992771  1.0000000      0.9680152
## back_peak_hour        0.9784667       0.9689689  0.9680152      1.0000000
## back_peak_month       0.9679986       0.9831139  0.9826660      0.9862641
## back_aadt             0.9672250       0.9827562  0.9833376      0.9852505
## county_pop            0.5019581       0.5268439  0.5305682      0.5020283
## postmile             -0.1127205      -0.1189987 -0.1234395     -0.1535695
##                 back_peak_month  back_aadt  county_pop    postmile
## ahead_peak_hour       0.9679986  0.9672250  0.50195813 -0.11272049
## ahead_peak_aadt       0.9831139  0.9827562  0.52684388 -0.11899871
## ahead_aadt            0.9826660  0.9833376  0.53056818 -0.12343946
## back_peak_hour        0.9862641  0.9852505  0.50202832 -0.15356953
## back_peak_month       1.0000000  0.9993272  0.52685191 -0.15792589
## back_aadt             0.9993272  1.0000000  0.53044740 -0.16222574
## county_pop            0.5268519  0.5304474  1.00000000 -0.03568948
## postmile             -0.1579259 -0.1622257 -0.03568948  1.00000000

The traffic counts are all well correlated. The traffic is somewhat correlated with county population (around .5) - it is expected that more people lead to more cars and to more traffic even if there are other factors such as availability of public transportation that also affect traffic. One surprising finding is that the postmile (distance from S or W end of route in county) is slightly negatively correlated with traffic and even more slightly negatively correlated with county population. There is slightly more traffic at the start of the route (S or W end).

Look at traffic counts by district:

You can see here that not all districts have bimodal traffic distributions. Districts 1, 2, and 3 are all less built up and have mostly decreasing distributions. District 4 has a few peaks and is taller with a short tail. It contains the San Franciso/Silicon Valley area and has a spread out high traffic. District 7 contains Los Angeles and it has a large second peak where there is a lot of traffic. The last one with a big second peak is District 12, which contains Orange and Irvine counties. This district is unique in that it does not have many areas with low traffic. It has the highest traffic rates and a peak of roadway sites with traffic between 200 and 300 thousand cars a day. As a side note, I once had a tour of the Irvine traffic control center and it was pretty impressive!

## Warning: Removed 537 rows containing missing values (geom_point).

## Warning: Removed 537 rows containing missing values (geom_point).

## Warning: Removed 537 rows containing missing values (geom_point).

## Warning: Removed 537 rows containing missing values (geom_point).

These were some of my first plots. They are not very easy to read although they are colorful! I hope I am not penalized for leaving them in. Orange County has the highest traffic, and then LA.

## Warning: Removed 537 rows containing non-finite values (stat_boxplot).

## Warning: Removed 537 rows containing non-finite values (stat_boxplot).

## Warning: Removed 537 rows containing non-finite values (stat_boxplot).

## Warning: Removed 537 rows containing non-finite values (stat_boxplot).

The statistics of the ratio of peak month AADT to peak hour AADT:

# Peak month daily vs peak hourly
summary(dfp3$back_peak_month/dfp3$back_peak_hour)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.095  10.000  11.460  11.430  12.910  27.000     537

summary(dfp3$ahead_peak_aadt/dfp3$ahead_peak_hour)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   3.167  10.000  11.440  11.430  12.890  25.470     537

# Average daily vs peak hourly
summary(dfp3$back_aadt/dfp3$back_peak_hour)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.063   8.800  10.430  10.390  12.190  17.220     537

summary(dfp3$ahead_aadt/dfp3$ahead_peak_hour)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   2.000   8.789  10.430  10.390  12.190  17.820     537

Look at population by county.

Los Angeles has such a high population that it skews results. Looking at a log10 distribution shows all better.

Traffic Counts by population - This won’t work very well in a bivariate plot because population is the same for all points in a county.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Los Angeles county has a much larger population than all the other counties. The Peak month annual average daily traffic was an average of 11 times greater than the peak hourly average traffic. The average daily traffic was only 10 times greater.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Only three counties (Los Angeles, San Bernardino, and Kern had roads with route suffixes appended while fourteen counties had alignment changes and all fifty-eight had at least one postmile prefix (most frequently ‘R’).

I noticed in the plots that the postmile prefix ‘M’ had the highest mean AADT. ‘M’ is the realignment of ‘R’ mileage, where ‘R’ was the first realignment. So it seems like when a second realignment is needed for a roadway, it tends to have higher traffic than the first realignment (‘R’), which also tended to have higher traffic than non-realigned roadways.

The peak month daily traffic was an average of 11 times greater than the peak hourly traffic.

What was the strongest relationship you found?

All of the traffic counts were correlated with each other, but that was to be expected.

Multivariate Plots Section

Looking at Traffic counts on the routes that go through Alameda county.

## Warning: Removed 18 rows containing missing values (geom_path).

## Warning: Removed 19 rows containing missing values (geom_point).

I-80 only has less than 10 miles of roadway in the county but it has some of the highest traffic. I-880 also has a lot of traffic at its beginning, but the traffic drops. This plot is a little busy and may not be optimal.

Looking at Highway 101 Instead of focusing on a county, I will focus on a route. Highway 101 from South to North goes through these counties in order: (from http://en.wikipedia.org/wiki/U.S._Route_101_in_California)

LA  - Los Angeles
VEN - Ventura
SB  - Santa Barbara
SLO - San Luis Obispo
MON - Monterey
SBT - San Benito
SCL - Santa Clara
SM  - San Mateo
SF  - San Francisco
MRN - Marin
SON - Sonoma
MEN - Mendocino
HUM - Humboldt
DN  - Del Norte

Look at the traffic on Highway 101 throughout it’s length

Plotting traffic along Highway 101 with help from http://stackoverflow.com/questions/13616515/recommend-a-scale-colour-for-13-or-more-categories and http://stackoverflow.com/questions/17844494/change-grid-line-behavior-in-ggplot2 Adding in peak traffic (the triangles)…

The triangles are peak month aadt - it looks like all of the peak traffic locations still have room for even more traffic at peak seasons! Adding the peak traffic makes the plot look a little busy.
I will not use it for one of my final plots.

I created a separate dataset called hw101 that is a subset of the main dataset, but also has cumulative mileage. What are the column names in hw101 in the knitted dataset?

##  [1] "dist"            "route"           "route_suffix"   
##  [4] "county"          "postmile_prefix" "postmile"       
##  [7] "alignment"       "description"     "back_peak_hour" 
## [10] "back_peak_month" "back_aadt"       "ahead_peak_hour"
## [13] "ahead_peak_aadt" "ahead_aadt"      "county_name"    
## [16] "district"        "county_pop"      "county_postmile"
## [19] "cm101"           "cm"

## 'data.frame':    507 obs. of  20 variables:
##  $ dist           : Factor w/ 12 levels "1","2","3","4",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ route          : Factor w/ 243 levels "1","2","3","4",..: 90 90 90 90 90 90 90 90 90 90 ...
##  $ route_suffix   : Factor w/ 3 levels "","S","U": 1 1 1 1 1 1 1 1 1 1 ...
##  $ county         : Ord.factor w/ 14 levels "LA"<"VEN"<"SB"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ postmile_prefix: Factor w/ 8 levels "","C","D","L",..: 7 7 7 7 1 1 1 1 1 1 ...
##  $ postmile       : num  0 0.624 0.907 1.329 0 ...
##  $ alignment      : Factor w/ 3 levels "","L","R": 1 1 1 1 1 1 1 1 1 1 ...
##  $ description    : Factor w/ 5762 levels ""," JCT. RTE. 101",..: 2902 2864 2861 2885 2885 2994 2814 2973 2889 2869 ...
##  $ back_peak_hour : int  NA 9000 7900 7700 NA 13800 13300 12700 13600 15800 ...
##  $ back_peak_month: int  NA 135000 121000 120000 NA 211000 207000 201000 218000 267000 ...
##  $ back_aadt      : int  NA 132000 120000 118000 NA 207000 205000 198000 215000 264000 ...
##  $ ahead_peak_hour: int  9000 7900 7700 NA 13800 13300 12700 13600 15800 15400 ...
##  $ ahead_peak_aadt: int  135000 121000 120000 NA 211000 207000 201000 218000 267000 261000 ...
##  $ ahead_aadt     : int  132000 120000 118000 NA 207000 205000 198000 215000 264000 257000 ...
##  $ county_name    : Ord.factor w/ 14 levels "Los Angeles"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ district       : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ county_pop     : int  9946947 9946947 9946947 9946947 9946947 9946947 9946947 9946947 9946947 9946947 ...
##  $ county_postmile: chr  "LA_S_0" "LA_S_0.624" "LA_S_0.907" "LA_S_1.329" ...
##  $ cm101          : num [1:507(1d)] 0 0 0 0 0 0 0 0 0 0 ...
##  $ cm             : num [1:507(1d)] 0 0.624 0.907 1.329 0 ...

## Warning: Removed 1 rows containing missing values (geom_path).

## Warning: Removed 16 rows containing missing values (geom_point).

## Warning: Removed 15 rows containing missing values (geom_path).

I tried to plot county population and Ahead Annual Average Daily Traffic (AAADT) along Highway 101 on the same plot to show how traffic could be related to population but I could not easily figure out how to get two scales on the same chart (which I used to do all the time in gnuplot with y1 and y2 axes). It seems that the R way is not to mix scales but instead to facet. If I had more time I would try to plot them together with a facet. I limited the y scale so some higher populations were left out. If I were using this graph I would have to foot note that.

The next plot added colors to the county-faceted ahead AADT plots.

These are a set of faceted plots like one of my final plots, but it does not have a free x scale. Each plot is as wide and scaled as the county with the most mileage along highway 101. The legend is outside the plot.

The next set of faceted plots brings the legend inside and colors it to stand out. It doesn’t provide any new information.

Another way to represent the traffic along highway 101 is by a boxplot by county. It shows the traffic distribution in each county. I also color-coded it by district. Adjacent counties can be in the same district. It is obvious that districts 7 and 4 have higher traffic along highway 101 than districts 5 and 1.

Look at traffic counts by postmile prefix per county along Highway 101: The most frequent is blank (no prefix), and the second most frequent is ‘R’, the first re-alignment. Mendocino has some ‘T’, a temporary connection. Los Angeles has some ‘S’, a spur. Del Norte also has an ‘M’, which is a realignment of ‘R’ (second realignment).

Another way to look at the postmile prefix is to graph them by amount of traffic and group by county. The spurs usually have lower traffic. Using a more divergent color scheme (I created it manually) helps.

# Ahead AADT on 101 by county
ggplot(hw101, aes(x = county_name, y = ahead_aadt)) +
  geom_point(aes(colour = postmile_prefix), alpha = .5) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
  ggtitle("Highway 101 Ahead AADT Per County and Postmile Prefix") +
  xlab("County") + ylab("Ahead Annual Average Daily Traffic (AADT)") +
  #scale_fill_discrete(name = "Postmile Prefix") +
  scale_color_manual(values=c("green", "orange", "red", "black", "gold"),
                     name = "Postmile Prefix")

## Warning: Removed 16 rows containing missing values (geom_point).

Looking at the traffic counts for ggpairs. They should be correlated and they are.

The GGpairs did not come out well. It may not be polished, but it shows the correlations

Looking at the differences in back and ahead peak month (daily) and peak hour traffic, the differences are obviously bigger over an entire day. If I had more time I would investigate by looking at the descriptions for the locations that had the biggest differences to see whether there was a large roadway at the intersection that was adding or siphoning off traffic.

Limiting the difference to Highway 101 along its cumulative mileage makes the labelling easier and reduces the clutter. We know there are big differences in San Francisco and you can see the points here.

Difference in Back and Ahead Peak Hour by County

ggplot(data = subset(dfp3, !is.na(ahead_peak_hour) & !is.na(back_peak_hour)), aes(x = county, y = back_peak_hour - ahead_peak_hour)) + geom_jitter(alpha=.5, aes(colour = county)) + theme(axis.text.x = element_text(colour=“grey20”,size=8,angle=45,hjust=.5, vjust=.5,face=“plain”), axis.text.y = element_text(colour=“grey20”,size=10,angle=0,hjust=1, vjust=0,face=“plain”),
axis.title.x = element_text(colour=“grey20”,size=12,angle=0,hjust=.5, vjust=0,face=“plain”), axis.title.y = element_text(colour=“grey20”,size=12,angle=90,hjust=.5, vjust=.5,face=“plain”), legend.text = element_text(size = 8), legend.key.size = unit(.3, “cm”)) + guides(colour=guide_legend(ncol=2)) + #scale_colour_brewer(palette=c(“Set1”, “Set2”, “Set3”, “Set4”, “Set5”)) + ggtitle(label = “CA Difference in Back and Ahead Peak HourCounty”)

Difference in Back and Ahead Peack Hour along Highway 101

ggplot(data = subset(hw101, !is.na(ahead_peak_hour) & !is.na(back_peak_hour)), aes(x = cm, y = back_peak_hour - ahead_peak_hour)) + geom_point(alpha=.5, aes(colour = county)) + theme(axis.text.x = element_text(colour=“grey20”,size=8,angle=45,hjust=.5, vjust=.5,face=“plain”), axis.text.y = element_text(colour=“grey20”,size=10,angle=0,hjust=1, vjust=0,face=“plain”),
axis.title.x = element_text(colour=“grey20”,size=12,angle=0,hjust=.5, vjust=0,face=“plain”), axis.title.y = element_text(colour=“grey20”,size=12,angle=90,hjust=.5, vjust=.5,face=“plain”), legend.text = element_text(size = 8), legend.key.size = unit(.3, “cm”)) + guides(colour=guide_legend(ncol=1)) + #scale_colour_brewer(palette=c(“Set1”, “Set2”, “Set3”, “Set4”, “Set5”)) + ggtitle( label = “Highway 101 Difference in Back and Ahead Peak HourCounty”)

```

Traffic, population, by county. Here I made a new summary dataframe with back peak month mean/median/min/max/count. I added in the population.

I looked at the mean county traffic vs. the number of traffic counts in the county. I could see a relationship, but it wasn’t exactly linear. Then I plotted mean county traffic vs. population. It looked vaguely logarithmic so I plotted the logs of each variable until plotting logs of both made it look linear. I added a lm line to the plot.

Next I try to correlate the log of population with the log of back peak month traffic, since it looks linear.

# It looks kind of linear when you take the log of population and traffic
p.bpm.pop.cor <- with(dfp3.bpm_by_county,
                      cor.test(log10(pop20120701), log10(bpm_mean)))
k.bpm.pop.cor <- with(dfp3.bpm_by_county,
     cor.test(log10(pop20120701),log10(bpm_mean), method = "kendall"))
s.bpm.pop.cor <- with(dfp3.bpm_by_county,
     cor.test(log10(pop20120701),log10(bpm_mean), method = "spearman"))
# Pearson
p.bpm.pop.cor

## 
##  Pearson's product-moment correlation
## 
## data:  log10(pop20120701) and log10(bpm_mean)
## t = 18.1356, df = 56, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8750290 0.9547327
## sample estimates:
##       cor 
## 0.9243962

# Kendall
k.bpm.pop.cor

## 
##  Kendall's rank correlation tau
## 
## data:  log10(pop20120701) and log10(bpm_mean)
## z = 8.6734, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##       tau 
## 0.7822142

# Spearman
s.bpm.pop.cor

## 
##  Spearman's rank correlation rho
## 
## data:  log10(pop20120701) and log10(bpm_mean)
## S = 2252, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.9307269

Both Pearson and Spearman test show a correlation around .93, while Kendall’s test shows a lower correlation of .78 but the p-value is very low (< 2.2e-16) so we reject the null hypothesis and we say that true correlation is not equal to 0.

Now I am going to look at Back Annual Average Daily Traffic, which I think should be more representative of the population of a county than peak traffic.

## Warning: joining factor and character vector, coercing into character
## vector

## 
##  Pearson's product-moment correlation
## 
## data:  n and baadt_mean
## t = 6.6872, df = 56, p-value = 1.136e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4928630 0.7888568
## sample estimates:
##       cor 
## 0.6663318

## 
##  Pearson's product-moment correlation
## 
## data:  log10(pop20120701) and log10(baadt_mean)
## t = 18.5323, df = 56, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8796410 0.9564716
## sample estimates:
##      cor 
## 0.927257

## 
##  Kendall's rank correlation tau
## 
## data:  log10(pop20120701) and log10(baadt_mean)
## z = 8.66, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##       tau 
## 0.7810042

## 
##  Spearman's rank correlation rho
## 
## data:  log10(pop20120701) and log10(baadt_mean)
## S = 2232, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.9313421

It looks like the Pearson and Spearman correlation is slightly higher using the Annual Average and not peak daily traffic, but the Kendall correlation is slightly lower.

Now we can try to do a linear model.

## 
## Call:
## lm(formula = baadt_mean ~ pop20120701, data = dfp3.baadt_by_county)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -71003 -16400  -6743   3077  79446 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.366e+04  4.038e+03   5.860 2.57e-07 ***
## pop20120701 2.020e-02  2.575e-03   7.845 1.39e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27960 on 56 degrees of freedom
## Multiple R-squared:  0.5236, Adjusted R-squared:  0.5151 
## F-statistic: 61.54 on 1 and 56 DF,  p-value: 1.392e-10

## 
## Call:
## lm(formula = log10(baadt_mean) ~ log10(pop20120701), data = dfp3.baadt_by_county)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.37008 -0.15914 -0.03735  0.12866  0.49741 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.11676    0.17377   6.427 3.05e-08 ***
## log10(pop20120701)  0.60937    0.03288  18.532  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1963 on 56 degrees of freedom
## Multiple R-squared:  0.8598, Adjusted R-squared:  0.8573 
## F-statistic: 343.4 on 1 and 56 DF,  p-value: < 2.2e-16

Both models show high significance that some relationship exists between traffic and population. The standard error should be at least an order of magnitude lower than the coefficient estimate and it is approximately so for both models. R squared is higher for the log-log model, which means it fits better.

Next I tried to plot traffic per population (ahead_peak_hour/county_pop). I noticed that Sierra county had some high outliers - this county does not have a large population but it does have I-80 which gets a lot of traffic.

ggplot(data = dfp3, aes(x = county, y = ahead_peak_hour/county_pop)) + 
  geom_boxplot() + 
  scale_y_continuous(trans=log10_trans()) +
  ggtitle("CA traffic per Population in July 2012 by County") +
  theme(axis.text.x = element_text(angle=90, vjust = 0.5)) +
  xlab("County") + ylab("Log10 Axis Traffic (ahead peak hour) per Population")

Now order the counties by the number of traffic counts per county (descending).

When it is not plot on log scale, there is a huge outlier for Sierra County. But it looks like traffic does not go up linearly with population - because LA has such a large population, it has a very small traffic count/population ratio.
Alpine has a large ratio - either it has a lot of visitors, or 1/4 of the population drives on each state route.

I plot it again in log scale.

## Warning: Removed 537 rows containing non-finite values (stat_boxplot).

## Warning: Removed 537 rows containing non-finite values (stat_boxplot).

Now I am doing some investigations from my cumulative mileage on Highway 101 plot, investigating descriptions to find causes for fluctuations in traffic. Looking at S Santa Barbara and Santa Clara counties. Looks like AAA Triptik!

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Looking at Descriptions helped me understand the traffic flows more, but not in a statistical way. If I had some text analysis I could have actually used the descriptions more. It might also have helped to know how many lanes were in the highways at each point.

My early investigations did a lot of investigations by county since that was a convenient feature to use. Later I was able to add the county populations, which helped to clarify why certain counties were showing certain properties - it was based at least somewhat on their population. These two features are not independent, but since population is numeric, it helps to quantify the relationship.

I also saw how the back and ahead traffic counts in an intersection could vary.

Were there any interesting or surprising interactions between features?

I saw that the traffic counts were bimodal. When faceted the log of the traffic counts by high and low county populations (greater or less than 1 million), I came up with separate distributions that looked more normal, although the larger counties had traffic counts that were heavily weighted on the higher end.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I created a model to predict average traffic for a county based on the county population. Since I only had average population across the entire county, it couldn’t be too accurate - the traffic fluctuates in a county. It actually uses the log of the population vs. the log of the traffic. It could be made stronger perhaps if I had more data about the type of roadway (number of lanes, etc.) and more demographic information. ——

Final Plots and Summary

Plot One

Description One

This is a faceted plot of the three back traffic types along Highway 101 by county. The mileposts run from South to North. Each count type has a different shape and color. I kept the y (traffic count) axis the same for all counties so they could be compared, but let the x (postmile) axis free so some counties are more spaced out.

This plot type, unlike the unfaceted plot later, lets the 3 traffic types show clearly in each county, although the scale doesn’t allow you to see much variation in hourly traffic.
Back peak hour traffic is always lower than the other two, which measure entire days. The traffic hits its peak in Los Angeles county where counts are closely spaced together, and then decreases through the rest of Los Angeles and Ventura counties. There is a local peak in Santa Barbara and counts are father apart in the middle of Santa Barbara. They start rising slightly travelling North in Monterey, and increase more in Santa Clara and San Mateo, before peaking in San Mateo county and starting to decrease. There is a big drop in San Francisco county, as Highway 101 goes through the city - further investigation is warranted.

Looking at the raw traffic data, the drop occurs at the junction with I-80 (Bay Bridge), before S. Van Ness Avenue - a city street does not have the same capacity as a highway and much of the traffic continues to I-80.

Traffic increases when approaching the Golden Gate Bridge and then peaks in Marin County in San Rafael, peaks (but lower) in Sonoma county in Santa Rosa, and then decreases to lower levels in the northern counties.

Plot Two

Description Two

This plot shows the attempt to find a relationship between average traffic on county roadways and the population of the county. The AADT is the Annual Average Daily Traffic - the total volume for the year divided by 365 days. My assumption is that a county with higher population would have higher average traffic counts. This might not always be the case because some cities have higher public transportation usage.

Looking at My assumption is that the daily average would have a higher correlation to population than the daily or seasonal peak, since it isn’t a peak that could be influenced by tourists or commuters.

Plot Three

Description Three

This was the plot I worked on the longest. It shows the Ahead Annual Average Daily Traffic (Ahead AADT) for California Highway 101 from South to North. I had to create the cumulative total mileage for Highway 101 by adding the maximum postmile reading from each county to that of all counties south of it along the highway (counties in order determined from Wikipedia entry). There are some counties which only have a short distance and thus the mileage almost overlaps with the neighboring borders.

I used a palette that provides a contrast between neighboring counties, instead of the standard palette. It was difficult to find a palette like that which had enough colors, so I had to add two manually.

This plot shows how the traffic varies with location. I also did a subplot for part of Santa Barbara (in multiplot section) where I included the location description, which demonstrates what is going on at the location.

This plot shows that Los Angeles county has some of the highest average daily traffic. It does a better job of showing how the traffic compares along the entire route than the facetted plot does. The traffic increases around the San Francisco Bay Area. The northern part of the state has the lowest traffic.

Reflection

My biggest struggle in this analysis was knowing when to quit. Even now, I keep finding more things I would like to investigate and different ways I can improve what I’ve already done, but I’ve been working on this for over 3 weeks and would like to finish. When I first picked the dataset, I wasn’t sure it was good enough. I didn’t know what I could model. I had some suggestions to add population data and that really helped make it more interesting.

For a long time I struggled with how to present the data. I wanted to show traffic patterns along a highway, and my first approach was to facet it by county. I didn’t realize I could free the scales so that different counties could have different lengths. I had to gradually learn the features of ggplot2 and dplyr to be able to do what I wanted it to do. So in the beginning, I used some not-polished coding and as I went along and tried to do more things I learned more and changed my way of doing things. But I did not always go back and update what I had already done, so you can see different levels of skills in the work.

If I had highway lane data, that might help do better traffic predictions. Also, I wanted to plot population along highway 101 to see how it correlated to traffic. So I just went back and tried to do it, but realized it is not needed and I need to to go sleep, so I left what I had there. I had to scale y to leave out Los Angeles as it was skewing the plot too much. I wanted to plot population and traffic on different axes but ggplot2 doesn’t seem to allow that. I suppose I could scale the population by dividing by some amount. I am leaving that for future work.

My biggest struggle will be to organize the plots in a logical order and comment them all. With a single laptop screen and a broken printer, I have no easy way to go through it.