01234567890123456789012345678901234567890123456789012345678901234567890123456789
## Warning: joining factor and character vector, coercing into character
## vector
Dimensions of dataset: 6768, 17 Column names: ``` ## [1] "dist" "route" "route_suffix" ## [4] "county" "postmile_prefix" "postmile" ## [7] "alignment" "description" "back_peak_hour" ## [10] "back_peak_month" "back_aadt" "ahead_peak_hour" ## [13] "ahead_peak_aadt" "ahead_aadt" "county_name" ## [16] "district" "county_pop" ``` Structure of dataset: ``` ## 'data.frame': 6768 obs. of 17 variables: ## $ dist : Factor w/ 12 levels "1","2","3","4",..: 12 12 12 12 12 12 12 12 12 12 ... ## $ route : Factor w/ 243 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ route_suffix : Factor w/ 3 levels "","S","U": 1 1 1 1 1 1 1 1 1 1 ... ## $ county : chr "ORA" "ORA" "ORA" "ORA" ... ## $ postmile_prefix: Factor w/ 8 levels "","C","D","L",..: 6 6 1 1 1 1 6 6 1 1 ... ## $ postmile : num 0.129 0.78 8.43 9.418 9.6 ... ## $ alignment : Factor w/ 3 levels "","L","R": 1 1 1 1 1 1 1 1 1 1 ... ## $ description : Factor w/ 5762 levels ""," JCT. RTE. 101",..: 1129 1128 2623 2622 2619 2625 3508 3509 3510 3507 ... ## $ back_peak_hour : int NA 3750 2850 3000 3350 3150 4000 4250 4350 5400 ... ## $ back_peak_month: int NA 40000 38500 40500 39500 37500 49500 53000 53000 52000 ... ## $ back_aadt : int NA 37000 36000 38000 37000 35000 45000 48000 48700 48500 ... ## $ ahead_peak_hour: int 3750 3900 2850 3400 3350 3150 4800 4250 5300 5400 ... ## $ ahead_peak_aadt: int 40000 42000 38500 40500 39500 38500 59000 53000 52000 52000 ... ## $ ahead_aadt : int 37000 38500 36000 38000 37000 35000 54000 48000 48700 48500 ... ## $ county_name : chr "Orange" "Orange" "Orange" "Orange" ... ## $ district : int 12 12 12 12 12 12 12 12 12 12 ... ## $ county_pop : int 3073540 3073540 3073540 3073540 3073540 3073540 3073540 3073540 3073540 3073540 ... ``` Summary of dataset: ``` ## dist route route_suffix county ## 4 :1084 101 : 507 :6738 Length:6768 ## 7 : 913 5 : 420 S: 20 Class :character ## 3 : 771 1 : 278 U: 10 Mode :character ## 6 : 721 99 : 254 ## 11 : 609 80 : 157 ## 8 : 581 10 : 152 ## (Other):2089 (Other):5000 ## postmile_prefix postmile alignment ## :4695 Min. : 0.000 :6649 ## R :1844 1st Qu.: 5.752 L: 54 ## M : 71 Median : 15.367 R: 65 ## T : 70 Mean : 22.308 ## L : 63 3rd Qu.: 30.408 ## S : 16 Max. :186.238 ## (Other): 9 NA's :6 ## description back_peak_hour ## JCT. RTE. 5 : 33 Min. : 10 ## JCT. RTE. 101 : 16 1st Qu.: 940 ## JCT. RTE. 99 : 14 Median : 2600 ## NEVADA STATE LINE : 13 Mean : 5326 ## JCT. RTE. 15 : 12 3rd Qu.: 8800 ## LOS ANGELES/SAN BERNARDINO COUNTY LINE: 12 Max. :31000 ## (Other) :6668 NA's :537 ## back_peak_month back_aadt ahead_peak_hour ahead_peak_aadt ## Min. : 100 Min. : 80 Min. : 10 Min. : 100 ## 1st Qu.: 9400 1st Qu.: 8200 1st Qu.: 940 1st Qu.: 9500 ## Median : 29000 Median : 25600 Median : 2600 Median : 29000 ## Mean : 68262 Mean : 64726 Mean : 5333 Mean : 68318 ## 3rd Qu.:109000 3rd Qu.:104000 3rd Qu.: 8800 3rd Qu.:109000 ## Max. :406000 Max. :377500 Max. :31000 Max. :406000 ## NA's :537 NA's :537 NA's :537 NA's :537 ## ahead_aadt county_name district county_pop ## Min. : 80 Length:6768 Min. : 1.000 Min. : 1114 ## 1st Qu.: 8400 Class :character 1st Qu.: 4.000 1st Qu.: 182640 ## Median : 26000 Mode :character Median : 6.000 Median : 822403 ## Mean : 64773 Mean : 6.134 Mean :2000090 ## 3rd Qu.:104000 3rd Qu.: 8.000 3rd Qu.:2249045 ## Max. :377500 Max. :12.000 Max. :9946947 ## NA's :537 ``` * The maximum length of any route in a single county is approximately 186 miles. * The hourly peak traffic ranges from 10 vehicles to 31,000 vehicles. * The daily peak traffic ranges from 100 vehicles to 406,100 vehicles. * The average daily traffic ranges from 80 vehicles to 377,500 vehicles.
I am not loading the CA traffic districts as a dataset, but am including the table here to help understand the plots by district. Traffic Districts http://en.wikipedia.org/wiki/California_Department_of_Transportation#Districts
| District | Counties |
|---|---|
| 1 | Del Norte, Humboldt, Lake, Mendocino Eureka |
| 2 | Lassen, Modoc, Plumas, Shasta, Siskiyou, Tehama, Trinity; portions of Butte and Sierra Redding |
| 3 | Butte, Colusa, El Dorado, Glenn, Nevada, Placer, Sacramento, Sierra, Sutter, Yolo,Yuba Marysville |
| 4 | Alameda, Contra Costa, Marin, Napa, San Francisco, San Mateo, Santa Clara, Solano, Sonoma, Oakland |
| 5 | Monterey, San Benito, San Luis Obispo, Santa Barbara, Santa Cruz San Luis Obispo |
| 6 | Madera, Fresno, Tulare, Kings, Kern Fresno |
| 7 | Los Angeles, Ventura Los Angeles |
| 8 | Riverside, San Bernardino San Bernardino |
| 9 | Inyo, Mono Bishop |
| 10 | Alpine, Amador, Calaveras, Mariposa, Merced, San Joaquin, Stanislaus, Tuolumne Stockton |
| 11 | Imperial, San Diego San Diego |
| 12 | Orange Irvine |
Frequency of District traffic counts:
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 325 411 771 1084 460 721 913 581 98 533 609 262
##
## 4 7 3 6 11 8 10 5 2 1 12 9
## 1084 913 771 721 609 581 533 460 411 325 262 98
##
## 101 5 1 99 80 10 15 49 33 20 405 12 210 8 580 78 91 395
## 507 420 278 254 157 152 130 122 92 77 74 73 72 70 70 69 68 67
## 50 4
## 66 65
Frequency of route_suffix, county, postmile_prefix, alignment, traffic:
## Source: local data frame [3 x 3]
##
## route_suffix n description
## 1 6738 Normal Route
## 2 S 20 Supplemental Route
## 3 U 10 Unrelinquished Route
##
## Alameda Alpine Amador Butte
## 220 19 47 96
## Calaveras Colusa Contra Costa Del Norte
## 50 36 109 36
## El Dorado Fresno Glenn Humboldt
## 80 183 42 146
## Imperial Inyo Kern Kings
## 156 45 267 49
## Lake Lassen Los Angeles Madera
## 45 41 762 56
## Marin Mariposa Mendocino Merced
## 60 31 98 103
## Modoc Mono Monterey Napa
## 22 53 111 59
## Nevada Orange Placer Plumas
## 68 261 103 44
## Riverside Sacramento San Benito San Bernardino
## 242 151 27 339
## San Diego San Francisco San Joaquin San Luis Obispo
## 453 52 139 126
## San Mateo Santa Barbara Santa Clara Santa Cruz
## 157 128 202 70
## Shasta Sierra Siskiyou Solano
## 138 22 84 93
## Sonoma Stanislaus Sutter Tehama
## 132 91 40 54
## Trinity Tulare Tuolumne Ventura
## 28 163 54 152
## Yolo Yuba
## 89 44
## Warning in tapply(X = X, INDEX = x, FUN = FUN, ...): NAs introduced by
## coercion
Postmile prefix - Definitions of Postmile prefix joined to count of each prefix. Note that not all prefixes are in use.
##
## C D L M R S T
## 4695 6 3 63 71 1844 16 70
## Prefix Meaning
## 1
## 2 C Commercial Lanes
## 3 D Duplicate postmile at meandering county line
## 4 G Reposting of duplicate postmile oat end of route
## 5 H Overlap of D mileage
## 6 L Overlap postmile
## 7 M Realignment of R mileage
## 8 N Realignment of M mileage
## 9 R First realignment
## 10 S Spur
## 11 T Temporary connection
## 12 U Unrelinquished
## Warning: joining factors with different levels, coercing to character
## vector
## Source: local data frame [8 x 3]
##
## postmile_prefix n Meaning
## 1 4695
## 2 C 6 Commercial Lanes
## 3 D 3 Duplicate postmile at meandering county line
## 4 L 63 Overlap postmile
## 5 M 71 Realignment of R mileage
## 6 R 1844 First realignment
## 7 S 16 Spur
## 8 T 70 Temporary connection
Alignment - Definitions of alignment prefix joined to count of each prefix
## Source: local data frame [3 x 3]
##
## alignment n description
## 1 6649
## 2 L 54 Left independent alignment
## 3 R 65 Right independent alignment
It is difficult to see the hourly and daily traffic counts in the same graph because the scales are so different. If I use free scales for the facets it might look better.
Hmm, I seem to remember seeing this type of distribution in one of the lessons and it was transformed into a normal distribution somehow… Lesson 3. By taking log. I will try to take the log or sqrt.
It looks like a bimodality - must be different types of highways Both transformations look more normal - log has no tail on right, sqrt has no tail on left
Try to figure out why it is bimodal - add population to tidyData and cut it into two levels: county population under and over 1 Million people.
When I separate it out into counties that have under a Million population and over a Million Population, it separates the bimodality. Counties under a million have a more normally distributed traffic counts. Counties with over a million population have a higher skewed number of traffic count instances with high traffic counts.
OK, technically this should go into the bivariate section, but it started off trying to investigate the traffic counts on their own.
I wonder what thes plots look like across the categorical variables of district, route, and county?
Now lets look at the population.
## county pop20120701
## 1 Los Angeles 9946947
## 2 San Diego 3153376
## 3 Orange 3073540
## 4 Riverside 2249045
## 5 San Bernardino 2063867
## 6 Santa Clara 1827313
## 7 Alameda 1539145
## 8 Sacramento 1435118
## 9 Contra Costa 1069158
## 10 Fresno 948453
## 11 Kern 855974
## 12 Ventura 833361
## 13 San Francisco 822403
## 14 San Mateo 736019
## 15 San Joaquin 697758
## 16 Stanislaus 523126
## 17 Sonoma 488300
## 18 Tulare 452301
## 19 Santa Barbara 426063
## 20 Monterey 421465
## 21 Solano 419064
## 22 Placer 360098
## 23 San Luis Obispo 270637
## 24 Santa Cruz 268607
## 25 Merced 261002
## 26 Marin 253892
## 27 Butte 220980
## 28 Yolo 204974
## 29 El Dorado 182640
## 30 Shasta 178402
## 31 Imperial 178382
## 32 Madera 151242
## 33 Kings 150643
## 34 Napa 138019
## 35 Humboldt 134601
## 36 Nevada 97920
## 37 Sutter 96557
## 38 Mendocino 88550
## 39 Yuba 72915
## 40 Lake 64204
## 41 Tehama 63937
## 42 San Benito 56643
## 43 Tuolumne 53949
## 44 Siskiyou 45218
## 45 Calaveras 44923
## 46 Amador 36403
## 47 Lassen 33719
## 48 Glenn 28560
## 49 Del Norte 28533
## 50 Colusa 21442
## 51 Plumas 19906
## 52 Inyo 18578
## 53 Mariposa 17959
## 54 Mono 14393
## 55 Trinity 13496
## 56 Modoc 9516
## 57 Sierra 3133
## 58 Alpine 1114
## 'data.frame': 58 obs. of 2 variables:
## $ county : chr "Alameda" "Alpine" "Amador" "Butte" ...
## $ pop20120701: int 1539145 1114 36403 220980 44923 21442 1069158 28533 182640 948453 ...
Population Frequency - Los Angeles outlier - then remove it
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
When I get rid of Lost Angeles, the population distribution is more even. The counties with smaller population might have fewer traffic counts, but there are a lot of them.
Plot county population, not
q1 <- qplot(data=pop12, x = pop20120701) +
ggtitle("With Outlier")
q2 <- qplot(data=subset(pop12, pop20120701 < 5000000), x = pop20120701) +
ggtitle("Remove Outlier")
grid.arrange(q1, q2, ncol=2)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Los Angeles has a very high population, so it is an outlier and makes it more difficult to see the population distribution in the lower counties. Removing counties with populations > 5 Million (Los Angeles) helps make the rest clearer.
There are 6768 traffic count locations in this dataset, with 8 original features: dist, route, route_suffix, county, postmile_prefix, postmile, alignment, description and 6 traffic count types: back_peak_hour, back_peak_month, back_aadt, ahead_peak_hour, ahead_peak_aadt, ahead_aadt plus county_name and county_pop (population)
There are 12 districts, 243 routes, 3 route suffixes (most are blank), 58 counties, 8 postmile prefixes (R is the most common besides blank), 3 alignments (most are blank), 5762 descriptions (which are the intersections with town or county/state line), and 537 missing traffic counts for each traffic type (at a boundary, the ahead and back counts are usually on separate lines).
Other observations:
The most popular description is “JCT. RTE. 5”. The county with the most traffic count locations is Los Angeles. The district with the most traffic counts is 4, which includes San Francisco and Oakland. The route with the most traffic counts is 101. The ahead_peak_aadt was equal to or higher than the back_peak_month for min/mean/median/max. (these are equivalent measures at front and back of intersection, just not consistently named in source document)
It seems that the route with the largest number of traffic counts is State Highway 101, which covers California from South to North. There are 507 points on highway 101. It might be an interesting roadway to investigate.
The second most counted roadway is I-5. It is an interstate, not a state highway and it also covers the whole state from South to North.
The third most counted highway is highway 1, which hugs the coastline. The fourth is Highway 99. Those 4 have many more counts than the other highways.
Los Angeles has a huge population, which raises the mean much higher than the median county population.
All of the traffic count types have a high concentration of counts that are in the lower 2/30 of the traffic count range. The hourly traffic counts have a slightly different distribution than the daily traffic counts - the daily counts had more falling into the lower 1/30 than the hourly counts.
The main features of interest are the traffic counts, counties and routes.
I also added population from a separate table to do traffic correlation based on county population.
The population count also has the population difference between 2012 and 2013 which could be useful to see if people came to or left areas of high traffic.
I created a county_name variable which was the full name of the county taken from the county in the dataset which was actually an abbreviation.
In a subset of data for Highway 101 I created a cumulative mileage column. In the next sections I created a few summary datasets such as dfp3.bpm_by_county and dfp3.baadt_by_county to plot the mean traffic by county and do modeling to predict average traffic based on population.
I created a tidy data set of one variable/value pair per row so that I could plot the different traffic types in a facet.
I summarized the traffic (back peak month) by county_name since I had the population of the counties, so I could compare the average traffic to the county population. I tried to find a correlation, so I plotted each axis as a log function until I found one that looked correlated - the log of the county population is correlated with the log of the average traffic (back peak month).
To look at the county population distribution, I removed Los Angeles because it was such an outlier (population of over 9 million when next largest county only had 3 million). It made the histogram spread out more.
I looked at the traffic distributions and saw that a log transformation would make them more normal, so I plotted the log10 of the traffic counts. It looked bimodal. I separated it by county population - made two groups using cut: one for counties with population under a million and the rest over a million.
This removed the bimodality. Counties with under a million residents had a more normal-looking log traffic frequency distribution.
Below is a correlation matrix of the numeric columns in my main data frame.
## ahead_peak_hour ahead_peak_aadt ahead_aadt back_peak_hour
## ahead_peak_hour 1.0000000 0.9861205 0.9851281 0.9784667
## ahead_peak_aadt 0.9861205 1.0000000 0.9992771 0.9689689
## ahead_aadt 0.9851281 0.9992771 1.0000000 0.9680152
## back_peak_hour 0.9784667 0.9689689 0.9680152 1.0000000
## back_peak_month 0.9679986 0.9831139 0.9826660 0.9862641
## back_aadt 0.9672250 0.9827562 0.9833376 0.9852505
## county_pop 0.5019581 0.5268439 0.5305682 0.5020283
## postmile -0.1127205 -0.1189987 -0.1234395 -0.1535695
## back_peak_month back_aadt county_pop postmile
## ahead_peak_hour 0.9679986 0.9672250 0.50195813 -0.11272049
## ahead_peak_aadt 0.9831139 0.9827562 0.52684388 -0.11899871
## ahead_aadt 0.9826660 0.9833376 0.53056818 -0.12343946
## back_peak_hour 0.9862641 0.9852505 0.50202832 -0.15356953
## back_peak_month 1.0000000 0.9993272 0.52685191 -0.15792589
## back_aadt 0.9993272 1.0000000 0.53044740 -0.16222574
## county_pop 0.5268519 0.5304474 1.00000000 -0.03568948
## postmile -0.1579259 -0.1622257 -0.03568948 1.00000000
The traffic counts are all well correlated. The traffic is somewhat correlated with county population (around .5) - it is expected that more people lead to more cars and to more traffic even if there are other factors such as availability of public transportation that also affect traffic. One surprising finding is that the postmile (distance from S or W end of route in county) is slightly negatively correlated with traffic and even more slightly negatively correlated with county population. There is slightly more traffic at the start of the route (S or W end).
Look at traffic counts by district:
You can see here that not all districts have bimodal traffic distributions. Districts 1, 2, and 3 are all less built up and have mostly decreasing distributions. District 4 has a few peaks and is taller with a short tail. It contains the San Franciso/Silicon Valley area and has a spread out high traffic. District 7 contains Los Angeles and it has a large second peak where there is a lot of traffic. The last one with a big second peak is District 12, which contains Orange and Irvine counties. This district is unique in that it does not have many areas with low traffic. It has the highest traffic rates and a peak of roadway sites with traffic between 200 and 300 thousand cars a day. As a side note, I once had a tour of the Irvine traffic control center and it was pretty impressive!
## Warning: Removed 537 rows containing missing values (geom_point).
## Warning: Removed 537 rows containing missing values (geom_point).
## Warning: Removed 537 rows containing missing values (geom_point).
## Warning: Removed 537 rows containing missing values (geom_point).
These were some of my first plots. They are not very easy to read although they are colorful! I hope I am not penalized for leaving them in. Orange County has the highest traffic, and then LA.
## Warning: Removed 537 rows containing non-finite values (stat_boxplot).
## Warning: Removed 537 rows containing non-finite values (stat_boxplot).
## Warning: Removed 537 rows containing non-finite values (stat_boxplot).
## Warning: Removed 537 rows containing non-finite values (stat_boxplot).
The statistics of the ratio of peak month AADT to peak hour AADT:
# Peak month daily vs peak hourly
summary(dfp3$back_peak_month/dfp3$back_peak_hour)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.095 10.000 11.460 11.430 12.910 27.000 537
summary(dfp3$ahead_peak_aadt/dfp3$ahead_peak_hour)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 3.167 10.000 11.440 11.430 12.890 25.470 537
# Average daily vs peak hourly
summary(dfp3$back_aadt/dfp3$back_peak_hour)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.063 8.800 10.430 10.390 12.190 17.220 537
summary(dfp3$ahead_aadt/dfp3$ahead_peak_hour)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.000 8.789 10.430 10.390 12.190 17.820 537
Look at population by county.
Los Angeles has such a high population that it skews results. Looking at a log10 distribution shows all better.
Traffic Counts by population - This won’t work very well in a bivariate plot because population is the same for all points in a county.
Los Angeles county has a much larger population than all the other counties. The Peak month annual average daily traffic was an average of 11 times greater than the peak hourly average traffic. The average daily traffic was only 10 times greater.
Only three counties (Los Angeles, San Bernardino, and Kern had roads with route suffixes appended while fourteen counties had alignment changes and all fifty-eight had at least one postmile prefix (most frequently ‘R’).
I noticed in the plots that the postmile prefix ‘M’ had the highest mean AADT. ‘M’ is the realignment of ‘R’ mileage, where ‘R’ was the first realignment. So it seems like when a second realignment is needed for a roadway, it tends to have higher traffic than the first realignment (‘R’), which also tended to have higher traffic than non-realigned roadways.
The peak month daily traffic was an average of 11 times greater than the peak hourly traffic.
All of the traffic counts were correlated with each other, but that was to be expected.
Looking at Traffic counts on the routes that go through Alameda county.
## Warning: Removed 18 rows containing missing values (geom_path).
## Warning: Removed 19 rows containing missing values (geom_point).
I-80 only has less than 10 miles of roadway in the county but it has some of the highest traffic. I-880 also has a lot of traffic at its beginning, but the traffic drops. This plot is a little busy and may not be optimal.
Looking at Highway 101 Instead of focusing on a county, I will focus on a route. Highway 101 from South to North goes through these counties in order: (from http://en.wikipedia.org/wiki/U.S._Route_101_in_California)LA - Los Angeles VEN - Ventura SB - Santa Barbara SLO - San Luis Obispo MON - Monterey SBT - San Benito SCL - Santa Clara SM - San Mateo SF - San Francisco MRN - Marin SON - Sonoma MEN - Mendocino HUM - Humboldt DN - Del Norte
Look at the traffic on Highway 101 throughout it’s length
Plotting traffic along Highway 101 with help from http://stackoverflow.com/questions/13616515/recommend-a-scale-colour-for-13-or-more-categories and http://stackoverflow.com/questions/17844494/change-grid-line-behavior-in-ggplot2 Adding in peak traffic (the triangles)…
The triangles are peak month aadt - it looks like all of the peak traffic locations still have room for even more traffic at peak seasons! Adding the peak traffic makes the plot look a little busy.
I will not use it for one of my final plots.
I created a separate dataset called hw101 that is a subset of the main dataset, but also has cumulative mileage. What are the column names in hw101 in the knitted dataset?
## [1] "dist" "route" "route_suffix"
## [4] "county" "postmile_prefix" "postmile"
## [7] "alignment" "description" "back_peak_hour"
## [10] "back_peak_month" "back_aadt" "ahead_peak_hour"
## [13] "ahead_peak_aadt" "ahead_aadt" "county_name"
## [16] "district" "county_pop" "county_postmile"
## [19] "cm101" "cm"
## 'data.frame': 507 obs. of 20 variables:
## $ dist : Factor w/ 12 levels "1","2","3","4",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ route : Factor w/ 243 levels "1","2","3","4",..: 90 90 90 90 90 90 90 90 90 90 ...
## $ route_suffix : Factor w/ 3 levels "","S","U": 1 1 1 1 1 1 1 1 1 1 ...
## $ county : Ord.factor w/ 14 levels "LA"<"VEN"<"SB"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ postmile_prefix: Factor w/ 8 levels "","C","D","L",..: 7 7 7 7 1 1 1 1 1 1 ...
## $ postmile : num 0 0.624 0.907 1.329 0 ...
## $ alignment : Factor w/ 3 levels "","L","R": 1 1 1 1 1 1 1 1 1 1 ...
## $ description : Factor w/ 5762 levels ""," JCT. RTE. 101",..: 2902 2864 2861 2885 2885 2994 2814 2973 2889 2869 ...
## $ back_peak_hour : int NA 9000 7900 7700 NA 13800 13300 12700 13600 15800 ...
## $ back_peak_month: int NA 135000 121000 120000 NA 211000 207000 201000 218000 267000 ...
## $ back_aadt : int NA 132000 120000 118000 NA 207000 205000 198000 215000 264000 ...
## $ ahead_peak_hour: int 9000 7900 7700 NA 13800 13300 12700 13600 15800 15400 ...
## $ ahead_peak_aadt: int 135000 121000 120000 NA 211000 207000 201000 218000 267000 261000 ...
## $ ahead_aadt : int 132000 120000 118000 NA 207000 205000 198000 215000 264000 257000 ...
## $ county_name : Ord.factor w/ 14 levels "Los Angeles"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ district : int 7 7 7 7 7 7 7 7 7 7 ...
## $ county_pop : int 9946947 9946947 9946947 9946947 9946947 9946947 9946947 9946947 9946947 9946947 ...
## $ county_postmile: chr "LA_S_0" "LA_S_0.624" "LA_S_0.907" "LA_S_1.329" ...
## $ cm101 : num [1:507(1d)] 0 0 0 0 0 0 0 0 0 0 ...
## $ cm : num [1:507(1d)] 0 0.624 0.907 1.329 0 ...
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 16 rows containing missing values (geom_point).
## Warning: Removed 15 rows containing missing values (geom_path).
I tried to plot county population and Ahead Annual Average Daily Traffic (AAADT) along Highway 101 on the same plot to show how traffic could be related to population but I could not easily figure out how to get two scales on the same chart (which I used to do all the time in gnuplot with y1 and y2 axes). It seems that the R way is not to mix scales but instead to facet. If I had more time I would try to plot them together with a facet. I limited the y scale so some higher populations were left out. If I were using this graph I would have to foot note that.
The next plot added colors to the county-faceted ahead AADT plots.
These are a set of faceted plots like one of my final plots, but it does not have a free x scale. Each plot is as wide and scaled as the county with the most mileage along highway 101. The legend is outside the plot.
The next set of faceted plots brings the legend inside and colors it to stand out. It doesn’t provide any new information.
Another way to represent the traffic along highway 101 is by a boxplot by county. It shows the traffic distribution in each county. I also color-coded it by district. Adjacent counties can be in the same district. It is obvious that districts 7 and 4 have higher traffic along highway 101 than districts 5 and 1.
Look at traffic counts by postmile prefix per county along Highway 101: The most frequent is blank (no prefix), and the second most frequent is ‘R’, the first re-alignment. Mendocino has some ‘T’, a temporary connection. Los Angeles has some ‘S’, a spur. Del Norte also has an ‘M’, which is a realignment of ‘R’ (second realignment).
Another way to look at the postmile prefix is to graph them by amount of traffic and group by county. The spurs usually have lower traffic. Using a more divergent color scheme (I created it manually) helps.
# Ahead AADT on 101 by county
ggplot(hw101, aes(x = county_name, y = ahead_aadt)) +
geom_point(aes(colour = postmile_prefix), alpha = .5) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
ggtitle("Highway 101 Ahead AADT Per County and Postmile Prefix") +
xlab("County") + ylab("Ahead Annual Average Daily Traffic (AADT)") +
#scale_fill_discrete(name = "Postmile Prefix") +
scale_color_manual(values=c("green", "orange", "red", "black", "gold"),
name = "Postmile Prefix")
## Warning: Removed 16 rows containing missing values (geom_point).
Looking at the traffic counts for ggpairs. They should be correlated and they are.
The GGpairs did not come out well. It may not be polished, but it shows the correlations
Looking at the differences in back and ahead peak month (daily) and peak hour traffic, the differences are obviously bigger over an entire day. If I had more time I would investigate by looking at the descriptions for the locations that had the biggest differences to see whether there was a large roadway at the intersection that was adding or siphoning off traffic.
Limiting the difference to Highway 101 along its cumulative mileage makes the labelling easier and reduces the clutter. We know there are big differences in San Francisco and you can see the points here.
ggplot(data = subset(dfp3, !is.na(ahead_peak_hour) & !is.na(back_peak_hour)), aes(x = county, y = back_peak_hour - ahead_peak_hour)) + geom_jitter(alpha=.5, aes(colour = county)) + theme(axis.text.x = element_text(colour=“grey20”,size=8,angle=45,hjust=.5, vjust=.5,face=“plain”), axis.text.y = element_text(colour=“grey20”,size=10,angle=0,hjust=1, vjust=0,face=“plain”),
axis.title.x = element_text(colour=“grey20”,size=12,angle=0,hjust=.5, vjust=0,face=“plain”), axis.title.y = element_text(colour=“grey20”,size=12,angle=90,hjust=.5, vjust=.5,face=“plain”), legend.text = element_text(size = 8), legend.key.size = unit(.3, “cm”)) + guides(colour=guide_legend(ncol=2)) + #scale_colour_brewer(palette=c(“Set1”, “Set2”, “Set3”, “Set4”, “Set5”)) + ggtitle(label = “CA Difference in Back and Ahead Peak HourCounty”)
ggplot(data = subset(hw101, !is.na(ahead_peak_hour) & !is.na(back_peak_hour)), aes(x = cm, y = back_peak_hour - ahead_peak_hour)) + geom_point(alpha=.5, aes(colour = county)) + theme(axis.text.x = element_text(colour=“grey20”,size=8,angle=45,hjust=.5, vjust=.5,face=“plain”), axis.text.y = element_text(colour=“grey20”,size=10,angle=0,hjust=1, vjust=0,face=“plain”),
axis.title.x = element_text(colour=“grey20”,size=12,angle=0,hjust=.5, vjust=0,face=“plain”), axis.title.y = element_text(colour=“grey20”,size=12,angle=90,hjust=.5, vjust=.5,face=“plain”), legend.text = element_text(size = 8), legend.key.size = unit(.3, “cm”)) + guides(colour=guide_legend(ncol=1)) + #scale_colour_brewer(palette=c(“Set1”, “Set2”, “Set3”, “Set4”, “Set5”)) + ggtitle( label = “Highway 101 Difference in Back and Ahead Peak HourCounty”)
```
Traffic, population, by county. Here I made a new summary dataframe with back peak month mean/median/min/max/count. I added in the population.
I looked at the mean county traffic vs. the number of traffic counts in the county. I could see a relationship, but it wasn’t exactly linear. Then I plotted mean county traffic vs. population. It looked vaguely logarithmic so I plotted the logs of each variable until plotting logs of both made it look linear. I added a lm line to the plot.
Next I try to correlate the log of population with the log of back peak month traffic, since it looks linear.
# It looks kind of linear when you take the log of population and traffic
p.bpm.pop.cor <- with(dfp3.bpm_by_county,
cor.test(log10(pop20120701), log10(bpm_mean)))
k.bpm.pop.cor <- with(dfp3.bpm_by_county,
cor.test(log10(pop20120701),log10(bpm_mean), method = "kendall"))
s.bpm.pop.cor <- with(dfp3.bpm_by_county,
cor.test(log10(pop20120701),log10(bpm_mean), method = "spearman"))
# Pearson
p.bpm.pop.cor
##
## Pearson's product-moment correlation
##
## data: log10(pop20120701) and log10(bpm_mean)
## t = 18.1356, df = 56, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8750290 0.9547327
## sample estimates:
## cor
## 0.9243962
# Kendall
k.bpm.pop.cor
##
## Kendall's rank correlation tau
##
## data: log10(pop20120701) and log10(bpm_mean)
## z = 8.6734, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.7822142
# Spearman
s.bpm.pop.cor
##
## Spearman's rank correlation rho
##
## data: log10(pop20120701) and log10(bpm_mean)
## S = 2252, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.9307269
Both Pearson and Spearman test show a correlation around .93, while Kendall’s test shows a lower correlation of .78 but the p-value is very low (< 2.2e-16) so we reject the null hypothesis and we say that true correlation is not equal to 0.
Now I am going to look at Back Annual Average Daily Traffic, which I think should be more representative of the population of a county than peak traffic.
## Warning: joining factor and character vector, coercing into character
## vector
##
## Pearson's product-moment correlation
##
## data: n and baadt_mean
## t = 6.6872, df = 56, p-value = 1.136e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4928630 0.7888568
## sample estimates:
## cor
## 0.6663318
##
## Pearson's product-moment correlation
##
## data: log10(pop20120701) and log10(baadt_mean)
## t = 18.5323, df = 56, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8796410 0.9564716
## sample estimates:
## cor
## 0.927257
##
## Kendall's rank correlation tau
##
## data: log10(pop20120701) and log10(baadt_mean)
## z = 8.66, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.7810042
##
## Spearman's rank correlation rho
##
## data: log10(pop20120701) and log10(baadt_mean)
## S = 2232, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.9313421
It looks like the Pearson and Spearman correlation is slightly higher using the Annual Average and not peak daily traffic, but the Kendall correlation is slightly lower.
Now we can try to do a linear model.
##
## Call:
## lm(formula = baadt_mean ~ pop20120701, data = dfp3.baadt_by_county)
##
## Residuals:
## Min 1Q Median 3Q Max
## -71003 -16400 -6743 3077 79446
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.366e+04 4.038e+03 5.860 2.57e-07 ***
## pop20120701 2.020e-02 2.575e-03 7.845 1.39e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27960 on 56 degrees of freedom
## Multiple R-squared: 0.5236, Adjusted R-squared: 0.5151
## F-statistic: 61.54 on 1 and 56 DF, p-value: 1.392e-10
##
## Call:
## lm(formula = log10(baadt_mean) ~ log10(pop20120701), data = dfp3.baadt_by_county)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.37008 -0.15914 -0.03735 0.12866 0.49741
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.11676 0.17377 6.427 3.05e-08 ***
## log10(pop20120701) 0.60937 0.03288 18.532 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1963 on 56 degrees of freedom
## Multiple R-squared: 0.8598, Adjusted R-squared: 0.8573
## F-statistic: 343.4 on 1 and 56 DF, p-value: < 2.2e-16
Both models show high significance that some relationship exists between traffic and population. The standard error should be at least an order of magnitude lower than the coefficient estimate and it is approximately so for both models. R squared is higher for the log-log model, which means it fits better.
Next I tried to plot traffic per population (ahead_peak_hour/county_pop). I noticed that Sierra county had some high outliers - this county does not have a large population but it does have I-80 which gets a lot of traffic.
ggplot(data = dfp3, aes(x = county, y = ahead_peak_hour/county_pop)) +
geom_boxplot() +
scale_y_continuous(trans=log10_trans()) +
ggtitle("CA traffic per Population in July 2012 by County") +
theme(axis.text.x = element_text(angle=90, vjust = 0.5)) +
xlab("County") + ylab("Log10 Axis Traffic (ahead peak hour) per Population")
Now order the counties by the number of traffic counts per county (descending).
When it is not plot on log scale, there is a huge outlier for Sierra County. But it looks like traffic does not go up linearly with population - because LA has such a large population, it has a very small traffic count/population ratio.
Alpine has a large ratio - either it has a lot of visitors, or 1/4 of the population drives on each state route.
I plot it again in log scale.
## Warning: Removed 537 rows containing non-finite values (stat_boxplot).
## Warning: Removed 537 rows containing non-finite values (stat_boxplot).
Now I am doing some investigations from my cumulative mileage on Highway 101 plot, investigating descriptions to find causes for fluctuations in traffic. Looking at S Santa Barbara and Santa Clara counties. Looks like AAA Triptik!
Looking at Descriptions helped me understand the traffic flows more, but not in a statistical way. If I had some text analysis I could have actually used the descriptions more. It might also have helped to know how many lanes were in the highways at each point.
My early investigations did a lot of investigations by county since that was a convenient feature to use. Later I was able to add the county populations, which helped to clarify why certain counties were showing certain properties - it was based at least somewhat on their population. These two features are not independent, but since population is numeric, it helps to quantify the relationship.
I also saw how the back and ahead traffic counts in an intersection could vary.
I saw that the traffic counts were bimodal. When faceted the log of the traffic counts by high and low county populations (greater or less than 1 million), I came up with separate distributions that looked more normal, although the larger counties had traffic counts that were heavily weighted on the higher end.
I created a model to predict average traffic for a county based on the county population. Since I only had average population across the entire county, it couldn’t be too accurate - the traffic fluctuates in a county. It actually uses the log of the population vs. the log of the traffic. It could be made stronger perhaps if I had more data about the type of roadway (number of lanes, etc.) and more demographic information. ——
This is a faceted plot of the three back traffic types along Highway 101 by county. The mileposts run from South to North. Each count type has a different shape and color. I kept the y (traffic count) axis the same for all counties so they could be compared, but let the x (postmile) axis free so some counties are more spaced out.
This plot type, unlike the unfaceted plot later, lets the 3 traffic types show clearly in each county, although the scale doesn’t allow you to see much variation in hourly traffic.
Back peak hour traffic is always lower than the other two, which measure entire days. The traffic hits its peak in Los Angeles county where counts are closely spaced together, and then decreases through the rest of Los Angeles and Ventura counties. There is a local peak in Santa Barbara and counts are father apart in the middle of Santa Barbara. They start rising slightly travelling North in Monterey, and increase more in Santa Clara and San Mateo, before peaking in San Mateo county and starting to decrease. There is a big drop in San Francisco county, as Highway 101 goes through the city - further investigation is warranted.
Looking at the raw traffic data, the drop occurs at the junction with I-80 (Bay Bridge), before S. Van Ness Avenue - a city street does not have the same capacity as a highway and much of the traffic continues to I-80.
Traffic increases when approaching the Golden Gate Bridge and then peaks in Marin County in San Rafael, peaks (but lower) in Sonoma county in Santa Rosa, and then decreases to lower levels in the northern counties.
This plot shows the attempt to find a relationship between average traffic on county roadways and the population of the county. The AADT is the Annual Average Daily Traffic - the total volume for the year divided by 365 days. My assumption is that a county with higher population would have higher average traffic counts. This might not always be the case because some cities have higher public transportation usage.
Looking at My assumption is that the daily average would have a higher correlation to population than the daily or seasonal peak, since it isn’t a peak that could be influenced by tourists or commuters.
This was the plot I worked on the longest. It shows the Ahead Annual Average Daily Traffic (Ahead AADT) for California Highway 101 from South to North. I had to create the cumulative total mileage for Highway 101 by adding the maximum postmile reading from each county to that of all counties south of it along the highway (counties in order determined from Wikipedia entry). There are some counties which only have a short distance and thus the mileage almost overlaps with the neighboring borders.
I used a palette that provides a contrast between neighboring counties, instead of the standard palette. It was difficult to find a palette like that which had enough colors, so I had to add two manually.
This plot shows how the traffic varies with location. I also did a subplot for part of Santa Barbara (in multiplot section) where I included the location description, which demonstrates what is going on at the location.
This plot shows that Los Angeles county has some of the highest average daily traffic. It does a better job of showing how the traffic compares along the entire route than the facetted plot does. The traffic increases around the San Francisco Bay Area. The northern part of the state has the lowest traffic.
My biggest struggle in this analysis was knowing when to quit. Even now, I keep finding more things I would like to investigate and different ways I can improve what I’ve already done, but I’ve been working on this for over 3 weeks and would like to finish. When I first picked the dataset, I wasn’t sure it was good enough. I didn’t know what I could model. I had some suggestions to add population data and that really helped make it more interesting.
For a long time I struggled with how to present the data. I wanted to show traffic patterns along a highway, and my first approach was to facet it by county. I didn’t realize I could free the scales so that different counties could have different lengths. I had to gradually learn the features of ggplot2 and dplyr to be able to do what I wanted it to do. So in the beginning, I used some not-polished coding and as I went along and tried to do more things I learned more and changed my way of doing things. But I did not always go back and update what I had already done, so you can see different levels of skills in the work.
If I had highway lane data, that might help do better traffic predictions. Also, I wanted to plot population along highway 101 to see how it correlated to traffic. So I just went back and tried to do it, but realized it is not needed and I need to to go sleep, so I left what I had there. I had to scale y to leave out Los Angeles as it was skewing the plot too much. I wanted to plot population and traffic on different axes but ggplot2 doesn’t seem to allow that. I suppose I could scale the population by dividing by some amount. I am leaving that for future work.
My biggest struggle will be to organize the plots in a logical order and comment them all. With a single laptop screen and a broken printer, I have no easy way to go through it.