This lab is designed to give you practice acquiring data from an external site and reading it into R. In addition, it should give you more practice using R and Markdown, and provide experience using the R language when producing a ## Part 1: Data * Find a data set you’re interested in from Seattle’s Open Data Portal at: https://data.seattle.gov/ * Download that dataset as a CSV file and save it to your computer. * Write a code chunk to import the dataset into R using read.csv().
setwd("~/Documents/Data-Driven Analytics/Class Exercises/Homeworks/Lab 2")
data <- read.csv("Burke_Gilman_Trail_Bike_and_Ped_Counter.csv")
Tell me something about the data you downloaded. Why do you think it’s interesting? How large is it, what variables does it contain? What kind of information is available? What kind of questions might it let you answer? Use and display a few of the commands we’ve learned (table, head, dim, names) but try to make sure it displays in a readable way. Remember to create separate chunks of code like below and to discuss what you see in the output
Name of Dataset: Burke Gilman Trail north of NE 70th St Bike and Ped Counter Link
This dataset is for a Bike and Pedestrian counter at the Burke-Gilman Trail north of NE 70th St. The dataset has 46k rows and 6 columns. The variables are: Date/Time, BGT North of NE 70th Total (total bike and pedestrians counted at that time), Ped South, Ped North, Bike South, and Bike North. This dataset has purely quantitative information regarding the counts.
I think this dataset is interesting as it is one of the few pedestrian/bike counters in the City, and is for a highly-used path in the City. It would be interesting to see how the pedestrian and bike use changes over time.
head(data)
## Date BGT.North.of.NE.70th.Total Ped.South Ped.North Bike.North
## 1 3/31/19 23:00 2 0 0 1
## 2 3/31/19 22:00 5 0 0 3
## 3 3/31/19 21:00 4 0 0 2
## 4 3/31/19 20:00 12 2 3 4
## 5 3/31/19 19:00 60 10 14 18
## 6 3/31/19 18:00 142 19 25 45
## Bike.South
## 1 1
## 2 2
## 3 2
## 4 3
## 5 18
## 6 53
summary(data)
## Date BGT.North.of.NE.70th.Total Ped.South
## 1/1/14 0:00 : 1 Min. : 0.00 Min. : 0.0
## 1/1/14 1:00 : 1 1st Qu.: 3.00 1st Qu.: 0.0
## 1/1/14 10:00: 1 Median : 32.00 Median : 4.0
## 1/1/14 11:00: 1 Mean : 75.37 Mean : 22.1
## 1/1/14 12:00: 1 3rd Qu.: 92.00 3rd Qu.: 14.0
## 1/1/14 13:00: 1 Max. :10493.00 Max. :4054.0
## (Other) :45978 NA's :2335 NA's :2335
## Ped.North Bike.North Bike.South
## Min. : 0.000 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.: 1.00 1st Qu.: 1.00
## Median : 4.000 Median : 8.00 Median : 9.00
## Mean : 9.975 Mean : 21.79 Mean : 21.51
## 3rd Qu.: 12.000 3rd Qu.: 29.00 3rd Qu.: 32.00
## Max. :4095.000 Max. :794.00 Max. :8191.00
## NA's :2335 NA's :2335 NA's :2335
dim(data)
## [1] 45984 6
names(data)
## [1] "Date" "BGT.North.of.NE.70th.Total"
## [3] "Ped.South" "Ped.North"
## [5] "Bike.North" "Bike.South"
If future weeks we’ll learn about modifying data, how to change the orientation of a figure or remove some entries or many other things. Look at your data and think about what would make the raw data more useful. Are dates entered in the wrong form? Do you need locations aggregated to a higher level? Are words entered inconsistently (i.e., Seattle, seattle, SEATTLE)? Start to think forward about what you want to learn to make data as useful as possible for you. If you are experienced with R, go ahead an modify one of the columns.
I think it would be useful to condense the rows into morning, afternoon, evening data/time chunks. Right now, the rows are provided by the hour, which just creates too much data.
I think the columns are good as they are; there are only 5 other variables besides date and time and they are all useful. If I were to join the columns, I would join Ped North and Ped South, and Bike North and Bike South.