Let’s get LeaRning!

Alicia & Ryan

Agenda

  • Install R
  • Install RStudio
  • Install helpful packages (RSocrata, GGPlot2, Tidyverse, TidyCensus)
  • Get data from Socrata
  • Summary Statistics and Plots
  • Contextualize data using Census American Community Survey Estimates

Motivations

  • R is a free and Open Source language with a very active community
  • R is commonly used by Analysts, Academics and Researchers
  • Ability to publish analyses with charts, text and code as sharable documents (html, word, pdfs) and shiny applications
  • Reproducability - allows others in organization and client repeat what you did easily
  • Tons of useful packages are constantly being created and improved upon to make it easy to do what we want to do like selecting a color theme based on our favorite Wes Anderson Movie

First Steps

Check our Environment

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   29.38   36.60   34.89   42.77   67.00

Install Packages

  • Install helpful packages (RSocrata, GGPlot2, Tidyverse, TidyCensus)
  • Add Socrata credentials to .Renviron and also any API Keys required for Census and BLS

Let’s get some data!

  • RSocrata is a package developed by City of Chicago to work with datasets via SODA endpoints and Imports API (deprecated)
  • We can also use the Readr package to download public datasets with the CSV endpoint and provides both performance and tidy data benefits
  • Or we can use HTTR and jsonlite package to download private and public datasets
  • Coming soon - Socrata package where we use new Publishing, Metadata and Discovery APIs

RSocrata

  • Load up a Socrata dataset
  • We can use the full URL

Readr

  • Preserves field names (rather than replacing spaces with periods)
  • Faster
  • Returns as Tibble
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   sensor_id = col_integer(),
##   station_name = col_character(),
##   `Update Time` = col_character(),
##   raw = col_integer(),
##   `5_minutes` = col_integer(),
##   `15_minutes` = col_integer(),
##   `Geo Location` = col_character()
## )
## See spec(...) for full column specifications.
##    sensor_id    station_name       Update Time             raw      
##  Min.   : 155   Length:61          Length:61          Min.   :   6  
##  1st Qu.:1755   Class :character   Class :character   1st Qu.: 516  
##  Median :4535   Mode  :character   Mode  :character   Median :1182  
##  Mean   :4274                                         Mean   :1063  
##  3rd Qu.:6525                                         3rd Qu.:1451  
##  Max.   :7955                                         Max.   :2015  
##    5_minutes   15_minutes   30_minutes            1_hour         
##  Min.   :0   Min.   :0    Min.   :0.0000000   Min.   :0.0000000  
##  1st Qu.:0   1st Qu.:0    1st Qu.:0.0000000   1st Qu.:0.0000000  
##  Median :0   Median :0    Median :0.0000000   Median :0.0000000  
##  Mean   :0   Mean   :0    Mean   :0.0006557   Mean   :0.0006557  
##  3rd Qu.:0   3rd Qu.:0    3rd Qu.:0.0000000   3rd Qu.:0.0000000  
##  Max.   :0   Max.   :0    Max.   :0.0400000   Max.   :0.0400000  
##     2_hours             3_hours             6_hours         
##  Min.   :0.0000000   Min.   :0.0000000   Min.   :0.0000000  
##  1st Qu.:0.0000000   1st Qu.:0.0000000   1st Qu.:0.0000000  
##  Median :0.0000000   Median :0.0000000   Median :0.0000000  
##  Mean   :0.0006557   Mean   :0.0006557   Mean   :0.0006557  
##  3rd Qu.:0.0000000   3rd Qu.:0.0000000   3rd Qu.:0.0000000  
##  Max.   :0.0400000   Max.   :0.0400000   Max.   :0.0400000  
##     12_hours             1_day               7_days         30_days     
##  Min.   :0.0000000   Min.   :0.0000000   Min.   :0.120   Min.   :0.470  
##  1st Qu.:0.0000000   1st Qu.:0.0000000   1st Qu.:1.160   1st Qu.:1.930  
##  Median :0.0000000   Median :0.0000000   Median :1.340   Median :2.320  
##  Mean   :0.0006557   Mean   :0.0006557   Mean   :1.282   Mean   :2.282  
##  3rd Qu.:0.0000000   3rd Qu.:0.0000000   3rd Qu.:1.420   3rd Qu.:2.560  
##  Max.   :0.0400000   Max.   :0.0400000   Max.   :1.890   Max.   :4.530  
##      today             this_month      past_year         latitude    
##  Min.   :0.0000000   Min.   :0.120   Min.   : 26.06   Min.   :32.64  
##  1st Qu.:0.0000000   1st Qu.:1.180   1st Qu.: 45.47   1st Qu.:32.74  
##  Median :0.0000000   Median :1.380   Median : 50.24   Median :32.80  
##  Mean   :0.0006557   Mean   :1.315   Mean   : 62.01   Mean   :32.81  
##  3rd Qu.:0.0000000   3rd Qu.:1.500   3rd Qu.: 57.44   3rd Qu.:32.87  
##  Max.   :0.0400000   Max.   :1.890   Max.   :332.10   Max.   :33.01  
##    longitude      Geo Location      
##  Min.   :-96.95   Length:61         
##  1st Qu.:-96.86   Class :character  
##  Median :-96.81   Mode  :character  
##  Mean   :-96.81                     
##  3rd Qu.:-96.76                     
##  Max.   :-96.66

Summary Statistics

  • Helps us see range of values and evaluate dataset completeness by number of NAs for every columns

Plots

  • Shows gaps of data to evaluate completeness of dataset
  • Shows outliers to validate quality of dataset

Get Other Datasets like from Census, BLS, GoogleMaps