Introduction to R

Retreiving, Cleaning, Analyzing, and Visualizing Data

Jason Freels

16 December 2016

Retrieving, Cleaning and Analyzing Data

Overview

Process We'll Follow

Required Packages

pkgs <- c('dplyr','data.table','readr','DT','shinythemes')

for(i in 1:length(pkgs)) {
  
if(!pkgs[i]%in%installed.packages())  install.packages(pkgs[i])

  do.call('library', args = list(pkgs[i]))
}

Notes

library(data.table)
library(dplyr)

my.data    <- fread('somefile.csv')
my.summary <- summarise(my.data, mean('column1'))
library(data.table)
library(dplyr)

my.data    <- data.table::fread('somefile.csv')
my.summary <- dplyr::summarise(my.data, stats::mean('column1'))

Step 1 - Retrieve the Data

Download the data as a .zip file from a URL

  1. Store the URL for the 'vehicles.csv.zip' file as a character string
URL <- 'http://fueleconomy.gov/feg/epadata/vehicles.csv.zip'
  1. Use tempfile() to create a temporary file named temp

    • We'll use temp to hold 'vehicles.csv.zip' so we can unzip it

    • Once we're finished with temp we'll remove it

temp <- base::tempfile()
  1. Use download.file() to download 'vehicles.csv.zip' from URL into the temp
utils::download.file(url = URL, 
                     destfile = temp)

Unzip the file and extract the .csv file containing the data

  1. Use unzip() to extract the vehicles.csv from temp

    • unzip() will save vehicles.csv under the same name

    • vehicles.csv will be saved in getwd() - the current working directory

utils::unzip(zipfile = temp, 
             files = "vehicles.csv")

Read the data from the .csv file into R for further processing

epa.data <- data.table::fread('vehicles.csv')
epa.data <- readr::read_csv('vehicles.csv')
epa.data <- utils::read.csv('vehicles.csv')
base::unlink(temp)

Step 2 - Clean The Data prior to beginning the analysis

Visually review the data

utils::View(utils::head(epa.data))

Remove Extraneous Columns & Rows

epa.data <- dplyr::filter(epa.data, mpgData=='Y')
epa.data <- base::subset(epa.data, !atvType%in%c('Hybrid','Plug-in Hybrid'))
epa.data[, c('createdOn','modifiedOn')] <- NULL
epa.data <- data.table::as.data.table(epa.data)

epa.data[, c('phevCity','phevHwy', 'phevComb') := NULL]
epa.data <- dplyr::select(epa.data, -(barrels08:charge240))
base::source('resources/gadgets.R')
clean_columns(epa.data)
epa.data <- .Last.value$data

Step 3: Analyze the data