Getting and Cleaning Data

setwd("C://Users//Olusola//Desktop//New foldercourse//Systems//Data Science//Course 3 -GCData//Wk1//Assignment_WK1")

Question 1

The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv

and load the data into R. The code book, describing the variable names is here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FPUMSDataDict06.pdf

How many housing units in this survey were worth more than $1,000,000?

# Download files

dataURL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
dataFile <- "ss06hid.csv"
if (!file.exists(dataFile)) {
  download.file(dataURL, dataFile, mode = "wb")
}


codebookURL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FPUMSDataDict06.pdf"
codebook <- "PUMSDataDict06.pdf"
if (!file.exists(codebook)) {
  download.file(codebookURL, codebook, mode = "wb")
}

# Load package and read file
library(data.table)
housing <- data.table::fread(dataFile) 

# str(housing)
# summary(housing)

# Variable VAL is the Property value
# .N is the number of rows in data.table

#housing[VAL == 24] Running this code will display 53 rows for all variables in housing
housing[, .N]           # Total number of rows in the dataFile

## [1] 6496

housing[VAL == 24, .N]  # Total number of rows for VAL factor/categorical variable 24

## [1] 53

# Conversely

setkey(housing, VAL)
housing[, .N, key(housing)]

##     VAL    N
##  1:  NA 2076
##  2:   1   75
##  3:   2   42
##  4:   3   33
##  5:   4   30
##  6:   5   26
##  7:   6   29
##  8:   7   23
##  9:   8   70
## 10:   9   99
## 11:  10  119
## 12:  11  152
## 13:  12  199
## 14:  13  233
## 15:  14  495
## 16:  15  483
## 17:  16  486
## 18:  17  357
## 19:  18  502
## 20:  19  232
## 21:  20  312
## 22:  21  164
## 23:  22  159
## 24:  23   47
## 25:  24   53
##     VAL    N

Answer

Question 2

Use the data you loaded from Question 1. Consider the variable FES in the code book. Which of the "tidy data" principles does this variable violate?

Answer

Tidy data one variable per column

Question 3

Download the Excel spreadsheet on Natural Gas Aquisition Program here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FDATA.gov_NGAP.xlsx

Read rows 18-23 and columns 7-15 into R and assign the result to a variable called:

dat

What is the value of:

sum(dat$Zip*dat$Ext,na.rm=T)

(original data source: http://catalog.data.gov/dataset/natural-gas-acquisition-program)

# Download files 

dataURL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FDATA.gov_NGAP.xlsx"
dataFile <- "DATA.gov_NGAP.xlsx"
if (!file.exists(dataFile)) {
  download.file(dataURL, dataFile, mode = "wb")
        
}

# Subset data

rows <- 18:23
cols <- 7:15

# Load package and read file

# library(xlsx)
# dat <- read.xlsx(dataFile, sheetIndex = 1, rowIndex = rows, colIndex = cols)

# OR

dat <- xlsx::read.xlsx(dataFile, sheetIndex = 1, rowIndex = rows, colIndex = cols)

str(dat)

## 'data.frame':    5 obs. of  9 variables:
##  $ Zip      : num  74136 30329 74136 80203 80120
##  $ CuCurrent: num  0 1 1 0 1
##  $ PaCurrent: num  1 0 0 1 0
##  $ PoCurrent: num  0 0 0 0 0
##  $ Contact  : Factor w/ 5 levels "303-864-1919",..: 4 3 5 1 2
##  $ Ext      : num  0 NA 0 0 456
##  $ Fax      : Factor w/ 2 levels "918-491-6659",..: 1 NA 2 NA NA
##  $ email    : logi  NA NA NA NA NA
##  $ Status   : num  1 1 1 1 1

sum(dat$Zip * dat$Ext, na.rm=T)

## [1] 36534720

Answer

36534720

Question 4

Read the XML data on Baltimore restaurants from here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml

How many restaurants have zipcode 21231?

Use http instead of https, which caused the message Error: XML content does not seem to be XML: 'https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml'.

# Load file

library(XML)
doc <- xmlTreeParse("http://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml", useInternal = TRUE)
rootNode <- xmlRoot(doc)

# Get the values of all elements with tag "zipcode"
zipcodes <- xpathSApply(rootNode, "//zipcode", xmlValue)

# Store the result of zipcodes into data.table
zipcodetable <- data.table::data.table(zipcode = zipcodes)
zipcodetable[zipcode == 21231]

##      zipcode
##   1:   21231
##   2:   21231
##   3:   21231
##   4:   21231
##   5:   21231
##  ---        
## 123:   21231
## 124:   21231
## 125:   21231
## 126:   21231
## 127:   21231

# Compute the total value of zipcode with value 21231
zipcodetable[zipcode == 21231, .N]

## [1] 127

Answer

127

Question 5

The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv

using the fread() command load the data into an R object

DT

Which of the following is the fastest way to calculate the average value of the variable

pwgtp15

broken down by sex using the data.table package?

# Load data

DT <- data.table::fread("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv")

# Group variable pwgtp15 by variable SEX

system.time(DT[,mean(pwgtp15),by=SEX])

##    user  system elapsed 
##    0.00    0.00    0.02

Getting and Cleaning Data

Week 1 assignment of the course Data Science taught by Johns Hopkins University

Olusola Afuwape

June 29, 2020

Question 1

Answer

Question 2

Answer

Question 3

Answer

Question 4

Answer

Question 5