Data Cleaning Assignment 1

Setting up the defaults:

knitr::opts_chunk$set(echo = TRUE, results = "asis")

setInternet2(TRUE)

Question 1

The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv

and load the data into R. The code book, describing the variable names is here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FPUMSDataDict06.pdf

Create a logical vector that identifies the households on greater than 10 acres who sold more than $10,000 worth of agriculture products. Assign that logical vector to the variable agricultureLogical. Apply the which() function like this to identify the rows of the data frame where the logical vector is TRUE. which(agricultureLogical) What are the first 3 values that result?

Answer:

library(jpeg)
library(data.table)

## Warning: package 'data.table' was built under R version 3.2.4

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:data.table':
## 
##     between, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(Hmisc)

## Warning: package 'Hmisc' was built under R version 3.2.4

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.2.4

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     combine, src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units

#fileurl1 = 'https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv'
#dst1 = 'q1.csv'
#download.file(fileurl1, dst1, method = 'curl')
data1 = read.csv("ss06hid.csv")
agricultureLogical = data1$ACR == 3 & data1$AGS == 6
head(which(agricultureLogical), 3)

[1] 125 238 262

Question 2

Using the jpeg package read in the following picture of your instructor into R

https://d396qusza40orc.cloudfront.net/getdata%2Fjeff.jpg

Use the parameter native=TRUE. What are the 30th and 80th quantiles of the resulting data? (some Linux systems may produce an answer 638 different for the 30th quantile)

Answer:

fileurl2 = 'https://d396qusza40orc.cloudfront.net/getdata%2Fjeff.jpg'
dst2 = 'getdata-jeff.jpg'
download.file(fileurl2, dst2, mode = 'wb', method = 'curl')

## Warning: running command 'curl "https://d396qusza40orc.cloudfront.net/
## getdata%2Fjeff.jpg" -o "getdata-jeff.jpg"' had status 127

## Warning in download.file(fileurl2, dst2, mode = "wb", method = "curl"):
## download had nonzero exit status

data2 = readJPEG(dst2, native = TRUE)
quantile(data2, probs = c(0.3, 0.8))

  30%       80%

-15259150 -10575416

Question 3

Load the Gross Domestic Product data for the 190 ranked countries in this data set:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv

Load the educational data from this data set:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv

Match the data based on the country shortcode. How many of the IDs match? Sort the data frame in descending order by GDP rank (so United States is last). What is the 13th country in the resulting data frame?

Original data sources:

http://data.worldbank.org/data-catalog/GDP-ranking-table http://data.worldbank.org/data-catalog/ed-stats

fileurl3a = 'https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv'
dst3a = 'getdata-data-GDP.csv'
fileurl3b = 'https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv'
dst3b = 'getdata-data-EDSTATS_Country.csv'
download.file(fileurl3a, dst3a, method = 'curl')

## Warning: running command 'curl "https://d396qusza40orc.cloudfront.net/
## getdata%2Fdata%2FGDP.csv" -o "getdata-data-GDP.csv"' had status 127

## Warning in download.file(fileurl3a, dst3a, method = "curl"): download had
## nonzero exit status

download.file(fileurl3b, dst3b, method = 'curl')

## Warning: running command 'curl "https://d396qusza40orc.cloudfront.net/
## getdata%2Fdata%2FEDSTATS_Country.csv" -o "getdata-data-
## EDSTATS_Country.csv"' had status 127

## Warning in download.file(fileurl3b, dst3b, method = "curl"): download had
## nonzero exit status

gdp = fread(dst3a, skip=4, nrows = 190, select = c(1, 2, 4, 5), col.names=c("CountryCode", "Rank", "Economy", "Total"))

## Warning in fread(dst3a, skip = 4, nrows = 190, select = c(1, 2, 4, 5),
## col.names = c("CountryCode", : Bumped column 5 to type character on
## data row 59, field contains 'a'. Coercing previously read values in this
## column from logical, integer or numeric back to character which may not
## be lossless; e.g., if '00' and '000' occurred before they will now be just
## '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too
## (if they occurred in this column before the bump). If this matters please
## rerun and set 'colClasses' to 'character' for this column. Please note
## that column type detection uses the first 5 rows, the middle 5 rows and the
## last 5 rows, so hopefully this message should be very rare. If reporting to
## datatable-help, please rerun and include the output from verbose=TRUE.

edu = fread(dst3b)
merge = merge(gdp, edu, by = 'CountryCode')
nrow(merge)

[1] 188

arrange(merge, desc(Rank))[13, Economy]

[1] “767”

Question 4

What is the average GDP ranking for the “High income: OECD” and “High income: nonOECD” group?

Answer:

tapply(merge$Rank, merge$`Income Group`, mean)

High income: nonOECD High income: OECD Low income NA 35.17857 133.72973 Lower middle income Upper middle income 109.69811 NA

Question 5

Cut the GDP ranking into 5 separate quantile groups. Make a table versus Income.Group. How many countries are Lower middle income but among the 38 nations with highest GDP?

Answer:

merge$RankGroups <- cut2(merge$Rank, g=5)
table(merge$RankGroups, merge$`Income Group`)

        High income: nonOECD High income: OECD Low income

[ 4, 42) 4 17 0 [ 42, 79) 5 9 1 [ 79,116) 8 1 9 [116,154) 5 1 16 [154,190] 1 0 11

        Lower middle income Upper middle income

[ 4, 42) 6 11 [ 42, 79) 12 10 [ 79,116) 12 7 [116,154) 7 8 [154,190] 16 9

Data Cleaning Assignment 1

Jeremias Lalis

May 12, 2016

Question 1

Answer:

Question 2

Answer:

Question 3

Question 4

Answer:

Question 5

Answer: