Coursera Data Science

Quiz Week 01

There is a small quiz on https://class.coursera.org/dsscapstone-005.

## Loading packages
if(!require('BBmisc')){
  install.packages('BBmisc',dep=TRUE)
}

## Loading required package: BBmisc

suppressPackageStartupMessages(library('BBmisc'))
pkgs <- c('devtools','jsonlite','plyr','plyr','stringr','doParallel','ff','ffbase','lubridate')
suppressAll(lib(pkgs)); rm(pkgs)

## Setting parallel computing
#'@ registerDoParallel(cores=16)

#'@ download.file('https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/yelp_dataset_challenge_academic_dataset.zip', destfile=paste0(getwd(),'/yelp.zip'))
#'@ unzip(paste0(getwd(),'/yelp.zip'))
#'@ file.remove(paste0(getwd(),'/yelp.zip'))

## https://www.kaggle.com/c/yelp-recsys-2013/forums/t/4465/reading-json-files-with-r-how-to
## https://class.coursera.org/dsscapstone-005/forum/thread?thread_id=12
#'@ fnames <- c('business','checkin','review','tip','user')
#'@ jfile <- paste0(getwd(),'/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_',fnames,'.json')
#'@ dat <- llply(as.list(jfile), function(x) stream_in(file(x),pagesize = 10000))
#'@ names(dat) <- fnames

## Since read the json files will cost us few minutes time, here I save as .RData will be faster.
#'@ save.image(paste0(getwd(),'/Capstone_Quiz.RData'))
load(paste0(getwd(),'/Capstone_Quiz.RData'))

## Rpudplus 0.5.1
## http://www.r-tutor.com
## Copyright (C) 2010-2015 Chi Yau. All Rights Reserved.
## 
## This copy of RPUDPLUS is NOT yet activated.

Question 1

After untaring the the dataset, how many files are there (including the documentation pdfs)?

dir('yelp_dataset_challenge_academic_dataset') %>% length

## [1] 7

Question 2

The data files are in what format?

dir('yelp_dataset_challenge_academic_dataset') %>% gsub('\\w{1,}\\.','',.) %>% unique

## [1] "pdf"  "json"

Question 3

How many lines of text are there in the reviews file (in orders of magnitude)?

dat[['review']]$text %>% length

## [1] 1569264

Question 4

Consider line 100 of the reviews file. “I’ve been going to the Grab n Eat for almost XXX years”

dat[['review']][100,]$text

## [1] "I have been coming to Gab n Eat for almost 20 years and They have never let me down. I get a typical breakfast if eggs, ham, toast, and home fries. Delicious as usual. The ambience however is usually lacking. The walls are dark, with writing and signatures of semi famous people all over the place. Pictures of local people hang on the walls(i secretly want mine up there) along with posters galore. While its fun to look at the first 10 times, it gets a little boring after awhile. So today when I arrived I expected the same old experience. Wow was I wrong! As soon as I looked at the door I knew something was different. The place seemed lighter and brighter. To my pleasant surprise, they painted and got new counter tops!! They're not quite done yet but the place has a new Happy vibe to it. The awesome breakfast, the new decor and the 5 guys sitting at the counter making me laugh are why I will be back( maybe for lunch)."

Question 5

What percentage of the reviews are five star reviews (rounded to the nearest percentage point)?

length(dat[['review']]$stars[dat[['review']]$stars==5])/length(dat[['review']]$stars) * 100

## [1] 36.92986

Question 6

How many lines are there in the businesses file?

dat[['business']] %>% nrow

## [1] 61184

Question 7

Conditional on having an response for the attribute “Wi-Fi”, how many businesses are reported for having free wi-fi (rounded to the nearest percentage point)?

length(na.omit(dat[['business']]$attributes$'Wi-Fi'[dat[['business']]$attributes$'Wi-Fi'=='free']))/length(na.omit(dat[['business']]$attributes$'Wi-Fi'=='free')) * 100

## [1] 40.91519

Question 8

How many lines are in the tip file?

dat[['tip']] %>% nrow

## [1] 495107

Question 9

In the tips file on the 1,000th line, fill in the blank: “Consistently terrible ______”

dat[['tip']][1000,]

##                     user_id
## 1000 ILquUgLlW7UiMNLakuW4Yg
##                                                           text
## 1000 Consistently terrible service. What's with the attitudes?
##                 business_id likes       date type
## 1000 hhNlQWaKqGbMPoeZHoL-IQ     0 2012-04-07  tip

Question 10

What is the name of the user with over 10,000 compliment votes of type “funny”?

dat[['user']][dat[['user']]$votes$funny>10000 & dat[['user']]$compliments$funny>10000,'name']

## [1] "Brian"

Question 11

Create a 2 by 2 cross tabulation table of when a user has more than 1 fans to if the user has more than 1 compliment vote of type “funny”. Treat missing values as 0 (fans or votes of that type). Pass the 2 by 2 table to fisher.test in R. What is the P-value for the test of independence?

## http://www.statmethods.net/stats/frequencies.html
## http://www.utstat.toronto.edu/~brunner/oldclass/312f12/lectures/312f12FisherWithR.pdf
fans <- dat[['user']]$fans
funny <- dat[['user']]$compliments$funny %>% str_replace_na(0) %>% as.numeric

## summazied to <=1 and >1
fans <- c(sum(fans[fans<=1]),sum(fans[fans>1]))
funny <- c(sum(funny[funny<=1]),sum(funny[funny>1]))

## https://stat.ethz.ch/R-manual/R-patched/library/stats/html/fisher.test.html
## https://stat.ethz.ch/R-manual/R-devel/library/base/html/table.html
## table() same with xtabs() in this case
pb <- table(fans=cut(fans,breaks=c(0,1,Inf)),funny=cut(funny,breaks=c(0,1,Inf)))
#'@ pb <- xtabs(~fans+funny)
#'@ pb %>% chisq.test(.,correct=F) %>% .$expected
pb %>% fisher.test

## 
##  Fisher's Exact Test for Count Data
## 
## data:  .
## p-value = 1
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##    0 Inf
## sample estimates:
## odds ratio 
##          0

Coursera Data Science - Capstone

Ryo®, Eng Lian Hu TonyStark®

2015-11-27 07:25:34

Quiz Week 01

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Question 11

Apendix

Q11 (First Attempt)

Q11 (Second Attempt)

Appendices

Documenting File Creation

References