This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
#Tutorial 3
#Question 1:
#1. What are the questions that a data scientist can ask on Covid-19 data?
# Think of some good questions and then categorize your questions as
# descriptive, exploratory, inferential and predictive.
# Descriptive:
# a) Are the active covid-19 cases increase today?
# b) How many of the Malaysian had partially and fully vaccinated?
# Exploratory:
# a) Are the person who had partially vaccinated have more risk to get covid
# compare to the person who are fully vaccinated?
# b) Are the recovered covid-19 patient have higher probability to attack
# by the virus again?
# Inferential:
# a) Is it feasible for someone to get mix vaccination of different vaccines?
# Will the vaccines still work?
# b) What are the side effect of getting mix vaccination?
# Predictive:
# a) How long can the immunity of vaccine be maintained?
# b) Will covid-19 end in the next 5 years?
#2. Web scraping with R
# Scrape product details from Amazon
# Show that you had successfully done some web scraping.
#Step 1: Loading the packages we need
library(selectr)
library(xml2)
library(rvest)
library(stringr)
library(jsonlite)
#Step 2: Reading the HTML content from Amazon
webpage <- read_html("https://www.amazon.in/Samsung-Galaxy-M12-Storage-Processor/dp/B08XJCMGL7/ref=psdc_1805560031_t3_B08VB2CMR3?th=1")
#Step 3: Scrape product details from Amazon
#scrape title of the product
title_html <- html_nodes(webpage, "h1#title")
title <- html_text(title_html)
#remove all space and new lines
title <- str_replace_all(title, "[\r\n]" , "")
head(title)
## [1] "Samsung Galaxy M12 (Black,4GB RAM, 64GB Storage) 6000 mAh with 8nm Processor | True 48 MP Quad Camera | 90Hz Refresh Rate"
#scrape the price of the product
price_html <- html_nodes(webpage, "span#priceblock_ourprice")
price <- html_text(price_html)
#remove spaces and new line
price <- str_replace_all(price, "[\r\n]" , "")
head(price)
## [1] "<U+20B9>11,499.00"
#scrape product descriptions
desc_html <- html_nodes(webpage,'div#feature-bullets')
#replace new lines and spaces
desc <- str_trim(html_text(desc_html))
desc <- str_replace_all(desc, "[\r\n]" , "")
head(desc)
## [1] "About this item48MP+5MP+2MP+2MP Quad camera setup- True 48MP (F 2.0) main camera + 5MP (F2.2) Ultra wide camera+ 2MP (F2.4) depth camera + 2MP (2.4) Macro Camera| 8MP (F2.2) front came6000mAH lithium-ion battery, 1 year manufacturer warranty for device and 6 months manufacturer warranty for in-box accessories including batteries from the date of purchaseAndroid 11, v11.0 operating system,One UI 3.1, with 8nm Power Efficient Exynos850 (Octa Core 2.0GH16.55 centimeters (6.5-inch) HD+ TFT LCD - infinity v-cut display,90Hz screen refresh rate, HD+ resolution with 720 x 1600 pixels resolution, 269 PPI with 16M colorMemory, Storage & SIM: 4GB RAM | 64GB internal memory expandable up to 1TB| Dual SIM (nano+nano) dual-standby (4G+4<U+009B>See more product details"
#scrape product rating
rate_html <- html_nodes(webpage, "span#acrPopover")
rate <- html_text(rate_html)
#remove spaces and newlines and tabs
rate <- str_replace_all(rate, "[\r\n]" , "")
rate <- str_trim(rate)[1]
#print rating of the product
head(rate)
## [1] "4.1 out of 5 stars"
#Scrape size of the product
size_html <- html_nodes(webpage, "span#inline-twister-expanded-dimension-text-size_name")
#remove tab from text
size <- str_trim(html_text(size_html))
#Print product size
head(size)
## [1] "4GB RAM & 64GB Storage"
#Scrape product color
color_html <- html_nodes(webpage, "span#inline-twister-expanded-dimension-text-color_name")
#remove tabs from text
color <- html_text(color_html)
color <- str_trim(color)
#print product color
head(color)
## [1] "Black"
#Step 4: We have successfully extracted data from all the fields which can be
#used to compare the product information from another site.
#Combining all the lists to form a data frame
product_data <- data.frame(Title=title,Price=price,Description=desc,Rating=rate,Size=size,Color=color)
#Structure of the data frame
str(product_data)
## 'data.frame': 1 obs. of 6 variables:
## $ Title : chr "Samsung Galaxy M12 (Black,4GB RAM, 64GB Storage) 6000 mAh with 8nm Processor | True 48 MP Quad Camera | 90Hz Refresh Rate"
## $ Price : chr "<U+20B9>11,499.00"
## $ Description: chr "About this item48MP+5MP+2MP+2MP Quad camera setup- True 48MP (F 2.0) main camera + 5MP (F2.2) Ultra wide camera"| __truncated__
## $ Rating : chr "4.1 out of 5 stars"
## $ Size : chr "4GB RAM & 64GB Storage"
## $ Color : chr "Black"
View(product_data)
#Step 5: Store data in JSON format.
# Include 'jsonlite' library to convert in JSON form.
library(jsonlite)
#convert dataframe into JSON format
json_data <- toJSON(product_data)
#print output
cat(json_data)
## [{"Title":"Samsung Galaxy M12 (Black,4GB RAM, 64GB Storage) 6000 mAh with 8nm Processor | True 48 MP Quad Camera | 90Hz Refresh Rate","Price":"<U+20B9>11,499.00","Description":"About this item48MP+5MP+2MP+2MP Quad camera setup- True 48MP (F 2.0) main camera + 5MP (F2.2) Ultra wide camera+ 2MP (F2.4) depth camera + 2MP (2.4) Macro Camera| 8MP (F2.2) front came6000mAH lithium-ion battery, 1 year manufacturer warranty for device and 6 months manufacturer warranty for in-box accessories including batteries from the date of purchaseAndroid 11, v11.0 operating system,One UI 3.1, with 8nm Power Efficient Exynos850 (Octa Core 2.0GH16.55 centimeters (6.5-inch) HD+ TFT LCD - infinity v-cut display,90Hz screen refresh rate, HD+ resolution with 720 x 1600 pixels resolution, 269 PPI with 16M colorMemory, Storage & SIM: 4GB RAM | 64GB internal memory expandable up to 1TB| Dual SIM (nano+nano) dual-standby (4G+4<U+009B>See more product details","Rating":"4.1 out of 5 stars","Size":"4GB RAM & 64GB Storage","Color":"Black"}]
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.