link: https://rpubs.com/FaizAslam_s2125991/Tutorial_3

Question 1

  1. What are the questions that a data scientist can ask on Covid-19 data? Think of some good questions and then categorize your questions as descriptive, exploratory, inferential and predictive.

Descriptive

          How many new Cases per day?
          How Many were Vacinated per day?
          How many death per day?
          What are the common symptoms and early symptoms?

Exploratory

          What are the age group of people who died?
          Where are the places did their visit prior before positive/
          What are the patients medical history to avoid further complication?

Inferential

          How does an individual is affected by each vaccine?
          How long does the vaccine works on an individual?
          Effective gap-days for the first and second dose of vaccine
          Does one need a booster shot?

Predictive

          What is the number of new cases tomorrow,in a week, in a month and in 6 months?
          How does the movement of the people effect the number of daily cases?
          Movement of the covid wave from 1 area to another based the movement of traffic?

Question 2

Web scraping with R for movie list on IMDB

importing libraries

library(rvest)
library(stringr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Extracting the movie name

name = page %>%
    html_nodes(".lister-item-header a") %>%
    html_text()

Extracting the movie relesed year

year = page %>%
    html_nodes(".text-muted.unbold") %>%
    html_text()

Extracting the movie summary

Extracting the summary includes some html syntaxing hence the data needed to cleaned beforehand

summary = page %>%
    html_nodes(".text-muted+ .text-muted , .ratings-bar+ .text-muted") %>%
    html_text()
summary = str_replace_all(summary, "[\r\n]", "")

Extracting the movie genre

genre = page %>%
    html_nodes(".genre") %>%
    html_text()
genre = str_replace_all(genre, "[\r\n]", "")

Extracting the movie’s director

director = page %>%
    html_nodes(".text-muted~ .text-muted+ p a:nth-child(1)") %>%
    html_text()

All the data is then placed into the dataframe for future analysis

df <- data.frame(name, genre, year, director, summary, stringsAsFactors = FALSE)
head(df, 10)