Question 1

What are the questions that a data scientist can ask on Covid-19 data? Think of some good questions and then categorize your questions as descriptive, exploratory, inferential and predictive.

Descriptive

          How many new Cases per day?
          How Many were Vacinated per day?
          How many death per day?
          What are the common symptoms and early symptoms?

Exploratory

          What are the age group of people who died?
          Where are the places did their visit prior before positive/
          What are the patients medical history to avoid further complication?

Inferential

          How does an individual is affected by each vaccine?
          How long does the vaccine works on an individual?
          Effective gap-days for the first and second dose of vaccine
          Does one need a booster shot?

Predictive

          What is the number of new cases tomorrow,in a week, in a month and in 6 months?
          How does the movement of the people effect the number of daily cases?
          Movement of the covid wave from 1 area to another based the movement of traffic?

Question 2

Web scraping with R for movie list on IMDB

importing libraries

library(rvest)
library(stringr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Pointing r towards the link

This allows R to read the following link in HTML

link = "https://www.imdb.com/search/title/?title_type=feature&genres=adventure&explore=genres&view=advanced"
page = read_html(link)

Extracting the movie name

name = page %>%
    html_nodes(".lister-item-header a") %>%
    html_text()

Extracting the movie link to get more information on individual page

Getting the movie link, but it doesnt have the full details on the page For EG. the movie Dune : https://www.imdb.com/title/tt1160419/?ref_=adv_li_tt this Link doesnt actually shows the full cast members hence “?ref_=adv_li_tt” was needed to be replaced with “fullcredits?ref_=tt_ov_st_sm”

movie_links = page %>%
    html_nodes(".lister-item-header a") %>%
    html_attr("href") %>%
    paste("https://www.imdb.com/", ., sep = "")
movie_links = gsub("?ref_=adv_li_tt", "fullcredits?ref_=tt_ov_st_sm",
    movie_links)

Extracting the movie relesed year

year = page %>%
    html_nodes(".text-muted.unbold") %>%
    html_text()

Extracting the movie summary

Extracting the summary includes some html syntaxing hence the data needed to cleaned beforehand

summary = page %>%
    html_nodes(".text-muted+ .text-muted , .ratings-bar+ .text-muted") %>%
    html_text()
summary = str_replace_all(summary, "[\r\n]", "")

Extracting the movie genre

genre = page %>%
    html_nodes(".genre") %>%
    html_text()
genre = str_replace_all(genre, "[\r\n]", "")

Extracting the movie’s director

director = page %>%
    html_nodes(".text-muted~ .text-muted+ p a:nth-child(1)") %>%
    html_text()

All the data is then placed into the dataframe for future analysis

df <- data.frame(name, genre, year, director, summary, stringsAsFactors = FALSE)
head(df, 10)

As we got the link for each of the movies link, we also can get its full cast member

The Movie Eternal was used for this example

movie_link = "https://www.imdb.com/title/tt1160419/fullcredits?ref_=tt_ov_st_sm"
get_cast = function(movie_link) {
    movie_page = read_html(movie_link)
    movie_cast = movie_page %>%
        html_nodes(".primary_photo+ td a") %>%
        html_text()
    movie_cast = str_replace_all(movie_cast, "[\r\n]", "") %>%
        paste(collapse = ",")
    print(movie_cast)
}

get_cast(movie_link)

## [1] " Timothée Chalamet, Rebecca Ferguson, Oscar Isaac, Jason Momoa, Stellan Skarsgård, Stephen McKinley Henderson, Josh Brolin, Javier Bardem, Sharon Duncan-Brewster, Chang Chen, Dave Bautista, David Dastmalchian, Zendaya, Charlotte Rampling, Babs Olusanmokun, Benjamin Clémentine, Souad Faress, Golda Rosheuvel, Roger Yuan, Seun Shote, Neil Bell, Oliver Ryan, Stephen Collins, Charlie Rawes, Richard Carter, Ben Dilloway, Elmi Rashid Elmi, Tachia Newall, Gloria Obianyo, Fehinti Balogun, Dora Kápolnai-Schvab, Joelle, Jimmy Walker, Paul Bullion, Milena Sidorova, János Timkó, Jean Gilpin, Marianne Faithfull, Ellen Dubin, Károly Baksai, Björn Freiberg, Balázs Megyeri, Michael Nardone, Duncan Pow, Ferenc Iván Szabó, Laszlo Szilagyi, Peter Sztojanov Jr., István Áldott"

Web Scraping IMDB page

Faiz Aslam s2125991

9/14/2021