Question 1 was designed to ensure that you did not lose track of why we were: (i) scraping data; (ii) cleaning data; (iii) manipulating data; (iv) data warehouse.

Scraping Data:

Purpose: Scraping is essential for gathering data that isn’t readily available in structured formats, like publicly displayed data on websites. It’s often used to collect up-to-date, specific, or large-scale datasets without manual entry.

Example: A business may scrape e-commerce sites for competitor pricing or gather social media mentions to analyze customer sentiment.

Cleaning Data:

Purpose: Raw data often contains errors, inconsistencies, duplicates, or irrelevant information. Data cleaning is the process of making data reliable and usable, ensuring analysis is accurate and actionable.

Example: Cleaning data might involve removing duplicate rows, handling missing values, or standardizing date formats to prepare data for meaningful insights.

Manipulating Data:

Purpose: Manipulation allows us to reshape, filter, aggregate, or calculate new variables to extract insights relevant to our questions or goals. Using tools like dplyr in R streamlines these transformations.

Data Warehouse:

Purpose: A data warehouse serves as a centralized repository for integrated data from multiple sources, supporting analysis across time and scale. Warehousing enables consistency, faster querying, and more efficient, large-scale analyses.

Example: A company might consolidate sales, marketing, and customer support data in a warehouse, allowing for a unified view of performance metrics and customer behavior over time.



# Question 2 evaluated learning outcomes of web scraping. Read HTML Tables with “rvest”. Exactly similar to the table extract that we did in class. (Assignment 3)

<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuIyBFeGFtcGxlOlxuXG5saWJyYXJ5KHJ2ZXN0KVxuXG51cmwgPC0gXCJodHRwczovL3d3dy5pbWRiLmNvbS90aXRsZS90dDcyMzU0NjYvZnVsbGNyZWRpdHM/cmVmXz10dF9jbF9zbVwiXG5cbnBhZ2UgPC0gcmVhZF9odG1sKHVybClcblxudGFibGVzIDwtIHBhZ2UgJT4lIGh0bWxfdGFibGUoKVxuXG5zZXJpZXNfY2FzdF90YWJsZSA8LSB0YWJsZXNbWzNdXVxuXG5jbGVhbmVkX3RhYmxlIDwtIHNlcmllc19jYXN0X3RhYmxlICU+JSBzZWxlY3QoMiwgNClcblxuY2xlYW5lZF90YWJsZSA8LSBzdWJzZXQoY2xlYW5lZF90YWJsZSwgY2xlYW5lZF90YWJsZVssIDFdICE9IFwiXCIgJiBjbGVhbmVkX3RhYmxlWywgMl0gIT0gXCJcIilcblxuXG5jbGVhbmVkX3RhYmxlIDwtIGNsZWFuZWRfdGFibGVbIWFwcGx5KGNsZWFuZWRfdGFibGUgPT0gXCJcIiwgMSwgYWxsKSwgXVxuXG5cbmZpbmFsX3Jvd3MgPC0gbnJvdyhjbGVhbmVkX3RhYmxlKVxuZmluYWxfY29sdW1ucyA8LSBuY29sKGNsZWFuZWRfdGFibGUpXG5cbmNhdChcIlRoZSBjbGVhbmVkIGNhc3QgdGFibGUgaGFzXCIsIGZpbmFsX3Jvd3MsIFwib2JzZXJ2YXRpb25zIGFuZFwiLCBmaW5hbF9jb2x1bW5zLCBcImNvbHVtbnMuXFxuXCIpXG5cbnByaW50KGhlYWQoY2xlYW5lZF90YWJsZSkpXG5gYGAifQ== -->

```r
# Example:

library(rvest)

url <- "https://www.imdb.com/title/tt7235466/fullcredits?ref_=tt_cl_sm"

page <- read_html(url)

tables <- page %>% html_table()

series_cast_table <- tables[[3]]

cleaned_table <- series_cast_table %>% select(2, 4)

cleaned_table <- subset(cleaned_table, cleaned_table[, 1] != "" & cleaned_table[, 2] != "")


cleaned_table <- cleaned_table[!apply(cleaned_table == "", 1, all), ]


final_rows <- nrow(cleaned_table)
final_columns <- ncol(cleaned_table)

cat("The cleaned cast table has", final_rows, "observations and", final_columns, "columns.\n")

print(head(cleaned_table))

Question 3 is designed to extract data from a given API. I provide you with the package and function that were needed.Be able to index a list and/or data frame.

Example:

install.packages("jsonlite")
library(jsonlite)
BTC <- fromJSON("https://min-api.cryptocompare.com/data/v2/histoday?fsym=BTC&tsym=USD&limit=100")

str(BTC)

BTC_data <- BTC$Data$Data  

head(BTC_data)

max_close_price <- max(BTC_data$close, na.rm = TRUE)

max_close_price

Question 4 is designed to measure your ability to clean data. It was similar to what we have done with the iris datasets.The kNN procedure.

#In R, the class package has a knn function for classification, and packages like VIM or DMwR provide functions for kNN-based imputation.

Question 5 is designed to measure your ability to manipulate data. It was similar to what we have done with the nycflights13 datasets. The use of data manipulation functions from “dplyr” package to derive the metrics needed.

Example:
#install.packages("dplyr")
#install.packages("pacman")

pacman::p_load(nycflights13)

View(flights) # this View() function opens lets you directly view the whole dataset

glimpse(flights) # this glimpse() function provides a quick overview of the dataset
summary(flights)


select(flights, starts_with("dep"))

 
sortf <- arrange(flights,desc(dep_delay)) 
select(sortf, carrier, flight, tailnum, everything())
LS0tCnRpdGxlOiAiRXhhbSBSZXZpZXciCm91dHB1dDogaHRtbF9ub3RlYm9vawotLS0KCiMgUXVlc3Rpb24gMSB3YXMgZGVzaWduZWQgdG8gZW5zdXJlIHRoYXQgeW91IGRpZCBub3QgbG9zZSB0cmFjayBvZiB3aHkgd2Ugd2VyZTogKGkpIHNjcmFwaW5nIGRhdGE7IChpaSkgY2xlYW5pbmcgZGF0YTsgKGlpaSkgbWFuaXB1bGF0aW5nIGRhdGE7IChpdikgZGF0YSB3YXJlaG91c2UuIAoKIyBTY3JhcGluZyBEYXRhOgogIAojIFB1cnBvc2U6IFNjcmFwaW5nIGlzIGVzc2VudGlhbCBmb3IgZ2F0aGVyaW5nIGRhdGEgdGhhdCBpc27igJl0IHJlYWRpbHkgYXZhaWxhYmxlIGluIHN0cnVjdHVyZWQgZm9ybWF0cywgbGlrZSBwdWJsaWNseSBkaXNwbGF5ZWQgZGF0YSBvbiB3ZWJzaXRlcy4gSXTigJlzIG9mdGVuIHVzZWQgdG8gY29sbGVjdCB1cC10by1kYXRlLCBzcGVjaWZpYywgb3IgbGFyZ2Utc2NhbGUgZGF0YXNldHMgd2l0aG91dCBtYW51YWwgZW50cnkuCiMgRXhhbXBsZTogQSBidXNpbmVzcyBtYXkgc2NyYXBlIGUtY29tbWVyY2Ugc2l0ZXMgZm9yIGNvbXBldGl0b3IgcHJpY2luZyBvciBnYXRoZXIgc29jaWFsIG1lZGlhIG1lbnRpb25zIHRvIGFuYWx5emUgY3VzdG9tZXIgc2VudGltZW50LgoKIyBDbGVhbmluZyBEYXRhOgoKIyBQdXJwb3NlOiBSYXcgZGF0YSBvZnRlbiBjb250YWlucyBlcnJvcnMsIGluY29uc2lzdGVuY2llcywgZHVwbGljYXRlcywgb3IgaXJyZWxldmFudCBpbmZvcm1hdGlvbi4gRGF0YSBjbGVhbmluZyBpcyB0aGUgcHJvY2VzcyBvZiBtYWtpbmcgZGF0YSByZWxpYWJsZSBhbmQgdXNhYmxlLCBlbnN1cmluZyBhbmFseXNpcyBpcyBhY2N1cmF0ZSBhbmQgYWN0aW9uYWJsZS4KIyBFeGFtcGxlOiBDbGVhbmluZyBkYXRhIG1pZ2h0IGludm9sdmUgcmVtb3ZpbmcgZHVwbGljYXRlIHJvd3MsIGhhbmRsaW5nIG1pc3NpbmcgdmFsdWVzLCBvciBzdGFuZGFyZGl6aW5nIGRhdGUgZm9ybWF0cyB0byBwcmVwYXJlIGRhdGEgZm9yIG1lYW5pbmdmdWwgaW5zaWdodHMuCgojIE1hbmlwdWxhdGluZyBEYXRhOgoKIyBQdXJwb3NlOiBNYW5pcHVsYXRpb24gYWxsb3dzIHVzIHRvIHJlc2hhcGUsIGZpbHRlciwgYWdncmVnYXRlLCBvciBjYWxjdWxhdGUgbmV3IHZhcmlhYmxlcyB0byBleHRyYWN0IGluc2lnaHRzIHJlbGV2YW50IHRvIG91ciBxdWVzdGlvbnMgb3IgZ29hbHMuIFVzaW5nIHRvb2xzIGxpa2UgZHBseXIgaW4gUiBzdHJlYW1saW5lcyB0aGVzZSB0cmFuc2Zvcm1hdGlvbnMuCiMgRXhhbXBsZTogV2l0aCBmbGlnaHQgZGF0YSwgbWFuaXB1bGF0aW9uIGNhbiBpbnZvbHZlIGZpbHRlcmluZyBmb3Igc3BlY2lmaWMgYWlybGluZXMsIHN1bW1hcml6aW5nIGRlbGF5cyBieSBtb250aCwgb3IgbWVyZ2luZyBkYXRhc2V0cyB0byBhbmFseXplIHRyZW5kcyBhY3Jvc3MgeWVhcnMuCgojIERhdGEgV2FyZWhvdXNlOgoKIyBQdXJwb3NlOiBBIGRhdGEgd2FyZWhvdXNlIHNlcnZlcyBhcyBhIGNlbnRyYWxpemVkIHJlcG9zaXRvcnkgZm9yIGludGVncmF0ZWQgZGF0YSBmcm9tIG11bHRpcGxlIHNvdXJjZXMsIHN1cHBvcnRpbmcgYW5hbHlzaXMgYWNyb3NzIHRpbWUgYW5kIHNjYWxlLiBXYXJlaG91c2luZyBlbmFibGVzIGNvbnNpc3RlbmN5LCBmYXN0ZXIgcXVlcnlpbmcsIGFuZCBtb3JlIGVmZmljaWVudCwgbGFyZ2Utc2NhbGUgYW5hbHlzZXMuCiMgRXhhbXBsZTogQSBjb21wYW55IG1pZ2h0IGNvbnNvbGlkYXRlIHNhbGVzLCBtYXJrZXRpbmcsIGFuZCBjdXN0b21lciBzdXBwb3J0IGRhdGEgaW4gYSB3YXJlaG91c2UsIGFsbG93aW5nIGZvciBhIHVuaWZpZWQgdmlldyBvZiBwZXJmb3JtYW5jZSBtZXRyaWNzIGFuZCBjdXN0b21lciBiZWhhdmlvciBvdmVyIHRpbWUuCmBgYAoKCiMgUXVlc3Rpb24gMiBldmFsdWF0ZWQgbGVhcm5pbmcgb3V0Y29tZXMgb2Ygd2ViIHNjcmFwaW5nLiBSZWFkIEhUTUwgVGFibGVzIHdpdGgg4oCccnZlc3TigJ0uIEV4YWN0bHkgc2ltaWxhciB0byB0aGUgdGFibGUgZXh0cmFjdCB0aGF0IHdlIGRpZCBpbiBjbGFzcy4gKEFzc2lnbm1lbnQgMykKYGBge3J9CiMgRXhhbXBsZToKCmxpYnJhcnkocnZlc3QpCgp1cmwgPC0gImh0dHBzOi8vd3d3LmltZGIuY29tL3RpdGxlL3R0NzIzNTQ2Ni9mdWxsY3JlZGl0cz9yZWZfPXR0X2NsX3NtIgoKcGFnZSA8LSByZWFkX2h0bWwodXJsKQoKdGFibGVzIDwtIHBhZ2UgJT4lIGh0bWxfdGFibGUoKQoKc2VyaWVzX2Nhc3RfdGFibGUgPC0gdGFibGVzW1szXV0KCmNsZWFuZWRfdGFibGUgPC0gc2VyaWVzX2Nhc3RfdGFibGUgJT4lIHNlbGVjdCgyLCA0KQoKY2xlYW5lZF90YWJsZSA8LSBzdWJzZXQoY2xlYW5lZF90YWJsZSwgY2xlYW5lZF90YWJsZVssIDFdICE9ICIiICYgY2xlYW5lZF90YWJsZVssIDJdICE9ICIiKQoKCmNsZWFuZWRfdGFibGUgPC0gY2xlYW5lZF90YWJsZVshYXBwbHkoY2xlYW5lZF90YWJsZSA9PSAiIiwgMSwgYWxsKSwgXQoKCmZpbmFsX3Jvd3MgPC0gbnJvdyhjbGVhbmVkX3RhYmxlKQpmaW5hbF9jb2x1bW5zIDwtIG5jb2woY2xlYW5lZF90YWJsZSkKCmNhdCgiVGhlIGNsZWFuZWQgY2FzdCB0YWJsZSBoYXMiLCBmaW5hbF9yb3dzLCAib2JzZXJ2YXRpb25zIGFuZCIsIGZpbmFsX2NvbHVtbnMsICJjb2x1bW5zLlxuIikKCnByaW50KGhlYWQoY2xlYW5lZF90YWJsZSkpCmBgYAoKCiMgUXVlc3Rpb24gMyBpcyBkZXNpZ25lZCB0byBleHRyYWN0IGRhdGEgZnJvbSBhIGdpdmVuIEFQSS4gSSBwcm92aWRlIHlvdSB3aXRoIHRoZSBwYWNrYWdlIGFuZCBmdW5jdGlvbiB0aGF0IHdlcmUgbmVlZGVkLkJlIGFibGUgdG8gaW5kZXggYSBsaXN0IGFuZC9vciBkYXRhIGZyYW1lLgpgYGB7cn0KRXhhbXBsZToKCmluc3RhbGwucGFja2FnZXMoImpzb25saXRlIikKbGlicmFyeShqc29ubGl0ZSkKQlRDIDwtIGZyb21KU09OKCJodHRwczovL21pbi1hcGkuY3J5cHRvY29tcGFyZS5jb20vZGF0YS92Mi9oaXN0b2RheT9mc3ltPUJUQyZ0c3ltPVVTRCZsaW1pdD0xMDAiKQoKc3RyKEJUQykKCkJUQ19kYXRhIDwtIEJUQyREYXRhJERhdGEgIAoKaGVhZChCVENfZGF0YSkKCm1heF9jbG9zZV9wcmljZSA8LSBtYXgoQlRDX2RhdGEkY2xvc2UsIG5hLnJtID0gVFJVRSkKCm1heF9jbG9zZV9wcmljZQpgYGAKCgojIFF1ZXN0aW9uIDQgaXMgZGVzaWduZWQgdG8gbWVhc3VyZSB5b3VyIGFiaWxpdHkgdG8gY2xlYW4gZGF0YS4gSXQgd2FzIHNpbWlsYXIgdG8gd2hhdCB3ZSBoYXZlIGRvbmUgd2l0aCB0aGUgaXJpcyBkYXRhc2V0cy5UaGUga05OIHByb2NlZHVyZS4KYGBge3J9CiNJbiBSLCB0aGUgY2xhc3MgcGFja2FnZSBoYXMgYSBrbm4gZnVuY3Rpb24gZm9yIGNsYXNzaWZpY2F0aW9uLCBhbmQgcGFja2FnZXMgbGlrZSBWSU0gb3IgRE13UiBwcm92aWRlIGZ1bmN0aW9ucyBmb3Iga05OLWJhc2VkIGltcHV0YXRpb24uCmBgYAoKCiMgUXVlc3Rpb24gNSBpcyBkZXNpZ25lZCB0byBtZWFzdXJlIHlvdXIgYWJpbGl0eSB0byBtYW5pcHVsYXRlIGRhdGEuIEl0IHdhcyBzaW1pbGFyIHRvIHdoYXQgd2UgaGF2ZSBkb25lIHdpdGggdGhlIG55Y2ZsaWdodHMxMyBkYXRhc2V0cy4gVGhlIHVzZSBvZiBkYXRhIG1hbmlwdWxhdGlvbiBmdW5jdGlvbnMgZnJvbSAiZHBseXIiIHBhY2thZ2UgdG8gZGVyaXZlIHRoZSBtZXRyaWNzIG5lZWRlZC4KYGBge3J9CkV4YW1wbGU6CiNpbnN0YWxsLnBhY2thZ2VzKCJkcGx5ciIpCiNpbnN0YWxsLnBhY2thZ2VzKCJwYWNtYW4iKQoKcGFjbWFuOjpwX2xvYWQobnljZmxpZ2h0czEzKQoKVmlldyhmbGlnaHRzKSAjIHRoaXMgVmlldygpIGZ1bmN0aW9uIG9wZW5zIGxldHMgeW91IGRpcmVjdGx5IHZpZXcgdGhlIHdob2xlIGRhdGFzZXQKCmdsaW1wc2UoZmxpZ2h0cykgIyB0aGlzIGdsaW1wc2UoKSBmdW5jdGlvbiBwcm92aWRlcyBhIHF1aWNrIG92ZXJ2aWV3IG9mIHRoZSBkYXRhc2V0CnN1bW1hcnkoZmxpZ2h0cykKCgpzZWxlY3QoZmxpZ2h0cywgc3RhcnRzX3dpdGgoImRlcCIpKQoKIApzb3J0ZiA8LSBhcnJhbmdlKGZsaWdodHMsZGVzYyhkZXBfZGVsYXkpKSAKc2VsZWN0KHNvcnRmLCBjYXJyaWVyLCBmbGlnaHQsIHRhaWxudW0sIGV2ZXJ5dGhpbmcoKSkKYGBgCgoKCg==