Question 1 was designed to ensure that you did not lose track of why
we were: (i) scraping data; (ii) cleaning data; (iii) manipulating data;
(iv) data warehouse.
Scraping Data:
Purpose: Scraping is essential for gathering data that isn’t readily
available in structured formats, like publicly displayed data on
websites. It’s often used to collect up-to-date, specific, or
large-scale datasets without manual entry.
Example: A business may scrape e-commerce sites for competitor
pricing or gather social media mentions to analyze customer
sentiment.
Cleaning Data:
Purpose: Raw data often contains errors, inconsistencies,
duplicates, or irrelevant information. Data cleaning is the process of
making data reliable and usable, ensuring analysis is accurate and
actionable.
Example: Cleaning data might involve removing duplicate rows,
handling missing values, or standardizing date formats to prepare data
for meaningful insights.
Manipulating Data:
Example: With flight data, manipulation can involve filtering for
specific airlines, summarizing delays by month, or merging datasets to
analyze trends across years.
Data Warehouse:
Purpose: A data warehouse serves as a centralized repository for
integrated data from multiple sources, supporting analysis across time
and scale. Warehousing enables consistency, faster querying, and more
efficient, large-scale analyses.
Question 3 is designed to extract data from a given API. I provide
you with the package and function that were needed.Be able to index a
list and/or data frame.
Example:
install.packages("jsonlite")
library(jsonlite)
BTC <- fromJSON("https://min-api.cryptocompare.com/data/v2/histoday?fsym=BTC&tsym=USD&limit=100")
str(BTC)
BTC_data <- BTC$Data$Data
head(BTC_data)
max_close_price <- max(BTC_data$close, na.rm = TRUE)
max_close_price
Question 4 is designed to measure your ability to clean data. It was
similar to what we have done with the iris datasets.The kNN
procedure.
#In R, the class package has a knn function for classification, and packages like VIM or DMwR provide functions for kNN-based imputation.
Question 5 is designed to measure your ability to manipulate data.
It was similar to what we have done with the nycflights13 datasets. The
use of data manipulation functions from “dplyr” package to derive the
metrics needed.
Example:
#install.packages("dplyr")
#install.packages("pacman")
pacman::p_load(nycflights13)
View(flights) # this View() function opens lets you directly view the whole dataset
glimpse(flights) # this glimpse() function provides a quick overview of the dataset
summary(flights)
select(flights, starts_with("dep"))
sortf <- arrange(flights,desc(dep_delay))
select(sortf, carrier, flight, tailnum, everything())
LS0tCnRpdGxlOiAiRXhhbSBSZXZpZXciCm91dHB1dDogaHRtbF9ub3RlYm9vawotLS0KCiMgUXVlc3Rpb24gMSB3YXMgZGVzaWduZWQgdG8gZW5zdXJlIHRoYXQgeW91IGRpZCBub3QgbG9zZSB0cmFjayBvZiB3aHkgd2Ugd2VyZTogKGkpIHNjcmFwaW5nIGRhdGE7IChpaSkgY2xlYW5pbmcgZGF0YTsgKGlpaSkgbWFuaXB1bGF0aW5nIGRhdGE7IChpdikgZGF0YSB3YXJlaG91c2UuIAoKIyBTY3JhcGluZyBEYXRhOgogIAojIFB1cnBvc2U6IFNjcmFwaW5nIGlzIGVzc2VudGlhbCBmb3IgZ2F0aGVyaW5nIGRhdGEgdGhhdCBpc27igJl0IHJlYWRpbHkgYXZhaWxhYmxlIGluIHN0cnVjdHVyZWQgZm9ybWF0cywgbGlrZSBwdWJsaWNseSBkaXNwbGF5ZWQgZGF0YSBvbiB3ZWJzaXRlcy4gSXTigJlzIG9mdGVuIHVzZWQgdG8gY29sbGVjdCB1cC10by1kYXRlLCBzcGVjaWZpYywgb3IgbGFyZ2Utc2NhbGUgZGF0YXNldHMgd2l0aG91dCBtYW51YWwgZW50cnkuCiMgRXhhbXBsZTogQSBidXNpbmVzcyBtYXkgc2NyYXBlIGUtY29tbWVyY2Ugc2l0ZXMgZm9yIGNvbXBldGl0b3IgcHJpY2luZyBvciBnYXRoZXIgc29jaWFsIG1lZGlhIG1lbnRpb25zIHRvIGFuYWx5emUgY3VzdG9tZXIgc2VudGltZW50LgoKIyBDbGVhbmluZyBEYXRhOgoKIyBQdXJwb3NlOiBSYXcgZGF0YSBvZnRlbiBjb250YWlucyBlcnJvcnMsIGluY29uc2lzdGVuY2llcywgZHVwbGljYXRlcywgb3IgaXJyZWxldmFudCBpbmZvcm1hdGlvbi4gRGF0YSBjbGVhbmluZyBpcyB0aGUgcHJvY2VzcyBvZiBtYWtpbmcgZGF0YSByZWxpYWJsZSBhbmQgdXNhYmxlLCBlbnN1cmluZyBhbmFseXNpcyBpcyBhY2N1cmF0ZSBhbmQgYWN0aW9uYWJsZS4KIyBFeGFtcGxlOiBDbGVhbmluZyBkYXRhIG1pZ2h0IGludm9sdmUgcmVtb3ZpbmcgZHVwbGljYXRlIHJvd3MsIGhhbmRsaW5nIG1pc3NpbmcgdmFsdWVzLCBvciBzdGFuZGFyZGl6aW5nIGRhdGUgZm9ybWF0cyB0byBwcmVwYXJlIGRhdGEgZm9yIG1lYW5pbmdmdWwgaW5zaWdodHMuCgojIE1hbmlwdWxhdGluZyBEYXRhOgoKIyBQdXJwb3NlOiBNYW5pcHVsYXRpb24gYWxsb3dzIHVzIHRvIHJlc2hhcGUsIGZpbHRlciwgYWdncmVnYXRlLCBvciBjYWxjdWxhdGUgbmV3IHZhcmlhYmxlcyB0byBleHRyYWN0IGluc2lnaHRzIHJlbGV2YW50IHRvIG91ciBxdWVzdGlvbnMgb3IgZ29hbHMuIFVzaW5nIHRvb2xzIGxpa2UgZHBseXIgaW4gUiBzdHJlYW1saW5lcyB0aGVzZSB0cmFuc2Zvcm1hdGlvbnMuCiMgRXhhbXBsZTogV2l0aCBmbGlnaHQgZGF0YSwgbWFuaXB1bGF0aW9uIGNhbiBpbnZvbHZlIGZpbHRlcmluZyBmb3Igc3BlY2lmaWMgYWlybGluZXMsIHN1bW1hcml6aW5nIGRlbGF5cyBieSBtb250aCwgb3IgbWVyZ2luZyBkYXRhc2V0cyB0byBhbmFseXplIHRyZW5kcyBhY3Jvc3MgeWVhcnMuCgojIERhdGEgV2FyZWhvdXNlOgoKIyBQdXJwb3NlOiBBIGRhdGEgd2FyZWhvdXNlIHNlcnZlcyBhcyBhIGNlbnRyYWxpemVkIHJlcG9zaXRvcnkgZm9yIGludGVncmF0ZWQgZGF0YSBmcm9tIG11bHRpcGxlIHNvdXJjZXMsIHN1cHBvcnRpbmcgYW5hbHlzaXMgYWNyb3NzIHRpbWUgYW5kIHNjYWxlLiBXYXJlaG91c2luZyBlbmFibGVzIGNvbnNpc3RlbmN5LCBmYXN0ZXIgcXVlcnlpbmcsIGFuZCBtb3JlIGVmZmljaWVudCwgbGFyZ2Utc2NhbGUgYW5hbHlzZXMuCiMgRXhhbXBsZTogQSBjb21wYW55IG1pZ2h0IGNvbnNvbGlkYXRlIHNhbGVzLCBtYXJrZXRpbmcsIGFuZCBjdXN0b21lciBzdXBwb3J0IGRhdGEgaW4gYSB3YXJlaG91c2UsIGFsbG93aW5nIGZvciBhIHVuaWZpZWQgdmlldyBvZiBwZXJmb3JtYW5jZSBtZXRyaWNzIGFuZCBjdXN0b21lciBiZWhhdmlvciBvdmVyIHRpbWUuCmBgYAoKCiMgUXVlc3Rpb24gMiBldmFsdWF0ZWQgbGVhcm5pbmcgb3V0Y29tZXMgb2Ygd2ViIHNjcmFwaW5nLiBSZWFkIEhUTUwgVGFibGVzIHdpdGgg4oCccnZlc3TigJ0uIEV4YWN0bHkgc2ltaWxhciB0byB0aGUgdGFibGUgZXh0cmFjdCB0aGF0IHdlIGRpZCBpbiBjbGFzcy4gKEFzc2lnbm1lbnQgMykKYGBge3J9CiMgRXhhbXBsZToKCmxpYnJhcnkocnZlc3QpCgp1cmwgPC0gImh0dHBzOi8vd3d3LmltZGIuY29tL3RpdGxlL3R0NzIzNTQ2Ni9mdWxsY3JlZGl0cz9yZWZfPXR0X2NsX3NtIgoKcGFnZSA8LSByZWFkX2h0bWwodXJsKQoKdGFibGVzIDwtIHBhZ2UgJT4lIGh0bWxfdGFibGUoKQoKc2VyaWVzX2Nhc3RfdGFibGUgPC0gdGFibGVzW1szXV0KCmNsZWFuZWRfdGFibGUgPC0gc2VyaWVzX2Nhc3RfdGFibGUgJT4lIHNlbGVjdCgyLCA0KQoKY2xlYW5lZF90YWJsZSA8LSBzdWJzZXQoY2xlYW5lZF90YWJsZSwgY2xlYW5lZF90YWJsZVssIDFdICE9ICIiICYgY2xlYW5lZF90YWJsZVssIDJdICE9ICIiKQoKCmNsZWFuZWRfdGFibGUgPC0gY2xlYW5lZF90YWJsZVshYXBwbHkoY2xlYW5lZF90YWJsZSA9PSAiIiwgMSwgYWxsKSwgXQoKCmZpbmFsX3Jvd3MgPC0gbnJvdyhjbGVhbmVkX3RhYmxlKQpmaW5hbF9jb2x1bW5zIDwtIG5jb2woY2xlYW5lZF90YWJsZSkKCmNhdCgiVGhlIGNsZWFuZWQgY2FzdCB0YWJsZSBoYXMiLCBmaW5hbF9yb3dzLCAib2JzZXJ2YXRpb25zIGFuZCIsIGZpbmFsX2NvbHVtbnMsICJjb2x1bW5zLlxuIikKCnByaW50KGhlYWQoY2xlYW5lZF90YWJsZSkpCmBgYAoKCiMgUXVlc3Rpb24gMyBpcyBkZXNpZ25lZCB0byBleHRyYWN0IGRhdGEgZnJvbSBhIGdpdmVuIEFQSS4gSSBwcm92aWRlIHlvdSB3aXRoIHRoZSBwYWNrYWdlIGFuZCBmdW5jdGlvbiB0aGF0IHdlcmUgbmVlZGVkLkJlIGFibGUgdG8gaW5kZXggYSBsaXN0IGFuZC9vciBkYXRhIGZyYW1lLgpgYGB7cn0KRXhhbXBsZToKCmluc3RhbGwucGFja2FnZXMoImpzb25saXRlIikKbGlicmFyeShqc29ubGl0ZSkKQlRDIDwtIGZyb21KU09OKCJodHRwczovL21pbi1hcGkuY3J5cHRvY29tcGFyZS5jb20vZGF0YS92Mi9oaXN0b2RheT9mc3ltPUJUQyZ0c3ltPVVTRCZsaW1pdD0xMDAiKQoKc3RyKEJUQykKCkJUQ19kYXRhIDwtIEJUQyREYXRhJERhdGEgIAoKaGVhZChCVENfZGF0YSkKCm1heF9jbG9zZV9wcmljZSA8LSBtYXgoQlRDX2RhdGEkY2xvc2UsIG5hLnJtID0gVFJVRSkKCm1heF9jbG9zZV9wcmljZQpgYGAKCgojIFF1ZXN0aW9uIDQgaXMgZGVzaWduZWQgdG8gbWVhc3VyZSB5b3VyIGFiaWxpdHkgdG8gY2xlYW4gZGF0YS4gSXQgd2FzIHNpbWlsYXIgdG8gd2hhdCB3ZSBoYXZlIGRvbmUgd2l0aCB0aGUgaXJpcyBkYXRhc2V0cy5UaGUga05OIHByb2NlZHVyZS4KYGBge3J9CiNJbiBSLCB0aGUgY2xhc3MgcGFja2FnZSBoYXMgYSBrbm4gZnVuY3Rpb24gZm9yIGNsYXNzaWZpY2F0aW9uLCBhbmQgcGFja2FnZXMgbGlrZSBWSU0gb3IgRE13UiBwcm92aWRlIGZ1bmN0aW9ucyBmb3Iga05OLWJhc2VkIGltcHV0YXRpb24uCmBgYAoKCiMgUXVlc3Rpb24gNSBpcyBkZXNpZ25lZCB0byBtZWFzdXJlIHlvdXIgYWJpbGl0eSB0byBtYW5pcHVsYXRlIGRhdGEuIEl0IHdhcyBzaW1pbGFyIHRvIHdoYXQgd2UgaGF2ZSBkb25lIHdpdGggdGhlIG55Y2ZsaWdodHMxMyBkYXRhc2V0cy4gVGhlIHVzZSBvZiBkYXRhIG1hbmlwdWxhdGlvbiBmdW5jdGlvbnMgZnJvbSAiZHBseXIiIHBhY2thZ2UgdG8gZGVyaXZlIHRoZSBtZXRyaWNzIG5lZWRlZC4KYGBge3J9CkV4YW1wbGU6CiNpbnN0YWxsLnBhY2thZ2VzKCJkcGx5ciIpCiNpbnN0YWxsLnBhY2thZ2VzKCJwYWNtYW4iKQoKcGFjbWFuOjpwX2xvYWQobnljZmxpZ2h0czEzKQoKVmlldyhmbGlnaHRzKSAjIHRoaXMgVmlldygpIGZ1bmN0aW9uIG9wZW5zIGxldHMgeW91IGRpcmVjdGx5IHZpZXcgdGhlIHdob2xlIGRhdGFzZXQKCmdsaW1wc2UoZmxpZ2h0cykgIyB0aGlzIGdsaW1wc2UoKSBmdW5jdGlvbiBwcm92aWRlcyBhIHF1aWNrIG92ZXJ2aWV3IG9mIHRoZSBkYXRhc2V0CnN1bW1hcnkoZmxpZ2h0cykKCgpzZWxlY3QoZmxpZ2h0cywgc3RhcnRzX3dpdGgoImRlcCIpKQoKIApzb3J0ZiA8LSBhcnJhbmdlKGZsaWdodHMsZGVzYyhkZXBfZGVsYXkpKSAKc2VsZWN0KHNvcnRmLCBjYXJyaWVyLCBmbGlnaHQsIHRhaWxudW0sIGV2ZXJ5dGhpbmcoKSkKYGBgCgoKCg==