In every data science project, data collection or data mining is an important step to obtain the data we need for analysis. There are various methods in data mining, one of them is web scrapping, a process of extracting (scraping) a wealth of useful data from text-based mark-up languages (HTML and kinds) which build up webpage.
In this assignment, we are going to perform web scraping from https://github.com/Microsoft/ using R.
library(rvest)
library(httr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.5 ✔ purrr 0.3.4
## ✔ tibble 3.1.6 ✔ dplyr 1.0.8
## ✔ tidyr 1.1.4 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ readr::guess_encoding() masks rvest::guess_encoding()
## ✖ dplyr::lag() masks stats::lag()
library(dplyr)
library(knitr)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
url <- "https://github.com/Microsoft/"
MSOFT <- read_html(url)
MSOFT
## {html_document}
## <html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="logged-out env-production page-responsive" style="word-wrap: ...
user_description <- MSOFT %>%
html_elements(css=".flex-1 .color-fg-muted div") %>%
html_text()
user_description
## [1] "Open source projects and samples from Microsoft"
user_logo_url <- MSOFT %>%
html_elements(css=".mr-md-4") %>%
html_attr("src") # the url info is located in the href attribute
user_logo_url
## [1] "https://avatars.githubusercontent.com/u/6154722?s=200&v=4"
knitr::include_graphics(user_logo_url)
most_used_topics <- MSOFT %>%
html_elements(css=".topic-tag-link") %>%
html_text2()
most_used_topics
## [1] "azure" "microsoft" "python" "machine-learning"
## [5] "typescript"
repo_name <- MSOFT %>%
html_elements(css=".text-bold.mr-1") %>%
html_text2()
repo_df <- as.data.frame(repo_name)
repo_df$repo_name <- gsub('\\s+', '', repo_df$repo_name)
repo_url <- MSOFT %>%
html_elements(css=".text-bold.mr-1") %>%
html_attr("href")
repo_df$repo_url <- repo_url
repo_df$repo_url <- paste("https://github.com",repo_df$repo_url)
repo_df$repo_url <- gsub('\\s+', '', repo_df$repo_url)
repo_description <- c()
for (i in 1:nrow(repo_df)) {
repo_description[i] <- read_html(repo_df$repo_url[i]) %>%
html_elements(css = "#repo-content-pjax-container .f4") %>%
html_text2()
}
repo_description <- trimws(repo_description, which = c("both"))
repo_df$repo_description <- repo_description
repo_topics <- c()
for (i in 1:nrow(repo_df)) {
repo_topics[i] <- read_html(repo_df$repo_url[i]) %>%
html_elements(css = ".topic-tag-link") %>%
html_text2() %>%
paste(collapse = ',')
}
repo_df$repo_topics <- repo_topics
repo_df$repo_topics <- gsub('\\s+', '', repo_df$repo_topics)
repo_df %>%
kbl(format = 'html',
escape = FALSE, caption = "Summary table of the first 10 repositories") %>%
kable_styling()
| repo_name | repo_url | repo_description | repo_topics |
|---|---|---|---|
| azure-tools-for-java | https://github.com/microsoft/azure-tools-for-java | Azure tools for Java, including Azure Toolkits for Eclipse, IntelliJ and related projects. | |
| microcode | https://github.com/microsoft/microcode | Microsoft MicroCode: physical computing for early education. | education,coding,microbit,jacdac |
| plcrashreporter | https://github.com/microsoft/plcrashreporter | Reliable, open-source crash reporting for iOS, macOS and tvOS | macos,ios,crash-reporting |
| hat | https://github.com/microsoft/hat | TOML-annotated C header file format for packaging binary files, from Microsoft Research | metadata,toml,benchmarking,cpp,python-library,cuda,platform-independent,cprogramming,rocm |
| satellite-imagery-labeling-tool | https://github.com/microsoft/satellite-imagery-labeling-tool | This is a lightweight web-interface for creating and sharing vector annotations over satellite/aerial imagery scenes. | |
| vscode-maven | https://github.com/microsoft/vscode-maven | VSCode extension “Maven for Java” | microsoft,maven,vscode-extension,java-support |
| vscode | https://github.com/microsoft/vscode | Visual Studio Code | electron,microsoft,editor,typescript,visual-studio-code |
| onnxruntime | https://github.com/microsoft/onnxruntime | ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator | machine-learning,deep-learning,tensorflow,scikit-learn,pytorch,neural-networks,hardware-acceleration,hacktoberfest,ai-framework,onnx |
| azure-pipelines-tasks | https://github.com/microsoft/azure-pipelines-tasks | Tasks for Azure Pipelines | |
| lisa | https://github.com/microsoft/lisa | LISA is developed and maintained by Microsoft, to empower Linux validation. | testing,linux,azure,automation-framework,e2e-testing,hyperv,cloudtesting,linux-compatibility |
Used gsub and trimws function to tidy up the text in each column and remove unnecessary spaces, tabs, etc.