Implementation of Web Scraping using R
A lot of data isn’t accessible through data sets or APIs but rather exists on the internet as Web pages. So, through web-scraping, one can access the data without waiting for the provider to create an API.In this assignment, we will explore the official GitHub page for Microsoft.
Please refer to the URL below: https://github.com/Microsoft/
We will be using rvest and perform some commonly used web scrapping techniques
The code Snippet
Loading the libraries required in this Assignment
library(rvest)
library(httr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.5 ✔ purrr 0.3.4
## ✔ tibble 3.1.6 ✔ dplyr 1.0.8
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ readr::guess_encoding() masks rvest::guess_encoding()
## ✖ dplyr::lag() masks stats::lag()
library(knitr)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(gt)
1. Read HTML Code
We are here reading the HTML code from the webpage URL using read_html().
url <- "https://github.com/Microsoft/"
Git_url = read_html(url)
Scrape Data From HTML Code
2. Fetching the User Description
User_Description <- Git_url %>%
html_elements(css=".flex-1 .color-fg-muted div") %>%
html_text()
User_Description
## [1] "Open source projects and samples from Microsoft"
3. Fetching the URL of the GitHub user’s logo and displaying the image in RMarkdown
User_Logo <- Git_url %>%
html_element(css=".mr-md-4") %>%
html_attr("src")
knitr::include_graphics(User_Logo)
4. Most used topics
Here we are fetching the topics from the “Most used topics” section on the page
Most_Used_Topics <- Git_url %>%
html_elements(css=".topic-tag-link") %>%
html_text2()
Most_Used_Topics
## [1] "azure" "microsoft" "python" "machine-learning"
## [5] "typescript"
5. A Summary table of the first 10 Repositories
The Column Names used here as per the problem description: repo_name , repo_url, repo_description, repo_topics
repo_name
repo_name <- Git_url %>%
html_elements(css=".text-bold.mr-1") %>%
html_text2()
RepoName_df <- as.data.frame(repo_name)
RepoName_df$repo_name <- gsub('\\s+', '', RepoName_df$repo_name)
repo_url
repo_url <- Git_url %>%
html_elements(css=".text-bold.mr-1") %>%
html_attr("href")
RepoName_df$repo_url <- repo_url
RepoName_df$repo_url <- paste("https://github.com",RepoName_df$repo_url)
RepoName_df$repo_url <- gsub('\\s+', '', RepoName_df$repo_url)
repo_description
repo_description <- Git_url %>%
html_elements(css=".wb-break-word") %>%
html_text()
repo_description <- trimws(repo_description, which = c("both"))
for (i in 1:nrow(RepoName_df)) {
if (existsFunction(repo_description[i])) {
RepoName_df$repo_description[i]='No description'
}
else {
RepoName_df$repo_description[i]=repo_description[i]
}
}
repo_topics
repo_topics <- c()
for (i in 1:nrow(RepoName_df)) {
repo_topics[i] <- read_html(RepoName_df$repo_url[i]) %>%
html_elements(css = ".topic-tag-link") %>%
html_text2() %>%
paste(collapse = ',')
}
RepoName_df$repo_topics <- repo_topics
RepoName_df$repo_topics <- gsub('\\s+', '', RepoName_df$repo_topics)
RepoName_df %>%
kbl() %>%
kable_styling()
| repo_name | repo_url | repo_description | repo_topics |
|---|---|---|---|
| typescript-error-deltas | https://github.com/microsoft/typescript-error-deltas | Test popular repos on new versions of Typescript: display new compiler errors | |
| winget-pkgs | https://github.com/microsoft/winget-pkgs | The Microsoft community Windows Package Manager manifest repository | hacktoberfest |
| Build-OMS-Agent-for-Linux | https://github.com/microsoft/Build-OMS-Agent-for-Linux | Build projects required for omsagent | |
| typescript-server-harness | https://github.com/microsoft/typescript-server-harness | Tools for exercising a TypeScript tsserver process | |
| AL-Go | https://github.com/microsoft/AL-Go | Engineering specs for DirectX features. | |
| DirectX-Specs | https://github.com/microsoft/DirectX-Specs | ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator | |
| onnxruntime | https://github.com/microsoft/onnxruntime | DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective. | machine-learning,deep-learning,tensorflow,scikit-learn,pytorch,neural-networks,hardware-acceleration,hacktoberfest,ai-framework,onnx |
| DeepSpeed | https://github.com/microsoft/DeepSpeed | C++ Library Manager for Windows, Linux, and MacOS | machine-learning,compression,deep-learning,gpu,inference,pytorch,zero,data-parallelism,model-parallelism,mixture-of-experts,pipeline-parallelism,billion-parameters,trillion-parameters |
| vcpkg | https://github.com/microsoft/vcpkg | A framework for building native Windows apps with React. | c,windows,package-manager,visual-studio,cmake,packages,cplusplus,cpp,libraries,vcpkg |
| react-native-windows | https://github.com/microsoft/react-native-windows | NA | react,react-native,dotnet,uwp,xbox |