Implementation of Web Scraping using R

A lot of data isn’t accessible through data sets or APIs but rather exists on the internet as Web pages. So, through web-scraping, one can access the data without waiting for the provider to create an API.In this assignment, we will explore the official GitHub page for Microsoft.

Please refer to the URL below: https://github.com/Microsoft/

We will be using rvest and perform some commonly used web scrapping techniques

The code Snippet

Loading the libraries required in this Assignment

library(rvest)
library(httr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.5     ✔ purrr   0.3.4
## ✔ tibble  3.1.6     ✔ dplyr   1.0.8
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()         masks stats::filter()
## ✖ readr::guess_encoding() masks rvest::guess_encoding()
## ✖ dplyr::lag()            masks stats::lag()
library(knitr)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(gt)

1. Read HTML Code

We are here reading the HTML code from the webpage URL using read_html().

url <- "https://github.com/Microsoft/"

Git_url = read_html(url)

Scrape Data From HTML Code

2. Fetching the User Description

User_Description <- Git_url %>%
  html_elements(css=".flex-1 .color-fg-muted div") %>%
  html_text()
User_Description
## [1] "Open source projects and samples from Microsoft"

3. Fetching the URL of the GitHub user’s logo and displaying the image in RMarkdown

User_Logo <- Git_url %>%
  html_element(css=".mr-md-4") %>%
  html_attr("src")
knitr::include_graphics(User_Logo)

4. Most used topics

Here we are fetching the topics from the “Most used topics” section on the page

Most_Used_Topics <- Git_url %>%
  html_elements(css=".topic-tag-link") %>%
  html_text2()
Most_Used_Topics
## [1] "azure"            "microsoft"        "python"           "machine-learning"
## [5] "typescript"

5. A Summary table of the first 10 Repositories

The Column Names used here as per the problem description: repo_name , repo_url, repo_description, repo_topics

repo_name

repo_name <- Git_url %>%
  html_elements(css=".text-bold.mr-1") %>%
  html_text2()
RepoName_df <- as.data.frame(repo_name)
RepoName_df$repo_name <- gsub('\\s+', '', RepoName_df$repo_name)

repo_url

repo_url <- Git_url %>%
  html_elements(css=".text-bold.mr-1") %>%
  html_attr("href")
RepoName_df$repo_url <- repo_url
RepoName_df$repo_url <- paste("https://github.com",RepoName_df$repo_url)
RepoName_df$repo_url <- gsub('\\s+', '', RepoName_df$repo_url)

repo_description

repo_description <- Git_url %>%
  html_elements(css=".wb-break-word") %>%
  html_text()
repo_description <- trimws(repo_description, which = c("both"))

for (i in 1:nrow(RepoName_df)) {
  if (existsFunction(repo_description[i])) {
    RepoName_df$repo_description[i]='No description'
  }
  else {
    RepoName_df$repo_description[i]=repo_description[i]
  }
}

repo_topics

repo_topics <- c()
for (i in 1:nrow(RepoName_df)) {
  repo_topics[i] <- read_html(RepoName_df$repo_url[i]) %>%
    html_elements(css = ".topic-tag-link") %>%
    html_text2() %>%
    paste(collapse = ',')
}

RepoName_df$repo_topics <- repo_topics
RepoName_df$repo_topics <- gsub('\\s+', '', RepoName_df$repo_topics)

RepoName_df %>%
  kbl() %>%
  kable_styling()
repo_name repo_url repo_description repo_topics
typescript-error-deltas https://github.com/microsoft/typescript-error-deltas Test popular repos on new versions of Typescript: display new compiler errors
winget-pkgs https://github.com/microsoft/winget-pkgs The Microsoft community Windows Package Manager manifest repository hacktoberfest
Build-OMS-Agent-for-Linux https://github.com/microsoft/Build-OMS-Agent-for-Linux Build projects required for omsagent
typescript-server-harness https://github.com/microsoft/typescript-server-harness Tools for exercising a TypeScript tsserver process
AL-Go https://github.com/microsoft/AL-Go Engineering specs for DirectX features.
DirectX-Specs https://github.com/microsoft/DirectX-Specs ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
onnxruntime https://github.com/microsoft/onnxruntime DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective. machine-learning,deep-learning,tensorflow,scikit-learn,pytorch,neural-networks,hardware-acceleration,hacktoberfest,ai-framework,onnx
DeepSpeed https://github.com/microsoft/DeepSpeed C++ Library Manager for Windows, Linux, and MacOS machine-learning,compression,deep-learning,gpu,inference,pytorch,zero,data-parallelism,model-parallelism,mixture-of-experts,pipeline-parallelism,billion-parameters,trillion-parameters
vcpkg https://github.com/microsoft/vcpkg A framework for building native Windows apps with React. c,windows,package-manager,visual-studio,cmake,packages,cplusplus,cpp,libraries,vcpkg
react-native-windows https://github.com/microsoft/react-native-windows NA react,react-native,dotnet,uwp,xbox