Assignment3

Introduction

In every data science project, data collection or data mining is an important step to obtain the data we need for analysis. There are various methods in data mining, one of them is web scrapping, a process of extracting (scraping) a wealth of useful data from text-based mark-up languages (HTML and kinds) which build up webpage.

In this assignment, we are going to perform web scraping from https://github.com/Microsoft/ using R.

Load the required packages

library(rvest)
library(httr)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.3.5     ✔ purrr   0.3.4
## ✔ tibble  3.1.6     ✔ dplyr   1.0.8
## ✔ tidyr   1.1.4     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()         masks stats::filter()
## ✖ readr::guess_encoding() masks rvest::guess_encoding()
## ✖ dplyr::lag()            masks stats::lag()

library(dplyr)
library(knitr)
library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

Q1. Read the URL

url <- "https://github.com/Microsoft/"
MSOFT <- read_html(url)
MSOFT

## {html_document}
## <html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="logged-out env-production page-responsive" style="word-wrap: ...

Q2. Description of the user

user_description <- MSOFT %>%
  html_elements(css=".flex-1 .color-fg-muted div") %>%
  html_text()
user_description

## [1] "Open source projects and samples from Microsoft"

Q3.1. URL of the GitHub user’s logo

user_logo_url <- MSOFT %>%
  html_elements(css=".mr-md-4") %>%
  html_attr("src") # the url info is located in the href attribute
user_logo_url

## [1] "https://avatars.githubusercontent.com/u/6154722?s=200&v=4"

Q3.2. Display the image of GitHub user’s logo

knitr::include_graphics(user_logo_url)

Q4. Most used topics

most_used_topics <- MSOFT %>%
  html_elements(css=".topic-tag-link") %>%
  html_text2()
most_used_topics

## [1] "azure"            "microsoft"        "python"           "machine-learning"
## [5] "typescript"

Q5. A Summary table of the first 10 repositories

5.1. Extract Repository Name

repo_name <- MSOFT %>%
  html_elements(css=".text-bold.mr-1") %>%
  html_text2()
repo_df <- as.data.frame(repo_name)
repo_df$repo_name <- gsub('\\s+', '', repo_df$repo_name)

5.2. Extract Repository URL

repo_url <- MSOFT %>%
  html_elements(css=".text-bold.mr-1") %>%
  html_attr("href")
repo_df$repo_url <- repo_url
repo_df$repo_url <- paste("https://github.com",repo_df$repo_url)
repo_df$repo_url <- gsub('\\s+', '', repo_df$repo_url)

5.3. Extract Repository Description

repo_description <- c()
for (i in 1:nrow(repo_df)) {
  repo_description[i] <- read_html(repo_df$repo_url[i]) %>%
    html_elements(css = "#repo-content-pjax-container .f4") %>%
    html_text2()
}

repo_description <- trimws(repo_description, which = c("both"))
repo_df$repo_description <- repo_description

5.4. Extract Repository Topics

repo_topics <- c()
for (i in 1:nrow(repo_df)) {
  repo_topics[i] <- read_html(repo_df$repo_url[i]) %>%
    html_elements(css = ".topic-tag-link") %>%
    html_text2() %>%
    paste(collapse = ',')
}

repo_df$repo_topics <- repo_topics
repo_df$repo_topics <- gsub('\\s+', '', repo_df$repo_topics)

repo_df %>%
  kbl(format = 'html',
        escape = FALSE, caption = "Summary table of the first 10 repositories") %>%
  kable_styling()

Summary table of the first 10 repositories
repo_name	repo_url	repo_description	repo_topics
azure-tools-for-java	https://github.com/microsoft/azure-tools-for-java	Azure tools for Java, including Azure Toolkits for Eclipse, IntelliJ and related projects.
microcode	https://github.com/microsoft/microcode	Microsoft MicroCode: physical computing for early education.	education,coding,microbit,jacdac
plcrashreporter	https://github.com/microsoft/plcrashreporter	Reliable, open-source crash reporting for iOS, macOS and tvOS	macos,ios,crash-reporting
hat	https://github.com/microsoft/hat	TOML-annotated C header file format for packaging binary files, from Microsoft Research	metadata,toml,benchmarking,cpp,python-library,cuda,platform-independent,cprogramming,rocm
satellite-imagery-labeling-tool	https://github.com/microsoft/satellite-imagery-labeling-tool	This is a lightweight web-interface for creating and sharing vector annotations over satellite/aerial imagery scenes.
vscode-maven	https://github.com/microsoft/vscode-maven	VSCode extension “Maven for Java”	microsoft,maven,vscode-extension,java-support
vscode	https://github.com/microsoft/vscode	Visual Studio Code	electron,microsoft,editor,typescript,visual-studio-code
onnxruntime	https://github.com/microsoft/onnxruntime	ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator	machine-learning,deep-learning,tensorflow,scikit-learn,pytorch,neural-networks,hardware-acceleration,hacktoberfest,ai-framework,onnx
azure-pipelines-tasks	https://github.com/microsoft/azure-pipelines-tasks	Tasks for Azure Pipelines
lisa	https://github.com/microsoft/lisa	LISA is developed and maintained by Microsoft, to empower Linux validation.	testing,linux,azure,automation-framework,e2e-testing,hyperv,cloudtesting,linux-compatibility

Q6. Tidy up the text in each column

Used gsub and trimws function to tidy up the text in each column and remove unnecessary spaces, tabs, etc.