Introduction

In every data science project, data collection or data mining is an important step to obtain the data we need for analysis. There are various methods in data mining, one of them is web scrapping, a process of extracting (scraping) a wealth of useful data from text-based mark-up languages (HTML and kinds) which build up webpage.

In this assignment, we are going to perform web scraping from https://github.com/Microsoft/ using R.

Load the required packages

library(rvest)
library(httr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.5     ✔ purrr   0.3.4
## ✔ tibble  3.1.6     ✔ dplyr   1.0.8
## ✔ tidyr   1.1.4     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()         masks stats::filter()
## ✖ readr::guess_encoding() masks rvest::guess_encoding()
## ✖ dplyr::lag()            masks stats::lag()
library(dplyr)
library(knitr)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows

Q1. Read the URL

url <- "https://github.com/Microsoft/"
MSOFT <- read_html(url)
MSOFT
## {html_document}
## <html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="logged-out env-production page-responsive" style="word-wrap: ...

Q2. Description of the user

user_description <- MSOFT %>%
  html_elements(css=".flex-1 .color-fg-muted div") %>%
  html_text()
user_description
## [1] "Open source projects and samples from Microsoft"