class: center, middle, inverse, title-slide .title[ # Web scraping for Business Professionals - 2025 ACME Conference ] .author[ ### Dr. Zhenning ‘Jimmy’ Xu, California State University Bakersfield, followe me on Twitter:
https://twitter.com/MKTJimmyxu
] .date[ ### 2025/03/19 ] --- background-image: url(https://upload.wikimedia.org/wikipedia/commons/b/be/Sharingan_triple.svg) ??? Image credit: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Sharingan_triple.svg) --- class: center, middle # xaringan ### /ʃaː.'riŋ.ɡan/ --- class: inverse, center, middle # Get Started --- # Agenda for today - What is Web Scraping - Business usage of web scraping - 1) Digital marketing – SEO - 2) Monitoring competitors - 3) Price comparison - Web scraping using R - Useful libraries - Which library to use for which job - Legality - A simple demonstration using httr (give it a try) – it will take 1 min for you to see how data scraping works at a minimum level - Contact me at *utjimmyx.github.io* for collaborations or unlimited data for your applied projects – *Don’t tell the IRB office at your school as they will be angry when they know this!* --- background-image: url(https://github.com/yihui/xaringan/releases/download/v0.0.2/karl-moustache.jpg) background-position: 50% 50% class: center, bottom, inverse # You only live once! --- # Web scraping is a technique for gathering data or information on web pages. You could revisit your favorite web site every time it updates for new information. # Or you could write a web scraper to have it do it for you! --- # Why web scraping? - It is daunting to just copy and paste data from online sources - Extract data which we can see while browsing the web - Extract data from a website that does not have an API - Extract a LOT of data which we can not do through an API due to rate limiting --- # Web scraping in real life - Extract web page information for SEO purposes - Extract product information - Extract job postings and internships (LinkedIn, Indeed.com, etc.) - Extract offers and discounts from deal-of-the-day websites (very useful but under-utilized) - Extract data to make a search engine - Gather weather data - And so on – your imagination is the only limit #*Analytics (our goal for most applied projects)* - using data and analytics to inform future actions --- # Workflow of Web Scraping - Identify the URL - Read the page (using the read_html function) - Specify the HTML elements using a CSS Selector - Extract the HTML elements - Store the data - Build a for loop to do it efficiently - Packages - httr, xml2, rvest, selenium, etc. --- # Web scraping – business usage (extracting complex tables from pdf documents) .center[<img src="https://raw.githubusercontent.com/utjimmyx/resources/master/workflow.png"/>] --- # Web scraping – business usage (extracting complex tables from pdf documents) .center[<img src="https://raw.githubusercontent.com/utjimmyx/resources/master/itsgoing.jpg"/>] --- # Web scraping – business usage (extracting complex tables from pdf documents) .center[<img src="https://raw.githubusercontent.com/utjimmyx/resources/master/crop.png" width='50%' align="middle"/>] --- # Web scraping – business usage (extracting complex tables from pdf documents) .center[<img src="https://raw.githubusercontent.com/utjimmyx/resources/master/itsgoing.png" width='50%' align="middle"/>] .center[<img src="https://raw.githubusercontent.com/utjimmyx/resources/master/cropdashboard.png" width='50%' align="middle"/>] --- # Web scraping – business usage - Scraping the title tags from LL Bean (ref:https://rpubs.com/utjimmyx/seobasics) .center[<img src="https://raw.githubusercontent.com/utjimmyx/resources/master/seo_demo.png" />] --- # Web scraping – business usage - Scraping a top news from a news website (ref:utjimmyx.github.io) - *Other interesting examples* .center[<img src="https://raw.githubusercontent.com/utjimmyx/resources/master/ecommerce.png" />] --- ## Thank you all for your participation! ### Questions ### How to make these slides for your creative work - https://www.rstudio.com/about/customer-stories/ xaringan Presentations and R - - Markdown https://bookdown.org/yihui/rmarkdown/xaringan.html - Slides for today can be accessed at my Rpubs.com website [**rpubs.com**](https://rpubs.com/utjimmyx). ### Fun and useful staff - [**syntax you will be using for today's tutorial is named - gtrendsR_business_analyst.txt**] you can download it from this Website: https://github.com/utjimmyx/resources - [**my website - recently rebuilt using R for free**]Website: utjimmyx.github.io - [**my Tableau site**]https://public.tableau.com/app/profile/zhenning.xu - [**my Github website**]https://github.com/utjimmyx. - [**Co-organizer - Central Valley Data Analytics and R Users Meetup Group**]https://www.meetup.com/valley-data-analytics-using-r-meetup-group/. - [**My YouTube channel**]please consider following or subscribing to my YouTube channel at https://www.youtube.com/@webdatax