class: center, middle, inverse, title-slide .title[ # Programming Tools in Data Science ] .subtitle[ ## Lecture #8: Webscraping ] .author[ ### Samuel Orso ] .date[ ### 26 October 2023 ] --- # Webscraping with R ```r library(rvest) url <- "https://ptds.samorso.ch/lectures/" read_html(url) %>% html_table() %>% .[[1]] %>% .[5:7,] %>% kableExtra::kable() ``` <table> <thead> <tr> <th style="text-align:right;"> Week </th> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Topic </th> <th style="text-align:left;"> Instructor </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:left;"> 19 Oct </td> <td style="text-align:left;"> Exercise and Homework 2, R coding style guide </td> <td style="text-align:left;"> Aleksandr </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:left;"> 26 Oct </td> <td style="text-align:left;"> Function I, Project Proposal, Webscraping </td> <td style="text-align:left;"> Samuel </td> </tr> <tr> <td style="text-align:right;"> 7 </td> <td style="text-align:left;"> 2 Nov </td> <td style="text-align:left;"> Exercise and Homework 3 </td> <td style="text-align:left;"> Aleksandr </td> </tr> </tbody> </table> --- # API * **A**pplication **P**rogramming **I**nterface are gold standard for fetching data from the web * Data is fetched by directly posing HTTP requests. * Data requests from `R` using `library(httr)` or API wrappers. * Data fetched through the API is generally more reliable. <table> <thead> <tr> <th style="text-align:left;"> Provider </th> <th style="text-align:left;"> Registration </th> <th style="text-align:left;"> Wrapper </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Twitter </td> <td style="text-align:left;"> TRUE </td> <td style="text-align:left;"> TRUE </td> </tr> <tr> <td style="text-align:left;"> Financial Times </td> <td style="text-align:left;"> TRUE </td> <td style="text-align:left;"> TRUE </td> </tr> <tr> <td style="text-align:left;"> Open Weather Map </td> <td style="text-align:left;"> TRUE </td> <td style="text-align:left;"> TRUE </td> </tr> <tr> <td style="text-align:left;"> DeepL </td> <td style="text-align:left;"> TRUE </td> <td style="text-align:left;"> TRUE </td> </tr> </tbody> </table> --- # API example: Wikipedia pageviews ```r library(pageviews) top_articles("en.wikipedia", start = (Sys.Date()-1)) %>% dplyr::select(article, views) %>% dplyr::top_n(10) ``` ``` ## Selecting by views ``` ``` ## article views ## 1 Main_Page 4697700 ## 2 Special:Search 1286342 ## 3 2023_Cricket_World_Cup 548741 ## 4 Leo_(2023_Indian_film) 417184 ## 5 Cricket_World_Cup 413295 ## 6 Wikipedia:Featured_pictures 266922 ## 7 Killers_of_the_Flower_Moon_(film) 248633 ## 8 Tom_Emmer 246187 ## 9 YouTube 226346 ## 10 Vijayadashami 166568 ``` --- # API example: translation with Deepl ```r library(deeplr) deeplr::translate2( text = "Mais quelle bonne traduction nom d'une pipe!", target_lang = "EN", auth_key = my_key ) ``` ``` ## [1] "But what a great translation!" ``` This is what we obtain on Google translate: > But what a good translation of the name of a pipe! --- # API Example: ChatGPT ```r library(chatgpt) cat(ask_chatgpt("What do you think about the Programming Tools in Data Science class in R?")) ``` ``` ## ## *** ChatGPT input: ## ## What do you think about the Programming Tools in Data Science class in R? ``` ``` ## As an AI assistant, I don't have personal opinions. However, I can provide some information about the Programming Tools in Data Science class in R. ## ## The Programming Tools in Data Science class in R is a course focused on teaching programming tools and techniques specifically tailored for data science tasks using R programming language. The course aims to equip students with the necessary skills to efficiently manipulate, analyze, and visualize data in R. ## ## This class typically covers topics such as data wrangling, data visualization, package management, version control, and reproducibility. Students will also learn how to use RStudio, a popular integrated development environment (IDE) for R, to effectively write and debug code. ## ## Overall, this class can be valuable for individuals interested in data science as it provides essential programming tools and techniques necessary for data analysis using the R programming language. ``` --- # Webscraping with R * If API is not available, e.g. there is no `R` package on CRAN or GitHub, you could try to build your own API by following for example [this tutorial](https://colinfay.me/build-api-wrapper-package-r/) or [that one](https://httr2.r-lib.org/articles/wrapping-apis.html) (not covered in this class). * Instead, we discuss webscraping, a method that is effective regardless of whether a website offers an API. --- # Scraping? <center> <div style="width:800px"><iframe allow="fullscreen" frameBorder="0" height="450" src="https://giphy.com/embed/Q8VCAek0MGjRK" width="800"></iframe></div> </center> --- # HTTP request/response cycle <img src="images/http_request_response.png" width="1680" /> --- # HyperText Markup Language ```html <!DOCTYPE html> <html> <body> <h1 id='first'>Webscraping with R</h1> <p> Basic experience with <a href="www.r-project.org">R</a> and familiarity with the <em>Tidyverse</em> is recommended.</p> <h2>Technologies</h2> <ol> <li>HTML: <em>Hypertext Markup Language</em></li> <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> <p><strong>Note</strong>: <em>rvest</em> is included in the <em>tidyverse</em></p> </body> </html> ``` .bottom[[Try it!](https://www.w3schools.com/html/tryit.asp?filename=tryhtml_default)] --- # HTML * **element** starts with `<tag>` and ends `</tag>`, * it has optional **attributes** (`id=attribute`), * **content** is everything between two tags. * For example, add the attribute `style="background-color:DodgerBlue;"` to `h1` and try it. --- # HTML elements tag | meaning --- | --- p | Paragraph h1 | Top-level heading h2, h3, ... | Lower level headings ol | Ordered list ul | Unorder list li | List item img | Image a | Anchor (Hyperlink) div | Section wrapper (block-level) span | Text wrapper (in-line) Find out more tags [here](https://developer.mozilla.org/en-US/docs/Web/HTML) or [here](https://www.w3schools.com/tags/) --- # Data extraction Create a HTML page with `minimal_html` for experimenting ```r html_page <- minimal_html(' <body> <h1>Webscraping with R</h1> <p> Basic experience with <a href="www.r-project.org">R</a> and familiarity with the <em>Tidyverse</em> is recommended.</p> <h2>Technologies</h2> <ol> <li>HTML: <em>Hypertext Markup Language</em></li> <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> <p><strong>Note</strong>: <em>rvest</em> is included in the <em>tidyverse</em></p> </body>') ``` --- # Example: list item (li) ```html ... <h2>Technologies</h2> <ol> * <li>HTML: <em>Hypertext Markup Language</em></li> * <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> * <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> ... ``` ```r html_page %>% html_nodes("li") ``` ``` ## {xml_nodeset (3)} ## [1] <li>HTML: <em>Hypertext Markup Language</em>\n</li> ## [2] <li>CSS: <em>Cascading Style Sheets</em>\n</li> ## [3] <li>rvest</li> ``` ```r html_page %>% html_nodes("li") %>% html_text() ``` ``` ## [1] "HTML: Hypertext Markup Language" "CSS: Cascading Style Sheets" ## [3] "rvest" ``` --- # Example: heading of order 2 (h2) ```html ... * <h2>Technologies</h2> <ol> <li>HTML: <em>Hypertext Markup Language</em></li> <li>CSS: <em>Cascading Style Sheets</em></li> </ol> * <h2>Packages</h2> <ul> <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> ... ``` ```r html_page %>% html_nodes("h2") %>% html_text() ``` ``` ## [1] "Technologies" "Packages" ``` --- # Example: emphasized text (em) ```html <p> Basic experience with <a href="www.r-project.org">R</a> and * familiarity with the <em>Tidyverse</em> is recommended.</p> <h2>Technologies</h2> <ol> * <li>HTML: <em>Hypertext Markup Language</em></li> * <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> <p><strong>Note</strong>: * <em>rvest</em> is included in the <em>tidyverse</em></p> ``` ```r html_page %>% html_nodes("em") %>% html_text() ``` ``` ## [1] "Tidyverse" "Hypertext Markup Language" ## [3] "Cascading Style Sheets" "rvest" ## [5] "tidyverse" ``` --- # Cascading Style Sheets (CSS) * CSS is used to specify the style (appearance, arrangement and variations) of your web pages. ```html <style> body { background-color: lightblue; } h1 { color: white; text-align: center; } .content { font-family: monospace; font-size: 1.5em; color: black; } #intro { background-color: lightgrey; border-style: solid; border-width: 5px; padding: 5px; margin: 5px; text-align: center; } </style> ... ``` --- # Combining commands with CSS selector selector | meaning --- | --- , | grouping space | descendant > | child + | adjacent sibling ~ | general sibling :first-child | first element :nth-child(n) | n element :last-child | last element . | class selector # | id selector .center[[CSS diner](https://flukeout.github.io/), [CSS selector](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors), [W3 School](https://www.w3schools.com/css/css_selectors.asp)] --- # CSS Selector: grouping (`,`) * The grouping selector selects all the HTML elements with the same style definitions. * For example, `div, p` selects all `<div>` elements and all `<em>` elements. --- # Example: grouping `li` and `em` ```html <p> Basic experience with <a href="www.r-project.org">R</a> and familiarity with the <em>Tidyverse</em> is recommended.</p> <h2>Technologies</h2> <ol> <li>HTML: <em>Hypertext Markup Language</em></li> <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> <p><strong>Note</strong>: <em>rvest</em> is included in the <em>tidyverse</em></p> ``` --- # Example: grouping `li` and `em` ```html <p> Basic experience with <a href="www.r-project.org">R</a> and * familiarity with the <em>Tidyverse</em> is recommended.</p> <h2>Technologies</h2> <ol> * <li>HTML: <em>Hypertext Markup Language</em></li> * <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> * <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> <p><strong>Note</strong>: * <em>rvest</em> is included in the <em>tidyverse</em></p> ``` ```r html_page %>% html_nodes("li, em") %>% html_text() ``` ``` ## [1] "Tidyverse" "HTML: Hypertext Markup Language" ## [3] "Hypertext Markup Language" "CSS: Cascading Style Sheets" ## [5] "Cascading Style Sheets" "rvest" ## [7] "rvest" "tidyverse" ``` --- # CSS Selector: descendant selector (`space`) * The descendant selector matches all elements that are descendants of a specified element. * For example, `div p` selects all `<p>` elements inside `<div>` elements. --- # Example: all `em` that are descendants of `li` ```html <p> Basic experience with <a href="www.r-project.org">R</a> and familiarity with the <em>Tidyverse</em> is recommended.</p> <h2>Technologies</h2> <ol> <li>HTML: <em>Hypertext Markup Language</em></li> <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> <p><strong>Note</strong>: <em>rvest</em> is included in the <em>tidyverse</em></p> ``` --- # Example: all `em` that are descendants of `li` ```html <p> Basic experience with <a href="www.r-project.org">R</a> and familiarity with the <em>Tidyverse</em> is recommended.</p> <h2>Technologies</h2> <ol> * <li>HTML: <em>Hypertext Markup Language</em></li> * <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> <p><strong>Note</strong>: <em>rvest</em> is included in the <em>tidyverse</em></p> ``` ```r html_page %>% html_nodes("li em") %>% html_text() ``` ``` ## [1] "Hypertext Markup Language" "Cascading Style Sheets" ``` --- # CSS Selector: child selector (`>`) * The child selector selects all elements that are the children of a specified element. * For example, `div > p` selects all `<p>` elements that are children of a `<div>` element. --- # Example: all `em` that are children of `li` ```html <p> Basic experience with <a href="www.r-project.org">R</a> and familiarity with the <em>Tidyverse</em> is recommended.</p> <h2>Technologies</h2> <ol> <li>HTML: <em>Hypertext Markup Language</em></li> <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> <p><strong>Note</strong>: <em>rvest</em> is included in the <em>tidyverse</em></p> ``` --- # Example: all `em` that are children of `p` ```html <p> Basic experience with <a href="www.r-project.org">R</a> and * familiarity with the <em>Tidyverse</em> is recommended.</p> <h2>Technologies</h2> <ol> <li>HTML: <em>Hypertext Markup Language</em></li> <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> <p><strong>Note</strong>: * <em>rvest</em> is included in the <em>tidyverse</em></p> ``` ```r html_page %>% html_nodes("p > em") %>% html_text() ``` ``` ## [1] "Tidyverse" "rvest" "tidyverse" ``` --- # CSS Selector: adjacent sibling selector (`+`) * The adjacent sibling selector is used to select an element that is directly after another specific element. * Sibling elements must have the same parent element, and "adjacent" means "immediately following". * For example, `div + p` selects the first `<p>` element that is situated immediately after `<div>` elements. --- # Example: `em` immediately after `p` ```html <p> Basic experience with <a href="www.r-project.org">R</a> and familiarity with the <em>Tidyverse</em> is recommended.</p> <h2>Technologies</h2> <ol> <li>HTML: <em>Hypertext Markup Language</em></li> <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> <p><strong>Note</strong>: <em>rvest</em> is included in the <em>tidyverse</em></p> ``` --- # Example: `em` immediately after `p` ```html <p> Basic experience with <a href="www.r-project.org">R</a> and familiarity with the <em>Tidyverse</em> is recommended.</p> <h2>Technologies</h2> <ol> <li>HTML: <em>Hypertext Markup Language</em></li> <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> <p><strong>Note</strong>: <em>rvest</em> is included in the <em>tidyverse</em></p> ``` ```r html_page %>% html_nodes("p + em") %>% html_text() ``` ``` ## character(0) ``` No `em` are immediately after `p`. --- # Example: `em` immediately after `em` ```html <p> Basic experience with <a href="www.r-project.org">R</a> and familiarity with the <em>Tidyverse</em> is recommended.</p> <h2>Technologies</h2> <ol> <li>HTML: <em>Hypertext Markup Language</em></li> <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> <p><strong>Note</strong>: <em>rvest</em> is included in the <em>tidyverse</em></p> ``` --- # Example: `em` immediately after `em` ```html <p> Basic experience with <a href="www.r-project.org">R</a> and familiarity with the <em>Tidyverse</em> is recommended.</p> <h2>Technologies</h2> <ol> <li>HTML: <em>Hypertext Markup Language</em></li> <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> <p><strong>Note</strong>: * <em>rvest</em> is included in the <em>tidyverse</em></p> ``` ```r html_page %>% html_nodes("em + em") %>% html_text() ``` ``` ## [1] "tidyverse" ``` --- # CSS Selector: general sibling selector (`~`) * The general sibling selector selects all elements that are next siblings of a specified element. * Sibling elements must have the same parent element, and "general" means "any place". * For example, `div ~ p` selects all `<p>` elements that are preceded by a `<div>` element. --- # Example: `em` next sibling of `a` ```html <p> Basic experience with <a href="www.r-project.org">R</a> and familiarity with the <em>Tidyverse</em> is recommended.</p> <h2>Technologies</h2> <ol> <li>HTML: <em>Hypertext Markup Language</em></li> <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> <p><strong>Note</strong>: <em>rvest</em> is included in the <em>tidyverse</em></p> ``` --- # Example: `em` next sibling of `a` ```html * <p> Basic experience with <a href="www.r-project.org">R</a> and * familiarity with the <em>Tidyverse</em> is recommended.</p> <h2>Technologies</h2> <ol> <li>HTML: <em>Hypertext Markup Language</em></li> <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> <p><strong>Note</strong>: <em>rvest</em> is included in the <em>tidyverse</em></p> ``` ```r html_page %>% html_nodes("a ~ em") %>% html_text() ``` ``` ## [1] "Tidyverse" ``` (Here, we would have obtained the same result with `a + em`) --- # CSS Selector: first child selector (`:first-child`) * `:first-child` selects the specified element that is the first child of another element. * For example, `p:first-child` selects all `<p>` elements that are the first child of any other element. --- # Example: all `li` that are first children ```html <p> Basic experience with <a href="www.r-project.org">R</a> and familiarity with the <em>Tidyverse</em> is recommended.</p> <h2>Technologies</h2> <ol> <li>HTML: <em>Hypertext Markup Language</em></li> <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> <p><strong>Note</strong>: <em>rvest</em> is included in the <em>tidyverse</em></p> ``` --- # Example: all `li` that are first children ```html <p> Basic experience with <a href="www.r-project.org">R</a> and familiarity with the <em>Tidyverse</em> is recommended.</p> <h2>Technologies</h2> <ol> * <li>HTML: <em>Hypertext Markup Language</em></li> <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> * <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> <p><strong>Note</strong>: <em>rvest</em> is included in the <em>tidyverse</em></p> ``` ```r html_page %>% html_nodes("li:first-child") %>% html_text() ``` ``` ## [1] "HTML: Hypertext Markup Language" "rvest" ``` --- # CSS Selector: nth child selector (`:nth-child(n)`) * Remark: `:last-child` is completely symmetric to `:first-child`. * `:nth-child(n)` selects the specified element that is the nth child of another element. * For example, `p:nth-child(2)` selects all `<p>` elements that are the second child of any other element. --- # Example: all `li` that are second children ```html <p> Basic experience with <a href="www.r-project.org">R</a> and familiarity with the <em>Tidyverse</em> is recommended.</p> <h2>Technologies</h2> <ol> <li>HTML: <em>Hypertext Markup Language</em></li> <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> <p><strong>Note</strong>: <em>rvest</em> is included in the <em>tidyverse</em></p> ``` --- # Example: all `li` that are second children ```html <p> Basic experience with <a href="www.r-project.org">R</a> and familiarity with the <em>Tidyverse</em> is recommended.</p> <h2>Technologies</h2> <ol> <li>HTML: <em>Hypertext Markup Language</em></li> * <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> <p><strong>Note</strong>: <em>rvest</em> is included in the <em>tidyverse</em></p> ``` ```r html_page %>% html_nodes("li:nth-child(2)") %>% html_text() ``` ``` ## [1] "CSS: Cascading Style Sheets" ``` --- # HTML attributes * All HTML elements can have attributes, additional information about elements. * Attributes are always specified in the start tag, usually in the format `name="value"`. * For example, `<a href="www.r-project.org">R</a>`, `href` is an attribute of `a` that specifies an url. * Attributes can be accessed with `html_attr` command. --- # Example: `href` attributes ```html <p> Basic experience with <a href="www.r-project.org">R</a> and familiarity with the <em>Tidyverse</em> is recommended.</p> <h2>Technologies</h2> <ol> <li>HTML: <em>Hypertext Markup Language</em></li> <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> <p><strong>Note</strong>: <em>rvest</em> is included in the <em>tidyverse</em></p> ``` --- # Example: `href` attributes ```html * <p> Basic experience with <a href="www.r-project.org">R</a> and familiarity with the <em>Tidyverse</em> is recommended.</p> <h2>Technologies</h2> <ol> <li>HTML: <em>Hypertext Markup Language</em></li> <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> * <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> <p><strong>Note</strong>: <em>rvest</em> is included in the <em>tidyverse</em></p> ``` ```r html_page %>% html_nodes("a") %>% html_attr("href") ``` ``` ## [1] "www.r-project.org" "https://github.com/tidyverse/rvest" ``` --- # HTML tables tag | meaning --- | --- table | Table section tr | Table row td | Table cell th | Table header * Tables can be fetched by using the command `html_table()` --- ```r basic_table <- minimal_html(' <body> <table> <tr> <th>Month</th> <th>Savings</th> </tr> <tr> <td>January</td> <td>$100</td> </tr> <tr> <td>February</td> <td>$80</td> </tr> </table> </body> ') ``` ```r basic_table %>% html_table() ``` ``` ## [[1]] ## # A tibble: 2 × 2 ## Month Savings ## <chr> <chr> ## 1 January $100 ## 2 February $80 ``` --- # Example: Wikipedia table * We would like to fetch the table with Qualified teams of the Rugby World Cup 2023 on Wikipedia. * A first solution: fetch all tables and select the correct one. ```r url <- "https://en.wikipedia.org/wiki/2023_Rugby_World_Cup" url %>% read_html() %>% html_table() %>% .[[5]] %>% kableExtra::kable() ``` <table> <thead> <tr> <th style="text-align:left;"> Region </th> <th style="text-align:left;"> Team </th> <th style="text-align:left;"> Qualificationmethod </th> <th style="text-align:right;"> Previous.mw-parser-output .tooltip-dotted{border-bottom:1px dotted;cursor:help}apps </th> <th style="text-align:left;"> Previous best result </th> <th style="text-align:right;"> World Rank¹ </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Africa </td> <td style="text-align:left;"> South Africa </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 7 </td> <td style="text-align:left;"> Champions (1995, 2007, 2019) </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Africa </td> <td style="text-align:left;"> Namibia </td> <td style="text-align:left;"> Africa 1 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:left;"> Pool stage (six times) </td> <td style="text-align:right;"> 21 </td> </tr> <tr> <td style="text-align:left;"> Asia </td> <td style="text-align:left;"> Japan </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Quarter-finals (2019) </td> <td style="text-align:right;"> 14 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> France </td> <td style="text-align:left;"> Hosts </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Runners-up (1987, 1999, 2011) </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> England </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Champions (2003) </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> Ireland </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Quarter-finals (seven times) </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> Italy </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Pool stage (nine times) </td> <td style="text-align:right;"> 13 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> Scotland </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Fourth place (1991) </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> Wales </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Third place (1987) </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> Georgia </td> <td style="text-align:left;"> Europe 1 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:left;"> Pool stage (five times) </td> <td style="text-align:right;"> 11 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> Romania </td> <td style="text-align:left;"> Europe 2 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> Pool stage (eight times) </td> <td style="text-align:right;"> 19 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> Portugal </td> <td style="text-align:left;"> Final Qualifier </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> Pool stage (2007) </td> <td style="text-align:right;"> 16 </td> </tr> <tr> <td style="text-align:left;"> Oceania </td> <td style="text-align:left;"> Australia </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Champions (1991, 1999) </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;"> Oceania </td> <td style="text-align:left;"> Fiji </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> Quarter-finals (1987, 2007) </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> Oceania </td> <td style="text-align:left;"> New Zealand </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Champions (1987, 2011, 2015) </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:left;"> Oceania </td> <td style="text-align:left;"> Samoa </td> <td style="text-align:left;"> Oceania 1 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> Quarter-finals (1991, 1995) </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:left;"> Oceania </td> <td style="text-align:left;"> Tonga </td> <td style="text-align:left;"> Asia/Pacific 1 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> Pool stage (eight times) </td> <td style="text-align:right;"> 15 </td> </tr> <tr> <td style="text-align:left;"> South America and North America Rugby </td> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Third place (2007) </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> South America and North America Rugby </td> <td style="text-align:left;"> Uruguay </td> <td style="text-align:left;"> Americas 1 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> Pool stage (1999, 2003, 2015, 2019) </td> <td style="text-align:right;"> 17 </td> </tr> <tr> <td style="text-align:left;"> South America and North America Rugby </td> <td style="text-align:left;"> Chile </td> <td style="text-align:left;"> Americas 2 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:left;"> Debut </td> <td style="text-align:right;"> 22 </td> </tr> </tbody> </table> --- # Example: Wikipedia table * Inspect the HTML with the developer tools. <img src="images/wikitable_rugby.png" width="2485" /> --- # Example: Wikipedia table * A better solution using CSS selectors: using the class selector (`.`). * Select `class="wikitable"`. ```r url <- "https://en.wikipedia.org/wiki/2023_Rugby_World_Cup" url %>% read_html() %>% html_nodes(".wikitable") %>% html_table() %>% .[[3]] %>% kableExtra::kable() # equivalently html_nodes("table.wikitable") ``` <table> <thead> <tr> <th style="text-align:left;"> Region </th> <th style="text-align:left;"> Team </th> <th style="text-align:left;"> Qualificationmethod </th> <th style="text-align:right;"> Previous.mw-parser-output .tooltip-dotted{border-bottom:1px dotted;cursor:help}apps </th> <th style="text-align:left;"> Previous best result </th> <th style="text-align:right;"> World Rank¹ </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Africa </td> <td style="text-align:left;"> South Africa </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 7 </td> <td style="text-align:left;"> Champions (1995, 2007, 2019) </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Africa </td> <td style="text-align:left;"> Namibia </td> <td style="text-align:left;"> Africa 1 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:left;"> Pool stage (six times) </td> <td style="text-align:right;"> 21 </td> </tr> <tr> <td style="text-align:left;"> Asia </td> <td style="text-align:left;"> Japan </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Quarter-finals (2019) </td> <td style="text-align:right;"> 14 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> France </td> <td style="text-align:left;"> Hosts </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Runners-up (1987, 1999, 2011) </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> England </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Champions (2003) </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> Ireland </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Quarter-finals (seven times) </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> Italy </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Pool stage (nine times) </td> <td style="text-align:right;"> 13 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> Scotland </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Fourth place (1991) </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> Wales </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Third place (1987) </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> Georgia </td> <td style="text-align:left;"> Europe 1 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:left;"> Pool stage (five times) </td> <td style="text-align:right;"> 11 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> Romania </td> <td style="text-align:left;"> Europe 2 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> Pool stage (eight times) </td> <td style="text-align:right;"> 19 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> Portugal </td> <td style="text-align:left;"> Final Qualifier </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> Pool stage (2007) </td> <td style="text-align:right;"> 16 </td> </tr> <tr> <td style="text-align:left;"> Oceania </td> <td style="text-align:left;"> Australia </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Champions (1991, 1999) </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;"> Oceania </td> <td style="text-align:left;"> Fiji </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> Quarter-finals (1987, 2007) </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> Oceania </td> <td style="text-align:left;"> New Zealand </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Champions (1987, 2011, 2015) </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:left;"> Oceania </td> <td style="text-align:left;"> Samoa </td> <td style="text-align:left;"> Oceania 1 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> Quarter-finals (1991, 1995) </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:left;"> Oceania </td> <td style="text-align:left;"> Tonga </td> <td style="text-align:left;"> Asia/Pacific 1 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> Pool stage (eight times) </td> <td style="text-align:right;"> 15 </td> </tr> <tr> <td style="text-align:left;"> South America and North America Rugby </td> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Third place (2007) </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> South America and North America Rugby </td> <td style="text-align:left;"> Uruguay </td> <td style="text-align:left;"> Americas 1 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> Pool stage (1999, 2003, 2015, 2019) </td> <td style="text-align:right;"> 17 </td> </tr> <tr> <td style="text-align:left;"> South America and North America Rugby </td> <td style="text-align:left;"> Chile </td> <td style="text-align:left;"> Americas 2 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:left;"> Debut </td> <td style="text-align:right;"> 22 </td> </tr> </tbody> </table> --- # Example: Wikipedia table * A better solution using CSS selectors: using the class selector (`.`). * Select `class="wikitable sortable"`. ```r url <- "https://en.wikipedia.org/wiki/2023_Rugby_World_Cup" url %>% read_html() %>% html_nodes(".wikitable.sortable") %>% html_table() %>% kableExtra::kable() # equivalently html_nodes("table.wikitable.sortable") ``` <table class="kable_wrapper"> <tbody> <tr> <td> <table> <thead> <tr> <th style="text-align:left;"> Region </th> <th style="text-align:left;"> Team </th> <th style="text-align:left;"> Qualificationmethod </th> <th style="text-align:right;"> Previous.mw-parser-output .tooltip-dotted{border-bottom:1px dotted;cursor:help}apps </th> <th style="text-align:left;"> Previous best result </th> <th style="text-align:right;"> World Rank¹ </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Africa </td> <td style="text-align:left;"> South Africa </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 7 </td> <td style="text-align:left;"> Champions (1995, 2007, 2019) </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Africa </td> <td style="text-align:left;"> Namibia </td> <td style="text-align:left;"> Africa 1 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:left;"> Pool stage (six times) </td> <td style="text-align:right;"> 21 </td> </tr> <tr> <td style="text-align:left;"> Asia </td> <td style="text-align:left;"> Japan </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Quarter-finals (2019) </td> <td style="text-align:right;"> 14 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> France </td> <td style="text-align:left;"> Hosts </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Runners-up (1987, 1999, 2011) </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> England </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Champions (2003) </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> Ireland </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Quarter-finals (seven times) </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> Italy </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Pool stage (nine times) </td> <td style="text-align:right;"> 13 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> Scotland </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Fourth place (1991) </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> Wales </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Third place (1987) </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> Georgia </td> <td style="text-align:left;"> Europe 1 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:left;"> Pool stage (five times) </td> <td style="text-align:right;"> 11 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> Romania </td> <td style="text-align:left;"> Europe 2 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> Pool stage (eight times) </td> <td style="text-align:right;"> 19 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> Portugal </td> <td style="text-align:left;"> Final Qualifier </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> Pool stage (2007) </td> <td style="text-align:right;"> 16 </td> </tr> <tr> <td style="text-align:left;"> Oceania </td> <td style="text-align:left;"> Australia </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Champions (1991, 1999) </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;"> Oceania </td> <td style="text-align:left;"> Fiji </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> Quarter-finals (1987, 2007) </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> Oceania </td> <td style="text-align:left;"> New Zealand </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Champions (1987, 2011, 2015) </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:left;"> Oceania </td> <td style="text-align:left;"> Samoa </td> <td style="text-align:left;"> Oceania 1 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> Quarter-finals (1991, 1995) </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:left;"> Oceania </td> <td style="text-align:left;"> Tonga </td> <td style="text-align:left;"> Asia/Pacific 1 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> Pool stage (eight times) </td> <td style="text-align:right;"> 15 </td> </tr> <tr> <td style="text-align:left;"> South America and North America Rugby </td> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Third place (2007) </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> South America and North America Rugby </td> <td style="text-align:left;"> Uruguay </td> <td style="text-align:left;"> Americas 1 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> Pool stage (1999, 2003, 2015, 2019) </td> <td style="text-align:right;"> 17 </td> </tr> <tr> <td style="text-align:left;"> South America and North America Rugby </td> <td style="text-align:left;"> Chile </td> <td style="text-align:left;"> Americas 2 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:left;"> Debut </td> <td style="text-align:right;"> 22 </td> </tr> </tbody> </table> </td> </tr> </tbody> </table> --- # Example: Wikipedia table * An alternative solution: select `table` immediately after four `p`. ```r url <- "https://en.wikipedia.org/wiki/2023_Rugby_World_Cup" url %>% read_html() %>% html_nodes("p + p + p + p + table") %>% html_table() %>% kableExtra::kable() ``` <table class="kable_wrapper"> <tbody> <tr> <td> <table> <thead> <tr> <th style="text-align:left;"> Region </th> <th style="text-align:left;"> Team </th> <th style="text-align:left;"> Qualificationmethod </th> <th style="text-align:right;"> Previous.mw-parser-output .tooltip-dotted{border-bottom:1px dotted;cursor:help}apps </th> <th style="text-align:left;"> Previous best result </th> <th style="text-align:right;"> World Rank¹ </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Africa </td> <td style="text-align:left;"> South Africa </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 7 </td> <td style="text-align:left;"> Champions (1995, 2007, 2019) </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Africa </td> <td style="text-align:left;"> Namibia </td> <td style="text-align:left;"> Africa 1 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:left;"> Pool stage (six times) </td> <td style="text-align:right;"> 21 </td> </tr> <tr> <td style="text-align:left;"> Asia </td> <td style="text-align:left;"> Japan </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Quarter-finals (2019) </td> <td style="text-align:right;"> 14 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> France </td> <td style="text-align:left;"> Hosts </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Runners-up (1987, 1999, 2011) </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> England </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Champions (2003) </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> Ireland </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Quarter-finals (seven times) </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> Italy </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Pool stage (nine times) </td> <td style="text-align:right;"> 13 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> Scotland </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Fourth place (1991) </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> Wales </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Third place (1987) </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> Georgia </td> <td style="text-align:left;"> Europe 1 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:left;"> Pool stage (five times) </td> <td style="text-align:right;"> 11 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> Romania </td> <td style="text-align:left;"> Europe 2 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> Pool stage (eight times) </td> <td style="text-align:right;"> 19 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> Portugal </td> <td style="text-align:left;"> Final Qualifier </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> Pool stage (2007) </td> <td style="text-align:right;"> 16 </td> </tr> <tr> <td style="text-align:left;"> Oceania </td> <td style="text-align:left;"> Australia </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Champions (1991, 1999) </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;"> Oceania </td> <td style="text-align:left;"> Fiji </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> Quarter-finals (1987, 2007) </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> Oceania </td> <td style="text-align:left;"> New Zealand </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Champions (1987, 2011, 2015) </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:left;"> Oceania </td> <td style="text-align:left;"> Samoa </td> <td style="text-align:left;"> Oceania 1 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> Quarter-finals (1991, 1995) </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:left;"> Oceania </td> <td style="text-align:left;"> Tonga </td> <td style="text-align:left;"> Asia/Pacific 1 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> Pool stage (eight times) </td> <td style="text-align:right;"> 15 </td> </tr> <tr> <td style="text-align:left;"> South America and North America Rugby </td> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Top 3 in 2019 RWC pool </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Third place (2007) </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> South America and North America Rugby </td> <td style="text-align:left;"> Uruguay </td> <td style="text-align:left;"> Americas 1 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> Pool stage (1999, 2003, 2015, 2019) </td> <td style="text-align:right;"> 17 </td> </tr> <tr> <td style="text-align:left;"> South America and North America Rugby </td> <td style="text-align:left;"> Chile </td> <td style="text-align:left;"> Americas 2 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:left;"> Debut </td> <td style="text-align:right;"> 22 </td> </tr> </tbody> </table> </td> </tr> </tbody> </table> --- # Why web scraping could be bad? * Scraping increases web traffic. * People ignore and violate `robots.txt` and Terms of Service (ToS) of websites. * You should avoid those troubles by following these simple rules: 1. Read ToS of the website you want to scrap. 2. Inspect `robots.txt` (see <https://cran.r-project.org/robots.txt> for instance). 3. Use a reasonable frequency of requests (force your program to make some pauses). --- # Dynamic sites (advanced) * Sometimes, what you see in your browser is not what is returned by `read_html()`. In many cases, this is due to website that employs methods for dynamic data requests. * A solution is to simulate a browser to cope with dynamically rendered webpages. * _Selenium_ offers a solution. It is a project focused on automating web browsers. * You have access to Selenium with the `RSelenium` package. * An alternative is the `chromote` package (developped by Posit) that focuses on [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/). --- # World bank data <iframe src="https://data.worldbank.org/indicator/SP.ADO.TFRT" width="100%" height="400px" data-external="1"></iframe> --- # World bank data * Inspecting the "table". <img src="images/worlddata.png" width="600px" style="display: block; margin: auto;" /> --- # World bank data * Trying to fetch the data _non-dynamically_ using `class="item"`. ```r url <- "https://data.worldbank.org/indicator/SP.ADO.TFRT" url %>% read_html() %>% html_nodes(".item") %>% html_text() # or html_nodes("div.item") ``` ``` ## [1] "CountryMost Recent YearMost Recent Value" ``` * Only the header is returned. --- # World bank data * A first dynamic solution with the `chromote` package. ```r library(chromote) b <- ChromoteSession$new() # open a chromote session url <- "https://data.worldbank.org/indicator/SP.ADO.TFRT" b$Page$navigate(url) # navigate to the url b$Runtime$evaluate("document.querySelector('html').outerHTML")$result$value %>% read_html() %>% html_nodes(".item") %>% html_text() %>% head() b$close() # close the session ``` ``` ## [1] "CountryMost Recent YearMost Recent Value" ## [2] "Afghanistan202183" ## [3] "Albania202115" ## [4] "Algeria202112" ## [5] "American Samoa202130" ## [6] "Andorra20216" ``` --- # World bank data Some comments on the `chromote` command: * `b <- ChromoteSession$new()` create a new `ChromoteSession` object assigned to `b`. * `b$Page$navigate(url)` navigates to the provided URL. * The `Runtime$evaluate` command tells the browser to run JavaScript code. * The JavaScript code `document.querySelector('html').outerHTML` selects the <html> element from the current web page's Document Object Model (DOM), and then retrieves its entire HTML content, including the element itself and everything inside it. * Essentially, it captures the entire structure of the HTML document, from the opening <html> tag to the closing </html> tag, as a string. * Notice that the browser can be viewed using `b$view()` * Check the package [site](https://github.com/rstudio/chromote) for more info. --- # World bank data * `chromote` is for `Chrome`, `Chromium` and the likes. `Selenium` is more general. * Unfortunately, the solution using `RSelenium` is currently not running on my installation. But here is how a possible implementation would look like. ```r rD <- rsDriver(browser="firefox", port=4545L, verbose=F) remDr <- rD[["client"]] remDr$navigate(url) html_page <- remDr$getPageSource()[[1]] html_page %>% read_html() %>% html_nodes(".item") %>% html_text() ``` --- class: sydney-blue, center, middle # Question ? .pull-down[ <a href="https://ptds.samorso.ch/"> .white[<svg viewBox="0 0 384 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M369.9 97.9L286 14C277 5 264.8-.1 252.1-.1H48C21.5 0 0 21.5 0 48v416c0 26.5 21.5 48 48 48h288c26.5 0 48-21.5 48-48V131.9c0-12.7-5.1-25-14.1-34zM332.1 128H256V51.9l76.1 76.1zM48 464V48h160v104c0 13.3 10.7 24 24 24h104v288H48z"></path></svg> website] </a> <a href="https://github.com/ptds2023/"> .white[<svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg> GitHub] </a> ] --- # Exercises 1. Play with [CSS Diner](https://flukeout.github.io/) to get familiar with CSS Selectors. 2. Follow this [workflow](https://smac-group.github.io/ds/section-web-scraping.html#section-workflow). It uses the _SelectorGadget_. Propose an alternative solution using CSS selectors. You will probably need to use the developer tools of your browser. 3. Repeat exercise 2. using `RSelenium` or `chromote`. 4. Extract the information from the World bank data example using regular expressions. --- # To go further * More details and examples in the book [An Introduction to Statistical Programming Methods with R](https://smac-group.github.io/ds/section-web-scraping.html) * <https://github.com/yusuzech/r-web-scraping-cheat-sheet/> * Want to build your own R API wrapper? Have a look at <https://colinfay.me/build-api-wrapper-package-r/> and <https://httr2.r-lib.org/articles/wrapping-apis.html> * [Datacamp](https://www.datacamp.com/courses/web-scraping-in-r) class on webscraping with R * [Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining](https://www.wiley.com/en-us/Automated+Data+Collection+with+R%3A+A+Practical+Guide+to+Web+Scraping+and+Text+Mining-p-9781118834817) * See also the chapters on [webscraping](https://r4ds.hadley.nz/webscraping) and [regular expression](https://r4ds.hadley.nz/regexps) of R for Data Science. * W3School for [HTML](https://www.w3schools.com/html/default.asp) and [CSS](https://www.w3schools.com/css/default.asp).