Knowledge Sources - Configuring Web Crawling

Web crawling for knowledge base (KB) ingestion involves using automated programs, known as crawlers or spiders, to systematically browse and collect information from websites. This data is then processed and indexed for use in knowledge-based systems, enabling efficient retrieval and understanding of relevant content.

 

Creating A Web Crawl Knowledge Set

  1. Go to Administration on the Top Navigation
  2. Under the AI Knowledge Sets heading select Knowledge Sets
  3. Select the Knowledge set that you want to add a web crawl source to
  4. On the Left Navigation Menu select Sources
  5. In the upper right select the Add Button which will open a Create Knowledge Set Source dialog
  6. In the Dialog give the source a Name, description and select type of Web Crawl
  7. In the section titled URLs Select the Add button or the Clipboard Button to add URLs to Crawl
    1. NOTE: You can only have a max of 50 URLs
Uploaded Image (Thumbnail)
 

 

Configuration Options for Web Crawling

Max Request

Max Requests specifies the number of requests that the web crawl will make.

Examples

If one url is entered then the crawl will crawl the first page, and then crawl down any links that are within that main page. These will count as requests. If a page has 3 links to sub pages then that will add up to 3 requests. And it will keep going down the tree of web pages

 

Crawl Strategy

Crawl strategy offers options for what content to actually crawl. You have 4 options

  1. Same Domain: The crawler restricts its activity to URLs within the same domain as the starting URL. For example, if the starting URL is https://www.example.com, the crawler will only visit other URLs that end with example.com, such as https://blog.example.com or https://www.example.com/about. This approach is useful when the goal is to index or analyze all content within a specific domain.
  2. Same Hostname: In this case, the crawler focuses on URLs that share the same hostname as the starting URL. For instance, if the starting URL is https://www.example.com, the crawler will only visit URLs that begin with https://www.example.com, such as https://www.example.com/about or https://www.example.com/contact. This strategy is beneficial when the interest is in a specific subdomain or section of a website.
  3. Same Origin: Here, the crawler limits its scope to URLs that have the same origin as the starting URL. The origin includes the protocol, hostname, and port. For example, if the starting URL is https://www.example.com:8080, the crawler will only visit URLs that match https://www.example.com:8080, such as https://www.example.com:8080/page1. This method is useful when dealing with web applications that have specific origin constraints

 

Keywords Meta Tag

Key word meta tags specify which HTML meta tag that each web page should be inspected for to define the Keywords MetaData field for when the content record is added to the Vector Database
Key words can then be used as a filter field for the RAG Search

Examples

Given the following Work Management KB article Meta Tags

Uploaded Image (Thumbnail)

You could input article:section as the keywords meta tag, and the value "iPaaS Authentication Methods" will be placed in the key words for the source

 

Update Since

When updating this source, either due to a scheduled update or manual update request, only content that has changed within the specified period of time will be re-ingested into the knowledge set.

This references the last-modified header when the request to get the page content is made.

 

Advanced Options

To fine-tune the web crawling, you can configure advanced options:

  1. URL Filters: Use filters to control which pages are indexed. Filters should use "Glob" syntax, allowing you to specify patterns. Only pages matching these patterns will be crawled.

    • Example: You can include or exclude URLs based on specific file types (e.g., *.html, *.php) or paths (e.g., https://www.example.com/docs/*).
  2. Include/Exclude Selectors: Specify HTML selectors to include or exclude specific parts of the webpage content during crawling. This can be helpful to avoid indexing irrelevant sections or to focus on critical information

  3. Login Information: If the content requires authentication, input login credentials so the crawler can access restricted areas of the site.

    1. Login URL - The url that will need to log the user in

    2. Login Credentials - The Username/Password that can be used to log the user in

    3. Username Selector - The CSS Selector that can be used to reference the Username input field 

    4. Password Selector - The CSS Selector that can be used to reference the Password input field 

    5. Button Selector - The CSS Selector that can be used to reference the Login Button 

  4. Chunk Type: Define how the content should be chunked or segmented for ingestion. This helps in dividing large documents into smaller, manageable sections for better indexing and retrieval.

 

 

100% helpful - 1 review