Skip to main content

Index a full website

This guide will help you index a website into your Nuclia knowledgebox.

Prerequisites

  1. API Key: Obtain a contributor or writer API key as detailed here.

  2. Python 3: Ensure Python 3 is installed on your system:

    python --version
  3. Nuclia SDK: Install the Nuclia package in your environment::

    pip install nuclia

Run the script

Use the following Python script to upload a website to your Nuclia knowledgebox:

import sys

from nuclia import sdk

# Nuclia knowledge box URL and API key
KNOWLEDGE_BOX_URL = "https://<zone>.nuclia.cloud/api/v1/kb/<your-knowledge-box-id>"
API_KEY = "<your-api-key>"

# Authenticate with the Nuclia SDK
sdk.NucliaAuth().kb(url=KNOWLEDGE_BOX_URL, token=API_KEY)


def upload_website(url: str):
"""
Uploads a website

Args:
url (str): The URL of the website that you want to upload.
"""
# We filter the website to only select the body
sdk.NucliaUpload().link(uri=url, css_selector="body")


if __name__ == "__main__":
# Ensure an URL is provided
if len(sys.argv) != 2:
print("Usage: python script.py <url>")
sys.exit(1)

# Get the root folder path from command line arguments
url = sys.argv[1]

# Upload files from the specified folder
upload_website(url)

To execute the script and start uploading a website, run the following command:

python3 script.py url

Replace url with the wbesite you want to upload, for example:

python3 script.py https://docs.nuclia.dev/docs/

For more information on the Nuclia Python SDK, see the Nuclia Python SDK documentation.

note

When indexing a web page, avoid indexing the header, footer, and other irrelevant parts of the page. Use the css_selector parameter in the SDK to select only the relevant content.

tip

Consider using the Nuclia Sync Agent to process a full sitemap (a sitemap.xml file containing all the pages of a given site)