Index a full website
This guide will help you index a website into your Nuclia knowledgebox.
Prerequisites
-
API Key: Obtain a contributor or writer API key as detailed here.
-
Python 3: Ensure Python 3 is installed on your system:
python --version
-
Nuclia SDK: Install the Nuclia package in your environment::
pip install nuclia
Run the script
Use the following Python script to upload a website to your Nuclia knowledgebox:
import sys
from nuclia import sdk
# Nuclia knowledge box URL and API key
KNOWLEDGE_BOX_URL = "https://<zone>.nuclia.cloud/api/v1/kb/<your-knowledge-box-id>"
API_KEY = "<your-api-key>"
# Authenticate with the Nuclia SDK
sdk.NucliaAuth().kb(url=KNOWLEDGE_BOX_URL, token=API_KEY)
def upload_website(url: str):
"""
Uploads a website
Args:
url (str): The URL of the website that you want to upload.
"""
# We filter the website to only select the body
sdk.NucliaUpload().link(uri=url, css_selector="body")
if __name__ == "__main__":
# Ensure an URL is provided
if len(sys.argv) != 2:
print("Usage: python script.py <url>")
sys.exit(1)
# Get the root folder path from command line arguments
url = sys.argv[1]
# Upload files from the specified folder
upload_website(url)
To execute the script and start uploading a website, run the following command:
python3 script.py url
Replace url
with the wbesite you want to upload, for example:
python3 script.py https://docs.nuclia.dev/docs/
For more information on the Nuclia Python SDK, see the Nuclia Python SDK documentation.
When indexing a web page, avoid indexing the header, footer, and other irrelevant parts of the page. Use the css_selector parameter in the SDK to select only the relevant content.
Consider using the Nuclia Sync Agent to process a full sitemap (a sitemap.xml file containing all the pages of a given site)