Skip to main content

Deploying NucliaDB on GCP

Welcome to the step-by-step guide on deploying a NucliaDB cluster with Kubernetes on the Google Cloud Platform (GCP). We will walk you through the process of setting up a 2-node NucliaDB cluster, utilizing PostgreSQL (PG) as the key-value (KV) storage, and Google Cloud Storage (GCS) for efficient blob data management.

By the end of the tutorial, you will be able to leverage the full potential of Nuclia's processing engine and AI, yet having complete ownership of your data.

Important Considerations

Warning: The following tutorial is intended for demo purposes only. The resulting installation is not production-ready, and it is crucial to exercise caution while following the steps outlined here. Always refer to the official documentation and best practices for production deployments.

  • Security: The configurations in this tutorial may not adhere to the strictest security practices. It is essential to review and enhance security measures before considering a production deployment.

  • Performance: The resource allocations and settings are optimized for a small-scale environment. Adjustments may be required based on workload expectations in a production setting.

Prerequisites

  • Basic understanding of the components involved, including Kubernetes, PostgreSQL, and Google Cloud Storage.
  • Gcloud: to create and manage google cloud resources.
  • Kubectl: to perform operations on the Kubernetes cluster.
  • Helm: to install NucliaDB in a Kubernetes cluster.
  • NUA key: to configure the NucliaDB cluster with your Nuclia account.
  • A GCP project with a billing account and a payment profile configured.
  • jq command: for json manipulation.

Bootstrap a Kubernetes cluster

First off, configure the gcloud command to point to the project where you want to install NucliaDB on.

PROJECT_ID="nucliadb-onprem-tutorial"

# Make sure gcloud is pointing to the right account and is authenticated
gcloud auth login
# Configure the project id
gcloud config set project $PROJECT_ID

After that, make sure the Kubernetes API / Containers module in GCP is enabled on your project. You can enable it via command line with:

gcloud services enable container.googleapis.com

or by visiting the google cloud console web UI. Click on "Overview" and then on "Enable". This will require you to set up a billing account and a payment profile for your project if you don't have those already configured.

The following command will create a GKE Kubernetes cluster in Autopilot mode in a specific region. Check out GKE documentation if you are not familiar with the Autopilot mode.

CLUSTER_NAME="my-cluster"
CLUSTER_REGION="us-central1"

gcloud beta container clusters create-auto $CLUSTER_NAME --release-channel "regular" --network "projects/$PROJECT_ID/global/networks/default" --subnetwork "projects/$PROJECT_ID/regions/$CLUSTER_REGION/subnetworks/default" --cluster-ipv4-cidr "/17" --binauthz-evaluation-mode=DISABLED --region $CLUSTER_REGION

Then, use gcloud to fetch the cluster credentials so that we can connect to the cluster via kubectl

gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION

This will create and activate a new kubectl context. At this point, you should be able to see the default namespaces on the fresh Kubernetes cluster with kubectl get ns

PostgreSQL

Now we are going to create a PostgreSQL instance with a private IP, connected to the default VPC network in which GKE autopilot cluster is running. Alternatively, you could also simply deploy PostgreSQL in the cluster via the recommended PostgreSQL helm chart.

PG_INSTANCE=pg
PG_CPUS=2
PG_MEMORY_GB=4
PG_STORAGE_SIZE_GB=10
PG_ROOT_PASSWORD="nucliadb"
PG_DATABASE="nucliadb"
PG_USER_NAME="nucliadb"
PG_USER_PASSWORD="nucliadb"

# Enable SQL admin API on the project
gcloud -q services enable sqladmin.googleapis.com

# Create a private connection to the service networking API
gcloud -q compute addresses create google-managed-services-default --global --purpose=VPC_PEERING --network=default --prefix-length 12

# Enable service networking
gcloud services enable servicenetworking.googleapis.com

# Add peering with Google´s managed services
gcloud -q services vpc-peerings connect --service=servicenetworking.googleapis.com --ranges=google-managed-services-default --network=default

# Create a new sql instance with PostgreSQL 15 installed and with only private IP
gcloud -q sql instances create $PG_INSTANCE --project $PROJECT_ID --database-version=POSTGRES_15 --cpu=$PG_CPUS --memory=${PG_MEMORY_GB}GiB --region=$CLUSTER_REGION --root-password=$PG_ROOT_PASSWORD --storage-size=${PG_STORAGE_SIZE_GB}GB --storage-type=SSD --network=default --no-assign-ip

# Add a new user and create a database
gcloud -q sql databases create $PG_DATABASE --instance=$PG_INSTANCE
gcloud -q sql users create $PG_USER_NAME --instance=$PG_INSTANCE --password=$PG_USER_PASSWORD

# Get the ip of the new instance and print the pg dsn
PG_INSTANCE_IP=$(gcloud -q sql instances describe $PG_INSTANCE | grep ipAddress: | awk '{print $3}')
echo "PG instance DSN: postgres://${PG_USER_NAME}:${PG_USER_PASSWORD}@${PG_INSTANCE_IP}:5432/${PG_DATABASE}?sslmode=disable"

The output DSN will be configured later on the helm chart values file in the driver_pg_url setting.

Google Cloud Storage

For NucliaDB to be able to utilize GCS, it requires a service account key with admin role access, which can be created with the following steps:

SERVICE_ACCOUNT_NAME="nucliadb-service-account"

# Create a service account
gcloud -q iam service-accounts create $SERVICE_ACCOUNT_NAME --display-name $SERVICE_ACCOUNT_NAME --project $PROJECT_ID
# Grant required roles
gcloud -q projects add-iam-policy-binding $PROJECT_ID --member serviceAccount:${SERVICE_ACCOUNT_NAME}@${PROJECT_ID}.iam.gserviceaccount.com --role roles/storage.admin
# Create a new JSON key
gcloud -q iam service-accounts keys create $HOME/key.json --iam-account ${SERVICE_ACCOUNT_NAME}@${PROJECT_ID}.iam.gserviceaccount.com --project $PROJECT_ID
# Base 64 encode it
ENCODED_CREDS=$(cat $HOME/key.json | base64)
# Remove json file
rm $HOME/key.json
echo "GCS base64 encoded credentials: ${ENCODED_CREDS}"

The resulting encoded credentials will need to be configured in the values.yaml file of the NucliaDB Helm chart.

Installing the NucliaDB Helm chart

Create a values.yaml file following this template:

# image settings
imagePullPolicy: IfNotPresent
image: nuclia/nucliadb

replicas: 2

# app settings
env:
NUCLIA_ZONE: europe-1
CORS_ORIGINS: '["*"]'
NUA_API_KEY: <nua-key-here>
cluster_discovery_mode: kubernetes
cluster_discovery_kubernetes_namespace: nucliadb
cluster_discovery_kubernetes_selector: "app.kubernetes.io/name=node"
# when we're k8s, we want structured logs to stdout because we don't have
# so they can be scraped by log exporters
log_output_type: stdout
log_format_type: structured
# Maindb driver settings
DRIVER: pg
driver: pg
driver_pg_url: <pg-dsn-here>
# File backend settings
file_backend: gcs
gcs_base64_creds: <base64-encoded-key-here>
gcs_bucket: nucliadb_standalone_{kbid}
gcs_location: europe-west1
gcs_project: <project-id-here>

envSecrets:
# - name: NUA_API_KEY
# valueFrom:
# secretKeyRef:
# name: nuclia-api-key
# key: api-key

affinity: {}
nodeSelector: {}
tolerations: []
topologySpreadConstraints: []

resources: {}
# limits:
# memory: "2600Mi"
# requests:
# memory: "600Mi"
# cpu: 1

storage:
class: <storage-type-name-here>
size: 100Gi

service:
http_port: 8080

Make sure to fill in the right values:

  • env.NUA_API_KEY: follow the steps outlined in the link at the prerequisites section.
  • env.gcs_project: GCP project id.
  • env.driver_pg_url: copy the value generated in the PG setup step
  • env.gcs_base64_creds: copy the value generated in the GCS setup step.
  • storage.class: Choose a storage class from the list obtained by kubectl get storageclasses.

We are now ready to install the chart:

NAMESPACE=nucliadb
NUCLIADB_RELEASE=$(curl https://api.github.com/repos/nuclia/nucliadb/releases | jq -r .[0].tag_name)

# Create the namespace
kubectl create ns $NAMESPACE

# Install the chart
helm install nucliadb-standalone \
https://github.com/nuclia/nucliadb/releases/download/$NUCLIADB_RELEASE/nucliadb-chart.tgz \
--namespace $NAMESPACE --values ./values.yaml
note

For better control over deployments, we recommend installing a specific release of NucliaDB. Check out the NucliaDB releases page to see available releases.

You can then check the status of the workloads with:

kubectl get pods -n $NAMESPACE

It may take a few minutes until the NucliaDB pods are ready and running, as GKE autopilot cluster needs some time to provision the cluster resources, after which the NucliaDB nodes need to auto-discover and form a cluster.

note

The GKE autopilot cluster may schedule both NucliaDB pods on the same machine. This is fine for demo purposes but make sure that each pod is assigned to a different GKE node for a production setup. You can do so by adjusting the requested resources on the values.yaml so that only one pod fits in the used machine type.

Checking the installation

If the NucliaDB pods are up and running, you should now be able to access the Admin UI by port forwarding the service to your local machine

kubectl port-forward --namespace $NAMESPACE $(kubectl get pod --namespace $NAMESPACE --selector="app=nucliadb" --output jsonpath='{.items[0].metadata.name}') 8080:8080

And then visiting http://localhost:8080/admin with your browser.

You can now push some data to Nuclia's processing engine and after a few minutes, you will be able to get generative answers on them. Note that all the extracted metadata will be stored on the PG instance of the cluster and the binary files will be stored on the GCS of the project.

You can also use the GCP logging system to inspect the application logs. NucliaDB is configured to output structured logs to stdout by default, which are scraped by Google and indexed automatically.