Managing DKube

DKube can be managed from the $HOME/.dkube directory.

Upgrade DKube Version

DKube can be upgraded to a newer version through the upgrade command. The format of the command is:

sudo ./dkubeadm dkube upgrade <version>

Backup and Restore

The DKube database can be backed up and restored from the installation/management node. This function is performed from the $HOME/.dkube directory. Both functions rely on a backup.ini file for configuration, and use the same k8s.ini and dkube.ini files that were edited as part of the installation.

Important

There are some portions of the DKube database that are not backed up. These are explained in the following section.

Backup Exclusions

The following items are not backup up through the DKube backup feature:

  • Any System packages installed from a command line

  • Any files added to folders outside the workdir

Important

In order to save the complete image, it should be pushed to Docker

Editing the backup.ini File

The backup.ini file provides the target for the backup, and the credentials where necessary.

_images/backup_ini.png

Important

The input fields must be enclosed in quotes, as shown in the figure

Local Backup

The “local” backup option will save the snapshot to a folder on the local cluster. Only the following sections should be filled in.

Field

Value

PROVIDER

local

BACKUP_DIR

Target folder on local cluster

AWS-S3

The AWS backup option will save the snapshot to an AWS-S3 cloud. Only the following sections should be filled in.

Field

Value

PROVIDER

aws-s3

BUCKET

Storage bucket in cloud

AWS_ACCESS_KEY_ID

https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html

AWS_SECRET_ACCESS_KEY

https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html

MinIO

The MinIO backup option will save the snapshot to a MinIO server. Only the following sections should be filled in.

Field

Value

PROVIDER

minio

BUCKET

Storage bucket in cloud

MINIO_ACCESS_KEY_ID

https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html

MINIO_SECRET_ACCESS_KEY

https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html

MINIO_ACCESS_IP

url of the MinIO server

MINIO_ACCESS_PORT

Port number of the MinIO server

GCS (Google Cloud Services)

The GCS backup option will save the snapshot to Google Cloud Services. Only the following sections should be filled in.

Field

Value

PROVIDER

gcs

BUCKET

Storage bucket in cloud

GOOGLE_APPLICATION_CREDENTIALS

JSON file containing the GCP private key

Backup

DKube backup takes a snapshot of the database and saves it on the target provider. The backup snapshots can be listed, and restored to a DKube cluster.

Important

There are some items that are not backed up by this process. These are explained in the section Backup Exclusions

Before initiating the backup, DKube must be stopped through the following command so that a reliable backup can be accomplished.

sudo ./dkubeadm dkube shutdown

The backup is initiated by the following command. A backup name must be provided.

sudo ./dkubeadm backup start <name>

List of Snapshots

To see a list of backup snapshots on the current provider, and to identify the ID for each one, the following command is used.

sudo ./dkubeadm backup list

An output similar to the figure below will be shown.

_images/Backup_List.png

Delete Backup Snapshot

A backup snapshot can be removed from the current storage location. This is accomplished by using the ID listed in the backup list.

sudo ./dkubeadm backup delete <ID>

Restore

The DKube restore function will take a DKube database and install it on the cluster.

Important

Restoring the database will overwrite the current database on that cluster

The restore function uses a backup ID to identify which backup to choose. The restore function is initiated by the following command, using the ID from the backup list. It will copy the database to the cluster and start DKube.

sudo ./dkubeadm restore start <ID>

Restore Failure

If the restore fails, the user will be given the option of reverting to the DKube database that was on the cluster prior to the restore.

DKube Migration

The DKube database (Runs, Notebooks, Inferences, Models, Pipelines, etc) can be migrated from one cluster to another. This function is performed from the $HOME/.dkube/cli directory.

Important

The entities must all be in the Stopped state to initiate migration

Editing the Migration.ini File

In order to provide the source and destination information, the migration.ini file must be edited. An example ini file is provided below.

_images/migration_ini.png

Important

The fields in the migration.ini files must be enclosed in quotes, as shown in the example

Field

Value

Name

User-generated name to use for migration tracking

JWT

Authorization token as described below

dkubeurl

IP address of the DKube instance

jobs

List of entities to migrate

Getting the Access Token

The JWT token in the migration.ini file is available from the DKube Operator screen.

_images/Operator_Right_Menu_Developer.png




_images/Operator_Developer_Settings_Popup.png

Executing the Migration

After the migration ini file has been filled in, the command to initate the migration is shown.

sudo ./dkubectl migration start --config migration.ini

The migration will begin, and when the status shows 100% complete the app exits.

_images/Migration_Status.png

If the user wants to execute the migration more than once, one of the following steps must be taken:

  • Edit the migration.ini file to use another name

  • Delete the existing name using the following command.

sudo ./dkubectl migration delete --name <migration name> --config migration.ini

Stop and Start DKube

Before shutting down the cluster, DKube should be stopped. This will ensure that there is no activity, and that it is shutdown in a reliable manner.

Stopping DKube

The following command will stop all activity in DKube:

sudo ./dkubeadm dkube shutdown

Starting DKube

The following command will restart DKube after the cluster is brought back up:

sudo ./dkubeadm dkube start

Managing the Kubernetes Cluster

The instructions for managing the k8s cluster depend upon the platform type (community or managed):

Community k8s

Managing a Community Kubernetes Cluster

Managed k8s

Managed Kubernetes Cluster

Managing a Community Kubernetes Cluster

If community k8s was included as part of the installation, the cluster can be managed using the dkubeadm tool. This should be performed from the $HOME/.dkube directory.

Adding a Node

A worker node can be added to the cluster by using the dkubeadm tool through its public IP address. The other information necessary to add the node is read from the ini files that were used for installation. DKube will automatically recognize the node and add its resources into the application.

The worker node must be accessible from the installation node.

sudo ./dkubeadm node add <Public IP Address>

Removing a Node

Removing a node from a cluster is accomplished by the dkubeadm command through its public IP address. The worker node will be cleanly removed from the DKube application. The cluster will continue to operate.

Important

This command should be used prior to removing the node from the cluster in order to ensure a clean removal

sudo ./dkubeadm node remove <Public IP Address>

Note

A network error may occur after a node has been removed. This is a temporary status message and can be ignored. Refresh the browser window to remove it.

Note

When DKube has been configured to use Ceph, it is advisable to provide an alert that a Ceph node will be removed. This is described in section Removing a Ceph Node in HA

Removing a Node that is Inaccessible

If a node is inaccessible, the node remove command will not work directly. In this case, remove the node from the k8s.ini file before running the node remove command.

Adjusting GPU Pool Sizes

Note

After a node is removed, the maximum number of GPUs assigned to the pools may exceed the total number of GPUs in the cluster

For example:

  • Initially there are 10 GPUs in the cluster

  • All 10 GPUs are spread out in different pools, but the total of all of the GPUs in all of the pools cannot be more than 10.

  • After a node is removed, there may be fewer GPUs in the cluster (for example, there may only be 8 GPUs in the cluster)

  • The pool GPU assignments now add up to more GPUs than exist on the cluster

The Operator needs to manually re-assign the pool maximums so that they do not add up to more GPUs than there are on the cluster.

Managed Kubernetes Cluster

When running DKube on a managed k8s cluster, the management of the cluster is handled by the platform. DKube will recognize the state of the cluster nodes and configure itself appropriately.

Adding a Node

Adding a node to the cluster is done through the specific managed k8s platform. The only requirement is that the necessary software drivers be installed on the new node. This is accomplished by:

  • Editing the $HOME/.dkube/k8s.ini file to add the new node IP

  • Ensuring that the new node is accessible passwordlessly from the installation node

  • Running the dkubeadm node setup command on the new node

There are all described in more detail in the instructions about how to install DKube on a managed platform.

Removing a Node

In general, no special actions are necessary to remove a node from the cluster. DKube will recognize that they node has been removed, and configure itself appropriately.

Note

When DKube has been configured to use Ceph, it is advisable to provide an alert that a Ceph node will be removed. This is described in section Removing a Ceph Node in HA

Removing a Ceph Node in HA

When DKube has been installed with the following options, DKube creates a Ceph cluster that shares the storage among the local storage on the cluster nodes.

:widths: auto :header: “Field”, “Value”

HA

true

STORAGE_TYPE

disk, pv, or sc

In this situation, it is advisable to notify the Ceph controller before removing the node. This will cause the data to be moved from that node to other nodes that are part of the Ceph cluster.

The command is:

sudo ./dkubeadm ceph remove <hostname>

Shutting Down & Restarting the Cluster

The cluster will remain operational as long as the master node is up and running. Shutting down the master node will shut down the entire cluster.

Important

Before shutting down the cluster, DKube should be stopped to ensure a reliable recovery. This is described in section Stopping DKube

When bringing the cluster back up, the master node should be brought back first, then the worker nodes. After the cluster is brought up, DKube should be restarted as described in section Starting DKube.