Installation & Cluster Management

Overview

This guide describes the process of installing and managing DKube.

DKube installation and cluster management are accomplished through a management tool, dkubeadm, that simplifies the process. The tool:

  • Ensures that the target system is accessible and has the right software prerequisites
  • Installs DKube and (if required) Kubernetes on the cluster
  • Allows uninstalling of DKube & Kubernetes
  • Manages DKube and Kubernetes (if installed as per this guide) on the cluster after installation

After installation, the management capabilities include:

Backing up and restoring DKube Backup and Restore
Stopping and starting DKube Stop and Start DKube
Migrating the DKube database from one cluster to another DKube Migration
Shutting down and restarting the cluster Shutting Down & Restarting the Cluster
Adding and removing cluster worker nodes Adding a Worker Node

This guide also describes how to uninstall DKube and k8s from cluster ( Uninstalling DKube & Kubernetes )

DKube Configuration

The cluster can have a single master node, or multiple nodes with one master and one or more worker nodes.

  • The Master node coordinates the cluster, and can optionally contain GPUs
  • Each Worker node provides more resources, and is a way to expand the capability of the cluster

The Master node must always be running for the cluster to be active. Worker nodes can be added and removed, and the cluster will continue to operate. This is described in the section Managing the Kubernetes Cluster.

If the cluster needs to be shutdown, follow the procedure in section Shutting Down & Restarting the Cluster.

Installation Configuration

The installation scripts can be installed:

  • From the master node in the cluster, or
  • From a remote node that is not part of the cluster

The overall flow of installation is as follows:

  • Copy the Docker installation scripts and associated files to the installation node (master node or remote node)
  • Ensure that the installation node has passwordless access to all of the nodes on the cluster using an ssh key pair
  • Edit the installation ini files with the appropriate options
  • Install DKube and its required software components (including the optional installation of community Kubernetes)
  • Access DKube through a browser

The figures below show the 2 possible configurations for installation. The only requirement is that the installation node (either the master node on the cluster or a remote node) must have passwordless access to all of the nodes in the cluster. This is discussed in more detail in section Accessing the Cluster.

Master Node Installation

_images/Installation_Block_Diagram_Local.png

In a local installation, the scripts are run from the master node of the DKube cluster. After installation, the management of the cluster ( Managing the Kubernetes Cluster) is also accomplished from the master node.

Important

Even if the installation is executed from the master node on the cluster, the ssh public key still needs to be added to the appropriate file on all of the nodes on the cluster, including the master node. This is explained in the section on passwordless ssh key security.

Remote Node Installation

_images/Installation_Block_Diagram_Remote.png

DKube and its associated applications can be installed and managed from a remote node that is not part of the DKube cluster. The installation node needs to have passwordless access to all of the nodes on the DKube cluster.

Note

If the installation was done remotely, cluster management can be performed from the remote installation node, or from the DKube cluster master node after the installation is complete. However, once any management tasks have been performed on the local cluster, future management must also be done on the master node. You cannot go back and forth.

DKube and Kubernetes

DKube requires Kubernetes to operate. The DKube installation process can install a community version of k8s prior to DKube installation, or DKube can be installed directly if a supported version of k8s is already installed on the cluster. This guide highlights the optional k8s installation in the instructions. If k8s is already installed, please skip over that section, as shown in the guide.

Prerequisites

Supported Platforms

The following platforms are supported for DKube:

  • Installation platform can be any node running
  • Ubuntu 18.04
  • CentOS 7
  • Cluster nodes can be one of the following:
  • On-prem (bare metal or VM)
  • Google GCP
  • Amazon AWS

Note

Please note that not all combinations of provider and OS are supported. Additional platforms are being released continually, and described in application notes.

The DKube installation scripts handle most of the work in getting DKube installed, including the installation of the software packages on the cluster. There are some prerequisites for each node, described below.

Important

Some platforms require addition steps as described in this section. Please read this section carefully, and ensure that you have taken these steps before moving ahead with the installation.

Node Requirements

The installation node has the following requirements:

  • A supported operating system
  • Docker

Additional Requirements for GCP on CentOS

When installing Kubernetes and DKube on a GCP platform running CentOS, additional steps are required before installation. After the VM has been created, the following commands must be executed to update and install the proper software.

sudo yum -y update sudo yum install -y redhat-lsb

Docker Installation on Ubuntu

The following commands can be used to install Docker on Ubuntu:

sudo apt update -y sudo apt install docker.io -y

Docker Installation on CentOS

The following commands can be used to install Docker on CentOS:

sudo yum install -y yum-utils device-mapper-persistent-data lvm2 sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo sudo yum install -y docker-ce-18.09.2-3.el7 docker-ce-cli-18.09.2-3.el7 containerd.io sudo systemctl start docker

The DKube Cluster nodes have the following requirements:

  • A supported operating system
  • Nodes should all have status IP addresses, even if the VM exists on a cloud
  • Node names must be lower case
  • All nodes must be on the same subnet
  • All nodes must have the same user name and ssh key

Each node on the cluster should have the following minimum resources:

  • 4 CPU cores (recommend 8)
  • 32GB RAM
  • 100GB Storage

Note

A script is included to install Community Kubernetes on the cluster. It will install the required software versions. However, if you prefer to install k8s on your own, the following prerequisites are necessary.

NVIDIA Driver 410.79-418.39
CUDA 10, 10.1
Community Kubernetes 1.12-1.15

Important

Only GPUs of the exact same type can be installed on a node. So, for example, you cannot mix an NVIDIA V100 and P100 on the same node. And even GPUs of the same class must have the same configuration (i.e. memory).

Important

The Nouveau driver should not be installed on any of the nodes in the cluster. If the driver is installed, you can follow the instructions in the section Removing Nouveau Driver.

Cluster and DKube Resiliency

For highly available operation, DKube supports multi-node resiliency (HA). An HA system prevents any single point of failure through redundant operation. For resilient operation, at least 3 nodes are required. There are 2 different types of independent resiliency offered: cluster and DKube. The details of how to configure k8s and DKube for resilient operation are provided in the pertinent sections that explain how to complete the ini files.

Cluster Resiliency

Cluster resiliency provides the ability of Kubernetes to offer a highly available control plane. Since the master node in a k8s system manages the cluster, cluster resiliency is enabled by having 3 master nodes. There can be any number of worker nodes. In a resilient cluster, a load balancer monitors the health of pods running on all of the master nodes. If a pod does down, the requests are automatically sent to pods running on other master nodes. In such a system, any number of worker nodes can go down and still offer a usable cluster. But only a single master node can go down and still have the system continue.

In order to enable cluster resiliency, the HA option must be set to “true” in the k8s.ini file.

Note

Since the master node manages the cluster, for the best resiliency it is advisable to not install any GPUs on the master nodes, and to prevent any DKube-related pods from being scheduled on them. The mechanism to do that is explained in the section that describes the k8s.ini file configuration.

Note

This guide will explain how to configure community k8s for resilient operation. If the user does their own k8s installation, it is up to them to ensure that the cluster is resilient. Depending upon the type of k8s, the details will vary.

DKube Resiliency

DKube resiliency is independent of - and can be enabled with or without - cluster resiliency. If the storage is installed by DKube, resiliency ensures that the storage and databases for the application have redundancy built in. This prevents an issue with a single node from corrupting the DKube operation. Externally configured storage is not part of DKube resiliency. For DKube resiliency to function, there must be at least 3 schedulable nodes. That is, 3 nodes that allow DKube pods to be scheduled on them. The nodes can be master nodes or worker nodes in any combination.

In order to enable DKube resiliency, the HA option must be set to “true” in the dkube.ini file.

Resiliency Examples

There are various ways that resiliency can be enabled at different levels. This section lists some examples:

Nodes Master Nodes Worker Nodes Master Schedulable Resiliency
3 1 2 Yes DKube Only
3 1 2 No No Resiliency
3 3 0 Yes Cluster & DKube
4 1 3 Yes/No DKube Only
4 3 1 Yes Cluster & DKube
4 3 1 No Cluster Only
6 3 3 Yes/No Cluster & DKube

Getting the Files

The files necessary for installation, including the scripts, the .ini files, and any other associated files are pulled from Docker, using the following commands:

sudo docker login -u <Docker username> Password: <Docker password> sudo docker pull ocdr/dkubeadm:<DKube version> sudo docker run --rm -it -v $HOME/.dkube:/root/.dkube ocdr/dkubeadm:<DKube version> init

Note

The docker credentials and DKube version number (x.y.z) are provided separately.

This will copy the necessary files to the folder $HOME/.dkube

dkubeadm
  • Tool used to install & uninstall Kubernetes & DKube on the cluster
  • After installation, this tool manages the cluster
k8s.ini
  • Configuration file for Kubernetes cluster installation
dkube.ini
  • Configuration file for DKube installation
ssh-rsa Key Pair
  • ssh key pair for passwordless access to the remote machines

Accessing the Cluster

In order to run DKube both during and after installation, a minimum level of security access must be provided from any system that needs to use the node. This includes access to the url in order to open DKube from a browser.

Protocol Port Range Source
TCP 32222 Access IP
TCP 32223 Access IP
TCP 32323 Access IP
TCP 6443 Access IP
TCP 443 Access IP
TCP 22 Access IP
All 0-65535 Private Subnet
ICMP 0-65535 Access IP

The source IP access range is in CIDR format. It consists of an IP address and mask combination. For example:

  • 192.168.100.14/24 would allow IP addresses in the range 192.168.100.x
  • 192.168.100.14/16 would allow IP addresses in the range 192.168.x.x

Note

The source IP address 0.0.0.0/0 can be used to allow access from any browser client. If this is used, then a firewall can be used to enable appropriate access

Cluster Access from Installation Node

In order to run the scripts, the installation node needs to be able to access each node in the cluster without requiring a password. This includes the following requirements at a minimum:

  • An SSH key pair, with the public key copied to each node on the cluster
  • A sudoers file that allows appropriate account access

The Docker init creates an ssh key pair and places them in the $HOME/.dkube directory.

  • If you want to use your own ssh key pair, you must take additional steps as described below.

Important

Even in the case where the master node is used as the installation node, the ssh key pair must still be added to the master node authorized_keys file in the $HOME/.ssh directory

Using Your Own Key Pair

If you have your own ssh key pair, it is assumed that the private key works with all of the DKube cluster nodes, including the master node. In this case, the following steps are required:

  • Copy the private key to the $HOME/.dkube directory. It needs to be copied with the name ssh-rsa
  • Delete the file ssh-rsa.pub from the $HOME/.dkube directory, since it will not match your new private ssh-key file

Docker-Supplied ssh Key Pair

The initial Docker init:

  • Creates a $HOME/.dkube directory
  • Creates an ssh key pair to allow passwordless access to the DKube cluster nodes

If the ssh key pair created by the Docker init will be used for cluster access, then it’s public key file contents need to be added to the $HOME/.ssh/authorized_keys file on each node of the DKube cluster, including the master node. This can generally be accomplished by simply adding it with:

sudo ssh-copy-id -i $HOME/.dkube/ssh-rsa.pub <username>@<Master Node IP Address>

Other Access Requirements

The sudoers file on each node must include the target cluster account name with the necessary access. This can be accomplished using the visudo command, and adding the following line:

<username> ALL=(ALL) NOPASSWD:ALL

Depending upon the platform type, the details of using the ssh key pair might differ, and there may be other access credentials that must also be put in place. For example, there may be a requirement for a VPN or some other firewall-related difference. The cluster manager for the installation can provide the required additional security steps depending upon the installation.

For specific platform types, additional or different steps must be taken.

GCP Google GCP System Installation
AWS Amazon AWS System Installation

After these steps have been successfully accomplished, proceed to the final access verification step Final Access Verification.

Final Access Verification

For all platform types, after the security access steps have been taken, the user should ensure that each node in the cluster can be properly accessed by the installation node without a password.

The installation process should not move ahead if these verification steps are not successful.

ssh -i ssh-rsa <username>@<Master Node IP Address>

Note

Depending upon the account privileges, the command might need to be prefixed with “sudo”

Installation

Note

The rest of the installation procedure is run from the $HOME/.dkube folder on the installation node

A fully automated process will install:

  • Community k8s
  • DKube

If both need to be installed, start with the following section. If the cluster already has k8s installed, go directly to the section Installing DKube.

After DKube has been installed, in can be uninstalled by following the steps in the section Uninstalling DKube & Kubernetes.

Installing Kubernetes

Kubernetes ini File

The k8s.ini files provides the information needed to access the cluster and install Kubernetes.

The k8s.ini file has the following format:

_images/k8s-ini-File.png

Only the following fields should be filled in:

provider Type of platform
distro OS type
master Allow jobs to be used on master node
master_HA Set to true to enable a resilient cluster
nodes IP address of the nodes in the cluster
user The user name for the account

Provider & OS

This identifies the type of platform that DKube will run on, and the operating system.

Important

Please note that not all combinations of provider and OS are supported. Check the section Prerequisites

Schedule Pods on Master Node

On a cluster that includes only a Master node, jobs will automatically be used on that node. If the cluster includes workers nodes, DKube can allow the jobs to be used on the Master node, or - if desired - only on the Worker nodes. If the system is being set up as a resilient platform, this option will allow jobs to be scheduled on all of the master nodes or none.

Note

Since the Master node does the overall coordination of the cluster, it is recommended that jobs not be schedulable on the Master node when possible.

Resilient Operation

If the cluster has at least 3 nodes, it can be configured to be resilient, guaranteeing that it will continue when one of the nodes becomes inoperable. This is explained in the section Cluster and DKube Resiliency. To enable this option, set the master_HA option to true.

Nodes

The node IP addresses should reflect how the installation node accesses the cluster nodes. Typically, the public IP address is all that is required. But if both are required in some cases, the ini file shows how they need to be provided.

Note

The first IP address in the list will be the master node, and the rest will be worker nodes. When master_HA is set to true, the first 3 nodes are the master nodes.

User

This is the DKube cluster user account name. It can be a root or non-root account, but the same account must be available on all cluster nodes, and must have passwordless access through an ssh key and subdoers permissions.

In general, the default User will work for installation on the cloud, and this field does not need to be changed. For an on-prem installation, or a non-standard cloud installation, the user name will be provided by the IT group.

After the prerequisites and cluster access steps have been successfully completed, the installation can begin. The dkubeadm tool is used for this step. The tool should be run from the $HOME/.dkube folder on the installation node.

Installing Kubernetes on the Cluster

The installation script will check that the cluster is accessible, and that the prerequisites are all in place. It will then initiate the installation. If an error is detected the script will stop with an error message that identifies the error.

sudo ./dkubeadm cluster install

Error During Installation

If the cluster install procedure detects that there is an existing kubernetes cluster already in place, or that some of the prerequisites are not correct, the first troubleshooting step is to uninstall the cluster, which will also do a cleanup. The command to uninstall and cleanup is:

sudo ./dkubeadm cluster uninstall

After the cleanup is complete, run the install command again. If it still fails, contact your IT manager.

Once the Kubernetes cluster has been successfully installed, you can use the Kubernetes dashboard to see the status of the system. Follow the instructions that are provided as part of the installation log, right after the message that the installation has been successful.

Note

Ignore the warning messages in the output

_images/Kubespray-Successful.png

Selecting the link in the log output will take you to a browser window that asks for credentials. Copying the contents of the file k8s-token will access the dashboard.

Installing DKube

DKube can be installed on any platform that has a supported combination of an operating system and Kubernetes. Supported combinations are being added continually, and this guide only provides a subset of them. Application nodes are available for other platforms.

  • If the user installed community k8s with the scripts provided, continue through the rest of this section.
  • If the user installed k8s themselves through their own procedures, additional steps may be required, as described in the sections referenced below. After those steps have been accomplished, return to this section and complete the installation.
Rancher Additional Steps for Installing DKube on Rancher

DKube ini File

The dkube.ini file controls the DKube installation options.

Standard DKube installation

For a standard DKube installation, where the community k8s script was used:

  • Fill in the top section labeled [REQUIRED]
  • Fill in the [STORAGE] identifying an NFS if it is being used

Installing DKube with Rancher

For a DKube installation with Rancher already installed:

  • Fill in the top section labeled [REQUIRED]
  • Fill in the [STORAGE] identifying an NFS if it is being used
  • For non-NFS storage, fill in the STORAGE_DISK_NODE as described in the [STORAGE] section with the node name as identified in section Additional Steps for Installing DKube on Rancher.

Editing the DKube ini File

_images/dkube-ini-File.png
KUBE_PROVIDER Type of platform
HA Set true to enable DKube resiliency
username User-chosen initial login username
password User-chosen initial login password

KUBE_PROVIDER

This identifies the type of platform that DKube will run on.

Platform Field Value
Community k8s dkube
Rancher k8s dkube

Resilient Operation

DKube can run as a resilient system that guarantees the databases remain usable when one of the nodes become inoperable. This requires at least 3 schedulable nodes in the cluster. This is explained in the section Cluster and DKube Resiliency. If you have provided that minimum configuration, you can set this to be true for resilient operation.

Username and Password

This provides the credentials for initial DKube local login. The initial login user has both Operator and Data Scientist access. Only a single user can log in with this method. More users can be added through a backend access configuration using the OAuth screen.

The Username has the following restrictions:

  • Do not use the following names
    • dkube
    • monitoring
    • kubeflow

Storage Options

The storage options are configured in the [STORAGE] section of the dkube.ini file.

Non-NFS Storage
  • STORAGE_TYPE option should be kept in the disk default state
  • STORAGE_DISK_NODE should be set as follows:
Platform Field Value
Community k8s Default state: auto
Rancher k8s Node name as identified in the Rancher Server in section Additional Steps for Installing DKube on Rancher
_images/dkube-ini-rancher.png
DKube Installation with NFS

If an NFS is available on the DKube cluster, and the user wants DKube to make use of it, the following fields must be modified. This applies to all k8s platforms.

_images/nfs_dkube_ini.png
STORAGE_TYPE Type of storage - should be nfs
STORAGE_NFS_SERVER Internal IP address of nfs server
STORAGE_NFS_PATH Absolute path of the exported share

Note

The path must exist on the share, but does not need to be mounted. DKube will perform its own mount

Installing the Necessary Packages on the Cluster

Before the DKube installation is started, it is important that all of the required packages are installed on the system.

If the One Convergence k8s Script was Used

If the installation of k8s was accomplishing using the dkubeadm command on any platform or operating system, run the setup command as shown below. This will install all of the packages required by DKube. If an existing package is already installed, it will not be re-installed. Only those packages that are missing will be installed.

sudo ./dkubeadm node setup

Note

This command requires that the k8s.ini be filled in as described in Installing Kubernetes, and that the installation node has ssh access to the cluster as described in Cluster Access from Installation Node. This is required even in the user installed k8s on their own.

Customer-Installed k8s

If the customer has installed k8s on the cluster, it is expected that the necessary packages have been installed prior to installing DKube. In order to check that this has been done, run the following prerequisite command:

sudo ./dkubeadm dkube prereq

Note

This command requires that the k8s.ini be filled in as described in Installing Kubernetes, and that the installation node has ssh access to the cluster as described in Cluster Access from Installation Node. This is required even in the user installed k8s on their own.

If the check shows that some prerequisites are not installed:

  • Install them manually, or
  • Run the node setup command as described above to let them be installed automatically

Installing DKube on the Cluster

After completing the configuration of the dkube.ini file and installing the prerequisites, the DKube installation can begin. The initial setup will start, and the script log will provide a url link to a browser-based dashboard that can be used to track the progress of the installation. The url is highlighted in the figure. The pubic-IP referenced in the url is the first IP address used in the k8s.ini file, which is the Master node.

sudo ./dkubeadm dkube install
_images/dkube-install-log.png
_images/DKube-Install-Dashboard.jpg

The dashboard will show the status of COMPLETED when DKube has been successfully installed.

If the DKube install procedure detects that some of the prerequisites are not correct, the first troubleshooting step is to to uninstall and cleanup the system. The command to uninstall and cleanup is:

sudo ./dkubeadm dkube uninstall

After this successfully completes, run the DKube install command again. If it still fails, contact your IT manager.

Accessing DKube

After the DKube installation dashboard shows that the installation has completed, the DKube UI is shown as part of the dashboard.

The form of the url is:

https://xxx.xxx.xxx.xxx:32222/

The IP address in the url is the first IP address used in the k8s.ini file, which is the Master node.

Initial Login

The initial login after installation is accomplished with the username and password entered in the dkube.ini file. Authorization based on a backend mechanism is explained in the User Guide in the section “Getting Started”.

Uninstalling DKube & Kubernetes

DKube and k8s can be uninstalled by following the steps in this section. Uninstallation commands should be run from the $HOME/.dkube directory. The installation process uses the same ini files that were used for installation.

Uninstalling DKube

DKube should always be uninstalled from the cluster first. The following command is used to initiate the DKube removal:

sudo ./dkubeadm dkube uninstall

Uninstalling k8s

If k8s was installed through the script described in this guide, then it can also be uninstalled through a One Convergence script. If k8s was installed by the customer independently of the script, then it must be uninstalled manually through the appropriate method. The following command is used to uninstall k8s:

sudo ./dkubeadm cluster uninstall

Managing DKube

DKube can be managed from the $HOME/.dkube directory.

Backup and Restore

The DKube database can be backed up and restored from the installation/management node. This function is performed from the $HOME/.dkube directory. Both functions rely on a backup.ini file for configuration, and use the same k8s.ini and dkube.ini files that were edited as part of the installation.

Important

There are some portions of the DKube database that are not backed up. These are explained in the following section.

Backup Exclusions

The following items are not backup up through the DKube backup feature:

  • Any System packages installed from a command line
  • Any files added to folders outside the workdir

Important

In order to save the complete image, it should be pushed to Docker

Editing the backup.ini File

The backup.ini file provides the target for the backup, and the credentials where necessary.

_images/backup_ini.png

Important

The input fields must be enclosed in quotes, as shown in the figure

Local Backup

The “local” backup option will save the snapshot to a folder on the local cluster. Only the following sections should be filled in.

PROVIDER local
BACKUP_DIR Target folder on local cluster
DKube

The “DKube” backup option will save the snapshot to the primary DKube folder as defined in the dkube.ini file. This will normally be the folder /var/dkube/ unless specified differently in the file. Only the following should be filled in.

PROVIDER dkube
AWS-S3

The AWS backup option will save the snapshot to an AWS-S3 cloud. Only the following sections should be filled in.

PROVIDER aws-s3
BUCKET Storage bucket in cloud
AWS_ACCESS_KEY_ID https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html
AWS_SECRET_ACCESS_KEY https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html
MinIO

The MinIO backup option will save the snapshot to a MinIO server. Only the following sections should be filled in.

PROVIDER minio
BUCKET Storage bucket in cloud
MINIO_ACCESS_KEY_ID https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html
MINIO_SECRET_ACCESS_KEY https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html
MINIO_ACCESS_IP url of the MinIO server
MINIO_ACCESS_PORT Port number of the MinIO server
GCS (Google Cloud Services)

The GCS backup option will save the snapshot to Google Cloud Services. Only the following sections should be filled in.

PROVIDER gcs
BUCKET Storage bucket in cloud
GOOGLE_APPLICATION_CREDENTIALS JSON file containing the GCP private key

Backup

DKube backup takes a snapshot of the database and saves it on the target provider. The backup snapshots can be listed, and restored to a DKube cluster.

Important

There are some items that are not backed up by this process. These are explained in the section Backup Exclusions

Before initiating the backup, DKube must be stopped through the following command so that a reliable backup can be accomplished.

sudo ./dkubeadm dkube shutdown

The backup is initiated by the following command. A backup name must be provided.

sudo ./dkubeadm backup start <name>

List of Snapshots

To see a list of backup snapshots on the current provider, and to identify the ID for each one, the following command is used.

sudo ./dkubeadm backup list

An output similar to the figure below will be shown.

_images/Backup_List.png

Delete Backup Snapshot

A backup snapshot can be removed from the current storage location. This is accomplished by using the ID listed in the backup list.

sudo ./dkubeadm backup delete <ID>

Restore

The DKube restore function will take a DKube database and install it on the cluster.

Important

Restoring the database will overwrite the current database on that cluster

The restore function uses a backup ID to identify which backup to choose. The restore function is initiated by the following command, using the ID from the backup list. It will copy the database to the cluster and start DKube.

sudo ./dkubeadm restore start <ID>
Restore Failure

If the restore fails, the user will be given the option of reverting to the DKube database that was on the cluster prior to the restore.

DKube Migration

Jobs, Notebooks, and Inferences can be migrated from one cluster to another. This function is performed from the $HOME/.dkube/cli directory.

Important

The entities (Jobs, Notebooks, and Inferences) must all be in the Stopped state to initiate migration.

Editing the Migration.ini File

In order to provide the source and destination information, the migration.ini file must be edited. An example ini file is provided below.

_images/migration_ini.png

Important

The fields in the migration.ini files must be enclosed in quotes, as shown in the example

Name User-generated name to use for migration tracking
JWT Authorization token as described below
dkubeurl IP address of the DKube instance
jobs List of entities to migrate
Getting the Access Token

The JWT token in the migration.ini file is available from the DKube Operator screen.

_images/Operator_Right_Menu_Developer.png




_images/Operator_Developer_Settings_Popup.png

Executing the Migration

After the migration ini file has been filled in, the command to initate the migration is shown.

sudo ./dkubectl migration start --config migration.ini

The migration will begin, and when the status shows 100% complete the app exits.

_images/Migration_Status.png

If the user wants to execute the migration more than once, one of the following steps must be taken:

  • Edit the migration.ini file to use another name
  • Delete the existing name using the following command.
sudo ./dkubectl migration delete --name <migration name> --config migration.ini

Stop and Start DKube

Before shutting down the cluster, DKube should be stopped. This will ensure that there is no activity, and that it is shutdown in a reliable manner.

Stopping DKube

The following command will stop all activity in DKube:

sudo ./dkubeadm dkube shutdown

Starting DKube

The following command will restart DKube after the cluster is brought back up:

sudo ./dkubeadm dkube start

Managing the Kubernetes Cluster

If community k8s was included as part of the installation, the cluster can be managed using the dkubeadm tool. This should be performed from the $HOME/.dkube directory.

Adding a Worker Node

A worker node can be added to the cluster by using the dkubeadm tool through its public IP address. The other information necessary to add the node is read from the ini files that were used for installation. DKube will automatically recognize the node and add its resources into the application.

The worker node must be accessible from the installation node. The proper security procedures for this will depend upon the platform type. This is described in the section Accessing the Cluster. This should already be set up during the original installation, but the final verification step should be performed prior to running the node add command as shown in section Final Access Verification.

Adding a Node

sudo ./dkubeadm node add <Public IP Address>

Removing a Worker Node

Removing a node from a cluster is accomplished by the dkubeadm command through its public IP address. The worker node will be cleanly removed from the DKube application. The cluster will continue to operate.

Removing a Node

sudo ./dkubeadm node remove <Public IP Address>

Note

A network error may occur after a node has been removed. This is a temporary status message and can be ignored. Refresh the browser window to remove it.

Removing a Node that is Inaccessible

If a node is inaccessible, the node remove command will not work directly. In this case, remove the node from the k8s.ini file before running the node remove command.

Removing Node on AWS System

Note

For an AWS system, after removing the node from the cluster, the instance must also be properly shutdown before stopping it

The following command should be used after accessing the instance through ssh:

sudo shutdown now

Adjusting GPU Pool Sizes

Note

After a node is removed, the maximum number of GPUs assigned to the pools may exceed the total number of GPUs in the cluster.

For example:

  • Initially there are 10 GPUs in the cluster
  • All 10 GPUs are spread out in different pools, but the total of all of the GPUs in all of the pools cannot be more than 10.
  • After a node is removed, there may be fewer GPUs in the cluster (for example, there may only be 8 GPUs in the cluster)
  • The pool GPU assignments now add up to more GPUs than exist on the cluster

The Operator needs to manually re-assign the pool maximums so that they do not add up to more GPUs than there are on the cluster.

Shutting Down & Restarting the Cluster

The cluster will remain operational as long as the master node is up and running. Shutting down the master node will shut down the entire cluster.

Important

Before shutting down the cluster, DKube should be stopped to ensure a reliable recovery. This is described in section Stopping DKube

When bringing the cluster back up, the master node should be brought back first, then the worker nodes. After the cluster is brought up, DKube should be restarted as described in section Starting DKube.

Note

It can take 15 minutes for the cluster to be fully operational after it is restarted. Please wait that long before accessing the DKube url from your browser.

General Installation Activities

Platform-Specific Installation

Google GCP System Installation

This section describes the specific additional or different steps necessary to install DKube on a GCP system.

Once these steps have been taken, go to the section Final Access Verification and continue with the installation.

Amazon AWS System Installation

This section describes the specific additional or different steps necessary to install DKube on an AWS system.

  • The VM should have a static IP address
  • Create a security group that conforms to Accessing the Cluster
  • Attach the security group to the VMs
  • Assign roles to the nodes as explained in this section
  • Enable passwordless access to the nodes as described below

Once these steps have been taken, go to the section Final Access Verification and continue with the installation.

IAM Roles

AWS DKube installation requires that specific roles be assigned to the nodes, depending upon their node type (master and worker). The roles must be assigned as follows:

Master Node Only (no Workers) kubernetes-master-node
Master Node (Cluster that includes Workers) kubernetes-master
Worker Node kubernetes-node

The roles are assigned to the instances from the Instance dashboard.

  • Actions
    • Instance Settings
      • Attach/Replace IAM Role

If the role does not exist yet, it must be created from the IAM Role screen.

  • Create new IAM Role
    • Create role
      • <Choose Service>
        • Create Policy
          • JSON
            • <Cut and paste policy from the appropriate section below>
              • Review policy
                • <Name Policy>
                  • Create Policy
kubernetes-master

This is used for the Master node if there are also Worker nodes on the cluster.

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": ["ec2:*"], "Resource": ["*"] }, { "Effect": "Allow", "Action": ["elasticloadbalancing:*"], "Resource": ["*"] }, { "Effect": "Allow", "Action": ["route53:*"], "Resource": ["*"] }, { "Effect": "Allow", "Action": "s3:*", "Resource": [ "arn:aws:s3:::kubernetes-*" ] } ] }
kubernetes-node

This is used for the Worker nodes on the cluster.

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "s3:*", "Resource": [ "arn:aws:s3:::kubernetes-*" ] }, { "Effect": "Allow", "Action": "ec2:Describe*", "Resource": "*" }, { "Effect": "Allow", "Action": "ec2:AttachVolume", "Resource": "*" }, { "Effect": "Allow", "Action": "ec2:DetachVolume", "Resource": "*" }, { "Effect": "Allow", "Action": ["route53:*"], "Resource": ["*"] }, { "Effect": "Allow", "Action": [ "ecr:GetAuthorizationToken", "ecr:BatchCheckLayerAvailability", "ecr:GetDownloadUrlForLayer", "ecr:GetRepositoryPolicy", "ecr:DescribeRepositories", "ecr:ListImages", "ecr:BatchGetImage" ], "Resource": "*" } ] }
kubernetes-master-node

This is used for the Master node if there are no other nodes on the cluster (i.e. Master node only - no Workers).

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": ["ec2:*"], "Resource": ["*"] }, { "Effect": "Allow", "Action": ["elasticloadbalancing:*"], "Resource": ["*"] }, { "Effect": "Allow", "Action": ["route53:*"], "Resource": ["*"] }, { "Effect": "Allow", "Action": "s3:*", "Resource": [ "arn:aws:s3:::kubernetes-*" ] }, { "Effect": "Allow", "Action": [ "ecr:GetAuthorizationToken", "ecr:BatchCheckLayerAvailability", "ecr:GetDownloadUrlForLayer", "ecr:GetRepositoryPolicy", "ecr:DescribeRepositories", "ecr:ListImages", "ecr:BatchGetImage" ], "Resource": "*" } ] }
Adding ssh Key at Instance Creation

In AWS Console GUI, click on “Key Pairs”. Then click on “Import Key Pair”. Paste the contents of “ssh-rsa.pub” into the “Public key contents” section and give it a “Key pair name”. Click “Import”.

Use the same ssh keypair Name during Instance creation (which will be last step in instance creation).

_images/aws-create-key.jpg
Adding ssh Key for Existing Instance

For an existing Instance, append the contents of ssh-rsa.pub into the file /home/ubuntu/.ssh/authorized_keys in each of the target nodes (master and each worker node).

After these steps have been successfully accomplished, proceed to the final access verification step Final Access Verification.

Additional Steps for Installing DKube on Rancher

To install DKube on a system, it is assumed that:

  • A Rancher Server has been created
  • The DKube cluster is running CentOS 7 on all of the nodes
  • All the nodes should have a static IP address
  • Docker has been installed on the installation node

Adding a Cluster

The commands necessary to add a cluster are performed from the Rancher Server UI. Preparing the Linux host and starting the server are explained on the Rancher site https://rancher.com/ in the Get Started Section. Once the Rancher Server UI is running, log onto the server.

Add a cluster by selecting the Add Cluster button. Fill in the fields as follows:

  • Choose the Custom cluster
  • Provide a name for the cluster
  • Choose the Kubernetes Version as shown
  • Choose the Network Provider as shown
  • Select Next
  • Choose the cluster options as shown
  • Select Done
_images/Rancher-Add-Cluster.png
_images/Rancher-Custom-Cluster.png
_images/Rancher-Cluster-Fields.png
_images/Rancher-Custom-Options.png

Execute the Rancher Server Run Command

In order to create k8s on the DKube cluster, the Run command must be executed on the each node in the DKube cluster. The Run command is generated during the Add Cluster procedure on the Rancher Server, and can be obtained later from the Edit screen as shown below.

_images/Rancher-Cluster-Edit.png
_images/Rancher-Run-Command.png

The execution of the Run command on the DKube node will initiate activity on the Rancher Server. When the activities have been complete the Rancher Server will show an “Active” status.

_images/Rancher-Active.png

Copy the Config file to the DKube Cluster Master Node

The Kubeconfig file from the Rancher Server must be copied to the installation node (either a remote installation node or the master node in the cluster). The Kubeconfig file can be found by selecting the cluster name.

_images/Rancher-Kubeconfig.png

The contents of the Kubeconfig file should be put into a file on the installation node: $HOME/.kube/config.

The contents of the $HOME/.kube folder should be copied to the root area.

sudo cp -r ~/.kube/ /root/

Identifying the Node Name from the Rancher Server

The DKube installation configuration will need the node name from the Rancher Server. This can be identified from the Nodes screen.

_images/Rancher-Nodename.png

After these steps have been performed, continue the installation of DKube in section DKube ini File.

Removing Nouveau Driver

In order to remove the Nouveau driver from a cluster node, you can use the following command.

rmmod nouveau

The next step is to blacklist it from the node. This can be accomplished by adding “blacklist nouveau” to the blacklist file.

vim /etc/modprobe.d/blacklist.conf

The following command will update the system based on the changes made.

sudo update-initramfs -u

Finally, reboot the node.