Using DKube

This section provides the information that you need to start to use DKube immediately. Your access to DKube will depend upon your role.

Operator Role

If you are an Operator, you will have access to both the Operator and Data Science roles. The concepts and details of the roles are described in the sections that follow. By default, DKube enables operation without needing to do setup from the Operator. The Operator User is on-boarded and authenticated during the installation process, and this User is also enabled as a Data Scientist.

  • Default Pools & Groups have been created, and the Operator is added to the Default Pool
  • The Default Pool contains all of the resources
  • The Operator has been added to the Default Group
  • The Data Scientist can start without needing to do any resource configuration

If Pools and Group are required in addition to the Default, the following steps can be followed:

  • Create Additional Pools (Section Create Pool)
    • Assign Devices to the Pools
  • Create Additional Groups (Section Create Group)
    • Assign a Pool to each new Group
  • Add (On-Board) Users (Section Add (On-Board) User)
    • Assign Users to one of the new Groups
    • New Users can still be assigned to the Default Group if desired

If the Operator is the only User, or if all of the Users, including other Data Scientists, are working on the same project (and are in the same Group), nothing else needs to be done from the Operator workflow to get started.

  • The Operator should select the Data Scientist dashboard as shown in the section Dashboards
  • The following section describes how to get started as a Data Scientist

Data Scientist Role

If you are a Data Scientist, you will only have access to the Data Scientist role, described in DKube Data Scientist Concepts

  • Several example models with their associated datasets and test data have been provided on GitHub. The locations are described in section Example Model and Dataset Locations
  • The models and datasets can be downloaded through the Workspaces and Datasets screens, and data science can begin.
  • If you want to learn more about the concepts behind DKube, it is described in the following sections.

First Time Users

If you want to jump directly to a guided example, go to the Data Scientist Tutorial. This steps you through the Data Scientist workflow with using a simple example.

If you want to start with your own program and dataset, follow these steps.

DKube Operator Concepts

_images/Operator_Block_Diagram.png
User Operator or Data Scientist
Group

Aggregation of Users

  • Users in Group share models, datasets, notebooks, training jobs & inferences
Device

Devices connected to the Node

  • GPUs are shown here at this time
  • Devices can be used on the same server, or on another server in the cluster
Node

Execution entity

  • This can be a physical host for an on-prem system, or a VM on the cloud
Pool

Maximum number of Devices assignable to the Pool

  • A Pool of Devices is assigned to a Group
  • Pools can be made up of Devices anywhere on the cluster
  • Pools limit how many Devices on the cluster are available to the Users in the Group
  • The Users in the Group share the maximum number of Devices in the Pool
  • The specific Devices are assigned to the User when a Job is using them

Default Pool and Group

DKube includes a Group and Pool with special properties, called the Default Group and Default Pool. They are both available when DKube is installed, and cannot be deleted. The Default Group and Pool allow Users to start their work as Data Scientists without needing to do a lot of setup.

  • The Default Pool contains all of the Devices that have not been allocated to another Pool by the Operator. As the Devices are discovered and automatically on-boarded, they are placed in the Default Pool.
    • As additional Pools are created, and Devices are allocated to the new Pools, the number of Devices in the Default Pool are reduced.
    • As Devices are removed from the other (non-Default) Pools, those Devices are allocated back into the Default Pool
    • The total number of Devices in all of the Pools will always equal the total number of Devices across the cluster, since the Default Pool will always contain any Device not allocated to any other Pool
  • The Default Group automatically gets the allocation of the Default Pool, and it contains all of the on-boarded Users who are allocated to the Default Group.
    • As new Users are on-boarded, they are assigned to the Default Group unless a different assignment is made during the on-boarding process
    • Users can be moved from the Default Group to another Group using the same steps as from any other Group

Operation of Pools

Pools are collections of Devices (GPUs at this time) assigned to Groups. The Devices in the Pool are shared by the Users in the Group.

  • All of the GPUs in the cluster are available to Pools (either the Default or another Pool created by the Operator), limited by the maximum number of Devices assignable to that Pool. Initially, as they are discovered and on-boarded, the GPUs are available to the Default Pool (which is assigned to the Default Group).
  • The Pool limits the number of GPUs of a specific type that can be accessed by the Users in the Group/Pool combination
  • The Users in a Group thus share a maximum number of GPUs, pulled from whatever GPUs are available on the cluster when the Job runs
  • As GPUs are used by jobs, they reduce the number of GPUs available to other Users in the Group. Once the Job is complete (or stopped), the GPUs are available for other Jobs.

Clustered Pools

Pools behave differently depending upon whether the GPUs are spread across the cluster, or on a single node. If all of the GPUs in a Pool are on a single node, no special treatment is required to operate as described above.

If the GPUs in a pool are distributed across more than a single node, the Advanced option must be selected when submitting a Job. The job must be submitted with the number of worker nodes that provide DKube with guidance about how they can should be distributed.

This is described in more detail in section on the Training Training Job Container.

DKube Data Scientist Concepts

_images/Data_Scientist_Block_Diagram.png

Basic DKube Workflow

Workspaces Directory containing program code for Notebooks and Jobs
Datasets Directory containing training data for Notebooks and Jobs
Notebooks Experiment with different models, datasets, and parameters
Training Jobs Training runs for model, with program data, datasets, resources, and hyperparameters
Models Trained models, ready for deployment or transfer learning
Inferences Deployed model after training for testing or production

Pipeline Workflow

Pipeline Kubeflow Pipelines - Portable, visual approach to automated deep learning
Experiments Aggregation of runs
Runs Single cycle through a pipeline

Note

The concepts of Pipelines are explained in section Kubeflow Pipelines

Shared Data

Users in a Group share models, datasets, notebooks, jobs, & inferences. These are shown on the screens for each type of data. At the top right-hand of the screen there is a dropdown menu that allows the User to select what is visible: just the User’s data, just the shared data, or both.

_images/Data_Scientist_Ownership.jpg

Tags

Notebooks, Training Jobs, Workspaces, and Datasets can have Tags associated with them, provided by the user when the instance is downloaded or created. The figure below shows an example for a Training Job.

_images/Data_Scientist_Tags.jpg

Tags provide an alphanumeric field that become part of the instance. They have no impact on the instance within DKube, but can be used to group entities together, or by a post-processing filter created by the Data Scientist to store information about the instance such as release version, framework, etc.

The Tag field can have as many as 256 characters.

Job Scheduling

When a Training Job is submitted (see Jobs ), DKube will determine whether there are enough available GPUs in the Pool associated with the shared Group. If there are enough GPUs, the job will be scheduled immediately.

If there are not currently enough GPUs in the Pool, the job will be queued waiting for the GPUs to become available. As the currently running jobs are completed, their GPUs are released back into the Pool, and as soon as there are sufficient GPUs to run the queued job it will start.

It is possible to instruct the scheduler to initiate a Job immediately, without regard to how many GPUs are available. This directive is provided by the user in the GPUs section when submitting the job.

Status Field of Notebooks, Training Jobs, and Inference

The status field provides an indication of how the Notebook, Training Job, or Inference is progressing. The meaning of each status is provided here.

Status Description Applies To
Queued Initial state All
Waiting for GPUs Released from queue; waiting for GPUs All
Starting Resources available; job is starting All
Running Job is active All
Training Training Job is running Training Job
Complete Job is completed; resources released All
Error Job failure; clone and rerun All
Stopping Job in process of stopping All
Stopped Job stopped; resources released All

Hyperparameter Optimization

DKube implements Katib-based hyperparameter optimization. This enables automated tuning of hyperparameters for a model, based upon target objectives.

This is described in more detail at Katib Introduction.

Katib Within DKube

The section Hyperparameter Optimization provides the details on how to use this feature from DKube.

Kubeflow Pipelines

Support for Kubeflow Pipelines has been integrated into DKube. Pipelines facilitate portable, automated, structured machine learning workflows based on Docker containers.

The Kubeflow Pipelines platform consists of:

  • A user interface (UI) for managing and tracking experiments, jobs, and runs
  • An engine for scheduling multi-step machine learning workflows
  • An SDK for defining and manipulating pipelines and components
  • Notebooks for interacting with the system using the SDK

The following are the goals of Kubeflow Pipelines:

  • End-to-end orchestration: enabling and simplifying the orchestration of machine learning pipelines
  • Easy experimentation: making it easy for you to try numerous ideas and techniques and manage your various trials/experiments
  • Easy re-use: enabling you to re-use components and pipelines to quickly create end-to-end solutions without having to rebuild each time

An overall description of Kubeflow Pipelines is provided below. The reference documentation is available at Pipelines Reference.

Pipeline Definition

A pipeline is a description of a machine learning workflow, including all of the components in the workflow and how they combine in the form of a graph. The pipeline includes the definition of the inputs (parameters) required to run the pipeline and the inputs and outputs of each component.

After developing your pipeline, you can upload and share it on the Kubeflow Pipelines UI.

The following provides a summary of the Pipelines terminology.

Pipeline Graphical description of the workflow
Component Self-contained set of code that performs one step in the workflow
Graph Pictorial representation of the run-time execution
Experiment Workspace used to try different configurations of your pipeline
Run Single execution of a pipeline
Recurring Run Repeatable run of a pipeline
Run Trigger Flag that tells the system when a recurring run spawns a new run
Step Execution of one step in the pipeline
Output Artifact Output emitted by a pipeline component

Pipeline Component

A pipeline component is a self-contained set of user code, packaged as a Docker image, that performs one step in the pipeline. For example, a component can be responsible for data preprocessing, data transformation, model training, etc.

The component contains:

Client Code The code that talks to endpoints to submit jobs
Runtime Code The code that does the actual job and usually runs in the cluster

A component specification is in YAML format, and describes the component for the Kubeflow Pipelines system. A component definition has the following parts:

Metadata Name, description, etc.
Interface Input/output specifications (type, default values, etc)
Implementation A specification of how to run the component given a set of argument values for the component’s inputs. The implementation section also describes how to get the output values from the component once the component has finished running.

The Component specification is available at Kubeflow Component Spec.

You must package your component as a Docker image. Components represent a specific program or entry point inside a container.

Each component in a pipeline executes independently. The components do not run in the same process and cannot directly share in-memory data. You must serialize (to strings or files) all the data pieces that you pass between the components so that the data can travel over the distributed network. You must then deserialize the data for use in the downstream component.

Pipeline Example

The following screenshot shows an example of a pipeline graph, taken from one of the programs that is included as part of DKube.

_images/Pipeline_Catsdogs_Graph.jpg

The python source code that corresponds to the graph is shown here.

_images/pipeline_catsdogs_source.jpg

In order to create an experiment, a Run must be initiated. This is an example of the Details needed for a Run.

_images/pipeline_catsdogs_run_details.jpg

After the Run is complete, the details of the run and the outputs can be viewed. Information about the Run, including the full graph and the details of the Run, are available by selecting the Run name. The Pipeline stage provides more information from the Run details screen .

_images/pipeline_catsdogs_run_outputs.jpg

Kubeflow Within DKube

The section Kubeflow Pipelines provides the details on how this capability is implemented in DKube.

Migrating User Data Between Platforms

User data can be easily migrated between platforms through the dkubectl command. This includes:

  • Notebooks
  • Jobs
  • Workspaces
  • Datasets
  • Models
  • Inferences

After the migration, the target platform will look like the source platform, and the work can continue. This can be accomplished from within DKube, as described in section Migrating User Data Between Platforms.