Troubleshooting

Error Retrieving or Displaying Information

Symptom

This describes any problem where the UI fails to retrieve information from DKube, or does not show or update the information.

Cause/Action

This problem can occur when there is a temporary miscommunication between the UI and the DKube application, or when the network is not accessible. Refreshing the browser window will typically fix this issue.

Problems Accessing DKube url

Symptom

The DKube url is not accessible or not behaving properly.

Cause/Action

The Chrome and Firefox browsers are supported. Please ensure that you are using one of these, and that it is up-to-date.


Notebook or Job Loading Slowly

Symptom

After submitting a Notebook or Job, it stays in the Queued, Waiting for GPUs, or Starting status for a long time.

Cause/Action

The first time that a Job or Notebook is submitted after installation with a new version of TensorFlow, it pulls the image, which takes additional time. The time to pull is dependent on the network speed, but on average it will take about 2 minutes.

Subsequent downloads for the same version of TensorFlow will start faster.

If a Notebook or Job stays in those states for a long time, and it is not the first usage of a TensorFlow version, it might be due to more GPUs being allocated than exist in the cluster after a node has been removed. Contact the cluster administrator if this occurs.


Error Status for Notebook or Training Job

Symptom

After submission, a Notebook or Training Job goes into the Error state.

Cause/Action

The job has identified an internal problem that prevents proper operation. The first step is to clone and re-submit the job.

If the error persists, it may be due to more GPUs being allocated than exist in the cluster after a node has been removed. Contact the cluster administrator if this occurs.


GPUs are Not Available on Master Node

Symptom

GPUs are installed on the Master Node, but are not showing up in DKube.

Cause/Action

The ability to schedule GPUs on the Master Node is an installation option. Check with the cluster administrator to determine if DKube was installed with this capability.


Special Notebook “DKube” is in the aborted or stopped state

Symptom

The special DKube Notebook that is used for testing the inference, migration, etc is not running when initially accessed

Cause/Action

The “DKube” Notebook is launched in the stopped state. YOu must start it in order to use it. Once it has been started, it will remain in the running state until stopped.


Network Fetch Error Message

Symptom

A network error shows up on the screen.

Cause/Action

A temporary error was identified. It will not cause any problems with the operation of DKube. Refresh the browser window.


Jupyter Notebook Takes a Long Time to Load

Symptom

Jupyter Notebook does not load immediately from icon on Notebook screen.

Cause/Action

This is expected behavior. It can take as long as a minute for Jupyter to load.


404 Error Message from Notebook or Training Job Screen

Symptom

A 404 error message appears when:

  • Selecting the Tensorboard icon on an instance from the Training Job screen
  • Selecting the Notebook icon on an instance from the Notebook screen

Cause/Action

This can happen if either icon is selected too soon after starting the Notebook or Job. It can take several minutes before the selections are active. Wait for several minutes and retry the activity.


Insufficient GPU Message Even Though the Pool is Large Enough

Symptom

An error message appears when a Job is submitted, showing that there are not enough GPUs available. The submission is lower than the number of GPUs in the Pool, so that the Job should be queued.

Cause/Action

If this happens in a clustered system, it means that there are not enough GPUs to satisfy the submission on a single node. The Advanced option should be used in the Container section, with the number of GPUs and Workers specified. This behavior is described in section Clustered Pools.


GitHub Users Do Not Appear from On-Board Popup

Symptom

In the User On-Board Popup on the Operator screen, the user list for the organization does not appear.

Cause/Action

This can occur if the GitHub authorization credentials have been activated or changed. Refresh the User list as explained in section Add (On-Board) User.


Configuration ini file is not found when running dkubectl command within Notebook

Symptom

When running a dkubectl command within a Notebook, the message “Error : Can’t read config file” is displayed.

Cause/Action

The dkubectl commands in the guide assume that the command is run from the home folder. If it is run from another folder, the config file will not be found.