Tools

What tools are there for debugging applications on DC/OS?

Tools for Debugging Application Deployments on DC/OS

DC/OS comes with several tools relevant for application debugging:

DC/OS GUIs

DC/OS provides many GUIs for various components, these are particularly when debugging application deployment issues:

DC/OS GUI

The DC/OS GUI is a great place to start debugging as it provides quick access to:

  • Cluster Resource Allocation to provide an overview of available cluster resources
  • Task Logs to provide insight into tasks failures
  • Task Debug Information to provide information about the most recent task offers and/or why a task did not start

Pic of DC/OS GUI

Mesos GUI

The DC/OS GUI shows the majority of the information you need for debugging. However, sometimes going a step further and accessing the Mesos GUI can be helpful – especially when checking failed tasked or registered frameworks. The Mesos GUI can be accessed via https://<cluster-address>/mesos.

Pic of Mesos GUI

ZooKeeper GUI

As much of the cluster and framework state is stored in Zookeeper, it can sometimes be helpful to check these states using the ZooKeeper/Exhibitor GUI. Frameworks such as Marathon, Kafka, and Cassandra store information with Zookeeper, so this resource can be particularly useful when debugging such frameworks. For example, a failure while uninstalling of one of these frameworks can leave entries behind. So then for sure, if you experience difficulties when reinstalling a framework you have uninstalled earlier, checking this GUI could be very helpful. You can access it via https://<cluster-address>/exhibitor.

Pic of ZooKeeper/Exhibitor GUI

Logs

Logs are useful tools for seeing events and conditions that occurred before they emerge. Often logs include error messages that can supply helpful information regarding the cause of the error. As logging is an important topic in its own right, we also recommend to have a look at the DC/OS logging documentation, for more information.

DC/OS has a number of different sources for logs. In general, these are the most helpful logs for application debugging:

In DC/OS, there are multiple options for accessing any of these logs: the DC/OS GUI the DC/OS CLI, or HTTP endpoints. Moreover, DC/OS rotate logs by default to prevent utilizing all available disk space.

Tip Need a scalable way to manage and search your logs? It could be worth building an ELK stack for log aggregation and filtering.

As with other systems, sometimes it can help to increase the level of detail written to a log temporarily to obtain more detailed troubleshooting information for debugging. For most components, this can be done by accessing an endpoint. For example, if you want to increase the log level of a Mesos Agent for 5 minutes after the server receives the API call, you could follow something like this simple two-step process:

Connect to Master Node
$ dcos node ssh --master-proxy --leader
Raise Log Level on Mesos Agent 10.0.2.219
$ curl -X POST 10.0.2.219:5051/logging/toggle?level=3&duration=5mins

Task/Application Logs

Task/application logs are often helpful in understanding the state of the problematic application. By default, applications logs are written (together with execution logs) to the STDERR and STDOUT files in the task workdirectory. When looking at the task in the DC/OS GUI, you can just simply view the logs as shown below.

Pic of task log

You can also do the same from the DC/OS CLI:

$ dcos task log --follow <service-name>

Scheduler/Marathon Logs

Marathon is DC/OS’s default scheduler when starting an application. Scheduler logs, and Marathon logs in particular, are a great source of information to help you understand why or how something was scheduled (or not) on which node. Recall that the scheduler matches tasks to available resources. So then because the scheduler also receives task status updates, the log also contains detailed information about task failures.

You can retrieve and view a scheduler log about a specific service through the list of services found in the DC/OS GUI, or via the following command:

$ dcos service log --follow <scheduler-service-name>

Note that as Marathon is the “Init” system of DC/OS and it therefore running as a SystemD unit (same with respect to the other DC/OS system components). Due to this fact, you need the CLI command to access its logs.

Mesos Agent Logs

Mesos agent logs are helpful for understanding how an application was started by the agent and how it may have failed. You can launch the Mesos GUI by navigating to https://<cluster_name>/mesos and examining the agent logs as shown below.

Pic of Mesos agent UI

Alternatively, you can view the agent logs by first using dcos node log --mesos-id=<node-id> from the DC/OS CLI to locate the corresponding node ID. Enter:

$ dcos node

where you will see something similar to the following output:

HOSTNAME        IP                         ID                    TYPE

10.0.1.51    10.0.1.51   ffc913d8-4012-4953-b693-1acc33b400ce-S3  agent

10.0.2.50    10.0.2.50   ffc913d8-4012-4953-b693-1acc33b400ce-S1  agent

10.0.2.68    10.0.2.68   ffc913d8-4012-4953-b693-1acc33b400ce-S2  agent

10.0.3.192   10.0.3.192  ffc913d8-4012-4953-b693-1acc33b400ce-S4  agent

10.0.3.81    10.0.3.81   ffc913d8-4012-4953-b693-1acc33b400ce-S0  agent

master.mesos.  10.0.4.215    ffc913d8-4012-4953-b693-1acc33b400ce   master (leader)

Then, in this case, you can enter:

$ dcos node log --mesos-id=ffc913d8-4012-4953-b693-1acc33b400ce-S0 --follow

and get the following log output:

2018-04-09 19:04:22: I0410 02:38:22.711650  3709 http.cpp:1185] HTTP GET for /slave(1)/state from 10.0.3.81:56595 with User-Agent='navstar@10.0.3.81 (pid 3168)'

2018-04-09 19:04:24: I0410 02:38:24.752534  3708 logfmt.cpp:178] dstip=10.0.3.81 type=audit timestamp=2018-04-10 02:38:24.752481024+00:00 reason="Valid authorization token" uid="dcos_net_agent" object="/slave(1)/state" agent="navstar@10.0.3.81 (pid 3168)" authorizer="mesos-agent" action="GET" result=allow srcip=10.0.3.81 dstport=5051 srcport=56595

Mesos Master Logs

The Mesos Master is responsible for matching available resources to the scheduler. It also forwards task status updates from the Mesos Agents to the corresponding scheduler. This makes the Mesos Master logs a great resource for understanding the overall state of the cluster.

Be aware that there are typically multiple Mesos Masters for a single cluster. So you should identify the current leading Mesos Master to get the most recent logs. In fact, in some cases it might even make sense to retrieve logs from another Mesos master as well: e.g., a master node failed and you want to understand why.

You can either retrieve the master logs from the Mesos GUI via <cluster-name>/mesos, via dcos node log --leader, or for a specific master node using ssh master and journalctl -u dcos-mesos-master.

System Logs

We have now covered the most important log sources in the DC/OS environment, but there are many more logs available. Every DC/OS component writes a log. As briefly mentioned above, each DC/OS component is running as one Systemd unit. You can retrieve the logs directly on the particular node by SSHing into the node, and then typing journalctl -u <systemd-unit-name>. Two of the more common system units to consider during debugging (besides Mesos and Marathon) are the docker.service and the dcos-exhibitor.service.

As an example, consider the system unit for the docker daemon on the Mesos agent ffc913d8-4012-4953-b693-1acc33b400ce-S0 (recall the dcos node command retrieves the Mesos ID).

First, we can SSH into that agent using the corresponding SSH key:

$ dcos node ssh --master-proxy --mesos-id=ffc913d8-4012-4953-b693-1acc33b400ce-S0

Then we can use journatlctl, to look at the Docker logs:

$ journalctl -u docker

which outputs something like this:

-- Logs begin at Mon 2018-04-09 23:50:05 UTC, end at Tue 2018-04-10 02:52:41 UTC. --

Apr 09 23:51:50 ip-10-0-3-81.us-west-2.compute.internal systemd[1]: Starting Docker Application Container Engine...

Apr 09 23:51:51 ip-10-0-3-81.us-west-2.compute.internal dockerd[1262]: time="2018-04-09T23:51:51.293577691Z" level=info msg="Graph migration to content-addressability took 0.00 seconds"

Metrics

Metrics are useful because they help identify potential issues before they become actual bugs. For example, imagine a situation wherein a container uses up all allocated memory. If you could detect this while the container is still running but not yet killed, you are much more likely to be able to intervene in time.

In DC/OS there are three main endpoints for metrics:

  • DC/OS metrics
    • endpoint exposing combined metrics from tasks/container, nodes, and applications
  • Mesos metrics
    • endpoint exposing Mesos-specific metrics
  • Marathon metrics
    • endppoint exposing Marathon-specific metrics

One of the more helpful ways to leverage metrics to help with debugging is to set up a dashboard. This dashboard would include the most important metrics related to the services you want to monitor. For example, you could use prometheus and grafana to make a metrics dashboard.

Ideally, with the dashboard configured and functioning, you can identify potential problems before they become actual bugs. Moreover, when issues do arise, this sort of dashboard can be extremely helpful in determining the cause of the bug(e.g. maybe a cluster has no free resources). Each link from the endpoint item listed above provides recommendations for the metrics you should monitor for that endpoint.

Interactive

Sometimes the task logs provide insufficient help. In these cases, using your favorite Linux tools (e.g. curl, cat, ping, etc…) to get an interactive point of view could be a worthwhile next step.

For example, if you are using a [Universal Container Runtime (UCR)] (https://docs.mesosphere.com/latest/deploying-services/containerizers/ucr/), you can use dcos task exec as follows:

dcos task exec -it <mycontainerid>

and be presented with an interactive bash shell inside that container.

Tip If you alter the state of the container when using dcos task exec in the manner above, you must update the stored app-definition and restart the container from that updated app-definition. If you fail to do so, then your changes will be lost the next time the container restarts.

Alternatively, when using a docker containerizer, you can SSH into the node in question and run docker exec to investigate the running container.

HTTP Endpoints

DC/OS has a large number of additional endpoints that could be useful for debugging:

  • <cluster>/mesos/master/state-summary

state-summary

The state-summary endpoint returns a json encoded summary of the agents, tasks, and frameworks inside the cluster. This is especially helpful when considering allocation of resources across the cluster, as it shows you whether there are resources already reserved for a particular role (there are more details on this in one of the debugging scenarios provided below.

TIP See the complete list of Mesos endpoints.

queue

  • <cluster>/marathon/v2/queue

Marathon’s queue endpoint returns a list of all tasks in the queue to be scheduled by Marathon. This endpoint is valuable when troubleshooting scaling or deployment problems.

Community

The DC/OS community is a great place to ask additional questions either via Slack or the mailing list. Also keep in mind that both Mesos and Marathon have their own communities in addition to the DC/OS community.

Other Tools

There are other debugging tools as well – internal to DC/OS as well as external tools like Sysdig or Instana. These tools can be especially helpful in determining non DC/OS specific issues (e.g., Linux Kernel or networking problems).