Tools for Debugging Application Deployments on DC/OS
DC/OS comes with several tools relevant for application debugging:
DC/OS provides many GUIs for various components, these are particularly when debugging application deployment issues:
The DC/OS GUI is a great place to start debugging as it provides quick access to:
- Cluster Resource Allocation to provide an overview of available cluster resources
- Task Logs to provide insight into tasks failures
- Task Debug Information to provide information about the most recent task offers and/or why a task did not start
The DC/OS GUI shows the majority of the information you need for debugging. However, sometimes going a step further and accessing the Mesos GUI can be helpful – especially when checking failed tasked or registered frameworks. The Mesos GUI can be accessed via
As much of the cluster and framework state is stored in Zookeeper, it can sometimes be helpful to check these states using the ZooKeeper/Exhibitor GUI. Frameworks such as Marathon, Kafka, and Cassandra store information with Zookeeper, so this resource can be particularly useful when debugging such frameworks. For example, a failure while uninstalling of one of these frameworks can leave entries behind. So then for sure, if you experience difficulties when reinstalling a framework you have uninstalled earlier, checking this GUI could be very helpful. You can access it via
Logs are useful tools for seeing events and conditions that occurred before they emerge. Often logs include error messages that can supply helpful information regarding the cause of the error. As logging is an important topic in its own right, we also recommend to have a look at the DC/OS logging documentation, for more information.
DC/OS has a number of different sources for logs. In general, these are the most helpful logs for application debugging:
- Task/Application Logs
- Service Scheduler Logs (e.g., Marathon)
- Mesos Agent Logs
- Mesos Master Logs
- System Logs
In DC/OS, there are multiple options for accessing any of these logs: the DC/OS GUI the DC/OS CLI, or HTTP endpoints. Moreover, DC/OS rotate logs by default to prevent utilizing all available disk space.
Tip Need a scalable way to manage and search your logs? It could be worth building an ELK stack for log aggregation and filtering.
As with other systems, sometimes it can help to increase the level of detail written to a log temporarily to obtain more detailed troubleshooting information for debugging. For most components, this can be done by accessing an endpoint. For example, if you want to increase the log level of a Mesos Agent for 5 minutes after the server receives the API call, you could follow something like this simple two-step process:
Connect to Master Node
$ dcos node ssh --master-proxy --leader
Raise Log Level on Mesos Agent 10.0.2.219
$ curl -X POST 10.0.2.219:5051/logging/toggle?level=3&duration=5mins
Task/application logs are often helpful in understanding the state of the problematic application. By default, applications logs are written (together with execution logs) to the
STDOUT files in the task workdirectory. When looking at the task in the DC/OS GUI, you can just simply view the logs as shown below.
You can also do the same from the DC/OS CLI:
$ dcos task log --follow <service-name>
Marathon is DC/OS’s default scheduler when starting an application. Scheduler logs, and Marathon logs in particular, are a great source of information to help you understand why or how something was scheduled (or not) on which node. Recall that the scheduler matches tasks to available resources. So then because the scheduler also receives task status updates, the log also contains detailed information about task failures.
You can retrieve and view a scheduler log about a specific service through the list of services found in the DC/OS GUI, or via the following command:
$ dcos service log --follow <scheduler-service-name>
Note that as Marathon is the “Init” system of DC/OS and it therefore running as a SystemD unit (same with respect to the other DC/OS system components). Due to this fact, you need the CLI command to access its logs.
Mesos Agent Logs
Mesos agent logs are helpful for understanding how an application was started by the agent and how it may have failed. You can launch the Mesos GUI by navigating to
https://<cluster_name>/mesos and examining the agent logs as shown below.
Alternatively, you can view the agent logs by first using
dcos node log --mesos-id=<node-id> from the DC/OS CLI to locate the corresponding node
$ dcos node
where you will see something similar to the following output:
HOSTNAME IP ID TYPE 10.0.1.51 10.0.1.51 ffc913d8-4012-4953-b693-1acc33b400ce-S3 agent 10.0.2.50 10.0.2.50 ffc913d8-4012-4953-b693-1acc33b400ce-S1 agent 10.0.2.68 10.0.2.68 ffc913d8-4012-4953-b693-1acc33b400ce-S2 agent 10.0.3.192 10.0.3.192 ffc913d8-4012-4953-b693-1acc33b400ce-S4 agent 10.0.3.81 10.0.3.81 ffc913d8-4012-4953-b693-1acc33b400ce-S0 agent master.mesos. 10.0.4.215 ffc913d8-4012-4953-b693-1acc33b400ce master (leader)
Then, in this case, you can enter:
$ dcos node log --mesos-id=ffc913d8-4012-4953-b693-1acc33b400ce-S0 --follow
and get the following log output:
2018-04-09 19:04:22: I0410 02:38:22.711650 3709 http.cpp:1185] HTTP GET for /slave(1)/state from 10.0.3.81:56595 with User-Agentemail@example.com (pid 3168)' 2018-04-09 19:04:24: I0410 02:38:24.752534 3708 logfmt.cpp:178] dstip=10.0.3.81 type=audit timestamp=2018-04-10 02:38:24.752481024+00:00 reason="Valid authorization token" uid="dcos_net_agent" object="/slave(1)/state" agent="firstname.lastname@example.org (pid 3168)" authorizer="mesos-agent" action="GET" result=allow srcip=10.0.3.81 dstport=5051 srcport=56595
Mesos Master Logs
The Mesos Master is responsible for matching available resources to the scheduler. It also forwards task status updates from the Mesos Agents to the corresponding scheduler. This makes the Mesos Master logs a great resource for understanding the overall state of the cluster.
Be aware that there are typically multiple Mesos Masters for a single cluster. So you should identify the current leading Mesos Master to get the most recent logs. In fact, in some cases it might even make sense to retrieve logs from another Mesos master as well: e.g., a master node failed and you want to understand why.
You can either retrieve the master logs from the Mesos GUI via
dcos node log --leader, or for a specific master node using
ssh master and
journalctl -u dcos-mesos-master.
We have now covered the most important log sources in the DC/OS environment, but there are many more logs available. Every DC/OS component writes a log. As briefly mentioned above, each DC/OS component is running as one Systemd unit. You can retrieve the logs directly on the particular node by SSHing into the node, and then typing
journalctl -u <systemd-unit-name>. Two of the more common system units to consider during debugging (besides Mesos and Marathon) are the
docker.service and the
As an example, consider the system unit for the docker daemon on the Mesos agent
ffc913d8-4012-4953-b693-1acc33b400ce-S0 (recall the
dcos node command retrieves the Mesos ID).
First, we can SSH into that agent using the corresponding SSH key:
$ dcos node ssh --master-proxy --mesos-id=ffc913d8-4012-4953-b693-1acc33b400ce-S0
Then we can use
journatlctl, to look at the Docker logs:
$ journalctl -u docker
which outputs something like this:
-- Logs begin at Mon 2018-04-09 23:50:05 UTC, end at Tue 2018-04-10 02:52:41 UTC. -- Apr 09 23:51:50 ip-10-0-3-81.us-west-2.compute.internal systemd: Starting Docker Application Container Engine... Apr 09 23:51:51 ip-10-0-3-81.us-west-2.compute.internal dockerd: time="2018-04-09T23:51:51.293577691Z" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Metrics are useful because they help identify potential issues before they become actual bugs. For example, imagine a situation wherein a container uses up all allocated memory. If you could detect this while the container is still running but not yet killed, you are much more likely to be able to intervene in time.
In DC/OS there are three main endpoints for metrics:
- DC/OS metrics
- endpoint exposing combined metrics from tasks/container, nodes, and applications
- Mesos metrics
- endpoint exposing Mesos-specific metrics
- Marathon metrics
- endppoint exposing Marathon-specific metrics
One of the more helpful ways to leverage metrics to help with debugging is to set up a dashboard. This dashboard would include the most important metrics related to the services you want to monitor. For example, you could use prometheus and grafana to make a metrics dashboard.
Ideally, with the dashboard configured and functioning, you can identify potential problems before they become actual bugs. Moreover, when issues do arise, this sort of dashboard can be extremely helpful in determining the cause of the bug(e.g. maybe a cluster has no free resources). Each link from the endpoint item listed above provides recommendations for the metrics you should monitor for that endpoint.
Sometimes the task logs provide insufficient help. In these cases, using your favorite Linux tools (e.g.
ping, etc…) to get an interactive point of view could be a worthwhile next step.
For example, if you are using a [Universal Container Runtime (UCR)] (https://docs.mesosphere.com/latest/deploying-services/containerizers/ucr/), you can use
dcos task exec as follows:
dcos task exec -it <mycontainerid>
and be presented with an interactive bash shell inside that container.
Tip If you alter the state of the container when using
dcos task exec in the manner above, you must update the stored
app-definition and restart the container from that updated
app-definition. If you fail to do so, then your changes will be lost the next time the container restarts.
Alternatively, when using a docker containerizer, you can SSH into the node in question and run
docker exec to investigate the running container.
DC/OS has a large number of additional endpoints that could be useful for debugging:
state-summary endpoint returns a json encoded summary of the agents, tasks, and frameworks inside the cluster. This is especially helpful when considering allocation of resources across the cluster, as it shows you whether there are resources already reserved for a particular role (there are more details on this in one of the debugging scenarios provided below.
TIP See the complete list of Mesos endpoints.
queue endpoint returns a list of all tasks in the queue to be scheduled by Marathon. This endpoint is valuable when troubleshooting scaling or deployment problems.
The DC/OS community is a great place to ask additional questions either via Slack or the mailing list. Also keep in mind that both Mesos and Marathon have their own communities in addition to the DC/OS community.
There are other debugging tools as well – internal to DC/OS as well as external tools like Sysdig or Instana. These tools can be especially helpful in determining non DC/OS specific issues (e.g., Linux Kernel or networking problems).