DC/OS comes with several tools relevant for application debugging:
DC/OS web interfaces
DC/OS provides many web interfaces for various components, these are particularly when debugging application deployment issues:
DC/OS web interface
The DC/OS web interface is a great place to start debugging as it provides quick access to:
- Cluster Resource Allocation to provide an overview of available cluster resources
- Task Logs to provide insight into tasks failures
- Task Debug Information to provide information about the most recent task offers and/or why a task did not start
Figure 1. Task debug interface
Mesos web interface
The DC/OS web interface shows the majority of the information you need for debugging. However, sometimes going a step further and accessing the Mesos web interface can be helpful – especially when checking failed tasked or registered frameworks. The Mesos web interface can be accessed via
Figure 2. Mesos web interface
ZooKeeper web interface
As much of the cluster and framework state is stored in Zookeeper, it can sometimes be helpful to check these states using the ZooKeeper/Exhibitor web interface. Frameworks such as Marathon, Kafka, and Cassandra store information with Zookeeper, so this resource can be particularly useful when debugging such frameworks. For example, a failure while uninstalling of one of these frameworks can leave entries behind. So then for sure, if you experience difficulties when reinstalling a framework you have uninstalled earlier, checking this web interface could be very helpful. You can access it via
Figure 3. ZooKeeper/Exhibitor web interface
Logs are useful tools for seeing events and the conditions that occurred before they emerged. Often logs include error messages that can supply helpful information regarding the cause of the error. As logging is an important topic in its own right, we recommend the DC/OS logging documentation, for more information.
DC/OS has a number of different sources for logs. In general, these are the most helpful logs for application debugging:
- Task/Application Logs
- Service Scheduler Logs (e.g., Marathon)
- Mesos Agent Logs
- Mesos Master Logs
- System Logs
In DC/OS, there are multiple options for accessing any of these logs: the DC/OS web interface the DC/OS CLI, or HTTP endpoints. Moreover, DC/OS rotate logs by default to prevent utilizing all available disk space.
Note: Need a scalable way to manage and search your logs? It could be worth building an ELK stack for log aggregation and filtering.
Sometimes it can help to increase the level of detail written to a log temporarily to obtain more detailed troubleshooting information for debugging. For most components, this can be done by accessing an endpoint. For example, if you want to increase the log level of a Mesos Agent for 5 minutes after the server receives the API call, you could follow something like this simple two-step process:
Connect to Master Node
$ dcos node ssh --master-proxy --leader
Raise Log Level on Mesos Agent 10.0.2.219
$ curl -X POST 10.0.2.219:5051/logging/toggle?level=3&duration=5mins
Task/application logs are often helpful in understanding the state of the problematic application. By default, applications logs are written (together with execution logs) to the
STDOUT files in the task work directory. When looking at the task in the DC/OS web interface, you can just simply view the logs as shown below.
Figure 4. Task log
You can also do the same from the DC/OS CLI:
$ dcos task log --follow <service-name>
Marathon is DC/OS’s default scheduler when starting an application. Scheduler logs, and Marathon logs in particular, are a great source of information to help you understand why or how something was scheduled (or not) on which node. Recall that the scheduler matches tasks to available resources. So then because the scheduler also receives task status updates, the log also contains detailed information about task failures.
You can retrieve and view a scheduler log about a specific service through the list of services found in the DC/OS web interface, or via the following command:
$ dcos service log --follow <scheduler-service-name>
Note that since Marathon is the “Init” system of DC/OS, it is running as a SystemD unit (same with respect to the other DC/OS system components). Due to this fact, you need the CLI command to access its logs.
Mesos Agent Logs
Mesos agent logs are helpful for understanding how an application was started by the agent and how it may have failed. You can launch the Mesos web interface by navigating to
https://<cluster_name>/mesos and examining the agent logs as shown below.
Figure 5. Mesos agent interface
Alternatively, you can view the agent logs by first using
dcos node log --mesos-id=<node-id> from the DC/OS CLI to locate the corresponding node
$ dcos node
where you will see something similar to the following output:
HOSTNAME IP ID TYPE 10.0.1.51 10.0.1.51 ffc913d8-4012-4953-b693-1acc33b400ce-S3 agent 10.0.2.50 10.0.2.50 ffc913d8-4012-4953-b693-1acc33b400ce-S1 agent 10.0.2.68 10.0.2.68 ffc913d8-4012-4953-b693-1acc33b400ce-S2 agent 10.0.3.192 10.0.3.192 ffc913d8-4012-4953-b693-1acc33b400ce-S4 agent 10.0.3.81 10.0.3.81 ffc913d8-4012-4953-b693-1acc33b400ce-S0 agent master.mesos. 10.0.4.215 ffc913d8-4012-4953-b693-1acc33b400ce master (leader)
Then, in this case, you can enter:
$ dcos node log --mesos-id=ffc913d8-4012-4953-b693-1acc33b400ce-S0 --follow
and get the following log output:
2018-04-09 19:04:22: I0410 02:38:22.711650 3709 http.cpp:1185] HTTP GET for /slave(1)/state from 10.0.3.81:56595 with User-Agentfirstname.lastname@example.org (pid 3168)' 2018-04-09 19:04:24: I0410 02:38:24.752534 3708 logfmt.cpp:178] dstip=10.0.3.81 type=audit timestamp=2018-04-10 02:38:24.752481024+00:00 reason="Valid authorization token" uid="dcos_net_agent" object="/slave(1)/state" agent="email@example.com (pid 3168)" authorizer="mesos-agent" action="GET" result=allow srcip=10.0.3.81 dstport=5051 srcport=56595
Mesos Master Logs
The Mesos Master is responsible for matching available resources to the scheduler. It also forwards task status updates from the Mesos Agents to the corresponding scheduler. This makes the Mesos Master logs a great resource for understanding the overall state of the cluster.
Be aware that there are typically multiple Mesos Masters for a single cluster. So you should identify the current leading Mesos Master to get the most recent logs. In fact, in some cases it might even make sense to retrieve logs from another Mesos master as well: e.g., a master node failed and you want to understand why.
You can either retrieve the master logs from the Mesos web interface via
dcos node log --leader, or for a specific master node using
ssh master and
journalctl -u dcos-mesos-master.
We have now covered the most important log sources in the DC/OS environment, but there are many more logs available. Every DC/OS component writes a log. As mentioned above, each DC/OS component is running as one Systemd unit. You can retrieve the logs directly on the particular node by SSHing into the node, and then typing
journalctl -u <systemd-unit-name>. Two of the more common system units to consider during debugging (besides Mesos and Marathon) are the
docker.service and the
As an example, consider the system unit for the docker daemon on the Mesos agent
ffc913d8-4012-4953-b693-1acc33b400ce-S0 (recall the
dcos node command retrieves the Mesos ID).
First, we can SSH into that agent using the corresponding SSH key:
$ dcos node ssh --master-proxy --mesos-id=ffc913d8-4012-4953-b693-1acc33b400ce-S0
Then we can use
journatlctl, to look at the Docker logs:
$ journalctl -u docker
which outputs something like this:
-- Logs begin at Mon 2018-04-09 23:50:05 UTC, end at Tue 2018-04-10 02:52:41 UTC. -- Apr 09 23:51:50 ip-10-0-3-81.us-west-2.compute.internal systemd: Starting Docker Application Container Engine... Apr 09 23:51:51 ip-10-0-3-81.us-west-2.compute.internal dockerd: time="2018-04-09T23:51:51.293577691Z" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Metrics are useful because they help identify potential issues before they become actual bugs. For example, imagine a situation wherein a container uses up all allocated memory. If you could detect this while the container is still running but not yet killed, you are much more likely to be able to intervene in time.
In DC/OS there are three main endpoints for metrics:
- DC/OS metrics
- endpoint exposing combined metrics from tasks/container, nodes, and applications
- Mesos metrics
- endpoint exposing Mesos-specific metrics
- Marathon metrics
- endppoint exposing Marathon-specific metrics
One way to leverage metrics to help with debugging is to set up a dashboard. This dashboard would include the most important metrics related to the services you want to monitor. For example, you could use prometheus and grafana to make a metrics dashboard.
Ideally, with the dashboard configured and functioning, you can identify potential problems before they become actual bugs. Moreover, when issues do arise, this sort of dashboard can be extremely helpful in determining the cause of the bug(e.g. maybe a cluster has no free resources). Each link from the endpoint item listed above provides recommendations for the metrics you should monitor for that endpoint.
Sometimes the task logs provide insufficient help. In these cases, using your favorite Linux tools (e.g.
ping, etc…) to get an interactive point of view could be a worthwhile next step.
For example, if you are using a [Universal Container Runtime (UCR)] (https://docs.mesosphere.com/latest/deploying-services/containerizers/ucr/), you can use
dcos task exec as follows:
dcos task exec -it <mycontainerid>
and be presented with an interactive bash shell inside that container.
Note If you alter the state of the container when using
dcos task exec in the manner above, you must update the stored
app-definition and restart the container from that updated
app-definition. If you fail to do so, then your changes will be lost the next time the container restarts.
Alternatively, when using a docker containerizer, you can SSH into the node in question and run
docker exec to investigate the running container.
DC/OS has a large number of additional endpoints that could be useful for debugging:
state-summary endpoint returns a json encoded summary of the agents, tasks, and frameworks inside the cluster. This is especially helpful when considering allocation of resources across the cluster, as it shows you whether there are resources already reserved for a particular role (there are more details on this in one of the debugging scenarios provided below.
Note See the complete list of Mesos endpoints.
queue endpoint returns a list of all tasks in the queue to be scheduled by Marathon. This endpoint is valuable when troubleshooting scaling or deployment problems.
The DC/OS community is a great place to ask additional questions either via Slack or the mailing list. Also keep in mind that both Mesos and Marathon have their own communities in addition to the DC/OS community.
There are other debugging tools as well – internal to DC/OS as well as external tools like Sysdig or Instana. These tools can be especially helpful in determining non DC/OS specific issues (e.g., Linux Kernel or networking problems).