General Strategy: Debugging Application Deployment on DC/OS
Now that we have defined a set of tools for debugging applications on DC/OS, let us consider a step-by-step general troubleshooting strategy for actually implementing pir available tools in an application debugging scenario. Once we have gone over this general strategy, we will consider a few concrete scenarios of how to apply this strategy in the practice section.
Beyond considering any information special to your scenario, a reasonable approach to debugging an application deployment issue is to apply our debugging tools in the following order:
- 1: Check GUIs
- 2: Check Task Logs
- 3: Check Scheduler Logs
- 4: Check Agent Logs
- 5: Test Task Interactively
- 6: Check Master Logs
- 7: Ask Community
Step 1: Check the GUIs
Step 2: Check the Task Logs
If the GUIs cannot provide sufficient information, next check the task logs using the DC/OS GUI or the CLI. This provides a better understanding of what might have happened to the application. If the issue is related to our app not deploying (for example, the task status continues to wait indefinitely), try looking at the ‘Debug’ page. It could be helpful in understanding the resources being offered by Mesos.
Step 3: Check the Scheduler Logs
Next, when there is a deployment problem and the task logs do not provide enough information to fix the issue, it can be helpful to double-check the app definition. Then, after confirming the app definition, check the Marathon log or GUI to understand how it was scheduled or why not.
Step 4: Check the Agent Logs
The Mesos Agent logs provide information regarding how the task and that task’s environment are being started. Recall that increasing the log level can be helpful in some cases to obtain more information with which to work.
Step 5: Test the Task Interactively
The next step is to interactively look at the task running inside the container. If the task is still running,
dcos task exec or
docker exec can be helpful to start an interactive debugging session. If the application is based on a Docker container image, manually starting it using
docker run followed by
docker exec can also get you started in the right direction.
Step 6: Check the Master Logs
If you want to understand why a particular scheduler has received certain resources or a particular status, then the master logs can be very helpful. Recall that the master is forwarding all status updates between the agents and scheduler, so it might even be helpful in cases where the agent node might not be reachable (for example, network partition or node failure).