Scenario 2: Out of Memory
Deploy the file
$ dcos marathon app add https://raw.githubusercontent.com/dcos-labs/dcos-deb
Once deployed, when we take a look at the DC/OS GUI, we see some strange results under CPU Allocation:
How is it that CPU Allocation is continuing oscillating between 0 and 8 percent? Let’s take a look at the application details in the GUI:
Based on this, the application runs for a few seconds and then fails.
To get a better handle on understanding this unexpected behavior, let us start by looking at the application logs — either in the GUI or via the CLI. You can find the application logs in the GUI by looking under ‘Output’ in the ‘Logs’ tab of the application:
The log output “Eating Memory” is a pretty generous hint that the issue might be related to memory. Despite this, there is no direct failure message regarding memory allocation(keep in mind that most apps are not so friendly as to log that they are eating up memory).
As suspected, this might be an application-related issue, and this application is scheduled via Marathon. So let’s check the Marathon logs using the CLI:
$ dcos service log marathon
We see a log entry similar to:
Mar 27 00:46:37 ip-10-0-6-109.us-west-2.compute.internal marathon.sh: [2018-03-27 00:46:36,960] INFO Acknowledge status update for task app-oom.4af344fa-3158-11e8-b60b-a2f459e14528: TASK_FAILED (Memory limit exceeded: Requested: 64MB Maximum Used: 64MB
Tip One helpful time-saving tip can be to
grep for ‘TASK_FAILED’.
Now we have confirmed that we exceeded the previously set container memory limit in
If you’ve been paying close attention you might shout now “wait a sec” because you noticed that the memory limit we set in the app definition is 32 MB, but the error message mentions 64MB. DC/OS automatically reserves some overhead memory for the executor which in this case is 32 MB.
Please note that OOM
kill is performed by the Linux kernel itself, hence we can also check the kernel logs directly:
dcos node ssh --master-proxy --mesos-id=$(dcos task app-oom --json | jq -r '. | .slave_id') journalctl -f _TRANSPORT=kernel Mar 27 01:15:36 ip-10-0-1-103.us-west-2.compute.internal kernel: [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name Mar 27 01:15:36 ip-10-0-1-103.us-west-2.compute.internal kernel:  0 16846 30939 11021 62 3 0 0 mesos-container Mar 27 01:15:36 ip-10-0-1-103.us-west-2.compute.internal kernel:  0 16866 198538 12215 81 4 0 0 mesos-executor Mar 27 01:15:36 ip-10-0-1-103.us-west-2.compute.internal kernel:  0 16879 2463 596 11 3 0 0 sh Mar 27 01:15:36 ip-10-0-1-103.us-west-2.compute.internal kernel:  0 16883 1143916 14756 52 6 0 0 oomApp Mar 27 01:15:36 ip-10-0-1-103.us-west-2.compute.internal kernel: Memory cgroup out of memory: Kill process 16883 (oomApp) score 877 or sacrifice child Mar 27 01:15:36 ip-10-0-1-103.us-west-2.compute.internal kernel: Killed process 16883 (oomApp) total-vm:4575664kB, anon-rss:57784kB, file-rss:1240kB, shmem-rss:0kB Mar 27 01:15:36 ip-10-0-1-103.us-west-2.compute.internal kernel: oom_reaper: reaped process 16883 (oomApp), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
The resolution in such cases is to either increase the resource limits for that container, in case it was configured too low to begin with. Or, as in this case, fix the memory leak in the application itself.
As we are dealing with a failing task it is good to check the application and scheduler logs (in this case our scheduler is Marathon). If doing this is insufficient, it can help to look at the Mesos Agent logs and/or to use
dcos task exec when using UCR (or in a Docker containerizer, ssh into the node and use
Remove the application with
$ dcos marathon app remove /app-oom