Scenario 2

Tutorial - Out of Memory

Scenario 2: Out of Memory


Deploy the file app-oom.json:

$ dcos marathon app add

Once deployed, when we take a look at the DC/OS GUI, we see some strange results under CPU Allocation:

Pic of CPU allocation

How is it that CPU Allocation is continuing oscillating between 0 and 8 percent? Let’s take a look at the application details in the GUI:

Pic of Task tab

Based on this, the application runs for a few seconds and then fails.


To get a better handle on understanding this unexpected behavior, let us start by looking at the application logs — either in the GUI or via the CLI. You can find the application logs in the GUI by looking under ‘Output’ in the ‘Logs’ tab of the application:

Pic of app logs

The log output “Eating Memory” is a pretty generous hint that the issue might be related to memory. Despite this, there is no direct failure message regarding memory allocation(keep in mind that most apps are not so friendly as to log that they are eating up memory).

As suspected, this might be an application-related issue, and this application is scheduled via Marathon. So let’s check the Marathon logs using the CLI:

$ dcos service log marathon

We see a log entry similar to:

Mar 27 00:46:37[5866]: [2018-03-27 00:46:36,960] INFO  Acknowledge status update for task app-oom.4af344fa-3158-11e8-b60b-a2f459e14528: TASK_FAILED (Memory limit exceeded: Requested: 64MB Maximum Used: 64MB

Tip One helpful time-saving tip can be to grep for ‘TASK_FAILED’.

Now we have confirmed that we exceeded the previously set container memory limit in app-oom.json

If you’ve been paying close attention you might shout now “wait a sec” because you noticed that the memory limit we set in the app definition is 32 MB, but the error message mentions 64MB. DC/OS automatically reserves some overhead memory for the executor which in this case is 32 MB.

Please note that OOM kill is performed by the Linux kernel itself, hence we can also check the kernel logs directly:

dcos node ssh --master-proxy --mesos-id=$(dcos task app-oom --json | jq -r '.[] | .slave_id')

journalctl -f _TRANSPORT=kernel

Mar 27 01:15:36 kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name

Mar 27 01:15:36 kernel: [16846]     0 16846    30939    11021      62       3        0             0 mesos-container

Mar 27 01:15:36 kernel: [16866]     0 16866   198538    12215      81       4        0             0 mesos-executor

Mar 27 01:15:36 kernel: [16879]     0 16879     2463      596      11       3        0             0 sh

Mar 27 01:15:36 kernel: [16883]     0 16883  1143916    14756      52       6        0             0 oomApp

Mar 27 01:15:36 kernel: Memory cgroup out of memory: Kill process 16883 (oomApp) score 877 or sacrifice child

Mar 27 01:15:36 kernel: Killed process 16883 (oomApp) total-vm:4575664kB, anon-rss:57784kB, file-rss:1240kB, shmem-rss:0kB

Mar 27 01:15:36 kernel: oom_reaper: reaped process 16883 (oomApp), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

The resolution in such cases is to either increase the resource limits for that container, in case it was configured too low to begin with. Or, as in this case, fix the memory leak in the application itself.

General Pattern

As we are dealing with a failing task it is good to check the application and scheduler logs (in this case our scheduler is Marathon). If doing this is insufficient, it can help to look at the Mesos Agent logs and/or to use dcos task exec when using UCR (or in a Docker containerizer, ssh into the node and use docker exec).


Remove the application with

$ dcos marathon app remove /app-oom