Scenario 1

Tutorial - Resource Allocation

Scenario 1: Resource Allocation


For this first scenario, deploy this app definition as follows:

$ dcos marathon app add

Check the application status using the DC/OS GUI, you should see something like the following:

Pic of GUI

With the status of the application most likely to be “Waiting” followed by some number of thousanths “x/1000”. “Waiting” refers to the overall application status and the number; “x” here represents how many instances have successfully deployed (6 in this example).

You can also check this status from the CLI:

$ dcos marathon app list

would produce the following output in response:


/app-scaling-1  128    1    6/1000   ---      scale     True       mesos    sleep 10000

Or, if you want to see all ongoing deployments, enter:

$ dcos marathon deployment list

to see something like the following:


/app-scaling-1  -    scale     1/2     c51af187-dd74-4321-bb38-49e6d224f4c8

So now we know that some (6/1000) instances of the application have successfully deployed, but the overall deployment status is “Waiting”. But what does this mean?


The “Waiting” state means that DC/OS (or more precisely Marathon) is waiting for a suitable resource offer. So it seems to be an deployment issue and we should start by checking the available resources.

If we look at the DC/OS dashboard, we should see a pretty high CPU allocation similar to the following (of course, the exact percentage depends on your cluster):

Pic of CPU Allocation

Since we are not yet at 100% allocation, but we are still waiting to deploy, something interesting is going on. So let’s look at the recent resource offers in the debug view of the DC/OS GUI.

Pic of relevant instance of GUI

We can see that there are no matching CPU resources. But again, the overall CPU allocation is only at 75%. Further puzzling, when we take a look at the ‘Details’ section further below, we see that the latest offers from a different host match the resource requirements of our application. So, for example, the first offer coming from host matched the role, constraint (not present in this app-definition) memory, disk, port resource requirements — but failed the CPU resource requirements. The offer before this also seemed like it should have met the resource requirements. So despite it looking like we have enough CPU resources available, the application seems to be failing for just this reason.

Let’s look at the ‘Details’ more closely.

Pic of details

Interesting. According to this, some of the remaining CPU resources are allocated to a different Mesos resource role and so cannot be used by our application (it runs in role ‘*’, the default role).

To check the roles of different resources let us have a look at the state-summary endpoint, which you can access at https://<master-ip>/mesos/state-summary.

That endpoint will give us a rather long json output, so it is helpful to use jq to make the output readable:

curl -skSL


-H "Authorization: token=$(dcos config show core.dcos_acs_token)"

-H "Content-Type: application/json"

"$(dcos config show core.dcos_url)/mesos/state-summary" |

jq '.'

When looking at the agent information we can see two different kinds of agent.

Pic of cluster information

The first kind has no free CPU resources and also no reserved resources. Of course, this might be different if you had other workloads running on your cluster prior to these exercises. Note that these unreserved resources correspond to the default role ‘*’ — the role by which we are trying to deploy our tasks.

The second kind has unused CPU resources, but these resources are reserved in the role ‘slave_public’.

We now know that the issue is that there are not enough resources in the desired resource role across the entire cluster. As a solution we could either scale down the application (1000 instances does seem a bit excessive), or we need to add more resources to the cluster.

General Pattern

When your application framework (e.g. Marathon) is not accepting resource offers, check whether there are sufficient resources available in the respective resource role.

This was a straightforward scenario with too few CPU resources. Typically resource issues are more likely caused by more complex factors - such as improperly configured port resources or placement constraints. Nonetheless, this general workflow pattern still applies.


Remove the application from the cluster with:

$ dcos marathon app remove /app-scaling-1