Metrics in DC/OS, version 1.12 or newer, use Telegraf to collect and process data. Telegraf provides metrics from DC/OS cluster hosts, containers running on those hosts, and from applications running on DC/OS using the
statsd process. Telegraf is natively integrated with DC/OS. By default, it exposes metrics in Prometheus format from
port 61091 on each node, and in JSON format through the DC/OS Metrics API.
DC/OS collects four types of metrics as follows:
- System: Metrics about each node in the DC/OS cluster.
- Component: Metrics about the components which make up DC/OS.
- Container: Metrics about
cgroupallocations from tasks running in the DC/OS Universal Container Runtime or Docker Engine runtime.
- Application: Metrics emitted from any application running on the Universal Container Runtime.
Telegraf is included in the DC/OS distribution and runs on every host in the cluster. Because Telegraf provides a plugin-driven architecture, custom DC/OS plugins provide metrics on the performance of DC/OS workloads and DC/OS itself.
Telegraf collects application and custom metrics through the
dcos_statsd plugin. A dedicated StatsD server is started for each new task. Any metrics received by the StatsD server are tagged with the task name and its service name. The address of the server is provided by environment variables (
STATSD_UDP_PORT). Note that when a task finishes, any metrics it has emitted that haven’t yet been gathered by Telegraf will be discarded. The metrics collected by
dcos_statsd are gathered every 30 seconds. To ensure a task’s metrics are gathered, the task must run for at least 30 seconds.
For more information about the list of metrics that are automatically collected by DC/OS, read Metrics Reference documentation.
Upgrading from 1.11
DC/OS 1.12 includes an updated
statsd server implementation for application metrics. The
statsd update fixes an issue with the
statsd server implementation in 1.11, which treated all application metrics as gauges, regardless of
Dashboards and alerts that rely on counters, histograms, or sets behave differently in 1.12 than in 1.11 as follows:
- Gauges report the last received value. There is no change from 1.11 functionality.
- Counters report the sum of all received values. In 1.11, counters reported the last received value.
- Histograms and timers report
_maxmetrics. In 1.11, histograms reported the last received value.
- Sets report the sum of all unique values. In 1.11, sets reported the last received value.
Additionally, multi-packet metrics and sampling are now available. In 1.11, they were not implemented and resulted in missing metrics.
Use the following troubleshooting guidelines to resolve errors:
- You can collect metrics about Telegraf’s own performance by enabling the
- You can check the status of the Telegraf
systemdunit by running
systemctl status dcos-telegraf.
- Logs are available from journald via
journalctl -u dcos-telegraf.
Metrics Plugin Architecture
How DC/OS collects and publishes metrics…Read More
Metrics Quick Start
Getting Started with metrics in DC/OS…Read More
Enable Mesos Metrics
Monitoring Mesos with Telegraf…Read More
Export DC/OS Metrics to Datadog
Sending DC/OS metrics to Datadog…Read More
Export DC/OS Metrics to Prometheus
Monitoring your workload with Prometheus and Grafana self-hosted instances…Read More
Using the Metrics API…Read More
Understanding metrics collected by DC/OS…Read More