Debugging User Account Management
Admin Router receives all user account management operations via HTTP and hands them over to the IAM. To debug user management problems, make sure Admin Router on the masters receives the HTTP request for the intended operation using the following commands.
sudo journalctl -u dcos-adminrouter.service
Identity and Access Manager
To debug user account management problems, check the IAM (Bouncer) on the masters using the following commands.
sudo journalctl -u dcos-bouncer.service
The IAM stores user information in CockroachDB running on master nodes. If Admin Router and the IAM received the user management operation to perform according to their logs and the IAM could not fulfill the request in it likely that CockroachDB is in trouble. In this case check for errors or unusual patterns in the CockroachDB log on all master nodes.
sudo journalctl -u dcos-cockroach.service
The simplest way to determine whether a CockroachDB cluster is healthy is to query the CockroachDB node status information from a master node like this.
sudo /opt/mesosphere/bin/cockroach node status --ranges --host=$(/opt/mesosphere/bin/detect_ip) --insecure
The reported number of nodes should match the number of DC/OS master nodes in general. However, each time a master node is replaced the CockroachDB node of the old DC/OS master node is not automatically removed.
The number of
is_live CockroachDB nodes should always match the current number of master nodes.
DC/OS sets the number of replicas for CockroachDB ranges equal to the number of DC/OS master nodes. This implies one master node should hold one copy of the data. Therefore, the number of
ranges_underreplicated can go up temporarily if a master node goes out of business. After a new master node joins the cluster
ranges_underreplicated is expected to go down again to zero after about five minutes.
If the number of
ranges_unavailable is non-zero this data is at least temporarily unavailable for reading and writing. In this case the CockroachDB range’s raft group quorum is most likely impaired.
Getting the state of the majority of CockroachDB instances to
is_live again on the majority of DC/OS master nodes is expected to restore the quorum and thus to bring
ranges_unavailable down to zero.
This is done easiest by restarting the CockroachDB instances that report
is_live = false. Execute the following command to restart the CockroachDB node on the corresponding master node.
sudo systemctl restart dcos-cockroach.service
If the raft quorum for a particular range cannot be recovered and the
ranges_unavailable persist, the unavailable data is inevitably lost because it cannot be read nor written.