Ceph logging [WRN] evicting unresponsive client

This warning shows that a client stopped responding to messages from the MDS. Sometimes it is harmless (perhaps a client disconnected "uncleanly", e.g. a hard reboot), or it could indicate the client is overloaded, deadlocked on something else.

If the same client is appearing repeatedly, it may be useful to get in touch with the owner of the client machine. (ai-dump <hostname> on aiadm).

Ceph logging [WRN] clients failing to respond to cache pressure

When the MDS cache is full, it will need to clear inodes from its cache. This normally also means that the MDS needs to ask some clients to also remove some inodes from their cache too.

If the client fails to respond to this cache recall request, then Ceph will log this warning.

Clients stuck in this state for an extended period of time can cause issues -- follow up with the machine owner to understand the problem.

Note: Ceph-fuse v13.2.1 has a bug which triggers this issue -- users should update to a newer client release.

Ceph logging [WRN] client session with invalid root denied

This means that a user is trying to mount a Manila share that either doesn't exist or they didn't create a key yet. It is harmless, but if repeated then get in touch with the user.

Procedure to unblock hung HPC writes

An HPC client was stuck like this for several hours:

HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs
report slow requests
MDS_CLIENT_LATE_RELEASE 1 clients failing to respond to capability release
    mdscephflax-mds-2a4cfd0e2c(mds.1): Client hpc070.cern.ch:hpcscid02
failing to respond to capability release client_id: 69092525
MDS_SLOW_REQUEST 1 MDSs report slow requests
    mdscephflax-mds-2a4cfd0e2c(mds.1): 1 slow requests are blocked > 30 sec

Indeed there was a hung write on hpc070.cern.ch:

# cat /sys/kernel/debug/ceph/*/osdc
245540  osd100  1.9443e2a5 1.2a5   [100,1,75]/100  [100,1,75]/100
e74658  fsvolumens_393f2dcc-6b09-44d7-8d20-0e84b072ed26/2000b2f5905.00000001
0x400024        1 write

I restarted osd.100 and the deadlocked request went away.

Improve me !