A Ceph cluster (at least in Mimic version), by default, comes set up with MDS cache memory limit (mds_cache_memory_limit) of 1G... and that is not enough if you are running some heavy load clients with CephFS and you will soon start to get warning like client X is failing to respond to cache pressure.
How do I know that Ceph cluster comes with mds_cache_memory_limit of 1G you ask? Well, I run the following command on a Ceph MDS server:
... and you should get the following output:
ceph daemon mds.<<your_ceph_mds_server_name>> config get mds_cache_memory_limit
Always perform modifications on a standby MDS server. Do not perform modifications on a active server because (from my experience) the MDS will get stuck for some time or restart. At least this happened to me on my Mimic 13.2.6 ceph cluster.
The command to increase MDS Cache Memory Limit from 1G to 6G on your Ceph cluster is (if want more, do some calculations as 1073741824 Kilobytes is 1G 😛 ):
ceph daemon mds.<<your_ceph_mds_server_name>> config set mds_cache_memory_limit 68719476736
Do the above modification on all your MDS standby servers and I truly hope you have more then one MDS servers on your cluster, otherwise, you are screwed. Or, you can read this and quickly deploy extra MDS servers and it's all good 😊.
Now stop the active MDS server with systectl (I am running my cluster on Ubuntu 16.04) and watch how one of your standby MDS servers becomes active.
Remember to perform the above modifications on the ex-active MDS server and that's it.
Oh, one more thing... If you reboot the servers the default values will get activated (1G mds_cache_memory_limit) and every modification performed will be erased. To prevent that from happening add the following config in /etc/ceph/ceph.conf file.
mds_cache_memory_limit = 68719476736