diff options
author | xie xingguo <xie.xingguo@zte.com.cn> | 2019-01-08 11:38:45 +0100 |
---|---|---|
committer | xie xingguo <xie.xingguo@zte.com.cn> | 2019-01-09 06:48:26 +0100 |
commit | 794a8f9cf51cf176636d114ccfbbf68fbc304083 (patch) | |
tree | b9cc655ee985cf4af4fff4c0d2dd2f652b7b7e0b /src/msg/async/ProtocolV1.h | |
parent | Merge PR #25750 into master (diff) | |
download | ceph-794a8f9cf51cf176636d114ccfbbf68fbc304083.tar.xz ceph-794a8f9cf51cf176636d114ccfbbf68fbc304083.zip |
msg/async: do not force updating rotating keys inline
We found quite a few OSDs were unable to re-join the cluster
after the updation of the core switch was done.
The symptoms are similar - all these OSDs are complaining about not
being able to renew rotating keys, which are necessary
for authorized entities to talk with each other.
The root cause is that a specific OSD would keep hunting a reachable Mon,
and if unavailable, the hunting process would reboot every __timeout__ seconds,
causing the async-connection in progress torn down and re-created.
However the underlying thread in charge of the hunting process could be
blocked if there were hundreds of async-connections which were also waiting
for new rotating keys, e.g.:
```
2018-12-29 16:35:19.210884 7f416d6ee700 0 -- 172.18.35.6:6808/1036230 >> 172.18.35.4:6810/1037600 conn(0x7f41d9e3c000 :6808 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH
pgs=293 cs=25 l=0).handle_connect_reply connect got BADAUTHORIZER
2018-12-29 16:35:19.210891 7f416d6ee700 10 monclient(hunting): wait_auth_rotating waiting (until 2018-12-29 16:35:29.210889)
2018-12-29 16:35:29.210947 7f416d6ee700 0 monclient(hunting): wait_auth_rotating timed out after 10
2018-12-29 16:35:29.211101 7f416d6ee700 0 -- 172.18.35.6:6808/1036230 >> 172.18.35.4:6824/1028882 conn(0x7f418195d000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH p
gs=1433 cs=8 l=0).handle_connect_reply connect got BADAUTHORIZER
2018-12-29 16:35:29.211108 7f416d6ee700 10 monclient(hunting): wait_auth_rotating waiting (until 2018-12-29 16:35:39.211108)
2018-12-29 16:35:39.211167 7f416d6ee700 0 monclient(hunting): wait_auth_rotating timed out after 10
```
which as a result causes the corresponding OSD being stuck at hunting forever.
Fix by avoiding updating rotating keys on the messenger level and
making monclient do it instead. On detecting a bad or an outdated
rotating key, we could simply backoff and restart the connecting
procedure.
Signed-off-by: yanjun <yan.jun8@zte.com.cn>
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
Diffstat (limited to 'src/msg/async/ProtocolV1.h')
-rw-r--r-- | src/msg/async/ProtocolV1.h | 3 |
1 files changed, 1 insertions, 2 deletions
diff --git a/src/msg/async/ProtocolV1.h b/src/msg/async/ProtocolV1.h index 7973b07eecd..cf2370f1a94 100644 --- a/src/msg/async/ProtocolV1.h +++ b/src/msg/async/ProtocolV1.h @@ -226,7 +226,6 @@ public: // Client Protocol private: int global_seq; - bool got_bad_auth; AuthAuthorizer *authorizer; CONTINUATION_DECL(ProtocolV1, send_client_banner); @@ -301,4 +300,4 @@ public: } }; -#endif /* _MSG_ASYNC_PROTOCOL_V1_ */
\ No newline at end of file +#endif /* _MSG_ASYNC_PROTOCOL_V1_ */ |