summaryrefslogtreecommitdiffstats
path: root/src/msg/async/ProtocolV1.h
diff options
context:
space:
mode:
authorxie xingguo <xie.xingguo@zte.com.cn>2019-01-08 11:38:45 +0100
committerxie xingguo <xie.xingguo@zte.com.cn>2019-01-09 06:48:26 +0100
commit794a8f9cf51cf176636d114ccfbbf68fbc304083 (patch)
treeb9cc655ee985cf4af4fff4c0d2dd2f652b7b7e0b /src/msg/async/ProtocolV1.h
parentMerge PR #25750 into master (diff)
downloadceph-794a8f9cf51cf176636d114ccfbbf68fbc304083.tar.xz
ceph-794a8f9cf51cf176636d114ccfbbf68fbc304083.zip
msg/async: do not force updating rotating keys inline
We found quite a few OSDs were unable to re-join the cluster after the updation of the core switch was done. The symptoms are similar - all these OSDs are complaining about not being able to renew rotating keys, which are necessary for authorized entities to talk with each other. The root cause is that a specific OSD would keep hunting a reachable Mon, and if unavailable, the hunting process would reboot every __timeout__ seconds, causing the async-connection in progress torn down and re-created. However the underlying thread in charge of the hunting process could be blocked if there were hundreds of async-connections which were also waiting for new rotating keys, e.g.: ``` 2018-12-29 16:35:19.210884 7f416d6ee700 0 -- 172.18.35.6:6808/1036230 >> 172.18.35.4:6810/1037600 conn(0x7f41d9e3c000 :6808 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=293 cs=25 l=0).handle_connect_reply connect got BADAUTHORIZER 2018-12-29 16:35:19.210891 7f416d6ee700 10 monclient(hunting): wait_auth_rotating waiting (until 2018-12-29 16:35:29.210889) 2018-12-29 16:35:29.210947 7f416d6ee700 0 monclient(hunting): wait_auth_rotating timed out after 10 2018-12-29 16:35:29.211101 7f416d6ee700 0 -- 172.18.35.6:6808/1036230 >> 172.18.35.4:6824/1028882 conn(0x7f418195d000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH p gs=1433 cs=8 l=0).handle_connect_reply connect got BADAUTHORIZER 2018-12-29 16:35:29.211108 7f416d6ee700 10 monclient(hunting): wait_auth_rotating waiting (until 2018-12-29 16:35:39.211108) 2018-12-29 16:35:39.211167 7f416d6ee700 0 monclient(hunting): wait_auth_rotating timed out after 10 ``` which as a result causes the corresponding OSD being stuck at hunting forever. Fix by avoiding updating rotating keys on the messenger level and making monclient do it instead. On detecting a bad or an outdated rotating key, we could simply backoff and restart the connecting procedure. Signed-off-by: yanjun <yan.jun8@zte.com.cn> Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
Diffstat (limited to 'src/msg/async/ProtocolV1.h')
-rw-r--r--src/msg/async/ProtocolV1.h3
1 files changed, 1 insertions, 2 deletions
diff --git a/src/msg/async/ProtocolV1.h b/src/msg/async/ProtocolV1.h
index 7973b07eecd..cf2370f1a94 100644
--- a/src/msg/async/ProtocolV1.h
+++ b/src/msg/async/ProtocolV1.h
@@ -226,7 +226,6 @@ public:
// Client Protocol
private:
int global_seq;
- bool got_bad_auth;
AuthAuthorizer *authorizer;
CONTINUATION_DECL(ProtocolV1, send_client_banner);
@@ -301,4 +300,4 @@ public:
}
};
-#endif /* _MSG_ASYNC_PROTOCOL_V1_ */ \ No newline at end of file
+#endif /* _MSG_ASYNC_PROTOCOL_V1_ */