Sign Up
Log In
Log In
or
Sign Up
Places
All Projects
Status Monitor
Collapse sidebar
SUSE:SLE-12-SP2:Update
drbd.24674
remove_rece_deadlock_conn_lost_in_resync.patch
Overview
Repositories
Revisions
Requests
Users
Attributes
Meta
File remove_rece_deadlock_conn_lost_in_resync.patch of Package drbd.24674
commit ccc813791bddae892c4c8708d39d21f522827923 Author: Philipp Reisner <philipp.reisner@linbit.com> Date: Wed Apr 4 10:14:59 2018 +0200 drbd: Remove a possible receiver deadlock when connection is lost while resyn We got two reports of incidents where DRBD was stuck in NetworkFailure state. It turns out that when a resync runs and application IO is ongoing and the connection is lost while the receiver thread is just in the progress of receiving a barrier: Feb 13 13:08:10 vm17 kernel: drbd t1 mqhavm16.hursley.ibm.com: PingAck did not arrive in time. Feb 13 13:08:10 vm17 kernel: drbd t1 mqhavm16.hursley.ibm.com: conn( Connected -> NetworkFailure ) peer( Primary -> Unknown ) Feb 13 13:08:10 vm17 kernel: drbd t1/0 drbd1122 mqhavm16.hursley.ibm.com: pdsk( UpToDate -> DUnknown ) repl( SyncTarget -> Off ) Feb 13 13:08:10 vm17 kernel: drbd t1 mqhavm16.hursley.ibm.com: ack_receiver terminated Feb 13 13:08:10 vm17 kernel: drbd t1 mqhavm16.hursley.ibm.com: Terminating ack_recv thread The ack_receiver terminated itself, since the connection is dead, Now the receiver should terminate: Feb 13 13:09:39 vm17 kernel: INFO: task drbd_r_t1:4536 blocked for more than 120 seconds. Feb 13 13:09:39 vm17 kernel: drbd_r_t1 D ffff880217690000 0 4536 2 0x00000084 Feb 13 13:09:39 vm17 kernel: Call Trace: Feb 13 13:09:39 vm17 kernel: [<ffffffff816ab6d9>] schedule+0x29/0x70 Feb 13 13:09:39 vm17 kernel: [<ffffffffc0506d97>] __conn_wait_ee_empty+0x97/0xe0 [drbd] Feb 13 13:09:39 vm17 kernel: [<ffffffff810b34b0>] ? wake_up_atomic_t+0x30/0x30 Feb 13 13:09:39 vm17 kernel: [<ffffffffc0506e0d>] conn_wait_ee_empty+0x2d/0x50 [drbd] Feb 13 13:09:39 vm17 kernel: [<ffffffffc050fbb3>] receive_Barrier+0x143/0x1f0 [drbd] The receiver waits for active_ee to become empty. That should be done by the submitter. Tha submitter waits for a resync extent do go out of the way: Feb 13 13:11:39 vm17 kernel: Workqueue: drbd1122_submit do_submit [drbd] Feb 13 13:11:39 vm17 kernel: Call Trace: Feb 13 13:11:39 vm17 kernel: [<ffffffffc0518905>] ? prepare_al_transaction_nonblock+0x1f5/0x210 [drbd] Feb 13 13:11:39 vm17 kernel: [<ffffffff816ab6d9>] schedule+0x29/0x70 Feb 13 13:11:39 vm17 kernel: [<ffffffffc051cb63>] do_submit+0x363/0x620 [drbd] That would be removed by the ack receiver, when it receives the P_RS_CANCEL packet from the Primary/SyncSource node. But that thread has already ceased its activity. Fix that by waiting for ee_empty_or_disconnect. Credits go to Lars Ellenberg. He explained the deadlock, I just wrote the patch. diff --git a/drbd/drbd_receiver.c b/drbd/drbd_receiver.c index 2762917a..4fe0f6ad 100644 --- a/drbd/drbd_receiver.c +++ b/drbd/drbd_receiver.c @@ -1416,6 +1416,12 @@ static void conn_wait_ee_empty(struct drbd_connection *connection, struct list_h wait_event(connection->ee_wait, conn_wait_ee_cond(connection, head)); } +static void conn_wait_ee_empty_or_disconnect(struct drbd_connection *connection, struct list_head *head) +{ + wait_event(connection->ee_wait, + conn_wait_ee_cond(connection, head) || connection->cstate[NOW] < C_CONNECTED); +} + /** * drbd_submit_peer_request() * @device: DRBD device. @@ -1701,7 +1707,7 @@ static int receive_Barrier(struct drbd_connection *connection, struct packet_inf case WO_DRAIN_IO: if (rv == FE_STILL_LIVE) { set_bit(DE_BARRIER_IN_NEXT_EPOCH_ISSUED, &connection->current_epoch->flags); - conn_wait_ee_empty(connection, &connection->active_ee); + conn_wait_ee_empty_or_disconnect(connection, &connection->active_ee); rv = drbd_flush_after_epoch(connection, connection->current_epoch); } if (rv == FE_RECYCLED) @@ -1719,7 +1725,7 @@ static int receive_Barrier(struct drbd_connection *connection, struct packet_inf if (!epoch) { drbd_warn(connection, "Allocation of an epoch failed, slowing down\n"); issue_flush = !test_and_set_bit(DE_BARRIER_IN_NEXT_EPOCH_ISSUED, &connection->current_epoch->flags); - conn_wait_ee_empty(connection, &connection->active_ee); + conn_wait_ee_empty_or_disconnect(connection, &connection->active_ee); if (issue_flush) { rv = drbd_flush_after_epoch(connection, connection->current_epoch); if (rv == FE_RECYCLED) diff --git a/drbd/drbd_state.c b/drbd/drbd_state.c index 611bed24..e3e9e0f8 100644 --- a/drbd/drbd_state.c +++ b/drbd/drbd_state.c @@ -629,6 +629,7 @@ static enum drbd_state_rv ___end_state_change(struct drbd_resource *resource, st connection->susp_fen[NOW] = connection->susp_fen[NEW]; wake_up(&connection->ping_wait); + wake_up(&connection->ee_wait); } idr_for_each_entry(&resource->devices, device, vnr) {
Locations
Projects
Search
Status Monitor
Help
OpenBuildService.org
Documentation
API Documentation
Code of Conduct
Contact
Support
@OBShq
Terms
openSUSE Build Service is sponsored by
The Open Build Service is an
openSUSE project
.
Sign Up
Log In
Places
Places
All Projects
Status Monitor