Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cchost crashes in case a corrupt ledger file is found on a node that is joining the network #6612

Open
gaurav137 opened this issue Nov 5, 2024 · 1 comment

Comments

@gaurav137
Copy link

gaurav137 commented Nov 5, 2024

If the path under

return;
gets hit then the malformed/corrupt ledger file is not getting ignored when a node starts from a later snapshot but has this older uncommitted ledger file in its ledger directory.

2024-11-05T05:12:11.384963Z        100 [fail ] ../src/host/ledger.h:312             | Malformed incomplete ledger file /mnt/storage/ledger/ledger_19 at seqno 32 (expecting entry of size 54978, remaining 49144)
2024-11-05T05:12:11.415505Z        100 [debug] ../src/host/ledger.h:1107            | Recovering file from main ledger directory: ledger_19
@gaurav137 gaurav137 added the bug label Nov 5, 2024
@gaurav137
Copy link
Author

More generally if a node is starting in join mode with uncommitted ledger files in its ledger directory that are further behind than the committed snapshot files in its snapshot directory then the uncommitted ledger files should get ignored and not interfere with node start up. The situation I faced was eventually the below (after multiple scale up/down/ recovery attempts):

2024-11-04T14:40:27.733657Z -0.012 0   [info ][gov] ode/gov/handlers/recovery.h:170 | 1/1 recovery shares successfully submitted
End of recovery procedure initiated - initiating recovery
2024-11-04T14:40:27.741599Z -0.020 0   [info ][gov] /gov/gov_endpoint_registry.h:58 | RequestCompletedEvent: POST /recovery/members/{memberId}:recover 200 0ms 1 attempt(s)
2024-11-04T14:40:28.702008Z -0.004 0   [info ] ../src/node/node_state.h:2167        | Initiating end of recovery (primary)
2024-11-04T14:40:28.705587Z -0.008 0   [info ] ../src/node/snapshot_serdes.h:111    | Deserialising snapshot (size: 457616, public only: false)
2024-11-04T14:40:28.705679Z -0.008 0   [info ] ../src/node/snapshot_serdes.h:123    | Snapshot successfully deserialised at seqno 117
2024-11-04T14:40:28.705692Z        100 [fail ] ../src/host/ledger.h:489             | Cannot find entries: 118 - 31 in ledger file ledger_19
2024-11-04T14:40:28.705702Z        100 [debug] ../src/host/ledger.h:1435            | Ledger commit: 150/150
2024-11-04T14:40:28.761173Z        100 [fail ] ../src/host/main.cpp:779             | Exception in ccf::run: std::exception
2024-11-04T14:40:28.761947Z -0.064 0   [fail ] ../src/ds/messaging.h:170            | Exception while processing message <::consensus::ledger_no_entry_range:1107064419> of size 17
libc++abi: terminating due to uncaught exception of type std::exception: std::exception

Per my understanding of what happened: the node started up with ledger_19 file around and also with committed snapshot with seq no 117 and the presence of ledger_19 file resulted in cchost crashing.

@achamayou achamayou removed the bug label Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants