-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to find logical volume messages in SMlog and defunct processes #91
Comments
Did you check to see if the snapshots were correctly removed after the backup process completed? |
Hi
Checked snapshots in Xen Center - they were removed correctly.
I didn't happen in XenServer 7.6, we upgraded at the beginning of this
month to 8.0.
…On Wed, 26 Jun 2019 at 19:00, NAUbackup ***@***.***> wrote:
Did you check to see if the snapshots were correctly removed after the
backup process completed?
Unfortunately, I have not had a chance to test this thoroughly yet on an
8.0 installation. Did you see anything like this under XS 7.X?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#91?email_source=notifications&email_token=AINTJENJFULR6S4DFYFRSDTP4OHCZA5CNFSM4H3PN5M2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYUAGOI#issuecomment-505938745>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AINTJEJV7GRXX46IZODFDSDP4OHCZANCNFSM4H3PN5MQ>
.
|
Not sure how such an LVM call would have been initiated via VmBackup to result in this. I will try to get a test instance running soon under 8.0 and see if I can replicate this. Sorry this is showing up. Is everything functional otherwise? |
Yes, everything else seems to be working fine. |
Hi Today after number of zombie processes reached 10 000, I killed their parent and it worked as workaround - all zombies disappeared. Here's the process I killed:
|
Seems that the parent process doesn't terminate normally, and exacerbated the issue. Would be curious if the same thing happens if you run another backup process. I do not see anything like this under 7.X, hence it is clearly related to changes in 8.0. |
Hi Reproduced issue after getting rid of zombie processes.
Killed it again and zombie processes disappeared also. |
Hi Please tell is there any news? Thank you |
I am having issues getting the NFS network connected to my tests servers and need to work with out network operations group. Until that's rectified, I won't have a CH 8.0 system I can test with. Sorry it's taking this long. |
We are using own backup scripts to exports VM snapsots, but faced same issue. |
Make sure you don't have too many snapshots already and there is enough room to be able to take an additional snapshot. If the SR is too full, the coalesce process will sometimes also fail to work. You can try manually rescanning the SR, or go and delete snapshots manually first to ensure there is enough free space. Worst case you may have to shuffle VMs around to free up some space and also potentially force the coalescing to kick in. |
I know about such restrictions. In may case I have at least triple of VM size free on that SR. systemd --switched-root --system --deserialize 21 Looking through our monitoring, it seems that the issue was caused by some update that I've applied on Xenserver 7.1 before July 28th. Since that time it triggering alerts from time to time and max number of processes I've seen in monitoring was 3k+ (and I don't finished monitoring check yet). I submited a case to Citrix. Once I killed that parent process and all zombies disappeared. |
Thank you for feedback, Bogdan! |
Bogdan, did you find a way to manually free space as a workaround? If yes, please share, Thank you |
There are two ways:
Finally you have a look of xe vdi-list and clean up orgphaed VDI. There are many discussion about this in Google and especially on Citrix forums. |
Actually, sr_scan also hanged and caused zombie processes. |
Unfortunately, the coalesce_leaf command was removed from later versions of XenServer/Citrix Hypervisor and hence the clean-up process of orphaned VDIs is now by and large a manual process. Moving VMs to other SRs and rexscanning or even reinitializing the original SR is indeed one option. When manually deleting VDIs, make sure they're not a template or some other form of VDI that may not necessarily have an associated parent. |
I've found some correlation for my case: Also I wound that issue is frequently reproduces on ourw Elasticearch virtual machines. Some other types of the VMs are affected too, bu ELK with it's moderate but constant load is good debug sample. I will re-image one host, apply patches up to XS71ECU2008, put ELK VM and see if zombiles will appears again. Then I will start applying patches one by one untill the LVMSr will suck again. Regarding Citix support, I was told to try some product https://citrixready.citrix.com/category-results.html?search=backup |
Well, I was able to reproduce the issue using "Full backup" from XOA and even using Storage Xenmotion from one server to another. I'm going to reinstall Xenserver 7.1, then I will apply patches one by one. As a quick and dirty hack I do VM pause (much faster then suspend) against all VMs on hosts and wait ~15-30 minutes until coalescing will do it's job. |
Note that with a vhd-util scan that just because no parent is found, doesn't necessarily mean it is an orphan; it might just be a template. Hence, one should be careful about deleting any instances. |
Thank you for the effort, Bogdan! I followed your advice to get rid of VDIs causing the issue.
[root@xenserver1 EBS demo (12.2.8)]# vhd-util scan -f -m "VHD-*" -l VG_XenStorage-bac4ccac-5680-915a-6f84-b975379aaa1c -p vhd=VHD-34060004-645d-4a4e-84dd-832f058e5adf capacity=107374182400 size=40647000064 hidden=1 parent=none
I'm not performing backups for now as I don't want the issue to be repeated. |
Have you considered upgrading to 3.25 and trying with a very small subset of copied VMs, perhaps on non-production SR? Some issues with snapshots can arise if the snapshot chain is too long or there are space issues on the SR on which the VMs reside. |
I was already on 3.25 and still having issues. So I will wait for further news from Bogdan. Also, looks like VmBackup.py 3.25 has older version inside, can you please check: #NAUVmBackup/VmBackup.py V3.24 June 2018 |
@sniperkitten - fixed in-line version number and comments in source code. Thanks for pointing that out! |
Hi |
Sorry, I have not had any time to investigate this further. In general, orphans like this are a result of lack of space in which cleanup is possible, hence the suggestion to test with a small subset of VMs and plenty of storage space to work with. This is also possibly a fundamental XS/CH issue as orphaned VDIs have been a problem for years. |
Hello! I've checked the issue against patches up to XS71ECU2016, will have another call with Citrix. I hope it will be a call with next level of their support. I've reproduced this with Xen Orchestra as well (because Cirix want to discuss only "CitrixReady" solutions). |
I've just finished a call with support regarding our support case. From this call I understand that they have couple clients with similar problem and will escalate cases further. I hope there should be a hotfix one day. |
Thank you so much Bogdan for sharing this!
чт, 19 сент. 2019 г., 3:56 BogdanRudas <[email protected]>:
… I've just finished a call with support regarding our support case. From
this call I understand that they have couple clients with similar problem
and will escalate cases further. I hope there should be a hotfix one day.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#91?email_source=notifications&email_token=AINTJEPWXJGUSIS4WLWTVPLQKNLGFA5CNFSM4H3PN5M2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7DCGOI#issuecomment-533078841>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AINTJEP5Y35BK2AOPTXKRX3QKNLGFANCNFSM4H3PN5MQ>
.
|
I've testing a private hotfix from Citrix since last week. |
Good news! Did it work for you? |
Please, try this https://support.citrix.com/article/CTX262019 I was given with private hotfix that fixed zomobies, but the coalescing process that works for many days is still an issue. |
Thank you for update, Bogdan! |
@sniperkitten I guess they will release a fix in CH8.1 |
We installed CH8.1 and it seems to fix the issue with zombie processes. |
Did it improved coalescing process itself or just fixed zombies (like an update for XS7.1 ? ) |
I tested it and 8.1 update only fixed zombies. I created test snapshots and deleted them right away. For relatively small disks ~100G coalescing happened successfully. However for one of 300G VDI I found following error in /var/log/SMlog: SMGC: [11948] SR bac4 ('SDD 4TB') (14 VDIs in 12 VHD trees): showing only VHD trees that changed: Used this command to find leafs vhd-util scan -f -m "VHD-*" -l VG_XenStorage-bac4ccac-5680-915a-6f84-b975379aaa1c -p vhd=VHD-bda8bdc2-d549-4b77-b460-63bc23219d61 capacity=10737418240 size=10368319488 hidden=1 parent=none I tried your fix Bogdan from xcp-ng/xcp#298 LIVE_LEAF_COALESCE_MAX_SIZE = 1024 * 1024 * 1024 After rerunning sr_scan, coalescing successfully completed. |
I wonder if it's related to this issue: https://bugs.xenserver.org/browse/XSO-966 |
Hello!
Recently I noticed many defunct processes on xenserver after executing NAUBackup
root 31292 807 0 Jun25 ? 00:00:00 [LVMSR] <defunct> root 31331 807 0 Jun25 ? 00:00:00 [LVMSR] <defunct> root 31351 807 0 07:12 ? 00:00:00 [LVMSR] <defunct> root 31370 807 0 Jun25 ? 00:00:00 [LVMSR] <defunct> root 31401 807 0 02:40 ? 00:00:00 [LVMSR] <defunct> root 31523 807 0 04:07 ? 00:00:00 [LVMSR] <defunct> root 31574 807 0 02:52 ? 00:00:00 [LVMSR] <defunct> root 31577 807 0 Jun25 ? 00:00:00 [LVMSR] <defunct>
They all have same parent:
root 807 1 0 Jun25 ? 00:04:45 /usr/bin/python /opt/xensource/sm/LVMSR <methodCall><methodName>vdi_delete</methodName><params><param><value><struct><me root 862 807 0 Jun25 ? 00:00:00 [LVMSR] <defunct>
Found following errors in SMlog:
Jun 25 13:15:28 xenserver-enginatics1 SM: [27322] FAILED in util.pread: (rc 5) stdout: '', stderr: ' Failed to find logical volume "VG_XenStorage-bac4ccac-5680-915a-6f84-b975379aaa1c/cc4d186a-1846-47eb-9dcb-ea48bccd6d85.cbtlog" Jun 25 13:15:28 xenserver-enginatics1 SM: [27322] ' Jun 25 13:15:28 xenserver-enginatics1 SM: [27322] Ignoring exception for LV check: /dev/VG_XenStorage-bac4ccac-5680-915a-6f84-b975379aaa1c/cc4d186a-1846-47eb-9dcb-ea48bccd6d85.cbtlog ! Jun 25 13:20:31 xenserver-enginatics1 SM: [30686] FAILED in util.pread: (rc 5) stdout: '', stderr: ' Failed to find logical volume "VG_XenStorage-bac4ccac-5680-915a-6f84-b975379aaa1c/71494603-83b2-403f-83cc-329cadce9d6f.cbtlog" Jun 25 13:20:31 xenserver-enginatics1 SM: [30686] ' Jun 25 13:20:31 xenserver-enginatics1 SM: [30686] Ignoring exception for LV check: /dev/VG_XenStorage-bac4ccac-5680-915a-6f84-b975379aaa1c/71494603-83b2-403f-83cc-329cadce9d6f.cbtlog ! Jun 25 15:12:23 xenserver-enginatics1 SM: [666] FAILED in util.pread: (rc 5) stdout: '', stderr: ' Failed to find logical volume "VG_XenStorage-bac4ccac-5680-915a-6f84-b975379aaa1c/cc4d186a-1846-47eb-9dcb-ea48bccd6d85.cbtlog" Jun 25 15:12:23 xenserver-enginatics1 SM: [666] ' Jun 25 15:12:23 xenserver-enginatics1 SM: [666] Ignoring exception for LV check: /dev/VG_XenStorage-bac4ccac-5680-915a-6f84-b975379aaa1c/cc4d186a-1846-47eb-9dcb-ea48bccd6d85.cbtlog ! Jun 25 15:12:23 xenserver-enginatics1 SM: [855] FAILED in util.pread: (rc 5) stdout: '', stderr: ' Failed to find logical volume "VG_XenStorage-bac4ccac-5680-915a-6f84-b975379aaa1c/71494603-83b2-403f-83cc-329cadce9d6f.cbtlog" Jun 25 15:12:23 xenserver-enginatics1 SM: [855] ' Jun 25 15:12:23 xenserver-enginatics1 SM: [855] Ignoring exception for LV check: /dev/VG_XenStorage-bac4ccac-5680-915a-6f84-b975379aaa1c/71494603-83b2-403f-83cc-329cadce9d6f.cbtlog !
Following messages taken from NAUBackup log with same VDI IDs:
2019-06-25-(15:12:22) - The following items are about to be destroyed 2019-06-25-(15:12:22) - VM : 2a0bc086-1ef0-f404-3660-a58b83573c6d (RESTORE_EBS r1213a) 2019-06-25-(15:12:22) - VDI: cc4d186a-1846-47eb-9dcb-ea48bccd6d85 (EBS r1213a upgraded OEL 6 root "/" mountpoint) 2019-06-25-(15:12:22) - VDI: 71494603-83b2-403f-83cc-329cadce9d6f (EBS r1213a upgraded OEL 6 /d01 EBS mountpoint)
Xen Server and Xen SDK version: 8.0
NAUBackup version: V3.24
Please help me with the investigation.
Thank you!
The text was updated successfully, but these errors were encountered: