Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to find logical volume messages in SMlog and defunct processes #91

Open
sniperkitten opened this issue Jun 26, 2019 · 38 comments
Open

Comments

@sniperkitten
Copy link

Hello!

Recently I noticed many defunct processes on xenserver after executing NAUBackup

root 31292 807 0 Jun25 ? 00:00:00 [LVMSR] <defunct> root 31331 807 0 Jun25 ? 00:00:00 [LVMSR] <defunct> root 31351 807 0 07:12 ? 00:00:00 [LVMSR] <defunct> root 31370 807 0 Jun25 ? 00:00:00 [LVMSR] <defunct> root 31401 807 0 02:40 ? 00:00:00 [LVMSR] <defunct> root 31523 807 0 04:07 ? 00:00:00 [LVMSR] <defunct> root 31574 807 0 02:52 ? 00:00:00 [LVMSR] <defunct> root 31577 807 0 Jun25 ? 00:00:00 [LVMSR] <defunct>

They all have same parent:

root 807 1 0 Jun25 ? 00:04:45 /usr/bin/python /opt/xensource/sm/LVMSR <methodCall><methodName>vdi_delete</methodName><params><param><value><struct><me root 862 807 0 Jun25 ? 00:00:00 [LVMSR] <defunct>

Found following errors in SMlog:

Jun 25 13:15:28 xenserver-enginatics1 SM: [27322] FAILED in util.pread: (rc 5) stdout: '', stderr: ' Failed to find logical volume "VG_XenStorage-bac4ccac-5680-915a-6f84-b975379aaa1c/cc4d186a-1846-47eb-9dcb-ea48bccd6d85.cbtlog" Jun 25 13:15:28 xenserver-enginatics1 SM: [27322] ' Jun 25 13:15:28 xenserver-enginatics1 SM: [27322] Ignoring exception for LV check: /dev/VG_XenStorage-bac4ccac-5680-915a-6f84-b975379aaa1c/cc4d186a-1846-47eb-9dcb-ea48bccd6d85.cbtlog ! Jun 25 13:20:31 xenserver-enginatics1 SM: [30686] FAILED in util.pread: (rc 5) stdout: '', stderr: ' Failed to find logical volume "VG_XenStorage-bac4ccac-5680-915a-6f84-b975379aaa1c/71494603-83b2-403f-83cc-329cadce9d6f.cbtlog" Jun 25 13:20:31 xenserver-enginatics1 SM: [30686] ' Jun 25 13:20:31 xenserver-enginatics1 SM: [30686] Ignoring exception for LV check: /dev/VG_XenStorage-bac4ccac-5680-915a-6f84-b975379aaa1c/71494603-83b2-403f-83cc-329cadce9d6f.cbtlog ! Jun 25 15:12:23 xenserver-enginatics1 SM: [666] FAILED in util.pread: (rc 5) stdout: '', stderr: ' Failed to find logical volume "VG_XenStorage-bac4ccac-5680-915a-6f84-b975379aaa1c/cc4d186a-1846-47eb-9dcb-ea48bccd6d85.cbtlog" Jun 25 15:12:23 xenserver-enginatics1 SM: [666] ' Jun 25 15:12:23 xenserver-enginatics1 SM: [666] Ignoring exception for LV check: /dev/VG_XenStorage-bac4ccac-5680-915a-6f84-b975379aaa1c/cc4d186a-1846-47eb-9dcb-ea48bccd6d85.cbtlog ! Jun 25 15:12:23 xenserver-enginatics1 SM: [855] FAILED in util.pread: (rc 5) stdout: '', stderr: ' Failed to find logical volume "VG_XenStorage-bac4ccac-5680-915a-6f84-b975379aaa1c/71494603-83b2-403f-83cc-329cadce9d6f.cbtlog" Jun 25 15:12:23 xenserver-enginatics1 SM: [855] ' Jun 25 15:12:23 xenserver-enginatics1 SM: [855] Ignoring exception for LV check: /dev/VG_XenStorage-bac4ccac-5680-915a-6f84-b975379aaa1c/71494603-83b2-403f-83cc-329cadce9d6f.cbtlog !

Following messages taken from NAUBackup log with same VDI IDs:

2019-06-25-(15:12:22) - The following items are about to be destroyed 2019-06-25-(15:12:22) - VM : 2a0bc086-1ef0-f404-3660-a58b83573c6d (RESTORE_EBS r1213a) 2019-06-25-(15:12:22) - VDI: cc4d186a-1846-47eb-9dcb-ea48bccd6d85 (EBS r1213a upgraded OEL 6 root "/" mountpoint) 2019-06-25-(15:12:22) - VDI: 71494603-83b2-403f-83cc-329cadce9d6f (EBS r1213a upgraded OEL 6 /d01 EBS mountpoint)

Xen Server and Xen SDK version: 8.0
NAUBackup version: V3.24

Please help me with the investigation.

Thank you!

@NAUbackup
Copy link
Owner

Did you check to see if the snapshots were correctly removed after the backup process completed?
Unfortunately, I have not had a chance to test this thoroughly yet on an 8.0 installation. Did you see anything like this under XS 7.X?

@sniperkitten
Copy link
Author

sniperkitten commented Jun 27, 2019 via email

@NAUbackup
Copy link
Owner

Not sure how such an LVM call would have been initiated via VmBackup to result in this. I will try to get a test instance running soon under 8.0 and see if I can replicate this. Sorry this is showing up. Is everything functional otherwise?

@sniperkitten
Copy link
Author

Yes, everything else seems to be working fine.
Backups finish successfully regardless of this issue.
I restored one of the VMs from such recent backup and it started normally.

@sniperkitten
Copy link
Author

Hi

Today after number of zombie processes reached 10 000, I killed their parent and it worked as workaround - all zombies disappeared. Here's the process I killed:

root 807 1 0 Jun25 ? 00:44:32 /usr/bin/python /opt/xensource/sm/LVMSR <methodCall><methodName>vdi_delete</methodName><params><param><value><struct><member><name>host_ref</name><value>OpaqueRef:11a4f521-1bcb-4852-9218-444fadda3540</value></member><member><name>command</name><value>vdi_delete</value></member><member><name>args</name><value><array><data/></array></value></member><member><name>device_config</name><value><struct><member><name>SRmaster</name><value>true</value></member><member><name>device</name><value>/dev/disk/by-id/ata-Micron_5200_MTFDDAK3T8TDC_18531FF5401F</value></member></struct></value></member><member><name>session_ref</name><value>OpaqueRef:5aa061c5-eeda-485d-8786-820f56a6e90e</value></member><member><name>sr_ref</name><value>OpaqueRef:3c102b39-623d-4ee1-8065-4f5e4153d3af</value></member><member><name>sr_uuid</name><value>bac4ccac-5680-915a-6f84-b975379aaa1c</value></member><member><name>vdi_ref</name><value>OpaqueRef:7b5c22d8-6a75-4f55-b1db-387e967aaa3a</value></member><member><name>vdi_location</name><value>cc4d186a-1846-47eb-9dcb-ea48bccd6d85</value></member><member><name>vdi_uuid</name><value>cc4d186a-1846-47eb-9dcb-ea48bccd6d85</value></member><member><name>subtask_of</name><value>DummyRef:|7a421cce-426e-4034-a7b9-7d8f2b194932|VDI.destroy</value></member><member><name>vdi_on_boot</name><value>persist</value></member><member><name>vdi_allow_caching</name><value>false</value></member></struct></value></param></params></methodCall>

@NAUbackup
Copy link
Owner

Seems that the parent process doesn't terminate normally, and exacerbated the issue. Would be curious if the same thing happens if you run another backup process. I do not see anything like this under 7.X, hence it is clearly related to changes in 8.0.

@sniperkitten
Copy link
Author

Hi

Reproduced issue after getting rid of zombie processes.
First backup completed normally, so I initiated the second.
Snapshot got deleted successfully, still this process left hanging along with its childs:

root 18663 1 0 08:43 ? 00:00:04 /usr/bin/python /opt/xensource/sm/LVMSR <methodCall><methodName>vdi_delete</methodName><params><param><value><struct><member><name>host_ref</name><value>OpaqueRef:11a4f521-1bcb-4852-9218-444fadda3540</value></member><member><name>command</name><value>vdi_delete</value></member><member><name>args</name><value><array><data/></array></value></member><member><name>device_config</name><value><struct><member><name>SRmaster</name><value>true</value></member><member><name>device</name><value>/dev/disk/by-id/ata-Micron_5200_MTFDDAK3T8TDC_18531FF5401F</value></member></struct></value></member><member><name>session_ref</name><value>OpaqueRef:5e3e4027-3a05-4ff6-889d-4d1f9fb2b79b</value></member><member><name>sr_ref</name><value>OpaqueRef:3c102b39-623d-4ee1-8065-4f5e4153d3af</value></member><member><name>sr_uuid</name><value>bac4ccac-5680-915a-6f84-b975379aaa1c</value></member><member><name>vdi_ref</name><value>OpaqueRef:06fa189d-1500-4bae-a008-3d1ad0a11b57</value></member><member><name>vdi_location</name><value>03d9f8ee-43b9-4583-bf42-0194a86acbe3</value></member><member><name>vdi_uuid</name><value>03d9f8ee-43b9-4583-bf42-0194a86acbe3</value></member><member><name>subtask_of</name><value>DummyRef:|53937de2-c14b-4ed4-bb20-1a93802e371f|VDI.destroy</value></member><member><name>vdi_on_boot</name><value>persist</value></member><member><name>vdi_allow_caching</name><value>false</value></member></struct></value></param></params></methodCall> root 19245 18663 0 09:01 ? 00:00:00 [LVMSR] <defunct> root 21032 18663 0 09:02 ? 00:00:00 [LVMSR] <defunct>

Killed it again and zombie processes disappeared also.

@sniperkitten
Copy link
Author

Hi

Please tell is there any news?

Thank you

@NAUbackup
Copy link
Owner

I am having issues getting the NFS network connected to my tests servers and need to work with out network operations group. Until that's rectified, I won't have a CH 8.0 system I can test with. Sorry it's taking this long.

@BogdanRudas
Copy link

We are using own backup scripts to exports VM snapsots, but faced same issue.
Snapshots looks deleted from Xencenters, but LVMSR space is not freed.
This apply to Xenserver XS71ECU2015 and most probably to Hypervisor 8. We have no evidences of this in XCP-ng 7.6 (but probably need to deploy XCP-ng 8 or just update our's 7.6 which are not as big a installation of licensed Citrix products)

@NAUbackup
Copy link
Owner

Make sure you don't have too many snapshots already and there is enough room to be able to take an additional snapshot. If the SR is too full, the coalesce process will sometimes also fail to work. You can try manually rescanning the SR, or go and delete snapshots manually first to ensure there is enough free space. Worst case you may have to shuffle VMs around to free up some space and also potentially force the coalescing to kick in.

@BogdanRudas
Copy link

I know about such restrictions. In may case I have at least triple of VM size free on that SR.
Process tree looks like.

systemd --switched-root --system --deserialize 21
|-LVMSR /opt/xensource/sm/LVMSRvdi_deleteh
| |-(LVMSR)
| |-(LVMSR)
| |-(LVMSR)
| |-(LVMSR)
| |-(LVMSR)
| |-(LVMSR)
| |-(LVMSR)
| |-(LVMSR)
| |-(LVMSR)
| |-(LVMSR)
| |-(LVMSR)
| |-(LVMSR)
| |-(LVMSR)
| |-(LVMSR)
| |-(LVMSR)
| |-(LVMSR)
| |-(LVMSR)
| |-(LVMSR)

Looking through our monitoring, it seems that the issue was caused by some update that I've applied on Xenserver 7.1 before July 28th. Since that time it triggering alerts from time to time and max number of processes I've seen in monitoring was 3k+ (and I don't finished monitoring check yet). I submited a case to Citrix. Once I killed that parent process and all zombies disappeared.
I'm sorry for discussing this here, but I think it is Xenserver issue and not related to NAUbackup itself.

@sniperkitten
Copy link
Author

Thank you for feedback, Bogdan!
Really looks like Xenserver problem.
Can you please share solution from Citrix if they manage to resolve it.

@sniperkitten
Copy link
Author

Bogdan, did you find a way to manually free space as a workaround? If yes, please share, Thank you

@BogdanRudas
Copy link

There are two ways:

  1. Gracefully move all VMs using Storage Xenmotion to other host, reboot, do sr_scan.
  2. See pstree, kill parrent LVMSR vdi_delete process, then zmobies disappears and finally do sr_scan. I'm unsure it this way is safe, but it worked for me while ago.

Finally you have a look of xe vdi-list and clean up orgphaed VDI. There are many discussion about this in Google and especially on Citrix forums.

@sniperkitten
Copy link
Author

Actually, sr_scan also hanged and caused zombie processes.
Will try to find orphaned VDI

@NAUbackup
Copy link
Owner

Unfortunately, the coalesce_leaf command was removed from later versions of XenServer/Citrix Hypervisor and hence the clean-up process of orphaned VDIs is now by and large a manual process. Moving VMs to other SRs and rexscanning or even reinitializing the original SR is indeed one option. When manually deleting VDIs, make sure they're not a template or some other form of VDI that may not necessarily have an associated parent.

@BogdanRudas
Copy link

I've found some correlation for my case:
The issue with number of zombies started when I run the backup first time after installation of listed patched (installed same day):
XS71ECU2009
XS71ECU2010
XS71ECU2011

Also I wound that issue is frequently reproduces on ourw Elasticearch virtual machines. Some other types of the VMs are affected too, bu ELK with it's moderate but constant load is good debug sample.

I will re-image one host, apply patches up to XS71ECU2008, put ELK VM and see if zombiles will appears again. Then I will start applying patches one by one untill the LVMSr will suck again.

Regarding Citix support, I was told to try some product https://citrixready.citrix.com/category-results.html?search=backup
I will go with XOA against fully patched XS71CU2 on HCL-approved hardware. I bet any other backup software will do nearly same as me, thus I should reproduce the issue on ELK VM too.

@BogdanRudas
Copy link

Well, I was able to reproduce the issue using "Full backup" from XOA and even using Storage Xenmotion from one server to another. I'm going to reinstall Xenserver 7.1, then I will apply patches one by one.

As a quick and dirty hack I do VM pause (much faster then suspend) against all VMs on hosts and wait ~15-30 minutes until coalescing will do it's job.
It is possible to narrow list of VM to be paused with review VHD tree for particular VM, see syntax here: https://support.citrix.com/article/CTX139224

@NAUbackup
Copy link
Owner

Note that with a vhd-util scan that just because no parent is found, doesn't necessarily mean it is an orphan; it might just be a template. Hence, one should be careful about deleting any instances.

@sniperkitten
Copy link
Author

Thank you for the effort, Bogdan!

I followed your advice to get rid of VDIs causing the issue.

  1. I identified VMs that had hidden VDIs:

[root@xenserver1 EBS demo (12.2.8)]# vhd-util scan -f -m "VHD-*" -l VG_XenStorage-bac4ccac-5680-915a-6f84-b975379aaa1c -p

vhd=VHD-34060004-645d-4a4e-84dd-832f058e5adf capacity=107374182400 size=40647000064 hidden=1 parent=none
vhd=VHD-b39ca920-73fe-499a-83dd-d3de8a8d5b67 capacity=107374182400 size=107592286208 hidden=0 parent=VHD-34060004-645d-4a4e-84dd-832f058e5adf

  1. Then I stopped VMs and copied them to the same SR as it had enough space.

  2. After starting copied VM and healthcheck I removed source VM. Hidden VDI were gone.

I'm not performing backups for now as I don't want the issue to be repeated.

@NAUbackup
Copy link
Owner

NAUbackup commented Aug 30, 2019

Have you considered upgrading to 3.25 and trying with a very small subset of copied VMs, perhaps on non-production SR? Some issues with snapshots can arise if the snapshot chain is too long or there are space issues on the SR on which the VMs reside.

@sniperkitten
Copy link
Author

I was already on 3.25 and still having issues. So I will wait for further news from Bogdan.

Also, looks like VmBackup.py 3.25 has older version inside, can you please check:

#NAUVmBackup/VmBackup.py

V3.24 June 2018

@NAUbackup
Copy link
Owner

@sniperkitten - fixed in-line version number and comments in source code. Thanks for pointing that out!

@sniperkitten
Copy link
Author

Hi
Is there any news regarding this issue?
Thank you

@NAUbackup
Copy link
Owner

Sorry, I have not had any time to investigate this further. In general, orphans like this are a result of lack of space in which cleanup is possible, hence the suggestion to test with a small subset of VMs and plenty of storage space to work with. This is also possibly a fundamental XS/CH issue as orphaned VDIs have been a problem for years.

@BogdanRudas
Copy link

Hello!

I've checked the issue against patches up to XS71ECU2016, will have another call with Citrix. I hope it will be a call with next level of their support.

I've reproduced this with Xen Orchestra as well (because Cirix want to discuss only "CitrixReady" solutions).

@BogdanRudas
Copy link

I've just finished a call with support regarding our support case. From this call I understand that they have couple clients with similar problem and will escalate cases further. I hope there should be a hotfix one day.

@sniperkitten
Copy link
Author

sniperkitten commented Sep 19, 2019 via email

@BogdanRudas
Copy link

I've testing a private hotfix from Citrix since last week.

@sniperkitten
Copy link
Author

Good news! Did it work for you?

@BogdanRudas
Copy link

Please, try this https://support.citrix.com/article/CTX262019

I was given with private hotfix that fixed zomobies, but the coalescing process that works for many days is still an issue.

@sniperkitten
Copy link
Author

Thank you for update, Bogdan!
Looks like update not available for my company's account and I have to wait for the fix to be made publicly available
Also it's for version 7.1. Hope Citrix will include it in some public update for Hypervisor 8.0

@BogdanRudas
Copy link

@sniperkitten I guess they will release a fix in CH8.1

@sniperkitten
Copy link
Author

We installed CH8.1 and it seems to fix the issue with zombie processes.
Thanks everyone for help.

@BogdanRudas
Copy link

Did it improved coalescing process itself or just fixed zombies (like an update for XS7.1 ? )

@sniperkitten
Copy link
Author

I tested it and 8.1 update only fixed zombies. I created test snapshots and deleted them right away. For relatively small disks ~100G coalescing happened successfully. However for one of 300G VDI I found following error in /var/log/SMlog:

SMGC: [11948] SR bac4 ('SDD 4TB') (14 VDIs in 12 VHD trees): showing only VHD trees that changed:
SMGC: [11948] eea5e8a0VHD
SMGC: [11948] 0b1a110dVHD
SMGC: [11948]
SMGC: [11948] ~~
*~*~*~*~*~*~
SMGC: [11948] ***********************
SMGC: [11948] * E X C E P T I O N *
SMGC: [11948] ***********************
SMGC: [11948] gc: EXCEPTION <class 'SR.SROSError'>, Gave up on leaf coalesce after leaf grew bigger than before snapshot taken [opterr=VDI=0b1a110dVHD]
SMGC: [11948] File "/opt/xensource/sm/cleanup.py", line 2769, in gc
SMGC: [11948] _gc(None, srUuid, dryRun)
SMGC: [11948] File "/opt/xensource/sm/cleanup.py", line 2654, in _gc
SMGC: [11948] _gcLoop(sr, dryRun)
SMGC: [11948] File "/opt/xensource/sm/cleanup.py", line 2636, in _gcLoop
SMGC: [11948] sr.coalesceLeaf(candidate, dryRun)
SMGC: [11948] File "/opt/xensource/sm/cleanup.py", line 1586, in coalesceLeaf
SMGC: [11948] self._coalesceLeaf(vdi)
SMGC: [11948] File "/opt/xensource/sm/cleanup.py", line 1789, in _coalesceLeaf
SMGC: [11948] raise xs_errors.XenError('LeafGCSkip', opterr='VDI=%s' % vdi)
SMGC: [11948]
SMGC: [11948] ~*~*~*~*~*~*~~*
SMGC: [11948] * * * * * SR bac4ccac-5680-915a-6f84-b975379aaa1c: ERROR

Used this command to find leafs

vhd-util scan -f -m "VHD-*" -l VG_XenStorage-bac4ccac-5680-915a-6f84-b975379aaa1c -p

vhd=VHD-bda8bdc2-d549-4b77-b460-63bc23219d61 capacity=10737418240 size=10368319488 hidden=1 parent=none
vhd=VHD-c987bb4d-3d47-4412-b0cc-fd5e1535d505 capacity=10737418240 size=10766778368 hidden=0 parent=VHD-bda8bdc2-d549-4b77-b460-63bc23219d61
vhd=VHD-eea5e8a0-32a4-4b07-9dfa-d12a35410f66 capacity=311385128960 size=304514859008 hidden=1 parent=none
vhd=VHD-0b1a110d-b3a8-47ec-b7ec-2b4ef37cda1a capacity=311385128960 size=312001691648 hidden=0 parent=VHD-eea5e8a0-32a4-4b07-9dfa-d12a35410f66

I tried your fix Bogdan from xcp-ng/xcp#298
I set following parameters in /opt/xensource/sm/cleanup.py

LIVE_LEAF_COALESCE_MAX_SIZE = 1024 * 1024 * 1024
LIVE_LEAF_COALESCE_TIMEOUT = 300

After rerunning sr_scan, coalescing successfully completed.
Then I tried to create and delete snapshot again. Coalescing successfully completed this time.

@NAUbackup
Copy link
Owner

I wonder if it's related to this issue: https://bugs.xenserver.org/browse/XSO-966

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants