I am aware that virtual environment on my notebook is completely different environment than production. Please let me know if this test is not even similar with situations that could happen in the production. Goal is to test OCFS2 node fencing after device is detached.
My environment is HP notebook with Windows 7, VirtualBox 4.2.16 and two Oracle Linux 64bit virtual machines.
Let the test begin...
Shutdown virtual machines and create shareable virtual disk.
C:\>cd "Program Files\Oracle\VirtualBox" > VBoxManage.exe createhd --filename D:\VirtualneMasine\ClusterSharedDisks\disk3.vdi ^ --size 1024 --format VDI --variant Fixed 0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100% Disk image created. UUID: 862b1d65-eb04-42b2-8a1e-eafafb5bbcd3
Connect disk to virtual machines Cluster1 and Cluster2.
> VBoxManage.exe storageattach Cluster1 --storagectl "SATA" --port 5 --device 0 ^ --type hdd --medium D:\VirtualneMasine\ClusterSharedDisks\disk3.vdi --mtype shareable > VBoxManage.exe storageattach Cluster2 --storagectl "SATA" --port 5 --device 0 ^ --type hdd --medium D:\VirtualneMasine\ClusterSharedDisks\disk3.vdi --mtype shareable
Start virtual machines and partition newly added disk on Cluster1 node.
[root@cluster1 ~]# fdisk /dev/sdf Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel Building a new DOS disklabel. Changes will remain in memory only, until you decide to write them. After that, of course, the previous content won't be recoverable. Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite) Command (m for help): n Command action e extended p primary partition (1-4) p Partition number (1-4): 1 First cylinder (1-130, default 1): Using default value 1 Last cylinder or +size or +sizeM or +sizeK (1-130, default 130): Using default value 130 Command (m for help): p Disk /dev/sdf: 1073 MB, 1073741824 bytes 255 heads, 63 sectors/track, 130 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/sdf1 1 130 1044193+ 83 Linux Command (m for help): w The partition table has been altered! Calling ioctl() to re-read partition table. Syncing disks.
Inform the OS of partition table changes using partprobe.
[root@cluster1 ~]# /sbin/partprobe /dev/sdf [root@cluster1 ~]# ssh -l root cluster2 root@cluster2's password: Last login: Sat Jul 6 10:12:58 2013 from 192.168.56.101 [root@cluster2 ~]# /sbin/partprobe /dev/sdf
As I am using OCFS2 for shared-disk cluster file system I must create OCFS2 file system on a device. Execute this command on just one node.
[root@cluster1 ~]# mkfs.ocfs2 -b 4K -C 128K -N 4 -L disk3 /dev/sdf1 mkfs.ocfs2 1.6.3 Cluster stack: classic o2cb Label: disk3 Features: sparse backup-super unwritten inline-data strict-journal-super Block size: 4096 (12 bits) Cluster size: 131072 (17 bits) Volume size: 1069154304 (8157 clusters) (261024 blocks) Cluster groups: 1 (tail covers 8157 clusters, rest cover 8157 clusters) Extent allocator size: 4194304 (1 groups) Journal size: 16777216 Node slots: 4 Creating bitmaps: done Initializing superblock: done Writing system files: done Writing superblock: done Writing backup superblock: 0 block(s) Formatting Journals: done Growing extent allocator: done Formatting slot map: done Formatting quota files: done Writing lost+found: done mkfs.ocfs2 successful
Create directories and mount virtual storage device on both nodes.
[root@cluster1 ~]# mkdir /disk3 [root@cluster1 ~]# mount -t ocfs2 -o datavolume,nointr,noatime -L "disk3" /disk3 [root@cluster1 ~]# ssh -l root cluster2 root@cluster2's password: Last login: Sat Jul 6 10:13:14 2013 from 192.168.56.101 [root@cluster2 ~]# mkdir /disk3 [root@cluster2 ~]# mount -t ocfs2 -o datavolume,nointr,noatime -L "disk3" /disk3
Now I want to test what will happen if I simply detach specified storage device.
Using "del" command to delete file that represents shared storage for virtual machines won’t work.
C:\>del d:\VirtualneMasine\ClusterSharedDisks\disk3.vdi d:\VirtualneMasine\ClusterSharedDisks\disk3.vdi The process cannot access the file because it is being used by another process.
So, how to detach device while virtual machines are running.
Detach device using VBoxManage:
> VBoxManage.exe storageattach Cluster1 --storagectl "SATA" --port 5 --device 0 --medium none > VBoxManage.exe storageattach Cluster2 --storagectl "SATA" --port 5 --device 0 --medium none
Check /var/log/messages on Cluster2.
Jul 6 10:33:37 cluster2 kernel: ata6: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen Jul 6 10:33:37 cluster2 kernel: ata6: irq_stat 0x80400000, PHY RDY changed Jul 6 10:33:37 cluster2 kernel: ata6: SError: { PHYRdyChg } Jul 6 10:33:37 cluster2 kernel: ata6: hard resetting link Jul 6 10:33:38 cluster2 kernel: ata6: SATA link down (SStatus 0 SControl 300) Jul 6 10:33:38 cluster2 kernel: ata6: failed to recover some devices, retrying in 5 secs Jul 6 10:33:43 cluster2 kernel: ata6: hard resetting link Jul 6 10:33:43 cluster2 kernel: ata6: SATA link down (SStatus 0 SControl 300) Jul 6 10:33:43 cluster2 kernel: ata6: failed to recover some devices, retrying in 5 secs Jul 6 10:33:48 cluster2 kernel: ata6: hard resetting link Jul 6 10:33:49 cluster2 kernel: ata6: SATA link down (SStatus 0 SControl 300) Jul 6 10:33:49 cluster2 kernel: ata6.00: disabled Jul 6 10:33:49 cluster2 kernel: sd 5:0:0:0: rejecting I/O to offline device Jul 6 10:33:49 cluster2 kernel: sd 5:0:0:0: SCSI error: return code = 0x00010000 Jul 6 10:33:49 cluster2 kernel: end_request: I/O error, dev sdf, sector 2879 Jul 6 10:33:49 cluster2 kernel: (kjournald,415,0):o2hb_bio_end_io:241 ERROR: IO Error -5 Jul 6 10:33:49 cluster2 kernel: (o2hb-28851B89F3,9129,0):o2hb_do_disk_heartbeat:772 ERROR: status = -5 Jul 6 10:33:49 cluster2 kernel: sd 5:0:0:0: rejecting I/O to offline device Jul 6 10:33:49 cluster2 kernel: (o2hb-28851B89F3,9129,0):o2hb_bio_end_io:241 ERROR: IO Error -5 Jul 6 10:33:49 cluster2 kernel: (o2hb-28851B89F3,9129,0):o2hb_do_disk_heartbeat:772 ERROR: status = -5 Jul 6 10:33:49 cluster2 kernel: sd 5:0:0:0: rejecting I/O to offline device Jul 6 10:33:49 cluster2 kernel: (o2hb-28851B89F3,9129,0):o2hb_bio_end_io:241 ERROR: IO Error -5 Jul 6 10:33:49 cluster2 kernel: (o2hb-28851B89F3,9129,0):o2hb_do_disk_heartbeat:772 ERROR: status = -5 Jul 6 10:33:49 cluster2 kernel: sd 5:0:0:0: rejecting I/O to offline device Jul 6 10:33:49 cluster2 kernel: (o2hb-28851B89F3,9129,0):o2hb_bio_end_io:241 ERROR: IO Error -5 Jul 6 10:33:49 cluster2 kernel: (o2hb-28851B89F3,9129,0):o2hb_do_disk_heartbeat:772 ERROR: status = -5 Jul 6 10:33:49 cluster2 kernel: ata6: EH complete Jul 6 10:33:49 cluster2 kernel: ata6.00: detaching (SCSI 5:0:0:0) Jul 6 10:33:51 cluster2 kernel: scsi 5:0:0:0: rejecting I/O to dead device Jul 6 10:33:51 cluster2 kernel: (o2hb-28851B89F3,9129,0):o2hb_bio_end_io:241 ERROR: IO Error -5 Jul 6 10:33:51 cluster2 kernel: (o2hb-28851B89F3,9129,0):o2hb_do_disk_heartbeat:772 ERROR: status = -5 Jul 6 10:33:51 cluster2 kernel: scsi 5:0:0:0: rejecting I/O to dead device Jul 6 10:33:51 cluster2 kernel: (o2hb-28851B89F3,9129,0):o2hb_bio_end_io:241 ERROR: IO Error -5 Jul 6 10:33:51 cluster2 kernel: (o2hb-28851B89F3,9129,0):o2hb_do_disk_heartbeat:772 ERROR: status = -5 Jul 6 10:33:53 cluster2 kernel: scsi 5:0:0:0: rejecting I/O to dead device Jul 6 10:33:53 cluster2 kernel: (o2hb-28851B89F3,9129,0):o2hb_bio_end_io:241 ERROR: IO Error -5 Jul 6 10:33:53 cluster2 kernel: (o2hb-28851B89F3,9129,0):o2hb_do_disk_heartbeat:772 ERROR: status = -5 Jul 6 10:33:53 cluster2 kernel: scsi 5:0:0:0: rejecting I/O to dead device Jul 6 10:33:53 cluster2 kernel: (o2hb-28851B89F3,9129,0):o2hb_bio_end_io:241 ERROR: IO Error -5 Jul 6 10:33:53 cluster2 kernel: (o2hb-28851B89F3,9129,0):o2hb_do_disk_heartbeat:772 ERROR: status = -5
Notice heartbeat errors due to missing device. Every 2 secs we will get error until timeout is reached then it's time for self-fencing.
In another test I will unmount device prior detaching.
# Cluster1 [root@cluster1 ~]# umount -t ocfs2 /disk3 # Cluster2 [root@cluster2 ~]# umount -t ocfs2 /disk3
Detach using VBOxManage:
> VBoxManage.exe storageattach Cluster1 --storagectl "SATA" --port 5 --device 0 --medium none > VBoxManage.exe storageattach Cluster2 --storagectl "SATA" --port 5 --device 0 --medium none
Check /var/log/messages on Cluster1:
Jul 6 10:30:52 cluster1 kernel: ocfs2: Unmounting device (8,81) on (node 0) Jul 6 10:31:38 cluster1 kernel: ata6: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen Jul 6 10:31:38 cluster1 kernel: ata6: irq_stat 0x80400000, PHY RDY changed Jul 6 10:31:38 cluster1 kernel: ata6: SError: { PHYRdyChg } Jul 6 10:31:38 cluster1 kernel: ata6: hard resetting link Jul 6 10:31:39 cluster1 kernel: ata6: SATA link down (SStatus 0 SControl 300) Jul 6 10:31:39 cluster1 kernel: ata6: failed to recover some devices, retrying in 5 secs Jul 6 10:31:44 cluster1 kernel: ata6: hard resetting link Jul 6 10:31:44 cluster1 kernel: ata6: SATA link down (SStatus 0 SControl 300) Jul 6 10:31:44 cluster1 kernel: ata6: failed to recover some devices, retrying in 5 secs Jul 6 10:31:49 cluster1 kernel: ata6: hard resetting link Jul 6 10:31:50 cluster1 kernel: ata6: SATA link down (SStatus 0 SControl 300) Jul 6 10:31:50 cluster1 kernel: ata6.00: disabled Jul 6 10:31:50 cluster1 kernel: ata6: EH complete Jul 6 10:31:50 cluster1 kernel: ata6.00: detaching (SCSI 5:0:0:0)
Both nodes stayed up and running without heartbeat errors.
I can conclude from this test - if you unmount OCFS2 device from both nodes prior detaching device everything should continue to work without sudden reboots.
If you want to attach virtual storage again just shutdown virtual machines and connect device using commands from the beginning of the post.
>VBoxManage.exe storageattach Cluster1 --storagectl "SATA" --port 5 --device 0 ^ --type hdd --medium D:\VirtualneMasine\ClusterSharedDisks\disk3.vdi --mtype shareable >VBoxManage.exe storageattach Cluster2 --storagectl "SATA" --port 5 --device 0 ^ --type hdd --medium D:\VirtualneMasine\ClusterSharedDisks\disk3.vdi --mtype shareable
If you have the opportunity to perform tests on a real hardware this should be always your first choice. But in case you are unable to do that it is better to perform tests in virtual environment then nothing.
0 Comments:
Post a Comment