Saturday, July 6, 2013

How to detach storage device from running virtual machine in VirtualBox

In this post I want to show how to test what will happen with your cluster if you simply detach shared storage device from it. VirtualBox is great tool for that purpose because you can test such scenario without involving many people or causing any damage.

I am aware that virtual environment on my notebook is completely different environment than production. Please let me know if this test is not even similar with situations that could happen in the production. Goal is to test OCFS2 node fencing after device is detached.


My environment is HP notebook with Windows 7, VirtualBox 4.2.16 and two Oracle Linux 64bit virtual machines.

Let the test begin...

Shutdown virtual machines and create shareable virtual disk.
C:\>cd "Program Files\Oracle\VirtualBox"

> VBoxManage.exe createhd --filename D:\VirtualneMasine\ClusterSharedDisks\disk3.vdi ^
--size 1024 --format VDI --variant Fixed

0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%
Disk image created. UUID: 862b1d65-eb04-42b2-8a1e-eafafb5bbcd3

Connect disk to virtual machines Cluster1 and Cluster2.
> VBoxManage.exe storageattach Cluster1 --storagectl "SATA" --port 5 --device 0 ^
--type hdd --medium D:\VirtualneMasine\ClusterSharedDisks\disk3.vdi --mtype shareable

> VBoxManage.exe storageattach Cluster2 --storagectl "SATA" --port 5 --device 0 ^
--type hdd --medium D:\VirtualneMasine\ClusterSharedDisks\disk3.vdi --mtype shareable

Start virtual machines and partition newly added disk on Cluster1 node.
[root@cluster1 ~]# fdisk /dev/sdf
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel. Changes will remain in memory only,
until you decide to write them. After that, of course, the previous
content won't be recoverable.

Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

Command (m for help): n
Command action
   e   extended
   p   primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-130, default 1): 
Using default value 1
Last cylinder or +size or +sizeM or +sizeK (1-130, default 130): 
Using default value 130

Command (m for help): p

Disk /dev/sdf: 1073 MB, 1073741824 bytes
255 heads, 63 sectors/track, 130 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdf1               1         130     1044193+  83  Linux

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.

Inform the OS of partition table changes using partprobe.
[root@cluster1 ~]# /sbin/partprobe /dev/sdf

[root@cluster1 ~]# ssh -l root cluster2
root@cluster2's password: 
Last login: Sat Jul  6 10:12:58 2013 from 192.168.56.101

[root@cluster2 ~]# /sbin/partprobe /dev/sdf

As I am using OCFS2 for shared-disk cluster file system I must create OCFS2 file system on a device. Execute this command on just one node.

[root@cluster1 ~]# mkfs.ocfs2 -b 4K -C 128K -N 4 -L disk3 /dev/sdf1
mkfs.ocfs2 1.6.3
Cluster stack: classic o2cb
Label: disk3
Features: sparse backup-super unwritten inline-data strict-journal-super
Block size: 4096 (12 bits)
Cluster size: 131072 (17 bits)
Volume size: 1069154304 (8157 clusters) (261024 blocks)
Cluster groups: 1 (tail covers 8157 clusters, rest cover 8157 clusters)
Extent allocator size: 4194304 (1 groups)
Journal size: 16777216
Node slots: 4
Creating bitmaps: done
Initializing superblock: done
Writing system files: done
Writing superblock: done
Writing backup superblock: 0 block(s)
Formatting Journals: done
Growing extent allocator: done
Formatting slot map: done
Formatting quota files: done
Writing lost+found: done
mkfs.ocfs2 successful

Create directories and mount virtual storage device on both nodes.
[root@cluster1 ~]# mkdir /disk3
[root@cluster1 ~]# mount -t ocfs2 -o datavolume,nointr,noatime -L "disk3" /disk3

[root@cluster1 ~]# ssh -l root cluster2
root@cluster2's password: 
Last login: Sat Jul  6 10:13:14 2013 from 192.168.56.101

[root@cluster2 ~]# mkdir /disk3
[root@cluster2 ~]# mount -t ocfs2 -o datavolume,nointr,noatime -L "disk3" /disk3

Now I want to test what will happen if I simply detach specified storage device.

Using "del" command to delete file that represents shared storage for virtual machines won’t work.
C:\>del d:\VirtualneMasine\ClusterSharedDisks\disk3.vdi

d:\VirtualneMasine\ClusterSharedDisks\disk3.vdi
The process cannot access the file because it is being used by another process.


So, how to detach device while virtual machines are running.

Detach device using VBoxManage:
> VBoxManage.exe storageattach Cluster1 --storagectl "SATA" --port 5 --device 0 --medium none
> VBoxManage.exe storageattach Cluster2 --storagectl "SATA" --port 5 --device 0 --medium none


Check /var/log/messages on Cluster2.

Jul  6 10:33:37 cluster2 kernel: ata6: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
Jul  6 10:33:37 cluster2 kernel: ata6: irq_stat 0x80400000, PHY RDY changed
Jul  6 10:33:37 cluster2 kernel: ata6: SError: { PHYRdyChg }
Jul  6 10:33:37 cluster2 kernel: ata6: hard resetting link
Jul  6 10:33:38 cluster2 kernel: ata6: SATA link down (SStatus 0 SControl 300)
Jul  6 10:33:38 cluster2 kernel: ata6: failed to recover some devices, retrying in 5 secs
Jul  6 10:33:43 cluster2 kernel: ata6: hard resetting link
Jul  6 10:33:43 cluster2 kernel: ata6: SATA link down (SStatus 0 SControl 300)
Jul  6 10:33:43 cluster2 kernel: ata6: failed to recover some devices, retrying in 5 secs
Jul  6 10:33:48 cluster2 kernel: ata6: hard resetting link
Jul  6 10:33:49 cluster2 kernel: ata6: SATA link down (SStatus 0 SControl 300)
Jul  6 10:33:49 cluster2 kernel: ata6.00: disabled
Jul  6 10:33:49 cluster2 kernel: sd 5:0:0:0: rejecting I/O to offline device
Jul  6 10:33:49 cluster2 kernel: sd 5:0:0:0: SCSI error: return code = 0x00010000
Jul  6 10:33:49 cluster2 kernel: end_request: I/O error, dev sdf, sector 2879
Jul  6 10:33:49 cluster2 kernel: (kjournald,415,0):o2hb_bio_end_io:241 ERROR: IO Error -5
Jul  6 10:33:49 cluster2 kernel: (o2hb-28851B89F3,9129,0):o2hb_do_disk_heartbeat:772 ERROR: status = -5
Jul  6 10:33:49 cluster2 kernel: sd 5:0:0:0: rejecting I/O to offline device
Jul  6 10:33:49 cluster2 kernel: (o2hb-28851B89F3,9129,0):o2hb_bio_end_io:241 ERROR: IO Error -5
Jul  6 10:33:49 cluster2 kernel: (o2hb-28851B89F3,9129,0):o2hb_do_disk_heartbeat:772 ERROR: status = -5
Jul  6 10:33:49 cluster2 kernel: sd 5:0:0:0: rejecting I/O to offline device
Jul  6 10:33:49 cluster2 kernel: (o2hb-28851B89F3,9129,0):o2hb_bio_end_io:241 ERROR: IO Error -5
Jul  6 10:33:49 cluster2 kernel: (o2hb-28851B89F3,9129,0):o2hb_do_disk_heartbeat:772 ERROR: status = -5
Jul  6 10:33:49 cluster2 kernel: sd 5:0:0:0: rejecting I/O to offline device
Jul  6 10:33:49 cluster2 kernel: (o2hb-28851B89F3,9129,0):o2hb_bio_end_io:241 ERROR: IO Error -5
Jul  6 10:33:49 cluster2 kernel: (o2hb-28851B89F3,9129,0):o2hb_do_disk_heartbeat:772 ERROR: status = -5
Jul  6 10:33:49 cluster2 kernel: ata6: EH complete
Jul  6 10:33:49 cluster2 kernel: ata6.00: detaching (SCSI 5:0:0:0)
Jul  6 10:33:51 cluster2 kernel: scsi 5:0:0:0: rejecting I/O to dead device
Jul  6 10:33:51 cluster2 kernel: (o2hb-28851B89F3,9129,0):o2hb_bio_end_io:241 ERROR: IO Error -5
Jul  6 10:33:51 cluster2 kernel: (o2hb-28851B89F3,9129,0):o2hb_do_disk_heartbeat:772 ERROR: status = -5
Jul  6 10:33:51 cluster2 kernel: scsi 5:0:0:0: rejecting I/O to dead device
Jul  6 10:33:51 cluster2 kernel: (o2hb-28851B89F3,9129,0):o2hb_bio_end_io:241 ERROR: IO Error -5
Jul  6 10:33:51 cluster2 kernel: (o2hb-28851B89F3,9129,0):o2hb_do_disk_heartbeat:772 ERROR: status = -5
Jul  6 10:33:53 cluster2 kernel: scsi 5:0:0:0: rejecting I/O to dead device
Jul  6 10:33:53 cluster2 kernel: (o2hb-28851B89F3,9129,0):o2hb_bio_end_io:241 ERROR: IO Error -5
Jul  6 10:33:53 cluster2 kernel: (o2hb-28851B89F3,9129,0):o2hb_do_disk_heartbeat:772 ERROR: status = -5
Jul  6 10:33:53 cluster2 kernel: scsi 5:0:0:0: rejecting I/O to dead device
Jul  6 10:33:53 cluster2 kernel: (o2hb-28851B89F3,9129,0):o2hb_bio_end_io:241 ERROR: IO Error -5
Jul  6 10:33:53 cluster2 kernel: (o2hb-28851B89F3,9129,0):o2hb_do_disk_heartbeat:772 ERROR: status = -5

Notice heartbeat errors due to missing device. Every 2 secs we will get error until timeout is reached then it's time for self-fencing.


In another test I will unmount device prior detaching.
# Cluster1
[root@cluster1 ~]# umount -t ocfs2 /disk3

# Cluster2
[root@cluster2 ~]# umount -t ocfs2 /disk3

Detach using VBOxManage:
> VBoxManage.exe storageattach Cluster1 --storagectl "SATA" --port 5 --device 0 --medium none
> VBoxManage.exe storageattach Cluster2 --storagectl "SATA" --port 5 --device 0 --medium none

Check /var/log/messages on Cluster1:
Jul  6 10:30:52 cluster1 kernel: ocfs2: Unmounting device (8,81) on (node 0)
Jul  6 10:31:38 cluster1 kernel: ata6: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
Jul  6 10:31:38 cluster1 kernel: ata6: irq_stat 0x80400000, PHY RDY changed
Jul  6 10:31:38 cluster1 kernel: ata6: SError: { PHYRdyChg }
Jul  6 10:31:38 cluster1 kernel: ata6: hard resetting link
Jul  6 10:31:39 cluster1 kernel: ata6: SATA link down (SStatus 0 SControl 300)
Jul  6 10:31:39 cluster1 kernel: ata6: failed to recover some devices, retrying in 5 secs
Jul  6 10:31:44 cluster1 kernel: ata6: hard resetting link
Jul  6 10:31:44 cluster1 kernel: ata6: SATA link down (SStatus 0 SControl 300)
Jul  6 10:31:44 cluster1 kernel: ata6: failed to recover some devices, retrying in 5 secs
Jul  6 10:31:49 cluster1 kernel: ata6: hard resetting link
Jul  6 10:31:50 cluster1 kernel: ata6: SATA link down (SStatus 0 SControl 300)
Jul  6 10:31:50 cluster1 kernel: ata6.00: disabled
Jul  6 10:31:50 cluster1 kernel: ata6: EH complete
Jul  6 10:31:50 cluster1 kernel: ata6.00: detaching (SCSI 5:0:0:0)

Both nodes stayed up and running without heartbeat errors.
I can conclude from this test - if you unmount OCFS2 device from both nodes prior detaching device everything should continue to work without sudden reboots.


If you want to attach virtual storage again just shutdown virtual machines and connect device using commands from the beginning of the post.
>VBoxManage.exe storageattach Cluster1 --storagectl "SATA" --port 5 --device 0 ^
--type hdd --medium D:\VirtualneMasine\ClusterSharedDisks\disk3.vdi --mtype shareable

>VBoxManage.exe storageattach Cluster2 --storagectl "SATA" --port 5 --device 0 ^
--type hdd --medium D:\VirtualneMasine\ClusterSharedDisks\disk3.vdi --mtype shareable


If you have the opportunity to perform tests on a real hardware this should be always your first choice. But in case you are unable to do that it is better to perform tests in virtual environment then nothing.


0 Comments:

Post a Comment