Virtualization with proxmox

August 18th, 2011

This year I got to work with virtualization for real for the first time.

Hardware and purpose

We have bought a Dell PE R710 with 2 Intel Xeon E5640 (4 HT cores each) and 64 GB RAM.

It is to be used for a database server, a file server and some experiments /test setups.

The database server should access the database disk partition via iSCSI. The file server should access the locally attached Dell PV MD1220 for a /home and a data partition, both should be NFS exported. On top of that, it should also access and NFS export a large data partition via iSCSI.

/home should be reasonably fast.

For internal disks it has 2x600 GB 10kRPM SAS RAID-1 for system disks and 6 x 600 GB SAS 10kRPM RAID-10 for virtualization file system. To access the MD1220 it has a h800 controller with 6 x 1TB SAS-NL 7.2kRPM RAID-10 for the file server's /home partition and 5 x 1TB SAS-NL 7.2kRPM Disk RAID5 for a data partition.


    Virt server
  ---------------------------------------------------------------------------
 |                                                                           |
 |    DB guest 1                         FS guest 2                          |
 |   --------------------------------    --------------------------------    |
 |  | iSCSI vol via eth1 = /dev/sda  |  | iSCSI vol via eth1 = /dev/sda  |   |
 |  | LAN via eth0                   |  | LAN via eth0                   |   |
 |  |    eth0         eth1           |  | /dev/vdb via /dev/sdc on host  |   |
 |   --------------------------------   |    eth0         eth1           |   |
 |        |            |                 --------------------------------    |
 |        |       iSCSI|                      |            |      ^          |
 |        |     -------|----------------------             |     kvm         |
 |        |    |       |    -------------------------------       |          |
 | eth0 eth1 eth2 .. eth4 eth5 ....  SAS----------------------/dev/sdc       |
  ---------------------------------------------------------------------------
   LAN  LAN  LAN       |   |          |
               --------    |          |
              |   ---------           |
              |  |                   -
   Equallogic box          MD1220   |
  ----------------       ------------
 |                |     |            |
 |                |     |            |
  ----------------       ------------

Virtualization engine

We decided on proxmox, due to a recommendation and because is supported both openvz and qemu-kvm, and runs on a Debian Lenny with a proper apt repository so things seemed clean and simple to us.

The backside is IMHO that their documentation sucks. I am not saying this to annoy anyone, but I really find it hard to find any proper documentation. It seems random what is documented and where to find it. Off course, in Open Source lack of documentation is no excuse. Also, there is fine documentation for qemu and openvz so it is only the proxmox part where one is a bit on thin ice.

The reason I am writing this post is to document what I have found out. Maybe someone will find it useful, maybe someone has some additional info that will be useful to me. Also, if I got something wrong, perhaps it will be pointed out to me by a kind soul who knows better :)

Definitions

  • Proxmox Virtual Environment: A framework with a GUI and some configuration tools to control an environment of virtualized hosts. Contains both qemu-kvm and openvz. Maintains a lenny apt repository with the all the relevant packages.
  • Pve: Proxmox Virtual Environment (see above)
  • openvz: Openvz is an operating system-level virtualization where you can run multiple Linux "containers" on a Linux host system. This is para virtualization. This is an external project that has been included in proxmox.
  • Qemu:Qemu is a generic emulator which will run full virtualization so you can run any operating system. This is an external project that has been included in proxmox.
  • Kvm: KVM is a Linux kernel module that works with qemu to allow some access to the processor's virtualization capabilities, if the processor is supported.
  • Para virtualization: Partial virtualization does not emulate an entire machine but makes an API to access the hardware. Only operating systems supporting this API can run on the virtualized host.

Decisions to be made

  • Should we run openvz or qemu-kvm?
  • How do we access the iSCSI devices
  • How do we access the network devices?
  • How do we access the disks?

These decisions are closely connected since we do not have the same options for hardware access in openvz and kvm.

Other decisions one should make is: How many CPU's and RAM and system disk resources should each virtualized host have. However this is easy to experiment with in the GUI and your needs will be different from mine, so I shall not dwell upon it.

In our case, we started out with openvz due to expecting better performance. We got some experience but ended up deciding that we couldn't do what we wanted with iSCSI, so we changed to kvm. Both the database server and the file server needs to access our equallogic iSCSI SAN. If I want to do this in openvz, I have to iSCSI mount the disk on the host (the host is the machine running proxmox on the actual hardware), then export is as a virtual disk to the openvz container somehow. I cannot load an iscsi module on the openvz kernel.

I also wanted to access the network cards directly from the virtualized hosts instead of having one network card pretend to be the host and all the guests at the same time. This can be done in both openvz and kvm.

Finally I wanted to be able to access the MD1220 disks directly from within the guest, I never got to researching that in openvz, in KVM I have it working but not to my full satisfaction. More about that later.

Finally finally... kvm will behave ike a proper Linux, since it is a real Linux, and we get rid of some unexpected behaviour we saw with openvz.

Installing proxmox

I am not going to write anything about installing proxmox, since my colleague did the installation :) There are intructions here: http://pve.proxmox.com/wiki/Installation
However, you end up with a Lenny with a line in /etc/apt/sources.lst like this:

deb http://download.proxmox.com/debian lenny pve

and relevant openvz and qemu stuff installed.

Openvz lessons learned

This is what I learned about openvz before I had to gave up on iSCSI and decided to use qemu-kvm instead.

Create virtual machine in the web interface on the host OS.

  • Network: Leave it on default and if dedicated network card is needed, configure it afterwards.
  • Template: debian-6.0-standard_6.0-4_amd64 (has to be downloaded, in principle from http://pve.proxmox.com/wiki/Debian_6.0_Standard, but I had to flollow the 32 bit link to http://download.proxmox.com/appliances/system/ where there actually was a 64 bit version, it was just not mentioned in the wiki. Then it has to be added to templates in the proxmox GUI.)
  • CPU’s cannot be set from the start but have to be changed afterwards.

Network

I want to assign eth1 on host to be eth1 and eth4 to be eth4 in my first container (the interface names match to avoid confusion). Eth1 is connected to my main network, eth4 to my iSCSI network.

On the openvz container: /etc/network/interfaces:

        auto lo eth1 eth4
        iface lo inet loopback

        iface eth1 inet static
                address 192.168.1.17
                netmask 255.255.255.0
                gateway 192.168.1.254

        iface eth4 inet static
                address 192.168.7.53
                netmask 255.255.255.0
                mtu 9000

where the 192.168.1.0 network is for network, and the 192.168.7.0 is the iSCSI net. The MTU=9000 is for iSCSI. It is also set in the switch and on the equallogic iSCSI SAN. MTU=9000 is also called jumboframes.

Delete the

    auto venet0
    iface venet0 inet static
            address 127.0.0.1
            netmask 255.255.255.255
            broadcast 0.0.0.0
            up route add default dev venet0

part that was auto created.

On the proxmox host: Assign network cards to containers. Assuming container 106:

        vzctl set 106 --netdev_add eth1 --save

(To delete it again:

        vzctl set 106 --netdev_del eth1 --save

)
Delete the original one:

        vzctl set 106 --netif_del venet0 --save

Perhaps with the container stopped.

Useful commands

vzctl can also be used to stop and start a container. Configuration file: /etc/vz/conf/106.conf

Problems we found


Halt / reboot and shutdown does not work within a container with squeeze. It leaves the system in some inaccessible mode in between, but it is not really stopped/restarted. Our "fix" was to replace halt, shutdown and reboot commands with a script doing nothing, but echoing a recommendation to run vzctl restart 106 from the host instead. This might have been solved in newer versions.

KVM

Make network bridge vmbr1 and vmbr4 for eth1 and eth4 in the GUI, or do it manually as shown below.
Get the relevant installation iso image from a debian mirror (I selected a 64 bit squeeze). Add it in the GUI under ISO images.

Create your VM with these options:

        Type: Fully virtualized (KVM)
        ISO Storage: local (dir)
        Installation Media: debian-6.0.1a-amd64-CD-1.iso
        Disk Storage: local (dir)
        Disk space (GB): What you want
        Name: host name or whatever
        Memory (MB): What you want
        Start at boot: V
        Image Format: qcow2
        Disk type: VIRTIO
        Guest Type: Linux 2.6
        CPU Sockets: What you want
        Network Bridge: vmbr1
        Network Card: virtio
        MAC Address: (auto generated)

Click “Create”, and then start the machine from the web UI and select “Open VNC console” in the “Status” tab under the virtual machine.
Go through the squeeze installation normally. Install ssh server, to avoid having to use vnc console. Note to self: Also remember to configure the IP address and disable dhcp client...

Network and iSCSI

On proxmox host or in web interface, configure bridge for vmbr1/eth1 and vmbr4/eth4:

    auto vmbr1
    iface vmbr1 inet manual
            bridge_ports eth1
            bridge_stp off
            bridge_fd 0
    post-up brctl addif vmbr1 vmtab104i1d0

auto vmbr4
iface vmbr4 inet manual
        bridge_ports eth4
        bridge_stp off
        bridge_fd 0
        mtu 9000
pre-up ifconfig eth4 mtu 9000
post-up ifconfig vmbr4 mtu 9000
post-up brctl addif vmbr4 vmtab104i4d0

I think the mtu commands have to be a real ifconfig command because the interfaces are set to manual (mtu 9000 is not enough).

/etc/network/interfaces on guest:

    auto lo eth0 eth1
    iface lo inet loopback

    iface eth0 inet static
            address 192.168.1.17
            netmask 255.255.255.0
            gateway 192.168.1.254

    iface eth1 inet static
            address 192.168.7.53
            netmask 255.255.255.0
            mtu 9000

    killall dhclient
    /etc/init.d/network restart

To get the MTU right, we need to modify the tap interface generation because if that is created with a lower MTU, the bridge is set to that lower MTU when added to the bridge. We do that in /var/lib/qemu-server/bridge-vlan:

    diff bridge-vlan.orig bridge-vlan
    15a16
    > my $bridgeMTU = `/sbin/ifconfig $bridge | /usr/bin/tr " " "\n" | /bin/grep MTU | /bin/sed "s/MTU://"`;
    17c18
    < system ("/sbin/ifconfig $iface 0.0.0.0 promisc up") == 0 ||
    ---
    > system ("/sbin/ifconfig $iface 0.0.0.0 promisc up mtu $bridgeMTU") == 0 ||

- see http://forum.proxmox.com/threads/5606-Mtu-9000

    /etc/init.d/networking restart

Probably reboot guest a few times to see it working.

Install iSCSI software on guest. I usually compile from source, but in squeeze, I can actually work with the packaged version. Make iSCSI target ready, make sure cable is connected etc.

On guest:

    ifconfig eth1

to see the assigned MAC address.
On guest:

    iscsiadm -m iface -I iface0 --op=new
    iscsiadm -m iface -I iface0 --op=update -n iface.hwaddress -v 6e:db:d2:61:b4:9e
    /etc/init.d/open-iscsi restart
    iscsiadm -m discovery -t st -p 192.168.7.104 -I default -P 1
    iscsiadm -m node --targetname iqn.blabla.com.equallogic:blabla-db-vol  --portal 192.168.7.104
    iscsiadm -m node --targetname iqn.blabla.com.equallogic:blabla-db-vol  --portal 192.168.7.104 --interface default --login
    iscsiadm -m node --targetname iqn.2001-05.com.equallogic:0-8a0906-4d5cd6607-98c835557d94c7f9-db-vol  --portal 192.168.7.104 --op update -n node.conn[0].startup -v automatic

Target: iqn.blabla.com.equallogic:blabla-db-vol
	Portal: 192.168.7.104:3260,1
		Iface Name: default

where iqn.blabla.com.equallogic:blabla-db-vol should be replaced by your target name and 192.168.7.104 is the IP address of the iSCSI target.
Remember to permit target to be accessed by this host if it has access control.

Note that what I do here is not what they call "Use iSCSI LUN directly" in http://pve.proxmox.com/wiki/Storage_Model#Use_iSCSI_LUN_directly. What they do is connect the proxmox host to iSCSI. What I do is that my kvm has an ip address on the iSCSI network interface, and that ip address it the only one allowed to access the LUN. The Proxmox host does not know about the iSCSI LUN at all.

Disk access

  1. I want my /dev/sdc in the host OS to be available in my kvm 106 guest as /dev/something so I can mount it normally, and NFS export it to client machines.
  2. I want performance to be as close as possible to native (like if I NFS exported it directly from host OS).
  3. I also want to protect it from being accessed from the host OS, or any other guest, so I want it somehow marked as "in use by kvm 106".

1 can be achieved easily. 2 and 3, not so easily, if at all.

Documentation I have found:

  1. qm set 106 --virtio1 /dev/sdc will access the disk directly. It will be presented at /dev/vdb in the guest, since I have a /dev/vda already.

    Result in /etc/qemu-server/106.conf:

    virtio1: /dev/sdc

    I'll elaborate more on some options later.

  2. Performance is far away from "native" performance (host os access to the same array). This seems to be expected, but I have not been able to find any documentation to tell me how much performance degradation to expect. I am not sure how much it matters over NFS. Optimizing NFS might be more important in the end.
  3. For now, I have given up on the protection. I will just need to not make any mistakes. If I give /dev/sdc to kvm 106 as a --disk option, as far as I can see, I can still mount it as I please on the host OS. I don't know if qemu-kvm will prevent me from giving it to two guests. Perhaps it shouldn't, in case I want it read only in one of them, but I would like some kind of "already mounted" warning.

Parameters
I ran some I/O tests with fio. It is difficult to emulate production scenario but at least I get an idea of the performance, compared to access from the host. I also nfs exported it and tested from a client in different scenarios. I tried this:

  1. Direct access to /dev/sdc (not protected from other hosts)
  2. LVM on /dev/sdc, direct access to the LVM (not protected)
  3. LVM added via the proxmox GUI, with a virtual disk on it (protected but needs qemu-kvm installation to access it)

I tried with some parameters to kvm, like aio=native,cache=no.
(see http://publib.boulder.ibm.com/infocenter/lnxinfo/v3r0m0/topic/liaav/LPC/LPCKVMSSPV2.1.pdf)

Performance basically sucked on all scenarios compared to direct access from proxmox host.

The thing that had the best effect, especially on I/O with small block sizes, was to set the deadline scheduler on both host and guest, so I have done that in production.

Setting the deadline sceduler

echo deadline > /sys/block/vdb/queue/scheduler

on guest.

echo deadline > /sys/block/sdc/queue/scheduler

on host.

I ended up with running LVM on /dev/sdc and adding the LVM partition directly to the guest, with the parameter cache=writethrough, which does not seem to be default anymore for direct access on the new proxmox release (pve-qemu-kvm 0.14.1-1).

virtio1: /dev/VGHOME/VOLHOME,cache=writethrough

Unsupported kvm parameters to qm
qm does not support all kvm parameters, e.g. aio=native.
This kind of line can be put in your /etc/qemu-server/106.conf file:

args: -drive file=/dev/VGTEST/VOLTEST,if=virtio,index=1,cache=none,aio=native

Took me some time to figure, but this way I can still start and stop the guest from the GUI and get none qm supported options passed to kvm.

So, this is it, my DB server has been running production for months and my file server will run production from next week.

Neatx is the new black 2

April 1st, 2011

Neatx on squeeze, the cheat sheet:

Install dependencies:

aptitude install python-pexpect python-simplejson autoconf automake python-docutils xauth netcat python-gtk2 python-gobject gcc x11-xserver-utils make xutils-dev xserver-xorg-dev xorg-dev

Install nxagent:

Insert in /etc/apt/sources.lst:

# Here I got the nxagent package
deb http://ppa.launchpad.net/freenx-team/ppa/ubuntu intrepid main

aptitude update; aptitude install nxagent

Install neatx:

cd /usr/src/
svn checkout http://neatx.googlecode.com/svn/trunk/ neatx-read-only
cd neatx-read-only/
cd neatx/
./autogen.sh
./configure
make
make install

Get neatx ready to run by making nx user and keys and a config file

useradd --system -m -d /usr/local/var/lib/neatx/home -s /usr/local/lib/neatx/nxserver-login-wrapper nx
install -D -m 600 -o nx /usr/local/share/neatx/authorized_keys.nomachine ~nx/.ssh/authorized_keys
cp /usr/local/share/doc/neatx/neatx.conf.example /usr/local/etc/neatx.conf

Clean up sessions after reboot:
Insert in /etc/rc.local:

# Delete all NX sessions after a reboot, they're dead anyway
rm -rf /usr/local/var/lib/neatx/sessions/*

Assp and postfix and saslauthd

November 9th, 2010

What I had:

  • Assp proxy in front on host fw.
  • Postfix on internal server called fs, assp talking to it on port 25 (listenPort:=25,smtpDestination:=10.10.1.1:25).
  • Assp server serving as plain smtp from inside and as incoming mail server, using postfix port 25 for the actual delivery.
  • Assp mail interface working when using plain smtp.
  • Postfix server working as smtps server with saslauth from inside, and from outside with port forwarding in iptables, both using port 465, not involving assp at all.
            
                   ________
                 _(        )_
                (_ Internet _)   smtp = smtps = fw
                  (________)
               _________________
              |    465     25   |             
   HOST fw    | iptables  assp  |
10.10.0.254   |_________________|
                    |    /
               _____|___/_______
              |   465  25       |  
   HOST fs    |  postfix        |  
10.10.1.1     | (saslauthd)     |
              |_________________|
                   ________
                 _(        )_      smtp=fw
                (_ Internal _)     smtps=fs
                  (   net  )
                   (______)

What I wanted: Assp mail interface and auto whitelisting also working from smtps connections, meaning these have to go through assp.

How I did it:

  • I decided to let postfix / saslauth continue to do the authentication but change to port 587, the "submission port" which seems to be what one does now.
  • Postfix should pass the mail sent to this port on to assp when authenticated.
  • ASSP should then reinsert it into postfix port 25 for delivery - after doing what assp does, including auto whielisting and catching any assp mail interface mails.
            
                   ________
                 _(        )_
                (_ Internet _)   smtp = smtps = fw
                  (________)

               ___________/\____
              |    587   / 25   |             
   HOST fw    | iptables/ assp  |
              |________/________|
                    | /  /
               _____|/__/_______
              |   587  25       |  smtp=fw
   HOST fs    |  postfix        |  smtps=fs
              | (saslauthd)     |
              |_________________|
                   ________
                 _(        )_
                (_ Internal _)
                  (   net  )
                   (______)

master.cf:

smtp inet n - - - - smtpd
#smtps inet n - n - - smtpd -o smtpd_sasl_auth_enable=yes
submission inet n - n - - smtpd -o smtpd_sasl_auth_enable=yes -o smtpd_proxy_filter=[10.10.0.254]:25 -o receive_override_options=no_unknown_recipient_checks

Does anyone know if I ought to use the content_filter instead? smtpd_proxy_filter sounded right but I an not completely sure of the difference.

Also perhaps I should bother with getting the chrooting to work. But the smtpd_proxy_filter part does what it should and the auto whitelisting and assp-spam and assp-white addresses now work from smtps connections.

Sources:

Referencen:

UPS #FAIL

June 25th, 2010

Last fall I bought my first UPS'es for the sequencing centre.

Two rack mountable 3000VA 2u for the servers, 2 tower 3000VA for the sequencing machine and one tower 2200VA for the cluster station. All from APC.

The 3 tower based ones were requirements from the vendor of the sequencing machines. Those for the servers was because we are in a lousy facility for the time being and we might as well protect ourselves against power glitches. All the UPS'es are much to small to run through a longer power failure.

On the servers I set up apcupsd to monitor the two rack based UPS'es.

Earlier this week my apcupsd sent me an email telling me to "Change battery NOW!" on one of those. The web status monitor said the same and there is a red lamp glowing on the UPS.

Since this is my first UPS, I took a probably very naive approach to this. I thought that since this equipment is still under warranty, our vendor would send us a replacement for the faulty battery. This turned out not to be the case.

Our vendor sent us to the manufacturer. When I expressed the thought that it had been a lot more convenient for us if they could handle it themselves, they let us understand that APC has chosen this service model and the vendor was not allowed to do it. While I am not in a position to verify the truth of this statement, I have no reason to doubt it either.

I was able to open a support case through APC's web page. However, it did not get me a new battery.

Supposedly, to get a new battery, I have to do a test to find out if it is the battery or the UPS. Sounds reasonable, except the only way to find out is to run a so called manual calibration, which means running the UPS on battery with a load of at least 30% until the battery is empty. And the load crashes.

I can't really do that to my servers or storage. So, I'll have to magic some alternative load into being. Further, it is complicated by kit and science IT having attached their network equipment to my UPS, so I'll also have to find a place to plug in 4 or 5 switches meanwhile.

My first thought was electric kettles. But they stop after a few minutes. Smart people on my irc channel suggested I leave the lid open and just refill them if they run dry. But I don't like boiling water in my server room... The APC support guy who's week I've ruined, later suggested a heater or halogen lights. I don't have either - I am not going to bring my private home appliances in on the commuter train from Slagelse in order to be able to do my job - and I would need to cut the power chord and attach a plug that would go into the UPS. I don't even know it that is legal to do, and I cannot understand that such a weird Micky Mouse procedure should be required of every single UPS owner.

APC claims that I am supposed to do this anyway as maintenance every 6 month. Otherwise the UPS cannot know exactly how much time is left on the battery.

So, I thought that others would have had to run into this problem before, and asked around a bit. The IT guys I talked to certainly does not bother to do this kind of test. They just buy new batteries when they spot a UPS with a red lamp. They don't care if the UPS knows how much time is left on the battery, and neither do I. And we really don't have man power to fuss about every single UPS twice a year.

I also talked to some nice people from APC in Denmark who were much more understanding and very service minded, but the support guy told me afterwards that their suggestions would only work for larger systems, not our small UPS.

APC will not send me a new battery because they do not know whether it is the battery or the UPS that is faulty. I can respect that, and I would have done their trouble shooting if there had been any clean well described standard, non clown, non lamp/microwave/heater procedure I could follow - preferably without down time. But I can not respect that they send me out on a quest like that.

In my view it is their responsibility to make their UPS in a way so it can send a clear error message in the form of a blinking lamp, an alarm sound, some proprietary software monitoring program, whatever, but something clean and well defined. It is not my responsibility as a UPS owner to have a standby non critical load I can crash for them.

So, the "solution" is that we buy a new battery despite the fact that the product is under warranty. That is probably what they wanted anyway, to get out of their warranty obligation. We will just have to budget with that in the future and not bothering with APC support. (Or finding a different UPS manufacturer). We are probably to them what PC owners are to the large hardware manufacturers. Small and insignificant.

But I was wondering about other people's solutions to this.

Does anyone really do a UPS battery calibration every 6 months? Do people buy a couple of extra UPS'es to use in maintenance windows? Is it good for a UPS to stand by with little or no load most of the time, and will it then work when you need it?

We normally don't do down time at all if we can avoid it. We have people working at all hours and computers working 24/7, and we're not filthy rich. That combination seems to surprise the hardware vendors every time.

Æ Ø Å charset problem in Tivoli client

June 8th, 2010

Problem:

Tivoli client on Linux gives me this error message in the logs when it comes across a file with a Danish æ ø or å character generated on a non UTF-8 character set.

'file-name-with-æ-in-it' contains one or more unrecognised characters and is not valid.

Solution:

  1. Install the en_US.ISO-8859-1 locale - on Debian this is done by running dpkg-reconfigure locales.
  2. Insert in your /etc/init.d/dsmcad or whatever startup script for dsmcad you have: export LANG=en_US; export LC_ALL=en_US.
  3. Restart dsmcad.

Explanation:

The en_US.ISO-8859-1 is a so called single byte character set (SBCS) and will thus work independent of how many actual bytes used on a character in whatever charset the file name was made with. If you use a charset with more than a single byte per character, it goes wrong as soon as you have a file name not created with that same character set. It is possible to find this on IBMs web page but they seem to assume that you know what a SBCS is and which locales that are SBCSs.

Footnote

Yes I can run Tivoli client on Debian by running alien on the RPM packages. Since I run on a 64 bit system I had to go by a 32 bit system to build the alien packages and then use some force options to install. I am not impressed. They could bother to build 64 bit RPMS for Redhat, at least. But the point is: Yes, it works.