2009: The Year Without Money
The economy is in the toilet, and everyone from the gardener to corporate CEOs are feeling it. With that, many businesses are facing reduced IT budgets in 2009, forcing us to come up with new and creative ways to utilize our existing infrastructure. From workforce layoffs to spending cuts, the world of working in IT is changing shape right before our eyes.
I had big plans for 2009, with additional SAN storage, a new backup platform, and a better core switch for our network, all on our list of needs rather than our list of wants. Now I have to address those needs without buying new stuff, which will present new challenges in creativity and patience, I am sure. Although I’m a little disappointed, I welcome the challenge and plan to make the best of it.
VMWare HA and why I should have read the manual.
I ran into an issue this past week, where VirtualCenter wouldn’t start. I was swamped with some other issues so in my haste made a few bad decisions that made everything worse and eventually discovered the root of the issue: the SQL logfile was full. I changed it to unrestricted growth and voila we were back in business.
Wait, no we weren’t! None of the cluster hosts would enable HA, they errored out, all of them. I tried several things that were suggested at The VMWare Communities without success, from removing the hosts from the cluster and re-adding them, to just disabling HA and enabling it again. Was this the result of updates I may have installed recently? Did something else change?
I’ll take you back to when I first implemented this virtual platform. It’s 4 Dell servers connected to a pre-Dell EqualLogic iSCSI SAN. Each host has two Service Consoles configured, per best practices documentation, although I didn’t understand why that was necessary at the time of implementation. Service Console #1 is on the production LAN using the default gateway as the isolation address (which is shown as the default gateway setting in the Service Console configuration), and Service Console #2 is on the iSCSI network using a non-existant IP as the isolation address. If you understand how HA works and why a second Service Console is a good idea, then you’re probably cringing and calling me stupid right now. So am I.
Back to last week. We replaced our IPSEC VPN to our EMEA network with a direct connection to their Colt MPLS network, and in the process made some changes to the firewall rules. One rule that got changed was the ICMP rule, which was accidentally deleted. Now the ESX hosts couldn’t ping the default gateway, which shouldn’t have been a problem because of the second Service Console. Since Service Console #2 wasn’t configured correctly, there was no redundancy and HA failed.
Solution: RTFM and understand what you’re doing before implementing an HA cluster. If you don’t, then do what I did and fix the ICMP rule on your default gateway for Service Console #1 and add a valid isolation address for Service Console #2 just in case you’re stupid again later.
Virtualization: best practices for real life.
My whole career has been about having things thrown at me and learning on the fly. This works out pretty well most of the time, especially when it’s a new-to-us technology that’s been around for a while already like Active Directory (which we didn’t start using until Server 2003 was out). With all of the published best-practices and configuration guides that were available by the time we implemented AD, it was practically a walk in the park.
There are times, however, when you’re actually keeping up with the trends in technology and have to deploy something that’s fairly new. I first deployed VMWare ESX Server a few years ago and there weren’t so many configuration options as there are now. It was a standalone server with a lot of local storage, and then all I had to do was manage the resources. Easy! So easy that we got another one to run some automation systems for the Engineering Department. Another success! After successfully reducing our hardware inventory without lowering our availability or system performance index, I convinced the powers-that-be that a full-blown virtualization platform was clearly the next step in our technological evolution.
What a day I had when the new toys were delivered. 4 ESX3 host servers connected to a 7TB EqualLogic iSCSI array, unleashing 64GHz of multi-core processing power and 64 Gigabytes of RAM into our available resource pool. The EqualLogic promotional material (and sales reps) touted that it would take longer to unpack than it would take to configure, and they were right. Within 2 hours of UPS dropping it off I had the array installed and configured, ESX installed on all 4 hosts, VirtualCenter Server installed on the management server, and several guest systems migrated from one of the standalone ESX servers. It was perfect!
But was it really perfect? Had I implemented this solution in the smartest way possible? No. No I hadn’t. There weren’t any show-stopping mistakes but there were a few things, which I will address in a moment, that I ended up going back and changing well after the initial installation.
iSCSI Network
Your switch should support flow control and jumbo frames at the same time. 10Ge uplink ports are great for future-proofing your switching platform.
I had a spare Procurve 2810-48 gigabit switch, so I was good to go! Perhaps in a small deployment, but the 2810 doesn’t support jumbo frames AND flow control at the same time, it’s one or the other. You can’t adjust the MTU on the EQL interfaces so there will be additional overhead on the switch as it negotiates the MTU size with the EQL interfaces. While VMWare does not officially support jumbo frames on the software iSCSI initiator, it has been recommended by several reputable sources that I do it anyway. I did so, replacing the 2810 with a 2900, which also has 10G uplink capability, making it a perfect fit for the two-switch configuration recommended by Dell/EQL.
NIC Teaming
You want load balancing and failover. Yes, you do.
Without knowing exactly how it would work, I used the default NIC teaming configuration in ESX for failover and load balancing. All that does is offer failover, but no load balancing. Eventually I discovered this and found documentation for the proper configuration for load balancing and failover. One setting in ESX was changed (route based on ip hash) and then port trunking was enabled on the network switch. Load balancing and failover. Yay!
iSCSI Volumes
Plan for volume portability and scalability using reasonably-sized volumes with a smart naming convention.
I had no idea what would be best when it came to the volumes on the iSCSI array. No idea whatsoever, and I couldn’t find any best-practices guides for it either. I came up with something that made sense at the time, but in the end wasn’t a good idea and needed to be changed. We have a few primary virtual server types here, Domino, File, and Application, so I thought that creating volumes for each was appropriate. I started with volumes like F01, and A01, and D01, but in the end I realized that this wasn’t good, as it restricted me from deployment of virtual disks based solely on available space. I’ve got 100G free on D01 and I need to put up a new app server, but I can’t put it on D01 because it’s not a Domino server. Yes, I’m anal and probably suffer from a bit of OCD.
What I have now is generic SAN volumes, Vol0-0, Vol 0-1, Vol 1-0, and so on. My naming convention is Vol (Volume, duh) 0 (disk 0) -0 (extent 0, used for adding extents to datastores in ESX, where required). This allows me to allocate space, based solely on the availability of the space required without triggering any of the many adverse affects of my OCD. I have allocated 512G of space for each volume on the SAN*, which I think is a good compromise between usable space and volume portability, which will likely be important in the future when we add more EQL arrays to this platform. Here’s a ‘map’ if you will, of the SAN volumes and the ESX datastores that they correspond to.
*three of the SAN volumes are 1TB each, but it was too late to fix that once it was in place. Vol4 is Vol4-0 with Vol4-1 and Vol4-2 added as datastore extents in ESX.
SAN: ESX:
Vol0-0 -> Vol0 (512G)
Vol1-0 -> Vol1 (512G)
Vol2-0 -> Vol2 (512G)
Vol3-0 -> Vol3 (512G)
Vol4-0 -> Vol4 (3TB)
Vol4-1 -> Vol4
Vol4-2 -> Vol4
This approach provides service-agnostic provisioning of disk space, while allowing for scalability of datastores via extents, while also allowing for volume portability at the SAN level. Volume portability is handy especially with the EQL arrays, if you plan to run more than one array within a storage group. In this configuration, multiple arrays will begin to function as one and automatically load balance volumes between them.
pCPU vs vCPU
Multi-vCPU = more processing power, and host overhead.
Is it better to get a dualcore processor at 3GHz, or a quadcore at 2GHz? In the beginning, I thought that ESX could schedule a single vCPU virtual machine across multiple physical processors, or processor cores. Eventually I discovered that a single vCPU virtual machine running on a host with dual 3GHz dualcore processors will only have 3GHz available to it. The simple solution for adding more cpu is to add a second vCPU (and of course change the HAL to ACPI Multiprocessor PC). The downside of multiple vCPU virtual servers is the additional overhead on the host server. As a rule of thumb, all new VM deployements are done as a single vCPU, and if the needs change we add more processors.
In light of this, I’m tempted to say that core GHz is more important than the number of cores, but that comes with caveats a’ plenty. In the end, I think it’s best to assess your needs and then choose the right CPU platform. Now that 3GHz quadcore processors are around, the issue is almost moot.
That’s it for the big issues I’ve faced with virtualization, except for the issues I discuss in this blog, which I am still trying to solve and will write up any relevant tips if and when I get to the end of that ordeal.
Is that the news in your pocket?
These days it seems like everyone’s got a BlackBerry, iPhone or other data-enabled mobile device. The whole internet in your pocket! For those of us in IT these devices can provide real business value, enabling us to communicate via email, IM, text, and phone no matter where we are. The actual value though, is all relative to how you use it or as in many cases, don’t use it.
In a blog I posted yesterday, I wrote about how easy it is to fall behind in IT if you don’t put forth some effort to keep up. With the whole internet available on your phone, you now have the opportunity to ‘keep up’ even if you’re not at your computer. But how could I possibly manage navigating so many tech sites on my slow little phone? It’s tedious enough to check 10 news sites a day on my giant monitor at work, nevermind a 2″ phone LCD, so to be honest I don’t think I’d use my phone all that much for reading tech news.
Enter RSS. RSS has been around for a while and is becoming more and more populare across more and more demographics because let’s face it, it’s much easier to aggregate all your news from multiple sources into one location for viewing. If you’re reading this you’re probably familiar with RSS and maybe you’ve even used it, but do you use it on your phone? There’s a great little app available for BlackBerry and Windows Mobile called Viigo that is a fantastic RSS reader. I’ve been using Viigo for quite a while now and the latest version is 3 beta2. Version 2 was great but 3 is even better, with more customization options, more features, and a more modern appearance. [Screenshots below.] If you have a BB or WM device, go check it out!
So yes, in fact that is the news in my pocket! I can catch up on the latest articles from multiple sources in just a few minutes, no matter where I am. I often read tech news while waiting at a red light, or while grilling dinner, or while I’m …yes… in the bathroom. Now ‘keeping up’ doesn’t have to interfere with the rest of my schedule, or yours.
VMWare VI and iSCSI, a match made in…wait, what?
About a year ago I was faced with the reality that we’ll keep needing more servers, and that after 3 years we have to start paying to renew the warranties. Combined with the fact that our resource utilization on most of our servers was lllllooooowwwww, it became pretty obvious that we needed to make a change (we can believe in! lol). The answer to our problems: Virtualization. Duh.
We already had two standalone VMWare ESX servers running production servers and I couldn’t be happier, so the decision to use VMWare as the platform for a large-scale (ok it’s just large-scale to us) virtualization initiative was a no-brainer. We ordered 4 Dell rack servers with ESX 3 Enterprise, and a 7TB EqualLogic PS400E iSCSI SAN, to be connected using Procurve 2810-48 switches on both the network and iSCSI side with separate switches for each. I set it all up and started migrating physical servers into the virtual environment. Abso-freaking-lutely awesome!
I’ve got the ESX hosts set up with two dual-port nics (broadcom onboard, intel pci-e) in each, two teamed for Service Console, LAN, and DMZ, and two teamed for iSCSI and VMotion. On the iSCSI side I’ve got a Procurve 2810-48 with flow control turned on, but it doesn’t support FC and jumbo frames at the same time so jumbos are off. All is well.
That was then.
About a month ago I got a call from a guy in the Engineering Dept., saying that he’s opening a large SolidWorks drawing and it’s quite slow. I do some quick tests and sure enough, large file transfers from the file server are slow! 150Mbps if I’m lucky, and it’s on a GigE connection. I tested from some other systems and the results were the same. I called our sales rep with one of our vendors and asked if I could meet with their storage specialist to talk about this issue…and the following week we had our meeting. What he suggested was that, despite not being officially supported by VMWare, we get a switch that would do flow control and jumbo frames simultaneously. A Procurve 2900 fit the bill, and as soon as it arrived I cut over to the new switch. No improvement, but of course not because I still needed to enable jumbo frames on the vSwitches, which I did next. Still no improvement.
At this point I had posted about my little adventure on the VMWare user forums and received a ton of suggestions, but nothing had helped. That posting was noticed by a rep at Dell who was nice enough to forward me a best practices document for using VMWare with the EQL SAN, and to my delight I discovered that I had done everything in the proper manner. So…then what? More testing! I installed SQLio and IOmeter on the file server as suggested in the VMWare Communities thread, and ran it against one of the data volumes on the SAN. The results seemed optimal! If the server, that lives on the SAN, can get data to and from the SAN normally, why can’t I get data normally from that server via the network? Is it a problem on the network side? Mooore testing.
Next I installed the Microsoft iSCSI initiator client on a physical server and connected to a volume on the SAN and ran the same file copy that I used in my initial test, a 1.8G .iso file, and it was fast! So this isn’t a problem with EQL or iSCSI Procurve, it’s either the Procurve 2810-48 on the network side, the physical host server, or ESX! I moved a smaller VM onto the local storage of one of the hosts and tested the file copy again and while it was juuust a tad faster it was still terrible. The more testing I did, the more I was sure that it was the Broadcom nics on the host servers, so I ordered and installed an Intel GigE card to test with. No improvement. What?! Could it be the Procurve on the network side? I tested the file copy between physical servers, both connected to that switch, and it performed as expected. Nope, not the Procurve on the network side.
At one point I was directed to a great blog post about usb drivers interfering with network performance on ESX hosts, so I tried disabling the usb-ohci drivers on the hosts, but sadly I didn’t experience the same return of performance that many others did.
I’m running out of ideas at this point, and the last thing I can try (that I can think of) is iSCSI HBAs instead of the software iSCSI in ESX. The performance degradation definitely seems to be happening on the host server, so maybe, just maybe taking the sofware iSCSI initiator out of the mix will help, but I’m not confident that it will matter since the test using the local storage of one of the hosts showed similarly bad performance moving data to another network server. I guess doing it right doesn’t always work.
For the record I’m still quite happy with this system, most other servers on this SAN are running fine and the performance issue is only a problem on large file transfers. This system really is a match made in heaven, despite whatever is going on with ESX to cause this performance issue. I will post a follow-up to this when I make some progress. I’ve spent a lot of time on this, with a lot of help from a lot of people, and gotten nowhere. I’m sure it’s an issue on the network side on the ESX host but beyond that I can’t figure out a fix. If you have any suggestions, feel free to post a comment.
Some of the resources I found helpful:
http://communities.vmware.com/thread/166113?tstart=0
http://blog.scottlowe.org/2008/04/22/esx-server-ip-storage-and-jumbo-frames/
http://blog.scottlowe.org/2006/12/04/esx-server-nic-teaming-and-vlan-trunking/
http://www.tuxyturvy.com/blog/index.php?/archives/37-Troubleshooting-VMware-ESX-network-performance.html
Be careful! “Out of touch” isn’t as far out as you might think!
Like most people in IT, I deal with a lot of people in this industry on a day to day basis. I see people that are really good at this, and people that are really bad. What interests me is that of those that are bad at this, the reasons for being so are often vastly different. I’ll use this example in the context of two people performing at similar admin/support levels in an IT department, with the assumption that they are both amply smart and capable of learning.
In this corner we have Jody, with a good handle on the fundamentals and a history of poor logical thinking. In that corner we have Dianne, with a good sense of logic and troubleshooting skills but very little knowledge of the fundamentals. Jody: Poor logic, in my opinion, is a career-limiting characteristic to have if you work in IT. I don’t think you can effectively teach logic, you’re either good at it or you’re not. Dianne: Lacking fundamental knowledge is relatively easy to overcome. With the internet and all the training programs available, there’s almost no limit to what Dianne can learn.
At this point it might be safe to assume that Dianne is the better candidate for advancement in her career, right? What if Dianne lacked the motivation to get her learn on and really shine? Here she is with the brain and natural ability to think logically, but she’s just showing up and doing what she has to do today. Can she really be the better candidate for advancement if she doesn’t show a thirst for knowledge? Is it possible to make the best decisions today, without at least having an idea of what tomorrow will bring? I say no, but surely this story can’t be over!? It’s not, and yes I promise I’m coming back around to the title. Even being a new manager I’ve already had to face this situation, and my biggest challenge was to figure out how to 1. tell Dianne that she wasn’t cutting it the way she should be cutting it (she’s got great potential!) and 2. to motivate her. Issue 1 turned out to be easy, we just sat down and talked about it. Issue 2 is up in the air. I did my part, I think, and now it’s up to her to step up and be great.
This scenario leaves me wondering, how does one come to be in IT without that thirst for knowledge? Don’t you want to know what’s coming out next week, and to consider how that will affect what you do this week? Don’t you want to learn everything you possibly can about what you do, so that you can do your job effectively? Aren’t you aware of the fact that with how fast technology is changing, without some effort to keep up you’re going to get left behind? This isn’t a trade, you can’t just learn it and be done. Your education will never stop, or you will become obsolete. Quickly.
There is a fine line between being consumed by your career and being diligent, but if you’re doing it right you’re still damned close to that line. Where are you?
I’m on the consumed side of the fence.
Resistance is futile. Seriously.
I’m very anti-trend for most things of a social nature, including blogging. I formerly viewed blogging as a means for people to listen to themselves talk, and I don’t really like talking. I have resisted blogging successfully until now, but recently I’ve become aware of the value of (some) blogs.
The backstory on how I came to this moment is short and sweet (yay!). I’m a Network Manager with a decent technical background and I’ve always relied partly on the experience of others to educate myself when embarking on projects that involve technologies that I’m unfamiliar with. I know, that was a long sentence. In the recent past, a lot of that information has found its way into my brain by way of blogs. People have shared their valuable experiences and I have learned from them. I have valuable experiences from time to time, and this will be where I publish them for the world to see. …or ignore. People may find my stories pointless, or completely unhelpful, but hopefully this won’t be entirely in vain.
I leave you now with a blanket disclaimer – I have but a high school education and English was never a good subject for me, and so I am prone to poor spelling and grammar. I am also rather opinionated and intolerant of others’ flaws, but I do my best to keep my outrage in check. Please excuse me of these shortcomings.



