Time for a change. No developers slang or pics of mobile phones, it’s time to take a look behind the scenes at the systems which make These Days people able to work and do their stuff.
At These Days, it’s pretty simple. I’ve installed a NetApp SAN (FAS 2050c) about 2 years ago. It’s a clustered box which contains all fileserver data (the normal network drives which almost every digital company has and where employees save all their documents and work-related files) and it also contains all data from our VMware virtual infrastructure environment. The latest is a 25-30 virtual server clustered farm, These Days is completely virtualized and thus heavily dependent on the NetApp SAN/Virtual Infrastructure.
So basically, if the NetApp box goes down, we’re in big trouble. No fileserver nor access to servers anymore (development, finance, planning, …) and 80 people who are unable to work. You don’t even want to know how much that would cost for an office. Imagine 80 people completely unable to work! Time is money right?
But the NetApp SAN doesn’t go down. It’s made and designed for that. You boot it up once and you never turn it off or reboot it. It doesn’t require patches or updates. OK, you’ve got the occasional firmware updates but they can be done online with a 40sec Disk I/O interrupt. So yeah, that’s not bad. It’s even clustered, 2 controllers in an active/active mode, so IF something went wrong and one of the controllers died, the other one would take over and keep everything running.
It’s a box which stays online, 24×7.
Unless… you get a power outage like we did last week. It happened between 4-5 AM. The UPS (emergency battery power) kicked in and gave me 40 minutes to restore power or shut everything down. But I didn’t wake up from those text alert messages at 5AM! So UPS time went down while I was sleeping and dreaming of hot chicks ICT Business Continuity. Eventually the whole server room was without power and everything went down.
When I woke up, I saw the text messages and said (more like screamed) something along the lines of “EEK!” and “F*CK”. Phone calls from the early birds in the office were already coming in saying there was no power and after the fastest bike ride ever to the office, I booted up everything and got the whole thing up and running again in 15mins.
Now the cool thing about this experience kicks in. The same day, in the afternoon, I noticed something was wrong with one of the NetApp controllers. LEDs weren’t on/flashing, no network activity, nothing, completely nothing. I checked the box and noticed a failover happened and one controller was handling all traffic and took over all tasks from the other controller.
I was shocked and happy at the same time. Shocked as in: one of the controllers didn’t boot up and happy as in: cool, we’ve been running on one controller for a whole day and no one noticed a thing! All iSCSI and CIFS connections stayed up with no problems.
After some testing, I declared the faulty controller dead and called support to bring me a new one. It probably died from a voltage spike when restoring power.
The next day, I received a brand new controller and replaced the dead one. Another cool thing about NetApp controllers: all OS and config data is stored on the compact flash. So placing the flash card from the old controller in the new one restores all configs on the new controller. The only thing that needs to be done manually is reassign the disks to the new system ID, which takes about… 3 seconds.
The point of this post is to show how important it is to have redundancy on your business critical systems. What if our NetApp box wasn’t clustered? These Days would have been down for an entire day and lost large amounts of money, it’s the nightmare of every IT Manager. Sometimes we need to invest in these complex solutions to guarantee business continuity and in a situation like this, it pays off. Big time.
ROI on the NetApp? Ha, yeah, we reached it and even went completely over it














