Building Controller Resiliency at University of Cambridge

As discussed in my last blog, here at the University of Cambridge we have a large and complicated network. The wireless is one, albeit hefty, part of that network. I previously mentioned that we have a large fibre optic network that connects University buildings together all over Cambridge. In these buildings, all our access points connect via department or College local area networks (many not directly under our control). What I would like to talk about in this blog is related to how we manage all that centrally, and talk about what we are changing with our Aruba controller setup.

If you had asked me that question five to six years ago, I would have answered that we had a couple of controllers and around 700 to 800 access points. At this point, we had just dipped our toe into Aruba's product range having switched from another major vendor. The fact that we are still using Aruba all this time later tells you all you need to know about the product.

So, since then, what has changed? To answer that properly I need to take you back in time again to about 12 to 18 months ago. At this stage in development, we had grown the network to about 4000 access points and we had evolved to a conductor-standby conductor and five local controller architecture (2 x 7210 and 5 x 7220). However, relevant to the title of this blog one of these five local controllers was acting as our N+1 standby controller. That is if any one of the local controllers were to fail this controller would take over its role. At this time, we also began deploying AirWave to monitor our wireless estate, which, by the way, is a fantastic tool, especially for large networks.

So where are we now? Well, we have grown the network by another 1000 access points, taking us to just under 5000 but we now have fourteen controllers (2 x 7210 and 12 x 7220). "Fourteen controllers?!" I hear you say! So, why do we need so many when each 7220 controller can handle up to 1024 APs? The answer is 2N resiliency.

2N resiliency is the logical upgrade to N+1 resiliency, simply put it means having a complete second set of controllers in a disparate location to take over if the first set should fail. In our case, using backup LMS as the mechanism to fail-over (we previously used VRRP). This is important to us, as the wireless is critical to University business. This and the scale of our deployment means that it must be resilient and the N+1 model wasn't robust enough. Building this resiliency, especially on our complicated network is far from easy. As mentioned, we have a pan-city fibre network, with fifteen distribution routers and three core routers. Our Aruba equipment is located away from the centre of this network, geographically and logically in our main Data Centre, which lies at the edge of the City.

So, when we were working out a plan for all this we had to draw up aims for the upgrade. A high-level synopsis of this plan was:

Phase One

Provide 2N resiliency in two disparate, geographically diverse data centre locations
Rearrange the underlying network and addressing scheme
Prepare the network for AOS 8 (The importance of this is noted below)

Phase Two

Move the Aruba equipment to the centre of the University network (the distribution layer) for efficiency which is closer to the main user base and egress point to the internet (recall the main Data Centre is at the geographic and logical edge of our network)
Place the whole Wireless Service on dedicated routers with 2 x 10 Gbit connections to each controller.
Makes sure phase two arrangements are prepared for AOS 8

Planning for Aruba AOS 8 was a large part of this project. We wanted to ensure that the major immediate changes we were making remained compatible with AOS 8. We made sure that the network design we came up with, not only solved our short-term needs but also made the adoption of AOS 8 much easier. Why are we worried about this? At the moment we run AOS 6.5, which is great but 8 is better, no, actually it's a step change and I'll name a couple of benefits that stand out for us. Firstly, AirMatch will make a big difference to our deployment, which, as mentioned, spans the city and is subject to all sorts of inconsiderate or badly deployed Wi-Fi networks. Having this improved system dynamically optimise the channel and power management of the entire WLAN network will be amazing. Secondly, controller clustering is another key benefit to us. The immediate advantage is the entire system works as one, so no more logging into each local controller separately! However, more seriously, it also facilitates hitless failover (within the cluster) and most crucially provides seamless roaming. That is great because it stops our users having to re-authenticate if they cross into a different zone that is served by a different controller (our controllers each serve distinct areas of the City).

The planning and initial work on this project took 12 months. We completed the migration in about 2-3 months, which sounds like a lot, but if we could turn the system off, we reckon we could have completed all the work in a few days - which simply was not an option. We had to make all the underlying changes while keeping the system running and with minimal downtime. So we worked on the system where we could while it was "live" but crucially only where we were very confident it would not be service affecting. In addition to this, we had a set of out of hour's planned system outages. However, these had to be kept to a minimum level given that the Wi-Fi is used 24 hours a day seven days a week. (I kid you not - we have up to 35K unique devices connected during the day but still have at least 10K devices connected overnight). Consequently, overall we were limited to one to two hours of advertised downtime per week.

We completed phase one work a couple of weeks ago and last week we did a full failover test by powering off one bank of controllers. This was a nervous moment but thankfully, it all worked. This was a noteworthy and complex piece of work and it says much about the Networks Team here that no one felt we were being too ambitious to do this on a working system rather than build a new system in parallel. That is not to say we were not worried about it, we all care deeply about the service we offer, but I have the privilege to work with some very clever and capable people whose contribution make all the difference when undertaking this type of project.