The IT department reacts

Experts criticise UU following outage of systems and websites

brand datacenter. Foto: Shutterstock
Photo: Shutterstock

Slinger Jansen, an Associate Professor of Computer Science specialising in cybersecurity, and Nishant Saurabh, a UU lecturer and expert in cloud storage, have raised serious concerns about how Utrecht University (UU) handled the outage and the process of restarting its websites and systems. They argue that when disasters happen, it is crucial to switch quickly to other servers containing backups. “You should be able to quickly fall back on a system that at least runs your essential services, such as the server that grants access to your buildings," says Jansen. "The fact that this did not happen is highly worrying.”

On 7 May, a fire broke out at the NorthC data centre in Almere, which houses servers belonging to Utrecht University (UU). As a result, many of the university’s services did not work, or did not work properly, for a week. Staff were temporarily unable to enter their workspaces or access documents on the O drive, while students could not access course materials and homework. Several websites under the UU domain, including DUB's, were offline for a week.

Critical services
The university states that it is focusing on recovery and therefore prefers not to provide an explanation of exactly what went wrong at this stage. However, DUB's conversations with Slinger Jansen, Nishant Saurab, and IT staff who worked on the recovery during the outage indicate that, at least in theory, it would have been possible for the university to immediately switch to a server that contained a copy of all services in the event of a power cut. However, in the past, UU chose not to set up a complete automatic switch to a backup.

Too expensive
According to the IT staff DUB spoke with, the university opted out of an automatic switch due to high costs. Technical complexities may have also played a role in this decision. They say that, in the past, UU conducted an inventory of its critical services and then decided to provide online backups for only two of them: the network on which the laptops, Wi-Fi, and cables in the buildings operate, and the authentication system that powers the Solis login system. This enabled UU to restart the system after a while, using the online backup.

Switching to backups
The Associate Professor of Computer Science, Slinger Jansen, believes that Utrecht University should have ensured that everything would automatically switch to backup servers in the event of a failure. Even though UU said no to the expensive automatic redirection from Almere, Jansen believes the university could still have "made sure that all data is automatically redirected to backup servers" in another data centre. "We could work via Italy for a day instead of the Netherlands, or via Groningen instead of Almere."

UU is lagging behind
“Other organisations do that too," Jansen continues. "If there’s a fire in the data centre housing ING’s server or Instagram’s server, we don’t even notice. The services remain online. UU is lagging behind in that respect. All data can be replicated and stored in other places, but apparently the university doesn’t have the mindset to work that way yet.”

As a result, the university's systems were down for much longer than necessary, and the recovery was slow. According to Jansen, UU employs IT staff with “a great deal of expertise”. He argues that “the people who are now working hard to resolve everything also know how to deal with such scenarios from a technical perspective. At the same time, this situation shows that the university still has a significant way to go in terms of crisis preparedness and investment in digital continuity.”

Migration policy
Nishant Saurabh, a lecturer in Computer Science, also believes the university lacks a proper disaster recovery plan. After all, it took a long time for its systems to be back up and running. “A fire can break out in any data center. That is why it is important to have backups for everything. Three backups are usually the standard, and these should also be available online. There should also be a migration plan in place to quickly migrate data if systems go offline. The university does not seem to have arranged this properly.”

I hope the disruption serves as a wake-up call

According to Saurabh, it is not always easy to automatically transfer a massive computer system to backup servers. “But at the very least, a semi-automated solution should be implemented," argues the researcher. "You need to be able to anticipate an emergency situation and already determine what your migration policy looks like in order to minimise downtime.”

Anticipating
According to the researcher, Utrecht University should actually have responded before the emergency situation occurred. “We have developed good software that helps anticipate problems by monitoring certain statistics. You can identify in time which situations may arise and what actions can be taken to limit the problems, for example by alerting the right person. That is a semi-automated action. I do not get the impression that Utrecht University has arranged any of these things. I hope this outage will serve as a wake-up call for them.”

brand datacenter. Foto: Shutterstock

Photo: Shutterstock

Storing data on different servers
Above all, Saurabh believes that IT facilities at UU have not been sufficiently modernised. In his view, the biggest problem is that the university’s data is stored locally, with many different services housed on servers in a single building. If something goes wrong in that building – as was the case with the data centre in Almere – all the services go down. Saurabh advocates decentralising systems so that the entire infrastructure is not managed by a single organisation. If this were done, services such as student ID cards, applications, and the O-drive would no longer be hosted on servers in a single data centre, but across multiple locations.

For example, UU could consider joining a shared infrastructure with other Dutch universities. “If UU were to go down, the servers of the other universities would still be accessible. UU’s services would then still need to be accessible through the other servers, isolated by means of secure access control.”

The Executive Board reacts
In a written statement, the Executive Board (CvB) acknowledges that “the fire at the NorthC data centre has had a major impact on students and staff, and that they have questions about the extent to which the university was prepared for this. We understand that these questions exist. After the Whitsun weekend, we will begin an evaluation and examine the lessons to be learned. Until then, the Executive Board does not wish to comment in detail on the suggestions made in this article by Slinger Jansen and Nishant Saurabh. This also means that no assessment has yet been made as to whether the information outlined by DUB and the experts is factually correct.”

Login to comment

Comments

We appreciate relevant and respectful responses. Responding to DUB can be done by logging into the site. You can do so by creating a DUB account or by using your Solis ID. Comments that do not comply with our game rules will be deleted. Please read our response policy before responding.

Weet je wat nou mooi zou zijn? Volgend jaar een informaticus in U-raad! Upload 1-3 juni nr.5 van VUUR, Maurits, de raad in!

Als een UU SAP beheerder ben ik toch wel een beetje teleurgesteld dat er niet wordt benoemd dat ons ERP (met zweet en de nodige zorgen) in een secundair datacentrum online hebben ondergebracht. Ondanks de node triage; netwerk, identity en onderwijs/onderzoek diensten gaan ons noodzakelijkerwijs voor. Heel veel dank aan de netwerk collega's!

Komen weekend de rollback to normal.

Ik kan uit ervaring meedelen dat kiezen voor redundancy niet gratis is. Automatisch is trouwens een leugen: een disaster recovery test is niet zonder risico en de benodigde audit trail kost veel effort waar lastig waardering voor te vinden is. Totdat het echt misgaat :)

Met z'n allen om de tafel en zorgvuldig business continuity inrichten zou ik zeggen.

Laten we vooropstellen: ongelukken gebeuren en netwerken zijn niet 100% betrouwbaar. Dat weet iedereen. Maar kunnen we er alsjeblieft voor zorgen dat er geen single points of failure zijn die processen platleggen die hier helemaal los van horen te staan? Er is bijvoorbeeld geen enkele logische reden waarom zoiets simpels als iets printen in de binnenstad afhankelijk moet zijn van een actieve server in Almere.

De theorie achter failover-datacenters en disaster recovery (DR) bij calamiteiten is grotendeels ontwikkeld in de jaren 1970 en 1980. Iedere ICT-professional weet hoe het in theorie zou moeten: met een onbeperkt budget is een robuuste, goed geteste failover-oplossing relatief "eenvoudig" te realiseren. Helaas leven we niet in die ideale wereld, waardoor er altijd keuzes en afwegingen gemaakt moeten worden op basis van kosten, complexiteit en risico.Na de recente uitval van het datacenter in Almere is het daarom verstandig om kritisch te herzien of die gemaakte keuzes nog steeds de juiste zijn.

Dank aan de collega’s van ITS voor het uitstekende werk en de vakkundige aanpak om alles weer snel operationeel te krijgen.

Advertisement