An independent report into the 2022 Rogers outage says the company lacked several protections and redundancies that could have either prevented the outage or ended it sooner.
The report delivered to the Canadian Radio-Television and Telecommunications Commission says that since the outage, the telecom company has implemented the changes needed to address the cause of the outage and improve network resiliency and reliability.
In a separate letter posted to its website Thursday, the CRTC confirmed that Rogers has also implemented all the report’s additional recommendations.
“We said we would fix this – we completed a full review of our networks, strengthened our network resiliency, implemented all the report recommendations, and today our networks are recognized as the most reliable by global benchmarking leaders,” said Rogers spokeswoman Sarah Schmidt in a statement.
The outage in the early morning of July 8 two years ago lasted more than 24 hours and affected more than 12 million customers.
An configuration error during the network upgrade caused a flood of data to the core network routers, which crashed, according to the executive summary of the report by Xona Partners Inc. posted online Thursday.
The network failure could have been prevented if the core network routers had been configured with an overload limit, the report said.
Once the outage occurred, the report says it was prolonged by several factors.
The Rogers network operation centre and other critical remote infrastructure sites did not have redundant connectivity from other service providers, the report said, limiting access to critical equipment during the outage. Staff had to be physically dispatched to remote sites in order to access the affected routers, delaying recovery efforts.
In addition, Rogers staff also didn’t have backup connectivity from alternative service providers, and so they couldn’t communicate with each other until the company sent SIM cards from other service providers to its remote sites.
Get daily National news
The report said that staff also didn’t initially have access to information like the routers’ error logs and were unable to pinpoint the root cause of the outage for around 14 hours. There had also been multiple configuration changes made that day. These two factors contributed to the root cause being initially misdiagnosed, the report said.
The measures taken by Rogers since the outage include addressing the critical deficiencies exposed by the outage, separating the IP core for its wireless and wireline networks, and improving the processes for change management and incident management, the report said.
The report made seven recommendations of additional measures Rogers could take to improve its network resiliency.
Among the recommendations, which have since been taken by Rogers, are that the company test emergency roaming with other mobile network operators, develop a detailed root cause analysis for future outages, and expand the scope of incident management drills.
Rogers sent a letter to the CRTC on Jan. 17 outlining how it responded to the report’s recommendations of additional measures.
In the CRTC’s letter confirming those additional measures were implemented, the commission said that by July 4 next year, Rogers must report on whether the measures continue to address reliability issues, and on progress made in separating wireline and wireless core networks.
Rogers is partnering with Cisco in its work to split and build a new dedicated IP core, separating the two networks, said Schmidt. The company has also introduced new change controls that will limit the effects of “customer-impacting events,” she said, as well as “AI-based predictive simulation capabilities to strengthen our testing and monitoring.”
The report also included recommendations for all telecom network operators based on the “important lessons learned” from the outage. These include implementing router overload protection in the IP core and distribution networks; providing backup connectivity for the network operation centre, critical remote sites and critical staff; and simulating network failure and outage scenarios to uncover deficiencies.
Comments