Service degradation on January 10th 2024

Summary

Between 9:45 and 11:45 on January 10th 2024 the PDM system experienced a degradation of service which affected most users.

The service degradation was caused by throttling of the connection between our web services and the databases.

Although there was no outage of the system, performance was severely affected by the throttling of the connection causing the system to run very slowly. In many cases and-users will have experienced error messages where data could not be loaded within the time allowed by the application.

Our technical team diagnosed that there was some issue with out Azure infrastructure and contacted Microsoft support. A Microsoft technician investigated the cause and proposed a mitigation which our team applied. This resolved the degradation of service.

Details

Our technicians were alerted to the issue approximately one minute after it began and started investigating the cause. They rewound a bug fix that had been applied to PDM Web just before the issue began in case this was the cause, however the issue continued so it proved to be unrelated. Next, they rewound a change made to the number of CPU cores available for databases that had been made earlier that morning. Again, this did not improve matters so was shown to be unrelated.

Our technicians observed that the load on the PDM databases was drastically reduced when the issue began. This indicated that the degradation of connectivity to the databases was not caused by exhaustion of resources on the database side. They also observed that connections to the databases were slow no matter where the connection was made from.

It was noted that the PDM Service Status page did not report any issue. This was because our monitors only detect complete outages of the system and not degradation of performance.

Having determined that the issue was within the Azure network infrastructure our technicians contacted Microsoft technical support. A support agent began investigating the logs. He recommended that we change a setting in our Azure infrastructure which might mitigate the issue. When we applied this change the service degradation was resolved.

We will not revert the change that mitigated the issue, so we do not expect this same issue to occur again in the future.

Could the issue have been avoided?

The Microsoft technician explained the cause of the issue as follows:

"By default, connections originating from the Azure network boundary will use the proxy method which will be a shared common endpoint for connecting to the database(s) in that region. If performance regressions are seen or heavy traffic from a client are seen, throttling can occur."

The Microsoft technician explained that this throttling can be avoided by using the "redirect" method for connections rather than the "proxy" method. According to this article Microsoft recommend use of the "redirect" method.

Since our database was deployed using the default configuration, as explained in the linked article our connections were using the "proxy" method since connections to our databases are "from outside Azure." This is because although our web services that connect to the databases are hosted within Azure, the connections to the databases are made using a "private end-point."

If we had been aware of this recommendation we might have avoided this service degradation.

How we will improve

We will take steps to review key settings across our Azure infrastructure, comparing these with any recommendations provided in Microsoft's documentation.

We also plan to investigate if the way we monitor our systems can be improved so that severe degradation of services can be detected in order to make our service status page more meaningful to our end-users.