GitHub’s Chief Security Officer and SVP of Engineering shared more details today on a string of outages that hit the code hosting platform last week.

While these incidents had unrelated root causes, they affected most of GitHub’s primary services from May 9 to May 11, causing widespread database connection and authentication failures for up to ten hours.

“Last week, GitHub experienced several availability incidents, both long running and shorter duration. We have since mitigated these incidents and all systems are now operating normally,” Hanley said.

“The root causes for these incidents were unrelated but in aggregate, they negatively impacted the services that organizations and developers trust GitHub to deliver. This is not acceptable nor the standard we hold ourselves to.”

On May 9, eight main services were hit by a major outage caused by a configuration change to GitHub’s internal service serving Git data.

The second outage, occurring on May 10, impacted the issuance of authentication tokens for GitHub Apps and resulted from high load and inefficient implementation of an API responsible for managing GitHub App permissions.

“On May 10, the database cluster serving GitHub App auth tokens saw a 7x increase in write latency for GitHub App permissions (status yellow),” Hanley explained.

“The failure rate of these auth token requests was 8-15% for the majority of this incident, but did peak at 76% percent for a short time.”

The third GitHub outage experienced by users last week, on May 11, was due to a loss of read replicas after a database cluster serving Git data crashed and triggered an automated failover mechanism.

GitHub incident history
Incident history (GitHub)

​”We are addressing the Git database crash that has caused more than one incident at this point. This work was already in progress and we will continue to prioritize it,” Hanley said.

“We are addressing the database failover issues to ensure that failovers always recover fully without intervention.”

GitHub will share more detailed information on these outages and what it’s doing to address the issues that caused them in its May Availability Report.

“The May report will include these incidents and any further detail we have on them, along with a general update on progress towards increasing the availability of GitHub,” Hanley said.

GitHub was also affected by multiple outages within a week in March 2022, when the company revealed that the incidents were caused by resource contention issues in the platform’s primary database cluster.

Another major outage impacted GitHub in February 2022, when the platform was down worldwide, preventing access to the website and blocking commits, cloning, or pull request attempts.



Source link

Leave A Reply

Please enter your comment!
Please enter your name here