Skip to main content
 

The CrowdStrike “blue screen of death” event that began on July 19 affected at least 8.5 million computers worldwide and is being called the biggest IT outage in history. Early rough estimates put the cost to organizations and individuals in the billions of dollars.

At Carolina, the CrowdStrike incident affected roughly 5,000 machines. IT staff from across campus, including ITS employees, rallied to manually repair these machines and address the impact at the University.

Campus received 829 tickets

Campus received a barrage of requests from campus members. Of the 829 tickets that were submitted to the ITS help portal on July 19, the Service Desk touched 291 of them and directly resolved 58 of them, said Calvin Groves, Director, ITS Customer Support & Outreach. Most of those 58 resolved requests were from campus members who walked into the Service Desk at the Student Union for assistance with laptops and workstations.

A close-up of an iPhone displaying a CNBC article headline Microsoft-CrowdStrike issue causes largest IT outage in history
Paris, France – Jul 20, 2024: A close-up of an iPhone displaying a CNBC article headline Microsoft-CrowdStrike issue causes largest IT outage in history, highlighting the severity of the IT disruption

Many of those machines that information technology folks worked on at the Service Desk, at South Building and in buildings across the University required 10-15 minutes of manual work — each!

Over at South Building, Mark Wampole and Bill Vogt of the ITS University Administration Support team helped nearly 100 people or computers and talked to about 70 people. (The third member of the team, Alex Tocchi, happened to be on vacation and missed out on all the fun.) The team provides executive IT support for senior University administrators and associated support staff.

Mark Wampole

Wampole and Vogt seamlessly helped their clients — including folks in the Chancellor and Provost offices — get back up and running, being mindful that their clients needed their machines to function right away.

“I sent out instructions to our customers,” Wampole said. “Some were able to fix the problem themselves, some needed BitLocker codes, some needed to bring in their laptops for us to fix, and some just needed us to walk them through the process.”

Wampole and Vogt also fixed about 15 machines in conference rooms, about 10 machines for people who were out of the office, and about five student machines.

ITS response began in wee hours

Some ITS employees began responding to the CrowdStrike situation while most of us were still obliviously sound asleep.

The ITS Operations Center began noticing the effects of the CrowdStrike situation around 2 a.m. when machines began crashing one after another, said Jay Bernardoni of the Operations Center. He and co-worker Cathey Stansbury were on the overnight shift.

While the Operations Center staffers were diagnosing the situation, the Department of Public Safety informed the Operations Center that the department’s critical hardware was crashing as well.

A wide view of the Operations Center, showing banks of monitors in front of the Data Center
Neil McKeeman (not pictured) was working solo in the ITS Operations Center the morning of July 19
Neil McKeeman
Neil McKeeman

At that time, staffers with the Operations Center and the Information Security Office alerted and activated ITS employees responsible for core critical services and other key people who needed to jump on the response.

“The scope of the situation was not one that we wanted people to discover when they first started work in the morning, so with the blessing of Ryan Turner, who oversees the Operations Center and ITS Networking, contact procedures began to wake up the experts,” said Neil McKeeman, who manages the Operations Center.

Broad collaboration

By 3 a.m., ITS experts from multiple divisions were collaborating online on several Microsoft Teams channels. Some of those ITS staffers who got an early start included Richard Hill and Matthew Rice of ITS Systems Administration; Brent Caison, Director for ITS Global Systems and Cloud Architect; Matthew Conley, ITS Manager, Storage, Server and Application Virtualization; Systems Specialist Frank Cuicchi; John McGarrigle, Associate Director of Database & PeopleSoft Admin; Matthew Mauzy, Emergency Response Technology Manager; and Systems Specialist Patrick Murphy. Kenneth Langley of the School of Medicine IT also helped ITS respond.

The first goal was to determine if critical servers were affected. The second goal was to try to establish the scope of the situation.

The crashing computers were generating error messages stating that CrowdStrike csagent.dll was the root cause. The initial thought was that CrowdStrike was working fine and that a new patch rolled out by Microsoft tried to do something that CrowdStrike didn’t like, and that had caused the operating system to crash.

“This was a red herring, but it was what we were all chasing down at the start,” Bernardoni said.

Once ITS discovered computers crashed that hadn’t received patches, Alex Everett from the Information Security Office stepped in. Armed with the knowledge that csagent.dll was on the error screens, Everett checked with CrowdStrike technicians. He learned the problem was caused by a CrowdStrike update that CrowdStrike had rescinded by that hour. No new computers would be affected, but computers using Microsoft’s operating system that had downloaded the patch would have problems. CrowdStrike provided Everett with two instruction sets for bringing computers back online.

Generated list of every affected computer

By 3:20 a.m., Everett explained to ITS incident responders why this was happening and how ITS could mitigate the issue for the University. Because this was a pushed update, the Information Security Office — with some effort — was able to generate a comprehensive list of every computer on campus that downloaded that patch. That gave ITS staff the scope of the problem and a means to try and track the issue.

Alex Everett
Alex Everett

When ITS staffers ascertained that this CrowdStrike incident was affecting thousands of campus members, it became apparent this was going to be a long day and maybe a long week, said Bernardoni. “The most memorable comment at that time was, ‘At least this didn’t happen at 5 p.m.,’” he said.

Jack Smith and Thad Dodd of the ITS Managed Desktop Services unit were among the ITS staffers who were asked to join the effort in the wee hours. Their group provides IT support to 14 campus departments, including ITS itself.

Provided in-person and remote help

Jack Smith
Jack Smith

Smith stayed on Microsoft Teams with the Operations Center until 3:30 a.m. After CrowdStrike released a fix to the blue screen problem, Smith headed into the office about 9:30 a.m. At ITS Franklin, he set up in-person support while also helping customers remotely.

“Once we got caught up on the incoming tickets, we took the reports from the Information Security Office to proactively reach out to customers who had machines on the list,” said Jason Cross, who manages ITS Managed Desktop Services.

Like we mentioned, Rice of ITS Systems Administration was one of the ITS staff members who started at 3 a.m. He was instrumental in getting all the ITS infrastructure working early in the morning. Then Rice helped departmental IT with their problems for the rest of the day. He finished helping with customer servers at 6:02 p.m.

Even then, Rice’s manager Richard Hill said he had to urge Rice “to stop and go rest.”

Tsunami of incidents

Over at the Operations Center, at 7 a.m., McKeeman took the shift after Bernardoni and Stansbury. Like the staff on the previous shift, McKeeman did not miss a beat monitoring the tsunami of incidents and notifying the necessary people, even though he worked solo because he was down two employees.

About the same time, Chief Information Security Office Paul Rivers drafted a formal notice about the CrowdStrike incident, in consultation with J. Michael Barker, Vice Chancellor for Information Technology and Chief Information Officer, and pulled in ITS Communications to send the message to campus by 7:49 a.m.

Like staffers within other ITS groups, the Information Security Office team members worked the CrowdStrike situation all day — and through the week. Sam Garcia of the ISO assisted the Adams School of Dentistry while fellow ISO staffer Josh Jenkins provided clients with insights on what machines and systems were impacted.

While most ITS employees work remotely, ISO’s Everett further supported ITS’ response by going into the Service Desk to help with fixes in person.

Josh Jenkins
Josh Jenkins
Sam Garcia headshot
Sam Garcia
Larab Zaman headshot
Larab Zaman

‘Above and beyond’ customer support

By evening, the Operations Center, a 24/7/365 operation, needed extra support. Service Desk colleagues Rob Wilson and Larab Zaman stepped up. They extended their shifts into the night to enable the Operations Center staffers to concentrate on critical tasks, ensuring the smooth running of operations, said Christina Artis, Service Desk Tier 1 manager.

“Everyone at the Service Desk and across ITS went above and beyond to help,” said Calvin Groves, Director, Customer Support & Outreach.

“Waking up to a mess and scrambling to help coordinate was fun, but it seemed like most of the systems work was already in a good place,” he added. “The phone wait times were longer – we saw more than double the number of calls in the first hour than we saw the full previous Friday — but everyone pitched in. Service Desk staff shifted around to help where they could, and we got into a good cadence helping folks out of the Union, and over the phone.”

Bernardoni of the Operations Center summed up his experience responding to the CrowdStrike outage: “It was both the longest and shortest night of my career at UNC.”

 

Comments are closed.