I’d be interested to hear that story UG... while we’re comfortably OT, I have a couple of similar stories that still give me a twinge in my guts. One involves having a dev and a live SQL window open at once, and another one involves a misconfigured null check in a global replication routine... I hung my head.
Non tech people should probably stop reading here!
The short version is this:
I was supporting the Unix systems for SAP. There were a number of other Unix teams within the company.
Each team had a unique "domain" (using NIS) for managing Unix users and groups. In Unix (and Linux) the user ID number and group ID numbers are what the OS cares about - names are merely "labels" for humans. Each group was assigned specific ranges of UID and GID numbers even though each had their own domain.
The company decided to implement Active Directory single sign-on for Unix by incorporating an AD extension to include Unix user and group attributes into the central Active Directory domain... But, you need to maintain unique user and group IDs, and not all groups were correctly sticking to their assigned ranges over time.
So, as part of this multi-group project many of us needed to deal with conflicting user or group IDs by reassigning the values used by certain users and then updating the assigned values to all files with ownership by either the group or user across the entire system landscape for the team.
I wrote a series of Perl scripts using Find::File module that (basically) emulates the find command but in a programmatic fashion. I was able to find very specific file metadata attributes to drive these changes across all our systems from a single point.
For every file or directory that was owned by a conflicting user or group ID, it would report that information and update the values.
I (naturally) did a large amount of testing before the real event. No issues encountered.
The morning of the event (which was scheduled to begin that evening) I thought of another corner case, added logic and did more testing. Everything was great.
That night, I kicked off the first phase and was closely monitoring and seeing no issues. After a while I triggered a second, parallel stage. Also going great.
After a while, I stopped monitoring because there were no issues.
About an hour later, I got a call from the monitoring team as I was also the Oncall support that weekend. They had noticed a few of the big weekend backups had issues. We had a lot of backups so that was pretty typical.
As I was investigating and not seeing the typical errors, I got another call: all active backups had failed! Uh oh, that's not good.
Since I wasn't the expert on the backups, I called him - he was also my boss
He wasn't home so I was "driving by cell phone" and describing what I was seeing... Nothing was standing out to him, either, but suddenly it occurred to me that the issues we were experiencing could be caused by incorrect file ownership.
I was "talking thru" this when I confirmed that this was in fact the reason and that it must be related to my (still running) change. I immediately cancelled all the processes.
I got light headed and my heart was in my throat... I thought I might throw up, or cry, or both!
By boss (who was always very cool and pragmatic) says nothing. Hello? Turns out he lost connection right before I revealed the smoking gun
I had to call him back and explain the whole thing - again!
So, Mr calm and cool says: I'm going to get a bridge line set up and call these people and you call those people. Let's figure out what to do.
After the bridge call, we agree to meet at the office and start working to understand the scope and undo it.
That's when my coworker from the previous post told me he was coming to get me.
About 8-10 of us spent 14 hours all night to use various methods to correct permissions and get everything back up and running... Because it was ALL down!
So what happened?
All of the files for conflicting users and groups that should have been changed were changed exactly as they should have been.
However, every other file was changed to be owned by user "root" (the Unix super user).
Hence, the Got Root?(tm) incident
The cause: the last minute change I made that morning was to add condition to a multi-level bit-wise Boolean expression. I didn't include it inside a parenthesised grouping of tests properly.
When I tested the changes, I only tested positive test case files but not negative test cases... Bad thing to overlook as that would have immediately shown my mistake
I fully expected to be fired. I wasn't... Everyone was very understanding.
And on top of that, my boss bought me a Got Root? shirt that still hangs on my cubicle wall.
Teachable Moment!