In an earlier post I wrote about why we shouldn't be so transparent in our sysadmin duties that we end up facilitating a deluge of improperly diagnosed problems that we didn't cause. It was the whole Post hoc ergo propter hoc discussion. Well, now I want to call out the propensity to err on the other side by never accepting (let alone investigating) legitimate problems that we cause.
I have worked with a few sysadmins who get personally offended when help desk tickets come in that address legitimate concerns and are often caused by us. Please don't be one of these sysadmins. Knee-jerk reactions followed closely with rants of, "Their (the end-user) troubleshooting sucks!" or, "They have no idea what the hell I do. I didn't cause this problem!"
I can’t help but think of the South Park where Kyle won’t swim in public pools because of so much pee. “We’ve just tested the PH balance in the pool...as we suspected it’s all P and no H.” I remember one time a ticket came over to my department. It seemed that several brand new servers in a different department were "randomly" blue screening. A person (who is still a good friend so I won't name names) that I worked with shrugged this problem off immediately and with extreme prejudice. He said, "We haven't touched those computers!" I interjected and said, "Hey, there is a possibility that the TMA is peeing in their pool. I'm going to check this out."
You see, at this organization, we used Tivoli, and back then, the agent was known as the TMA or Tivoli Management Agent. This agent had root access. I asked the sysadmin who had reported it to my department if he had a list of dates/times these servers had crapped the bed. As a seasoned and respected sys admin, he had this information. I noticed that each date and time reported coincided with scheduled Tivoli Inventory scans on those machines. I'll cut to the chase. These brand new servers had no USB ports, and the Tivoli scanner was attempting to scan the USB bus. There was obviously very poor error handling in the Tivoli scanner, so when it tried to scan something that (it assumed) must always exist it couldn't handle the exception, and a blue screen immediately followed.
I went to the other department and grabbed one of these USB-less test servers to validate my hypothesis. Sure enough, I caused a blue screen. First, I removed these types of servers from future scans. Then, I created a new scan configuration that didn't attempt to scan USB and applied it to these servers. Next, I alerted IBM who, to their collective credit, added some better error handling to their scanners so that the absence of a USB bus didn't cause the scanner to behave like a fainting goat.
We must always remember that even though end-users don't always understand causation vs. correlation, this doesn't mean that they are always wrong. We wield powerful tools that can do real damage when they aren’t used properly. Be respectful. Do some troubleshooting. Every sysadmin that is worth their weight enjoys the challenge different problems present. Have some damn fun, do some digging and quit pretending that you’ve never peed in someone else’s pool.