This post is participating in Adam Machanic’s new invention and exciting blog party tradition in SQL Server community known as T-SQL Tuesday. Here is my story for T-SQL Tuesday #002: A Puzzling Situation without any further delay.
I encountered my share of puzzling situations in my career but this story really stands out as it put me on the verge of loosing my job few years ago. It was emotional and at the end of the day it was really exciting and fulfilling as I got to fix the situation and get my name cleared from a messy situation.
Few years ago, I used to work for a finance company and was tasked with a major project that was crucial for smooth running of the legacy system until a brand new and a robust system was launched. Top tier management was very keen on getting the daily/weekly updates on this project as well as making sure it was deployed smoothly. Due to the nature of the change, a lot of data had to be manipulated and the same deployment has to be repeated for few cycles.
At that time we had one-way Transactional replication from SQL 2K (SP3) to SQL 2K (SP3) and during deployments, replication should be teared down and rebuilt as a company policy if it involves table(s) in replication. Lots of data in the tables had to be modified during this deployment and it seemed a wise choice to follow the company policy to tear down replication. And I have supported plenty of deployments in this company and teared down and rebuilt replication few times during the deployments. It was a piece of cake for me.
At some point, a decision was made to upgrade to SP4 and changes were rolled to all prod servers but replication wasn’t rebuilt after the upgrade. Couple of weeks later, my project (I was a database developer and a contractor) was due for deployment and I am at the data center supporting the deployment( usually deployments are done at midnight). Replication was brought down, project deployment was successful and while rebuilding, replication failed after 2 hours. The SCM person, re-started it without reading the complete error message at 3 am and it failed again after 2 hours and we are almost nearing the SLA. Replication complained about Unique key violation in one of the articles on which my deployment modified the data heavily. I knew I had to call my manager at 5 am and lot of calls were made to escalate the issue to all levels/groups.
Production Support DBA group took over the issue at 6 am and I was kept in dark from the troubleshooting steps and my manager who shall remain anonymous for this post, asked me 3 times in 2 hours to check if I my scripts are responsible for the screw-up. I knew my head was on the line and can sense the background discussions and felt that I will be made a scape-goat for this situation. 4 Production support DBA’s and 2 managers were asigned to fix the issue & a ticket was raised with MSFT for a speedy resolution and bring back the server online quickly. Even after 3 pm the issue was NOT resolved until I figured out what really happened.
I was confident that this issue was NOT related to my deployment but something else was happening. In one article (table), we had a unique index on a column but the data quality wasn’t good. Before the deployment there are no rows in the table that violate the index key and after the deployment there is nothing to indicate that the deployment had introduced data that violates the unique key. But we had ” and null value and the remaining millions of records were legitimate values although some legacy data was questionable. After the SP4 upgrade, SQL Server was trying to transform ” and null values to null on the subscriber side and it failed because of the unique key/index violation.
The bad data was removed after getting high level permissions from business group and I got to keep my job for another year and solve many business problems for that company. I wish I could have caught this during the deployment and saved the embarrasment for my manager.
Lesson learned: Test, Test & Test each and every program and component you have before moving with a upgrade.