I had an interesting issue the other day where a customer’s maintenance plan was failing. The maintenance plan had 5 steps to it. The first three were all executing Agent Jobs using the Execute SQL Server Agent Job Task. The last two steps were a History Cleanup and Maintenance Cleanup tasks. Each step used a connector that specified to go to the next step after completion of the last.
Sometimes one or two of the Agent Jobs steps would fail stating that they were chosen as the deadlock victim. Looking further in the steps they were pretty simple, just executing some backups and restores.
I enabled the deadlock trace flags to capture the relevant info and waited for it to fail again. Once it failed again I saw that sometimes when the backup or restore was trying to update the history tables in MSDB it was deadlocking with the queries executed in the History Cleanup. At first I thought this was weird as how could multiple steps possibly be executing at the same time.
It turns out that the Execute SQL Server Agent Job Task uses the stored procedure sp_start_job. This stored procedure starts the job and immediately reports if the job started successfully or not. It does not report if the job completed successfully. Therefore the first three steps plus the History Cleanup were all being executed at the same time.
Luckily the fix in this case was quite easy. I just removed the History Cleanup task to its own maintenance plan and we scheduled it to run at a different time. In other situations you may need to write a loop that checks the status of the job executed so that you don’t move on until the job has completed.
The fact that the Execute SQL Server Agent Job Task can run multiple steps asynchronously does not seem to be well documented anywhere, so I thought this would be a good reminder for myself and anyone else that runs into similar issues.