Quick removal for DB snapshots

This is probably one of those isolated things that nobody will ever run into, but it’s worth putting out there.  My organization uses CommVault for backups.  Machine, Fileshare, Sharepoint, SQL…  if it can be backed up, it goes through CommVault.  One of our on-prem sharepoint servers is a bit older, and was set up… not according to best practice.  When we pointed the CommVault Sharepoint backup service at this instance, everything would back up OK in SQL, but the service couldn’t clean up the DB snapshots that it created for the backups.  Over time, this resulted in a very full disk.

CommVault confirmed that there was no reason to keep the snapshots, so we began to delete them.  Through Management Studio, this is a one-by-one process.  When you have hundreds of snapshots, you begin looking for a different way.  Here’s what I came up with:

declare
@sql1 nvarchar(255),
@sql2 nvarchar(255);

declare a cursor for
select
'EXEC msdb.dbo.sp_delete_database_backuphistory @database_name = N'''+name+'''',
'DROP DATABASE ['+name+']'
from sys.databases where source_database_id is not null;

open a;
fetch next from a into @sql1, @sql2;

while @@FETCH_STATUS = 0
begin
exec(@sql1);
exec(@sql2);

fetch next from a into @sql1, @sql2;
end

close a
deallocate a

There really isn’t any magic to the solution – it’s a pretty simple thing to loop through databases. The trick that caught me for a minute was how to identify only the snapshots. As you can see, this is accomplished by filtering on the source_database_id column in sys.databases. If there is a value in this column, then it is a snapshot; for a normal database, this is null.

With this script, what was shaping to take an hour or two of monotonous clicking was reduced to about 60 seconds of wait time. Mission accomplished!

Automated fix for a high VLF Count

Quick recap on VLF’s:  A SQL Server log file is logically subdivided into smaller files called Virtual Log Files, or VLF’s.  Awesome background on the what’s, how’s and why’s of VLF’s can be found on Brent Ozar’s site and on SQLSkills.com in posts by Kimberly Tripp and Paul Randal.  The inspiration for this post is a great post by David Levy, where he provides a simple script to fix your high VLF count.  David’s script is great, but has a couple of gotcha’s that I have attempted to address.  More on that in a bit…

Suffice it to say that (as always) we’re looking for the Goldilocks solution – too many or too few VLF’s cause problems.  With too many, the database is constantly doing maintenance on the VLF’s as it cycles through them.  With too few, the maintenance (when it happens) takes a long time.  The Goldilocks solution is to find the balance between the number of VLF’s and their size…What constitues too many or two few VLF’s is dependent on many factors – physical size, workload, system performance, etc.

So…  you’re here because you’re at least pretty sure that your VLF count is too high.  You’ve hopefully read the info I’ve linked to, and know that to get rid of your excess VLF’s you must shrink your log file as much as possible.  Once the log is shrunk, you then manually grow it to the proper size to achieve a manageable VLF size and count.

I’m not gonna lie – doing this is a pain.  24×7 environments don’t like to be brought down or slowed down; this process WILL slow things down at a minimum, and should really be done during a scheduled maintenance outage.  That said, the steps aren’t difficult.  Many DBA’s like to do them manually so they have control over every step.  I have no problem with that, but I also like to automate known processes so I can move on as quickly as possible.  Enter the automation:

Let’s start with David’s script.  It really is great – it shrinks the log file, then re-grows it to your specified size without any issues.  One big gotcha I found is that it does the growth in one chunk.  If your target is a 50GB log file and you grow from 1GB to 50GB in one shot, you’re going to end up with 16 new VLF’s of ~3GB each (plus the remaining VLF’s from your 1GB).  3GB feels a bit big to me, even if you’re running all flash.  I could be wrong, but it doesn’t pass my gut-check.  The problem is only going to get worse as your target size gets bigger.

Of course, the obvious workaround is to run the growth portion of the script multiple times, stepping up the target size each time, right?  That definitely gets around the single-growth issue, but negates the gains we got from automation.

The other issue I saw is really more one of personal preference.  David’s script leaves the database in full recovery mode and only shrinks once.  If your currently in one of the last VLF’s, the shrink won’t do you much good.  Again, you’re back to running that portion multiple times by hand.

So – my take on the process…  I’ve added loops to control the shrink and growth processes.  Currently the shrink only goes until you get one shrink operation that doesn’t do anything (old size = new size), but I plan to refine that later.  You can specify the growth rate to control VLF size; that number combined with the max size number allows the growth loop to run until complete.  I set the recovery model to simple to aid in the shrink process, then set it back to full once complete.  I only took that step because this script will only be run during scheduled maintenance.

--Initialize variables provided by user (below)
declare
@dbname sysname, --Database Name
@shrinkdelay char(8), --Delay time between shrink rounds. Format: hh:mm:ss
@shrinksizeMB int, --Target min size in MB to shrink to. Shrink process may not reach this size
@growthsizeMB int, --Size in MB that each growth cycle should use.
@maxsizeMB int; --Max size that the script will grow the log file to.

--Set values for use
select
@dbname = '<database>',
@shrinksizeMB = 50,
@growthsizeMB = 4000,
@maxsizeMB = 50000,
@shrinkdelay = '00:00:05';

--Initialize variables handled by the script
declare
@filename sysname, --Log file name
@oldsizeMB int, --Stores the size after the last shrink. Used for loop checks.
@currsizeMB int, --Stores the current size of the log file
@cmd nvarchar(512); --Used to create and execute dynamic SQL

--Get initial values
select
@filename = name,
@oldsizeMB = 0,
@currsizeMB = (size*8)/1024
from sys.master_files
where DB_NAME(database_id) = @dbname
and type = 1;

print 'BEGIN PROCESS'
print 'Setting Recovery Model to SIMPLE'

--Set DB to simple recovery to prevent log entries
set @cmd = 'ALTER DATABASE ['+@dbname+'] SET RECOVERY SIMPLE;';
exec(@cmd);

--Shrink log file to ~50MB
print 'Beginning shrink of file: ' + @filename + ' from ' + cast(@currsizeMB as nvarchar) + ' MB.'
while((@currsizeMB >= @shrinksizeMB) and (@currsizeMB <> @oldsizeMB))
begin
set @cmd = 'use ['+@dbname+']; dbcc shrinkfile(N'''+@filename+''',0, TRUNCATEONLY);';
exec(@cmd);

set @oldsizeMB = @currsizeMB;

select @currsizeMB = (size*8)/1024
from sys.master_files
where DB_NAME(database_id) = @dbname
and type = 1;

print 'Current Size = ' + cast(@currsizeMB as nvarchar) + ' MB.'
waitfor delay @shrinkdelay;
end

select 'Shrink Process Complete.'

--Set File Growth Size
print 'Setting file AutoGrowth rate.'
set @cmd = 'ALTER DATABASE ['+@dbname+'] MODIFY FILE ( NAME = N'''+@filename+''', FILEGROWTH = '+cast(@growthsizeMB as nvarchar)+' );';
exec(@cmd);

--Set DB back to full recovery model
print 'Setting Recovery Model to FULL.'
set @cmd = 'ALTER DATABASE ['+@dbname+'] SET RECOVERY FULL;';
exec(@cmd);

select @currsizeMB = (size*8)/1024
from sys.master_files
where DB_NAME(database_id) = @dbname
and type = 1;

--Manually Grow File
print 'Beginning Manual Growth of file'
while(@currsizeMB <= @maxsizeMB)
begin
set @cmd = 'ALTER DATABASE ['+@dbname+'] MODIFY FILE ( NAME = N'''+@filename+''', SIZE = '+cast(@currsizeMB+@growthsizeMB as nvarchar)+' );';
exec(@cmd);

select @currsizeMB = (size*8)/1024
from sys.master_files
where DB_NAME(database_id) = @dbname
and type = 1;

print 'Current Size = ' + cast(@currsizeMB as nvarchar) + ' MB.'
end

print 'PROCESS COMPLETE'

There’s definitely more work to be done to make this script bullet-proof…  I’m open to suggestions.  I’ll post updates as I make improvements.

Fix non-standard permissions on public role

I recently ran into a situation where a piece of software we were implementing had poorly implemented security.  That is to say – no security.  All permissions on all objects were granted to the public role in the database.  This may be acceptable for a small company without a DBA (not really, but ostrich security never hurt right? /s) ; my employer cannot accept that level of risk, so I was tasked with fixing the problem.

At it’s core, it’s a simple fix, right?  Create a new role, push the permissions to that role, then revoke those permissions from public.  Unfortunately, the database in question has enough objects that doing a manual comparison an migration is impractical.  Time to dig in and write a script to fix it.  Here is what I came up with:

declare
@newrole sysname,
@sql nvarchar(255);

set @newrole = 'newrole' -- give the new role a name

Set @sql = 'create role ' + @newrole + ';'
--exec (@sql);
;

--USE THIS SCRIPT TO GENERATE THE GRANT/DENY/REVOKE STATEMENTS TO FIX THE PERMISSIONS ON THE PUBLIC ROLE
with
pub as --get all permissions granted to the public role
(
select object_name(major_id) as majName, p.type, permission_name, state, state_desc, o.type_desc as otypedesc
from sys.database_permissions as p
inner join sys.objects as o on p.major_id = o.object_id
where grantee_principal_id = 0 and class=1 and o.schema_id = 1
--order by object_name(major_id)
),
def (majName, type, permission_name, state, state_desc, otypedesc) as --permissions granted to public by default
(
select 'fn_diagramobjects', 'EX', 'EXECUTE', 'G', 'GRANT', 'SQL_SCALAR_FUNCTION'
union select 'sp_alterdiagram', 'EX', 'EXECUTE', 'G', 'GRANT', 'SQL_STORED_PROCEDURE'
union select 'sp_creatediagram', 'EX', 'EXECUTE', 'G', 'GRANT', 'SQL_STORED_PROCEDURE'
union select 'sp_dropdiagram', 'EX', 'EXECUTE', 'G', 'GRANT', 'SQL_STORED_PROCEDURE'
union select 'sp_helpdiagramdefinition', 'EX', 'EXECUTE', 'G', 'GRANT', 'SQL_STORED_PROCEDURE'
union select 'sp_helpdiagrams', 'EX', 'EXECUTE', 'G', 'GRANT', 'SQL_STORED_PROCEDURE'
union select 'sp_renamediagram', 'EX', 'EXECUTE', 'G', 'GRANT', 'SQL_STORED_PROCEDURE'
),
fin as --extract all non-default permissions from the dataset
(
select * from pub
except
select * from def
)
select
*, --show object data
'revoke '+permission_name+' on '+majname+' to public', --Revoke statements
state_desc+' '+permission_name+' on '+majname+' to ' + @newrole --Grant/Deny statements
from fin
order by majName

Fortunately, the default permissions granted to the public role are few and easy to enumerate.  By selecting all permissions existing for public and removing the default permissions via the except operation, are left with all non-standard permissions for that role.  From there, the final query returns dynamic SQL for the necessary revoke and grant/deny statements.

There are a couple of holes in this script that I purposely left in.  First, the script stops at generating the scripts rather than automatically processing them.  I may at some point go back and make this a fully automated solution, but felt it was faster to copy/paste, and safer to force at least a cursory review of changes before running.

The second hole is that the script returns the GDR scripts based solely upon what is in the security tables.  I didn’t have the time or inclination to do checks for allowed permissions during the migration.  This hole isn’t immediately obvious, but important.  At the time my database was created, the developer granted permissions using the ALL operator (since deprecated, but maintained for compatibility).  This meant that all functions they had developed had insert/update/delete permissions granted; you cannot perform insert/update/delete operations on scalar functions, and SQL kindly informs you of this when you try to revoke, grant, or deny those permissions directly on a function.  By running the scripts exactly as they are returned by the code above, I was actually cleaning up unnecessary/invalid permissions in the middle of the migration.  Invalid permissions could then be removed by replacing the permission name (select, insert, etc.) with the ALL operator after the valid grant/deny statements were run.

The final step to the process is to move DB users into the new role.  This can be accomplished with the following snippet:

select 'alter role [newrole] add member ['+[name]+']'
from sys.database_principals as d
inner join <dbuser table> as t on d.[name] = t.<user id> --remove or comment this line if no users table exists in the database
where type = 's'
order by [name]

Once again, I stopped at returning dynamic SQL so I would be forced to review the results before running them.

The combination of these two scripts resulted in a successful migration of permissions to a new role, which allowed us to preserve proper security levels for DB users which are not related to the application, such as service accounts, non-privileged DBA accounts, etc.  This transition was transparent to the end users.

These scripts could also serve as a framework – a few simple modifications would allow you to transfer permissions (all or a subset) between custom roles

Disallow results from triggers – this isn’t the solution I was looking for

TL;DR Version:

You can set the advanced option “Disallow results from triggers” to 1 to get past the error “Message: A trigger returned a resultset and/or was running with SET NOCOUNT OFF while another outstanding resultset was active.”

Before you do this, be sure that you know the root cause of the error, and the impact of that setting – the problem may not actually be with the trigger.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

My employer’s ERP encrypts all of their stored procedures.  While this is good for them, it means that whenever something must be added to the workflow for say… saving customer records…  we must implement triggers to get the job done.  The ERP also has many popular 3rd party add-ins or modules which are forced into the same methodology.  The end result is that the customer table for instance has 11 (Eleven!!!) triggers on it.  Only one of those is in-house.

Let me just say for the record that as a general rule I really do not like triggers.  Yes, I know they have their place, and are often necessary.  That said, between the fragmented code, increased potential for blocking/deadlocks, etc. I’ve found they’re generally more trouble than they’re worth.

So – on to the issue at hand.  We are currently changing some applications in our stack, and moving to a hosted CRM solution.  As part of that move, we must build integrations for master data between the ERP and CRM systems.  One of the business requirements is that the master data key from CRM must exist in the ERP so that integrations to other systems can be created.

The CRM integration utilizes two triggers on the master data tables (one each for inserts and updates) which write to a queue table; the integration then reads from that table to pull master data into CRM and do its thing.  From there, the integration then pulls the ID from CRM and pushes it back to the ERP.

I’m sure all you DBA’s out there have spotted the first problem with this sequence – updates coming from CRM will fire the triggers, placing the record back in the queue; this obviously defeats the purpose of the queue, as nothing would ever fall out of it.  To get around this, we added code to the triggers to check the login making the update – if the update comes from the CRM service account, the trigger returns without executing any additional code.  Problem 1 solved.

With the aforementioned loop handled, we proceeded to test the integration.  First with one record – success; then with multiple records – FAILURE!  I’m guessing this is why you’ve hit this entry – the error was as follows:
"Message: A trigger returned a resultset and/or was running with SET NOCOUNT OFF while another outstanding resultset was active."
Well crap.

I’ll freely admit that this was the first time I had encountered this error.  After a few minutes of googling, I had confirmed what made intuitive sense to me, but I had never really needed to think about: triggers that return results are bad (I’m sure there will be some that disagree with that blanket statement, but I’ll stick by it).  I also learned about the advanced setting “Disallow Results from Triggers”, and how it’s set to 0 by default (allowing results) despite best practice being to not allow results from triggers…  makes perfect sense.

In all of the searching I did, it seemed like the problem was a new trigger on a table that suddenly caused the error; there were only two solutions suggested in all the sites I looked at: either fix the trigger (set nocount on, remove any select statements returning results), or set that advanced setting to 1, which affects all triggers on the server.

Armed with those bits of knowledge, I began looking at the code on my server…  and found that all of the involved triggers followed best practice.  All had nocount set to on, and none returned results via a select.

Well that was unexpected.  I had expected that at least one of the triggers would have the decency to have bad code and give me an easy fix.  Not so.  Time to dig deeper…

The next step in troubleshooting was to disable all but one of the triggers.  Because this issue cropped up during the CRM integration, we disabled the CRM triggers to prove that they weren’t the issue.  Sure enough, success with one row, failure with multiple – even with only one trigger.  Things were getting weird.

The next step was to do the obvious – set the disallow results from trigger option to 1.  As expected, everything worked, even with all triggers enabled and multiple rows pushed through the integration.  Success, right?  Not quite.

You see, we have a few other integrations that have been in place for a long time which would do mass updates on records with those triggers in place, and those NEVER FAILED.  If that setting were the solution, we should have expected to see that error long before now.  Further, our ERP has… lots…  of tables, and upwards of 10 triggers on a given table isn’t uncommon (did I mention how much I hate this model?).  So not only do we have LOTS of triggers which have never failed due to multiple resultsets, but this is spread over close to 25 databases, because each legal entity is contained in its own database for the ERP.  Where the server setting really is a server setting, I would have to go through all of the code for all of the triggers in all of the databases to ensure that this setting would not break something.  NOT GOING TO HAPPEN.

Where to look next?  Well, considering that everything was functional before the CRM integration, and that the CRM triggers follow best practice and had been proven not to be the issue, the only logical place left to look was the code for the integration itself.

The integration is built using a hosted 3rd party tool.  I set up another meeting with the vendor, and we dove in.  On my side, I set up a trace to watch activity coming from the integration.  On their side, they first ran the integration for one record.  Once again success.  In the trace, I noted that the integration ran a query to pull the record to be integrated, then ran another to push the update coming back from CRM.  I was using the default trace, so these showed up as:
Batch Starting: select top 1 * from <table>
Batch Completed: select top 1 * from <table>
Batch Starting: update <table> set...
Batch Completed: update <table> set...

Next was the run for multiple records.  This time the trace looked like this:

Batch Starting: select top 1 * from <table>
Batch Starting: update <table> set...
Batch Completed: update <table> set...
Batch Starting: update <table> set...
Batch Completed: update <table> set...

See the difference?  When running for one record, the integration is closing it’s batch before proceeding to update the record.  When running for multiple, that initial resultset is held open and processed RBAR.  Remember our error?
"Message: A trigger returned a resultset and/or was running with SET NOCOUNT OFF while another outstanding resultset was active."
This error is complaining because there is an open resultset that prevents the triggers from executing.  Why?  We could get technical and talk about locking, transaction isolation, etc., but that shouldn’t be necessary.  Suffice it to say that the triggers can’t execute with an open resultset.

The end result in this situation was that the integration was re-written.  We tried a few different ways to force that initial query batch to terminate before processing the rows, but limitations in the development tool prevented this from working.

I’d wager that the majority of the time you see this error it is due to a problem with the triggers.  When it’s not, you always have that big hammer of changing server settings.  However, as my experience shows, it is a good idea to dig a little bit to determine if the error actually is in SQL before you do anything too drastic.