Auditing changes in Merge Replication

One trick I have learned from the folks on the Replication Support Team is how to proactively audit data changes in Merge Replication.  This is useful when troubleshooting issues such as identifying where data changes are coming from, who is causing them, and what the data is before and after the change.  I have found this valuable information to have on a few occasions and thought I would share.

Keep in mind this should only be used for troubleshooting purposes and should always be tested in pre-production prior to deploying.  Make sure it works first!

Auditing data changes for a Merge article can be done by creating insert, update, and delete triggers to capture data changes and record them into an audit table.  The audit table rows consist of GETDATE(), APP_NAME(), HOST_NAME(), SUSER_NAME(), column data, action type, command, and spid for each insert, update, and delete that occurs on the article to audit.  You will have to modify the script to adjust the name of the table being audited and the relevant columns that you think should be included in the audit data.  Usually just the primary key columns are enough, but other columns can be included as well.

Audit script

/************************************************************************
This is a script to implement audit triggers of insert, update, delete on 
a base table. It is based on a generic scenario for a table with four 
columns col1, col2, col3, col4.
 
It will insert into a table called source_audit.
 
Run this script on the database you would like to audit.
 
Test first to be sure it is working as expected!
 
After the problem occurs, export the contents of the source_audit table.
************************************************************************/
 
USE DB_to_audit
GO
 
-- Drop audit table if exists
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[source_audit]') AND type in (N'U'))
DROP TABLE [dbo].[source_audit]
 
-- Create audit table
CREATE TABLE [dbo].[source_audit]
    ([tstamp] datetime NULL, 
     [ProgramName] nvarchar(128) NULL, 
     [hostname] nvarchar(128) NULL, 
     [suser] nvarchar(128) NULL,
     [col1] int ,
     [col2] nchar(10) NULL,
     [col3] datetime NULL,
     [col4] binary NULL,
     [actiontype] char(2) NULL,
     [inputbuffer] nvarchar(255) NULL,
     [spid] int NULL)
GO
 
-------------------------------------------
-- INSERT trigger
-------------------------------------------
 
-- Delete trigger if exists
IF  EXISTS (SELECT * FROM sys.triggers WHERE object_id = OBJECT_ID(N'[dbo].[audit_source_INS]'))
DROP TRIGGER [dbo].[audit_source_INS]
GO
 
-- Create INSERT trigger
CREATE TRIGGER audit_source_INS
on Table_1
FOR INSERT
AS
DECLARE @command NVARCHAR(255)
 
if 0 = (select count(*) from inserted) return
 
CREATE TABLE #InputBuffer ([eventtype] NVARCHAR(30), [parameters] INT, [eventinfo] NVARCHAR(255))
INSERT INTO #InputBuffer exec('DBCC INPUTBUFFER(@@spid) WITH NO_INFOMSGS')
SELECT @command=eventinfo from #InputBuffer
 
INSERT INTO source_audit
SELECT getdate(), app_name(), host_name(), suser_name(), col1, col2, col3, col4, 'I', @command, @@SPID
FROM inserted
GO
 
-------------------------------------------
-- UPDATE trigger
-------------------------------------------
 
-- Delete trigger if exists
IF  EXISTS (SELECT * FROM sys.triggers WHERE object_id = OBJECT_ID(N'[dbo].[audit_source_UPD]'))
DROP TRIGGER [dbo].[audit_source_UPD]
GO
 
-- Create UPDATE trigger
CREATE TRIGGER audit_source_UPD
on Table_1
FOR UPDATE
AS
DECLARE @command NVARCHAR(255)
 
if 0 = (select count(*) from inserted) return
 
CREATE TABLE #InputBuffer ([eventtype] NVARCHAR(30), [parameters] INT, [eventinfo] NVARCHAR(255))
INSERT INTO #InputBuffer exec('DBCC INPUTBUFFER(@@spid) WITH NO_INFOMSGS')
SELECT @command=eventinfo from #InputBuffer
 
INSERT INTO source_audit
SELECT getdate(), app_name(), host_name(), suser_name(), col1, col2, col3, col4, 'UD', @command, @@SPID
FROM deleted
INSERT INTO source_audit
SELECT getdate(), app_name(), host_name(), suser_name(), col1, col2, col3, col4, 'UI', @command, @@SPID
FROM inserted
GO
 
-------------------------------------------
-- DELETE trigger
-------------------------------------------
 
-- Delete trigger if exists
IF  EXISTS (SELECT * FROM sys.triggers WHERE object_id = OBJECT_ID(N'[dbo].[audit_source_DEL]'))
DROP TRIGGER [dbo].[audit_source_DEL]
GO
 
-- Create DELETE trigger
CREATE TRIGGER audit_source_DEL
on Table_1
FOR INSERT
AS
DECLARE @command NVARCHAR(255)
 
if 0 = (select count(*) from inserted) return
 
CREATE TABLE #InputBuffer ([eventtype] NVARCHAR(30), [parameters] INT, [eventinfo] NVARCHAR(255))
INSERT INTO #InputBuffer exec('DBCC INPUTBUFFER(@@spid) WITH NO_INFOMSGS')
SELECT @command=eventinfo from #InputBuffer
 
INSERT INTO source_audit
SELECT getdate(), app_name(), host_name(), suser_name(), col1, col2, col3, col4, 'D', @command, @@SPID
FROM deleted
GO

Examining the results

Once the audit table and triggers are in place we can begin collecting audit data.  The audit data can be exported from the audit table after reproducing the problem to be queried at a later place and time, or it can be queried directly.  Here is a sample audit of a Merge publisher and the audit data after an update and sync from subscriber WS2008R2_1.

SELECT tstamp, ProgramName, hostname, suser,
col1, actiontype, inputbuffer, spid
FROM dbo.source_audit


 
Using this approach, we can identify where data changes are coming from, who is causing them, and what the data is before and after the change.  This can be very useful information to have, especially when troubleshooting conflicts and determining where the conflicting changes are originating — but I will save that for a future post.  If you would like help implementing an auditing scheme in your Merge topology, feel free to drop me a line or leave a comment below.

-Brandon Williams (blog | linkedin | twitter)

Arithmetic Overflow in sp_MSarticle_validation

This is probably a rare occurrence — but I've noticed that data validation for an article with 10 billion or more rows will fail due to an arithmetic overflow error in sp_MSarticle_validation.

Data validation for a specific article in a transactional publication is done using sp_article_validation which in turn calls sp_MSarticle_validation.  In the definition of sp_MSarticle_validation, a local variable named @actual_rowcount is defined which is of the type bigint.  Later on in sp_MSarticle_validation, a command is built to execute sp_table_validation and the @expected_rowcount parameter passed in is derived by converting @actual_rowcount to a varchar(10).

The offending statement can be observed in the following code snippet from sp_MSarticle_validation:

select @command = 'exec dbo.sp_table_validation @table = ''' + replace(@destination_table, '''', '''''')  + ''', @expected_rowcount = ' +
	        convert(varchar(10), @actual_rowcount) + ', @expected_checksum = ' +
	        convert(varchar(100), @actual_checksum) + ', @rowcount_only = ' + convert(varchar(5),@rowcount_only) +
	        ', @full_or_fast = ' + convert(varchar(10), @full_or_fast) +
	        ', @shutdown_agent = ' + convert(varchar(10), @shutdown_agent)

Considering @actual_rowcount is a bigint, it should be converted to a varchar(19), rather than a varchar(10).  This is where an arithmetic overflow error occurs when validating an article that has 10 billion or more rows, causing validation to fail.

If you find yourself needing to validate an article with 10 billion or more rows, please vote this item as important to expedite a fix.

Creating merge replication conflict alerts

One of the challenges in working with Merge Replication is handling conflicts since writes can occur at any node in a topology. In an earlier post I demonstrated how to handle conflicts using the business logic handler framework - now I would like to show how to create Merge Replication conflict alerts based on the Performance Monitor counter SQLServer:Replication Merge Conflicts/sec.

Ideally proper planning would be done in an application to minimize the chances of conflicts occurring, whether that is achieved through column-level tracking, partitioning the writes, filtering, or some combination.  However sometimes application changes can introduce conflicts and having alerts in place is a smart precaution to take.

To setup a conflict alert - identify the Performance Monitor counter SQLServer:Replication Merge Conflicts/sec instance name to monitor.  In the Add Counters dialog in Performance Monitor the instance name can be identified for the Merge conflicts counter for a publication:

Add Counters Dialog

For this example the instance name that I will monitori is WS2008R2_1-AdventureWorksLT-TestMergePub1-WS2008R2_1-49.

From here sp_add_alert can be used to specify a merge conflict performance condition using our instance name to alert us when a conflict arises.  This can be done with the following bit of T-SQL:

USE msdb
GO
 
EXEC sp_add_alert @name=N'Merge Conflict Alert',
        @message_id=0,
        @severity=0,
        @enabled=1,
        @delay_between_responses=0,
        @include_event_description_in=0,
        @category_name=N'[Uncategorized]',
        @performance_condition=N'SQLServer:Replication Merge|Conflicts/sec|WS2008R2_1-AdventureWorksLT-TestMergePub1-WS2008R2_1-49|>|0'

This can also be modified to raise alerts only when a certain threshold is met to suit your needs.  An alert response can be set to send an email when the conflict performance condition is met which would give a much needed heads-up when things start to go awry.  I hope this provides a technique to detect precisely when conflicts occur and a head start on tracking down the culprit.

Error when creating a subscription

When creating a subscription in SQL Server sometimes the following error will occur:

The remote server "%s" does not exist, or has not been designated as a valid Publisher, or you may not have permission to see available Publishers.

The error is pretty straightforward and indicates that the replication agent login used to connect to the Publisher is not a member of the publication access list (PAL).  That would be the Distribution Agent process account for Transactional Replication and the Merge Agent process account for Merge Replication.  Add the agent process account to the PAL and try to create the subscription again.

Publication Access List

Undocumented Gotcha in Merge Replication Tracking

In the spirit of Robert Davis's (blog | @SQLSoldier) blog post: Undocumented Gotchas of Transactional Replication, I decided to contribute an undocumented gotcha in Merge Replication to his list.  First, let me start off by saying that if you happen to have a difficult replication scenario that your looking for advice on, Robert Davis is an excellent resource.  He was kind of enough to let me pick his brain at SQL Saturday #92 in Portland, Oregon about a topic that would send most people running for the door.  Not only did he offer a solution to my problem, he offered 3 different solutions, which was very nice of him.

Gotcha

One gotcha that you need to watch out for in Merge Replication is changing an existing article's tracking type from row-level tracking to column-level tracking, and when using Server subscription types.  When an article is added to a Merge publication, by default it is added using row-level tracking.  Sometimes, in an effort to reduce conflicts, one might decide to change an article's tracking type from row-level tracking to column-level tracking.  However, after doing so, one might realize that even though column-level tracking has been enabled, that conflicts still occur even though updates were made to different columns for a row at different nodes in the topology.  Further complicating the issue is the fact that the article tracking property will actually indicate that column-level tracking is enabled when checked via the Article Properties dialog or via the stored procedure sp_helpmergearticle.  It can be a little confusing.

Example

For example, lets say we have a Merge publication P1 and subscriptions S1 and S2 which are Server subscription types.  If P1 has an article Customer which is set to row-level tracking and we then change it to column-level tracking using the Article Properties dialog or via the stored procedure sp_changemergearticle and reinitialize subscriptions, the following updates will still generate a conflict after syncing both subscriptions:

S1:

UPDATE Customer
SET FirstName = 'Robert'
WHERE CustomerID = 10

Sync

S2:

UPDATE Customer
SET LastName = 'Davis'
WHERE CustomerID = 10

Sync

Solution

The proper way to handle this is to set the article tracking property to column-level from the very start.  Adding articles and setting column-level tracking using the Publication Wizard or sp_addmergearticle when creating the publication will yield the correct results, avoiding conflicts when updates are made to different columns for the same row.  Unfortunately, if the article has already been added to the publication, it will need to be dropped and added back to the publication with column-level tracking enabled.

Counting Merge Replication Article Changes

One of the cool things about Merge Replication is that it has the capability of revealing very interesting information regarding applications and their associated statements ran against published articles.  We can learn quite a bit about an application just by the sheer number of data changes that are replicated as a result of the statements it executes.  This can reveal not only which tables an application changes the most, but may even be useful in identifying flaws in application statements when the amount of data changes being replicated is more than anticipated.

Merge Replication tracks data changes through triggers and system tables in the publication and subscription databases.  The replication system tables contain metadata that indicates which data changes should be propagated.  One of Merge Replication system tables, MSmerge_genhistory, contains one row for each generation (collection of changes) that is delivered to a Publisher or Subscriber.  MSmerge_genhistory contains a column changecount, which can be used to determine which table articles contain the most data changes.  So to get an idea for which articles contain the most data changes within the retention period, we can do something like this:

USE MyPublicationDB
GO
 
SELECT
name AS [Object Name],
SUM(changecount) as [Change Count]
FROM dbo.MSmerge_genhistory WITH (NOLOCK)
INNER JOIN dbo.sysmergearticles  WITH (NOLOCK) ON art_nick = nickname
WHERE genstatus = 1 OR genstatus = 2
GROUP BY name
ORDER BY [Change Count] DESC

In this scenario, the articles in my test Merge topology that have been changed since my retention period are Customer, SalesOrderDetail, Address, Product, and SalesOrderHeader,  which have had 847, 542, 450, 295, and 32 data changes respectively.  If you find yourself wondering which articles in your Merge topology are exhibiting the most data changes, the changecount column in MSmerge_genhistory is what you're looking for.

Merge Article Count PivotTable And Chart

The process could not read file due to OS error 3

When configuring a subscription to a publication in SQL Server, sometimes the agent process cannot read from the snapshot:

The process could not read file X:\Program Files\Microsoft SQL Server\<instance>\MSSQL\repldata\<snapshot>\<snapshot_file> due to OS error 3. (Source: MSSQL_REPL, Error number: MSSQL_REPL20016)
Get help: http://help/MSSQL_REPL20016
The system cannot find the path specified.
(Source: MSSQL_REPL, Error number: MSSQL_REPL3)
Get help: http://help/MSSQL_REPL3

This error can be puzzling at first especially when the path indeed does exist and the replication agent process accounts have sufficient privileges to access the snapshot folder.

The error occurs because when using pull subscriptions and/or a remote distributor you must specify a UNC network share such as \\<computername>\snapshot rather than a local path for the snapshot folder.  To alleviate this issue, make the snapshot folder a UNC network share and update the Distributor Properties to reflect the change.

More Information

Securing the Snapshot Folder

Replication Agent Logging

To troubleshoot replication agent failures, replication agent profiles can be configured to output verbose agent logs to a text file.  All replication agents contain the parameters -Output output_path_and_file_name and -OutputVerboseLevel [0|1|2], which when specified, output the agent log to the file output_path_and_file_name and will contain varying degrees of information depending on the verbose level passed.  In general, the information contained in the verbose agent logs consists of progress report messages, error messages, procedure calls, and various session statistics.

Depending on the verbose level and the agent, some or all of this information will be provided.  For the purpose of troubleshooting replication agent failures, a verbose level of 2 should be specified, which prints all error and progress report messages.  It might also be worth mentioning that some agents can be passed undocumented output verbose levels 3 and 4 which provide even more information.  The information provided in the verbose agent logs can be useful when trying to diagnose and identify exactly where replication is failing as well as get detailed information as to the root cause of the failure.  In addition, the verbose agent logs are likely the first piece of diagnostic data that Microsoft Support will ask for when diagnosing a replication support incident.

Reading the logs takes a little getting used to but after utilizing them to troubleshoot a few times, it becomes second nature.  I often use them when replication failures are occurring and the error given in Replication Monitor and the View Synchronization Status window isn’t sufficient enough to diagnose the issue.  An example error message with Merge Replication that occasionally crops up is: The process was successfully stopped. (Source: MSSQL_REPL, Error number: MSSQL_REPL-2147200963), which isn’t very verbose.  Knowing how to output the verbose agent logs is a smart skill to have under these circumstances in order to identify the root cause.

The parameters to output the verbose agent logs can be passed to the Run Agent job step in the Job Properties dialog, Agent Profile dialog in Replication Monitor, via RMO, or by way of invoking the agents from the command line.  After adding the parameters, the replication agent must be restarted for the profile changes to take effect.  Once restarted, the replication agent will begin outputting the verbose log to the path and file name specified.  After reproducing the failure, the log can then be examined for further information.

The verbose agent log is outputted in the form of a text document and can be opened with your favorite text editor.  The -OutputVerboseLevel specified for the agent profile will dictate the contents of the log.  Typically, if a replication agent failure occurs, it will fail and the process will stop.  That means that the failure is likely to be found at the end of the log, in the last 20 or so lines in the output file.

Be sure to remove the agent parameters after reproducing the failure and collecting the log.  Not doing so will result in the log to be appended to on subsequent synchronizations and consume disk space.  Outputting agent logs can also impact replication performance and thus isn’t recommended to do unless troubleshooting replication failures.

The Proxy Auto-configuration URL was not found

When configuring Merge Replication for Web Synchronization it's common to encounter the error The Proxy Auto-configuration URL was not found and for replication to fail on synchronization.  This error occurs when either the certificate is not installed properly, and/or Internet Explorer proxy settings are set.

To alleviate this issue, verify the certificate is installed properly at the subscriber(s) for the merge agent process account.  Then, open Internet Explorer, go to Internet Options -> Connections tab -> LAN settings, and uncheck Automatically detect settings.

Accidentally Dropped the Distribution Database

Have you made the mistake of dropping your distribution database like this?

DropDistributionCapture

If so, remember not to do this ever again.  Disabling publishing and distribution involves more than just dropping the distribution database.  The distribution database in SQL Server requires special attention and dropping it in this fashion can wreak havoc on your distribution server.  Doing so will cause existing publications to error on synchronization and further attempts to disable publishing and distribution will fail.  This really puts a DBA between a rock and a hard place.  Setting the distribution database offline and dropping it like this is a great example of how to hose your distributor.

DistributionDoesNotExistCapture

CouldNotDisablePublishingCapture

Whatever you do, don’t panic.  Dropping the distribution database this way isn’t good but we still have the distribution data and log files files located in the folder we specified when configuring distribution.  The reason these files are still around is because when a database is offline and is dropped, the disk files are not deleted.  Knowing this, all we need to do is attach the distribution data and log files to recreate the distribution database.  This can be done via SSMS or with the following bit of T-SQL.

AttachDistributionTSQLCapture

If you encounter access denied errors when attaching the distribution data and log files, grant appropriate permissions to the distribution.mdf and distribution.ldf files on the Security tab in the file properties dialog.  This will circumvent the access denied errors.

DistributionFilePropertiesCapture

After attaching the distribution data and log files, the distribution database should now be in a workable state.  Existing publications should resume as normal and if you choose to do so, publishing and distribution can now be disabled correctly.

Disabling Publishing and Distribution

To avoid this problem altogether... use replication stored procedures, RMO (Replication Management Objects), or the Disable Publishing and Distribution Wizard to disable publishing and distribution.  We can learn from the MSDN Library How to: Disable Publishing and Distribution (Replication Transact-SQL Programming) that this can be done with T-SQL which is nice.

I encourage you to become familiar with these procedures and know what they do.  Using the Wizard is convenient but it’s important to realize what happens when publishing and distribution is disabled.  Not only does the distribution database need to be dropped, but replication objects need to be removed as well.  Publishers have to be dropped from the Distribution server and the Distribution server needs to be uninstalled.

DisablePublishing&DistributionViaTSQLCapture

To disable publishing and distribution using the Wizard, right-click the Replication node in Object Explorer and click ‘Disable Publishing and Distribution…

DisablePublishing&DistributionWizardCapture

I hope this sheds some light on how to disable publishing and distribution in SQL Server.  It’s wise to do this through the use of replication stored procedures, RMO, or the Disable Publishing and Distribution Wizard.  I’ve came across a couple of threads in the SQL Server Replication Forum with people getting stuck after accidentally dropping the distribution database.  If you find yourself in this situation don’t do anything drastic.  Re-installing SQL Server is not the answer.  Just remain calm, attach the distribution data and log files, and all will be fine.

Showing all posts tagged troubleshoot