Monday, March 19, 2012

Merge replication performance dropping. Opinions?

I've been tasked with improving our company's merge replication
performance. We are using SQL merge replication and lately we are seeing
a decrease in the performance. Specifically, we are seeing more errors,
retries, latency, and even missing data. I hope someone can provide
insight or share experiences on replicating at this scale.
Our configuration is such:
- One master server replicates to three distribution servers. These
servers are on the same LAN. The replication jobs are scheduled to run
every 10 minutes.
- Each of the three distribution servers replicates to ~15 servers (total
of 45 subscribers). These subscribers are on connections ranging from
64k to 256k. The replication jobs are scheduled to run once an hour
(staggered times). The subscription is a push subscription running at
the distributor. The subscriptions are filtered per location.
- The publication has 200+ articles, although data changes frequently in
only about 20 tables.
- The data being replicated is primarily either inserts at the
subscribers, or common data inserted/updated at the master. We have 18
tables >1M rows and 3 tables >10M rows. Because of row filtering, not
all these rows get replicated to each subscriber.
The errors:
- No errors between master and distributors, but of course they are on
the LAN.
- "The process could not check the existence of generation at the
Subscriber"
- "The process could not connect to Subscriber xxx"
- "The process could not enumerate changes at the Subscriber"
- "The merge process could not apply the replication metadata" (This
error was returned on a "Completed" task)
- "The process could not query row data at the Subscriber"
Other notes/questions:
- I've tried changing all the subscriber profiles to Slow Link, with not
much improvement.
- The replication taks are launched with QueryTimeout=30000. I believe
this to be excessive, but can't be sure.
- Task durations have run as long as 8 hours.
- Since most of the errors are accompanied by "General network error", I
can accept that our WAN sucks. How can I mitigate this risk?
- Querying the MSMerge tables, I am seeing what I think to be outrageous
row counts: 10M rows in MSMerge_contents at the master, 2M rows in
MSMerge_tombstone at the master, 8M rows in MSMerge_errorlineage at the
distributor. Does this indicate anything seriously wrong?
- Database Indexes are defragged once a week.
- I have just modified some triggers to reduce the amount of data being
replicated; a few days will show if it was helpful.
I appreciate any feedback, comments, or suggestions. Thank you!
If your subscribers connect frequently you should try to drop the retention
period to something small, like 4 days.
Also you might want to evaluate how you have created your publication to see
if there are any optimizations you can make. For example do you need
filters? Is Keep partition changes is true?
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
"Anachostic" <anachostic@.remove.700cb.net> wrote in message
news:Xns9950690A3F584anachostic@.66.250.146.128...
> I've been tasked with improving our company's merge replication
> performance. We are using SQL merge replication and lately we are seeing
> a decrease in the performance. Specifically, we are seeing more errors,
> retries, latency, and even missing data. I hope someone can provide
> insight or share experiences on replicating at this scale.
> Our configuration is such:
> - One master server replicates to three distribution servers. These
> servers are on the same LAN. The replication jobs are scheduled to run
> every 10 minutes.
> - Each of the three distribution servers replicates to ~15 servers (total
> of 45 subscribers). These subscribers are on connections ranging from
> 64k to 256k. The replication jobs are scheduled to run once an hour
> (staggered times). The subscription is a push subscription running at
> the distributor. The subscriptions are filtered per location.
> - The publication has 200+ articles, although data changes frequently in
> only about 20 tables.
> - The data being replicated is primarily either inserts at the
> subscribers, or common data inserted/updated at the master. We have 18
> tables >1M rows and 3 tables >10M rows. Because of row filtering, not
> all these rows get replicated to each subscriber.
> The errors:
> - No errors between master and distributors, but of course they are on
> the LAN.
> - "The process could not check the existence of generation at the
> Subscriber"
> - "The process could not connect to Subscriber xxx"
> - "The process could not enumerate changes at the Subscriber"
> - "The merge process could not apply the replication metadata" (This
> error was returned on a "Completed" task)
> - "The process could not query row data at the Subscriber"
> Other notes/questions:
> - I've tried changing all the subscriber profiles to Slow Link, with not
> much improvement.
> - The replication taks are launched with QueryTimeout=30000. I believe
> this to be excessive, but can't be sure.
> - Task durations have run as long as 8 hours.
> - Since most of the errors are accompanied by "General network error", I
> can accept that our WAN sucks. How can I mitigate this risk?
> - Querying the MSMerge tables, I am seeing what I think to be outrageous
> row counts: 10M rows in MSMerge_contents at the master, 2M rows in
> MSMerge_tombstone at the master, 8M rows in MSMerge_errorlineage at the
> distributor. Does this indicate anything seriously wrong?
> - Database Indexes are defragged once a week.
> - I have just modified some triggers to reduce the amount of data being
> replicated; a few days will show if it was helpful.
> I appreciate any feedback, comments, or suggestions. Thank you!
>
|||Thanks for your response.
Regarding retention, is this the "Subscriptions expire and may be
dropped..." in the Publication>Properties>General tab? If so, ours is set
to 60 days. I figure reducing the retention period will reduce the size of
the MSMerge_content tables? And would this change require dropping and
recreating all the subscriptions? That's not an option at the current
time.
I could not find where the "keep partition changes" setting is. The only
reference in BO is with SQL-DMO.
"Hilary Cotter" <hilary.cotter@.gmail.com> wrote in
news:#QmoFD4rHHA.1296@.TK2MSFTNGP06.phx.gbl:

> If your subscribers connect frequently you should try to drop the
> retention period to something small, like 4 days.
> Also you might want to evaluate how you have created your publication
> to see if there are any optimizations you can make. For example do you
> need filters? Is Keep partition changes is true?

No comments:

Post a Comment