Common DFSR Configuration Mistakes and Oversights
- Too small of a Staging Area Quota
Are you seeing a lot of event ID’s 4202 and 4204? If
so, your staging area is not sized correctly. The downside to an
improperly sized staging area is that replication performance will be
negatively affected as the service has to spend time cleaning up the
staging area instead of replicating files.
DFSR servers are more efficient with a full staging area for at least these two reasons:
- It is much more efficient to stage a file once and send it to all
downstream partners than to stage a file, replicate the file, purge for
each downstream partner.
- If at least one member is running Enterprise Edition the servers can take advantage of Cross File RDC
An improperly sized staging area can also cause a replication “loop”
to occur. This condition happens when a file get replicated to the
downstream server and is present in the staging however the file is
purges by the staging area cleanup process before the file can be
installed into the Replicated Folder. The purged file will be replicated
again to the server that just purged it from its staging as it never
got to install the file. This process will keep repeating until the file
gets installed.
Don’t ignore staging area events.
See this blog post for the method to use to determine your minimum staging area size
See the section “Increase Staging Quota” here
See “Remote Differential Compression details” here for information on Cross File RDC
- Improper or Untested Pre-seeding procedure
Pre-seeding is the act of copying the data that will be
replicated to a new replica member before they are added to the
Replicated Folder with the goal of reducing the amount of time it takes
to complete initial replication. Most failed pre-seeding cases I have
worked on failed in 3 ways.
- ACL mismatch between source and target.
- Changes were made to the files after they were copied to the new member
- No testing was done to verify the pre-seeding process they were using worked as expected.
In short the files must be copied in a certain way, you cannot change
the files after they are staged and you must test your process.
Click here to read Mr. Pyle’s blog on how to properly pre-seed your DFSR servers
- High backlogs for extended periods of time
Besides the fact that having high backlogs for extended
periods of time means your data is out of sync, it can lead to unwanted
conflict resolution where a file with older content wins in a conflict
resolution scenario. The most common way I have seen this condition hit
is when rolling out new RF’s . Instead of doing a phased rollout some
admins will add 20 new RF’s from 20 different branch offices at once
overloading their hub server. Stagger your rollouts so that initial
replication is finished in a reasonable amount of time.
- DFSR used as a backup solution
Believe it or not some admins run DFSR without backing up
the replicated content offline. DFSR was not designed as a backup
solution. One of DFSR’s design goals is to be part of an enterprise
backup strategy in that it gets your geographically distributed data to a
centralized site for backup, restore and archival. Multiple members do
offer protection from server failure; however, this does not protect you
from accidental deletions. To be fully protected you must backup your
data.
- One way Replication – Using it, or Fixing One way replication the wrong way
In an attempt to prevent unwanted updates from occurring
on servers where they know the data will never be changed (or they don’t
want changes made there) many customers have configured one-way
replication by removing outbound connections from replica members.
One-way replication is not supported on any version of DFSR until
Windows Server 2008 R2. On Windows 2008 R2 one-way replication is
supported provided you configure Read-Only replicated folders. Using
Read –Only DFSR members allows you to accomplish the goal of one-way
replication, which is preventing unwanted changes to replicated content.
If you must use one-way replication with DFSR then you must use Windows
2008 R2 and mark the members as read-only where changes to content
should not occur.
Click here and here to learn about Read-Only DFSR replicas.
Another common problem that occurs is when an Admin discovers that
one way replication is not supported they go about fixing it the wrong
way. Simply enabling two-way replication again can have undesirable
results. See the blog post below on how to recover from one-way
replication.
Click here to learn about fixing one-way replication.
- Hub Server – Single Point of Failure or Overworked Hub Servers
I have seen many deployments with just one hub server.
When that hub server fails the entire deployment is at risk. If you’re
using Windows Server 2003 or 2008 you should have at least 2 hub servers
so that if one fails the other can handle the load while the other is
repaired with little to no impact to the end users. Starting with
Windows Server 2008 R2 you can deploy DFSR on a Windows Failover Cluster
which gives you high availability with half of the storage requirement.
Other times admins will have too many branch office servers
replicating with a single hub server. This can lead to delays in
replication. Knowing how many branch office servers a single hub server
can service is a matter of monitoring your backlogs. There is no magic
formula as each environment is unique and there are many variables.
Read the “Topology Tuning” section here for ideas on deploying hub servers.
Click here to learn how to setup DFSR a Windows Server 2008 Fail-Over Cluster.
- Too many Replicated Folders in a single Jet Database
DFSR maintain one Jet database per volume. As a result
placing all of your RFs on the same volume puts them all in the same Jet
Database. If that Jet database has a problem that requires repair or
recovery of the database all of the RF’s on that drive are affected. It
is better to spread the RFs out using as many drives as possible to
provide maximum uptime for the data.
- Bargain Basement iSCSI deployments
I have seen more than one DFSR iSCSI deployment where the
cheapest hardware available was used. Usually if you are using DFSR it
is for some mission critical purpose such as data redundancy, backup
consolidation, pushing out applications and OS upgrades on a schedule.
Depending on low-end hardware with little to no vendor support is not a
good plan. If the data is important to your business then the hardware
that runs the OS and replication engine is important to your business.
- Did not maintain the DFSR service at the current patch level
DFSR is actively maintained by Microsoft and has updates
released for it as needed. You should update DFSR when a new release is
available during your normal patch cycle. Make sure your servers are up
to date per the KB articles listed below.
Windows 2003 R2 DFSR Hotfixes
Windows 2008 and Windows 2008 R2 DFSR Hotfixes
You will notice that updates are listed for NTFS.SYS and other files
besides DFSR.EXE/DFSRS.EXE. For replication always make sure DFSR and
NTFS are at least at the highest version listed. The other patches
listed mostly deal with UI issues that you will at minimum want
installed on the systems you perform DFSR configuration tasks on.
Proactively patching the DFSR servers is advisable even if everything
is running normally as it will prevent you from getting effected by a
known issue.
- Did not maintain NIC Drivers
DFSR will only work as well as the network you provide
for it. Running drivers that are 5 years old is not smart. Yes, I have
talked with more than a few customers with NIC drivers that old who’s
DFSR replication issue was resolved by updating the NIC driver.
Despite the fact that the data DFSR is moving around is
usually mission critical many Admins have no idea what DFSR is doing
until they discover a problem. Savvy admins have created their own
backlog scripts to monitor their servers but a lot of customers just
“hoped for the best”. The DFSR Management Pack has been out for almost a
year now (and other versions for much longer). Deploy it and use it so
you can detect problems and respond before they become a nightmare. If
you can’t use the DFSR Ops Management pack at a minimum write a script
to track your backlogs on a daily basis so that you know that DFSR is
replicating files or not.
Click here for information on the Ops Manager DFSR Management Pack.
Updated Jan 19, 2011:
- Making changes to disk storage without first backing up the data
If
you must replace or add hard drive space to your DFSR server it is
critical that you have a current backup of the data in case something
goes wrong. Any number if things can go wrong the most common being
conflict event s created due to accidental changes to a parent folder or
unintentional deletion of a parent folder that is replicated to all
partners. You must backup your data before starting the changes and
maintain the backup until the project is completed.
- Stopping the DFSR service to temporarily prevent replication
Sometimes
there is a need to temporarily stop replication. The proper way to
accomplish this is to set the schedule to no replication for the
Replication Group in question. The DFSR service must be running to be
able to read updates to the USN Journal. Do not stop the DFSR service
for long periods of time (days, weeks) as doing so may cause a journal
wrap to occur (if many files are modified, added, or deleted in the
meantime). DFSR will recover from the journal wrap but in large
deployments this will take a long time and replication will not occur or
will happen very slowly during the journal wrap recovery. You may also
see very high backlogs until the journal wrap recovery completes.
No comments:
Post a Comment