Cloud PHD

Replication Broken. Again?

Replication Broken. Again?

Replication Broken. Again?

One might imagine, after having fixed Active Directory replication issues so many times for so many reasons that you have a handle on things…. How naive of me! Sure, when someone says “we have issues with AD” I know its either going to be minutes or hours to get it fixed. Either way, I usually have good direction. Well this time around was something new for me.

In this case the client has 2 domain controllers, 1 at each site (production and datacenter. I will call them “PROD” and “DC.” All of the FSMO roles are held by DC. All users are located at the PROD site.

User problems: No access to any internal server resources

Today’s winning errors:
1) From PROD accessing the shares on DC by UNC \\DC\SYSVOL , \\DC\NETLOGON result in “Access Denied”
2) From the server command line, “repadmin /showreps” spits out fun garbage like “The target principal name is incorrect.”
3) DCDIAG /test:DNS is equally unhappy and results in “LDAP bind failed with error 8341”

Let me point out the following observations and tests from the server side:
1) NETLOGON, FRS, DNS and KDC services all running and no problems with restarting the services to troubleshoot
2) SYSVOL and NETLOGON shares are shared on both domain controllers
3) Local login and remote desktop connection to both servers is successful
4) All DNS entries, NS, A and PTR records are correct
5) Ping responses are fine in both directions
6) From PROD, accessing SYSVOL and NETLOGON by UNC \\PROD\SYSVOL , \\PROD\NETLOGON is no problem
7) From DC, accessing its own shares, also no issue
8) From DC, accessing the shares on PROD is also no problem
9) Confirmed with repadmin /replsum that no servers are tombstoned since its been only a few days since the last successful replication (problem was reported on a Monday morning)
10) In case of random luck, server restarts don’t help

Our Solution: Reset the secure channel between the domain controllers

Unfortunately here’s where I got tied up for a while… many suggested fixes included only using the “NETDOM RESET” command on DC to complete this task… at least by definition, that is what the “RESET” parameter for the NETDOM command is supposed to do…..

In our case we also needed to purge the Kerberos ticket cache on the PROD domain controller.

The Complete Fix:
1. Stop the Key Distribution Center (KDC) service on PROD.
• You can use the services MMC snap-in or from the command prompt, run: NET STOP KDC

2. Clear the Kerberos ticket cache on PROD.
On Server 2003 you can use the Kerbtray.exe tool which is included in the Resource Kit.
• Load Kerbtray.exe. Click the Start menu, click Run, and then type c:\program files\resource kit\kerbtray.exe
• A green ticket icon should appear in your system tray in the lower right corner of your desktop.
• Right-click the green ticket icon in your system tray, and then click Purge Tickets.
• You should receive a confirmation that your ticket cache was purged. Click OK.
On Server 2008 and above, the KLIST command line utility is included which will accomplish the same thing.
• From the command line run: KLIST PURGE

3. Reset the Server domain controller account password on DC (the PDC emulator).
• From the command line on DC run: NETDOM /RESETPWD /SERVER:PROD /userd:yourdomain.com\administrator /passwordd:yourpassword

4. Force Active Directory Replication. From the command line run REPADMIN /SYNCALL

5. Start the KDC service on PROD
• You can use the services MMC snap-in or from the command prompt, run: NET START KDC

Your replication attempts should now succeed and users will be able to access the servers again.

Leave a Reply

Your email address will not be published. Required fields are marked *