We noticed today (and a few days ago, for that matter) that our CMS Replication state was “False” an awful lot of the time. So much so that we thought our CMS Replication was broken. We failed over our CMS role1 the other day and, after coming back from lunch, all of our replicas were “True”. Well we tried the same trick today and it didn’t fix the problem. We dug deep into the logs and it appeared that everything was actually working correctly. We even went so far as making a simple change (New-csUserPolicy “Delete This Policy”) and verifying after a few minutes that it showed up on a few of our other Lync servers2. So we turned our focus to why wasn’t the replication status ever “True”?3
I’ll skip ahead a little here and get to the point where we made our little discovery. We exported a topology, then waited a random amount of time – say 5 minutes. Then we exported another copy of topology. We took the DocItemSet.xml file from each export and did a text comparison between the two files. Lo and behold there was a change. What was this Topology change?
A user migration.
Yes, moving a user from one pool to another caused a topo refresh to our servers. What the???
Our production environment is pretty big. As such, there are almost constant changes in the environment – be it updating a dial plan or disabling a user. In other words, it’s essentially dumb luck if we ever see our replication status set to “True” on all of our servers.
I was able to replicate this in my lab which has no automated systems enabling users or other system admins editing dial plans or the like. I can control the environment very tightly.
I exported a copy of the topology. I then ran “Move-csuser flinchbot -Target lync2013se.flinchbot.com”. I then waited 5 minutes and exported the topology a second time. Next I went to this site and copied the first topology file into the left pane and the updated topology file in the right pane. It found 5 changes.
The first is (and I am guessing here) a hash of some sort letting the recipient servers know that there has been a change to the following section (XML node). This is found at a root node in the XML document (I think that’s the right term). The next change is similar. Like above, I think it’s a marker to point out that within the root node above, this is the specific entry that has changed.
I have no users on the destination pool (well maybe a random account or two). As such, you can see that the usercount going from 0 to 1 is completely expected if a new user is moved to this specific pool.
So….the question is why is there a topology update sent out for a user move?
All signs point to Windows Fabric and/or pool pairing being the reason. But why would you spam all of the Lync servers in your entire infrastructure with a change that is only relevant to a subset and then only if they are using Windows Fabric?
And then the change is only the number of users?
If the user count for a pool is set to 1501 in one of these files, is this the event that triggers Windows Fabric to create a new user routing group or to re-balance its groups? It seems an awful brute-force kind of way to do this.
Consider an environment with tens of thousands or hundreds of thousands of users. Users are being created/deleted/moved all the time. Now files are being blasted around the network constantly to inform all of your servers that a user was moved. Admittedly these files tend to be fairly small. In my lab they are 30K in size. In the production environment I help manage these files are much larger.
As a fun side effect, all of these topo pushes will account for additional writes the the SQL XDS Database which will fill up your SQL Logs faster.
So I don’t know why Microsoft architected it this way. But if you see that your CMS state is False an awful lot then it may very well be normal for your environment.
1You can move the active CMS host(s) by stopping the Lync Server File Transfer Agent, Lync Server Master Replicator Agent, and Lync Server Replica Replicator Agent on the current active CMS host(s). This forces an election and one of the other Front End servers will pick up one or both of the roles.
2For reference, this was done by running Export-CsConfiguration -Filename export.zip -LocalStore. Looking in the returned export.zip file at the DocItemSet.xml file we found that the change had indeed replicated.
3For the record, to check your replication status run Get-CsManagementStoreReplicationStatus”