Lync SBA’s: The Good The Bad, and The Annoying

Painful%20SlideIf you’ve deployed Lync Enterprise Voice, you’ve at least had a discussion around Survivable Branch Appliances. You may have even deployed them. After having deployed (or assisted in deploying) and supported over 25 of them I’ve learned quite a few lessons about SBA’s that I thought I would share.

First off: sizing. If you look at the official Microsoft documentation, they say that an SBA can support between 25 and 1,000 users. However the SBA vendors do not offer just one single SBA option. They offer options with 2GB RAM and 4GB RAM. They offer SBA’s with different CPU’s. So is there a disconnect between what Microsoft says on Technet and what the vendors are providing? Can a 2GB SBA really support 500 users?

That really ends up being the wrong question, or at least it’s not the only question. I have some 2GB SBA’s that can not support 25 users . The Front End service keeps crashing every day or so. This behavior is not exhibited on the 4GB models.

It ends up that the supported user count isn’t the only metric you need to review when sizing an SBA. You need to understand how large your entire Lync deployment is (or will be). If you will only have a few thousand users then I think a 2GB SBA would work out fine. But if you have 10s of thousands or hundreds of thousands of users then don’t even consider a low powered SBA. As an example: We have an 8 person office using a 4GB SBA and they experience none of the issues that the larger offices we have experience with a 2GB SBA. All of those user accounts, all 10,000 or 50,000 or 200,000 get at least partially replicated to the SQL store on the SBA’s (at least I think it does – I’ve never looked into the DB to see what all is in there). And those large databases eat RAM. And without RAM….The Lync Front End service crashes.

There are other issues to consider before buying a low-powered SBA. How often will you need to monitor or troubleshoot? If Lync barely runs in 2GB of RAM, how well will your logging tools perform? We have crashed the Lync Front End Services on a 4GB RAM SBA just by trying to open a log file in Snooper that was too large for the SBA to handle. Lesson learned. So now when dealing with large log files we have to copy them off of the SBA to our local PC’s to open them in OCSlogger/Snooper. All of this adds delay while the users in that location can’t make or receive phone calls.

With a low powered SBA, other things will take longer too. Your patching window will need to be longer just because installing a Lync Cumulative Update or Windows patches will take longer. You surely need to run antivirus. You may also run a SCCM/SCOM agent or two. Antimalware? IDS/IPS agent? Any other management stuff running and chewing up CPU and RAM? If you add up the price difference for the extra RAM and/or the faster CPU, will you save that money in less downtime and quicker time resolving issues?

Installing and configuring an SBA is completely different than bringing up any other Lync role. Traditionally, you add a device to Topology and then either run the setup off the Lync CD for a first time install or you run bootstrapper to add the new features. You do this via remote desktop and life is good. If something goes wrong, you can just uninstall Lync and start the install over again. This is pretty much how you install all software you’ve ever installed on a Windows Server.

But with an SBA, things are different before you even turn the thing on. First, you have to add an SBA to your Active Directory first and then manually add an SPN value to that computer object. I’m sure someone who’s good at AD can explain why this is needed on an SBA but not needed for any other Lync role.

Next, after publishing Topology, you do not remote desktop to the machine and install Lync off the CD (or .iso image). Instead, you connect to a vendor-written website on the SBA to configure the server. These web-based installers handle all sorts of things such as renaming the server, adding it to a domain, and changing the password of the Administrator account. Of course it does all of this via HTTP by default so if security is important to you the first thing you do is waste 10 minutes to install a certificate on IIS on the SBA.

All that these web-based installers do is wrap PowerShell into a web GUI and invariably all of them have issues. For example, I have never successfully completed a certificate request through the Web installer. The other fun thing is that these SBA’s don’t have an uninstall option for Lync. So if things go wrong for whatever reason you can’t just uninstall Lync and start the install over again. You have to re-image the entire thing and set the whole thing back to scratch. Fortunately this doesn’t happen often.

But my core issue is figuring out what the point is of this web-based installer? Why not just ship a copy of the .iso with the SBA and install it just like you do every other Lync role?

In my imagination I see a bunch of Microsoft people sitting in a conference room

Forward Thinker:  “Hey, how can a Lync administrator install an SBA when they only have limited connectivity to the device? Like, they only have a dial-up modem connection to the site or a firewall policy limits their access?”

Everyone else in the room: “WEB BASED INSTALL!!!!!”.

And so we get stuck with a web based installer but in reality the web based installer solves no issues. It only creates them. If you only have a 56K connection to a site, you probably shouldn’t be installing Lync in that site in the first place, at least not an SBA. Go with a Standard Edition. What if you only have HTTP(S) access to the site? Well, you can then install Lync but you can’t do any logging or troubleshooting with OCS Logger so you better never have an issue. In other words, this has always seemed to me to be a solution in need of a problem.

This is also one of the reasons why I greatly prefer to deploy an SBS over an SBA: I can install off an iso and I’m not limited to under-powered hardware. However, depending on how your organization is structured, you may want to limit the amount of hardware (and “ownership” of that hardware) at a remote location. So an appliance makes sense which is why we continue to push them out.


When shopping for an SBA, there are some key points to ask the vendors you are comparing:

1. How easy is it to upgrade the SBA? Based on the above diatribe, you can’t just uninstall Lync 2010 and install Lync 2013. You have to *completely* re-install the server with a brand new copy of Windows and run through the whole rotten Web-based installer again. Some of the vendors make the upgrade process generally painless by letting you download an image and then flipping a switch on the server to boot to a new partition. These are easy to upgrade remotely. Others require you to download an image and overwrite the existing installation and this has to be done via a USB key or some other transport. These are harder as you may need to do some of the upgrade steps via a serial/terminal connection. (How exactly do I do this over HTTP? I can’t. Another reason the web installer is pointless.)

2. Can your vendor provide some semblance of local support if you have offices scattered all over the globe? Some vendors are a bit more global than others and this could become an issue regarding sourcing equipment and supporting them. It becomes a bigger issue if a part fails on the gateway and a vendor who claims to be global can’t get you parts because those parts are caught up in customs.

3. How good is their support? I’ve dealt with three different SBA vendors. Two of them are great with support, one of them not so much. And to my surprise, things I heard “on the street” about the support at these vendors did not match my reality when I worked with them. So ask the vendor how easy it is to open tickets, how quickly tickets get a response, how easy or difficult it is to set up a voice call for support, etc. I don’t know if there is an easy way to get real information out of a vendor so talk with peers about their experiences with a vendor. Alternately, if you are working with a Lync support organization, ask them how well they can support the gateway side of the product and their experience with the support organizations of the gateway vendor. Note that I am not calling any one out here so don’t ask me in the comments which one of the three I’ve had the most difficulty with. I won’t say.

One other thing to keep in mind: The vendor is on the hook for supporting both Windows and Lync on the SBA. So if the Front End service crashes, don’t call Microsoft. Call the SBA vendor.

4. Manageability. Your network guys have tools that monitor their routers and switches and firewalls. Can they also monitor this device? Some of the vendors sell their own monitoring software. Check those out and compare them. Can Vendor A’s software also monitor and manage Vendor B’s gateway? Can I write custom scripts to manage or monitor the gateways myself? How easily can I extract reports from your solution and link them with my Lync monitoring reports? Can I push out firmware upgrades? Can I centrally back up my configurations? Do you have a SCOM Management Pack?

5. Completeness of Vision. This isn’t a hard and fast set of questions or requirements of a vendor. But you do want to make sure that the vendor is completely committed to Lync as one of the core facets of their business. You want to make sure that no matter what screwball telecommunications connection you need to use in whatever screwball location that the gateway will be able to handle the connection. As an example, we had to connect to a screwy SIP trunk provider and in order to make the connection work the gateway had to manipulate the HTTP headers being sent to the SIP provider. I was impressed that this feature was available but then this completeness of features is one of the reasons we use this vendor. I have full confidence that anything we ever need to connect to our gateways will be able to be handled by this vendor.


Make sure that your SBA’s can route to your Edge servers. As calls come in to an SBA from the gateway, Lync will go through its whole STUN/TURN/ICE game and that includes seeing if using the Edge is a good option. But if the SBA cannot reach the Edge servers then calls will fail. There are some workarounds to this issue but if you have a properly configured network you won’t need to use them. We have one office that is always messing up their DNS servers. We ended up having to add our Edge servers to the local Hosts file on the SBA so that the SBA could reliably resolve and connect to the Edge servers.

Don’t put in an SBA thinking it will solve all of your congested WAN problems. Sure, if you can keep calls off the WAN that will address a portion of your WAN congestion. But if your WAN fills up the SBA could start dropping calls (inability to reach Edge) and/or putting your SBA-homed users into limited functionality mode (inability to reach parent pool).

And no matter what, make sure you have QoS working across your WAN. Someone could be copying a large file across the WAN link and during that time Lync can’t deliver calls and/or your users go into limited functionality mode. QoS helps avert this.

Since I’m talking about congested WAN’s I may as well bring this up: configure the client policy for all of your remote users to use web based address book lookups in the Lync client instead of downloading the address book. Even if the bandwidth is negligible between the two, consider this problem:

We were migrating remote users from Lync 2010 to Lync 2013. 1 week later we got reports from the network group that Lync was crushing the WAN connections to 1 of our remote offices. After some work we figured out it was that everyone in the office was downloading the Lync address book at about the same time and there wasn’t enough WAN bandwidth to support this. We effectively knocked that office off the WAN due to address book downloads. We changed the client policy to Addess Book Web Query and told everyone in the office to sign out/in on their Lync client. Within an hour or so the traffic calmed down. We changed our global policy to Address Book Web Query only.


Conferencing. Installing an SBA does not change the way Lync dial-in conferencing works. An SBA/SBS cannot be a conferencing server. So if you use publish a dial-in conferencing number that is hosted by the SBA, keep in mind that all traffic on that conference is still going across the WAN to your Front End servers. You may actually be increasing your WAN bandwidth with people now calling the number at the remote office to join meetings. Also, know how many available lines or SIP trunks you have connecting your gateway to the phone system. If you only have 10 SIP channels you can only have 10 callers dialing in to that dial-in conferencing number. The 11th caller gets a busy signal. This could also prevent customers from calling you because all 10 channels are being used for the conference.

Don’t blindly add a dial-in conferencing number to an SBA. Be sure that the local users know how the voice is routed and what the maximum number of invitees should be. Also make sure QoS is enabled on the WAN so people dialing in do not have a bad meeting experience.


We didn’t do this initially but we have gone back and fixed this. When we initially configured our gateways, we only configured a connection from the gateway to our SBA. So what happens if the SBA crashes or is getting upgraded or patched? All calling fails as the gateway can’t reach the Mediation service on the SBA. Instead, set up a Mediation server in your parent pool to be a fall back route (both inbound and outbound) in case the SBA is unavailable. While calls will now be travelling over your WAN during an outage, calls can still be made and received.

And be sure you have QoS configured on your WAN so that these calls don’t sound terrible.


I used to think that SBA’s were neat little devices. Now I kind of hate them. Not because they perform poorly. A properly sized SBA can handle 800 or more users in the largest of environments and once deployed we kind of forget they even exist. But upgrading them, configuring them, troubleshooting them, and dealing with their quirks is just a giant pain. I would love it if Microsoft nuked the entire install process in Lync vNext and just made it the exact same process used to install every other piece of Lync. I’m a big fan of the SBS precisely because every complaint I have about the SBA’s doesn’t exist with an SBS. You install it the same way you install everything else. You aren’t limited by overpriced and under-powered hardware. Microsoft handles the support. If they could take this flexibility and put it into the SBA model then life would be just that little bit better.

Moving Immovable Users

immovableThis is probably the first of a few blog posts regarding a problem we are facing with our Lync 2013 environment. In short, we have 2 corrupt routing groups right now. Users assigned to those routing groups are unable to add a contact to their buddy list and they cannot change their status.

This tip isn't anything too special and a lot of you may already know this but I'm putting it out there in case someone else runs into this situation.

Our initial thought was to move the users to a different pool which will remove them from one of the bad routing groups. However, we cannot move the users to a different pool. When doing so, we get the errors seen below.

PS C:\Users\flinchbot> Move-CsUser "user@flinchbot.com" -Target pool.flinchbot.com
Confirm
Move-CsUser
[Y] Yes [A] Yes to All [N] No [L] No to All [S] Suspend [?] Help
(default is "Y"):
Move-CsUser : Distributed Component Object Model (DCOM) operation begin move
away failed.
At line:1 char:1
+ Move-CsUser "user@flinchbot.com" -Target pool.flinchbot.com
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 + CategoryInfo : InvalidResult: (:) [Move-CsUser], MoveUserExcept
 ion
 + FullyQualifiedErrorId : FAILED::MoveRetry,Microsoft.Rtc.Management.AD.Cm
 dlets.MoveOcsUserCmdlet
Move-CsUser : Distributed Component Object Model (DCOM) operation
RollbackMoveAway failed "-1007781356".
At line:1 char:1
+ Move-CsUser "user@flinchbot.com" -Target pool.flinchbot.com
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 + CategoryInfo : InvalidResult: (:) [Move-CsUser], MoveUserExcept
 ion
 + FullyQualifiedErrorId : FAILED::MoveRetry,Microsoft.Rtc.Management.AD.Cm
 dlets.MoveOcsUserCmdlet
Move-CsUser : Distributed Component Object Model (DCOM) operation begin move
away failed.
At line:1 char:1
+ Move-CsUser "user@flinchbot.com" -Target pool.flinchbot.com
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 + CategoryInfo : InvalidOperation: (CN=Uk,lre poc..flinchbot,DC
 =com:OCSADUser) [Move-CsUser], MoveUserException
 + FullyQualifiedErrorId : MoveError,Microsoft.Rtc.Management.AD.Cmdlets.Mo
 veOcsUserCmdlet

So that wasn't going to work. So we decided to try a force-move of the users. In general a force-move is to be avoided as this process will move the user but it will throw away, among other things, any contact list entries.

So we did an Export-CsUserData of the users information first:

PS C:\Users\flinchbot> Export-CsUserData -UserFilter "user@flinchbot.com" -Poolfqdn pool.flinchbot.com -filename "e:\tempuser.zip"

We verified that the data was correct by extracting the .zip file and looking at the .xml file. In there we could see the contact list entries that the user already had.

Next we did the force-move.

PS C:\Users\flinchbot> Move-CsUser "user@flinchbot.com" -Target pool.flinchbot.com -force
Confirm
Move-CsUser [Using Force will cause data loss!]
[Y] Yes [A] Yes to All [N] No [L] No to All [S] Suspend [?] Help
(default is "Y"):

This moved the user. Finally we restored the data using the Update-CsUserData cmdlet:

PS C:Usersflinchbot> Update-CsUserData -UserFilter "user@flinchbot.com" -FileName "e:\tempuser.zip" -verbose
VERBOSE: Processing input file e:tempuser.zip.
VERBOSE: Opening file
C:UsersflinchbotAppDataLocalImportUserDataTemp.Xml.
VERBOSE: Opening file e:tempuser.zip.
VERBOSE: Processed 1 users so far.
VERBOSE: User user@flinchot.com specified in User Filter processed.
VERBOSE: Output file C:UsersflinchbotAppDataLocalImportUserDataTemp.Xml
 generated successfully.
VERBOSE: Processing user t-user@flinchbot.com.
VERBOSE: Processed 1 users so far.
Confirm
Are you sure you want to perform this action?
Performing operation "Update-CsUserData" on Target "user@flinchbot.com".
[Y] Yes [A] Yes to All [N] No [L] No to All [S] Suspend [?] Help
(default is "Y"):

After signing out of the user account and signing back in we saw the contacts had been restored. We were also now able to add new users to the contact list as well as update the Lync status.

Moving the user back to their original pool gave the same errors as in the first example above. We need to figure that issue out but at least our users can have full Lync client functionality again even if they are now in the wrong pool.

Quick & Dirty – Gather Shutdown Tracker Events

Today I had the need to see if my Front End servers were shut down “dirty’ and when. So I kicked out the following script.

$banana = Get-CsComputer -Pool lyncpool.flinchbot.com
foreach($Server in $banana)
{
 write-host $server.fqdn
 Get-EventLog -ComputerName $server.fqdn -LogName System -InstanceId 41 | export-csv shutdowns.csv -Append
}

Port 5088 Missing from Lync 2013 Documentation

scvmovies029portofmissinggirls

If they had the other Harry Caray, a whole lot of Budweiser would be missing too.

We had an issue where users were able to sign in with Lync mobility but were unable to send and receive IM’s. There are 2 things to note about this scenario:

1. The users are homed on an SBA

2. There are firewalls between the SBA and the parent pool.

So if you don’t have this scenario then you can quit reading now as you won’t ever have this problem.

In order to troubleshoot why our users were unable to successfully use Lync mobility, we jumped into the logs. We reviewed the log from the mobile phone and it showed nothing useful. We enabled the Lync Logging tool on the SBA and had a user log in and try to send an instant message.

Reviewing this log, we saw a request for port 5088 form the SBA to the parent pool. The request was to a specific server in the parent pool and it was from our Survivable Branch Appliance.

If you look at the image below you’ll see this in the Snooper view of the collected log file. The ms-diagnostics line pretty much spells this out as clearly as you could expect.

Look at the circle. It's 5088!

Look at the circle. It’s 5088!

Port 5088 does not currently exist on the Lync Ports and Protocols page on TechNet. Searching for this port turns up very little outside of this one TechNet article. That article points to the set-cswebserver PowerShell cmdlet which is used to define the web server settings in Lync. If you expand the Parameters section in the article and scroll down to the UcwaSipExternalListeningPort section you will see that this is set to use 5088/tcp by default. This is incorrect as this is the port used by UcwaSipPrimaryListeningPort. This TechNet article has the two ports switched in their documentation (The same error is seen when running get-help set-cswebserver -detailed).

ucwa ports

Run get-csservice -Webserver and you will see the default ports. Note that they don’t match the documentation.

 

In other words, even when Microsoft has documented this port in TechNet, they got it wrong. We didn’t see port 5089 in any of our traces so we couldn’t figure out when this port gets used.

After we updated the firewalls in front of our parent pool Lync servers, the problem immediately disappeared and our SBA users were able to successfully IM via their mobile clients.


Our contact at Microsoft has forwarded this omission to the relevant teams so hopefully at some point this will be added to the Lync ports and protocols page.


Credit to figuring this out goes to Antwan who is resurrecting his UC Playa blog. I’m just the one who wrote the article.

Lync 2013 and Useless(?) Topology Updates

RedHerringBlurbWe noticed today (and a few days ago, for that matter) that our CMS Replication state was “False” an awful lot of the time. So much so that we thought our CMS Replication was broken. We failed over our CMS role1 the other day and, after coming back from lunch, all of our replicas were “True”. Well we tried the same trick today and it didn’t fix the problem. We dug deep into the logs and it appeared that everything was actually working correctly. We even went so far as making a simple change (New-csUserPolicy  “Delete This Policy”) and verifying after a few minutes that it showed up on a few of our other Lync servers2. So we turned our focus to why wasn’t the replication status ever “True”?3

I’ll skip ahead a little here and get to the point where we made our little discovery. We exported a topology, then waited a random amount of time – say 5 minutes. Then we exported another copy of topology. We took the DocItemSet.xml file from each export and did a text comparison between the two files. Lo and behold there was a change. What was this Topology change?

A user migration.

Yes, moving a user from one pool to another caused a topo refresh to our servers. What the???

Our production environment is pretty big. As such, there are almost constant changes in the environment – be it updating a dial plan or disabling a user. In other words, it’s essentially dumb luck if we ever see our replication status set to “True” on all of our servers.


I was able to replicate this in my lab which has no automated systems enabling users or other system admins editing dial plans or the like. I can control the environment very tightly.

I exported a copy of the topology. I then ran “Move-csuser flinchbot -Target lync2013se.flinchbot.com”. I then waited 5 minutes and exported the topology a second time. Next I went to this site and copied the first topology file into the left pane and the updated topology file in the right pane. It found 5 changes.

Topo1

Look at the bottom right of this image.

The first is (and I am guessing here) a hash of some sort letting the recipient servers know that there has been a change to the following section (XML node). This is found at a root node in the XML document (I think that’s the right term).  The next change is similar. Like above, I think it’s a marker to point out that within the root node above, this is the specific entry that has changed.

Topo2

Finally we get to the actual change. Notice that the usercount decrements from 320 to 319. This is the move of the user FROM the source pool. Topo3

The fourth change is similar to the second change above – I think it’s just pointing out that “here be changes”: Topo4

I have no users on the destination pool (well maybe a random account or two). As such, you can see that the usercount going from 0 to 1 is completely expected if a new user is moved to this specific pool. Topo5


So….the question is why is there a topology update sent out for a user move?

All signs point to Windows Fabric and/or pool pairing being the reason. But why would you spam all of the Lync servers in your entire infrastructure with a change that is only relevant to a subset and then only if they are using Windows Fabric?

And then the change is only the number of users?

If the user count for a pool is set to 1501 in one of these files, is this the event that triggers Windows Fabric to create a new user routing group or to re-balance its groups? It seems an awful brute-force kind of way to do this.

Consider an environment with tens of thousands or hundreds of thousands of users. Users are being created/deleted/moved all the time. Now files are being blasted around the network constantly to inform all of your servers that a user was moved. Admittedly these files tend to be fairly small. In my lab they are 30K in size. In the production environment I help manage these files are much larger.

As a fun side effect, all of these topo pushes will account for additional writes the the SQL XDS Database which will fill up your SQL Logs faster.

So I don’t know why Microsoft architected it this way. But if you see that your CMS state is False an awful lot then it may very well be normal for your environment.


 

Footnotes:

1You can move the active CMS host(s) by stopping the Lync Server File Transfer Agent, Lync Server Master Replicator Agent, and Lync Server Replica Replicator Agent on the current active CMS host(s). This forces an election and one of the other Front End servers will pick up one or both of the roles.

2For reference, this was done by running Export-CsConfiguration -Filename export.zip -LocalStore. Looking in the returned export.zip file at the DocItemSet.xml file we found that the change had indeed replicated.

3For the record, to check your replication status run Get-CsManagementStoreReplicationStatus”

Lync 2013 and Useless(?) Topology Updates

RedHerringBlurbWe noticed today (and a few days ago, for that matter) that our CMS Replication state was “False” an awful lot of the time. So much so that we thought our CMS Replication was broken. We failed over our CMS role1 the other day and, after coming back from lunch, all of our replicas were “True”. Well we tried the same trick today and it didn’t fix the problem. We dug deep into the logs and it appeared that everything was actually working correctly. We even went so far as making a simple change (New-csUserPolicy  “Delete This Policy”) and verifying after a few minutes that it showed up on a few of our other Lync servers2. So we turned our focus to why wasn’t the replication status ever “True”?3

I’ll skip ahead a little here and get to the point where we made our little discovery. We exported a topology, then waited a random amount of time – say 5 minutes. Then we exported another copy of topology. We took the DocItemSet.xml file from each export and did a text comparison between the two files. Lo and behold there was a change. What was this Topology change?

A user migration.

Yes, moving a user from one pool to another caused a topo refresh to our servers. What the???

Our production environment is pretty big. As such, there are almost constant changes in the environment – be it updating a dial plan or disabling a user. In other words, it’s essentially dumb luck if we ever see our replication status set to “True” on all of our servers.


I was able to replicate this in my lab which has no automated systems enabling users or other system admins editing dial plans or the like. I can control the environment very tightly.

I exported a copy of the topology. I then ran “Move-csuser flinchbot -Target lync2013se.flinchbot.com”. I then waited 5 minutes and exported the topology a second time. Next I went to this site and copied the first topology file into the left pane and the updated topology file in the right pane. It found 5 changes.

Topo1

Look at the bottom right of this image.

The first is (and I am guessing here) a hash of some sort letting the recipient servers know that there has been a change to the following section (XML node). This is found at a root node in the XML document (I think that’s the right term).  The next change is similar. Like above, I think it’s a marker to point out that within the root node above, this is the specific entry that has changed.

Topo2

Finally we get to the actual change. Notice that the usercount decrements from 320 to 319. This is the move of the user FROM the source pool. Topo3

The fourth change is similar to the second change above – I think it’s just pointing out that “here be changes”: Topo4

I have no users on the destination pool (well maybe a random account or two). As such, you can see that the usercount going from 0 to 1 is completely expected if a new user is moved to this specific pool. Topo5


So….the question is why is there a topology update sent out for a user move?

All signs point to Windows Fabric and/or pool pairing being the reason. But why would you spam all of the Lync servers in your entire infrastructure with a change that is only relevant to a subset and then only if they are using Windows Fabric?

And then the change is only the number of users?

If the user count for a pool is set to 1501 in one of these files, is this the event that triggers Windows Fabric to create a new user routing group or to re-balance its groups? It seems an awful brute-force kind of way to do this.

Consider an environment with tens of thousands or hundreds of thousands of users. Users are being created/deleted/moved all the time. Now files are being blasted around the network constantly to inform all of your servers that a user was moved. Admittedly these files tend to be fairly small. In my lab they are 30K in size. In the production environment I help manage these files are much larger.

As a fun side effect, all of these topo pushes will account for additional writes the the SQL XDS Database which will fill up your SQL Logs faster.

So I don’t know why Microsoft architected it this way. But if you see that your CMS state is False an awful lot then it may very well be normal for your environment.


 

Footnotes:

1You can move the active CMS host(s) by stopping the Lync Server File Transfer Agent, Lync Server Master Replicator Agent, and Lync Server Replica Replicator Agent on the current active CMS host(s). This forces an election and one of the other Front End servers will pick up one or both of the roles.

2For reference, this was done by running Export-CsConfiguration -Filename export.zip -LocalStore. Looking in the returned export.zip file at the DocItemSet.xml file we found that the change had indeed replicated.

3For the record, to check your replication status run Get-CsManagementStoreReplicationStatus”

New Windows Phone UC App

wp_ss_20140505_0001About 2 years ago I released the Lync News app for Windows Phone. Today that app has been retired and replaced with “flinchböt on UC“, an app which covers Lync as well as Exchange and has a fairly terrible name (I was in a hurry and didn’t give the name any thought.). The new app is streamlined from the previous one partially because it was done with App Studio instead of native Visual C++ and partially because the older one was a bloated mess.

So if you have been using the Lync News app on Windows Phone, thanks – but it’s time to uninstall it! This version has way better load times for not only the app but for the Lync feed as well. The Exchange feed is a bit laggy but since I rarely have to deal with Exchange in my job I don’t care that it’s slow.

The app is fairly self explanatory. The one thing to point out is to see the full, original post click the url link at the top of a given article. Otherwise you can read it in a slightly-less readable format within the app. You can also pin an article to your start screen. If you have an article open, tpa on the menu then Share. Pick “Share Link” and then you can save to OneNote which is hot. That would be a really cool way to save articles.

Here is the link to download the app to your Windows 8 phone.

As a reminder, there is also a similar app for Android that can be found here.

Below are some screenshots.

 

wp_ss_20140505_0002

wp_ss_20140505_0003

wp_ss_20140505_0004

wp_ss_20140505_0005

wp_ss_20140505_0006

 

Fun with KHI and Performance Monitor

A few weeks ago I wrote a post basically saying that the Lync Stress Tool was worthless. In it I said you should really monitor the progress of your Lync deployment using Performance Monitor. I also pointed to the Key Health Indicators  that Microsoft recommends you use to monitor your Lync installs. Heck, they even have a script to easily install the KHI Data Collector Set into Performance Monitor for you.

As we built our Lync 2013 servers, we installed the KHI Data Collector Set on each server as part of our standard build process. As we have about 40 Lync servers it’s a pain to go back to 40 servers and update the KHI Data Collector Set configuration. For example, we want to change the logging directory off of our c: drive and to the e: drive. We’d also like to launch the performance monitor collection every so often, have it run for a week, and then stop. Manually starting Performance Monitor on 40 servers? This is where PowerShell comes in.

I cobbled together a script to change the settings of the KHI Data Collector Set in Performance Monitor. If the KHI Data Collector Set was not installed on the server, the script installs it. After updating (or installing) the KHI Data Collector Set, it starts it on all of the servers. This is a total time saver. I won’t shar the entire script here because I copied the entire Microsoft-written KHI script and buried it into mine. Copyright, plagiarism, etc.

But I will give you enough information to build your own script.

At the top of the script is this:


$arrServers=import-csv e:\scriptsservers.csv

This reads in a simple list of all of the servers I want to manipulate. Set the Header in the file to “ServerName”.

Next, I pasted in the two functions at the top of the Microsoft Script. I edited the CreateDataCollector function to look like this:


Function CreateDataCollector
{
Write-Host -ForegroundColor Green "Creating Lync Server 2013 KHI Data Collector on $($server.ServerName)..."

Invoke-Expression "logman.exe create counter KHI -o e:PerflogsKHI_$($server.ServerName) -f csv -si 15 -v mmddhhmm -cf .LyncServer2013KHIs.config -s $($server.ServerName)"
Remove-Item .\LyncServer2013KHIs.config
}

I edited the Write-Host line to properly display the Server name as it comes from the text file we are using. I then deleted a few lines and built my own Invoke-Expression command. Note that in this one I am slipping in the server name into the name of the logfile. I am also pointng th elogfile to an e:Perflogs directory.

The CreateKHIsTextFile function is left unchanged.

And then after those 2 functions is the code I cobbled together.

Function StartKHI
{
 $datacollectorset.Query("KHI", $Server.Servername)
#Change alread-installed KHI Collector set to log to e: drive instead of default c: drive
 Invoke-Expression "logman.exe update KHI -o e:PerflogsKHI_$($Server.ServerName) -b 5/1/2014 17:00:00 -e 5/8/2014 17:00:00 -s $($Server.ServerName)"
#Start the Collector Set
 $datacollectorset.Start($false);
}

foreach ($Server in $arrServers)
{
 Write-host "Working on" $Server.ServerName "..." -ForegroundColor Green

 $datacollectorset = New-Object -COM Pla.DataCollectorSet;
 try
 {
#If the collector set is not already installed, it errors. If no error, start the collector
 StartKHI
 }
 catch
 {
#Starting the collector crashed, so it's probably not installed. Install it, then start it.
 write-host ("KHI counters not installed on {0}" -f $Server.ServerName) -ForegroundColor Green
 write-host "Installing...." -ForegroundColor Green
 CreateKHIsTextFile
 CreateDataCollector
 StartKHI
 }
}
 

I’ll assume you are fairly well versed in PowerShell. So let me point out the one bit of creativity I had to use. No value is returned by the  “$datacollectorset.Query(“KHI”, $Server.Servername)” call. Instead, it returns nothing if it worked. If it fails it lows up and scrawls PowerShell blood all over your screen. So the way to tell if the KHI is already installed or not is to use a Try/Catch construct. If the try works, it starts the KHI Data Collector successfully. If it fails, then I assume that the KHI Data Collectors haven’t been installed. So I call the Microsoft-written (and slightly edited by me) functions to install it. Once those are done, I go ahead and start the Data Collector.

So using this script, I am able to either install the KHI Data Collectors or to update them with values I want. If you look at the Invoke-Expression line in the StartKHI function, I use the -b and -e parameters. This sets a begin and end time for the collector to run. In this case it is one week. You will probably have to edit this before running your copy.


Below is a short script to stop the Data Collector Set. It’s useful when testing.


$arrServers=import-csv e:scriptsservers.csv

foreach ($Server in $arrServers)
{
 Write-host "Working on" $Server.ServerName "..." -ForegroundColor Green

 try
 {
 $datacollectorset = New-Object -COM Pla.DataCollectorSet;
 $datacollectorset.Query("KHI", $Server.ServerName);
 $datacollectorset.Stop($false);
 }
 catch
 {
 write-host "KHI counters already stopped on $($Server.ServerName)" -ForegroundColor Green
 }
}

In the above you don’t really have to use the Try/Catch. It’s just to make things prettier (i.e., less PowerShell blood).


So if you cobble the full script together, you can install the KHI Data Collector set, edit its settings, and start and stop the collector. Pretty useful, especially if you have a lot of servers. Now the next challenge: What do you do with 40 servers-worth of logs?

Find SIP Addresses with Illegal Characters

SIP HappensOne of my peers had a Lync 2013 pool-failover scenario. Just about everything worked right except that apparently the Lync Backup Service had been getting hung up and not completing its replication cycles. They opened a case with Microsoft and one of the issues discovered was that Lync Backup Service was hanging on users whose SIP Address had illegal characters. Once they manually fixed these SIP Addresses, the Backup Service was able to complete successfully.

So what characters are illegal in a SIP address (at least so far as Lync is concerned)?

~ | { } [ ] < > ` # ^ & @

We can convert that to a Regular Expression:

^([^~|{}[]<>#^’&@\]+)$

Once that is done, a quick and dirty script can be written to compare every user against this Regular Expression. If the Regular Expression matches the SIP Address, then we can be notified of this.


# These are the invalid characters ~|{}[]<>`#^&@

$AddressToTest = get-csuser
$regex = "^([^~|{}[]<>#^’&@\]+)$"

Foreach ($user in $AddressToTest)

{
If (($User.sipaddress -split "@")[0].substring(4) -notmatch $regex)

{
Write-Host "Invalid username specified." $User.sipaddress
}
}


The only fancy part of this script is in the If statement. We can’t compare the entire SIP Address against the regex because the “@” will always be a match. So the Split is used to grab the left hand side of the SIP Address which is the portion that will (most likely) have illegal characters. You’ll also note the “substring” portion in the if statement. This means begin the comparison 4 characters in; skip the “sip:” portion of the returned SIP Address.

Note that if you want to test out this script in a lab environment, you can force a user to have any illegal character if you edit their SIP Address via ADSIEdit. Also note that set-csuser will permit you to edit a SIP Address and inserting a few of the above characters.

Here is sample output from the script:

Invalid_URI_Capture

 

Tom Arbuthnot points to a Technet document specifically calling out the unsupported usageof the hyphen and apostrophe here: http://tomtalks.uk/2014/08/apostrophe-and-dash-not-supported-in-user-sip-addresses-in-lync-server-find-problem-sip-uris/

Disabling HTTP in OWAS/WAC

tumblr_inline_mm0uxpnKvq1qz4rgpWe built our OWAS farms and, like most Lync people, had no clue what we were doing. But they ended up working anyway so hooray for us.

Now that we are begrudgingly learning a little about it we have learned that we should disable HTTP on the pools and run with HTTPS only.

So we tried the obvious command to disable HTTP:

Set-OfficeWebAppsFarm -AllowHTTP $False

That gives this wonderful error:

Set-OfficeWebAppsFarm : A positional parameter cannot be found that accepts argument ‘False’.
At line:1 char:1
+ Set-OfficeWebAppsFarm -AllowHTTP $False
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo          : InvalidArgument: (:) [Set-OfficeWebAppsFarm], ParameterBindingException
+ FullyQualifiedErrorId : PositionalParameterNotFound,Microsoft.Office.Web.Apps.Administration.SetFarmCommand

After asking around, we found that the secret to this command is to use a colon (:) instead of a space ( ) between the parameter and the value. As such, this is the proper syntax:

Set-OfficeWebAppsFarm -AllowHTTP:$False

Note that if you have the SSLOffloaded parameter set to True that you cannot disable AllowHTTP. If you try, you get this error:

WARNING: When offloading SSL, AllowHttp is automatically enabled.

To work around this, run the following command to set both to false.

Set-OfficeWebAppsFarm -SSLOffloaded:$False -AllowHTTP:$False

For more detail and tips on how to secure your Office Web Apps, see this blog.