At one of our remote locations, a coworker submitted a ticket that he could not consistently complete voice calls via Lync. Specifically, here is the ticket he submitted:
When on the VPN, you can only make or receive calls over Lync at best 5% of the time. This needs to be set as a fairly high priority and escalated.
Alrighty then. Step 1 was to make sure he knew what he was doing. As I knew this coworker, I figured he wasn’t doing anything stupid but it never hurts to check. Too bad for both of us, he wasn’t doing anything stupid.
I had him turn on logging on the client and ship me the logs. I found nothing – no error or anything. He did send me the below screenshot of what he got after clicking the Call button within Lync.
Well, that is a nice and vague error. Further, Googling for error ID 16389 brings up frustratingly little. So it was time to take matters into my hands! I got hooked up with access to their VPN and tried to emulate the error myself. Sadly, I experienced the exact same issue. This at least let me know that it was not a problem with the user or their PC. I am a known-good user on a known-good laptop and got the same error.
I then figured something on the VPN must be blocking traffic. So I telnetted to the pool via port 5061 and got connected. I then telnetted to the internal-side of our Lync Edge and did not succeed. Aha! This VPN must have trouble for whatever reason in directly connecting with our peer and thus we fall back to using the Edge as a proxy. It can’t reach the edge and thus we get the error. This should be easy enough to fix.
I got with our networking team and had them add in the correct routes between the VPN and the Edge. I then tested that I could connect to TCP/5061 and TCP/5062 on the Edge and I was successful. So I made a few test calls and they all still failed. Drat. There goes that idea.
Everything else seemed to check out. I verified that the VPN was resolving DNS correctly; traceroutes were taking the correct path; and I could connect to all relevant ports.
Time to break out the logging. I checked the logs on my client and I never saw an error packet in the logs when I tested the call. Frustrating. Especially since every-so-often a call would work.
So I kept logging. I even broke out Wireshark and started using that. At some point I got a successful call trace. I verified the Wireshark traces of a good call and a bad call. They both looked to do all of the same negotiation but one magically worked but the other did not. The one thing I did notice in the Wireshark traces is that Lync seemed to always want to use public IP addresses instead of its internal VPN-assigned address for negotiation. I also saw this in the Lync logs.
Finally I found an error in the Lync logs – not an error record mind you – but text buried in a standard I/O Out message as seen below.
So if you look in that mess, you’ll see my first actual clue – “Local endpoint allocation failed”. Sweet – I have a lead. Finally! Now – what does it mean? My guess is that as part of the connection negotiation it wasn’t able to open a port for connectivity to the peer. (And essentially that was the issue). Now the question is why?
Due to trying to keep things secret, I had to mask a bunch of stuff in that last screenshot. But if it weren’t there, you’d notice a lot of public IP addresses, particularly on that very first line. (11/16/2011|15:04:18.053 1DDC:1DE0 INFO :: Sending Packet – [Public IP]:443 (From Local Address: [VPN Client IP]:54635) 1319 bytes:)
That is essentially the problem. My VPN-connected client is trying to send packets to our public edge. But why? I verified with our networking team that we categorically do NOT permit split-tunneling. I could prove this myself by not being able to ping any device on my local network.
So if split tunneling is not allowed, why is my Lync client insisting on trying to reach the Edge via the public IP? If it really needed the edge, shouldn’t it use the internal IP of the Edge?
I ran out of ideas here and brought in our Lync consultant for some clarity. After about 60 minutes of testing and reviewing the logs he had an idea as to what was happening.
If you aren’t familiar with how Lync connects to an assigned pool and, more specifically, how Lync caches connection information then I recommend you read this article and come back.
The upshot is this: When my coworker took his laptop home and fired it up, Lync automatically connected to our public Edge -exactly as it should. My coworker then fired up his VPN and went on about his day until he needed to call someone. At this point, he got the error in the very first screen shot above.
At our headquarters where our Lync pool is situated, you literally cannot connect to the public IP address of our Edge via TCP/5061, TCP/5062, or any other manner. So when you connect via VPN to corporate, the same story carries forward – you cannot access the public Edge IP.
But in our remote office, you CAN reach the public Edge IP while VPN’ed in. So when you bring up the VPN, your connection to the Edge (more or less) stays up. The Lync client has no reason to re-negotiate its connection to the pool. In our corporate location, the connection to the Public Edge is severed when you bring up the VPN and eventually Lync re-negotiates its connection and becomes a proper “internal” client.
The crazy thing is that we have 2 different, independent SRV records for Lync – internally it is _sipinternaltls._tcp and on the public side it is _sip._tcp. So part of my confusion is wondering why Lync can connect to the public side without even being told HOW to get to the public side once you bring up the VPN. In other words – if my coworker closed Lync, then connected to the VPN, and then fired up Lync, his client would STILL connect over the edge.
The answer is that the Lync client caches the information for the last-known-good connection. In my coworkers case, this was the Edge. Lync also (apparently) caches the IP address of that host. So it doesn’t even need to do a DNS lookup – it just tries right out of the gate to connect to the last-known-good IP address. And in the case of the remote VPN, he could connect – at least until he tried to use any AV features within Lync.
Very long story short – I ended up working with our networking group to add a route on the remote VPN server. They routed the 2 public IP addresses of our Edge hosts to 127.0.0.1. In other words, any traffic heading through the remote VPN server for those 2 hosts get dropped. Eventually, Lync figures out it can’t connect to any servers and begins renegotiation. At this point, it finally decides to do a DNS lookup, gets the data for our internal hosts, and connects to them. Once this is done, AV on the Lync client works successfully via the remote VPN.
So the takeaway here is that, at least in our network, blocking access to the Public Edge addresses from our internal network was the fix we needed.
The one thing I can’t figure out is why the connectivity for AV fails when using the Edge via VPN when it works just fine when connected straight over the Internet. Our networking group says we don’t filter any ports outbound so it *should* have still worked and we should have never been notified of this error. Looking at the logs, the Lync client looks like it was trying to bring up a port on the public IP address of the outbound Internet router. Now this is similar to how it works for me at home. My PC/Laptop is never chosen as the best choice for STUN/TURN – rather the IP of my home gateway (router) is chosen. But I am using a cheap gateway that passes anything. Our routers at corporate may be smarter and are dropping traffic that isn’t “perfect” and so the AV negotiation fails.