NSX ALB Failed to Bypass an Unhealthy Horizon Connection Server: Root Cause, Symptoms, and Permanent Fix

NSX ALB Failed to Bypass an Unhealthy Horizon Connection Server: Root Cause, Symptoms, and Permanent Fix

NSX ALB Failed to Bypass an Unhealthy Horizon Connection Server: Root Cause, Symptoms, and Permanent Fix

After deploying NSX Advanced Load Balancer (Avi) in front of Horizon Connection Servers, one specific server became unstable. Its web console would not open, Horizon services stopped, and client access became unreliable. In theory, the load balancer should have removed that bad node and directed users to healthy Horizon Connection Servers. In practice, that did not happen consistently.

The immediate workaround was simple: remove the problematic Connection Server from the load balancer pool. That restored service. But that was only an operational bypass, not a permanent fix.

Problem Summary

The real issue was not just that one Horizon Connection Server failed. The bigger issue was that the load-balancing layer did not isolate the failure fast enough or accurately enough.

Critical point: if a failed Horizon Connection Server remains eligible in the NSX ALB pool, user sessions can still be sent to a server that is technically alive at the network layer but functionally dead at the application layer.

Main Symptoms

  • The web console on one Horizon Connection Server would not open
  • Horizon-related services on that server stopped or became unresponsive
  • Users experienced intermittent connection failures or inconsistent broker behavior
  • NSX ALB did not reliably steer all new sessions away from the affected server
  • Service stabilized only after the problematic server was excluded from the pool

Most Likely Root Causes

1. Health Monitor Was Too Shallow

This is the most likely explanation. If NSX ALB was only checking basic TCP connectivity or a superficial HTTPS response, the server could still appear healthy even though Horizon broker services or the admin interface were effectively down.

2. Persistence Kept Sending Sessions to the Same Bad Node

Horizon requires load balancer persistence. If persistence was configured correctly for affinity but the unhealthy node was not removed quickly enough, some users could continue to be pinned to that failed server.

3. Inline LB Design Reduced Failure Visibility

If the architecture placed the load balancer inline between UAG and Horizon Connection Servers, failure visibility becomes less clean. In that design, upstream components do not always detect the individual failed Connection Server directly.

4. Connection Server Itself Had an Internal Fault

The bad node may have had an OS-level or application-level problem such as:

  • Hung Horizon services
  • Certificate or trust issue
  • Resource exhaustion such as CPU, memory, or disk pressure
  • Windows service dependency failure
  • Broken communication with AD, LDAP, database, or other Horizon components

5. Health Monitor Path or Probe Logic Was Misaligned with Real User Traffic

A common design mistake is probing an endpoint that does not truly represent broker readiness. A server can answer a low-level probe while still failing real logon or brokering requests.

Why Excluding the Server Worked

Excluding the problematic Horizon Connection Server from the NSX ALB pool immediately removed the failed node from new traffic distribution. That is why the environment recovered.

However, this only proves that the node was bad. It does not prove that the load balancer logic was correctly identifying unhealthy Horizon service states.

Operational Impact

  • Failed or inconsistent user logins
  • Unexpected desktop launch failures
  • Broker instability that appears random from the user side
  • False confidence in high availability because the pool still contained a broken server
  • Longer MTTR because the problem looked like an intermittent Horizon issue instead of a detection failure

Troubleshooting Checklist

  1. Confirm the affected Horizon Connection Server cannot open its web console
  2. Check whether Horizon services are stopped, hung, or restarting repeatedly
  3. Verify NSX ALB pool member state and health monitor result details
  4. Check whether the pool member still shows up despite Horizon application failure
  5. Review persistence behavior and whether users remain pinned to the failed node
  6. Validate whether the health monitor is TCP-only or application-aware
  7. Review Windows Event Viewer and Horizon logs on the failed server
  8. Check CPU, memory, disk, TLS/certificate status, and connectivity to AD and other Horizon components

Permanent Fix Strategy

1. Redesign the Health Monitor

Replace shallow health checks with application-relevant monitoring. A basic TCP connect test is often insufficient for Horizon Connection Servers. The monitor should reflect real broker readiness, not just whether a port accepts a connection.

2. Review Persistence Settings

Horizon requires persistence, but persistence must not keep users stuck to a bad node longer than necessary. Confirm that affinity settings match Horizon guidance and that failover behavior is acceptable when a node becomes unhealthy.

3. Review the Traffic Path Design

If the load balancer is positioned inline between UAG and Connection Servers, review whether that design is making failure detection less reliable. Simplifying the path can improve observability and fault isolation.

4. Fix the Failed Horizon Server Itself

The root cause is not solved until the affected Connection Server is investigated directly. Review:

  • Horizon service state and restart history
  • Windows OS stability
  • Resource usage trends
  • Certificate validity and trust
  • Network communication to required Horizon dependencies
  • Patch level and known bugs

5. Add Monitoring for “Half-Dead” States

The dangerous condition is not fully down or fully healthy. It is the half-dead state where the server still answers at L4 or HTTPS but no longer functions correctly as a Horizon broker. That exact state must be monitored explicitly.

Recommended Monitoring Items

Monitoring Item Why It Matters Desired Action
Pool member health detail in NSX ALB Shows whether the member is considered healthy for the wrong reason Inspect monitor type and response logic
Connection Server service state Confirms whether Horizon components are running Alert on service stop or repeated restart
Web console reachability Indicates management-plane responsiveness Detect admin UI failure separately from TCP open
CPU, memory, disk, and OS events Identifies local server instability Correlate with service failure timing
Persistence and session distribution Shows whether clients remain pinned to bad nodes Review stickiness during failover testing

Final Assessment

This was not just a simple Horizon server outage. It was a combined availability design issue:

  1. One Horizon Connection Server became unhealthy
  2. NSX ALB did not remove it from traffic fast enough or accurately enough
  3. The workaround succeeded only because the bad node was manually excluded

In other words, the server failure was the trigger, but the real architectural weakness was incomplete failure detection and isolation.

Conclusion

In a properly designed Horizon environment, a failed Connection Server should become irrelevant to users because the load balancer should stop sending new sessions to it. If that does not happen, the issue is no longer just server failure. It becomes an HA design failure.

Excluding the bad server from the pool is the correct short-term recovery action. The permanent fix is to improve NSX ALB health monitoring, validate persistence, review path design, and repair the Horizon Connection Server that failed in the first place.

FAQ

Why did NSX ALB not bypass the failed Horizon Connection Server?

Usually because the health monitor did not reflect real Horizon broker readiness, or persistence kept traffic pinned to the bad node.

Why was removing the server from the pool effective?

Because it immediately stopped new traffic from reaching a node whose services or web console were already failing.

Is TCP health monitoring enough for Horizon Connection Servers?

Often no. A TCP port can be open while the actual Horizon service is degraded or unusable.

What is the long-term fix?

Use better health checks, validate persistence and traffic path design, and perform root-cause analysis on the failed Connection Server itself.

Comments

Popular posts from this blog

Troubleshooting VMware Horizon Client vdpConnect_Failure Issue

VMware Horizon Agent “Protocol Error” — Fixed by Windows Firewall Configuration

VMware / Omnissa Horizon Agent Unreachable – Causes and Fixes (Complete Troubleshooting Guide)