Why would NSX ALB fail to bypass a bad Horizon Connection Server?

The most common reasons are weak health monitoring, incorrect persistence settings, or an architecture where the load balancer cannot accurately detect that the individual Horizon Connection Server is unhealthy.

Why did excluding the bad Horizon server solve the issue?

Removing the failed server from the load balancer pool prevented new sessions from being sent to a node whose web console and Horizon services were no longer functioning correctly.

Is this a load balancer problem or a Horizon server problem?

Usually both layers are involved. The Horizon server has the underlying fault, but the load balancer configuration determines whether that failed node is removed quickly and reliably from service.

What is the permanent fix?

The permanent fix is to correct health monitoring, persistence, and traffic design, and to investigate the failed Horizon Connection Server itself for service, OS, certificate, resource, or communication problems.

NSX ALB Failed to Bypass an Unhealthy Horizon Connection Server: Root Cause, Symptoms, and Permanent Fix

After deploying NSX Advanced Load Balancer (Avi) in front of Horizon Connection Servers, one specific server became unstable. Its web console would not open, Horizon services stopped, and client access became unreliable. In theory, the load balancer should have removed that bad node and directed users to healthy Horizon Connection Servers. In practice, that did not happen consistently.

The immediate workaround was simple: remove the problematic Connection Server from the load balancer pool. That restored service. But that was only an operational bypass, not a permanent fix.

Problem Summary

The real issue was not just that one Horizon Connection Server failed. The bigger issue was that the load-balancing layer did not isolate the failure fast enough or accurately enough.

Critical point: if a failed Horizon Connection Server remains eligible in the NSX ALB pool, user sessions can still be sent to a server that is technically alive at the network layer but functionally dead at the application layer.

Main Symptoms

The web console on one Horizon Connection Server would not open
Horizon-related services on that server stopped or became unresponsive
Users experienced intermittent connection failures or inconsistent broker behavior
NSX ALB did not reliably steer all new sessions away from the affected server
Service stabilized only after the problematic server was excluded from the pool

Most Likely Root Causes

1. Health Monitor Was Too Shallow

This is the most likely explanation. If NSX ALB was only checking basic TCP connectivity or a superficial HTTPS response, the server could still appear healthy even though Horizon broker services or the admin interface were effectively down.

2. Persistence Kept Sending Sessions to the Same Bad Node

Horizon requires load balancer persistence. If persistence was configured correctly for affinity but the unhealthy node was not removed quickly enough, some users could continue to be pinned to that failed server.

3. Inline LB Design Reduced Failure Visibility

If the architecture placed the load balancer inline between UAG and Horizon Connection Servers, failure visibility becomes less clean. In that design, upstream components do not always detect the individual failed Connection Server directly.

4. Connection Server Itself Had an Internal Fault

The bad node may have had an OS-level or application-level problem such as:

Hung Horizon services
Certificate or trust issue
Resource exhaustion such as CPU, memory, or disk pressure
Windows service dependency failure
Broken communication with AD, LDAP, database, or other Horizon components

5. Health Monitor Path or Probe Logic Was Misaligned with Real User Traffic

A common design mistake is probing an endpoint that does not truly represent broker readiness. A server can answer a low-level probe while still failing real logon or brokering requests.

Why Excluding the Server Worked

Excluding the problematic Horizon Connection Server from the NSX ALB pool immediately removed the failed node from new traffic distribution. That is why the environment recovered.

However, this only proves that the node was bad. It does not prove that the load balancer logic was correctly identifying unhealthy Horizon service states.

Operational Impact

Failed or inconsistent user logins
Unexpected desktop launch failures
Broker instability that appears random from the user side
False confidence in high availability because the pool still contained a broken server
Longer MTTR because the problem looked like an intermittent Horizon issue instead of a detection failure

Troubleshooting Checklist

Confirm the affected Horizon Connection Server cannot open its web console
Check whether Horizon services are stopped, hung, or restarting repeatedly
Verify NSX ALB pool member state and health monitor result details
Check whether the pool member still shows up despite Horizon application failure
Review persistence behavior and whether users remain pinned to the failed node
Validate whether the health monitor is TCP-only or application-aware
Review Windows Event Viewer and Horizon logs on the failed server
Check CPU, memory, disk, TLS/certificate status, and connectivity to AD and other Horizon components

Permanent Fix Strategy

1. Redesign the Health Monitor

Replace shallow health checks with application-relevant monitoring. A basic TCP connect test is often insufficient for Horizon Connection Servers. The monitor should reflect real broker readiness, not just whether a port accepts a connection.

2. Review Persistence Settings

Horizon requires persistence, but persistence must not keep users stuck to a bad node longer than necessary. Confirm that affinity settings match Horizon guidance and that failover behavior is acceptable when a node becomes unhealthy.

3. Review the Traffic Path Design

If the load balancer is positioned inline between UAG and Connection Servers, review whether that design is making failure detection less reliable. Simplifying the path can improve observability and fault isolation.

4. Fix the Failed Horizon Server Itself

The root cause is not solved until the affected Connection Server is investigated directly. Review:

Horizon service state and restart history
Windows OS stability
Resource usage trends
Certificate validity and trust
Network communication to required Horizon dependencies
Patch level and known bugs

5. Add Monitoring for “Half-Dead” States

The dangerous condition is not fully down or fully healthy. It is the half-dead state where the server still answers at L4 or HTTPS but no longer functions correctly as a Horizon broker. That exact state must be monitored explicitly.

Recommended Monitoring Items

Monitoring Item	Why It Matters	Desired Action
Pool member health detail in NSX ALB	Shows whether the member is considered healthy for the wrong reason	Inspect monitor type and response logic
Connection Server service state	Confirms whether Horizon components are running	Alert on service stop or repeated restart
Web console reachability	Indicates management-plane responsiveness	Detect admin UI failure separately from TCP open
CPU, memory, disk, and OS events	Identifies local server instability	Correlate with service failure timing
Persistence and session distribution	Shows whether clients remain pinned to bad nodes	Review stickiness during failover testing

Final Assessment

This was not just a simple Horizon server outage. It was a combined availability design issue:

One Horizon Connection Server became unhealthy
NSX ALB did not remove it from traffic fast enough or accurately enough
The workaround succeeded only because the bad node was manually excluded

In other words, the server failure was the trigger, but the real architectural weakness was incomplete failure detection and isolation.

Conclusion

In a properly designed Horizon environment, a failed Connection Server should become irrelevant to users because the load balancer should stop sending new sessions to it. If that does not happen, the issue is no longer just server failure. It becomes an HA design failure.

Excluding the bad server from the pool is the correct short-term recovery action. The permanent fix is to improve NSX ALB health monitoring, validate persistence, review path design, and repair the Horizon Connection Server that failed in the first place.

FAQ

Why did NSX ALB not bypass the failed Horizon Connection Server?

Usually because the health monitor did not reflect real Horizon broker readiness, or persistence kept traffic pinned to the bad node.

Why was removing the server from the pool effective?

Because it immediately stopped new traffic from reaching a node whose services or web console were already failing.

Is TCP health monitoring enough for Horizon Connection Servers?

Often no. A TCP port can be open while the actual Horizon service is degraded or unusable.

What is the long-term fix?

Use better health checks, validate persistence and traffic path design, and perform root-cause analysis on the failed Connection Server itself.

Search This Blog

When Vendors Don’t Have the Answer

How to Fix VMware Horizon "VDP Connect Failure" on Low-Spec PCs