NSX ALB Failed to Bypass an Unhealthy Horizon Connection Server: Root Cause, Symptoms, and Permanent Fix

NSX ALB Failed to Bypass an Unhealthy Horizon Connection Server: Root Cause, Symptoms, and Permanent Fix NSX ALB Failed to Bypass an Unhealthy Horizon Connection Server: Root Cause, Symptoms, and Permanent Fix After deploying NSX Advanced Load Balancer (Avi) in front of Horizon Connection Servers, one specific server became unstable. Its web console would not open, Horizon services stopped, and client access became unreliable. In theory, the load balancer should have removed that bad node and directed users to healthy Horizon Connection Servers. In practice, that did not happen consistently. The immediate workaround was simple: remove the problematic Connection Server from the load balancer pool. That restored service. But that was only an operational bypass, not a permanent fix. ...

Pure Storage Network Errors: CRC & Invalid TX Word Count — Root Cause, Symptoms, Resolution, and Monitoring

Pure Storage CRC Errors and Invalid TX Word Count: Causes, Symptoms, Fixes, and Monitoring

Pure Storage CRC Errors and Invalid TX Word Count: Causes, Symptoms, Fixes, and Monitoring

In enterprise storage environments, errors such as Increased Invalid CRC Count and Increased Invalid TX Word Count typically indicate a physical-layer network problem. These counters usually point to issues between the storage array and the switch, not to application or storage logic faults.

What Do These Errors Mean?

CRC Errors

CRC errors indicate that transmitted frames arrived with corrupted data and failed integrity validation. In practical terms, this means the receiving side detected that the frame contents were altered during transmission.

Invalid TX Word Count

Invalid TX Word Count generally indicates transmission-side word or bit-level errors. This is often tied to encoding issues, signal degradation, port faults, or optical problems.

Bottom line: when both counters increase together, the most likely problem is degraded signal integrity at the physical layer.

Root Causes of Increased Invalid CRC Count and Invalid TX Word Count

1. Optical Transceiver Problems

  • Defective or aging SFP/QSFP/GBIC module
  • Vendor compatibility mismatch
  • TX and RX optical power imbalance

2. Fiber Cable Issues

  • Damaged or bent fiber cable
  • Dirty fiber connector end faces
  • Incorrect fiber type or mismatch

3. Switch Port or Storage Port Fault

  • Faulty switch port
  • Faulty storage controller port
  • ASIC or SerDes-level instability

4. Speed or FEC Configuration Mismatch

  • Speed mismatch across both ends
  • FEC mismatch in 25G, 40G, or 100G environments
  • Flow control inconsistency

5. Environmental Factors

  • Overheating leading to signal degradation
  • Rare but possible EMI interference

Common Symptoms in Production

Network Symptoms

  • CRC counters continue increasing
  • Input errors or frame errors appear on interfaces
  • Packet discard counts increase

Storage Symptoms

  • Higher I/O latency
  • Intermittent timeout events
  • Path failover in multipath environments
  • Reduced throughput

Pure Storage Side Effects

  • Port error counters rise
  • Degraded path warnings may appear
  • Host-side instability can occur intermittently

How to Fix Increased Invalid CRC Count and Invalid TX Word Count

In this case, the transceiver and cable were already replaced. That was the correct first action.

Step 1: Change the Switch Port or Storage Port

If optics and cable were replaced but errors continue, move the link to a different switch port and, if possible, a different storage port. This is the fastest way to isolate a port-level hardware issue.

Step 2: Check Optical Power Levels

Run the following command on the switch:

show interface transceiver details

Review the following values:

  • TX Power
  • RX Power
  • Bias current

Low RX power or abnormal TX/RX imbalance strongly suggests an optical path issue.

Step 3: Clear Counters and Monitor Again

clear counters interface x/x

After clearing the counters, monitor the interface for at least 5 to 10 minutes. If the counters rise again, the issue is persistent rather than historical.

Step 4: Verify Speed and FEC Settings

In high-speed links, especially 25G and above, FEC mismatch can directly cause CRC-related errors. Both ends must use compatible speed and FEC settings.

Step 5: Validate Vendor Compatibility

Mixed or unsupported optics frequently cause silent link degradation. Always confirm that both the switch and the storage platform support the installed transceivers.

Step 6: Review Firmware and OS Versions

  • Switch firmware or NX-OS/EOS/other network OS version
  • Pure Storage Purity OS version

Check for known bugs affecting ports, transceivers, or error counters.

How to Monitor CRC and TX Word Errors

1. Real-Time Interface Monitoring

show interface counters errors

Track these counters continuously:

  • CRC errors
  • Input errors
  • Symbol errors

2. Optical Monitoring

Do not rely only on packet counters. Optical metrics such as RX power and TX power are essential for early detection.

3. Storage Monitoring

  • Port error counters
  • Path health status
  • Latency spikes

4. Monitoring Tools

Platforms such as Zabbix, Prometheus, or enterprise observability stacks can collect and alert on these metrics.

Metric Normal State Alert Condition
CRC Error Count 0 Any increase
Invalid TX Word Count 0 Any increase
RX Optical Power Stable within expected range Abnormal drop or fluctuation
Latency Stable baseline Unexpected increase with link errors
Operational rule: CRC should remain zero in a healthy storage network. Any new increase deserves investigation.

Final Assessment

Since the transceiver and cable were already replaced, the remaining likely causes are:

  1. Faulty switch port
  2. Faulty storage port
  3. Optical power imbalance
  4. FEC mismatch
  5. Unsupported or partially compatible optics

In most cases, this is not a Pure Storage software issue. It is a network physical-layer issue.

Conclusion

If you see Increased Invalid CRC Count and Increased Invalid TX Word Count in a Pure Storage environment, start with the physical link. Replace optics and cable first, then isolate ports, validate optical levels, confirm configuration alignment, and monitor error recurrence.

The fastest way to reduce MTTR is to treat these counters as early indicators of link degradation rather than as secondary noise. In enterprise environments, that distinction matters.

FAQ

What does Increased Invalid CRC Count mean in Pure Storage?

It usually means frame corruption caused by physical-layer issues such as faulty optics, damaged fiber, dirty connectors, or port faults.

What causes Increased Invalid TX Word Count?

It commonly points to transmission-side encoding or signal integrity problems involving optics, ports, or link configuration mismatch.

Is this a storage software issue?

Usually no. In most real-world cases, it is a network link quality issue.

What should I monitor first?

Monitor CRC counters, TX word error counters, RX/TX optical power, latency trends, and path health together.

Comments

Popular posts from this blog

Troubleshooting VMware Horizon Client vdpConnect_Failure Issue

VMware Horizon Agent “Protocol Error” — Fixed by Windows Firewall Configuration

VMware / Omnissa Horizon Agent Unreachable – Causes and Fixes (Complete Troubleshooting Guide)