Close

HPE Aruba Networking Blogs

Get rich data center telemetry with DPU-powered switches

By Scott Stevens, Field CTO, AMD Pensando

Network telemetry is a source of truth for network engineers and security operations teams.  Telemetry takes a variety of forms, including SNMP, device memory and CPU utilization, port status, firewall syslogs, and flow records. Flow records are particularly valuable because they track the source and destination of communications, identify applications, and monitor bandwidth consumption by devices, protocols, and applications.

However, telemetry can be hard to collect, especially in the data center. The typical data center approach is to attach hardware probes to network devices, or to install software on the servers. While these probes and agents can gather flow records, they tend to be expensive and complicated to deploy and only provide visibility where deployed, which typically shows just a fraction of the overall data center traffic. To get full fidelity, you’d almost have to build a second network, which is cost-prohibitive. What’s more, devices or software agents also need to be monitored and maintained, which adds to the to-do lists of busy network engineers.

Given these constraints, many companies will rely on the sampled telemetry they can gather from the data center switches. This approach means typical solutions can only provide insights based on a small sample of total network traffic; in some cases, as little as 1 in every 8,000 flows, or .0125% of all traffic.

I believe that this limited sampling is not acceptable. It restricts visibility and doesn’t provide a full picture of the data center. It also hampers the effectiveness of AIOps tools by only providing partial awareness of what is happening in the network. Using only sampled flows creates a “garbage in-garbage out” scenario that drastically restricts the insights that modern AI/ML tools can provide.

The value of rich telemetry

Rich telemetry indicates the state of the network as well as the health of individual devices in it. It provides insights into performance and is essential for troubleshooting. With access to the right telemetry, network engineers can speed up mean time to resolution (MTTR)—or mean time to innocence (MTTI)—when the network isn’t at fault.

Telemetry is also valuable for security operations. By tracking the east-west movement of traffic through a network fabric, security teams may be able to identify anomalies or patterns that indicate suspicious behavior, be it an intruder mapping out resources or an insider trying to access sensitive systems.

Lastly, telemetry is vital for network automation, including AIOps. AI and ML tools are fueled by telemetry; it is the raw data they analyze to generate context-based insights or take automated actions. Without telemetry, there would be no modern AIOps. Today, feeding non-sampled flows into AI/ML tools creates the conditions for the advanced automation that has been needed for decades in the data center.

DPUs put eyeballs in your switches

So how to get better telemetry from your data center? A new option is to marry the computing power of data processing units (DPUs) with data center switches. The DPU is an evolution of the SmartNIC; it is a programmable processor designed to offload and accelerate networking, security, and other data center infrastructure services. DPUs can be deployed in servers and switches. By adding DPUs to Top of Rack (ToR) switches, network engineers can collect and export telemetry such as flows and logs via a computer platform that sits directly in the path of your data center traffic—on servers hosted in the data center.

HPE Aruba Networking and AMD have partnered to develop the industry’s first DPU-enabled switch, the HPE Aruba Networking CX 10000 with AMD Pensando™ switch. The CX 10000 is a 1 RU device that offers 3.6 Tbps of standard line-rate stateless switching and supports 1, 10, and 25 GbE port options to servers with 40/100 GbE uplinks.

According to HPE, this CX 10000 Distributed Services Switch further delivers stateful services at 800 Gbps of throughput in each server rack. With its integrated programmable DPU, it can offer highly scalable east-west network firewall security, full non-sampled telemetry, IPsec encrypt/decrypt, and network address translation services. The form factor of the CX 10000 is designed to distribute these services to to the edge of the data center fabric, directly connected to each server; by doing so, service resources automatically scale along with data center workloads.  This is the same architecture leveraged by many of the world's largest hyperscalers.

The CX 10000 can export firewall logs as well as industry-standard non-sampled IPFIX flow records. Network engineers can set intervals for flow sampling based on their requirements, from as granular as every second to longer periods such as one or five minutes.

In the flow

For years, organizations have been bolting on telemetry solutions to the network. By embedding DPUs into the switch, telemetry capabilities are now woven into the network fabric itself. And because these capabilities are offloaded to DPUs, there is no impacting switch performance.

By monitoring flow records and logs, network engineers can quickly spot congestion, retransmission, packet drops, and bandwidth-hogging applications. This can speed up troubleshooting, and even allow network engineers to head off issues before they impact application performance or service levels. Uniquely, since there is now telemetry for all flows in the network, network visibility is now mapped directly to each application instead of the legacy model of examining trunk usage.

Of course, it’s one thing to collect telemetry—it also needs to be analyzed. This analysis is best handled by dedicated systems such as flow analyzers, log collectors, and SIEMs. HPE Aruba Networking has developed a set of APIs to provide flow records and logs to a variety of third-party tools that are widely used in network operations centers (NOCs) and security operations centers (SOCs). These integrations include solutions from Splunk, Elastic, Guardicore, and Augtera Networks.

And as more AI and ML-driven systems come to market, the DPU-powered CX 10000 switch will be ready to fuel these tools with the high-fidelity telemetry required for these systems to provide accurate, context-based insights or take automated actions.

I can see clearly now

Network engineers have lacked the ability to gather comprehensive telemetry in data center networks because of complex, cost-prohibitive collection architectures. That changes with the CX 10000, which now makes rich telemetry available for collection and analysis. HPE Aruba Networking and AMD have developed a unique approach that inserts telemetry collection directly into the data center fabric.

For more information: