When we think about big data, we associate it with the term big, but when it comes to building infrastructure, we should also be thinking distributed.
Big data applications do, in fact, deal with large volumes of information that are made even bigger as data is replicated across racks for resiliency. Yet the most meaningful attribute of big data is not its size, but its ability to break larger jobs into lots of smaller ones, distributing resources to work in parallel on a single task.
When you combine the big with a distributed architecture, you find you need a special set of requirements for big data networks. Here are six to consider:
1. Network resiliency and big data applications
When you have a set of distributed resources that must coordinate through an interconnection, availability is crucial. If the network is unavailable, the result is a discontiguous collection of stranded compute resources and data sets.
Appropriately, the primary focus for most network architects and engineers is uptime. But the sources of downtime in networks are varied. They include everything from device failures (both hardware and software) to maintenance windows, to human error. Downtime is unavoidable. While it is important to build a highly available network, designing for perfect availability is impossible.
Rather than making downtime avoidance the objective, network architects should design networks that are resilient to failures. Resilience in networks is determined by path diversity (having more than one way to get between resources) and failover (being able to identify issues quickly and fail over to other paths). The real design criteria for big data networks ought to explicitly include these characteristics alongside more traditional mean time between failures, or MTBF, methods.
2. Solving network congestion for big data applications
Big data applications aren’t just big, they also tend to be what I call bursty. When a job is initiated, data begins to flow. During these periods of high traffic, congestion is a primary concern. However, congestion can lead to more than queuing delays and dropped packets. Congestion also can trigger retransmissions, which can cripple already heavily loaded networks. Accordingly, networks need to be architected to mitigate congestion wherever possible. As with the design criteria for availability, mitigating congestion requires networks with high path diversity, which allows the network to fan traffic out across a large number of paths between resources.
3. Network consistency more of a focus than latency in big data
Most big data applications actually are not particularly sensitive to network latency. When compute time is measured in seconds or minutes, even a somewhat significant latency contribution from the network — on the order of thousands of nanoseconds — is relatively meaningless. However, big data applications do tend to be highly synchronous. That means jobs are being executed in parallel, and large deviations in performance across jobs can trigger failures in the application. Therefore, it is important that networks provide not just efficient but also consistent performance across both space and time.
4. Prepare now for big data scalability later
It might be surprising to learn that most big data clusters are actually not that big. While many people know that Yahoo is running more than 42,000 nodes in its big data environment, in 2013 the average number of nodes in a big data cluster was just over 100, according to Hadoop Wizard. Put differently, even if you dual-homed each server, you could support the entire cluster with only four access switches (assuming 72 10 Gigabit Ethernet access ports per Broadcom switch).
The challenge with scalability is less about how large the clusters are now, and more about how to gracefully scale for future deployments. If the infrastructure is architected for small deployments, how will that architecture evolve as the number of nodes increases? Will it require a complete re-architecture at some point? Are there physical limitations? Does the architecture require some degree of proximity and data locality? The key point to remember is that scalability is less about the absolute scale and more about the path to a sufficiently scaled solution.
5. Network partitioning to handle big data
Network partitioning is crucial in setting up big data environments. In its simplest form, partitioning can mean the separation of big data traffic from residual network traffic so that bursty demands from applications do not impact other mission-critical workloads. Beyond that, there is a need to handle multiple tenants running multiple jobs for performance, compliance and/or auditing reasons. Doing this requires networks to keep workloads logically separate in some cases and physically separate in others. Architects need to plan for both, though initial requirements might favor just one.
6. Application awareness for big data networks
While the term big data is typically associated with Hadoop deployments, it’s becoming shorthand for clustered environments. Depending on the application, the requirements in these clustered environments will vary. Some might be particularly bandwidth-heavy while others might be latency-sensitive. Ultimately, a network that supports multiple applications and multiple tenants must be able to distinguish among their workloads and treat each appropriately.
The key to architecting networks for big data is understanding that the requirements go well beyond just providing sufficient east-west bandwidth. Ultimately, the application experience depends on a host of other factors, ranging from congestion to partitioning. Building a network that can deliver against all these demands requires forethought, in terms of not just how large the infrastructure must scale, but also how different types of applications will coexist in a common environment.
About the author: Michael Bushong is currently the vice president of marketing at Plexxi, where he focuses on using silicon photonics to deliver SDN-based data center options.