When we think about big data, we associate it with the term big, but when it comes to building infrastructure, we should also be thinking distributed.
Big data applications do, in fact, deal with large volumes of information that are made even bigger as data is replicated across racks for resiliency. Yet the most meaningful attribute of big data is not its size, but its ability to break larger jobs into lots of smaller ones, distributing resources to work in parallel on a single task.
When you combine the big with a distributed architecture, you find you need a special set of requirements for big data networks. Here are six to consider:
1. Network resiliency and big data applications
When you have a set of distributed resources that must coordinate through an interconnection, availability is crucial. If the network is unavailable, the result is a discontiguous collection of stranded compute resources and data sets.
Appropriately, the primary focus for most network architects and engineers is uptime. But the sources of downtime in networks are varied. They include everything from device failures (both hardware and software) to maintenance windows, to human error. Downtime is unavoidable. While it is important to build a highly available network, designing for perfect availability is impossible.
Rather than making downtime avoidance the objective, network architects should design networks that are resilient to failures. Resilience in networks is determined by path diversity (having more than one way to get between resources) and failover (being able to identify issues quickly and fail over to other paths). The real design criteria for big data networks ought to explicitly include these characteristics alongside more traditional mean time between failures, or MTBF, methods.