Cluster Computing and why is it used in Big Data

Introduction
Big data is a term for data sets that are so large or complex that traditional data processing application software is inadequate to deal with them. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating and information privacy.
We will talk about big data on a fundamental level and also take a high-level look at some of the processes and technologies currently being.
What Is Big Data?
An exact definition of "big data" is difficult to nail down because different people use it quite differently. Generally speaking, big data is:
  • large datasets
  • the category of computing strategies and technologies that are used to handle large datasets

In this context, "large dataset" means a dataset too large to reasonably process or store with traditional tooling or on a single computer. This means that the common scale of big datasets is constantly shifting and may vary significantly from organization to organization.
Clustered Computing
Because of the qualities of big data, individual computers are often inadequate for handling the data. To better address the high storage and computational needs of big data, computer clusters are a better fit.
Big data clustering software combines the resources of many smaller machines, seeking to provide a number of benefits:
  • Resource Pooling: Combining the available storage space to hold data is a clear benefit, but CPU and memory pooling is also extremely important. Processing large datasets requires large amounts of all three of these resources.
  • High Availability: Clusters can provide varying levels of fault tolerance and availability guarantees to prevent hardware or software failures from affecting access to data and processing. This becomes increasingly important as we continue to emphasize the importance of real-time analytics.
  • Easy Scalability: Clusters make it easy to scale horizontally by adding additional machines to the group. This means the system can react to changes in resource requirements without expanding the physical resources on a machine.


Using clusters requires a solution for managing cluster membership, coordinating resource sharing, and scheduling actual work on individual nodes. Cluster membership and resource allocation can be handled by software like Hadoop's YARN (which stands for Yet Another Resource Negotiator) or Apache Mesos.
The assembled computing cluster often acts as a foundation which other software interfaces with to process the data. The machines involved in the computing cluster are also typically involved with the management of a distributed storage system.
In short, Clustered computing is the practice of pooling the resources of multiple machines 
managing their collective capabilities to complete tasks.


Comments

  1. Casinos in Las Vegas - JTHub
    It's 양주 출장안마 a simple idea to create a Casino that 서울특별 출장안마 accepts 과천 출장마사지 all your wagers, and you're not just getting 서울특별 출장안마 paid 군산 출장안마 at any of the casinos.

    ReplyDelete

Post a Comment

Popular posts from this blog

Introduction to Apache Spark

Breaking the Manacles...