Introduction to Apache Spark

Introduction to Apache Spark

August 02, 2017

What is Apache Spark?

Reasons behind Apache Spark invention:

• Exploding Data

• Data Manipulation speed

Several shortcomings of Hadoop are:

• Adherence to its Map Reduce programming model

• Limited programming language API options

• Not a good fit for iterative algorithms like Machine Learning Algorithms

• Pipelining of tasks is not easy

What is Spark

Apache Spark is an open source data processing framework for performing Big data analytics on distributed computing cluster.

Spark Features

Spark has several advantages when compared to other big data and Map Reduce technologies like Hadoop and Storm. Spark is faster than Map Reduce and offers low latency due to reduced disk input and output operation. Spark has the capability of in memory computation and operations, which makes the data processing really fast than another Map Reduce.

Unlike Hadoop, spark maintains the intermediate results in memory rather than writing every intermediate output to disk. This hugely cuts down the execution time of the operation, resulting in faster execution of the task, as more as 100X time a standard MapReduce job. Apache Spark can also hold data onto the disk. When data crosses the threshold of the memory storage it is spilled to the disk. This way spark acts as an extension of MapReduce. Spark doesn’t execute the tasks immediately but maintains a chain of operations as meta-data of the job called DAG. The action on the DAG happens only when an action operation is called on to the transformation DAG. This process is called as lazy evaluation. This allows optimized execution of the queries on Big Data.

Apache Spark has other features, such as:

• Supports a wide variety of operations, compared to Map and Reduce functions.

• Provides concise and consistent APIs in Scala, Java, and Python.

• It leverages the distributed cluster memory for doing computations for increased speed and data processing.

• Spark enables applications in Hadoop clusters to run up to as much as 100 times faster in memory and 10 times faster even when running in the disk.

• It is most suitable for real time decision making with big data.

• It runs on top of existing Hadoop cluster and access Hadoop data store (HDFS), it can also process data stored by HBase structure. It can also run without Hadoop with Apache Mesos or alone in standalone mode.

• Apache Spark can be integrated with various data sources like SQL, NoSQL, S3, HDFS, local file system etc.

• Good fit for iterative tasks like Machine Learning (ML) algorithms.

• In addition to Map and Reduce operations, it supports SQL like queries, streaming data, machine learning, and data processing in terms of the graph.

Apache Spark Components and Architecture

Spark Context is an independent process through which spark application runs over a cluster. It gives the handle to the distributed mechanism/cluster so that you may use the resources of the distributed machines in your job. Your application program which will use Spark Context object would be known as driver program. Specifically, to run on a cluster, the Spark Context connects to several types of cluster managers (like Spark’s own standalone cluster manager, Apache Mesos or Hadoop's YARN), which allocate resources across applications. Once connected, Spark takes over executors on distributed nodes in the cluster, which are processes in the distributed nodes that run computations and store data for your application. Next, it sends your application code to the executors through Spark Context. Finally, tasks are sent to the executors to run and complete it.

Cluster Overview

Following are most important takeaways of the architecture:

• Each application gets its own executor processes, which remains in memory up to the duration of the complete application and run tasks in multiple threads. This means each application is independent of the other, on both the scheduling side since each driver schedules its own tasks and executor side as tasks from different applications run in different JVMs.

• Spark is independent of cluster managers that imply, it can be coupled with any cluster manager and then leverage that cluster.

• Because the driver schedules tasks on the cluster, it should be run as close to the worker nodes as possible.

Spark Eco-system components

Spark Core

Spark Core is the base of an overall spark project. It is responsible for distributed task dispatching, parallelism, scheduling, and basic I/O functionalities. All the basic functionality of spark core is exposed through an API (for Java, Python, Scala, and R) centered on the RDD abstraction. A ‘driver’ program starts parallel operations such as map, filter or reduce on any RDD by passing a function to Spark Core, which further schedules the function's execution in parallel on the cluster.

Other than Spark Core API, there are additional useful and powerful libraries that are part of the Spark ecosystem and adds powerful capabilities in Big Data analytics and Machine Learning areas. These libraries include:

Spark Streaming

Spark Streaming is a useful addition to the core Spark API. It enables high-throughput, fault-tolerant stream processing of live data streams. It is used for processing real-time streaming data. This is based on the micro batch style of computing and processing. The fundamental stream unit is D-Stream. D-Stream is basically a series of RDDs, to process the real-time data.

Comments