Apache Spark vs Apache Storm

difference between apache spark and storm

Introduction to data streaming

The need for real-time data streaming is growing exponentially due to the increase in real-time data. With streaming technologies leading the world of Big Data, it might be tough for the users to choose the appropriate real-time streaming platform. Two of the most popular real-time technologies that might consider for opting are Apache Spark and Apache Storm.

One major key difference between the frameworks Spark and Storm is that Spark performs Data-Parallel computations, whereas Storm occupies Task-Parallel computations

Apache Spark

Apache Spark is a general-purpose, lighting fast, cluster-computing technology framework, used for fast computation on large-scale data processing. It can manage both batch and real-time analytics and data processing workloads.

Apache Storm

Apache Storm is an open-source, scalable fault-tolerant, and real-time stream processing computation system. It is a framework for real-time distributed data processing, which focuses on stream processing or event processing. It can be used with any programming language and can be integrated using any queuing or database technology.

Differences Between Spark and Storm

Processing Model: Apache spark provides batch processing while storm provides micro batch processing
Programming Language: Apache spark Supports lesser languages like Java, Scala while storm on the other hand Supports multiple languages, such as Scala, Java, Clojure.
Stream Sources: Apache used HDFS while Apache Storm uses spout.
Messaging: Apache apark uses Akka, Netty while Apache storm uses ZeroMQ, Netty
Resource Management: For Apache spark Yarn and Meson are responsible while for Apache storm Yarn and Mesos are responsible.
Latency: Higher latency as compared to Apache Spark while Apache storm provides better low latency with lesser constraints
Stream Primitives: Apache spark uses DStream while Apache Storm uses Tuple, Partition
Development Cost: Apache spark uses same code for batch and stream processing while in Apache storm same code cannot be used.
Persistence: Apache spark uses per RDD while Apache storm uses MapState
Fault tolerance: In Apache spark it handles restarting workers via resource manager which can be YARN, Mesos or its stand alone manager while in Apache storm if the process fails the supervisor process will restart it automatically as state management is handled through zookeeper
Provisioning: Apache spark supports basic monitoring using Ganglia while Apache storm uses Apache Ambari
Throughput: Apache spark serves 100k records per node per sec while Apache storm serves 10k records per node per sec.
State Management: The changing and maintaining state in Apache Spark can be updated via UpdateStateByKey, but no pluggable strategy can be applied in the external system for the implementation of state. Whereas Storm does not provide any framework for the storage of any intervening bolt output as a state. Hence, each application has to create a state for itself whenever required.
Specialty: Apache spark uses unified processing (batch, SQL etc.) while Apache storm uses distributed RPC

Apache Storm and Apache Spark are great solutions that solve the streaming ingestion and transformation problem.

Search This Blog

Tech Bytes