Work allocation in Kafka Streams

2018-11-15T00:00:00Z

This is mostly an exercise in writing things to help remembering them myself. You’re better off referring to Confluent’s Kafka Streams documentation or this blog post by Andy Bryant.

Kafka

Apache Kafka is a “distributed streaming platform”. Messages with keys and values are written to topics (“queues”, if that helps to think about them). Each topic is divided (when it’s created) into a number of partitions. It’s topic partitions are the unit of work in a Kafka cluster: at any given time a single cluster node is responsible for processing the messages for a given topic.

Applications which read from a Kafka topic can also be distributed - each partition can be consumed (“read”) by a different worker. The collection of workers cooperating to process a topic form a consumer group. Kafka’s consumer group API helps to assign the work (i.e. the partitions) to the available workers.

Kafka Streams

Kafka Streams is a library for building streaming data processing applications on top of Kafka. Streams applications are just normal Java programs which can be deployed, monitored, and managed just like any other Java program: as many instances as you start will self-organise and cooperate to share the available work between them. This makes scaling Streams applications very straightforward: just start or kill some instances (assuming there are work units that can be re-/allocated).

A Kafka Streams application is described by a topology – essentially a directed acyclic graph with nodes representing each source, sink, and processing step. Each topology can be split into subtopologies with nodes which interact only with other nodes in the same subtopology. Because the nodes in a subtopology only interact with each other, the subtopologies can be executed in parallel without any coordination required. The collection of subtopologies together with the collection of partitions in the input topics for each subtopology will define the collection of stream tasks that can be distributed across the workers in the Streams applications.

The first phase in executing a topology analyses it and the Kafka cluster and determines the units of work that must be scheduled:

Partition the topology into subtopologies.
For each subtopology, check that the input topics have the same key configuration and the same number of partitions. This ensures that corresponding records from each of the input topics will be processed by the same stream task, allowing them to be joined, etc.
For each subtopology, generate one stream task to read from each set of corresponding partitions in the input topics. If subtopology 1 reads from topics A, B, and C and they are configured with 3 partitions then this will result in three stream tasks “1_0”, “1_1”, and “1_2”.

Note that the collection of stream tasks generated from a topology is static: both the graph in a topology and the number of partitions in a Kafka topic are fixed at creation. The next phase allocates the stream tasks to be executed by application instances.

Each instance executes a number of stream threads determined by its configuration. Each stream thread is a more or less independent worker able to process one or more stream tasks.
Each stream thread will connect to the Kafka cluster using the consumer group API. The Kafka cluster and the Streams application instances will cooperate to allocate the available work to the available workers. From the application’s perspective this means allocating stream tasks to stream threads and from the Kafka cluster’s perspective this is topic partitions to consumers (and it just happens to be the case that we’ll co-allocate partitions of certain topics to the same workers).

With all this done, the Kafka Streams application is able to start processing messages.

FP-Syd, June 2013

2013-06-26T00:00:00Z

FP-Syd in June 2013 had talks about implementing cellular automata in Haskll (comparing the Repa and Accelerate array processing libraries) and about distributed data structures and systems. I get the feeling there was something else, but I didn’t write notes on it.

Cellular Automata

Tran gave an experience report using array computation libraries in Haskell (Repa and Accelerate) to implement cellular automata. “Falling sand” game: simulate gravity and “alchemical” interactions between elements.

First step is in simulating gravity. Dealing with falling blocks, randomising block. Using “block CA”, with blocks defined by grids (2x2 cells) which alternate between time steps (red grid, then blue grid); this allows you to implement gravity using a single rule [: ] -> [..].

Repa and Accelerate both have concept of stencil convolutions. Use them to implement CA rules. Stencil has a shape (the neighbourhood) and a fold-ish function to process each cell.

Phrase the problem in terms of array computation.
Repa: slap Repa functions onto standard Haskell code.
Accelerate: EDSL means you can’t do lots of Haskell stuff (little things like pattern matching).

Repa has Gloss integration. Hmm.

Code is on GitHub. Called falling-turnip?

Conflict-free replicated data types, consensus protocols and the cloud

Andrew Frederick Cowie

twitter.com/afcowie

AfC on #haskell

Cloudy stuff means there’s never only one of things these days; everything is distributed (or will, hopefully, need to be distributed).

Streaming I/O: iteratee, conduits, io-streams, pipes all provide abstractions for processing data in a single thread in a single process.

Pipes explicitly talks about clients and servers (ends of the pipelines) but this is all just structuring computations within a single thread. This doesn’t really help us do anything interesting to build distributed systems.

AWS regions and availability zones; no SLA unless your app spans availability zones.

CAP theorem: consistency, availability and resilience to network partition.

Two generals: attack succeeds iff both attack at the same time. No solution can guarantee coordination in the face of unreliable messaging.

Jepsen blog posts about testing distributed databases.

Computers are slow.

Author mention “CRDTs” a few times (difficult to Google; did you mean “credit”?); original meant something like Convergent and Commutative Replicated Data Types, now Conflict-free Replicated Data Types. Paper in 2007 from INRIA. Is it possible to arrange things so that distributed states will always converge? Yes, joint semi-lattices.

Paper proposes a protocol and implementations of various abstractions on top of it: counters, registers, sets (grow only - G-Set, add & remove - 2P-Set, observer-remove - OR-Set), graphs.

Global invariants can’t be enforced over the whole system without synchronisation; eventual consistency allows you to accept changes which, after merge, break the invariant.

Consensus algorithms

Paxos algorithm (Lamport. 1998. The Part-Time Parliament), peer-to-peer consensus algorithm; most people don’t bother trying to implement a full peer-to-peer consensus system.

ZooKeeper elects a leader.

Raft. There is just one leader and the point of the algorithm is to ensure that the leader’s log is the most correct. Leaders have terms? Consensus is something like “3 nodes agree”.

Ceph - large distributed file system. Lots of nodes, three of them are special “monitors” which maintain the cluster map. It’s pretty hard to build a file system if your objects (disk blocks) change underneath you; Ceph have focussed on the consistency needed to build a file system.

Rather than maintain an index (which doesn’t scale), Ceph places blocks according to a layout algorithm. CRUSH(). Paxos to elect monitors.

All this stuff needs very fast interconnect, i.e. a single data centre.

Amazon’s SQS:

Reliable delivery of all messages.
Delivery is not ordered. (Not much of a “queue” is it?)
Delivery is guaranteed at least once.

We’re now back where we started: workers that recieve messages need to be able to figure out whether the message needs processing (at least once), etc.

Queues like this are good for signalling, but solutions like CRDTs will help manage the state we still need to track.

Conclusion

Idempotence is everything, see FP for salvation.

Q&A

Make state an Albion group, commutative monoid and this stuff comes largely for free.

Cloud Haskell duplicating Erlang model, but has tight coupling in the wrong places, needs same ABI (version) on all nodes; are we on our way back to a single data centre?

NoSQL vs RDBMS. Now settling down to small transactional world and larger non-transactional world.

Taking functional approaches (git’s similarity persistent data structures, log journaled databases, etc.)

Passing Curiosity: Posts tagged distributed systems