EEP: XXX Title: gen_stream behaviour Version: $Revision: 14 $ Last-Modified: $Date: 2007-12-10 07:17:01 +0200 (Mon, 10 Dec 2007) $ Author: Jay Nelson Status: Draft Type: Standards Track Content-Type: text/plain Created: 09-Dec-2007 Erlang-Version: R12B-2 Post-History: 09-Dec-2007 Abstract An optimized behaviour module is needed to simplify the handling of large streams of (typically binary) data for application developers. Specification Module name: gen_stream Implementation: A gen_server which delivers "chunks" of the stream in an efficient manner so that line-oriented processing or the handling of streams much bigger than memory (possibly even infinite) may be absorbed by an application. Behaviour callbacks: start, start_link as in gen_server init(Args, Options) -> Same as gen_server plus list of Options: {stream, {file, path_to_file()} | {binary, binary()} | {behaviour, atom(), ExtraArgs}} {chunk_size, integer()} returned sub-binary size, default is ~8K {chunks_per_proc, integer()} num of internal chunks, default is 1 {circular, false | true} whether stream repeats, default is false {num_processes, integer()} num_processes used, default 1 next_chunk(Server::pid()) -> binary() | end_of_stream pct_complete(Server::pid()) -> integer() | atom() stream_size(Server::pid()) -> integer() | atom() stream_pos(Server::pid()) -> integer() stop(Server::pid()) -> ok Usage: Client starts the gen_stream by providing at least a stream option. The stream option indicates whether the source of the stream is a file, a binary or a function. When using a socket, port or other source, the client needs to implement the behavior to feed the buffers on demand. Motivation There are many ways to get binary data into an erlang node, however, historically it has been recommended that the data be converted to a list and processed. There are many situations where leaving the binary data in its original form is preferable for space or conversion efficiency reasons (e.g., when merely filtering data in a relaying router process or when performing statistics on raw stream data). Providing a gen_server idiom makes the default approach to processing a binary stream an abstraction that is closer to an application developer's view of the problem solution. The recent Wide Finder project [1] challenged the erlang community by highlighting the slowness of standard I/O functions, forcing developers to use raw binary handling. This approach seems to be a common need in web service applications, yet it is quite easy to do in a very inefficient manner. Providing a reference implementation that exposes a simpler behaviour interface would increase the class of problems that erlang can solve in the hands of beginning to intermediate developers. It would also push implementers in the direction of an OTP compliant application without sacrificing efficiency. In addition, there has been a call on the email list for a string_stream implementation so that a buffer of data (e.g., an SMTP message, HTTP request, HTML page, multi-record socket protocol packet, raw text database, comma-delimited file, etc.) could be treated as a stream of binary elements rather than a single block of data. Finally, testing systems often need a generative source of data that can be replayed or repeated in a precise manner to trigger a fault or test a patch to same. The circular binary stream allows infinite streams of generative data, and the behaviour stream allows a functionally generated stream of data to be emitted. Rationale There are a few common idioms that are used when efficiently handling a binary data source: 1) "Chunking" the data to smaller sub-binaries 2) Buffering the chunks for efficient I/O 3) Few of the standard idioms are OTP-compliant A gen_server implementation seemed the most straight-forward method for making an OTP-compliant method for chunking a serial stream. A behaviour was created so that streams could be computed and generated rather than requiring a pre-constructed file or binary as a source. Reference Implementation A working version is available at the DuoMark Website [2]. References [1] Tim Bray's weblog http://www.tbray.org/ongoing/ [2] http://www.duomark.com/erlang/proposals/gen_stream.html Copyright This document is released to the public domain. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: