How are Enumerators for arbitrary objects implemented?

Some time ago someone asked on Stackoverflow how Ruby’s Enumerators do their job. Consider there is an implementation of an arbitrary Object that features an each method, then it is possible to get an Enumerator for that object via Kernel#to_enum:

o = Object.new
def o.each
  yield 1
  yield 2
  yield 3
end

enum = o.to_enum

Enumerators come in very handy, especially in situations where you want the basic functionality of each, but with a twist - let’s say you would like to iterate through the characters of a String but at the same time you want access to the index of the current character, without having to maintain that index manually. That’s what each_with_index does, something you probably use a dozen times a day. The importance here is that the each_with_index functionality is not something that is provided by the String class, but something that comes with Enumerator. In fact, String#each_char returns an instance of Enumerator.

alphabet = ('a'...'z').to_a.join("")
enum = alphabet.each_char

puts enum.class # => Enumerator

enum.each_with_index do |c, i|
  puts "#{i}: #{c}"
end

Besides String#each_char, each_byte, each_line etc. there are other classes typically returning Enumerators as well when each is called with no arguments, or they have special each_xxx methods similar to String#each_char.

puts [1, 2, 3].each.class         # => Enumerator

puts ({ a: "b" }).each_key.class # => Enumerator

...

So it turns out that we use Enumerators in quite a few places. It would be a real bummer if these weren’t implemented efficiently, right? Now with those Enumerators that are built-in we can certainly trust that they are implemented efficiently because we are in the context of a specific class and probably know how to do things clean and fast in that context.

But an interesting question is how to proceed in the case of arbitrary Objects that implement an each method, as was the case in our first example. A simple, straight-forward implementation of an Enumerator in that case immediately jumps to mind: iterate through all the elements in the object, store them in an Array and then hand them out one by one. This certainly works, an implementation could look like the following. Let’s assume that the object implements nothing but a bare each method that takes the usual block.


class EagerEnumerator
  def initialize(enumerable)
    @ary = [].tap do |a|
      enumerable.each { |item| a << item }
    end
    @i = 0
  end

  def next
    raise StopIteration.new("iteration reached an end") if @i == @ary.size
    val = @ary[@i]
    @i += 1
    val
  end
end

class MyEnumerable
  def each
    yield 1
    yield 2
    yield 3
  end
end

e = EagerEnumerator.new(MyEnumerable.new)
puts e.next # => 1
puts e.next # => 2
puts e.next # => 3
puts e.next # => StopIteration is raised

So it works, but it has one major drawback: All the references of the values in the enumerable are duplicated into an Array immediately when the EagerEnumerator is created. That’s not a biggie if the enumerable is small, but it’s certainly a waste of memory and could become a problem if the enumerable is quite large or the time to retrieve a single object in each is significant. That’s a problem that is probably well known to Pythonistas, and they solved this by introducing Generators. The goal is evident. Rather than doing the whole work at once, we would like to do the work only on request, only when it is necessary. As it turns out, Ruby has a very nice feature that allows us to do exactly that, and this is also how an Enumerator is implemented for Object#to_enum. Continuation is the tool that fulfills our requirements perfectly, and Ruby offers it in the form of Fibers. Let’s have a look at how Object#to_enum is implemented in CRuby:

static VALUE
obj_to_enum(int argc, VALUE *argv, VALUE obj)
{
    VALUE meth = sym_each;

    if (argc > 0) {
	      --argc;
	      meth = *argv++;
    }
    return rb_enumeratorize(obj, meth, argc, argv);
}

As we can see, to_enum only works for Objects that have an each method (the sym_each). So what rb_enumeratorize now does is it creates an Enumerator instance, and the really interesting part is what happens in calls to next:

static VALUE
enumerator_next(VALUE obj)
{
    VALUE vs = enumerator_next_values(obj);
    return ary2sv(vs, 0);
}

static VALUE
enumerator_next_values(VALUE obj)
{
    struct enumerator *e = enumerator_ptr(obj);
    VALUE vs;

    if (e->lookahead != Qundef) {
        vs = e->lookahead;
        e->lookahead = Qundef;
        return vs;
    }

    return get_next_values(obj, e);
}

static VALUE
get_next_values(VALUE obj, struct enumerator *e)
{
    VALUE curr, vs;

    if (e->stop_exc)
	      rb_exc_raise(e->stop_exc);

    curr = rb_fiber_current();

    if (!e->fib || !rb_fiber_alive_p(e->fib)) {
        /* sets up the fiber initially */
	      next_init(obj, e);
    }

    /* resume the Fiber if next value is needed */
    vs = rb_fiber_resume(e->fib, 1, &curr);
    if (e->stop_exc) {
	      e->fib = 0;
	      e->dst = Qnil;
	      e->lookahead = Qundef;
	      e->feedvalue = Qundef;
	      rb_exc_raise(e->stop_exc);
    }
    return vs;
}

The real interesting part is what happens in get_next_values. There, initially a Fiber is set up that pulls out the values from the each method one by one. It’s here where the beauty of Fibers and Continuations really shines. Without them, we would be stuck here, because calls to each are of a all-or-nothing nature, we either process all of the values immediately or we don’t. But with the additional functionality of a Fiber we can elegantly “pause” its execution immediately after we retrieved a single value from each, and return that to the caller. Later, if another value is requested, we can simply pick up in the Fiberfrom where we left off, get the next value, pause again, return the value to the caller and so on.

The benefit of this approach is that we don’t have to do the housekeeping for any excess values, we simply pass on the current value. Unlike before, where each has practically “flooded” us with all its values at once, we now have more control over the process and with this finer-grained control we are now able to tell each: “Stop. One is enough. I’ll ask for more later!”.

This pattern can be really useful in everyday tasks, and saving some memory or time is probably never a bad idea. Fortunately for us, it’s really easy to implement in Ruby itself. Let’s have a look at a lazy version of our Enumerator:


class LazyEnumerator
  def initialize(enumerable)
    @fiber = Fiber.new do
      enumerable.each { |item| Fiber.yield item }
    end
  end

  def next
    @fiber.resume || raise(StopIteration.new("iteration reached an end"))
  end
end

class MyEnumerable
  def each
    yield 1
    yield 2
    yield 3
  end
end

e = LazyEnumerator.new(MyEnumerable.new)
puts e.next # => 1
puts e.next # => 2
puts e.next # => 3
puts e.next # => StopIteration is raised

This is pretty much the equivalent Ruby code for how Enumerators are implemented internally in CRuby.

[Computing the Units Digit of a Power: a Constant-Time Algorithm Using a Little Number Theory] >