Schemas and SqlTransform in Beam

Beam has support for working with collections of data that conforms to a schema, and you can use SQL Transforms to transform this data. This feels a bit more like working with data in spark, but beam does not have the same level of optimization.

Here’s some imports. I’ll leave these in because beam’s docs are not great and I had to copy these out of a youtube video when I was learning to do this:


import apache_beam as beam
from apache_beam.transforms.sql import SqlTransform

import typing

from faker import Faker

Here is my schema class. Beam wants it to be a named tuple.

class Person(typing.NamedTuple):
    person_id: int
    name: str
    fave_color: str

And you have to register a coder for the class:

beam.coders.registry.register_coder(Person, beam.coders.RowCoder)

And here’s an example showing some usage of the schema and a transformation:

with beam.Pipeline() as p:
    fake = Faker()
    people = (
        p |
        "get ids" >> beam.Create(list(range(10_000)))
        | "to people" >> beam.Map(lambda person_id: Person(
            person_id=person_id,
            name=fake.name(),
            fave_color=fake.color_name()
            )
        ).with_output_types(Person)
        | SqlTransform("""
            select fave_color, count(*) as `COUNT`
            FROM PCOLLECTION
            group by fave_color
        """)
        | "print" >> beam.Map(print)
    )