Bulk Operations with Traversals, Now Composed
- Recap
- Sequencing
- Nesting
- Zipping
- Scalability
- Abstraction
- Another Take at Nesting
- Another Take at Zipping
- Arrows
- Another Take at Scalability
- The
profunctors
Package - Appendix
Recap
As I’ve written previously, if you have an operation
process :: A -> IO B
and you need to do this operation in bulk, i.e. turn multiple A
’s into multiple B
’s, then it’s a good idea to represent it as
processBulk :: Traversable t => t A -> IO (t B)
This type describes an invariant, that the B
’s returned are in a one-to-one correspondence with the input A
’s, and forces us to eagerly check/enforce this invariant. As we’ll see, this actually also makes such bulk operations composable.
Sequencing
Throughout this post we’ll use a running example of orchestrating bulk select queries to a database with the following toy schema:
┌─────┐ ┌──────┐ ┌────────┐
│Users│──[1-to-many]──│Orders│──[1-to-many]──│Products│
└─────┘ └──────┘ └────────┘
│ ┌────────┐
└─[1-to-many]──│Payments│
└────────┘
If we have two bulk processes that we need to pipe one after another, then it’s no surprise that we can just compose them with >=>
:
orderIDByPaymentID
:: Traversable t => t PaymentID -> IO (t OrderID)
userIDByOrderID
:: Traversable t => t OrderID -> IO (t UserID)
userIDByPaymentID
:: Traversable t => t PaymentID -> IO (t UserID)
userIDByPaymentID
= orderIDByPaymentID >=> userIDByOrderID
Nesting
Suppose for some reason we’ve decided to separate the function that selects product ID’s that belong to an order, and the function that selects the actual data of the product by its ID.
productIDsByOrderID
:: Traversable t => t OrderID -> IO (t [ProductID])
-- ^ The nested [] is to reflect the multiplicity relationship:
-- One order ID maps to 0 or more product ID's
productDataByID
:: Traversable t => t ProductID -> IO (t ProductData)
productDatasByOrderID
:: Traversable t => t OrderID -> IO (t [ProductData])
productDatasByOrderID = ???
How do we compose the two functions? We can start with productIDsByOrderID >=> ???
but then the resulting ???
has expected type t [ProductID] -> IO (t [ProductData])
, which is not what we’ve got.
Note that the choice of t
we make when calling productDataByID
doesn’t have to match the t
we’re given as implementors of productDatasByOrderID
. Instead what we can do is use productDataByID
at type Compose t []
:
productDataByID @(Compose t [])
:: Compose t [] ProductID -> IO (Compose t [] ProductData)
With some newtype wrapping/unwrapping, we get:
fmap getCompose
. productDataByID
. Compose @t @[]
:: t [ProductID] -> IO (t [ProductData])
i.e. exactly what we want. Putting the puzzle pieces together, we can implement the function:
productDatasByOrderID
:: Traversable t => t OrderID -> IO (t [ProductData])
productDatasByOrderID orderIDs = do
(productIDs :: t [ProductID]) <- productIDsByOrderID orderIDs
(productDatas :: Compose t [] ProductData) <-
productDataByID (Compose productIDs)
pure $ getCompose productDatas
This possibility is due to there existing an instance (Traversable f, Traversable g) => Traversable (Compose f g)
.
Zipping
Suppose we want to select the data of an order by ID, and also the related products:
orderDataByID
:: Traversable t => t OrderID -> IO (t OrderData)
productDatasByOrderID -- implemented up above
:: Traversable t => t OrderID -> IO (t [ProductData])
orderWithProductsByID
:: Traversable t => t OrderID -> IO (t (OrderData, [ProductData]))
orderWithProductsByID = ???
We could start with taking the collection of order IDs and feeding them to the two functions like so:
orderWithProductsByID orderIDs = do
(orderDatas :: t OrderData) <- orderDataByID orderIDs
(productDatas :: t [ProductData]) <- productDatasByOrderID orderIDs
???
but then we’re left with a puzzle of how to zip together two t
structures. However the only thing we know about t
is that it is Traversable
, and that is not actually enough to zip arbitrary structures together.
We may note that they’re not actually arbitrary structures, as each operation here preserves the shape, and thus orderDatas
and productDatas
must have the same shape and thus we’re morally entitled to zipping them. You may be reaching for mapAccumL
to unsafely zip them, or for a Bazaar
to represent them as length-indexed vectors, but there’s a much more safe and direct alternative to this.
We’re going to get productDatasByOrderID
to do the zipping for us, and we’re going to do it by passing OrderData
through it, and we’re going to do it for free, without changing productDatasByOrderID
.
The trick is to tuck OrderData
into the t
. Just like we represented t [X]
as Compose t [] X
, we can represent t (X, Y)
as Compose t ((,) X) Y
:
fmap getCompose
. productDatasByOrderID
. Compose @t @((,) OrderData)
:: t (OrderData, OrderID)
-> IO (t (OrderData, [ProductData]))
Alright, but how do we get t (OrderData, OrderID)
if what we have is t OrderData
and t OrderID
? We can do the exact same trick again: we thread (a copy of) OrderID
through orderDataByID
:
fmap getCompose
. orderDataByID
. Compose @t @((,) OrderID)
:: t (OrderID, OrderID)
-> IO (t (OrderID, OrderData))
With a little glue we can put the puzzle pieces together:
orderWithProductsByID orderIDs = do
let
dupedIDs :: t (OrderID, OrderID)
dupedIDs = orderIDs <&> \i -> (i, i)
Compose (idsAndOrderDatas :: t (OrderID, OrderData)) <-
orderDataByID (Compose dupedIDs)
let
orderDatasAndIDs :: t (OrderData, OrderID)
orderDatasAndIDs = idsAndOrderDatas <&> swap
Compose (ordersAndProducts :: t (OrderData, [ProductData])) <-
productDatasByOrderID (Compose orderDatasAndIDs)
pure ordersAndProducts
Scalability
Equipped with these tricks we fear no query:
usersEverything
:: Traversable t
=> t UserID
-> IO (t (UserData, [(OrderData, [ProductData], [PaymentData])]))
usersEverything userIDs = do
let uiui = userIDs <&> \i -> (i, i)
Compose uiud <- userDataByID (Compose uiui)
let udui = uiud <&> swap
Compose udoi <- orderIDsByUserID (Compose udui)
let udoioi = Compose (Compose udoi) <&> \i -> (i, i)
Compose udoiod <- orderDataByID (Compose udoioi)
let udodoioi = Compose (udoiod <&> swap) <&> \i -> (i, i)
Compose udodoipi <- productIDsByOrderID (Compose udodoioi)
Compose (Compose udodoipd) <- productDataByID (Compose (Compose udodoipi))
let udodpdoi = Compose (udodoipd <&> swap)
udodpdyi <- paymentIDsByOrderID udodpdoi
udodpdyd <- paymentDataByID (Compose udodpdyi)
pure $ getCompose $ getCompose $ fmap (\(x, (y, z)) -> (x, y, z))
$ getCompose $ getCompose $ getCompose udodpdyd
Actually maybe I fear that query. Can you keep track of the types, cause I for sure cannot. What’s the type of udodpdyd
?
Compose
(Compose
(Compose
(Compose
(Compose
t
((,) UserData)
)
[]
)
((,) OrderData)
)
((,) [ProductData])
)
[]
PaymentData
There’s got to be a better way.
Abstraction
Up to this point we’ve been treating the traversable approach as a “design pattern”. Haskell’s rich type system actually allows us to precisely capture the nature of this pattern and abstract it as a type:
newtype BulkFunction a b
= BulkFunction (forall t. Traversable t => t a -> IO (t b))
productDataByID :: BulkFunction ProductID ProductData
Now we can also abstract the compositional glue we’ve used into well-defined functions. Sequencing gets the simple type:
BulkFunction x y -> BulkFunction y z -> BulkFunction x z
Actually the base library has a typeclass for this kind of operation, and having made a newtype, we can write an instance:
class Category (q :: Type -> Type -> Type) where
id :: q x x
(.) :: q y z -> q x y -> q x z
instance Category BulkFunction where
id = BulkFunction pure
BulkFunction f . BulkFunction g = BulkFunction (f <=< g)
Actually before we go too far, note that there is already another instance of Category
, whose id
is pure
and whose .
is <=<
:
newtype Kleisli m a b = Kleisli (a -> m b)
instance Monad m => Category (Kleisli m)
It makes sense to abstract out the _ -> IO _
part as a parameter, and delegate to its instance of Category
:
newtype Bulk p a b = Bulk
{ runBulk :: forall t. Traversable t => p (t a) (t b) }
instance Category p => Category (Bulk p) where
id = Bulk id
Bulk f . Bulk g = Bulk (f . g)
productDataByID :: Bulk (Kleisli IO) ProductID ProductData
We’re also going to need to be able to map over the a
and b
parameters, this is achieved by making it an instance of Profunctor
, from the profunctors library:
class Profunctor (p :: Type -> Type -> Type) where
dimap :: (x -> y) -> (z -> w) -> p y z -> p x w
instance Profunctor p => Profunctor (Bulk p) where
dimap f g (Bulk h) = Bulk (dimap (fmap f) (fmap g) h)
Another Take at Nesting
Next up we have a notion of nesting: we were able to “lift” productDataByID
like so:
productDataByID :: Bulk (Kleisli IO) ProductID ProductData
productDatasByIDs :: Bulk (Kleisli IO) [ProductID] [ProductData]
productDatasByIDs = Bulk
(dimap Compose getCompose $ runBulk productDataByID)
Generalizing this, we have:
liftList :: Profunctor p => Bulk p x y -> Bulk p [x] [y]
In fact, generalizing even further, we arrive at:
liftContainer
:: (Profunctor p, Traversable t)
=> Bulk p x y -> Bulk p (t x) (t y)
liftContainer (Bulk f) = Bulk (dimap Compose getCompose f)
The profunctors
library has a typeclass for profunctors that support this operation:
class Traversing q where
traverse' :: Traversable t => q x y -> q (t x) (t y)
instance Profunctor p => Traversing (Bulk p) where
traverse' = liftContainer
Another Take at Zipping
The trick of tucking/threading data into the traversable is an example of something called profunctor strength:
class Strong q where
first' :: q x y -> q (x, z) (y, z)
second' :: q x y -> q (z, x) (z, y)
instance Profunctor p => Strong (Bulk p) where
second' = traverse'
-- ^ slick definition also leveraging `Compose t ((,) c)`
first' f = dimap swap swap $ second' f
With these two functions we can define a “fan-out” operation:
fanout :: (Category q, Strong q) => q x y -> q x z -> q x (y, z)
fanout f g = dimap (\x -> (x, x)) id
(second' g . first' f)
which can in particular be specialized to q ~ Bulk p
, whence it supplies the same input to two processes, and “zips” their results.
Arrows
What we’ve just defined looks very similar to the &&&
operator from Control.Arrow, both in type signature and in implementation:
class Category q => Arrow q where
arr :: (x -> y) -> q x y
first :: q x y -> q (x, z) (y, z)
second :: q x y -> q (z, x) (z, y)
(***) :: q x y -> q z w -> q (x, z) (y, w)
f *** g = second g . first f
(&&&) :: q x y -> q x z -> q x (y, z)
f &&& g = (f *** g) . arr (\x -> (x, x))
And indeed it is. base
cannot depend on profunctors
, but morally Arrow
is equivalent to Strong
+Category
, because the typeclasses can be inter-defined like so:
instance Arrow q => Profunctor q where
dimap f g h = arr g . h . arr f
instance Arrow q => Strong q where
first' = first
second' = second
instance (Strong q, Category q) => Arrow q where
arr f = dimap f id id
first = first'
second = second'
Under this identification we see that fanout
is actually equal to &&&
, and the following is sufficient to define Arrow
:
instance (Profunctor p, Category p) => Arrow (Bulk p) where
arr f = dimap f id id
first = first'
second = second'
Note that if we instead defined arr f = Bulk (arr (fmap f))
, this would unnecessarily require Strong p
.
Another Take at Scalability
Equipped with Arrow
, Traversing
, etc typeclass methods as compositional glue combinators, we can take another stab at defining a large query out of smaller pieces:
userDataByID :: Bulk (Kleisli IO) UserID UserData
orderDataByID :: Bulk (Kleisli IO) OrderID OrderData
productDataByID :: Bulk (Kleisli IO) ProductID ProductData
paymentDataByID :: Bulk (Kleisli IO) PaymentID PaymentData
orderIDsByUserID :: Bulk (Kleisli IO) UserID [OrderID]
productIDsByOrderID :: Bulk (Kleisli IO) OrderID [ProductID]
paymentIDsByOrderID :: Bulk (Kleisli IO) OrderID [PaymentID]
usersEverything
:: Bulk (Kleisli IO)
UserID
(UserData, [(OrderData, [ProductData], [PaymentData])])
usersEverything
= userDataByID
&&& (traverse' orderEverything . orderIDsByUserID)
where
orderEverything = dimap id (\(x, (y, z)) -> (x, y, z))
$ orderDataByID
&&& (traverse' productDataByID . productIDsByOrderID)
&&& (traverse' paymentDataByID . paymentIDsByOrderID)
Much better. What’s the type of orderEverything
?
orderEverything
:: Bulk (Kleisli IO)
OrderID
(OrderData, [ProductData], [PaymentData])
Now you can compose these bulk processes pretty much the same way as you can compose ordinary functions, except you’re forced to use point-free style, because you can’t give names to intermediate values using lambda binders.
Or can you? With the Arrow
typeclass comes the Arrows
extension which normally gets very little use, but works perfectly here. Just like a do
-block lets you give names to results of monadic actions, a proc
block lets you give names to values flowing in a string diagram, which is then “compiled” down to Arrow
operations.
Here’s what the function looks like using arrow syntax:
usersEverything
:: Bulk (Kleisli IO)
UserID
(UserData, [(OrderData, [ProductData], [PaymentData])])
usersEverything = proc userID -> do
userData <- userDataByID -< userID
orderIDs <- orderIDsByUserID -< userID
orders <- traverse' orderEverything -< orderIDs
returnA -< (userData, orders)
where
orderEverything = proc orderID -> do
orderData <- orderDataByID -< orderID
productIDs <- productIDsByOrderID -< orderID
paymentIDs <- paymentIDsByOrderID -< orderID
products <- traverse' productDataByID -< productIDs
payments <- traverse' paymentDataByID -< paymentIDs
returnA -< (orderData, products, payments)
With all intermediate values named, the code is self-documenting, you can add type signatures to the intermediate values, do some (irrefutable) pattern matching, construct return values (normally you’d probably be using a record type instead of this (,,)
triple).
As an exercise to the reader, you can figure out the connection between Choice
and ArrowChoice
, how traverse'
gives you right'
for free, and what that lets you do in terms of arrow syntax and DB queries.
The profunctors
Package
The Bulk
type actually exists in the profunctors
package as CofreeTraversing
, and there is a WrappedArrow
newtype to evidence that an Arrow
is Strong
, and that an ArrowChoice
is Choice
, but it’s lacking a mechanism to observe the converse, i.e. that CofreeTraversing
can be used with arrow syntax. However it is simple enough to add a blanket newtype wrapper that would translate Strong
into Arrow
and so on.
Appendix
A note about query performance: if we’re talking to a real database, the implementation of, say, userDataByID
will have to build something like a Map
or HashMap
of values the DB has returned, in order to build the resulting traversable. usersEverything
is composed out of multiple such functions, and in essence what happens is we send several queries to the DB, and glue their results together using Map
or HashMap
. This is essentially like executing a JOIN
query using a Merge Join or a Hash Join strategy, except that the Merge/Hash Join part has been moved out of the DB’s query planner and into our application. This has the disadvantage of depriving the query planner of the possibility to make an optimal choice based on the exact query and the data. Taking full advantage of the query planner would require making an SQL query builder – definitely possible but a little outside the scope of this post.
Also, beware of long IN ('x', 'y', ...)
lists, instead consider a VALUES
list (which can be IN
‘d, or JOIN
‘ed with).