Skip to end of metadata
Go to start of metadata

Background

The driver generates a stream of operations (e.g. create user, create post, create comment, retrieve person's posts etc.) and then executes them using the provided database connector.
To be capable of generating heavier loads, it executes the operations from this stream in parallel.
If there were no dependencies between operations (e.g., reads that depend on the completion of writes) this would be trivial.
However, some operations within the stream do depend on others, others are depended on, some both depend on others and are depended on, and some neither depend on others nor are they depended on. 

 Consider, for example, a social network benchmark scenario, where the data generator outputs a sequence of events such as User A posted a pictureUser B left a comment on the picture of User A, etc.  The second event depends on the first one in a sense that there is a causal ordering between them: User B can only leave a comment on the picture once it has been posted. The generated events are already ordered by their time stamp, so in case of the single-threaded execution this ordering is observed by default: the driver issues a request to the SUT  with the first event (i.e., Users A posts a picture), after its completion it issues the second event (create a comment). However, if events are executed in parallel, these two events may end up in different parallel sequences of events. Therefore, a driver needs a mechanism to insure the dependency is observed even when the dependant events are in different parallel update streams.

All operations are time-stamped, giving a partial ordering (partial because some operations may have equal time stamps), and as long as the time stamps of dependent operations are never equal this partial ordering captures the dependencies between operations.

Even with a known ordering, executing dependent operations in a parallel (and distributed) context, in a way that is scalable (i.e., with more resources the driver can generate higher throughput) is not trivial.
Furthermore, it is equally important that 
no unwanted bursty behavior is introduced by the driver. To maintain a deterministic execution order in distributed environments requires communication and synchronization between remote processes. This synchronization takes time and, especially if the duration between start times of dependent operations is short, can lead to the driver processes behaving in a bursty manner, i.e., blocking for periods, executing for periods, blocking again, etc.

For example, social networks tend to have a time-varying activity of users: certain concepts or topics cause flashmob effects, when large amount of users posts (reposts, replies, shares) about a certain phenomena (e.g. an earthquake). In such case the dataset contains a lot of dependant events, so the synchronization overhead would significantly alter the driver (and SUT) behavior. From the benchmarking point of view, it means that the driver would alter the workload in an unpredictable (and non-repeatable) way. 

The following sections summarize the approaches used in the driver to deal with these challenges.

Definitions

Definitions for the various terms that will be introduced throughout the document are list below:

  • Driver: distributed, parallel workload generating software
    • Driver Coordinator: single, centralized, coordinating driver process. Responsible for coordinating the, potentially multiple, driver workers
    • Driver Worker: many distributed driver worker processes may exist. Responsible for executing the benchmark workload, i.e., sending queries to the System Under Test (SUT)
  • Simulation Time (ST): notion of time created by data generator. All time stamps in the generated data set are in simulation time
  • Real Time (RT): wall clock time
  • Time Compression Ratio: function that maps simulation time to real time, e.g., an offset in combination with a compression ratio. It is a static value, set in driver configuration
    Real Time Ratio is reported along with benchmark results, allowing others to recreate the same benchmark
  • Operation: read and/or write 
  • Dependencies: operations in this set introduce dependencies in the workload. That is, for every operation in this set there exists at least one other operation (in Dependents) that can not be executed until this operation has been processed
  • Dependentsoperations in this set are dependent on at least one other operation (in Dependencies) in the workload
  • Due Time (DueT): point in simulation time at which the execution of an operation should be initiated.
  • Dependent Time (DepT): in addition to Due Time, every operation in Dependents also has a Dependent Time, which corresponds to the Due Time of the operation that it depends on. Dependent Time is always before Due Time. For operations with multiple dependencies Dependent Time is the maximum Due Time of all the operations it depends on.
  • Tolerated Delay: the duration of time after Due Time that an operation can be late
  • Safe Time (SafeT): time duration.
    • when two operations have a necessary order in time (i.e., dependency) there is at least a SafeT interval between them
    • SafeT is the minimum duration between the Dependency Time and Due Time of any operations in Dependents
  • Operation Stream: sequence of operations ordered by Due Time (dependent operations must separated by at least SafeT)
  • Initiated Operations: operations that have started executing but not yet finished
  • Completed Operations: operations that have finished executing
  • Local Completion Time (per driver): point in simulation time behind which there are no uncompleted operations
  • Completion Time = min( min(Initiated Operations), max(Completed Operations) )
  • Global Completion Time (GCT): minimum completion time of all drivers. Once GCT has advanced to the Dependent Time of some operation that operation is safe to execute, i.e., the operations it depends on have all completed executing.
    Global Completion Time = min(Completion Time)
  • Execution Window (Window): a timespan within which all operations can be safely executed
    • All operations satisfying window.startTime <= operation.DueT <= window.endTime may be executed
    • Within a window no restrictions on operation ordering or operation execution time are enforced, scheduling is non-deterministic within a window (Andrey: maybe better: driver has a freedom of choosing an arbitrary scheduling strategy inside the window)
    • To ensure that execution order respects dependencies between operations, window size is bounded by SafeT, such that: 0 < window.duration <= SafeT
    • Window duration is fixed, per operation stream; this is to simplify scheduling and make benchmark runs repeatable
    • Before any operations within a window can start executing it is required that: GCT >= window.startTime - (SafeT - window.duration) 
    • All operations within a window must initiate and complete between window start and end times: 
      window.startTime <= operation.initiate < window.endTime && window.startTime <= operation.complete < window.endTime
  • Dependency Mode: defines dependencies, constraints on operation execution order
  • Execution Mode: defines how the runtime should execute operations of a given type

Tracking dependencies

Consider that every operation in a workload belongs to none, one, or both of the following sets: Dependencies and Dependents.

As mentioned, the driver uses operation time stamps (Due Times) to ensure that dependencies are maintained.
It keeps track of the latest point in time behind which every operation has completed.
That is, every operation (i.e., dependency) with a Due Time lower or equal to this time is guaranteed to have completed execution.
It does this by maintaining a monotonically increasing (distributed) variable called Global Completion Time (GCT).

Logically, every time the driver (via a database connector) begins execution of an operation from Dependencies that operation is added to Initiated Operations: the set of operations that have started executing but not yet finished.
Then, upon completion, the operation is removed from Initiated Operations and added to Completed Operations: the set of operations that have started and finished executing.
Using these sets, each driver process maintains its own view of GCT in the following way.
Local progress is monitored and managed using a variable called Local Completion Time (LCT): the point in time behind which there are no uncompleted operations, i.e., no operation in Initiated Operations has a lower Due Time and no operation in Completed Operations has a higher Due Time  (see Definitions for specifics). LCT is periodically sent to all other driver processes, which all then (locally) set their view of GCT to the minimum LCT of all driver processes.

At this point the driver has two, of the necessary three (third covered shortly), pieces of information required for knowing when to execute an operation:

  • Due Time: point in time at which an operation should be executed, assuming all preconditions (e.g., dependencies) have been fulfilled
  • GCT: every operation (from Dependencies) with a Due Time before this point in time has completed execution 

However, with only GCT to track dependencies the driver has no way of knowing when it is safe to execute any particular dependent operation.
What GCT communicates is that all dependencies up to some point in time have completed, but whether or not the dependencies for any particular operation are within these completed operations is unknown.
The driver would have to wait until GCT has passed the Due Time (because Dependency Time is always lower) of an operation before that operation could be safely executed, which would result in the undesirable outcome of every operation missing its Due Time.

The required information is which particular operation in Dependencies does any operation in Dependents depend on. More specifically, the Due Time of this operation. This is referred to as Dependent Timein addition to Due Time, every operation in Dependents also has (read: must have) a Dependent Time, which corresponds to the latest Due Time of all the operations it depends on. Once GCT has advanced beyond the Dependent Time of an operation that operation is safe to execute.

Using these three mechanisms (Due Time, GCT, and Dependent Time) the driver is able to execute operations, while ensuring their dependencies are satisfied beforehand. 

Scalability in the presence of dependencies

The mechanisms introduced in Tracking Dependencies guarantee that dependency constraints are not violated, but in doing so they unavoidably introduce communication/synchronization between driver threads/processes.
To minimize the negative effects that synchronization has on driver scalability an additional Execution Mode was introduced (for more on Execution Modes see Workload Execution - Putting it all together): Windowed Execution.

Windowed Execution has two design goals: (a) make the generated load less 'bursty' (b) allow the driver to 'scale', so when the driver is given more resources (CPUs, servers, etc.) it is able to generate more load.

In the context of Windowed Execution, operations are executed in groups (Windows), where operations are grouped according to their Due Time.
Every Window has a Start Time, a Duration, and an End Time (i.e., Window.startTime + Window.duration = window.endTime), and Windows contain only those operations that have a Due Time between Window.startTime and Window.endTime.
Logically, all operations within a Window are executed at the same time, some time within the Window. No guaranty is made regarding exactly when, or in what order, an operation will execute within its Window.

The reasons this approach is correct are as follows:

  • Operations belonging to the Dependencies set are never executed in this manner - the Due Times of Dependencies operations are never modified as this would affect how dependencies are tracked
  • The minimum duration between the Dependency Time and Due Time of any operation in Dependents is known (can be calculated by scanning through workload once), this duration is referred to as Safe Time (SafeT)
  • A window does not start executing until the dependencies of all its operations have been fulfilled. This is ensured by enforcing that window execution does not start until GCT >= window.startTime - (SafeT - window.duration) = window.endTime - SafeT; that is, the duration between GCT and the end of the window is no longer than SafeT

The way this translates into improved scalability is it can drastically reduce the amount of communication/synchronization necessary for tracking and maintaining GCT. 
Instead of reading GCT before the execution of every operation, it only needs to be read before the execution of every window.
Then, as GCT is read less frequently, it follows that GCT does not need to be communicated between driver processes as frequently. Communicating GCT at some fraction of SafeT (e.g., SafeT/3) is likely sufficient.

The advantages of such an execution mode are as follows:

  • As no guarantees are made regarding time or order of operation execution within a Window, GCT no longer needs to be read before the execution of every operation, only before the execution of every window
  • Then, as GCT is read less frequently, it follows that it does not need to be communicated between driver processes as frequently. There is no need or benefit to communicating GCT protocol message more frequently than approximately Window.duration, the side effect of which is reduced network traffic
  • Further, by making no guarantees regarding the order of execution the driver is free to reschedule operations (within Window bounds). The advantage being that operations can be rearranged in such a way as to reduce unwanted bursts of load during execution, which could otherwise occur while synchronizing GCT during demanding workloads. For example, a uniform scheduler may modify operation Due Times to be uniformly distributed across the Window timespan, to 'smoothen' the load within a Window.

As with any system, there are tradeoffs to this design, particularly regarding Window.durationThe main tradeoff is that between ' workload resolution' and scalability. Increasing Window.duration reduces communication/synchronization but also reduces the resolution at which the workload definition is followed. That is, the generated workload becomes less like the workload definition. However, as this is both bounded and configurable, it is not a major concern (see illustration).
This design also trades a small amount of repeatability for scalability, as there are no timing or ordering guarantees within a window two executions of the same window are not guaranteed to be equivalent - 'what happens in the window stays in the window'. Despite sacrificing this repeatability, the results of operations do not change. No dependency-altering operations occur during the execution of a Window, therefore results for all queries should be equivalent between two executions of the same workload, there is no effect on the expected result for any given operation.

 


Above: illustration of windowed execution, exemplifying tradeoff when selecting window size


Workload Execution - Putting it all together

Using the information and mechanisms introduced in previous sections, in addition to a little additional context where necessary, we can now explain precisely how operations are executed.
Based on the dependencies certain operations have, and on the granularity of parallelism we wish to achieve while executing them, we assign a Dependency Mode and an Execution Mode to every operation type.
Using these classifications the driver runtime then knows how each operation should be executed. These modes, as well as what they mean to the driver runtime, are described below.

Dependency Modes

While executing a workload the driver treats operations differently, depending on their Dependency Mode.
In the previous section operations were categorized by whether or not they are in the sets Dependencies and/or Dependents
Another way of communicating the same categorization is by assigning a Dependency Mode to operations - every operation type generated by a workload definition must be assigned to exactly one Dependency Mode.
Dependency modes define dependencies, constraints on operation execution order.
The driver supports a number of different Dependency ModesNoneRead Only, Write OnlyRead Write.
During workload execution, operations of each type are treated as follows.
 

  • None 
    Depended On (NO): operations do not introduce dependencies with other operations (i.e., the correct execution of no other operation depends on these operations to have completed executing) 


    • Prior Execution
      • do nothing
    • After Execution
      • do nothing
  • Read Only 
    Depended On (NO): operations do not introduce dependencies with other operations (i.e., the correct execution of no other operation depends on these operations to have completed executing)
    Dependent On (YES): operation execution does depend on GCT to have advanced sufficiently (i.e., correct execution of these operations requires that certain operations have completed execution)


    • Prior Execution
      • wait for GCT >= operation.DepT
    • After Execution
      • do nothing
  • Write Only 
    Depended On (YES): operations do introduce dependencies with other operations (i.e., the correct execution of certain other operations requires that these operations to have completed executing, i.e., to advance GCT)
    Dependent On (NO): operation execution does not depend on GCT to have advanced sufficiently (i.e., correct execution of these operations does not depend on any other operations to have completed execution)

    • Prior Execution
      • add operation to Initiated Operations
    • After Execution
      • remove operation from Initiated Operations
      • add operation to Completed Operations
  • Read Write
    Depended On (YES): operations do introduce dependencies with other operations (i.e., the correct execution of certain other operations requires that these operations to have completed executing, i.e., to advance GCT)
    Dependent On (YES): operation execution does depend on GCT to have advanced sufficiently (i.e., correct execution of these operations requires that certain operations have completed execution)


    • Prior Execution
      • add operation to Initiated Operations
      • wait for GCT < operation.DepT
    • After Execution
      • remove operation from Initiated Operations
      • add operation to Completed Operations
Execution Modes

Execution Modes relate to how operations are scheduled (if and how their Due Time is modified, see Scaling: Execution Windows), when they are executed, and what their failure conditions are.
Each operation type in a workload definition must be assigned to exactly one Execution Mode.
The driver supports a number of different Execution Modes: Asynchronous, Synchronous, Partially Synchronous.
It splits the single operation stream generated by a workload into multiple streams, one per Execution Mode.
During workload execution, operations from each of these streams are treated as follows.

  • Asynchronous: operations are executed individually, when their Due Time arrives

    Motivation:
     This is the default execution mode, it executes operations as true to the workload definition as possible.

    • Re-scheduling Before Execution
      • None: operation.DueT not modified by scheduler
    • Execute When
      • time >= operation.DueT (&& GCT >= operation.DepT)
    • Max Concurrent Executions
      • unbounded
    • Max Execution Time
      • unbounded
    • Failure
      • operation execution starts later than: operation.DueT + Tolerated Delay
  • Synchronous: operations are executed individually, sequentially, in blocking manner

    Motivation
    :
    Some dependencies are difficult to capture efficiently with SafeT and GCT alone. 
    For example, social applications often support conversations via posts and likes, where likes depend on the existence of posts
    Furthermore, posts and likes also depend on the existence of the users that make them.
    However, users are created at a lower frequency than posts and likes, and it can be assumed they do not immediately start creating content.
    As such, a reasonably long SafeT can be used between the creation of a user and the first time that user creates posts or likes.
    Conversely, posts are often replied to and/or liked soon after their creation, meaning a short SafeT would be necessary to maintain the ordering dependency.
    Consequently, maintaining the dependencies related to conversations would require a short SafeT, and hence a small window.
    This results in windows containing fewer operations, leading to less potential for parallelism within windows, less freedom in scheduling, more synchronization, and greater likelihood of bursty behavior - all negative things.

    The alternative offered by Synchronous Execution is that, when practical, operations of certain types can be partitioned (e.g. posts and likes could be partitioned by the foruml in which they appear), and partitions assigned to driver processes.
    Using the social application example from above, if all posts and likes were partitioned by forum the driver process that executes the operations from any partition could simply execute them sequentially. Then the only dependency to maintain would be on user operations, reducing synchronization dramatically, and parallelism could still be achieve as each partition would be executed independently, in parallel, by a different driver process.

    • Re-scheduling Before Execution
      • None: operation.DueT not modified by scheduler
    • Execute When
      • time >= operation.DueT && previousOperation.completed == true (&& GCT >= operation.DepT)
    • Max Concurrent Execution
      • 1
    • Max Execution Time
      • nextOperation.DueT - operation.DueT
    • Failure
      • operation execution starts later than: operation.DueT + Tolerated Delay
        E.g., if previousOperation did not complete in time, forcing current operation to wait for longer than toleratedDelay
  • Partially Synchronous: a.k.a "Windowed Execution" (see Scalability in the presence of dependencies for more details), groups of operations from the same time window are executed together

    Motivation:
    see Scalability in the presence of dependencies

    • Re-scheduling Before Execution
      • Yes, as long as the following still holds: window.startTime <= operation.DueT < window.startTime + window.duration 
      • Operations within a window may be scheduled in any way, as long as they remain in the window from which they originated: Due Times, and therefore ordering, may be modified
    • Execute When
      • time >= operation.DueT (&& GCT >= operation.DepT)
    • Max Concurrent Execution
      • number of operations within window
    • Max Execution Time
      • (window.startTime + window.duration) - operation.DueT 
    • Failure
      • operation execution starts later than: window.startTime + window.duration
      • operation execution does not finish by: window.startTime + window.duration
  • No labels