We assume that the result of the data generator is the stream of events (Example: Creation of the profile, adding friends, posting in the group are all events). Some events depend on other events (e.g., user can not post anything before the profile is created), while there are events that do not have any dependencies.
The events from the data generator will be executed by distributed drivers, so we have to guarantee that partial ordering of the events based on their dependencies (e.g., the user profile should be created before she posts any comments or adds friends; the reply to the post can be created only after the initial post is created) stays the same in the distributed setup. Some of these events can be simply mapped onto drivers that execute them sequentially; for example, we can partition all the comments by forum to which they belong, and the entire forum is assigned to a single driver that can execute it without communicating with other drivers. However, other types of dependencies still require inter-driver communication: the driver that adds comments to a forum needs to make sure that the corresponding users were already created by other drivers.
We assume that all the events have their own Due Time (in simulation time), i.e. the time by which the action should be executed. We also assume that all the drivers are aware of the Global Completion Time (GCT) -- the minimum of Due Times of events that are finished, across all the drivers. Intuitively, GCT denotes the Due Time of the last completed event of the slowest driver, such that all the events with timestamps smaller than GCT are guaranteed to be finished.
In order to differentiate between events, we introduce the following classes of events that the data generator needs to be aware of.
Here we describe two ways of classification of events (see also Driver design - Scalable execution of dependent operations)
A: From the point of view of scheduling inside the driver (intra-driver):
1. Events that have to be executed sequentially, so the scheduler can not execute the next event of this type before the current event is completed
2. Events that can be executed in batches concurrently with other events of this type
1. Events that do not modify GCT. Intuitively, these are the events that do not introduce dependencies with events of other drivers
2. Events that modify GCT. Some other drivers may be waiting for the completion of an event of this type, so we have to notify everyone else once this event is completed (by modifying the GCT).
Examples of the SNB Benchmark events:
Create user: A.2 & B.2
Add a friend: A.2 & B.1
Add a post/comment/like: A.1 & B.1
Naturally, updating GCT and propagating it among drivers incurs high costs of inter-driver communication. For this reason, the simulated data should conform to the following ΔT-rule: between an event of type B.2 (that modify GCT) and any events that depend on it, there should be at least ΔT (DeltaT) time difference. This will allow us to only update GCT once in DeltaT time window.
(In addition to the input that it takes now)
DeltaT -- the size of time window for events of type B.2. For the SNB generator this is the time that should pass between a creation of user and adding posts/comments or adding friends by that user.
Output of Data Generator:
1. The event stream that is sorted by the timestamp (Due Time)
2. That stream may be split into 1..n CSV files, within each file the events have to be sorted by timestamp
The output CSV file can look like this:
Additionally, the data driver needs a mapping from EventName to the Event type according to our classification (e.g., AddFriend => Type A.2&B.1). This can be provided as a parameter file.