Skip to end of metadata
Go to start of metadata

Motivation for driver

  • Complex dependencies between operations, implementing these correctly is not trivial
  • Merging of multiple "streams" (one read & multiple writes) and compressing/stretching them
  • Consistent way of measuring and reporting metrics
  • Complement Auditing: verifying correctness of implementations will be slower, more expensive without standard driver
  • Automated validation of correctness of query implementation
  • Adoption: barrier of entry is lower when a driver is provided. Some vendors may write their own, most researchers that want to use our workload will not

Current Problems

  • Problem: insufficient load generated, due to: checking delay in hot path && single threaded updates
  • Details:
    • OGL get a max 600% CPU with the SNB driver
    • OGL skipped using driver and do the same in SQL procedures, reading from the parameter and update files, after these are first partitioned. No problem, we get full platform in a snap at 30 and 300G and can adjust the queries. Waiting for the driver to get fixed there would have been unfeasible.
  • Response:
    • Agree that simple prototype based on SQL procedures was sufficient for this experiment. Waiting would have slowed down entire project.
    • It *is* simple experiment though, script does not: track GCT/maintain dependencies, record metrics, schedule on start time, etc.
    • From what I know this was 24 threads (vs 1 in driver for updates, present) which accounts for 20x speedup. Driver should have 1 thread per update stream partition. Had not had time or feedback regarding this until now. Will prioritize. Once we have that we can run tests again.
  • Questions:
    • How is the update stream partitioned? By Forum (as far as I know Forum ID is not available for many updates), or round robin? If Forum ID not used, it is incorrect.
    • Are queries sent at full speed or scheduled on start time?


  • Problem: insufficient load generated, due to: checking delay in hot path
    • Latency tolerance is the largest complication and likely overhead of the driver. For all the aforementioned exercises, we just need workload as fast as it will go. The thinking time and execution window part is nice in principle but if it makes the driver not produce full platform utilization we just optimize away the whole driver.
    • The matter of checking latency, i.e. did everything get done within the allotted time window is the last step in qualifying a run, no more.
  • Response:
    • Start time delay checks now moved out of the hot path, to metrics collection component: background thread that reads results from a queue and records metrics based on those results, meaning: (1) delay latency now measured AFTER execution (2) delay latency now measured in a different thread.
    • Also added a flag -ignore_start_times: max throughput mode. Still tracks GCT (dependencies are maintained) but other than that it goes as fast as it can, does not: busy wait for scheduled start times, check tolerated delay, or record tolerated delay.
    • Benchmarked read workload with -ignore_start_times and "null handlers": 250,000 op/sec with 1 thread, 500,000 op/sec with 2 threads, 1,000,000 op/sec with 4 threads, .
    • Let's find a way to track driver performance from now on.


  • Problem: read/write mix && time compression <-- not driver-caused
    • There is the other issue of read/update mix. All the TPC benchmarks that are about online (C and E) tie the throughput to scale. SNB cannot help but do the same: data generator makes an update stream. Update stream is tied to simulation time (fictional timeline of the events in the database). In any real application, there is a fairly constant read/update ratio. In all the TPC workloads this is in fact precisely determined. So it might as well be so here. However, an effect of this policy is to make high numbers achievable only at high scales, for example in TPC C, these are scales which never occur in practice in this type of application. This undermines the credibility of the whole benchmark. It remains very good for OLTP style monster truck races but is removed from the way real deployments of the sort work.
    • Driver should operate in simulation time. Relative frequencies of different queries should be expressed in relation to the due time of updates. If one acceleration applies to updates the same applies to queries. Constant, predictable read/write mix.
    • The interaction of latency and updates will become clear, as this is now as good as ignored: Updates do not impact query latency per se, as long as the database state is never written to disk or check-pointed. If either of the two happens, then all latency bounds are blown unless there is a TPC-C style IO system (hundreds of devices). Examples of latency bounds are the scheduled execution/latest time to answer thing in the present driver or the 90th percentile and average execution time per query type below a limit, as in TPC C and E. It could be the updates are so few there is anyhow no issue, which also interacts with the materialization of query results question. 
    • Driver should peg read/write mix to the update stream (so many reads of such and such sort per new post/comment). We need:
      • 1. reference implementation that has no obvious stupidities (right query plans) 
      • 2. See how fast this will go when running at full platform. We do not want a query mix that will not keep up with simulation time on a commodity server when the working set is in memory. Such a thing would not be credible for an interactive workload.
  • Response:
    • Frequencies of reads & writes are tied together already. 
    • time_compression_ratio applies to an entire stream already, both reads & writes
    • At present "default" = reads at specified mix (see and updates at exactly what update stream gives. This is NOT good. 
    • We need to (don't yet) know HOW MUCH TO COMPRESS UPDATE STREAM[1] before merging it into our reads. After that point the same compression ratio should be applied to everything together, to maintain that ratio. [1] should come from OGL.


  • Problem: scale-out
    • For driving millions of clicks per second and for realism in system configuration, driver must be scale-out capable. TPC C disclosures have about as much gear in the driver than in the SUT. SNB is not as extreme, since the single operations are larger but still scale out is relevant, at least if one has short queries in the mix.
  • Response:
    • Agreed, but first let's see what performance we get from one many core machine and partitioned updates.
    • Having this as a requirement AND considering that tracking dependencies is required for LDBC SNB, is in fact an argument FOR having a driver provided by LDBC. 
    • Adding scale-out to the driver will not require a complete redesign. Not even close. It has been a consideration along the way. The configuration & GCT components will require updates. 
    • Propose keeping scale-out in mind while adding features and providing clear road-map/documentation as to what needs to be done to achieve this. This will limit development slowdowns when LDBC moves into a company structure, and development teams grows/shrinks/changes.


  • Problem: dynamic injection of operations (or, at least, substitution parameters) into workload, during execution
    • An operation should be able to insert a follow up operation in the stream of future scheduled operations with a due time in simulation time.
    • The last sizing things we did were with 2 extra short queries, person and profile and post lookup, effectively doing a random walk. This was added in order to generate high operation counts per second, i.e. a better looking metric and more realism when compared to actual systems and any other graph serving benchmarks.
    • These should probably be in the end product since these make for more topical realism. These do not make new query optimization choke points but the BI department is anyhow more for that. There are valid choke points in handling very high concurrency of trivial queries. An interactive workload should have that.
    • If such are in the end product (benchmark spec) then the scale out capability of the driver becomes relatively more important.
  • Response:
    • This request came from OGL a week ago. 
    • Agreed, we should add this, but let's work TOGETHER on such proposals and decide on the best way to realize this. Andrey and I have had 2 short calls, one midweek and another on Sunday. IMO, this functionality should be added, but it needs to be prioritized vs other functionality. That is the point of this call.


  • Problem: allow sponsors-provided drivers
    • I would say to formalize expected driver behaviors and outputs and let test sponsors make their own if want to, in this way there is no critical path and Neo4j does not have to do a thing. The issue will not even come up in public. 
    • Neo4j will likely agree that a scale-out capable driver is out of the question within the bounds of the schedule, so for large scales test sponsors must be allowed their own anyhow. TPC C constitutes an unimpeachable precedent.
  • Response:
    • This is not a "Neo/Graph database" issue, it is about providing the least amount of friction for adopters, and for those that continue LDBC. 
    • As far as I know, TPC-C does not have the type of dependencies we do.
    • Auditing that vendor supplied drivers adhere to these dependencies could be impossible to verify, in practice, and almost certainly more expensive to do correctly than it would be if we TOGETHER developed tooling that automated this.
    • With the right auditing procedure we could allow sponsor-supplied drivers, at least in the interim, and maybe forever, that's not my call.
    • Depending on what we do from now on, a scale-out driver may or may not be feasible with the time remaining. 
    • However, not providing a scale-out driver does not mean it is not on our road-map. We can continue development of the current driver with scale-out in mind.

Missing Functionality

  • Scale-out
    • Description
      • Multiple driver processes coordinate to execute one workload, against one SUT
      • Driver tracks Global Completion Time (GCT) by communicating with one another
      • Progress is intermittently reported by all worker processes to one coordinator?
      • Metrics are gathered locally and gossiped intermittently to monitor for workload failure, e.g., over x% of operations were late
      • Configuration somehow distributed to all worker processes
    • Priority
      • ???
  • Warm-up
    • Description
      • Before a benchmark run the driver executes a warmup routine
      • Warmup workload should probably be the same as benchmark workload
      • Warmup runs for: x operations, x duration, as long as some invariant holds (e.g., until steady state is reached for at least some duration)
      • Should metrics be reported for this phase?
      • What should happen once warmup is over?
        • Status update changes
        • New set of metrics start to be collected (with/without discarding old set)
        • Number of operations to execute as part of workload start now
      • The stream needs to be lazy, all the way through, as there is no way of knowing how long it will be if warmup is "until steady state" <-- number of operations to reach steady state is unknown
      • Should we have SteadyState OR MaxOperationCount for warmup?
    • Priority
      • ???

  • Thread per synchronous update stream
    • Description
      • Certain types of updates have dependencies on other updates that occurred a very small amount of time before them, for these operations it is more efficient to simply execute them sequentially than it is to maintain the dependencies using Global Completion Time
      • The operations come from the update stream, and at the moment ALL operations of that type are executed sequentially in ONE thread. This does not make sense
      • It is possible to partition the update stream by Forum, then execute all updates for that given forum in 1 thread. With multiple partitions you could run multiple threads, getting linear speedup
      • Though this requires some changes to the way a workload definition looks it is not so complicated
      • The way configuration and time compression/expansion will change a little bit, but not much else
      • We need a way to partition the update stream though, which will have to include the data generator because some events (e.g., add comment) do not include information that identifies which forum they belong to
    • Priority
      • ???

  • More elaborate failure scenarios
    • Description
      • If an SUT runs too far behind the configured workload it should fail
      • If an SUT has too many crashing query handlers it should fail
      • What we have right now is too simplistic/harsh: fail after 1 late operation
      • We need to decide under what conditions a benchmark run should fail. Is x% of operations were late sufficient? If not, what do we want?
      • As the recording of delay, results, response time, and everything else all occur in one background thread it should be relatively easy to add slightly more elaborate failure scenarios, because all context is in one place
      • A little fiddling will need to be done, but it shouldn't be a major
    • Priority
      • ???

  • Validation of update queries
    • Description
      • At the moment that driver has an option to validate the correctness of read queries. It must be run against a freshly imported database. The process executes queries sequentially, waiting for one to return before dispatching the next. It then serializes results to a file, which can be used to validate another sponsor's implementation
        • FYI: this was already trialled when validating 3x Neo4j connector implementations against one another, then using that file Sparksee was validated. Sparksee did not pass all tests. We diagnosed those failures as errors in one Neo4j implementation (the one used to generate the file) and in the data generator. The bugs were fixed and the process repeated, at which stage all 4 connectors agreed on query responses. See for details.
      • In the future we may want to validate that updates are being performed correctly during execution. We could use Global Completion Time to know when they have been applied, and then issue a validation query.
      • This is lowest priority of the functionality mentioned so far.
    • Priority
      • ???

Missing Functionality

  • I would like to have a small set of driver performance tests and a way to report the results in a useful way

  • For example, a configuration parameter (-rl/--results_log) tells driver to write certain metrics (type, scheduled_start, actual_start, runtime) for EVERY operation to a CSV file, as in table below (this is already implemented)

  • A plot could then be made of that file, to see how well the driver tracked its target workload (see example plot below)

  • Tests should be as simple as possible, we do not need a database for this. Query handlers could simply sleep for as long as, for example, Orri reports Virtuoso as taking
    We could take those numbers from here: 
  • However, it would be greatly appreciated if IN ADDITION to these simple tests, vendors also generated a few of these files using different configurations, while running the real workload
    They would then provide 3 files:
     - results.json (collected summary metrics like percentiles)
     - (automatically output file containing full list of used parameters)
     - results_log.csv (the one discussed here and shown in table below) 


Meeting Minutes

  • ...

Other comments from Orri

  • You can run at 300G 2x faster than simulation time, doing xxx clicks at 300GB with acceleration 2. This is a possible metric. The acceleration 2 is redundant since the clicks per second at 1:1 simulation to real time is in fact given by the scale factor. Each user does so much on the average per day, the more users the more gets done in a day.
  • So, now we find that with the mix, there is a 2x acceleration that can be had with 12 cores at 300G. This is valuable information. A benchmark should be usable, or at least adaptable to making an actual sizing study. The would be service operator runs the benchmark and gets a throughput. If the sustained throughput is around 3x the expected or actual throughput then the SUT is good.
  • A mock FDR for SNB interactive should be made. I will make one at 300G. I expect the minimum run length, the actual latency bounds and some of the recovery test things to slightly shift in the process. The composition of supporting files will also become clear. This will be done client-server but I will have no need of test driver updates for this.
  • No labels