Benchmark Guild

This guilde introduces detailed steps for executing the benchmark to validate performance of various data lake formats.

By following the steps in the guild, you can learn about the analytical performance of different data lake table format. At the same time, you can flexibly adjust the test scenarios to obtain test results that better suit your actual scenario.

Deploy testing environment

Deploy by Docker

With Docker-Compose, you can quickly set up an environment for performing the benchmark. The detailed steps reference: Lakehouse-benchmark.

Deploy manually

Alternatively, you can manually deploy the following components to set up the test environment:

Component Version Description Installation Guide
MySQL 5.7+ MySQL is used to generate TPC-C data for synchronization to data lakes. MySQL Installation Guide
Hadoop 2.3.7+ Hadoop is used to provide the storage for data lakes. Ambari
Trino 380 Trino is used to execute TPC-H queries for Iceberg and Mixed-Iceberg format tables. Trino Installation Guide
Amoro Trino Connector 0.4.0 To query Mixed-Iceberg Format tables in Trino, you need to install and configure the Amoro connector in Trino. Amoro Trino Connector
Iceberg Trino Connector 0.13.0 To query Iceberg Format tables in Trino, you need to install and configure the Iceberg connector in Trino. Iceberg Trino Connector
Presto 274 Presto is used to execute TPC-H queries for Hudi format tables. Presto Installation Guide
Hudi Presto Connector 0.11.1 To query Iceberg Format tables in Trino, you need to install and configure the Iceberg connector in Presto. Hudi Presto Connector
AMS 0.4.0 Amoro Management Service, support self-optimizing on tables during the test. AMS Installation Guide
data-lake-benchmark 21 The core program of Benchmark which is responsible for generating test data, executing the testing process, and generating test results. Data Lake Benchmark
lakehouse-benchmark-ingestion 1.0 Data synchronization tool based on Flink-CDC which can synchronize data from database to data lake in real-time. Lakehouse Benchmark Ingestion

Benchmark steps

  1. Configure the configuration file config/mysql/sample_chbenchmark_config.xml file of program data-lake-benchmark. Fill in the information of MySQL and parameter scalefactor. scalefactor represents the number of warehouses, which controls the overall data volume. Generally, choose 10 or 100.

  2. Generate static data into MySQL with command:

java -jar lakehouse-benchmark.jar -b tpcc,chbenchmark -c config/mysql/sample_chbenchmark_config.xml --create=true --load=true
  1. Configure the configuration file config/ingestion-conf.yaml file of program lakehouse-benchmark-ingestion. Fill in the information of MySQL.

  2. Start the ingestion job to synchronize data form MySQL to data lake tables witch command:

java -cp lakehouse-benchmark-ingestion-1.0-SNAPSHOT.jar com.netease.arctic.benchmark.ingestion.MainRunner -confDir [confDir] -sinkType [arctic/iceberg/hudi] -sinkDatabase [dbName]
  1. Execute TPC-H benchmark on static data with command:
java -jar lakehouse-benchmark.jar -b chbenchmarkForTrino -c config/trino/trino_chbenchmark_config.xml --create=false --load=false --execute=true
  1. Execute TPC-C program to continuously write data into MYSQL witch command:
java -jar lakehouse-benchmark.jar -b tpcc,chbenchmark -c config/mysql/sample_chbenchmark_config.xml --execute=true -s 5
  1. Execute TPC-H benchmark on dynamic data with command:
java -jar lakehouse-benchmark.jar -b chbenchmarkForTrino -c config/trino/trino_chbenchmark_config.xml --create=false --load=false --execute=true
  1. Obtain the benchmark results in the result directory of the data-lake-benchmark project.

  2. Repeat step 7 to obtain benchmark results for different points in time.