Benchmark Guild
This guilde introduces detailed steps for executing the benchmark to validate performance of various data lake formats.
By following the steps in the guild, you can learn about the analytical performance of different data lake table format. At the same time, you can flexibly adjust the test scenarios to obtain test results that better suit your actual scenario.
Deploy testing environment
Deploy by Docker
With Docker-Compose, you can quickly set up an environment for performing the benchmark. The detailed steps reference: Lakehouse-benchmark.
Deploy manually
Alternatively, you can manually deploy the following components to set up the test environment:
Component | Version | Description | Installation Guide |
---|---|---|---|
MySQL | 5.7+ | MySQL is used to generate TPC-C data for synchronization to data lakes. | MySQL Installation Guide |
Hadoop | 2.3.7+ | Hadoop is used to provide the storage for data lakes. | Ambari |
Trino | 380 | Trino is used to execute TPC-H queries for Iceberg and Mixed-Iceberg format tables. | Trino Installation Guide |
Amoro Trino Connector | 0.4.0 | To query Mixed-Iceberg Format tables in Trino, you need to install and configure the Amoro connector in Trino. | Amoro Trino Connector |
Iceberg Trino Connector | 0.13.0 | To query Iceberg Format tables in Trino, you need to install and configure the Iceberg connector in Trino. | Iceberg Trino Connector |
Presto | 274 | Presto is used to execute TPC-H queries for Hudi format tables. | Presto Installation Guide |
Hudi Presto Connector | 0.11.1 | To query Iceberg Format tables in Trino, you need to install and configure the Iceberg connector in Presto. | Hudi Presto Connector |
AMS | 0.4.0 | Amoro Management Service, support self-optimizing on tables during the test. | AMS Installation Guide |
data-lake-benchmark | 21 | The core program of Benchmark which is responsible for generating test data, executing the testing process, and generating test results. | Data Lake Benchmark |
lakehouse-benchmark-ingestion | 1.0 | Data synchronization tool based on Flink-CDC which can synchronize data from database to data lake in real-time. | Lakehouse Benchmark Ingestion |
Benchmark steps
-
Configure the configuration file
config/mysql/sample_chbenchmark_config.xml
file of programdata-lake-benchmark
. Fill in the information of MySQL and parameterscalefactor
.scalefactor
represents the number of warehouses, which controls the overall data volume. Generally, choose 10 or 100. -
Generate static data into MySQL with command:
java -jar lakehouse-benchmark.jar -b tpcc,chbenchmark -c config/mysql/sample_chbenchmark_config.xml --create=true --load=true
-
Configure the configuration file
config/ingestion-conf.yaml
file of programlakehouse-benchmark-ingestion
. Fill in the information of MySQL. -
Start the ingestion job to synchronize data form MySQL to data lake tables witch command:
java -cp lakehouse-benchmark-ingestion-1.0-SNAPSHOT.jar com.netease.arctic.benchmark.ingestion.MainRunner -confDir [confDir] -sinkType [arctic/iceberg/hudi] -sinkDatabase [dbName]
- Execute TPC-H benchmark on static data with command:
java -jar lakehouse-benchmark.jar -b chbenchmarkForTrino -c config/trino/trino_chbenchmark_config.xml --create=false --load=false --execute=true
- Execute TPC-C program to continuously write data into MYSQL witch command:
java -jar lakehouse-benchmark.jar -b tpcc,chbenchmark -c config/mysql/sample_chbenchmark_config.xml --execute=true -s 5
- Execute TPC-H benchmark on dynamic data with command:
java -jar lakehouse-benchmark.jar -b chbenchmarkForTrino -c config/trino/trino_chbenchmark_config.xml --create=false --load=false --execute=true
-
Obtain the benchmark results in the
result
directory of thedata-lake-benchmark
project. -
Repeat step 7 to obtain benchmark results for different points in time.