Bigstream Performance Numbers:

Benchmark Report 

Overview

Bigstream accelerates big data platforms such as Apache SparkTM in a variety of ways. The software-only solution runs Spark tasks as compiled native C++ code, which is faster than the bytecode Spark generates. Bigstream software can also connect Spark and advanced acceleration hardware such as field-programmable gate arrays (FPGAs). FPGAs are able to process many operators far faster than a central processing unit (CPU), which is the Spark cluster’s typical compute. Computational storage - specifically the Samsung SmartSSD - is another form of hardware acceleration that Bigstream enables.

 

Acceleration seamlessly runs individual tasks faster, leaving data and computation distribution to Spark to manage. This gives the Spark user an unchanged programming model with enhanced performance.

 

Bigstream has rigorously tested the different solutions, including running each across the 99* distinct queries of the Transaction Processing Performance Council Benchmark™ Decision Support (TPC-DS). This report summarizes the benchmark testing, showing software-only acceleration up to 3x and hardware acceleration up to 8x.

Download a PDF version of this report by filling out this form, or keep scrolling to read.

bigstream benchmark report

Chapter 1

SmartSSD

The acceleration that the SmartSSD brings to Spark analytic workloads is significant and broad reaching. Among the key results:
 
  • 5.6x acceleration across the top 50 queries
  • 6.6x acceleration across the top 10 queries
  • 4.6x acceleration  across all TPC-DS queries
  • 100% of queries completed faster with SmartSSD
  • 97% of queries at least 2x faster with SmartSSD
 
For the test, all 99 queries were first run using Spark alone and then run with the identical configuration, adding Bigstream and the SmartSSD. The raw data set is 3 terabyte (TB) in JavaScript Object Notation (JSON) format with gzip compression.** 
 
TPC-DS is focused on SQL and OLAP and covers the widest array of queries. Many of these queries are representative of standard extract, transform, and load (ETL) processes as well as batch analytics. Others cover less common SQL operations. 
 
The performance of each TPC-DS query varies based on the mix of operations contained in the query and Bigstream SmartSSD’s associated acceleration. But the chart clearly shows the majority of queries realize a large performance gain.

** Test configuration details: Xeon processor Platinum class with 32 cores, 192 GB memory, 3 SmartSSD with 4 TB each.

Chapter 2

Amazon F1 FPGA-based instances

The next series of benchmark tests is introducing FPGA accelerators to existing Spark clusters. These tests were run on Amazon Elastic Compute Cloud (Amazon EC2), using the FPGA-based F1 instances.
 
All 99 queries were first run using Spark alone on the server CPU and then run using Spark with Bigstream. The input data set is 220 gigabytes (GB) of JavaScript Object Notation (JSON) data. 
 
The chart shows the end-to-end speedup for Spark alone versus with Bigstream and F1 acceleration. The average was 5.5 times faster with the biggest improvement 8.2x faster.
 

 

Chapter 3

Amazon EMR (with software-only acceleration)

This series of benchmarks compares 99 TPC-DS query runs on Spark and then with software-only acceleration. This is on Amazon EMR with the identical EC2 compute instances for the baseline and accelerated runs. The software-only solution is able to deliver around 2x across the entire 99 queries. EMR run times are cut in half by the native C++ operator-level acceleration. 
 
  • 67 of 99 queries run at least 2x faster
  • Average speedup 2.1x
  • Top 50 queries average 2.3x faster.
This test was on an 11-node cluster, using r5d.2xlarge instances on Amazon EMR 5.29.
 

Chapter 4

Summary

The report shares acceleration figures across the full array of TPC-DS queries. Both hardware- and software-based accelerators deliver consistent, significant speed gains on these standard benchmark tests for analytics. Bigstream’s acceleration library currently focuses on a significant subset of SQL specific operators, and that coverage continues to expand. 

 
 

Top 10

Avg Accel

EMR SW only

2.5x

2.1x

AWS FPGA (F1)

7.6x

5.5x

SmartSSD (on prem)

6.6x

4.6x

 

Bigstream has invested heavily in the ingest stage of Spark data pipelines because these are often the most time-intensive. This includes extract-load-transform (ELT or ETL) workloads. Customers often deploy Bigstream acceleration to particular workloads with the most challenging service level agreements (SLAs), and they have enjoyed acceleration even higher than these benchmark results. TPC-DS, as the “Decision Support” name suggests, covers SQL and analytics. Even though those operations don’t have the largest acceleration, Bigstream still delivers impressive end-to-end acceleration of these queries. 


* The TPC-DS benchmark includes 99 queries, though a few have multiple variants so benchmarks report 104 queries. These results apply to the 96 queries that Spark runs successfully with its default settings, all of which also successfully run with Bigstream and the SmartSSD.



 
 
close chapters modal

Download a PDF version of this report by filling out this form

Simply fill out this form to receive a PDF version of our guide.

bigstream benchmark report