Spark Profiling with Bigstream xRay

Bigstream xRay is an easy-to-use profiling tool that provides detailed insights and visualizations of Apache SparkTM applications. While xRay’s power is in its simplicity and users can get started with it in minutes, this paper gives a deep dive on xRay’s background and capabilities.

Introducing xRay from Bigstream

Bigstream xRay is an easy-to-use profiling tool that provides detailed insights and visualizations of Apache SparkTM applications. While xRay’s power is in its simplicity and users can get started with it in minutes, this paper gives a deep dive on xRay’s background and capabilities.

In this whitepaper we cover:

  • Application profiling 
  • What xRay is and the questions it can answer
  • How Bigstream Hyperacceleration works in tandem with xRay
  • How to get started with xRay
  • xRay’s reporting and visualizations

Download a PDF version of this guide by filling out this form, or keep scrolling to read.

Spark Profiling with Bigstream xRay Thumbnail

Chapter 1

Performance Engineering and Application Profiling

Performance engineering teams have an array of tools available to give views into their different platforms and systems. Some are native to a given platform, some cover a range of applications, and still others provide real-time information to send alerts the moment there is trouble. These tools have specifically become a key asset for big data and analytics application development. Data scientists and developers, as well as operations professionals, use these data and tools to improve platform configuration, application structure, and application infrastructure.

While building and testing its acceleration software, Bigstream developed expertise in Apache Spark profiling and developed a tool for users to profile their own Spark environment. xRay provides an in-depth view of Spark applications and is currently available free to all users. 

xRay works seamlessly with Bigstream Hyperacceleration, analyzing users’ production application performance as well as the opportunities for acceleration. xRay is easy to install, provides fully web-based analysis, and covers both cloud and on-premises Spark applications.

What Is xRay? 

xRay is a web-based tool that gives users in-depth insights on the performance of their Apache Spark applications. xRay runs as a listener class on the user’s production application in their environment, not as a benchmark or surrogate. It provides basic application run information, such as configurations and runtimes of the application, stages, and tasks. It also provides log data signature analysis to give recommendations for performance improvement. Finally, it assesses the potential benefits of Bigstream Hyperacceleration to the application. Information about operators, user-defined functions (UDFs), stage structure, data formats, and more power this acceleration assessment.

xRay’s standard reporting views answer the following questions, among others:

  • What configurations (i.e., Spark, Yarn, and so forth) am I using? Can they be changed to improve performance?
  • Where is most of the time being spent in my application? Is it being spent on compute, I/O, idling, and so on?
  • What are the bottleneck stages of my application? What operators/data are they using?
  • Does my data have skew? 
  • Do certain stages have straggler tasks?
  • Which stages can be accelerated either via software or hardware acceleration? Are these the key stages in my application?
  • How much acceleration can I expect?

Chapter 2

xRay and Bigstream Hyperacceleration

The xRay tool works in concert with Bigstream acceleration technology to optimize an organization’s Spark performance with zero code change.

Synergy Between Performance Management Acceleration Picture

Figure 1: xRay and Hyperacceleration

Figure 1 illustrates the synergy between xRay and acceleration. It provides information about the applicability of acceleration and estimates improvements. For example, Bigstream accelerates the Dataframes/Datasets API (i.e., SparkSQL), and xRay analysis can ensure that these APIs are the ones the application uses. After viewing xRay, the user applies acceleration and then reruns xRay to assess the performance gains. 



Chapter 3

xRay Setup: Running Your First Application with xRay

xRay Setup

The xRay setup is just a few short steps.

Step 1: Sign up at https://insight.bigstream.co/auth/sign-up, and arrive at the main xRay page.

Step 2: Under “Getting Started” on the left side of the page (see Figure 2), select “Download” and choose the xRay package corresponding to the OS version on which you run Spark.  Currently, four Linux types are available: RedHat, Amazon EMR, Ubuntu, and CentOS. These packages work on both cloud and on-premises infrastructure equally.

 

Figure 2: User Start Page—Step 1 of Adding Application Log

 

Figure 2- User Start Page—Step 1 of Adding Application Log

Step 3: You can now run a Spark application as usual, except for two changes to your Spark settings: adding a .JAR file and a listener line. These empower the more detailed log information that power xRay.

Step 4: The final step is to upload the resulting logs to the xRay website via the “Upload New xRay” link.

 

Figure 3: User Start Page—Step 2 of Adding Application Log

 

Figure 3- User Start Page—Step 2 of Adding Application Log

From there, after a few minutes of processing, you will now have access to that application’s profile in your xRay dashboard.

Chapter 4

xRay Reporting and Visualization Examples

This section covers the xRay dashboard and some of the commonly used capabilities.

Figure 4: Dashboard View

Figure 4- Dashboard View

Users navigate xRay through a basic web interface, starting with a simple homepage. The primary view on the left-hand navigation menu is Applications. Figure 4 shows the dashboard for a user with two application log outputs. Each application or application group has colored icons indicating xRay’s recommendations:

  • Blue=”Info”
  • Orange=”Warning”
  • Red=”Critical”

As users develop longer lists of applications, they can define “App Groups” to organize them. The round graphic in the upper left shows the aggregate runtime for all of this user’s applications.

Figure 5: Application Overview—CPU Utilization View

Figure 5- Application Overview—CPU Utilization View

Users click on the application name to find the details of that specific Spark application. They start with an overview—for example, the overview in Figure 5—which includes runtime, cluster time used, the timeline, and more.

Figure 6: Application Overview—Disk I/O View

Figure 6- Application Overview—Disk IO View

The App Overview provides timeline graphs for CPU Utilization (Figure 4), Memory Utilization, as well as Disk I/O (Figure 6).

Figure 7: Stage-by-Stage View

Figure 7- Stage-by-Stage View-1

Further down the App Overview screen, xRay shows the stage-by-stage execution timeline (example in Figure 7). The dark parts of the individual stage timelines represent periods of time when the stage was runnable, but waiting for other inputs. Green portions show actual execution.

The App Overview also provides the following:

  • DAG: Full interactive layout of the Directed Acyclic Graph (DAG) for the application, including the critical path
  • Job metrics: Including scan time, CPU utilization, and write times
  • Spark configuration
  • Cluster configuration

The App Overview data and visualizations often generate clues to help troubleshoot an application, but a user will often probe into a specific stage to find the detailed insights. To view details at the stage level, simply click on a specific stage in the App Runtime stage-by-stage view.

Figure 8: Stage Analysis—Stage DAG

Figure 8- Stage Analysis—Stage DAG

The initial view is the Stage DAG (Figure 8), highlighting the operations executed by the stage.

Figure 9: Stage Analysis—Executor Timeline

Figure 9- Stage Analysis—Executor Timeline

The Executor Timeline (Figure 9) presents task-level detail, visualizing delays, deserialization, shuffle read and write, and executor computing time.

Figure 10: Stage Analysis—Task Charts

Figure 10- Stage Analysis—Task Charts

Task Charts (Figure 10) provides a scatter plot view of the data, giving an alternative visual to identify patterns and outliers. A user may have identified a single long-running stage and use this plot to find “straggler” tasks.

The Task Metrics Table view gives users the data in tabular form, allowing them to sort information by dimensions such as task runtime, end time, result size, shuffle write/read time, and much more.

Figure 11: Stage Analysis—Task Histogram

Figure 11- Stage Analysis—Task Histogram

Finally, the Task Histogram breaks out the number of tasks by execution time (Figure 11).

Chapter 5

Summary

These metrics and visualizations provide users with actionable information regarding their application. Optimizations and improvements could include examining the application keying structure, adjusting memory allocation, employing Bigstream acceleration to shorten task times, or any combination of these approaches.

Click here to see an example xRay use case exploring a specific extract, load, transform (ELT) workflow.

In summary, xRay is an easy-to-apply tool that analyzes the performance of a huge spectrum of Spark applications. It can also point the way to utilizing seamless Bigstream Hyperacceleration to address a range of user performance needs. Get started with xRay today by signing up at insight.bigstream.co.

close chapters modal

Download a PDF version of this guide by filling out this form

Simply fill out this form to receive a PDF version of our guide.

Spark Profiling with Bigstream xRay Thumbnail