}}

Spark

Stable Updated: April 28, 2017

Apache Spark is a fast and general-purpose cluster computing system for big data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for stream processing. For more information, see the Apache Spark documentation.

Apache DC/OS Spark consists of Apache Spark with a few custom commits along with DC/OS-specific packaging.

DC/OS Spark includes:

Benefits

  • Utilization: DC/OS Spark leverages Mesos to run Spark on the same cluster as other DC/OS services
  • Improved efficiency
  • Simple Management
  • Multi-team support
  • Interactive analytics through notebooks
  • UI integration
  • Security

Features

  • Multiversion support
  • Run multiple Spark dispatchers
  • Run against multiple HDFS clusters
  • Backports of scheduling improvements
  • Simple installation of all Spark components, including the dispatcher and the history server
  • Integration of the dispatcher and history server
  • Zeppelin integration
  • Kerberos and SSL support

Install and Customize

Spark is available in the Universe and can be installed by using either the web interface or the DC/OS CLI. Prerequisites Depending on your security mode in Enterprise DC/OS, you m...

Configure Spark for HDFS

To configure Spark for a specific HDFS cluster, configure hdfs.config-url to be a URL that serves your hdfs-site.xml and core-site.xml. For example: { "hdfs": { "config-url": "http...

History Server

DC/OS Spark includes The Spark History Server. Because the history server requires HDFS, you must explicitly enable it. Install HDFS: $ dcos package install hdfs Note: HDFS require...

Security

Mesos Security SSL `security.mesos.ssl.enabled` Set to true to enable SSL on Mesos communication (default: false). Authentication When running in DC/OS strict security mode, both t...

Upgrade

Go to the Universe > Installed page of the DC/OS GUI. Hover over your Spark Service to see the Uninstall button, then select it. Alternatively, enter the following from the DC/OS C...

Uninstall

$ dcos package uninstall --app-id=<app-id> spark The Spark dispatcher persists state in ZooKeeper, so to fully uninstall the Spark DC/OS package, you must go to http://<dc...

Runtime Configuration Change

You can customize DC/OS Spark in-place when it is up and running. Go to the DC/OS GUI. Click the Services tab, then the name of the Spark framework to be updated. Within the Spark ...

Run a Spark Job

Before submitting your job, upload the artifact (e.g., jar file) to a location visible to the cluster (e.g., S3 or HDFS). Learn more. Run the job. $ dcos spark run --submit-args=`-...

Interactive Spark Shell

Interactive Spark Shell You can run Spark commands interactively in the Spark shell. The Spark shell is available in Scala, Python, and R. Launch a long-running interactive bash se...

Custom Docker Images

Note: Custom Docker images are not supported by Mesosphere. You can customize the Docker image in which Spark runs by extending the standard Spark Docker image. In this way, you ca...

Fault Tolerance

Failures such as host, network, JVM, or application failures can affect the behavior of three types of Spark components: DC/OS Spark Service Batch Jobs Streaming Jobs DC/OS Spark S...

Job Scheduling

This document is a simple overview of material described in greater detail in the Apache Spark documentation here and here. Modes Spark on Mesos supports two “modes” of...

Troubleshooting

Dispatcher The Mesos cluster dispatcher is responsible for queuing, tracking, and supervising drivers. Potential problems may arise if the dispatcher does not receive the resources...

Version Policy

We have selected the latest version of the Apache Spark stable release train for new releases. We support HDFS version 2.6 by default and versions 2.4 and 2.7 with customization.

Limitations

Mesosphere does not provide support for Spark app development, such as writing a Python app to process data from Kafka or writing Scala code to process data from HDFS. Spark jobs r...