}}

Spark

Stable Updated: January 26, 2017

Apache Spark is a fast and general-purpose cluster computing system for big
data. It provides high-level APIs in Scala, Java, Python, and R, and
an optimized engine that supports general computation graphs for data
analysis. It also supports a rich set of higher-level tools including
Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX
for graph processing, and Spark Streaming for stream processing. For
more information, see the Apache Spark documentation.

Apache DC/OS Spark consists of
Apache Spark with a few custom commits
along with
DC/OS-specific packaging.

DC/OS Spark includes:

Benefits

  • Utilization: DC/OS Spark leverages Mesos to run Spark on the same
    cluster as other DC/OS services
  • Improved efficiency
  • Simple Management
  • Multi-team support
  • Interactive analytics through notebooks
  • UI integration
  • Security

Features

  • Multiversion support
  • Run multiple Spark dispatchers
  • Run against multiple HDFS clusters
  • Backports of scheduling improvements
  • Simple installation of all Spark components, including the
    dispatcher and the history server
  • Integration of the dispatcher and history server
  • Zeppelin integration
  • Kerberos and SSL support

Quick Start

Install DC/OS Spark via the DC/OS CLI. Note: If you are using Enterprise DC/OS, you may need to follow additional instructions. See the Install and Customize section for more infor...

Install

About Installing Spark on Enterprise DC/OS In Enterprise DC/OS strict security mode, Spark requires a service account. In permissive, a service account is optional. Only someone wi...

Configure Spark for HDFS

To configure Spark for a specific HDFS cluster, configure hdfs.config-url to be a URL that serves your hdfs-site.xml and core-site.xml. For example: { "hdfs": { "config-url": "http...

History Server

DC/OS Spark includes the Spark history server. Because the history server requires HDFS, you must explicitly enable it. Install HDFS first: $ dcos package install hdfs Note: HDFS r...

Security

Mesos Security SSL `security.mesos.ssl.enabled` Set to true to enable SSL on Mesos communication (default: false). Authentication When running in DC/OS strict security mode, both t...

Upgrade

In the Services section of the DC/OS web UI, destroy the Spark instance to be updated. Verify that you no longer see it in the DC/OS web UI. Reinstall Spark. $ dcos package install...

Uninstall

$ dcos package uninstall --app-id=<app-id> spark The Spark dispatcher persists state in ZooKeeper, so to fully uninstall the Spark DC/OS package, you must go to http://<dc...

Runtime Configuration Change

You can customize DC/OS Spark in-place when it is up and running. Go to the DC/OS web interface. Click the Services tab, then the name of the Spark framework to be updated. Within ...

Run a Spark Job

Before submitting your job, upload the artifact (e.g., jar file) to a location visible to the cluster (e.g., S3 or HDFS). Learn more. Run the job $ dcos spark run --submit-args=`--...

Interactive Spark Shell

Interactive Spark Shell You can run Spark commands interactively in the Spark shell. The Spark shell is available in either Scala or Python. SSH into a node in the DC/OS cluster. L...

Custom Docker Images

Note: Custom Docker images are not supported by Mesosphere. You can customize the Docker image in which Spark runs by extending the standard Spark Docker image. In this way, you ca...

Fault Tolerance

Failures such as host, network, JVM, or application failures can affect the behavior of three types of Spark components: DC/OS Spark Service Batch Jobs Streaming Jobs DC/OS Spark S...

Job Scheduling

This document is a simple overview of material described in greater detail in the Apache Spark documentation here and here. Modes Spark on Mesos supports two “modes” of...

Troubleshooting

Dispatcher The Mesos cluster dispatcher is responsible for queuing, tracking, and supervising drivers. Potential problems may arise if the dispatcher does not receive the resources...

Limitations

DC/OS Spark only supports submitting jars and Python scripts. It does not support R. Mesosphere does not provide support for Spark app development, such as writing a Python app to ...