Spark 2.5.0-2.2.1

Documentation for DC/OS Apache Spark 2.5.0-2.2.1

Welcome to the documentation for the DC/OS Apache Spark. For more information about new and changed features, see the release notes.

Apache Spark is a fast and general-purpose cluster computing system for big data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for stream processing. For more information, see the Apache Spark documentation.

DC/OS Apache Spark consists of Apache Spark with a few custom commits along with DC/OS-specific packaging.

DC/OS Apache Spark includes:

Benefits

  • Utilization: DC/OS Apache Spark leverages Mesos to run Spark on the same cluster as other DC/OS services
  • Improved efficiency
  • Simple management
  • Multi-team support
  • Interactive analytics through notebooks
  • UI integration
  • Security, including file- and environment-based secrets

Features

  • Multiversion support
  • Run multiple Spark dispatchers
  • Run against multiple HDFS clusters
  • Backports of scheduling improvements
  • Simple installation of all Spark components, including the dispatcher and the history server
  • Integration of the dispatcher and history server
  • Zeppelin integration
  • Kerberos and SSL support

Related services