Apache Spark Beyond Shuffling - Why it isn't Magic - but also where there is some really cool Magic

Tuesday May 2

3:40 PM –

4:30 PM

Vevey 1-2

Slides:

This video is also available in the GOTO Play video app! Download it to enjoy offline access to our conference videos while on the move.

Apache Spark is one the most popular general purpose distributed systems in the past few years. Apache Spark has APIs in Scala, Java, Python and more recently a few different attempts to provide support for R, C#, and Julia. This talk looks at Apache Spark from a performance/scaling point of view and the work we need to do to be able to handle large datasets. In essence parts of this talk could be considered "the impact of design decisions from years ago and how to work around them." It's not all doom and gloom though, we will explore the new APIs and the exciting new things we can do with them with a brief detour into how to work around some of the trade-offs in the new APIs - but mostly focused on the new exciting shiny things we can play with. A basic background with Apache Spark will probably make the talk more exciting or depressing depending on your point of view but for those new to Apache Spark just enough to understand whats going will be covered at the start. The presenter would of course encourage you to buy and read her books on the topic ("Learning Spark" & "High Performance Spark"), because which presenter doesn't do that.

Holden Karau

Open Source Engineer at Netflix

Data

Tuesday May 2 @ 11:40 AM

Processing Data of Any Size with Apache Beam

Jesse Anderson

Tuesday May 2 @ 3:40 PM

Apache Spark Beyond Shuffling - Why it isn't Magic - but also where there is some really cool Magic

Holden Karau

Tuesday May 2 @ 1:30 PM

Apache Flink - The State of the Art in Streaming Computation

Jamie Grier

Tuesday May 2 @ 10:35 AM

Fast Data Architectures for Streaming Applications

Dean Wampler

Tuesday May 2 @ 2:35 PM

Cloud Native Data Pipelines

Sid Anand

Tuesday May 2 @ 4:45 PM

Stream All Things - Patterns of Modern Data Integration

Gwen Shapira