Real-Life Apache Spark: Tips and Tricks from the Trenches

Wolfram Willuhn · 5 minutes · 23.06.2016

We happily hosted the Zurich Apache Spark Meetup with  Noah Bieler, Principal Data Scientist at Wealthport, presenting about “Real-Life Apache Spark: Tips and Tricks from the Trenches.”

 

The Big Data revolution: Spark

Researchers from Google started the Big Data revolution with their original publications of the Google File System and the MapReduce data processing framework for clusters of computers in 2003. Everyone got excited about the orders of magnitude in cost reduction for data storage and speed increases for data processing that these technologies enable. People soon realized, however, that the MapReduce framework was too specifically designed for Google’s purposes and not generic enough for general data analysis needs. This is when researchers at the University of California, Berkeley, started their work on Spark: a cluster computing framework using Resilient Distributed Datasets in 2009. This opens up the advantages of Google’s original Big Data technologies to a much larger range of application areas.

 

Tobias Widmer and Wolfram Willuhn started the Zurich Apache Spark meetup in the spring of 2015 to bring together the growing group of Spark practitioners and enthusiasts and establish a place to exchange experiences of using this new technology and to brainstorm about novel application areas. Previous meetups included 4Quant, an ETH spin-off that uses Spark to scale up the automated analysis of images and videos, and a visit to IBM’s Rüschlikon research lab to learn about their progress in text understanding.

 

Successful first meetup

The main focus of the Meetup in March was to share some of the lessons learned during a more complex Spark implementation. The talk presented a range of practical aspects and allowed the audience to get a glimpse of Wealthport’s actual code base. Noah shared his insights on:

  • how to overcome the most common pitfalls, i.e. when to use RDDs, DataFrames or DataSets;
  • how you can enrich the functionality of some Spark classes or add your own types;
  • and finally how you can deploy Spark in the cloud together with Cassandra – a large-scale database for the cloud.

More than 50 people came to listen to the presentation. This was surprising given this very technical topic and the still young age of the technology. The questions and discussions after the presentation showed that the interest and practical application is growing. Noah was kind enough to put his presentation on slideshare.

20160314_193304

 

About Wealthport

Wealthport is a startup that aims to make preparing and integrating data much less painful than it is today. It does so by offering an automated service based on machine learning technology. Since such customer data is often generated at scale, Wealthport is employing Apache Spark in order to deal with such volumes efficiently. The use cases for such technology are in matching and integrating data and in clustering and segmenting data. These problems occur in any industry with e-commerce and financial services being particularly prominent areas.

 

On July 4, the next meetup will happen around Spark Summit 2016: Trends & Insights at Impact Hub Viadukt.

 

Wolfram is an experienced executive and business developer with  experience in changing organizations and deploying advanced technologies in established and in growing businesses, and in creating innovative products and services in technology startups. He cares about innovative solutions, passionate technology competency, cross-cultural team leadership, truste...

Wolfram Willuhn, Apache Spark

Leave a Reply

Your email address will not be published. Required fields are marked *