We happily hosted the Zurich Apache Spark Meetup with Noah Bieler, Principal Data Scientist at Wealthport, presenting about “Real-Life Apache Spark: Tips and Tricks from the Trenches.”
The Big Data revolution: Spark
Researchers from Google started the Big Data revolution with their original publications of the Google File System and the MapReduce data processing framework for clusters of computers in 2003. Everyone got excited about the orders of magnitude in cost reduction for data storage and speed increases for data processing that these technologies enable. People soon realized, however, that the MapReduce framework was too specifically designed for Google’s purposes and not generic enough for general data analysis needs. This is when researchers at the University of California, Berkeley, started their work on Spark: a cluster computing framework using Resilient Distributed Datasets in 2009. This opens up the advantages of Google’s original Big Data technologies to a much larger range of application areas.
Tobias Widmer and Wolfram Willuhn started the Zurich Apache Spark meetup in the spring of 2015 to bring together the growing group of Spark practitioners and enthusiasts and establish a place to exchange experiences of using this new technology and to brainstorm about novel application areas. Previous meetups included 4Quant, an ETH spin-off that uses Spark to scale up the automated analysis of images and videos, and a visit to IBM’s Rüschlikon research lab to learn about their progress in text understanding.
Successful first meetup
The main focus of the Meetup in March was to share some of the lessons learned during a more complex Spark implementation. The talk presented a range of practical aspects and allowed the audience to get a glimpse of Wealthport’s actual code base. Noah shared his insights on:
- how to overcome the most common pitfalls, i.e. when to use RDDs, DataFrames or DataSets;
- how you can enrich the functionality of some Spark classes or add your own types;
- and finally how you can deploy Spark in the cloud together with Cassandra – a large-scale database for the cloud.
More than 50 people came to listen to the presentation. This was surprising given this very technical topic and the still young age of the technology. The questions and discussions after the presentation showed that the interest and practical application is growing. Noah was kind enough to put his presentation on slideshare.