Creating a Telegram Bot in 5 minutes, with Apache Camel

Recently, I started contributing to open source software of the Apache Software Foundation and I developed camel-telegram, a component that allows camel based applications to exchange data using the Telegram messaging network. It will be released as of Apache Camel 2.18. The full documentation is available in the Camel manual. Continue reading “Creating a Telegram Bot in 5 minutes, with Apache Camel”

Spash – Organizing your Big Data with SSH and Bash

Spash is a command line tool for Big Data platforms that simulates a real Unix environment, providing most of the commands of a typical Bash shell on top of YARN, HDFS and Apache Spark.

Spash uses the HDFS APIs to execute simple file operations and Apache Spark to perform parallel computations on big datasets. With Spash, managing a Big Data cluster becomes as natural as writing bash commands.

Spash is an open source project from an idea of Nicola Barbieri. The code is here:

Continue reading “Spash – Organizing your Big Data with SSH and Bash”

Logging to a NoSQL DB from Spark

Logging effectively is often a hard task in standard applications. But when the application runs in a distributed environment, for instance, a Spark job in a big YARN cluster, it becomes ten times harder. Jobs are split into thousands of tasks that run inside multiple worker machines, so the classic console logging is not a good option, because the logs get written to the standard output of several remote machines, making it impossible to find useful information.

One of the best options available in all modern data platforms is logging to a NoSQL database. Many data platforms support HBase and Phoenix as NoSQL layer, so why don’t using a Phoenix table to store the logs?

Continue reading “Logging to a NoSQL DB from Spark”

A Brief History of Big Data

Why everybody talks about Big Data? Where does Hadoop come from? Which steps led to the diffusion of Spark? What’s next?

This presentation will drive you through all the steps that brought to this “big” change.

Continue reading “A Brief History of Big Data”

Spark-HBase-Connector 1.0.2 is out

The Spark-HBase-Connector project started as a 3-days programming marathon I made last year. At home, with the flu. Now it is becoming one of the most popular drivers to read/write data to Apache HBase from Spark.

The project is hosted on github:
The library is also available on Maven. You can find more information on the github page.

Continue reading “Spark-HBase-Connector 1.0.2 is out”

Strategy Pattern using CDI

The strategy pattern is one of the most famous patterns by the GoF and, for sure, one of the most useful in the Java EE world. Implementing it using Context and Dependency Injection (CDI) may appear difficult, but it’s just a matter of following some simple steps.

Strategy Pattern
Strategy Pattern in UML

The strategy pattern allows a strategy to be selected dynamically among a collection of possible strategies. The client will be completely unaware of the selection and the choice will be done programmatically, on the basis of contextual information.

Continue reading “Strategy Pattern using CDI”

Using Non-Serializable Objects in Apache Spark

Anyone who starts writing applications for Apache Spark encounters immediately the following exception:

This happens whenever Spark tries to transmit the scheduled tasks to remote machines. Tasks are just pieces of application code that are sent from the driver to the workers.

Given the frequency of that exception, one may think that any piece of code that is executed by a worker node must be serializable. Fortunately, that is not true.

Continue reading “Using Non-Serializable Objects in Apache Spark”

Exception Handling in Apache Spark

Apache Spark is a fantastic framework for writing highly scalable applications. Data and execution code are spread from the driver to tons of worker machines for parallel processing. But debugging this kind of applications is often a really hard task.

Exceptions need to be treated carefully, because a simple runtime exception caused by dirty source data can easily lead to the termination of the whole process. Let’s see an example.

Continue reading “Exception Handling in Apache Spark”

println(“Hello World!”)

Welcome to my new blog. Cool things are coming up…