Logging to a NoSQL DB from Spark

Logging effectively is often a hard task in standard applications. But when the application runs in a distributed environment, for instance, a Spark job in a big YARN cluster, it becomes ten times harder.

Jobs are split into thousands of tasks that run inside multiple worker machines, so the classic console logging is not a good option, because the logs get written to the standard output of several remote machines, making it impossible to find useful information.

One of the best options available in all modern data platforms is logging to a NoSQL database. Many data platforms support HBase and Phoenix as NoSQL layer, so why don’t using a Phoenix table to store the logs?

First of all, the table must be created inside Phoenix, and it must be optimized for efficiently writing the log data. For example:

CREATE TABLE LOG 
(
  LOG_DATE  TIMESTAMP  NOT NULL,
  LOG_ID    BIGINT     NOT NULL,
  LOGGER    VARCHAR(150),
  LEVEL     VARCHAR(10),
  MESSAGE   VARCHAR(8192)
  CONSTRAINT LOG_PK PRIMARY KEY (LOG_DATE, LOG_ID)
) SALT_BUCKETS = 50;
 
CREATE SEQUENCE SEQ_LOG_ID MINVALUE 1;

The LOG table is defined as a salted table with 50 buckets (the number can be increased/decreased, depending on the size of the cluster). Bucketing is needed to spread the data across multiple region servers, in order to balance the load across all the machines. The LOG table has a timestamp as first column of the primary key (that is a monotonically increasing field) so, if salting buckets were not in place, only one machine at a time would be used to store the logs. The LOG_ID column is part of the primary key. It is useful to prevent collisions among log messages.

Once the table is defined, we need to configure log4j to store log messages inside it. Phoenix is compliant with the JDBC API. This allows using the JDBCAppender. Unfortunately, the standard JDBCAppender is not perfect for being used with Phoenix, just because it does not commit the transactions. Of course, I’m not saying that Phoenix supports transactions (not now). Phoenix requires that you commit the UPSERT statements, otherwise they will remain stuck in the driver’s cache. So, we need to write a custom extension of the JDBCAppender:

package it.nerdammer.log4j;
 
import org.apache.log4j.jdbc.JDBCAppender;
 
import java.sql.Connection;
import java.sql.SQLException;
 
public class PhoenixAppender extends JDBCAppender {
 
    protected Connection getConnection() throws SQLException {
        Connection connection = super.getConnection();
        connection.setAutoCommit(true);
 
        return connection;
    }
 
}

The class above just sets the autocommit=true property on every connection created by the JDBCAppender.

Now that we have an appender compatible with Phoenix, the next step is configuring it in the log4j.properties file:

# Root logger option
log4j.rootLogger=INFO, stdout
 
log4j.logger.com.enterprise=INFO, phoenix
log4j.additivity.com.enterprise=true
 
# Direct log messages to stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target=System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n
 
 
log4j.appender.phoenix=it.nerdammer.log4j.PhoenixAppender
log4j.appender.phoenix.URL=jdbc:phoenix:phoenixhost
log4j.appender.phoenix.user=anyuser
log4j.appender.phoenix.password=anypassword
log4j.appender.phoenix.driver=org.apache.phoenix.jdbc.PhoenixDriver
log4j.appender.phoenix.sql=UPSERT INTO LOG (LOG_DATE, LOG_ID, LOGGER, LEVEL, MESSAGE) VALUES ('%d', NEXT VALUE FOR SEQ_LOG_ID, '%C', '%p', '%m')
log4j.appender.phoenix.layout=org.apache.log4j.PatternLayout

An appender named phoenix has been created. It has been associated with the logger named “com.enterprise”. You can change it to the base package of your application, but note: you cannot associate the phoenix appender to the root logger. The reason is that Phoenix itself uses log4j while initializing the connection to the database. If you allow the logger org.apache.phoenix to append logs to the Phoenix table, you will get soon a stackoverflow error. One way of breaking the loop it is limiting the usage of the “phoenix” appender to just few packages of your application (in the example com.enterprise and all its sub-packages).

In order to test locally, you can just run one docker container with everything preinstalled, for example, my dockmob container for Phoenix https://hub.docker.com/r/dockmob/phoenix/, using the following script:

#!/bin/bash
# 
# Before executing the script, add to /etc/hosts the following entry
# <docker-machine-ip> phoenixhost
#
# Where <docker-machine-ip> is the IP assigned to the docker machine on OSX (usually 192.168.99.100), 
# or 127.0.0.1 on Linux
 
MYPHOENIX_ID=$(docker run -d -p 2181:2181 -p 60000:60000 -p 60010:60010 -p 60020:60020 -p 60030:60030 -h phoenixhost dockmob/phoenix:4.5.2-1.0.1 -t pseudodistributed)
docker exec -it $MYPHOENIX_ID /usr/lib/phoenix/bin/sqlline.py localhost

The last command starts a SQL console on the Phoenix instance. You need to paste the previous SQL commands to create the LOG table.

Ensure you have put all the required libraries for writing to Phoenix in the Spark application classpath. Here is a simple test for verifying if it works:

package com.enterprise.test
 
import java.sql.{DriverManager}
 
import it.nerdammer.log4j.PhoenixAppender
import org.apache.log4j.Logger
import org.apache.spark.{SparkConf, SparkContext}
import org.scalatest.{BeforeAndAfterAll, FlatSpec, Matchers}
 
class SparkJobTest extends FlatSpec with Matchers with BeforeAndAfterAll {
 
  lazy val sc: SparkContext = {
    val conf = new SparkConf()
      .setAppName("Console Test")
      .setMaster("local")
 
    new SparkContext(conf)
  }
 
  override protected def beforeAll(): Unit = sc
 
  override protected def afterAll(): Unit = sc.stop()
 
 
  "the console logger " should "work" in {
 
    val databaseURL = Logger.getLogger("com.enterprise").getAppender("phoenix").asInstanceOf[PhoenixAppender].getURL
    val conn = DriverManager.getConnection(databaseURL, "any", "any")
    conn.setAutoCommit(true)
    val pstm = conn.prepareStatement("delete from log")
    pstm.execute()
    pstm.close()
 
    sc.parallelize(1 to 10)
      .map(e => {
        Logger.getLogger(classOf[SparkJobTest]).info("This is a log message")
        e
      })
      .count();
 
    val pstm2 = conn.prepareStatement("select count(*) from log")
    val rs = pstm2.executeQuery()
    rs.next()
    val count = rs.getInt(1)
    rs.close()
    pstm.close()
    conn.close()
 
    assert(count == 10)
  }
}

Now the logs appear if you query the LOG table using the SQL console. You can find the full code here: https://github.com/nerdammer/spark-additions/tree/master/log4j-phoenix.

Share on

Twitter Facebook LinkedIn

Logging to a NoSQL DB from Spark

Nicola Ferraro

Share on

You may also enjoy

Camel meets KEDA

Low Code Camel

From Camel to Kamelets: new connectors for event-driven applications

Camel K 1.4.0 Released