New job assessment: Webcluster logs parsing

Publicado: enero 15, 2012 en Misc, monitoring, script, sysadmin, Technical

These is one of the proposed solutions for the job assessment commented in a previous post.

Provide a design which is able to parse Apache access-logs so we can generate an overview of which IP has visited a specific link. This design needs to be usable for a 500+ node webcluster. Please provide your configs/possible scripts and explain your choices and how scalable they are.

I will consider these requirements:

  • It is not critical to register all the log entries. It is no needed ensure that all the web hits are registered.
  • No control on duplicated log entries. It is not needed
    to check that the log entries had been already loaded.
  • It is also needed to propose a mechanism to gather the logs from the webservers.
  • It must be scalable.
  • It is a plus to make it flexible to allow further different analysis.

The problems to be solved are log storage and log gathering, but the main bottleneck will be the storage.

One realizes that the best option is a noSQL database due to
the characteristics of the data to process (log entries):

  • Time ordered entries
  • no duplicates
  • need of fast insertion
  • fixed fields
  • no data relation or conceptual integrity
  • need to be rotated (old entries removed)
  • etc…

So, I will propose the usage of MongoDB [1] (, that fits the requirements:

  • It is fast, both at inserting and querying.
  • Scales horizontally without disruption (is initially proper configured).
  • Supports replication and High Availability.
  • Well known solution. Commercial support if needed.
  • Python bindings (pyMongo)
[1] Note: I will not enter in details of a MongoDB scalable HA architecture.
See the quick start guide
to setup a single node and the documentation
for architecture examples.

To parse the logs and store them in MongoDB, I will propose a
simple python script: that:

  • Setup a direct MongoDB connection.
  • Read the access log from standard input.
  • Parse the logs and store all the fields, including: client_ip, url, referer, status code, timestamp, timezone…
  • I do not set any indexes in the NoSQL db. Indexes could be
    created on url or client_ip fields, but not having indexes allows faster
    insertions, that is the objective. The reads are very uncommon and performed
    in batch processes.
  • Notice that it should be improved to be more reliable. For instance, it
    does not check for errors (DB failures, etc.). It could buffer entries in case of DB failure.
  • A second script called queries the DB and prints the access. It gets an optional argument, the relative URL.

To feed the DB with the logs from the webservers, some solutions could be:

  • Copy the log files with a scheduled task via SSH or similar, then process them with in a centralized server (or cluster of servers).

    • Pros: Logs are centralized. Only a set of servers access to MongoDB.
      System can be stopped as needed.
    • Cons: Needs extra programming to get the logs.
      No realtime data.
  • Use a centralized syslog service, like syslog-ng
    (can be balanced and configured in HA),
    and setup all the webservers to send the logs via syslog
    (see this article).

    In the log server, we can process resulting files with a batch process or send all the messages to For instance, the configuration for syslog-ng:

    destination d_prog { program("/apath/"
                                  template-escape(no)); };
    • Pros: Centralized logs. No extra programming. Realtime data.
      Use of existent infrastructure (syslog). Only a set of servers access to MongoDB.
    • Cons: Some logs entries can be dropped. Can not be stopped, if not log entries will be lost.
  • Pipe the webserver logs directly to the script,
    In Apache configuration:

    CustomLog "|/apath/" combined
    • Pros: Easy to implement. No extra programming or infrastructure. Realtime data.
    • Cons: Some logs entries can be dropped. It can not be stopped or log entries will be lost.
      The script should be improved to make it more reliable.

These is one of the proposed solutions for the job assessment commented in a previous post.



Introduce tus datos o haz clic en un icono para iniciar sesión:

Logo de

Estás comentando usando tu cuenta de Cerrar sesión /  Cambiar )

Google+ photo

Estás comentando usando tu cuenta de Google+. Cerrar sesión /  Cambiar )

Imagen de Twitter

Estás comentando usando tu cuenta de Twitter. Cerrar sesión /  Cambiar )

Foto de Facebook

Estás comentando usando tu cuenta de Facebook. Cerrar sesión /  Cambiar )


Conectando a %s