Posts Tagged ‘kiokudb’

Riak, Perl and KiokuDB

Sunday, December 13th, 2009

As I was looking for a system to store documents at $work, Riak was pointed to me by one of my coworkers. I’m looking for a solution of this type to store various types of documents, from HTML pages to json. I need a system that is distributed, faul tolerant, and that works with Perl.

So Riak is a document based database, it’s key value, no sql, REST, and in Erlang. You can read more about it here or watch an introduction here. Like CouchDB, Riak provides a REST interface, so you don’t have to write any Erlang code.

One of the nice things with Riak it’s that it let you defined the N, R and W value for each operation. This values are:

  • N: the number of replicas of each value to store
  • R: the number of replicas required to perform a read operation
  • W: the number of replicas needed for a write operation

Riak comes with library for python ruby PHP and even javascript, but not for Perl. As all these libraries are just communicating with Riak via the REST interface, I’ve started to write one using AnyEvent::HTTP, and also a backend for KiokuDB.

Installing and using Riak

If you interested in Riak, you can install it easily. First, you will need the Erlang VM. On debian, a simple

sudo aptitude install erlang

install everything you need. Next step is to install Riak:

wget http://hg.basho.com/riak/get/riak-0.6.2.tar.gz
tar xzf riak-0.6.2.tar.gz
cd riak
make
export RIAK=`pwd`

Now, you can start to use it with

./start-fresh config/riak-demo.erlenv

or if you want to test it in cluster mode, you can write a configuration like this:

{cluster_name, "default"}.
{ring_state_dir, "priv/ringstate"}.
{ring_creation_size, 16}.
{gossip_interval, 60000}.
{storage_backend, riak_fs_backend}.
{riak_fs_backend_root, "/opt/data/riak/"}.
{riak_cookie, riak_demo_cookie}.
{riak_heart_command, "(cd $RIAK; ./start-restart.sh $RIAK/config/riak-demo.erlenv)"}.
{riak_nodename, riakdemo}.
{riak_hostname, "192.168.0.11"}.
{riak_web_ip, "192.168.0.11"}.
{riak_web_port, 8098}.
{jiak_name, "jiak"}.
{riak_web_logdir, "/tmp/riak_log"}.

Copy this config on a second server, edit it to replace the riak_hostname and riak_nodename. On the first server, start it like show previously, then on the second, with

./start-join.sh config/riak-demo.erlenv 192.168.0.11

where the IP address it the address of the first node in your cluster.

Let’s check if everything works:

curl -X PUT -H "Content-type: application/json" \
    http://192.168.0.11:8098/jiak/blog/lumberjaph/ \
    -d "{\"bucket\":\"blog\",\"key\":\"lumberjaph\",\"object\":{\"title\":\"I'm a lumberjaph, and I'm ok\"},\"links\":[]}"
 
curl -i http://192.168.0.11:8098/jiak/blog/lumberjaph/

will output (with the HTTP blabla)

{"object":{"title":"I'm a lumberjaph, and I'm ok"},"vclock":"a85hYGBgzGDKBVIsbGubKzKYEhnzWBlCTs08wpcFAA==","lastmod":"Sun, 13 Dec 2009 20:28:04 GMT","vtag":"5YSzQ7sEdI3lABkEUFcgXy","bucket":"blog","key":"lumberjaph","links":[]}

Using Riak with Perl and KiokuDB

I need to store various things in Riak: html pages, json data, and objects using KiokuDB. I’ve started to write a client for Riak with AnyEvent, so I can do simple operations at the moment, (listing information about a bucket, defining a new bucket with a specific schema, storing, retriving and deleting documents). To create a client, you need to

my $client = AnyEvent::Riak->new(
    host => 'http://127.0.0.1:8098',
    path => 'jiak',
);

As Riak exposes to you it’s N, R, and W value, you can also set them in creation the client:

my $client = AnyEvent::Riak->new(
    host => 'http://127.0.0.1:8098',
    path => 'jiak',            
    r    => 2,
    w    => 2,                 
    dw   => 2,
);

where:

  • the W and DW values define that the request returns as soon as at least W nodes have received the request, and at least DW nodes have stored it in their storage backend.
  • with the R value, the request returns as soon as R nodes have responded with a value or an error. You can also set this values when calling fetch, store and delete. By default, the value is set to 2.

So, if you wan to store a value, retrieve it, then delete it, you can do:

my $store = $client->store(                                           
    { bucket => 'foo', key => 'bar', object => { baz => 1 }, } )->recv;    
my $fetch  = $client->fetch( 'foo', 'bar' )->recv;
my $delete = $client->delete( 'foo', 'bar' )->recv;

If there is an error, the croak method from AnyEvent is used, so you may prefer to do this:

use Try::Tiny;
try {
  my $fetch = $client->fetch('foo', 'baz')->recv;
}catch{
  my $err = decode_json $_;
  say "error: code => ".$err->[0]." reason => ".$err->[1];
};

The error contains an array, with the first value the HTTP code, and the second value the reason of the error given by Riak.

At the moment, the KiokuDB backend is not complete, but if you want to start to play with is, all you need to do is:

my $dir = KiokuDB->new(
    backend => KiokuDB::Backend::Riak->new(
        db => AnyEvent::Riak->new(      
            host => 'http://localhost:8098',
            path => 'jiak',
        ),
        bucket => 'kiokudb',            
    ),
);
 
$dir->txn_do(sub { $dir->insert($key => $object)});

A simple feed aggregator with modern Perl - part 4

Wednesday, May 13th, 2009

We have the model, the aggregator (and some tests), now we can do a basic frontend to read our feed. For this I will create a webapp using Catalyst.

Catalyst::Devel is required for developping catalyst application, so we will install it first:

    cpan Catalyst::Devel

Now we can create our catalyst application using the helper:

    catalyst.pl MyFeedReader

This command initialise the framework for our application MyFeedReader. A number of files are created, like the structure of the MVC directory, some tests, helpers, …

We start by creating a view, using TTSite. TTSite generate some templates for us, and the configuration for this template. We will also have a basic CSS, a header, footer, etc.

    cd MyFeedReader
    perl script/myfeedreader_create.pl view TT TTSite

TTSite files are under root/src and root/lib. A MyAggregator/View/TT.pm file is also created. We edit it to make it look like this:

    __PACKAGE__->config({
        INCLUDE_PATH => [
            MyFeedReader->path_to( 'root', 'src' ),
            MyFeedReader->path_to( 'root', 'lib' )
        ],
        PRE_PROCESS  => 'config/main',
        WRAPPER      => 'site/wrapper',
        ERROR        => 'error.tt2',
        TIMER        => 0,
        TEMPLATE_EXTENSION => '.tt2',
    });

Now we create our first template, in root/src/index.tt2

    to <a href="/feed/">your feeds</a>

If you start the application (using perl script/myfeedreader_server.pl) and point your browser on http://localhost:3000/, this template will be rendered.

We need two models, one for KiokuDB and another one for MyModel:

lib/MyFeedReader/Model/KiokuDB.pm

    package MyFeedReader::Model::KiokuDB;
    use Moose;
    BEGIN { extends qw(Catalyst::Model::KiokuDB) }
    1;

we edit the configuration file (myfeedreader.conf), and set the dsn for our kiokudb backend

    <Model KiokuDB>
        dsn dbi:SQLite:../MyAggregator/foo.db
    </Model>

lib/MyFeedReader/Model/MyModel.pm

    package MyFeedReader::Model::MyModel;
    use base qw/Catalyst::Model::DBIC::Schema/;
    1;

and the configuration:

    <Model MyModel>
        connect_info dbi:SQLite:../MyModel/model.db
        schema_class MyModel
    </Model>

We got our view and our model, we can do the code for the controller. We need 2 controller, one for the feed, and one for the entries. The Feed controller will list them and display entries titles for a given feed. The Entry controller will just display them.

lib/MyFeedReader/Controller/Feed.pm

    package MyFeedReader::Controller::Feed;
    use strict;
    use warnings;
    use parent 'Catalyst::Controller';
 
    __PACKAGE__->config->{namespace} = 'feed';
 
    sub index : Path : Args(0) {
        my ( $self, $c ) = @_;
        $c->stash->{feeds}
            = [ $c->model('MyModel')->resultset('Feed')->search() ];
    }
 
    sub view : Chained('/') : PathPart('feed/view') : Args(1) {
        my ( $self, $c, $id ) = @_;
        $c->stash->{feed}
            = $c->model('MyModel')->resultset('Feed')->find($id);
    }
 
    1;

The function index list the feeds, while the function view list the entries for a give feed. We use the chained action mechanism to dispatch this url, so we can have
urls like this /feed/*

We create our 2 templates (for index and view):

root/src/feed/index.tt2

    <ul>
        [% FOREACH feed IN feeds %]
            <li><a href="/feed/view/[% feed.id %]">[% feed.url %]</a></li>
        [% END %]
    </ul>

root/src/feed/vew.tt2

    <h1>[% feed.url %]</h1>
 
    <h3>entries</h3>
    <ul>
        [% FOREACH entry IN feed.entries %]
            <li><a href="/entry/[% entry.id %]">[% entry.permalink %]</a></li>
        [% END %]
    </ul>

If you point your browser to http://localhost:3000/feed/ you will see this:

list_feed

Now the controller for displaying the entries:

    package MyFeedReader::Controller::Entry;
    use strict;
    use warnings;
    use MyAggregator::Entry;
    use parent 'Catalyst::Controller';
 
    __PACKAGE__->config->{namespace} = 'entry';
 
    sub view : Chained('/') : PathPart('entry') : Args(1) {
        my ( $self, $c, $id ) = @_;
        $c->stash->{entry} = $c->model('KiokuDB')->lookup($id);
    }
 
    1;

The function view fetch an entry from the kiokudb backend, and store it in the stash, so we can use it in our template.

root/src/entry/view.tt2

    <h1><a href="[% entry.permalink %]">[% entry.title %]</a></h1>
    <span>Posted [% entry.date %] by [% entry.author %]</span>
    <div id="content">
        [% entry.content %]
    </div>

If you point your browser to an entry (something like http://localhost:3000/entry/somesha256value), you will see an entry:

show_entry

Et voila, we are done with a really basic feed reader. You can add methods to add or delete feed, mark an entry as read, …

The code is available on github.

A simple feed aggregator with modern Perl - part 2

Tuesday, April 28th, 2009

I’ve choose to write about a feed aggregator because it’s one of the things I’m working on at RTGI (with web crawler stuffs, gluing datas with search engine, etc)

For the feed aggregator, I will use Moose, KiokuDB and our DBIx::Class schema. Before we get started, I’d would like to give a short introduction to Moose and KiokuDB.

Moose:
Moose is a “A postmodern object system for Perl 5″. Moose brings to OO Perl some really nice concepts like roles, a better syntax, “free” constructor and destructor, … If you don’t already know Moose, check http://www.iinteractive.com/moose/ for more informations.

KiokuDB:
KiokuDB is a Moose based frontend to various data stores [...] Its purpose is to provide persistence for “regular” objects with as little effort as possible, without sacrificing control over how persistence is actually done, especially for harder to serialize objects. [...] KiokuDB is meant to solve two related persistence problems:

  • Store arbitrary objects without changing their class definitions or worrying about schema details, and without needing to conform to the limitations of a relational model.
  • Persisting arbitrary objects in a way that is compatible with existing data/code (for example interoperating with another app using CouchDB with JSPON semantics).

I will store each feed entry in KiokuDB. I could have chosen to store them as plain text in JSON files, in my DBIx::Class model, etc. But as I want to show you new and modern stuff, I will store them in Kioku using the DBD’s backend.

And now for something completely different, code!

First, we will create a base module named MyAggregator.

module-setup MyAggregator

We will now edit lib/MyAggregator.pm and write the following code:

package MyAggregator;
use Moose;
1;

As you can see, there is no use strict; use warnings here: Moose automatically turns on these pragmas. We don’t have to write the new method either, as it’s provided by Moose.

For parsing feeds, we will use XML::Feed, and we will use it in a Role. If you don’t know what roles are:

Roles have two primary purposes: as interfaces, and as a means of code reuse. Usually, a role encapsulates some piece of behavior or state that can be shared between classes. It is important to understand that roles are not classes. You cannot inherit from a role, and a role cannot be instantiated.

So, we will write our first role, lib/MyAggregator/Roles/Feed.pm:

package MyAggregator::Roles::Feed;
use Moose::Role;
use XML::Feed;
use feature 'say';
sub feed_parser {
    my ( $self, $content ) = @_;
    my $feed = eval { XML::Feed->parse( $content ) };
    if ( $@ ) {
        my $error = XML::Feed->errstr || $@;
        say "error while parsing feed : $error";
    }
    $feed;
}
1;

This one is pretty simple. It will read a content, try to parse it, and return a XML::Feed object. If it can’t parse the feed, the error will be shown, and the result will be set to undef.

Now, a second role will be used to fetch the feed, and do basic caching, lib/MyAggregator/Roles/UserAgent.pm:

package MyAggregator::Roles::UserAgent;
use Moose::Role;
use LWP::UserAgent;
use Cache::FileCache;
use URI;
 
has 'ua' => (
    is      => 'ro',
    isa     => 'Object',
    lazy    => 1,
    default => sub { LWP::UserAgent->new( agent => 'MyUberAgent' ); }
);
has 'cache' => (
    is   => 'rw',
    isa  => 'Cache::FileCache',
    lazy => 1,
    default => sub { Cache::FileCache->new( { namespace => 'myaggregator', } ); }
);
 
sub fetch_feed {
    my ( $self, $url ) = @_;
 
    my $req = HTTP::Request->new( GET => URI->new( $url ) );
    my $ref = $self->cache->get( $url );
    if ( defined $ref && $ref->{ LastModified } ne '' ) {
        $req->header( 'If-Modified-Since' => $ref->{ LastModified } );
    }
 
    my $res = $self->ua->request( $req );
    $self->cache->set(
        $url,
        {   ETag         => $res->header( 'Etag' )          || '',
            LastModified => $res->header( 'Last-Modified' ) || ''
        },
        '5 days',
    );
    $res;
}
1;

This role has 2 attributes: ua and cache. The ua attribute is our UserAgent. ‘lazy’ means that it will not be constructed until I call $self->ua->request.

I use Cache::FileCache for doing basic caching so I don’t fetch or parse the feed if it’s unnecessary, and I use the Etag and Last-Modified header to check the validity of my cache.

The only method of this role is fetch_feed. It will fetch an URL if it’s not already in the cache, and return a HTTP::Response object.

Now, I create an Entry class in lib/MyAggregator/Entry.pm:

package MyAggregator::Entry;
use Moose;
use Digest::SHA qw(sha256_hex);
has 'author'  => ( is => 'rw', isa => 'Str' );
has 'content' => ( is => 'rw', isa => 'Str' );
has 'title'   => ( is => 'rw', isa => 'Str' );
has 'id'      => ( is => 'rw', isa => 'Str' );
has 'date'      => ( is => 'rw', isa => 'Object' );
has 'permalink' => (
    is       => 'rw',
    isa      => 'Str',
    required => 1,
    trigger  => sub {
        my $self = shift;
        $self->id( sha256_hex $self->permalink );
    }
);
1;

Here the permalink has a trigger attribute: each entry has a unique ID, constructed with a sha256 value from the permalink. So, when we fill the permalink accessor, the ID is automatically set.

We can now change our MyAggregator module like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
package MyAggregator;
use feature ':5.10';
use MyModel;
use Moose;
use MyAggregator::Entry;
use KiokuDB;
use Digest::SHA qw(sha256_hex);
with 'MyAggregator::Roles::UserAgent', 'MyAggregator::Roles::Feed';
 
has 'context' => ( is => 'ro', isa => 'HashRef' );
has 'schema' => (
    is      => 'ro',
    isa     => 'Object',
    lazy    => 1,
    default => sub { MyModel->connect( $_[0]->context->{ dsn } ) },
);
has 'kioku' => (
    is      => 'rw',
    isa     => 'Object',
    lazy    => 1,
    default => sub {
        my $self = shift;
        KiokuDB->connect( $self->context->{ kioku_dir }, create => 1 );
    }
);
sub run {
    my $self = shift;
 
    my $feeds = $self->schema->resultset( 'Feed' )->search();
    while ( my $feed = $feeds->next ) {
        my $res = $self->fetch_feed( $feed->url );
        if ( !$res || !$res->is_success ) {
            say "can't fetch " . $feed->url;
        } else {
            $self->dedupe_feed( $res, $feed->id );
        }
    }
}
 
sub dedupe_feed {
    my ( $self, $res, $feed_id ) = @_;
 
    my $feed = $self->feed_parser( \$res->content );
    return if ( !$feed );
    foreach my $entry ( $feed->entries ) {
        next if $self->schema->resultset( 'Entry' )->find( sha256_hex $entry->link );
        my $meme = MyAggregator::Entry->new(
            permalink => $entry->link,
            title     => $entry->title,
            author    => $entry->author,
            date      => $entry->issued,
            content   => $entry->content->body,
        );
 
 
        $self->kioku->txn_do(
            scope => 1,
            body  => sub {
                $self->kioku->insert( $meme->id => $meme );
            }
        );
        $self->schema->txn_do(
            sub {
                $self->schema->resultset( 'Entry' )->create(
                    {   entryid   => $meme->id,
                        permalink => $meme->permalink,
                        feedid    => $feed_id,
                    }
                );
            }
        );
    }
}
1;
  • the with function composes roles into a class. So my MyAggregator class has a fetch_feed and parse_feed methods, and all the attributes of our roles.
  • context is a HashRef that contains the configuration
  • schema is our MyModel schema
  • kioku is a connection to our kiokudb backend

Two methods in this object: run and dedupe.

The run method gets the list of feeds (line 28, via the search). For each feed return by the search, we try to fetch it, and if it’s successful, we dedupe the entries. To dedupe the entries, we check if the permalink is alread in the database (line 45, via the find). If we already have this entry, we skip this one, and do the next one. If it’s a new entry, we create a MyAggregator::Entry object, with the content, date, title, … we store this object in kiokudb (line 55, we create a transaction, and do our insertion in the transaction), and create a new entry in the MyModel database (line 61, we enter in transaction too, and insert the entry in the database).

And to run this, a little script:

#!/usr/bin/perl -w
use strict;
use MyAggregator;
use YAML::Syck;
my $agg = MyAggregator->new(context => LoadFile shift);
$agg->run;

so we can run our aggregator like this:

perl bin/aggregator.pl conf.yaml

And it’s done :) We got a really basic aggregator now. If you want to improve this one, you would like to improve the dedupe process, using the permalink, the date and/or the title, as this one is too much basic. In the next article we will write some tests for this aggregator using Test::Class.

big thanks to tea and blob for reviewing and fixing my broken english in the first 2 parts.

the code is available here.

Part 3 and 4 next week.

belgian perl workshop 09

Sunday, March 8th, 2009

last weekend my co-workers and I went to the Belgian Perl Workshop 09. I attended the following presentations:

  • kiokudb, by nothingmuch. Slides are available here. We were able to talk with him during the afternoon, we might we use it at work.
  • Painless XSLT with Perl, by andrew shitov.
    Was interesting, even if I don’t do any XSLT anymore. Again, some ideas might be used for work.
  • What are you pretending to be?, by liz. That’s a hell of a hack. The module is available on the cpan.
  • Regular Expressions and UniCode Guru, by abigail. Feel better to know that I’m not the only one suffering with unicode in Perl ;). Learn some stuff like how to create a custom character classe, etc.
  • Catalyst, by matt trout. Ok, we’re using Catalyst at work for our webservices. So we allready know about catalyst, but we were curious. And as i was hoping, we learn some nice tweaks. We discovered Catalyst::Adaptor, so we don’t have to do some horrible stuff in our Controller any more, and some other interesting stuff were put in this talk. And matt is a really good speaker, manage to keep an audiance amused and interested.
  • Catalyst & AWS, by matt trout. Once again, a really good talk by matt. Some good advices, and a lot of fun.

We didn’t stay long for the social event in the evening; we had booked a hotel in bruxelles. But i’m glad that we were able to get to this perl workshop, it was well-organised, good talks, meet nice people, and learn some stuff. All in all, a good day :)

some photos are available on on my flickr account.