Service Discovery with etcd and Node.js

In this blog post I'll cover some basics about etcd and show you how to use it for service discovery.

This is not part of my series on CoreOS, etcd, fleet and Docker, the first two parts of which can be found here and here, but it is related. Expect further blog posts to come in that series!

tl;dr

Herein I will:

  • Give an overview of the service discovery problem.
  • Discuss etcd is a solution.
  • Show you how to get etcd up and running.
  • Demonstrate a practical example in Node.js.
  • Give some pointers for more advanced use cases.

Service Discovery

Put simply, service discovery is the process of services finding other services to which they need to connect, for example an API server needing to know which database to connect to, or a worker needing to know a message queue to which it needs to subscribe.

Why is this even a problem? Why not just tell the API server the database endpoint and credentials when it starts up, via a config file, command-line arguments or environment variables?

Service discovery gives you more flexibility and allows you to better handle services coming and going either through failure or through updates.

With a service discovery solution an API service (for example) could simply ask a trusted registry "give me the details of the database that I should be using" and they will be provided. When a depended-upon service becomes unavailable and is replaced, users of those services can simply ask for the details again and then start using the new one.

It sounds like a simple problem, but building a service discovery solution that is fault-tolerant and consistent is non-trivial. For example, a naive centralised service discovery solution could be a single point of failure in your system; conversely, a distributed solution needs to be as consistent as possible so that all nodes give the same answer, including supporting such features as distributed locks for leader election and other, more advanced features.

Using etcd for Service Discovery

etcd is a distributed key-value store with a REST API that speaks JSON. It strives to be highly-available and highly consistent. In CAP terms it is consistent and partition-tolerant, according to its creator Brandon Philips. etcd can be used for configuration mangement, leader-election and distributed locks in addition to service discovery.

etcd's key-value store can be likened to a file system with a directory structure that can be accessed over HTTP. You can get and set values, list keys in a directory and watch for changes.

We can use etcd for service discovery by having our services store their IP, port and other pertinent information in its key-value store. Other services can then do a simple look-up by service name to find out how to connect to a service. A service can also subscribe to be notified of changes.

I'm not going to compare etcd to other similar or competing solutions (such as ZooKeeper, Doozerd or Consul.io). There is ample material on this just a short Google search away.

Below I'll be talking about and demonstrating Etcd, etcdctl and node-etcd. In case it's not clear, Etcd is the distributed key-value server discussed above, etcdctl is a CLI Etcd client that is useful for debugging and node-etcd is a Node.js module from NPM that can be used to interact with etcd.

Installing and Running etcd

I'm going to give instructions for OS X using Homebrew, but if you use something else then read the latest installation instructions here.

$ brew update
$ brew install etcd etcdctl
$ etcd

For the rest of the post I'm going to assume that you have etcd running, so fire up another terminal to follow along.

I'm not going to cover the basics of using etcd, it has been more than adequately covered here. If you're new to etcd and etcdctl, I suggest you read that now and then come back.

Let's get started.

Naive Service Discovery

You can find all the code from this post on Github here. You will of course need to run npm install before you run each example for the first time.

To use etcd for service discovery we'll use a key that represents the service, and a value that is a JSON document containing various bits of useful information that a user of the service might need, such as hostname, port and pid.

Let's use etcdctl to demonstrate:

$ etcdctl set /services/myservice "{\"hostname\": \"127.0.0.1\", \"port\": 3000}"
{"hostname": "127.0.0.1", "port": 3000}
$ etcdctl ls /services
/services/myservice
$ etcdctl get /services/myservice
{"hostname": "127.0.0.1", "port": 3000}

This is how a basic service registration might look. Here's how you might do this in Node.js (register1/index.js):

var path = require('path'),  
    Etcd = require('node-etcd');

var etcd = new Etcd();

var p = path.join('/', 'services', 'myservice');  
etcd.set(p,  
  JSON.stringify({
    hostname: '127.0.0.1',
    port: '3000',
    pid: process.pid
  }));
console.log('Registered with etcd as ' + p);  

Ignore for a moment that nothing is bound on 127.0.0.1:3000, I'm just showing how we're going to register services.

Run this node app and then let's interrogate with etcdctl:

$ cd register1
$ node index
Registered with etcd as /services/myservice  
$ etcdctl ls /services
/services/myservice
$ etcdctl get /services/myservice
{"hostname": "127.0.0.1", "port": 3000}

Notice how our app has finished yet the key remains in etcd's keystore?

Less Naive Service Discovery

We can use what's called a time-to-live (TTL) to have keys automatically expire, and we can schedule a regular update to this key in our app to keep it alive. Again, let's demonstrate with etcdctl:

$ etcdctl set --ttl 10 /services/myservice "{\"hostname\": \"127.0.0.1\", \"port\": 3000}"
$ etcdctl get /services/myservice
{"hostname": "127.0.0.1", "port": 3000}

We've set a TTL of 10 seconds, so play with the above code, calling get before 10 seconds is up, then setting it again and calling it after 10 seconds. Observe what happens.

Let's apply this to our Node.js app (register2/index.js):

var pkgjson = require('./package.json'),  
    path = require('path'),
    Etcd = require('node-etcd');

var etcd = new Etcd();

function etcdRegister() {  
  var p = path.join('/', 'services', 'myservice');
  etcd.set(p,
    JSON.stringify({
        hostname: '127.0.0.1',
        port: '3000',
        pid: process.pid,
        name: pkgjson.name
      }),
    {
      ttl: 10
    });
  setTimeout(etcdRegister, 5000);
  return p;
}

console.log(pkgjson.name + ' registered with etcd as ' + etcdRegister());  

Run the app and let it run for 10s or so, then kill it:

$ cd ../register2
$ node index
register2 registered with etcd as /services/myservice  
^C
$ etcdctl get /services/myservice
{"hostname":"127.0.0.1","port":"3000","pid":69470, "name": "register2"}

... wait 10 seconds or so, and then:

$ etcdctl get /services/myservice
Error: 100: Key not found (/services/myservice) [52565]  

What happened? Our Node.js app registered with etcd by writing a key that expires in 10s, but then it refreshes it every 5s. This works well, because it will maintain the key whilst the app is running, but if the process dies then the key will expire within 10s and it will be removed. You can see this in the latter example where we'd killed the app, waited a bit, and then queried etcd again. The value is gone.

Okay let's add another node that discovers the first node (naive-discovery/index.js):

var pkgjson = require('./package.json'),  
    path = require('path'),
    Etcd = require('node-etcd');

var etcd = new Etcd();

function etcdDiscover(name, options, callback) {  
  etcd.get(path.join('/', 'services', name), options, function (err, value) {
    if (err) {
      return callback(err);
    }
    var value = JSON.parse(value.node.value);
    return callback(null, value);
  });
}

console.log(pkgjson.name + ' is looking for \'myservice\'...');  
etcdDiscover('myservice', {wait: true}, function (err, node) {  
  if (err) {
    console.log(err.message);
    process.exit(1);
  }
  console.log(pkgjson.name + ' discovered node: ', node);
  setInterval(function () {}, 10000);}); // keep app alive

...and let's test it out:

$ cd ../naive-discovery
$ node ../register2/index & node index.js
[4] 69606
register2 registered with etcd as /services/myservice  
naive-discovery is looking for 'myservice'...  
naive-discovery discovered node:  { hostname: '127.0.0.1', port: '3000', pid: 984, name: 'register2' }  

Let's kill those, and then start them up in the other order:

$ jobs
[1]-  Running                 node index.js &
[2]+  Running                 node ../register2/index.js &
$ kill %1 %2
[1]-  Exit 143                node index.js
[2]+  Exit 143                node ../register2/index.js
$ node index & node ../register2/index
naive-discovery is looking for 'myservice'...  
register2 registered with etcd as /services/myservice  
naive-discovery discovered node:  { hostname: '127.0.0.1', port: '3000', pid: 995, name: 'register2' }  

Notice in the latter example how naive-discovery starts looking for myservice in etcd before register2 is even launched and it still works and doesn't error? This is because we passed {wait: true} to etcd.get(). What that does is it asks etcd to tell it when that value becomes available.

Using TTL makes our service discovery a little more robust, but there are still problems, as I'm about to demonstrate. Kill the running services as above and we'll start them up again:

$ jobs
[1]-  Running                 node index.js &
[2]+  Running                 node ../register2/index.js &
$ kill %1 %2
[1]-  Exit 143                node index.js
[2]+  Exit 143                node ../register2/index.js
$ node index & node ../register2/index &
egister2 registered with etcd as /services/myservice  
naive-discovery discovered node:  { hostname: '127.0.0.1',  
  port: '3000',
  pid: 3138,
  name: 'register2' }

Now kill register2 and start it up again:

$ jobs
[1]-  Running                 node index.js &
[2]+  Running                 node ../register2/index.js &
$ kill %2
[2]+  Exit 143                node ../register2/index.js
$ node ../register2/index.js &
[2] 1385
$ register2 registered with etcd as /services/myservice

As you might expect, naive-discovery doesn't know that register2 went away and came back again. If this were a database it was trying to communicate with it would be getting errors and not have any way to do anything about it. Service discovery at startup is useful, but let's make it even more useful by having it respond to what's going on in realtime.

More Robust Service Discovery

We can resolve this issue by watching for changes on a given key and responding accordingly. Once again, let's explore this with etcdctl first. Run register2 again (or just let it run if it's already running- use jobs to find out):

$ cd ../register2
$ node index
register2 registered with etcd as /services/myservice  

Now open another terminal window and run the following:

$ etcdctl watch  --forever services/myservice
{"hostname":"127.0.0.1","port":"3000","pid":1659,"name":"register2"}
{"hostname":"127.0.0.1","port":"3000","pid":1659,"name":"register2"}
...

This will watch for changes to the services/myservice key and print out the value whenever it is modified. As you can see, it will print a new line every 5 seconds because register2 is re-setting the key every 5 seconds.

Now go back to the first terminal and kill register2. Observe that the other terminal window will now print a blank line? That's because the TTL expired after register2 was killed and now the value is not set.

Now start register2 again. The value will appear in the etcdctl watch output once it has registered. You can kill etcdctl and close that terminal now.

Here's how you might do this in a node application (robust-discovery/index.js):

var pkgjson = require('./package.json'),  
    path = require('path'),
    Etcd = require('node-etcd');

var etcd = new Etcd();

function etcdDiscover(name, options, callback) {  
  var key = path.join('/', 'services', name);
  etcd.get(key, options, function (err, value) {
    if (err) {
      return callback(err);
    }
    var value = JSON.parse(value.node.value);
    return callback(null, value, etcd.watcher(key));
  });
}

console.log(pkgjson.name + ' is looking for \'myservice\'...');  
etcdDiscover('myservice', {wait: true}, function (err, node, watcher) {  
  if (err) {
    console.log(err.message);
    process.exit(1);
  }
  console.log(pkgjson.name + ' discovered node: ', node);
  watcher
    .on('change', function (data) {
      console.log('Value changed; new value: ', node);
    })
    .on('expire', function (data) {
      console.log('Value expired.');
    })
    .on('delete', function (data) {
      console.log('Value deleted.');
    });
});

If register2 is still running, kill it, and then we'll start the robust-discovery service:

$ cd ../robust-discovery
$ node index
robust-discovery is looking for 'myservice'...  

In another terminal, start register2 and then watch what happens in the logs of robust-discovery:

robust-discovery is looking for 'myservice'...  
robust-discovery discovered node:  { hostname: '127.0.0.1',  
  port: '3000',
  pid: 2251,
  name: 'register2' }
Value changed; new value:  { hostname: '127.0.0.1',  
  port: '3000',
  pid: 2251,
  name: 'register2' }
...

As you can see, the watcher is printing out the value each time it register2 sets it (every 5 seconds).

Let's try deleting the key (in a new terminal):

$ etcdctl rm /services/myservice

And watch what happens in the robust-discovery output:

Value deleted.  
Value changed; new value:  { hostname: '127.0.0.1',  
  port: '3000',
  pid: 2251,
  name: 'register2' }

It is being set again very soon after by register2, so we see a "deleted" and then a "changed" event.

Now kill register2 and again look at the output of robust-discovery:

Value expired.  

After up to 10 seconds you will see this output and then nothing more.

Summary

I've shown you the basics of using etcd for service discovery, demonstrating the principles with etcdctl, and giving a practical example with Node.js.

For the sake of brevity I've not given a full-blown, usable service- I'm assuming you just want to know the principles and apply it to your own problems. It's easy to see how you can take the code in robust-discovery and use it to track the availability of a depended-upon service and maintain up-to-date configuration for how to access it. Just put 'real' code in the place of my "Value changed/deleted/expired" printouts.

Instead of using a string of JSON data as the value in etcd, you could use a directory for the service and keys for each value such as host, port, etc. I wanted to keep it simple.

There are further considerations when you have multiple services of the same type (e.g. multiple web front-ends, or some sort of worker nodes). In that case you may want a directory for each type, and then multiple entries under that for each services (e.g. /services/myservice/myservice-0{1,2,3}). Again, I'm trying to keep it simple here.

I hope you've found it useful. Let me know if you find any mistakes or if you have any feedback.