Sonic the search engine

January 24, 2021 January 2021 in Blog

For reasons that should be abundantly clear, I’ve been poking at alternatives to Elasticsearch. I’m living in a mostly Rust-based ecosystem right now working on Vector, so I started looking within that world. I found Sonic and decided to give it a whirl.

Sonic is a “fast, lightweight, and schema-less search backend.” It’s written in Rust, licensed under MPL 2.0. It’s maintained by Valerian Saliou, who is one of the founders of Crisp.

Sonic is not Elasticsearch: it’s a lot lighter weight and much less fully-featured. Its focus is on normalizing natural language search queries and providing results. Also, unlike Elasticsearch, Sonic is an identifier index rather than a document index. Queries return IDs, which can then find matching documents in an external database. Search terms are stored in collections and organized in buckets; you can use buckets to segregate your data into separate indexes, for example, a bucket per user or the like.

Another difference worth mentioning is that Sonic indexes at the word level and not at the sentence level. This approach makes for fast and compact storage. It’s worth taking a look at Sonic’s benchmarks to see just how fast. And reviewing Sonic’s limitations to understand the trade-off you’re making to achieve those results.

It’s also important to note that Sonic runs on a single node and lacks fault tolerance capabilities like clustering/replication. Although lightweight, Sonic’s single node nature is likely to hit hardware scaling limits at some point.

You can see Sonic at work on the [Crisp support site search box](https://help.crisp.chat/en/).

Installing and Configuring Sonic

Let’s see Sonic in action. We’re going to run Sonic, add some data to it, and then query that data. The fastest way to do this is to run Sonic from its Docker image. All we need to do this is Docker installed, some quick scaffolding, and a sample configuration file.

I'm going to assume you've got access to Docker already.

Let’s create a directory to hold our Sonic test instance and data and change into that directory.

mkdir -p ~/sonic-search
cd ~/sonic-search

Now we’re going to grab the sample configuration file.

wget https://raw.githubusercontent.com/valeriansaliou/sonic/master/config.cfg

Inside the file, you’ll find a default configuration for Sonic. We’re going to change a few things to make it work for our demo. Firstly, by default, Sonic binds to localhost on port 1491. To work inside a Docker container, we need to bind it to all interfaces. To do this, find this line in the config.cfg file.

[channel]
inet = "[::1]:1491"

And change it to:

[channel]
inet = "0.0.0.0:1491"

Next, we want to tell Sonic where to store its indexes. Let’s create some local directories for that now.

mkdir -p ./store/fst/ ./store/kv/

The kv directory contains the Key-Value index, and the fst directory contains a word graph of the data inside Sonic. We’ll be mounting these directories as volumes inside our Docker container, and we need to update our configuration to reference them. Find the following two lines inside config.cfg and update them:

[store.kv]
path = "./data/store/kv/"

And:

[store.fst]
path = "/var/lib/sonic/store/fst/"

Lastly, let’s up Sonic’s logging to get some more feedback from it. To do this, change the log_level option to:

[server]
log_level = "debug"

All other defaults can stay the same.

Now let’s run Sonic.

docker run -p 1491:1491 -v ~/sonic-search/config.cfg:/etc/sonic.cfg -v ~/sonic-search/store/:/var/lib/sonic/store/ valeriansaliou/sonic:v1.3.0

We’ve mapped port 1491 outside of the container, mounted our configuration file, and store directories into the container. We should see the Sonic server startup:

(INFO) - starting up
(INFO) - started
(DEBUG) - spawn managed thread: tasker
(DEBUG) - spawn managed thread: channel
(INFO) - tasker is now active
(INFO) - listening on tcp://0.0.0.0:1491

And we can then telnet into port 1491 to see if the server responds.

telnet localhost 1491                                                                                                                                          (255) (18h 21m 56s 474ms) ┃
Trying ::1...
Connected to localhost.
Escape character is '^]'.
CONNECTED <sonic-server v1.3.0>

And hey presto, we’re up and running. It’s not very exciting without adding some data, so let’s generate some.

Testing Sonic

Sonic comes with a collection of official libraries and community-submitted libraries for languages and frameworks. As it’s Sunday and I am feeling particularly lazy, I will write two quick Ruby scripts: one to send data to Sonic for ingestion and a second to search it. These will both use the Ruby client for Sonic.

I am going to assume you have Ruby installed already.

Let’s create a new directory to hold our test scripts:

mkdir -p ~/sonic-search/stest/
cd ~/sonic-search/stest/

Now we’ll start our scripts with a Gemfile:

source 'https://rubygems.org'
gem 'sonic-ruby'
gem 'faker'

And use Bundler to install the sonic-ruby and the faker gem we’ll be using to generate some sample data.

bundle install

Ingesting data

Now let’s write a quick script to ingest some sample data. We’ll call it ingest.rb

touch ~/sonic-search/stest/ingest.rb

And populate it like so:

require 'sonic-ruby'
require 'faker'

# Connect to the Sonic server on localhost:1491
client = Sonic::Client.new('localhost', 1491, 'SecretPassword')

# Connect to the ingest channel
ingest = client.channel(:ingest)

# Add data
10000.times.map { Faker::Name.name }.each_with_index do |name, index|
  ingest.push('users', 'all', index, name)
end

The [Sonic protocol](https://github.com/valeriansaliou/sonic/blob/master/PROTOCOL.md ) has three channels: control, ingest, and search.

Here we’re using Faker to generate an array of 10,000 names and pushing them into a collection called users and into a bucket called all. We’ll see a flurry of activity from the Sonic server as it indexes all incoming data.

It took my Macbook about 30 seconds to generate and index the name data.

Searching data

We can then write another script to query this data.

touch ~/sonic-search/stest/search.rb

And populate it like so:

require 'sonic-ruby'

if ARGV.length != 1
  puts "Too many names ... or not enough name?"
  exit
else
  name = ARGV[0]
end

# Connect to the Sonic server on localhost:1491
client = Sonic::Client.new('localhost', 1491, 'SecretPassword')

# Connect to the search channel
search = client.channel(:search)

# Search for a matching name and return ID
puts "Matching IDs: " + search.query('users', 'all', name)

# Search for suggested matches and return suggested name
puts "Matching suggestions: " + search.suggest('users', 'all', name)

Our script takes a single name as an input and performs two operations. The first is a straight search of the users collection in the all bucket. If it matches one or more index IDs, it’ll return them on the command line. The second search returns one or more suggested names. Let’s give it a try now:

$ ruby sonic_search.rb kate
Matching IDs: 8384 684 79 9886 1514 9538 6445
Matching suggestions: kate katelin katelyn katelynn katerine
...
$ ruby sonic_search.rb jim
Matching IDs: 9087 6783 6074 674 9777 9435 8161 6879 5926 2499
Matching suggestions: jim jimmie jimm

We can see Sonic has returned some matching IDs for kate and jim and some suggested variants.

I think this example shows Sonic’s simplicity and power and how easy it is to wire into a search box and gain suggestions and corrections. I can see use cases in the middle-ground between the search needs of folks who would previously have defaulted to using Elasticsearch and what Sonic provides. Naturally, Sonic’s single node nature, the lack of fault tolerance, and the potential scaling challenges may be an issue for many folks. However, I still think it’s a cool project and worth a look.

Installing and Configuring Sonic

Testing Sonic

Ingesting data

Searching data

Share this post