Riak 101
11/21/2013
by Gabe Koss
Riak is a distributed key-value store written in Erlang. Development is carried out primarily by a company called Basho technologies
I first became aware of Riak when searching for solutions to complex data storage issues we started to encounter dealing with large volumes of JSON data. Because we were researching the solution. Sam Stelfox and I volunteered to give a presentation on the database at a @btvwag meetup.
Most of this content is derived from the slides from the presentation.
NoSQL, huh?
The event where we gave this presentation was a meeting on varion NoSQL technogogies. We started with the following quotes from Andy Gross:
"NoSQL marketing is confusing... Everything does everything and at a small scale everything works."
"If you're evaluating Mongo vs. Riak or Couch vs. Cassandra you don't understand either your problem or the technologies."
Andy Gross, VP of Engineering, Basho Technologies
(approximate paraphrasing, hopefully not grossly misquoted)
NoSQL is often cited as a sort of 'magic bullet' by both marketers and engineers. These technologies are generally less mature than their conventional & relational cousins. As such it is often a problem where tools which have significant differences between them are compared against each other for problems which neither one solves.
As developers it is crucial to understand our problems fully before we move on technologies which may or may not suit our purposes.
So what is Riak then?
Riak offers quite a lot. It is a REST-ful Key-Value store with some internal intelligence. As far as I have been able to decipher it's most significant features are as follows:
- HTTP access
- Simple Key-Value Database
- Masterless Clustering
- CAP Tuning
- Map/Reduce using Javascript or Erlang
- Automatic Sharding / Consistent Hashing
Other features which I will not dig into in this article include full text search via Solr, link walking, commit hooks for data processing and a secondary/ custom indexing system.
Key-Value Heaven
Since it is HTTP based it uses a URL scheme to organize keys and values into data buckets. Buckets are not strictly defined like tables in relational databases.
The basic URL scheme looks like: http://riak-node-hostname:8098/riak/<bucket-name>/<key-name>
This scheme dictates that actual internals of the Riak internal data models which is basically made up of the following:
- Bucket Name
- Key Name
- Binary Data
- Indexes
- Additional Metadata
Simple Example
A basic PUT / GET operation to save some data into a bucket looks like this:
$ curl -XPUT "http://localhost:8098/riak/my-bucket/my-key"\
--header "Content-Type: application/json" \
--data '{"living_in":"the future!"}'
$ curl -XGET "http://localhost:8098/riak/my-bucket/my-key"
{"living_in":"the future!"}
In this example a JSON blob containing '{"living_in":"the future!"}'
is
stored in the my-bucket
bucket with the key my-key
. It is then retrieved
using the same bucket/key combo.
Binary Data
The simplicity of the data model at Riaks core allows for very simple saving of
other filetypes. Here is an example of posting an image of a horse into the
images
bucket with the key horse.jpg
:
$ curl -XPUT "http://localhost:8098/riak/images/horse.jpg"\
--header "Content-Type: image/jpeg" \
--data-binary @/home/user/horse_pic.jpg
$ curl -XGET "http://localhost:8098/riak/images/horse.jpg" > /home/user/new_horse.jpg
$ md5sum /home/user/horse_pic.jpg
3f9bdc9366aec1f98839f47717ada5bf horse_pic.jpg
$ md5sum /home/user/new_horse.jpg
3f9bdc9366aec1f98839f47717ada5bf new_horse.jpg
After posting in the binary data an retrieving it we see that the MD5 checksum matches which indicates that the image was stored and retrieved properly.
It's distributed right?
The Ring
The primary way which Riak handles distributed clustering is a mechanism called 'The Ring'. Riak splits its data into a series of partitions (default is 32). Data is hashed using the Bucket/Key combination in a repeatable way. This hash effectively becomes the unique identifier
The partitions are applied evenly across all nodes of the cluster as a ring. For example if there were four nodes, Node 1 would be assigned partitions 1, 5, 9 ... and so forth.
Data is also saved across redundant nodes
image source: http://docs.basho.com/shared/1.4.2/images/riak-ring.png
Where does it fit on the CAP spectrum?
CAP theorem is the idea that in a system where you need to maintain tolerance to network partitions you can effectively choose any two of the following:
- Consistency
- Availability
- Partition Tolerance
"If you have a system that can get a network partition you have a choice: do you want to be consistent or do you want to be available?"
Martin Fowler
Generally any sort of distributed database system will optimize for either CP or AP. Riak does not make this decision for you.
Tuning for CAP
There are several properties which can be applied to Riak system wide, bucket wide or on a per data value basis. This allows for the Riak cluster performance to be tightly coupled to the type of data being stored in it.
n
: Copies of each nodes a given key must be stored onr
: Number of nodes which must agree on a value for a read to be considered validw
: Number of nodes that must confirm receipt for a write to be considered valid
There are also two advanced options: dw
and rw
.
Optimize for Consistency and Availability
{
n_val: "<all>",
r_val: "1",
w_val: "<all>"
}
Optimize for Consistency and Partition Tolerance
{
n_val: "<high value>",
r_val: "<n_val quorum+>",
w_val: "<n_val quorum+>"
}
Optimize for Availability and Partition Tolerance
{
n_val: "<high value>",
r_val: "<low value>",
w_val: "<low value>"
}
A few more samples
Ruby Client
Since I use a lot of Ruby the first thing I checked out was a Ruby client.
There is a client gem which can be installed via gem install riak-client
.
The project source is here. It does not seem as active as I'd like. Here is a sample anyway:
require 'rubygems'
require 'riak'
# Create a client for localhost
@client = Riak::Client.new
# Specify a bucket
@bucket = @client.bucket('sample-bucket')
# Build an object
@object = @bucket.get_or_new('sample-key')
@object.content_type = 'application/json'
@object.data = '{"sample":"data"}'
@object.store
# Access the object
@bucket.get('sample-key')
#=> #<Riak::RObject {sample-bucket,sample-key} [#<Riak::RContent [application/json]:"{\"sample\":\"data\"}">]>
Simple Map Reduce
The second functionality I wanted to investigate was the MapReduce functionality.
Chains of functions (written in Javascript or Erlang) are sent in and distributed amongst the nodes. The nodes run the query against their own Keys and Values and
For a general MapReduce overview take a look at the Wikipedia article.
Start by seeding some simple sample data as follows. Note bucket is users
,
key is the user email and then the data is simply some basic information.
$ curl -XPUT "http://localhost:8098/riak/users/btables@gmail.com"\
--header "Content-Type: application/json" \
--data '{
"first_name":"Bobby",
"last_name":"Tables"
}'
# do this a few more times ...
A Map or Reduce query can pull this data out in a new format:
$ curl -XPOST "http://localhost:8098/mapred"\
--header "Content-Type: application/json" \
--data '{
"inputs":"users",
"query":[{
"map":{
"language":"javascript",
"source":"function(record) {
var record_data = JSON.parse(record.values[0].data);
return [[
record.key,
record_data.first_name,
record_data.last_name
]];
}"}}]}'
# Returns:
#
# [ ["gabe@gabekoss.com", "Gabe", "Koss" ],
# ["sam@stelfox.net", "Sam", "Stelfox"],
# ["btables@gmail.com", "Bobby", "Tables" ] ]
Summary
Overall we ended up not using Riak. At least not yet. In general I think it is a strong system when it meets your needs.
Use Riak for...
- Clustered operations
- Straight key-value gets & puts
- Sibling resolution / link walking
Don't use Riak for...
- Key listing / Bucket enumeration (
MyRiakObject.all() == :bad
) - Mass deletion from certain backends
- Large map-reduce in real-time operations
Additional Resources
- Basho Technologies: Makers of Riak
- Sean Cribs: Riak evangelist, overall smart dude
- Little Riak Book: Tells you way more than I just did in a very succinct fashion
- Learn You Some Erlang for Great Good
- Sam Stelfox: My friend and partner in crime for this presentation
- Slides from the Talk: PDF from our presentation
- Install Riak on Ubuntu