Critique My Serialization API

qMopey

Level 6

« on: May 13, 2019, 12:53:08 AM »

Hi all. I wrote a serialization API in C++ for myself and wanted to ask for a critique from anyone interested. I like the JSON file format and wanted to use it originally. JSON is a good way to support versioning for your serialized stuff, since you can build in mechanisms to handle missing fields, and extra fields can be ignored.

I audited a few JSON options and this one by sheredom was the best, but I still found it a little lacking. Specifically I couldn't figure out the API. It's all... Weird. And the examples are terrible.

Since I couldn't find exactly what I wanted I wrote my own. Here were my specific list of requirements.

No external dependencies (other than c runtime).
JSON-like text output.
Can implement the writer/reader with the same code (instead of two different functions).
No annoying linked-lists in the API.
Supports arrays.
Base64 encoding built-in.
Shouldn't do anything special for utf8 (they can just be base64 encoded, or sent as-is).
Can inspect a field's type before attempting to read it.
Arrays have length prepended.

Code: (Example of a Serialized Object)

{
    x = 5,
    y = 10.300000,
    str = "Hello.",
    sub_thing = {
        num0 = 7,
        num1 = 3,
    },
    blob_data = "U29tZSBibG9iIGlucHV0LgA=",
    array_of_ints = [8] {
        0, 1, 2, 3, 4, 5, 6, 7,
    },
    array_of_array_of_ints = [2] {
        [3] {
            0, 1, 2
        },
        [3] {
            0, 1, 2
        },
    },
    array_of_objects = [2] {
        {
            some_integer = 13,
            some_string = "Good bye.",
        },
        {
            some_integer = 4,
            some_string = "Oi!",
        },
    },
},

Code: (The Serialization API, named KV (Key-Value), kv.h)

struct kv_t;

#define CUTE_KV_MODE_WRITE 1
#define CUTE_KV_MODE_READ  0

kv_t* kv_make(void* user_allocator_context = NULL);
void kv_destroy(kv_t* kv);
error_t kv_reset(kv_t* kv, const void* data, int size, int mode);
int kv_size_written(kv_t* kv);

enum kv_type_t
{
    KV_TYPE_NULL   = 0,
    KV_TYPE_INT64  = 1,
    KV_TYPE_DOUBLE = 2,
    KV_TYPE_STRING = 3,
    KV_TYPE_ARRAY  = 4,
    KV_TYPE_BLOB   = 5,
    KV_TYPE_OBJECT = 6,
};

error_t kv_key(kv_t* kv, const char* key, kv_type_t* type = NULL);

error_t kv_val(kv_t* kv, uint8_t* val);
error_t kv_val(kv_t* kv, uint16_t* val);
error_t kv_val(kv_t* kv, uint32_t* val);
error_t kv_val(kv_t* kv, uint64_t* val);

error_t kv_val(kv_t* kv, int8_t* val);
error_t kv_val(kv_t* kv, int16_t* val);
error_t kv_val(kv_t* kv, int32_t* val);
error_t kv_val(kv_t* kv, int64_t* val);

error_t kv_val(kv_t* kv, float* val);
error_t kv_val(kv_t* kv, double* val);

error_t kv_val_string(kv_t* kv, char** str, int* size);
error_t kv_val_blob(kv_t* kv, void* data, int* size, int capacity);

error_t kv_object_begin(kv_t* kv);
error_t kv_object_end(kv_t* kv);

error_t kv_array_begin(kv_t* kv, int* count);
error_t kv_array_end(kv_t* kv);

void kv_print(kv_t* kv);

The implementation is 948 lines of code - pretty small!

Here's what it generally looks like to use:

Code: (Example Use Case)

kv_t* kv = kv_make();
char buffer[1024];
kv_reset(kv, buffer, sizeof(buffer), CUTE_KV_MODE_WRITE);

thing_t thing;
thing.a = 5;
thing.b = 10.3f;
thing.str = "Hello.";
thing.str_len = 7;

kv_begin_object(kv);
kv_key(kv, "a");
kv_val(kv, &thing.a);
kv_key(kv, "b");
kv_val(kv, &thing.b);
kv_key(kv, "str");
kv_val(kv, &thing.str, &thing.str_len);
kv_object_end(kv);

printf("%s", buffer);

Which would output:

Code:

{
    a = 5,
    b = 10.300000,
    str = "Hello."
}

Depending on if mode is set to read/write the kv_* functions will either write to the buffer, or read (parse) from the buffer. This means the serialization routine only needs to be written once (most of the time) by using some polymorphism.

If anyone was brave enough to read through all this info, allow me to say thanks! I really appreciate it


	Logged

ThemsAllTook

Administrator
Level 10

Re: Critique My Serialization API

« Reply #1 on: May 13, 2019, 02:33:41 PM »

Neat. I made an API kind of like this a while ago. I also wasn't satisfied with the JSON parsers I found online, and decided to write my own. It's a pretty reasonable format - writing a compliant parser and generator doesn't take all that much code.

My API allows me to swap different instances of serializer and deserializer to read/write different formats with the same code. I briefly had the thought that I could take this a step further and make reading and writing work with the same code, but never got any further than thinking about it. This has inspired me to give it another shot, because I'm currently in a situation where I have some super complicated deserialization code, and the matching serialize function has fallen into disrepair. Seems like there might be some problems with this, though - writing is easy because everything is already organized nicely in memory, but reading involves allocating buffers, consistency checks, error reporting, etc. Maybe it's not so unreasonable for the serialization API to take care of all that stuff itself, though.

I guess I don't really have a critique, but thanks for sharing!


	Logged

Website | YouTube | Foresight Fight devlog

qMopey

Level 6

Re: Critique My Serialization API

« Reply #2 on: May 13, 2019, 03:00:50 PM »

Hi ThemsAllTook, thanks for posting

I think your comment on multiple formats is a great idea. I'm considering also adding an initialization setting to do binary or text format. I think text is quite good enough, but I'm very curious if binary would be much of a win in terms of space requirements and parsing speed. My intuition says no... Memory is a bit cheap these days, so personally I do not expect to need binary formats to keep space on-disk down. Also when applying patches and whatnot diffs get really ugly in binary and have an easier time with text. And with the current implementation I don't see a way binary would be particularly faster, except for perhaps less whitespace to churn through - but I imagine this will be negligible for my goals.

To/From disk with the same function is a really good feature. I'm glad to hear you were thinking about this as well.

You're right, I ended up having two very different implementations under the hood for to-disk compared to from-disk. To-disk was nearly trivial. From-disk involved parsing the entire file into a DOM. That way when the user is looking up keys they can do so out-of-order, or in nearly whatever order they like. The `kv_reset` function actually fully parses the input. Then the rest of the `kv_key` and `kv_val` functions just poke into the internal data structures to fetch pre-parsed values.

It was annoyingly complicated in terms of recursion. I ended up with a little piece like this at one point.

Code:

struct kv_string_t
{
    uint8_t* str = NULL;
    int len = 0;
};

union kv_union_t
{
    kv_union_t() {}

    int64_t ival;
    double dval;
    kv_string_t sval;
    kv_string_t bval;
    int object_index;
};

struct kv_val_t
{
    kv_type_t type = KV_TYPE_NULL;
    kv_union_t u;
    array<kv_val_t> aval;
};

struct kv_field_t
{
    kv_string_t key;
    kv_val_t val;
};

struct kv_object_t
{
    int parent_index = ~0;
    int parsing_array = 0;

    kv_string_t key;
    array<kv_field_t> fields;
};

So a value is a union, or an array of value's. The recursive definition is a little weird, and it took me a good number of hours just thinking about how to get this to even compile.

I hate recursion, so this was really painful. Eventually I got some short parsing functions down. It's mostly an LL(1) recursive descent style parser, so there's a function for parsing an object, a value, a key, etc.


	Logged

Pages: [1]

« previous next »