Compile-Time Strings

April 28, 2009 @ 23:19 | In CodeGems, Programming | 6 Comments | twitter rss digg

It would be nice if we had such a feature in the C language, wouldn’t it? The term ‘compile-time string’ is referred here as strings that are converted to unique integer identifiers at compile time. At run-time those identifiers are simple integers that can be compared and hashed very fast. In other languages, like for example Smalltalk, the concept of Symbol implements a similar idea. The following post describes a possible implementation of this feature in C/C++.

Imagine, for example, a generic object factory where object instances are created using unique identifiers. The classical solution here is having a shared-by-all-code header where all the identifiers are declared in a C enumeration. This solution, apart from creating a serious physical dependency where adding a new identifier to the enumeration forces a recompilation for all the project, is unfeasible in modular architectures where modules are isolated. In those architectures having a global header is not an option.

One viable solution may be using strings as identifiers. But strings are heavy objects, hard to compare and prone to typing errors because miswritten symbols would be detected at run-time instead of compile-time. Other equally insufficient solutions to this problem include FourCC and esoteric template tricks for generating a hash at compile time (desist from it, it is not possible to solve this 100% with templates because strings cannot be used as template parameters and anyway hashing a string is not collision-free. More information in this usenet thread). Mick West proposes more solutions in his Practical Hash IDs article.

What follows is an implementation that has been working nicely for me and that satisfactorily fits the requirements for simulating compile-time strings. First let me show you two examples of the usage:

namespace
{
DECLARE_SYMBOL(CubeMesh)
DECLARE_SYMBOL(SphereMesh)
DECLARE_SYMBOL(DuckMesh)
}
 
void CollectNodes(Ptr<Node>& node)
{
    Ptr<Mesh> mesh0 = CreateObject(S(CubeMesh));
    node->Add(mesh0);
 
    Ptr<Mesh> mesh1 = CreateObject(S(SphereMesh));
    node->Add(mesh1);
 
    Ptr<Mesh> mesh2 = CreateObject(S(DuckMesh));
    node->Add(mesh2);
}
namespace
{
DECLARE_SYMBOL(FirstMessage)
DECLARE_SYMBOL(SecondMessage)
DECLARE_SYMBOL(ThirdMessage)
}
 
void ProcessMessage(const Message& msg)
{
    if (msg.id == S(FirstMessage))
    {
       /// ...
    }
    else if (msg.id == S(SecondMessage))
    {
       /// ...
    }
    else if (msg.id== S(ThirdMessage))
    {
       /// ...
    }
}

A symbol represents a compile-time string. They must be declared before being used. The macro for declaring a symbol is hiding an inline function with a static inside itself:

#define DECLARE_SYMBOL(id)\
    inline Symbol __GetSymbol##id() throw()\
    {\
        static size_t sym;\
        if (sym == 0)\
        {\
            sym = GetIdFromString(#id);\
        }\
        return Symbol(sym);\
    }

The function GetIdFromString() hashes the string, stores it in an internal table and returns the table position for that string (the Symbol class is a simple wrapper around the identifier). This is done only the first time the symbol is requested. For future requests the static ID is returned. This adds a little overhead against using simple integers as symbols. Beware of local static initializations: they are not thread-safe. That is the reason of the manual comparison against 0. GetIdFromString() must be thread-safe for this code to work.

The S macro simply invokes the local function previously generated:

#define S(id) __GetSymbol##id()

And there you have it. Compile-time strings with negligible (in case you are doing anything more that simply comparing symbols) overhead. In case you need 100% efficient code you could pre-generate a table with the symbols being used by your project (searching for all DECLARE_SYMBOL blocks) and substitute each S() with a really unique identifier generated at compile-time. And that would be so easy if the preprocessor could be extended in a standard way…

Hope this makes sense. Thank you for reading.

 

  1. Practical Hash IDs
  2. FourCC
  3. Compile-time string hash generator



  1. Why didn’t you choose to do something like this?

    typedef size_t Symbol;

    #define DECLARE_SYMBOL(id)\
    static Symbol SYMBOL_##id = GetIdFromString(#id);

    #define S(id) SYMBOL_##id

    That way you don’t have to worry about making GetIdFromString thread-safe because it will only be called from the C runtime static initializer, which is guaranteed to be single threaded, and all the cost associated with it will be removed from the runtime. Is there any downside to this approach?



    Comment by hcpizzi
    April 29, 2009 @ 13:16 #

  2. Do you have a way to know the associated string to a Symbol(number), for instance, when you are debugging?



    Comment by Gus
    May 2, 2009 @ 13:28 #

  3. Gus, I’ve implemented that using #defines for debug configurations in VS.

    My Symbol class has a string field and a constructor requiring the string, that are only visible when the _DEBUG symbol is defined:

    class Symbol
    {
    public:
    #if _DEBUG
    Symbol(size_t id, char* str)
    {
    _id=id;
    _str = str;
    }

    std::string getStr()
    {
    return _str;
    }
    #else
    Symbol(size_t id) { _id=id; }
    #endif
    size_t getID() { return _id; }

    private:
    #if _DEBUG
    const char* _str;
    #endif
    size_t _id;
    };

    so you could use:

    return Symbol(sym, #id); /

    I’ve just realized that a cleaner approach could be leaving that constructor and field available for all configurations, but saving the string only in debug, something like this:
    Symbol::Symbol(u32 sym, char * str)
    {
    _sym = sym;
    #if _DEBUG
    _str = str;
    #else
    _str = 0;
    #endif
    }

    const char * Symbol::getStr()
    {
    if (!_str) return “”;
    return _str;
    }



    Comment by Rickyah
    May 4, 2009 @ 9:57 #

  4. Hi pizzi,

    we try to avoid statics in our architecture. That way, we have under control the init order of the different subsystems (for example, the symbol system depends on the log system and memory system).

    If you do not have that restriction, your solution seems to be right.



    Comment by ent
    May 4, 2009 @ 12:56 #

  5. Gus,

    yes, we have static functions that can be invoked from the watch window of the debugger:

    // These functions are to be used from the debugger Watch Window to 
    // convert symbols to strings and vice versa while debugging.
    // Use the following syntax in Microsoft Visual Studio:
    //
    //  {,,Debug.Core.Kernel.dll}SymbolToString(1146)
    //  {,,Debug.Core.Kernel.dll}StringToSymbol(L"DiskFileSystem")
    //@{
    extern "C" NS_CORE_KERNEL_API const NsChar* NsSymbolToString(NsSize id);
    extern "C" NS_CORE_KERNEL_API NsSize NsStringToSymbol(const NsChar* str);
    //@}


    Comment by ent
    May 4, 2009 @ 12:59 #

  6. actually, the C standard defines that “inline” doesnt enforce the function to necessarily be inline. It only “hints” the compiler that it is a good candidate for inline function



    Comment by Daniel "NeoStrider" Monteiro
    June 24, 2009 @ 4:32 #


Fri, 01 Aug 2014 13:52:45 +0000 / 27 queries. 1.467 seconds / 1 User Online

gentoo link wordpress link apache link PHP link

Theme modified from Pool theme. Valid XHTML and CSS