Created at:

Modified at:

Jim Tcl UTF-8

Introduction

Jim Tcl

The idea of this document is to investigate how Jim Tcl handles UTF-8 strings internally.

In order to make Jim Tcl work with UTF-8 support, one needs to compile it with --utf8 flag. When it happens, operations like string index and string length expect to receive UTF-8 strings as parameters. When compiling with UTF-8 support, one can use functions like string bytelength to return the number of bytes in the string.

Strings operations in Jim Tcl

Object representation in Jim Tcl

Let's introduce the Jim_Obj type. This is a large struct that holds the struct for Jim Tcl variables.

We are mostly interested in the bytes and length members:

File jim.h, starting at line 284:

    char *bytes; /* string representation buffer. NULL = no string repr. */

File jim.h, starting at line 287:

    int length; /* number of bytes in 'bytes', not including the null term. */

Besides bytes and length, a Jim_Obj have a internalRep union (to hold the value in the most appropriate format, like an integer or a double). This union have the strValue member:

File jim.h, starting at line 330:

        struct {
            int maxLength;
            int charLength;     /* utf-8 char length. -1 if unknown */
        } strValue;

Implementing the [string] command

Jim Tcl [string] command is implemented in the Jim_StringCoreCommand() function:

File jim.c, starting at line 14085:

/* [string] */
static int Jim_StringCoreCommand(Jim_Interp *interp, int argc, Jim_Obj *const *argv)
{
    int len;
    int opt_case = 1;
    int option;
    static const char * const options[] = {
        "bytelength", "length", "compare", "match", "equal", "is", "byterange", "range", "replace",
        "map", "repeat", "reverse", "index", "first", "last", "cat",
        "trim", "trimleft", "trimright", "tolower", "toupper", "totitle", NULL
    };
    enum
    {
        OPT_BYTELENGTH, OPT_LENGTH, OPT_COMPARE, OPT_MATCH, OPT_EQUAL, OPT_IS, OPT_BYTERANGE, OPT_RANGE, OPT_REPLACE,
        OPT_MAP, OPT_REPEAT, OPT_REVERSE, OPT_INDEX, OPT_FIRST, OPT_LAST, OPT_CAT,
        OPT_TRIM, OPT_TRIMLEFT, OPT_TRIMRIGHT, OPT_TOLOWER, OPT_TOUPPER, OPT_TOTITLE
    };

Note that options[] array holds possible subcommands like bytelength, length and index, that we'll study here.

Some lines later, we have a call to the Jim_GetEnum() function that process argv[1] and stores result in option variable. Later in this document we examine Jim_GetEnum() function.

Some lines later, if subcommand is length, it calls the Jim_Utf8Length() that we are mostly interested:

File jim.c, starting at line 14118:

        case OPT_LENGTH:
        case OPT_BYTELENGTH:
            if (argc != 3) {
                Jim_WrongNumArgs(interp, 2, argv, "string");
                return JIM_ERR;
            }
            if (option == OPT_LENGTH) {
                len = Jim_Utf8Length(interp, argv[2]);
            }
            else {
                len = Jim_Length(argv[2]);
            }

Implementing [string length]

So, [string length] is implemented using Jim_Utf8Length(). We are interested in the UTF-8 length calculation. The part of the function that does that follows: as follows:

File jim.c, starting at line 2428:

    SetStringFromAny(interp, objPtr);

    if (objPtr->internalRep.strValue.charLength < 0) {
        objPtr->internalRep.strValue.charLength = utf8_strlen(objPtr->bytes, objPtr->length);
    }
    return objPtr->internalRep.strValue.charLength;

An interesting thing: it seems it caches string length so it doesn't need to always calculate it (it is an "expensive" O(n) operation). This is stored in the charLength variable member. If charLength is less than zero, we call the utf8_strlen() function, passing both bytes and length parameter that are the string representation and its length.

So, let's take a look at the utf8_strlen() function:

File utf8.c, starting at line 63:

int utf8_strlen(const char *str, int bytelen)
{
    int charlen = 0;
    if (bytelen < 0) {
        bytelen = strlen(str);
    }
    while (bytelen > 0) {
        int c;
        int l = utf8_tounicode(str, &c);
        charlen++;
        str += l;
        bytelen -= l;
    }
    return charlen;
}

So, if bytelen is less than zero, we discover that calling a simple strlen(). We then traverse the string calling utf8_tounicode().

utf8_tounicode() is interesting: it returns the length, in bytes, for the first character in str. It also sets the second parameter to the Unicode code point representation. We are not going to look at the implementation of utf8_tounicode(), but it consists of code to convert UTF-8 character to Unicode code point.

Code point

So, after discovering the character length in bytes (that is stored in l), it then increments the charlen variable (because it is only one UTF-8 "character"), increments the str pointer so we can look at the next character and decrement bytelen decrementing l bytes from it. Repeat while bytelen > 0.

So, back to Jim_Utf8Length(), it either returns charLength or calculates it with utf8_strlen() that traverses UTF-8 characters with the help of the utf8_tounicode() function.

Implementing [string index]

We now know how Jim Tcl calculates the length of a UTF-8 string. What about how it retrieves a single character at index idx position? We cannot just retrieve byte at that position, since we are dealing with multi-byte UTF-8 characters.

So, let's see how Jim Tcl does that, by taking a look at the relevant lines in Jim_StringCoreCommand() function:

File jim.c, starting at line 14341:

  if (Jim_GetIndex(interp, argv[3], &idx) != JIM_OK) {
      return JIM_ERR;
  }

If I understand correctly, Jim_GetIndex() transforms an object stored in argv[3] and saves it as an integer in idx. Let's show an example: string index "café é bom" 7 returns "b". Internally, Jim Tcl pass this list to the Jim_StringCoreCommand() as an argv array, as is:

argv[0]
string
argv[1]
index
argv[2]
café é bom
argv[3]
7

But argv[3] is not an integer, but a Jim_Obj*, so we need to convert it to an integer with Jim_GetIndex(). At the end, we now have the integer 7 stored in our idx variable.

So, next lines are:

File jim.c, starting at line 14344:

  str = Jim_String(argv[2]);
  len = Jim_Utf8Length(interp, argv[2]);
  idx = JimRelToAbsIndex(len, idx);

The first line returns objPtr->bytes if it is defined or creates it from the internal representation if it is not defined. It seems useful specially if the object is not a direct string and is something else like an integer, a double or a list that don't have a string representation by default. Then, it calculates the string length. We are not going into detail about JimRelToAbsIndex() because, for our case, it doesn't change idx value, so let's pretend this line doesn't exist.

Now, we have some checks and, a little below, this:

File jim.c, starting at line 14356:

  int i = utf8_index(str, idx);

So, the heart of the implementation is the utf8_index() function. Let's take a look at its implementation:

File utf8.c, starting at line 92:

int utf8_index(const char *str, int index)
{
    const char *s = str;
    while (index--) {
        s += utf8_charlen(*s);
    }
    return s - str;
}

It is rather simple. It justs increments pointer s with the byte length of the current character, index times. At the end, pointer s points to the character at the desired position. Note that the function returns an integer, that is the offset between the start of the string and the position of the requested character. The algorithm used to retrieve a UTF-8 character at a given position, then, is O(n).

The utf8_charlen() function is simple: it just looks at the first byte of the character, that is enough information to know the its byte length:

File utf8.c, starting at line 45:

int utf8_charlen(int c)
{
    if ((c & 0x80) == 0) {
        return 1;
    }
    if ((c & 0xe0) == 0xc0) {
        return 2;
    }
    if ((c & 0xf0) == 0xe0) {
        return 3;
    }
    if ((c & 0xf8) == 0xf0) {
        return 4;
    }
    /* Invalid sequence, so treat it as a single byte */
    return 1;
}

The line editing library "linenoise" and UTF-8 handling

String buffer representation (stringbuf)

Before diving into the implementation, let's take a look at common data structures. linenoise.c defines a stringbuf data structure:

File linenoise.c, starting at line 21:

/**
 * The stringbuf structure should not be accessed directly.
 * Use the functions below.
 */
typedef struct {
	int remaining;	/**< Allocated, but unused space */
	int last;		/**< Index of the null terminator (and thus the length of the string) */
#ifdef USE_UTF8
	int chars;		/**< Count of characters */
#endif
	char *data;		/**< Allocated memory containing the string or NULL for empty */
} stringbuf;

Interestingly, it has a separated counter for the UTF-8 string (chars) that is different from the actual buffer site (last).

In other to create and handle stringbuf structures, one need to use sb_* functions. For instance:

File linenoise.c, starting at line 64:

static inline int sb_chars(stringbuf *sb) {
#ifdef USE_UTF8
	return sb->chars;
#else
	return sb->last;
#endif
}

Let's take a look on how sb_append_* functions work. Since we are now dealing with UTF-8, we need to see carefully how it calculates the number of total characters characters:

File linenoise.c, starting at line 176:

void sb_append(stringbuf *sb, const char *str)
{
	sb_append_len(sb, str, strlen(str));
}

void sb_append_len(stringbuf *sb, const char *str, int len)
{
	if (sb->remaining < len + 1) {
		sb_realloc(sb, sb->last + len + 1 + SB_INCREMENT);
	}
	memcpy(sb->data + sb->last, str, len);
	sb->data[sb->last + len] = 0;

	sb->last += len;
	sb->remaining -= len;
#ifdef USE_UTF8
	sb->chars += utf8_strlen(str, len);
#endif
}

sb_append() is straightforward: we just call sb_append_len(), passing same parameters and also the length calculated with traditional strlen(). sb_append_len() reallocs the buffer, if necessary, and copy data with memcpy(). It then sums the number of chars with the value returned by utf8_strlen(), which does a O(n) operation to find the number of chars in the string.

Current line representation (current)

linenoise.c defines a struct current to hold information for the line being edited. It is a large data struct and we are interested only in a few fields:

File linenoise.c, starting at line 475:

/* Structure to contain the status of the current (being edited) line */
struct current {
    stringbuf *buf; /* Current buffer. Always null terminated */
    int pos;    /* Cursor position, measured in chars */
    int cols;   /* Size of the window, in chars */

So, we have a stringbuf and other information about the line and the cursor.

How a UTF-8 character is read from the terminal

Line editing in Jim Tcl is implemented with the linenoise library that is shipped with Jim Tcl. It is a simple library that is also UTF-8 aware.

Linenoise library fork by steveb used in Jim Tcl

Original linenoise library from Antirez

First, Jim Tcl main() function is implemented in jimsh.c. It starts basic data structures (like the current interpreter) and calls Jim_InteractivePrompt() that is defined in jim-interactive.c, where other functions are also defined, like history callback functions. Jim_InteractivePrompt() calls Jim_HistoryGetLine() that returns a full allocated line. Jim_HistoryGetLine() calls linenoise() that calls linenoiseEdit(), that is where things get more interesting for us.

linenoiseEdit() has the following loop:

File linenoise.c, starting at line 1855:

    while(1) {
        int dir = -1;
        int c = fd_read(current);

It is just a main loop that calls fd_read() that returns one UTF-8 character at a time. Beforing continuing, let's peek at the fd_read():

File linenoise.c, starting at line 839:

/**
 * Reads a complete utf-8 character
 * and returns the unicode value, or -1 on error.
 */
static int fd_read(struct current *current)
{
#ifdef USE_UTF8
    char buf[MAX_UTF8_LEN];
    int n;
    int i;
    int c;

    if (read(current->fd, &buf[0], 1) != 1) {
        return -1;
    }
    n = utf8_charlen(buf[0]);
    if (n < 1) {
        return -1;
    }
    for (i = 1; i < n; i++) {
        if (read(current->fd, &buf[i], 1) != 1) {
            return -1;
        }
    }
    /* decode and return the character */
    utf8_tounicode(buf, &c);
    return c;
#else
    return fd_read_char(current->fd, -1);
#endif
}

It reads one byte with read(). The first character is enough to know the byte length of the UTF-8 character, which is done with utf8_charlen(). It then calls read() to complete the buffer with the remaining UTF-8 character bytes.

What is important here is: it actually doesn't return the buffer with actual UTF-8 bytes, but the Unicode codepoint! It calls utf8_tounicode() to find the Unicode code point. Later, it calls utf8_fromunicode() to get back the character in UTF-8 representation and store it on the buffer. For me, it seems unecessary to convert it to Unicode codepoints. Maybe someone can help me understanding it?

How the cursor is moved with arrow keys

When user press left or right arrow key on the keyboard, the cursor have to move left or right. It seems simple, but remember we are dealing with UTF-8! So we need to calculate the lenght of the character at the left or right side of the cursor in order to move it. Let's see how linenoise (and Jim Tcl) does that.

Function linenoiseEdit() has a big switch that hold possible keybindings that you can use from linenoise. Two cases are:

File linenoise.c, starting at line 1978:

        case ctrl('B'):
        case SPECIAL_LEFT:
            if (current->pos > 0) {
                current->pos--;
                refreshLine(current);
            }
            break;
        case ctrl('F'):
        case SPECIAL_RIGHT:
            if (current->pos < sb_chars(current->buf)) {
                current->pos++;
                refreshLine(current);
            }
            break;

We first check if we are in boundaries and then perform a simple decrement or increment operation. Remember the pos variable is measured in chars (i.e., UTF-8) chars. It then calls refreshLine() that calls refreshLineAlt(), that performs some complex operations.

The Jim_GetEnum() function

TODO