Created at:
Modified at:
Jim Tcl UTF-8
Introduction
The idea of this document is to investigate how Jim Tcl handles UTF-8 strings internally.
In order to make Jim Tcl work with UTF-8 support, one needs to compile it with
--utf8
flag. When it happens, operations like string index
and string
length
expect to receive UTF-8 strings as parameters. When compiling with UTF-8
support, one can use functions like string bytelength
to return the number of
bytes in the string.
Strings operations in Jim Tcl
Object representation in Jim Tcl
Let's introduce the Jim_Obj
type. This is a large struct that holds the struct
for Jim Tcl variables.
We are mostly interested in the bytes
and length
members:
File jim.h
, starting at line 284:
char *bytes; /* string representation buffer. NULL = no string repr. */
File jim.h
, starting at line 287:
int length; /* number of bytes in 'bytes', not including the null term. */
Besides bytes
and length
, a Jim_Obj
have a internalRep
union (to hold
the value in the most appropriate format, like an integer or a double). This
union have the strValue
member:
File jim.h
, starting at line 330:
struct {
int maxLength;
int charLength; /* utf-8 char length. -1 if unknown */
} strValue;
Implementing the [string] command
Jim Tcl [string]
command is implemented in the Jim_StringCoreCommand()
function:
File jim.c
, starting at line 14085:
/* [string] */
static int Jim_StringCoreCommand(Jim_Interp *interp, int argc, Jim_Obj *const *argv)
{
int len;
int opt_case = 1;
int option;
static const char * const options[] = {
"bytelength", "length", "compare", "match", "equal", "is", "byterange", "range", "replace",
"map", "repeat", "reverse", "index", "first", "last", "cat",
"trim", "trimleft", "trimright", "tolower", "toupper", "totitle", NULL
};
enum
{
OPT_BYTELENGTH, OPT_LENGTH, OPT_COMPARE, OPT_MATCH, OPT_EQUAL, OPT_IS, OPT_BYTERANGE, OPT_RANGE, OPT_REPLACE,
OPT_MAP, OPT_REPEAT, OPT_REVERSE, OPT_INDEX, OPT_FIRST, OPT_LAST, OPT_CAT,
OPT_TRIM, OPT_TRIMLEFT, OPT_TRIMRIGHT, OPT_TOLOWER, OPT_TOUPPER, OPT_TOTITLE
};
Note that options[]
array holds possible subcommands like bytelength
,
length
and index
, that we'll study here.
Some lines later, we have a call to the Jim_GetEnum()
function that process
argv[1]
and stores result in option
variable. Later in this document we
examine Jim_GetEnum()
function.
Some lines later, if subcommand is length
, it calls the Jim_Utf8Length()
that we are mostly interested:
File jim.c
, starting at line 14118:
case OPT_LENGTH:
case OPT_BYTELENGTH:
if (argc != 3) {
Jim_WrongNumArgs(interp, 2, argv, "string");
return JIM_ERR;
}
if (option == OPT_LENGTH) {
len = Jim_Utf8Length(interp, argv[2]);
}
else {
len = Jim_Length(argv[2]);
}
Implementing [string length]
So, [string length]
is implemented using Jim_Utf8Length()
. We are interested in the UTF-8 length calculation. The part of the function that does that follows:
as follows:
File jim.c
, starting at line 2428:
SetStringFromAny(interp, objPtr);
if (objPtr->internalRep.strValue.charLength < 0) {
objPtr->internalRep.strValue.charLength = utf8_strlen(objPtr->bytes, objPtr->length);
}
return objPtr->internalRep.strValue.charLength;
An interesting thing: it seems it caches string length so it doesn't need to
always calculate it (it is an "expensive" O(n) operation). This is stored in the
charLength
variable member. If charLength
is less than zero, we call the
utf8_strlen()
function, passing both bytes
and length
parameter that are
the string representation and its length.
So, let's take a look at the utf8_strlen()
function:
File utf8.c
, starting at line 63:
int utf8_strlen(const char *str, int bytelen)
{
int charlen = 0;
if (bytelen < 0) {
bytelen = strlen(str);
}
while (bytelen > 0) {
int c;
int l = utf8_tounicode(str, &c);
charlen++;
str += l;
bytelen -= l;
}
return charlen;
}
So, if bytelen
is less than zero, we discover that calling a simple
strlen()
. We then traverse the string calling utf8_tounicode()
.
utf8_tounicode()
is interesting: it returns the length, in bytes, for the
first character in str
. It also sets the second parameter to the Unicode code
point representation. We are not going to look at the implementation of
utf8_tounicode()
, but it consists of code to convert UTF-8 character to
Unicode code point.
So, after discovering the character length in bytes (that is stored in l
), it
then increments the charlen
variable (because it is only one UTF-8
"character"), increments the str
pointer so we can look at the next character
and decrement bytelen
decrementing l
bytes from it. Repeat while bytelen >
0
.
So, back to Jim_Utf8Length()
, it either returns charLength
or calculates it
with utf8_strlen()
that traverses UTF-8 characters with the help of the
utf8_tounicode()
function.
Implementing [string index]
We now know how Jim Tcl calculates the length of a UTF-8 string. What about how
it retrieves a single character at index idx
position? We cannot just retrieve
byte at that position, since we are dealing with multi-byte UTF-8 characters.
So, let's see how Jim Tcl does that, by taking a look at the relevant lines in
Jim_StringCoreCommand()
function:
File jim.c
, starting at line 14341:
if (Jim_GetIndex(interp, argv[3], &idx) != JIM_OK) {
return JIM_ERR;
}
If I understand correctly, Jim_GetIndex()
transforms an object stored in
argv[3]
and saves it as an integer in idx
. Let's show an example: string
index "café é bom" 7
returns "b". Internally, Jim Tcl pass this list to the
Jim_StringCoreCommand()
as an argv
array, as is:
- argv[0]
-
string
- argv[1]
-
index
- argv[2]
-
café é bom
- argv[3]
-
7
But argv[3]
is not an integer, but a Jim_Obj*
, so we need to convert it to
an integer with Jim_GetIndex()
. At the end, we now have the integer 7
stored
in our idx
variable.
So, next lines are:
File jim.c
, starting at line 14344:
str = Jim_String(argv[2]);
len = Jim_Utf8Length(interp, argv[2]);
idx = JimRelToAbsIndex(len, idx);
The first line returns objPtr->bytes
if it is defined or creates it from the
internal representation if it is not defined. It seems useful specially if the
object is not a direct string and is something else like an integer, a double or
a list that don't have a string representation by default. Then, it calculates
the string length. We are not going into detail about JimRelToAbsIndex()
because, for our case, it doesn't change idx
value, so let's pretend this line
doesn't exist.
Now, we have some checks and, a little below, this:
File jim.c
, starting at line 14356:
int i = utf8_index(str, idx);
So, the heart of the implementation is the utf8_index()
function. Let's take
a look at its implementation:
File utf8.c
, starting at line 92:
int utf8_index(const char *str, int index)
{
const char *s = str;
while (index--) {
s += utf8_charlen(*s);
}
return s - str;
}
It is rather simple. It justs increments pointer s
with the byte length of the
current character, index
times. At the end, pointer s
points to the
character at the desired position. Note that the function returns an integer,
that is the offset between the start of the string and the position of the
requested character. The algorithm used to retrieve a UTF-8 character at a given
position, then, is O(n).
The utf8_charlen()
function is simple: it just looks at the first byte of the character, that is enough information to know the its byte length:
File utf8.c
, starting at line 45:
int utf8_charlen(int c)
{
if ((c & 0x80) == 0) {
return 1;
}
if ((c & 0xe0) == 0xc0) {
return 2;
}
if ((c & 0xf0) == 0xe0) {
return 3;
}
if ((c & 0xf8) == 0xf0) {
return 4;
}
/* Invalid sequence, so treat it as a single byte */
return 1;
}
The line editing library "linenoise" and UTF-8 handling
String buffer representation (stringbuf)
Before diving into the implementation, let's take a look at common data
structures. linenoise.c
defines a stringbuf
data structure:
File linenoise.c
, starting at line 21:
/**
* The stringbuf structure should not be accessed directly.
* Use the functions below.
*/
typedef struct {
int remaining; /**< Allocated, but unused space */
int last; /**< Index of the null terminator (and thus the length of the string) */
#ifdef USE_UTF8
int chars; /**< Count of characters */
#endif
char *data; /**< Allocated memory containing the string or NULL for empty */
} stringbuf;
Interestingly, it has a separated counter for the UTF-8 string (chars
) that is
different from the actual buffer site (last
).
In other to create and handle stringbuf
structures, one need to use sb_*
functions. For instance:
File linenoise.c
, starting at line 64:
static inline int sb_chars(stringbuf *sb) {
#ifdef USE_UTF8
return sb->chars;
#else
return sb->last;
#endif
}
Let's take a look on how sb_append_*
functions work. Since we are now dealing
with UTF-8, we need to see carefully how it calculates the number of total
characters characters:
File linenoise.c
, starting at line 176:
void sb_append(stringbuf *sb, const char *str)
{
sb_append_len(sb, str, strlen(str));
}
void sb_append_len(stringbuf *sb, const char *str, int len)
{
if (sb->remaining < len + 1) {
sb_realloc(sb, sb->last + len + 1 + SB_INCREMENT);
}
memcpy(sb->data + sb->last, str, len);
sb->data[sb->last + len] = 0;
sb->last += len;
sb->remaining -= len;
#ifdef USE_UTF8
sb->chars += utf8_strlen(str, len);
#endif
}
sb_append()
is straightforward: we just call sb_append_len()
, passing same
parameters and also the length calculated with traditional strlen()
.
sb_append_len()
reallocs the buffer, if necessary, and copy data with
memcpy()
. It then sums the number of chars with the value returned by
utf8_strlen()
, which does a O(n) operation to find the number of chars in the
string.
Current line representation (current)
linenoise.c
defines a struct current
to hold information for the line being
edited. It is a large data struct and we are interested only in a few fields:
File linenoise.c
, starting at line 475:
/* Structure to contain the status of the current (being edited) line */
struct current {
stringbuf *buf; /* Current buffer. Always null terminated */
int pos; /* Cursor position, measured in chars */
int cols; /* Size of the window, in chars */
So, we have a stringbuf
and other information about the line and the cursor.
How a UTF-8 character is read from the terminal
Line editing in Jim Tcl is implemented with the linenoise library that is shipped with Jim Tcl. It is a simple library that is also UTF-8 aware.
Linenoise library fork by steveb used in Jim Tcl
Original linenoise library from Antirez
First, Jim Tcl main()
function is implemented in jimsh.c
. It starts basic
data structures (like the current interpreter) and calls
Jim_InteractivePrompt()
that is defined in jim-interactive.c
, where other
functions are also defined, like history callback functions.
Jim_InteractivePrompt()
calls Jim_HistoryGetLine()
that returns a full
allocated line. Jim_HistoryGetLine()
calls linenoise()
that calls
linenoiseEdit()
, that is where things get more interesting for us.
linenoiseEdit()
has the following loop:
File linenoise.c
, starting at line 1855:
while(1) {
int dir = -1;
int c = fd_read(current);
It is just a main loop that calls fd_read()
that returns one UTF-8 character
at a time. Beforing continuing, let's peek at the fd_read()
:
File linenoise.c
, starting at line 839:
/**
* Reads a complete utf-8 character
* and returns the unicode value, or -1 on error.
*/
static int fd_read(struct current *current)
{
#ifdef USE_UTF8
char buf[MAX_UTF8_LEN];
int n;
int i;
int c;
if (read(current->fd, &buf[0], 1) != 1) {
return -1;
}
n = utf8_charlen(buf[0]);
if (n < 1) {
return -1;
}
for (i = 1; i < n; i++) {
if (read(current->fd, &buf[i], 1) != 1) {
return -1;
}
}
/* decode and return the character */
utf8_tounicode(buf, &c);
return c;
#else
return fd_read_char(current->fd, -1);
#endif
}
It reads one byte with read()
. The first character is enough to know the byte
length of the UTF-8 character, which is done with utf8_charlen()
. It then
calls read()
to complete the buffer with the remaining UTF-8 character bytes.
What is important here is: it actually doesn't return the buffer with actual
UTF-8 bytes, but the Unicode codepoint! It calls utf8_tounicode()
to find the
Unicode code point. Later, it calls utf8_fromunicode()
to get back the
character in UTF-8 representation and store it on the buffer. For me, it seems
unecessary to convert it to Unicode codepoints. Maybe someone can help me
understanding it?
How the cursor is moved with arrow keys
When user press left or right arrow key on the keyboard, the cursor have to move left or right. It seems simple, but remember we are dealing with UTF-8! So we need to calculate the lenght of the character at the left or right side of the cursor in order to move it. Let's see how linenoise (and Jim Tcl) does that.
Function linenoiseEdit()
has a big switch
that hold possible keybindings
that you can use from linenoise. Two cases are:
File linenoise.c
, starting at line 1978:
case ctrl('B'):
case SPECIAL_LEFT:
if (current->pos > 0) {
current->pos--;
refreshLine(current);
}
break;
case ctrl('F'):
case SPECIAL_RIGHT:
if (current->pos < sb_chars(current->buf)) {
current->pos++;
refreshLine(current);
}
break;
We first check if we are in boundaries and then perform a simple decrement or
increment operation. Remember the pos
variable is measured in chars (i.e.,
UTF-8) chars. It then calls refreshLine()
that calls refreshLineAlt()
, that
performs some complex operations.
The Jim_GetEnum() function
TODO