Written in 2008 by Kragen Javier Sitaker. This paragraph added in 2016.
On IRC, Aristotle Pagaltzis was pondering how much performance variable-width encodings such as UTF-8 actually cost, because it's commonly suggested that fixed-width encodings such as ISO-8859-1 and UCS-4 are much faster.
He suggested:
Huh, it just occurs to me that
strlenis not at all expensive on UTF-8-encoded strings. Not exactly as fast, but if you write it in asm, it only takes one extra instruction to count characters in UTF-8 vs those in an 8-bit encoding, per character. So, if you factor in cache misses, it should make no measurable difference. All you lose with a variable-width encoding is direct random access to arbitrary indices in the string, which is basically a non-use case.
It turned out that he was partly wrong, but mostly right.  And along
   the way, we discovered that GCC's standard implementation of strlen
   was quite pessimal.
I'm using Linux on a 700MHz Pentium III laptop with GCC 4.1.2, using
   just the -O flag unless otherwise specified.
First I thought about how to write strlen, and came up with this:
        .global my_strlen_s
my_strlen_s:
        push %esi                
        cld
        mov 8(%esp), %esi
        xor %ecx, %ecx
        ## repnz lodsb doesn't work because lodsb doesn't update ZF
loop:   lodsb
        test %al, %al
        loopnz loop
        mov %ecx, %eax
        not %eax
        pop %esi
        ret
For those who aren't well-versed in 80386 assembly language:
lodsbreads a byte from memory at%esi, puts it in%al, and increments%esi;
loopnzdecrements%ecxeach time through the loop, and jumps back to the labelloop:if%ecxisn't zero and if the zero flagZFisn't set (that's the "nz" part);
test %al, %alsets the zero flagZFif%alis zero (the C string terminator), among other things.- after the loop body has run N times,
%ecxis -N, sonot %eaxconverts the negative number -N from%ecxinto a positive number N-1.- you have to push
%esibecause it's a callee-saves register. (To my surprise.)But there's a
scasbinstruction which I should have used instead oflodsb; test.
The inner loop there is three instructions.  
 
Then I looked at how GCC does strlen.  It turns out it inlines it,
   even without any extra optimization flags.  Here's a sample call,
   albeit with optimization:
 80484d5:   bf 00 99 04 08          mov    $0x8049900,%edi
 80484da:   fc                      cld    
 80484db:   b9 ff ff ff ff          mov    $0xffffffff,%ecx
 80484e0:   b0 00                   mov    $0x0,%al
 80484e2:   f2 ae                   repnz scas %es:(%edi),%al
 80484e4:   f7 d1                   not    %ecx
 80484e6:   49                      dec    %ecx
 80484e7:   89 4c 24 04             mov    %ecx,0x4(%esp)
There the inner loop is just the repnz scas.  I'd forgotten about
   SCAS.
Here's a reasonable strlen in C:
int my_strlen(char *s) {
  int i = 0;
  while (*s++) i++;
  return i;
}
This compiles to the following:
080483c4 <my_strlen>:
 80483c4:   55                      push   %ebp
 80483c5:   89 e5                   mov    %esp,%ebp
 80483c7:   8b 55 08                mov    0x8(%ebp),%edx
 80483ca:   b8 00 00 00 00          mov    $0x0,%eax
 80483cf:   80 3a 00                cmpb   $0x0,(%edx)
 80483d2:   74 0c                   je     80483e0 <my_strlen+0x1c>
 80483d4:   b8 00 00 00 00          mov    $0x0,%eax
 80483d9:   40                      inc    %eax
 80483da:   80 3c 10 00             cmpb   $0x0,(%eax,%edx,1)
 80483de:   75 f9                   jne    80483d9 <my_strlen+0x15>
 80483e0:   5d                      pop    %ebp
 80483e1:   c3                      ret
So here the inner loop is again three instructions: inc %eax; cmpb
0, (%eax,%edx,1); jne.  It's been optimized down to while (s[i])
i++;.  The loop termination test is duplicated above the top of the
   loop for the empty-string case.
So then I thought about how to do what Aristotle was suggesting. In UTF-8, bytes that start new characters begin either with binary 0 or binary 11; the second and subsequent bytes of multibyte characters have binary 10 as their high bits. So to count the characters, you just have to count the bytes that don't begin with binary 10.
I tried this:
        .global my_strlen_utf8_s
my_strlen_utf8_s:
        push %esi
        cld
        mov 8(%esp), %esi
        xor %ecx, %ecx
loop2:  lodsb
        test $0x80, %al
        jz ascii                # format 0xxx xxxx
        test $0x40, %al
        jz loop2                # format 10xx xxxx: doesn't start new char
ascii:  test %al, %al
        loopnz loop2
        mov %ecx, %eax
        not %eax
        pop %esi
        ret
So here we jump back to loop2, rather than decrementing %ecx with
   loopnz, in the case where the byte starts with 10.  And we can skip
   the test %al, %al test, since 0000 0000 doesn't start with 10.
The inner loop of this version has 5 instructions, including two taken conditional branches, in the usual ASCII case, and 7 instructions for non-ASCII bytes, rather than 3 instructions per byte. That's only two extra instructions in the "usual" case, but if every instruction were one cycle, that would still be a 67% increase in run-time.
Counting instructions and adding up their cycle count isn't a very accurate way to measure performance in these superscalar days, though.
A C version looks like this:
int my_strlen_utf8_c(char *s) {
  int i = 0, j = 0;
  while (s[i]) {
    if ((s[i] & 0xc0) != 0x80) j++;
    i++;
  }
  return j;
}
GCC compiles it to this:
080483e2 <my_strlen_utf8_c>:
 80483e2:   55                      push   %ebp
 80483e3:   89 e5                   mov    %esp,%ebp
 80483e5:   8b 55 08                mov    0x8(%ebp),%edx
 80483e8:   0f b6 02                movzbl (%edx),%eax
 80483eb:   b9 00 00 00 00          mov    $0x0,%ecx
 80483f0:   84 c0                   test   %al,%al
 80483f2:   74 23                   je     8048417 <my_strlen_utf8_c+0x35>
 80483f4:   b9 00 00 00 00          mov    $0x0,%ecx
 80483f9:   0f be c0                movsbl %al,%eax
 80483fc:   25 c0 00 00 00          and    $0xc0,%eax
 8048401:   3d 80 00 00 00          cmp    $0x80,%eax
 8048406:   0f 95 c0                setne  %al
 8048409:   0f b6 c0                movzbl %al,%eax
 804840c:   01 c1                   add    %eax,%ecx
 804840e:   0f b6 42 01             movzbl 0x1(%edx),%eax
 8048412:   42                      inc    %edx
 8048413:   84 c0                   test   %al,%al
 8048415:   75 e2                   jne    80483f9 <my_strlen_utf8_c+0x17>
 8048417:   89 c8                   mov    %ecx,%eax
 8048419:   5d                      pop    %ebp
 804841a:   c3                      ret
An inner loop of 10 instructions --- but containing only a single
   conditional jump, the jne at the bottom.  It uses the and; cmp;
   setne; movzbl sequence to put either a 0 or a 1 into %eax,
   depending on whether the byte fetched began with 10, and adds the
   result into %ecx each time through the loop.
So after all this, I chatted with Aristotle some more, and it turned out he had a much cleverer trick up his sleeve than I had thought --- or, in fact, than he had thought. He wrote:
But wow, my code is much faster than any of the other variants. Unexpectedly.
Here's his version:
        .global ap_strlen_utf8_s
ap_strlen_utf8_s:
        push %esi
        cld
        mov 8(%esp), %esi
        xor %ecx, %ecx
loopa:  dec %ecx
loopb:  lodsb
        shl $1, %al
        js loopa
        jc loopb
        jnz loopa
        mov %ecx, %eax
        not %eax
        pop %esi
        ret
In this case, the inner loop is 6 instructions, but as few as 3 of
   them can execute.  I hadn't realized that you could get the top two
   bits of a byte into the carry and sign flags with a single shl
   instruction like that!  Aristotle explains:
Jscatches all bytes of the form x1xxxxxx.Jccatches 1xxxxxxx, but becausejscame first, that can only have been 10xxxxxx; andjnzthen catches all 00xxxxxx other than all-0. This runs about 3x as fast as yourmy_strlen_s--- most of the time, anyway.
So how do these different approaches fare?  I wrote a program that
   creates a 32MB string and timed the different functions on it, in
   seconds, using wall-clock time.  Here are the results from one run,
   sorted with sort -t: -k1 -k3 -ns.  The first few lines are various
   functions' return values on the given strings.
"": 0 0 0 0 0 0
"hello, world": 12 12 12 12 12 12
"naïve": 6 6 6 5 5 5
"こんにちは": 15 15 15 5 5 5
1: all 'a':
1:                my_strlen(string) =   33554431: 0.227555
1:         ap_strlen_utf8_s(string) =   33554431: 0.299494
1:                   strlen(string) =   33554431: 0.314887
1:         my_strlen_utf8_c(string) =   33554431: 0.380355
1:              my_strlen_s(string) =   33554431: 0.432079
1:         my_strlen_utf8_s(string) =   33554431: 0.525443
2: all '\xe3':
2:                my_strlen(string) =   33554431: 0.224037
2:         ap_strlen_utf8_s(string) =   33554431: 0.299537
2:                   strlen(string) =   33554431: 0.311552
2:         my_strlen_utf8_c(string) =   33554431: 0.378162
2:              my_strlen_s(string) =   33554431: 0.436755
2:         my_strlen_utf8_s(string) =   33554431: 0.589165
3: all '\x81':
3:                my_strlen(string) =   33554431: 0.225011
3:         ap_strlen_utf8_s(string) =          0: 0.313525
3:                   strlen(string) =   33554431: 0.316182
3:         my_strlen_utf8_s(string) =          0: 0.322959
3:         my_strlen_utf8_c(string) =          0: 0.390958
3:              my_strlen_s(string) =   33554431: 0.432342
The 33554431 and 0 numbers are the return values; this ensures that
   GCC doesn't optimize out the strlen call.
So, on my CPU, the C version of strlen took about 28% less time than
   the built-in inlined one for this long string; it only uses two
   registers instead of the three used by the built-in inlined one (the
   one that uses repnz scasb); and they both seem to be about 12 bytes.
   I don't know why GCC inlines the worse one. Most likely it used to be
   faster than whatever GCC generated at the time and hasn't been
   revisited.
It's worth noting that while my C version of strlen was always
   faster than the built-in version, Aristotle's UTF-8 version was always
   in between.
On Aristotle's Core 2 Duo 1.8GHz (also with GCC 4.1.2 and -O), the
   difference was very much greater.  Here are his results:
"": 0 0 0 0 0 0
"hello, world": 12 12 12 12 12 12
"naïve": 6 6 6 5 5 5
"こんにちは": 15 15 15 5 5 5
1: all 'a':
1:                my_strlen(string) =   33554431: 0.025906
1:         ap_strlen_utf8_s(string) =   33554431: 0.039629
1:         my_strlen_utf8_c(string) =   33554431: 0.096041
1:                   strlen(string) =   33554431: 0.114821
1:              my_strlen_s(string) =   33554431: 0.116529
1:         my_strlen_utf8_s(string) =   33554431: 0.132648
2: all '\xe3':
2:                my_strlen(string) =   33554431: 0.025912
2:         ap_strlen_utf8_s(string) =   33554431: 0.039583
2:         my_strlen_utf8_c(string) =   33554431: 0.095699
2:                   strlen(string) =   33554431: 0.114452
2:              my_strlen_s(string) =   33554431: 0.114622
2:         my_strlen_utf8_s(string) =   33554431: 0.136109
3: all '\x81':
3:                my_strlen(string) =   33554431: 0.026112
3:         my_strlen_utf8_s(string) =          0: 0.039656
3:         ap_strlen_utf8_s(string) =          0: 0.039661
3:         my_strlen_utf8_c(string) =          0: 0.096416
3:              my_strlen_s(string) =   33554431: 0.115327
3:                   strlen(string) =   33554431: 0.116629
All of this code is online in two files:
GCC is better at writing x86 assembly than I am. No surprise there. Even when its inner loop is 10 instructions, it beats my three-instruction inner loops for speed.
Aristotle is better at writing x86 assembly than GCC is.
Aristotle was essentially correct: the penalty for counting UTF-8 characters, or indexing into or iterating over the characters of a UTF-8 string, is very small.
However, there is a speed penalty.  Although GCC's built-in
      strlen is much slower than Aristotle's function, a
      straightforward byte-counting C strlen compiled with optimization
      is faster still.
GCC should change to use the straightforward byte-counting C
      strlen instead of what it currently inlines.  The version of
      strlen that GCC inlines is worse than the one it compiled from C in
      every way: it's more instructions, more bytes of machine code, four
      times slower, and uses more registers (one of which is a
      callee-saves register!).
People probably shouldn't worry about the efficiency of counting and iterating over characters in UTF-8 strings, at least not if they were using null-terminated C strings before.