Quantcast
Viewing all articles
Browse latest Browse all 22042

Compatibility of printf with utf-8 encoded strings

I'm trying to format some utf-8 encoded strings in C code (char *) using the printf function. I need to specify a length in format. Everything goes well when there are no multi-bytes characters in parameter string, but the result seems to be incorrect when there are some multibyte chars in data.

my glibc is kind of old (2.17), so I tried with some online compilers and result is the same.

#include <stdlib.h>
#include <locale.h>

int main(void)
{
    setlocale( LC_CTYPE, "en_US.UTF-8" );
    setlocale( LC_COLLATE, "en_US.UTF-8" );

    printf( "'%-4.4s'\n",   "elephant" );
    printf( "'%-4.4s'\n",   "éléphant" );
    printf( "'%-20.20s'\n", "éléphant" );

    return 0;
}

Result of execution is :

'elep''él�''éléphant          '

First line is correct (4 chars in output)

Second line is obviously wrong (at least from a human point of view)

Last line is also wrong : only 18 unicode chars are written instead of 20

It seems that the printf function count chars before UTF-8 decoding (counting bytes instead of unicode chars)

Is that a bug in glibc or a well documented limitation of printf ?


Viewing all articles
Browse latest Browse all 22042

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>