Home > DeveloperSection > Forums > Convert Unicode string to UTF-8, and then to JSON
Babe Zaharias
Babe Zaharias

Total Post:19

Posted on    June-20-2013 3:16 AM


 1 Reply(s)
 3912  View(s)
Rate this:
Hi Developers,

I want to encode a string in UTF-8 and view the corresponding UTF-8 bytes individually. In the Python REPL the following seems to work fine:

>>> unicode('©', 'utf-8').encode('utf-8')'\xc2\xa9'

Note that I’m using U+00A9 COPYRIGHT SIGN as an example here. The '\xC2\xA9' looks close to what I want — a string consisting of two separate code points: U+00C2 and

U+00A9. (When UTF-8-decoded, it gives back the original string, '\xA9'.)

Then, I want the UTF-8-encoded string to be converted to a JSON-compatible string. However, the following doesn’t seem to do what I want:

>>> import json; json.dumps('\xc2\xa9')'"\\u00a9"'

Note that it generates a string containing U+00A9 (the original symbol). Instead, I need the UTF-8-encoded string, which would look like "\u00C2\u00A9" in valid JSON.

TL;DR How can I turn '©' into "\u00C2\u00A9" in Python? I feel like I’m missing something obvious — is there no built-in way to do this?


Total Post:604

Posted on    June-20-2013 7:36 AM

Hi Babe,

If you really want "\u00c2\u00a9" as the output, give json an Unicode string as input.

>>> print json.dumps(u'\xc2\xa9')"\u00c2\u00a9"

You can generate this Unicode string from the raw bytes:

s = unicode('©', 'utf-8').encode('utf-8')
s2 = u''.join(unichr(ord(c)) for c in s)

I think what you really want is "\xc2\xa9" as the output, but I'm not sure how to generate that yet.

Don't want to miss updates? Please click the below button!

Follow MindStick