Python write Unicode string to file


Introduction


You can use the “codecs” module in Python to read or write text in various encodings. To read or write Unicode (also known as “UTF-8”) text, you need to use the “codecs” module.

The “codecs” module provides functions for encoding and decoding text in various encodings. The most common encoding is probably “UTF-8”. You can also use other encodings, such as “Latin-1”, “Windows-1252”, etc.

To read or write Unicode text, you first need to open a file in the proper encoding. For example, to read or write Unicode (UTF-8) text, you would do something like this:

f = codecs.open(“filename.txt”, encoding=”utf-8″)
text = f.read()
f.close()
To write Unicode text to a file, you would do something like this:

f = codecs.open(“filename.txt”, mode=”w”, encoding=”utf-8″)
f.write(text)
f.close()

Basic steps


Unicode strings are stored internally as sequences of codepoints (to be precise as Py_UNICODE arrays). When a unicode string is written to a file, it is first encoded using a particular encoding scheme. The most common encodings are utf-8 and utf-16.

There are now two types of strings in Python 3.x, str for binary data and unicode for text data. When you write a unicode string to a file, you first need to encode it into a bytes object. The most common encoding is utf-8. For example:

f = open(‘test.txt’, ‘w’)
s = ‘\u00e9’ #unicode string
f.write(s) #error! can’t write unicode strings to files in python 2.x
UnicodeEncodeError: ‘ascii’ codec can’t encode character u’\xe9′ in position 0: ordinal not in range(128)

To fix this, you need to encode the unicode string into a byte string first:

f = open(‘test.txt’, ‘w’)
s = ‘\u00e9’ #unicode string
b = s.encode(‘utf-8’) #byte string encoded using utf-8 encoding scheme
f.write(b)

Code example

def write_unicode_string(filename, ustring):
with codecs.open(filename, ‘w’, ‘utf-8’) as f:
f.write(ustring)

Conclusion


Unicode is an industry standard for representing text in most of the world’s writing systems. Python 3 uses the codec system to deal with Unicode, and in this post we’ll look at some of the common problems you may encounter when trying to write Unicode strings to a file.

One common issue is the encoding of special characters such as é or ß. In Python 2, you can write these strings to a file using the correct codec:

import codecs

with codecs.open(‘output.txt’, ‘w’, ‘utf-8’) as f:
f.write(u’é’)

In Python 3, however, this will raise an error:

UnicodeEncodeError: ‘ascii’ codec can’t encode character ‘\xe9’ in position 0: ordinal not in range(128)
f = open(‘test2.py’, encoding=”utf8″) line 1

The reason for this error is that the default encoding for Python 3 is ASCII, which cannot represent all Unicode characters. To fix this, you need to explicitly set the encoding when you open the file:

import io

with io.open(‘output.txt’, ‘w’, encoding=’utf-8′) as f: # note io.open() instead of codecs.open()!
f.write(u’é’)

Alternatively, you can use sys.setdefaultencoding() to set the default encoding globally: import sys

sys.setdefaultencoding(‘utf8’) # this line only needs to be run once! Then you can remove it and your scripts will still work correctly 🙂

with open(‘output2.txt’, ‘w’) as f: # no need for “encoding” kwarg now! This will work with any kind of text file on any system! f


Leave a Reply

Your email address will not be published.