Introduction
You can use the “codecs” module in Python to read or write text in various encodings. To read or write Unicode (also known as “UTF-8”) text, you need to use the “codecs” module.
The “codecs” module provides functions for encoding and decoding text in various encodings. The most common encoding is probably “UTF-8”. You can also use other encodings, such as “Latin-1”, “Windows-1252”, etc.
To read or write Unicode text, you first need to open a file in the proper encoding. For example, to read or write Unicode (UTF-8) text, you would do something like this:
f = codecs.open(“filename.txt”, encoding=”utf-8″)
text = f.read()
f.close()
To write Unicode text to a file, you would do something like this:
f = codecs.open(“filename.txt”, mode=”w”, encoding=”utf-8″)
f.write(text)
f.close()
Basic steps
Unicode strings are stored internally as sequences of codepoints (to be precise as Py_UNICODE arrays). When a unicode string is written to a file, it is first encoded using a particular encoding scheme. The most common encodings are utf-8 and utf-16.
There are now two types of strings in Python 3.x, str for binary data and unicode for text data. When you write a unicode string to a file, you first need to encode it into a bytes object. The most common encoding is utf-8. For example:
f = open(‘test.txt’, ‘w’)
s = ‘\u00e9’ #unicode string
f.write(s) #error! can’t write unicode strings to files in python 2.x
UnicodeEncodeError: ‘ascii’ codec can’t encode character u’\xe9′ in position 0: ordinal not in range(128)
To fix this, you need to encode the unicode string into a byte string first:
f = open(‘test.txt’, ‘w’)
s = ‘\u00e9’ #unicode string
b = s.encode(‘utf-8’) #byte string encoded using utf-8 encoding scheme
f.write(b)
Code example
def write_unicode_string(filename, ustring):
with codecs.open(filename, ‘w’, ‘utf-8’) as f:
f.write(ustring)
Conclusion
Unicode is an industry standard for representing text in most of the world’s writing systems. Python 3 uses the codec system to deal with Unicode, and in this post we’ll look at some of the common problems you may encounter when trying to write Unicode strings to a file.
One common issue is the encoding of special characters such as é or ß. In Python 2, you can write these strings to a file using the correct codec:
import codecs
with codecs.open(‘output.txt’, ‘w’, ‘utf-8’) as f:
f.write(u’é’)
In Python 3, however, this will raise an error:
UnicodeEncodeError: ‘ascii’ codec can’t encode character ‘\xe9’ in position 0: ordinal not in range(128)
f = open(‘test2.py’, encoding=”utf8″) line 1
The reason for this error is that the default encoding for Python 3 is ASCII, which cannot represent all Unicode characters. To fix this, you need to explicitly set the encoding when you open the file:
import io
with io.open(‘output.txt’, ‘w’, encoding=’utf-8′) as f: # note io.open() instead of codecs.open()!
f.write(u’é’)
Alternatively, you can use sys.setdefaultencoding() to set the default encoding globally: import sys
sys.setdefaultencoding(‘utf8’) # this line only needs to be run once! Then you can remove it and your scripts will still work correctly 🙂
with open(‘output2.txt’, ‘w’) as f: # no need for “encoding” kwarg now! This will work with any kind of text file on any system! f