Unicode and Foreign Language Support
in Lexbe Litigation Support Software
Unicode support is important in litigation document management
as document repositories may include documents rendered with non-English
characters. These are typically produced in a version of Unicode, and a
litigation database that does not support unicode may not be able to display,
index or search foreign words. This document briefly describes Unicode in a litigation support context and
describes specifically foreign-language support through Unicode in the Lexbe
Online application.
Unicode
Unicode is a specification that allows text in any language to be
encoded in a consistent way.
Detailed information on the Unicode
specification is maintained by the
Unicode
Consortium. Computers initially used an encoding scheme called
Ascii to represent letters, but Ascii
is English-centric and does not allow sufficient characters to represent many
non-English characters. To remedy this deficiency, Unicode was developed.
Unicode can represent tens of thousands of distinct characters. Multiple versions of
Unicode are in use today, including UTF-8, UTF-16 and UTF-32. The UTF-8 and
UTF-16 versions are the most widely
used today.
Language Packs
Languages based on Latin-based alphabets can be viewed and created in nearly any
computer application without the need for any additional fonts. However, to type
or display foreign languages that use non-Latin or certain extended Latin-based
alphabets, the user may need to first download and install additional foreign
language fonts on his local computer. You can tell if you need to download a
font if characters appear as small rectangles. Microsoft Office contains a
useful "Arial Unicode MS" font with coverage of nearly every character in every
language included in the Unicode standard.
Unicode Support in Lexbe
Lexbe is primarily an English-language tool, but it does partially support
Unicode as described below:
-
Lexbe supports the 8-bit (UTF-8) and 16-bit (UCS-16)
encodings of Unicode, but not UCS-32.
-
Lexbe Unicode support means that it can index and
search documents containing Unicode-encoded data. Lexbe can also display
much Unicode in the Lexbe document browser, subject to client system
font installation.
-
Lexbe can automatically recognize Unicode data in
Microsoft Word, Excel and PowerPoint files.
-
An HTML or XML file can include Unicode data if the
HTML file uses the UTF-8 encoding. Lexbe can index and search Unicode
data in UTF-8 encoded HTML files and can also recognize many other HTML
encodings.
-
WordPerfect files use the WordPerfect Character Set
to express non-English text. Lexbe Online converts WordPerfect Character Set
data to Unicode for indexing, so non-English text in WordPerfect files
is supported.
-
Lexbe can index and search Unicode characters in
some, but not all, PDF files, depending on how the PDF file was created.
-
Lexbe's concept-search functionality is supported for the
English language only.
-
Text in Chinese, Japanese, and Korean can be stored
in, or converted to, Unicode, so Lexbe Online can search for words in these
languages just as it can search for words in other languages. However,
while Lexbe can search for literal word matches (or wildcard or fuzzy
matches), there are some limitations on the support in Lexbe Online for
Chinese, Japanese, and Korean text, described below.
-
Some documents store text in a way that does not
separate the words with spaces. Instead, all of the text in a document
is run together and a language-specific dictionary is needed to find
word breaks. Lexbe does not have the ability to identify word breaks in
these documents.
-
In some languages such as Arabic, the surrounding
context for a word (my, your, the, a, masculine/feminine, etc.) can be
expressed as characters added in front of or behind the word. For
example, "the apple" or "my apple" would not be two words but would be
different prefixes or suffixes added to "apple". To search for text in
these languages, adding a * in the front and back of the word will pick
up most of the variants, like this: *apple*.
-
The above discussion provides examples only, and there may be other limitations other than the ones
described above.
If you have other questions, please contact customer
support at LexbeSupport.com.