Nestle 1904 GNT - Character encoding

A computer cannot interpret text without first converting it into a binary format. Historically, this was done using ASCII, which mapped Latin characters to 7-bit codes. Various schemes existed for encoding characters in non-Latin languages, such as Greek. To standardize the encoding of these non-Latin characters, Unicode was introduced. This standardization aims to ensure full portability of text across computers with different operating systems. All Greek text in this Text-Fabric dataset has been encoded in Unicode.

However, in practice, this is not without complications, especially for languages, like Greek, that use characters augmented with polytonic accents. These complexities can lead to inconsistencies in text representation and can lead to errors in querie results when working with the dataset.

The information on this page in particular pertains to the following base features:

lemma: Lexical lemma (cf. BDAG).
normalized: Normalized form of the
text: Word as it appears in the text.
unicode: Word in unicode format. surface text.

The information is also relevant to the following add-on features:

bol_lemma: BibleOL (Bible Online Learner) based lexeme.
bol_lemma_dict: BibleOL based lexeme as it appears in the dictionary.
bol_surface: BibleOL based word as it appears in the text.
lemma_dict: Lexeme as it appears in the dictionary.

All used special characters

To view all the special characters used in this dataset, run the following command (e.g., in a Jupyter Notebook cell):

A.specialCharacters()

This command will return the following details:

Special characters in text-orig-full

· Α α ὰ ά ᾴ ἀ Ἀ ἂ ἄ Ἄ ᾄ ἆ Ἆ ἁ Ἁ ἃ Ἃ ἅ Ἅ ᾶ ᾷ ᾳ Β β Γ γ Δ δ Ε ε ὲ έ ἐ Ἐ ἔ Ἔ ἑ Ἑ ἓ Ἓ ἕ Ἕ Ζ ζ Η η ὴ
ή ῄ ἠ Ἠ ἢ Ἢ ἤ Ἤ ᾔ ἦ Ἦ ᾖ ᾐ ἡ ἣ ἥ Ἥ ἧ ᾗ ᾑ ῆ ῇ ῃ Θ θ Ι ι ὶ ί ϊ ῒ ΐ ἰ Ἰ ἴ Ἴ ἶ ἱ Ἱ ἳ ἵ Ἵ ἷ ῖ Κ κ Λ λ 
Μ μ Ν ν Ξ ξ Ο ο ὸ ό ὀ Ὀ ὂ ὄ Ὄ ὁ Ὁ ὃ Ὃ ὅ Ὅ Π π Ρ ρ ῥ Ῥ Σ ς σ Τ τ Υ υ ὺ ύ ϋ ῢ ΰ ὐ ὒ ὔ ὖ ὑ Ὑ ὓ ὕ Ὕ 
ὗ Ὗ ῦ Φ φ Χ χ Ψ ψ Ω ω ὼ ώ ῴ ὠ ὢ ὤ Ὤ ὦ Ὦ ᾠ ὡ Ὡ ὥ Ὥ ὧ Ὧ ᾧ ῶ ῷ ῳ — ’

These characters can be used directly to build queries without the need to look up Unicode codepoints.

There are some differences in Unicode encoding between the base features and certain add-on BibleOL features, particularly concerning homoglyphs—characters that look identical or very similar but have different Unicode values. These subtle differences can be difficult to spot on screen. However, using a small Python script, these distinctions can be revealed, such as in the word θεὸς in Romans 1:19. This following code snippet identifies the Unicode code points for each character in that word.

for char in item:    # stores the unicode encoded string
    print(f"Character: '{char}'\tUnicode Code Point: {ord(char)}")

The Python code prints the following for the base data:

Character: 'θ'	Unicode Code Point: 952
Character: 'ε'	Unicode Code Point: 949
Character: 'ό'	Unicode Code Point: 972
Character: 'ς'	Unicode Code Point: 962

The Python code prints the following for the BibleOL based feature, where the differences are highlighted:

Character: 'Θ'	Unicode Code Point: 920
Character: 'ε'	Unicode Code Point: 949
Character: 'ό'	Unicode Code Point: 8057
Character: 'ς'	Unicode Code Point: 962

Background

The unicode, text, after, before, and punctuation features were encoded using polytonic accents over the vowels (oxia, varia, and perispomeni). For instance, considering the vowel η, we have, respectively, ή (U+1F75), ὴ (U+1F74), and ῆ (U+1FC6). However, since 1982, in Modern Greek, polytonic accents should be replaced by the monotonic accent tonos ◌̍ (U+030D). Later, in 1986, the Greek government decreed that the tonos be represented as the acute accent ◌́ (U+0301). Therefore, it is not possible to distinguish visually the difference between the characters with tonos or the acute accent in the writing of ancient Greek.

Unicode has two ways of representing a character: decomposed and precomposed characters. For instance, the decomposed character ά (U+03AC, Greek small letter alpha with tonos) can be rendered by the character α (U+03B1) and the acute accent ◌́ (U+0301), or by equivalence, the precomposed character ά (U+1F71, Greek small letter alpha with oxia). Both of them should be rendered the same way.

Transcription in this Text-Fabric dataset

However, Python, like any other programming language, performs binary comparisons, which means it distinguishes between characters based on their exact Unicode values. This distinction becomes particularly evident when executing queries in Text-Fabric. For example, a user might copy a Greek word for a search, only to find no results because of subtle differences in character encoding. Therefore, we updated nine characters (ά, έ, ή, ί, ΐ, ό, ύ, ΰ, ώ) of the unicode, text, normalized, and lemma features using precomposed characters, as shown in the following table.

Character	Unicode decomposed character (with acute accent)	Unicode precomposed character (with oxia)	Character name
ά	U+03AC	U+1F71	Small letter alpha with tonos
έ	U+03AD	U+1F73	Small letter epsilon with tonos
ή	U+03AE	U+1F75	Small letter eta with tonos
ί	U+03AF	U+1F77	Small letter iota with tonos
ΐ	U+0390	U+1FD3	Small letter iota with dialytika and tonos
ό	U+03CC	U+1F79	Small letter omicron with tonos
ύ	U+03CD	U+1F78	Small letter upsilon with tonos
ΰ	U+038E	U+1FE3	Small letter upsilon with dialytika and oxia
ώ	U+03CE	U+1F7D	Small letter omega with tonos

N1904-TF

Nestle 1904 GNT - Character encoding

All used special characters

Background

Transcription in this Text-Fabric dataset

More resources on Unicode

Nestle 1904 GNT - Character encoding

All used special characters

Unicode related matters of concern

Example of Unicode related mismatches

Background

Transcription in this Text-Fabric dataset

More resources on Unicode