N1904-TF

Text-Fabric dataset of the Greek New Testament, based on the Nestle 1904 (7th printing) edition.

About this dataset
Transcription
Featureset
Optional features
Viewtypes
Textformats
Syntaxtrees
Tutorial
Latest release

Nestle 1904 GNT - Character encoding

A computer cannot interpret text without first converting it into a binary format. Historically, this was done using ASCII, which mapped Latin characters to 7-bit codes. Various schemes existed for encoding characters in non-Latin languages, such as Greek. To standardize the encoding of these non-Latin characters, Unicode was introduced. This standardization aims to ensure full portability of text across computers with different operating systems. All Greek text in this Text-Fabric dataset has been encoded in Unicode.

However, in practice, this is not without complications, especially for languages, like Greek, that use characters augmented with polytonic accents. These complexities can lead to inconsistencies in text representation and can lead to errors in querie results when working with the dataset.

The information on this page in particular pertains to the following base features:

The information is also relevant to the following add-on features:

All used special characters

To view all the special characters used in this dataset, run the following command (e.g., in a Jupyter Notebook cell):

A.specialCharacters()

This command will return the following details:

Special characters in text-orig-full

· Α α ὰ ά ᾴ ἀ Ἀ ἂ ἄ Ἄ ᾄ ἆ Ἆ ἁ Ἁ ἃ Ἃ ἅ Ἅ ᾶ ᾷ ᾳ Β β Γ γ Δ δ Ε ε ὲ έ ἐ Ἐ ἔ Ἔ ἑ Ἑ ἓ Ἓ ἕ Ἕ Ζ ζ Η η ὴ
ή ῄ ἠ Ἠ ἢ Ἢ ἤ Ἤ ᾔ ἦ Ἦ ᾖ ᾐ ἡ ἣ ἥ Ἥ ἧ ᾗ ᾑ ῆ ῇ ῃ Θ θ Ι ι ὶ ί ϊ ῒ ΐ ἰ Ἰ ἴ Ἴ ἶ ἱ Ἱ ἳ ἵ Ἵ ἷ ῖ Κ κ Λ λ 
Μ μ Ν ν Ξ ξ Ο ο ὸ ό ὀ Ὀ ὂ ὄ Ὄ ὁ Ὁ ὃ Ὃ ὅ Ὅ Π π Ρ ρ ῥ Ῥ Σ ς σ Τ τ Υ υ ὺ ύ ϋ ῢ ΰ ὐ ὒ ὔ ὖ ὑ Ὑ ὓ ὕ Ὕ 
ὗ Ὗ ῦ Φ φ Χ χ Ψ ψ Ω ω ὼ ώ ῴ ὠ ὢ ὤ Ὤ ὦ Ὦ ᾠ ὡ Ὡ ὥ Ὥ ὧ Ὧ ᾧ ῶ ῷ ῳ — ’

These characters can be used directly to build queries without the need to look up Unicode codepoints.

Unicode related matters of concern

There are some differences in Unicode encoding between the base features and certain add-on BibleOL features, particularly concerning homoglyphs—characters that look identical or very similar but have different Unicode values. These subtle differences can be difficult to spot on screen. However, using a small Python script, these distinctions can be revealed, such as in the word θεὸς in Romans 1:19. This following code snippet identifies the Unicode code points for each character in that word.

for char in item:    # stores the unicode encoded string
    print(f"Character: '{char}'\tUnicode Code Point: {ord(char)}")

The Python code prints the following for the base data:

Character: 'θ'	Unicode Code Point: 952
Character: 'ε'	Unicode Code Point: 949
Character: 'ό'	Unicode Code Point: 972
Character: 'ς'	Unicode Code Point: 962

The Python code prints the following for the BibleOL based feature, where the differences are highlighted:

Character: 'Θ'	Unicode Code Point: 920
Character: 'ε'	Unicode Code Point: 949
Character: 'ό'	Unicode Code Point: 8057
Character: 'ς'	Unicode Code Point: 962

See also the following notebook.

Background

The unicode, text, after, before, and punctuation features were encoded using polytonic accents over the vowels (oxia, varia, and perispomeni). For instance, considering the vowel η, we have, respectively, ή (U+1F75), ὴ (U+1F74), and ῆ (U+1FC6). However, since 1982, in Modern Greek, polytonic accents should be replaced by the monotonic accent tonos ◌̍ (U+030D). Later, in 1986, the Greek government decreed that the tonos be represented as the acute accent ◌́ (U+0301). Therefore, it is not possible to distinguish visually the difference between the characters with tonos or the acute accent in the writing of ancient Greek.

Unicode has two ways of representing a character: decomposed and precomposed characters. For instance, the decomposed character ά (U+03AC, Greek small letter alpha with tonos) can be rendered by the character α (U+03B1) and the acute accent ◌́ (U+0301), or by equivalence, the precomposed character ά (U+1F71, Greek small letter alpha with oxia). Both of them should be rendered the same way.

Transcription in this Text-Fabric dataset

However, Python, like any other programming language, performs binary comparisons, which means it distinguishes between characters based on their exact Unicode values. This distinction becomes particularly evident when executing queries in Text-Fabric. For example, a user might copy a Greek word for a search, only to find no results because of subtle differences in character encoding. Therefore, we updated nine characters (ά, έ, ή, ί, ΐ, ό, ύ, ΰ, ώ) of the unicode, text, normalized, and lemma features using precomposed characters, as shown in the following table.

Character Unicode decomposed character (with acute accent) Unicode precomposed character (with oxia) Character name
U+03AC U+1F71 Small letter alpha with tonos
U+03AD U+1F73 Small letter epsilon with tonos
U+03AE U+1F75 Small letter eta with tonos
U+03AF U+1F77 Small letter iota with tonos
U+0390 U+1FD3 Small letter iota with dialytika and tonos
U+03CC U+1F79 Small letter omicron with tonos
U+03CD U+1F78 Small letter upsilon with tonos
U+038E U+1FE3 Small letter upsilon with dialytika and oxia
U+03CE U+1F7D Small letter omega with tonos

More resources on Unicode