Text-Fabric dataset of the Greek New Testament, based on the Nestle 1904 (7th printing) edition.
About this datasetA computer cannot interpret text without first converting it into a binary format. Historically, this was done using ASCII, which mapped Latin characters to 7-bit codes. Various schemes existed for encoding characters in non-Latin languages, such as Greek. To standardize the encoding of these non-Latin characters, Unicode was introduced. This standardization aims to ensure full portability of text across computers with different operating systems. All Greek text in this Text-Fabric dataset has been encoded in Unicode.
However, in practice, this is not without complications, especially for languages, like Greek, that use characters augmented with polytonic accents. These complexities can lead to inconsistencies in text representation and can lead to errors in querie results when working with the dataset.
The information on this page in particular pertains to the following base features:
The information is also relevant to the following add-on features:
To view all the special characters used in this dataset, run the following command (e.g., in a Jupyter Notebook cell):
A.specialCharacters()
This command will return the following details:
Special characters in text-orig-full · Α α ὰ ά ᾴ ἀ Ἀ ἂ ἄ Ἄ ᾄ ἆ Ἆ ἁ Ἁ ἃ Ἃ ἅ Ἅ ᾶ ᾷ ᾳ Β β Γ γ Δ δ Ε ε ὲ έ ἐ Ἐ ἔ Ἔ ἑ Ἑ ἓ Ἓ ἕ Ἕ Ζ ζ Η η ὴ ή ῄ ἠ Ἠ ἢ Ἢ ἤ Ἤ ᾔ ἦ Ἦ ᾖ ᾐ ἡ ἣ ἥ Ἥ ἧ ᾗ ᾑ ῆ ῇ ῃ Θ θ Ι ι ὶ ί ϊ ῒ ΐ ἰ Ἰ ἴ Ἴ ἶ ἱ Ἱ ἳ ἵ Ἵ ἷ ῖ Κ κ Λ λ Μ μ Ν ν Ξ ξ Ο ο ὸ ό ὀ Ὀ ὂ ὄ Ὄ ὁ Ὁ ὃ Ὃ ὅ Ὅ Π π Ρ ρ ῥ Ῥ Σ ς σ Τ τ Υ υ ὺ ύ ϋ ῢ ΰ ὐ ὒ ὔ ὖ ὑ Ὑ ὓ ὕ Ὕ ὗ Ὗ ῦ Φ φ Χ χ Ψ ψ Ω ω ὼ ώ ῴ ὠ ὢ ὤ Ὤ ὦ Ὦ ᾠ ὡ Ὡ ὥ Ὥ ὧ Ὧ ᾧ ῶ ῷ ῳ — ’
These characters can be used directly to build queries without the need to look up Unicode codepoints.
There are some differences in Unicode encoding between the base features and certain add-on BibleOL features, particularly concerning homoglyphs—characters that look identical or very similar but have different Unicode values. These subtle differences can be difficult to spot on screen. However, using a small Python script, these distinctions can be revealed, such as in the word θεὸς in Romans 1:19. This following code snippet identifies the Unicode code points for each character in that word.
for char in item: # stores the unicode encoded string
print(f"Character: '{char}'\tUnicode Code Point: {ord(char)}")
The Python code prints the following for the base data:
Character: 'θ' Unicode Code Point: 952 Character: 'ε' Unicode Code Point: 949 Character: 'ό' Unicode Code Point: 972 Character: 'ς' Unicode Code Point: 962
The Python code prints the following for the BibleOL based feature, where the differences are highlighted:
Character: 'Θ' Unicode Code Point: 920 Character: 'ε' Unicode Code Point: 949 Character: 'ό' Unicode Code Point: 8057 Character: 'ς' Unicode Code Point: 962
See also the following notebook.
The unicode, text, after, before, and punctuation features were encoded using polytonic accents over the vowels (oxia, varia, and perispomeni). For instance, considering the vowel η, we have, respectively, ή (U+1F75), ὴ (U+1F74), and ῆ (U+1FC6). However, since 1982, in Modern Greek, polytonic accents should be replaced by the monotonic accent tonos ◌̍ (U+030D). Later, in 1986, the Greek government decreed that the tonos be represented as the acute accent ◌́ (U+0301). Therefore, it is not possible to distinguish visually the difference between the characters with tonos or the acute accent in the writing of ancient Greek.
Unicode has two ways of representing a character: decomposed and precomposed characters. For instance, the decomposed character ά (U+03AC, Greek small letter alpha with tonos) can be rendered by the character α (U+03B1) and the acute accent ◌́ (U+0301), or by equivalence, the precomposed character ά (U+1F71, Greek small letter alpha with oxia). Both of them should be rendered the same way.
However, Python, like any other programming language, performs binary comparisons, which means it distinguishes between characters based on their exact Unicode values. This distinction becomes particularly evident when executing queries in Text-Fabric. For example, a user might copy a Greek word for a search, only to find no results because of subtle differences in character encoding. Therefore, we updated nine characters (ά, έ, ή, ί, ΐ, ό, ύ, ΰ, ώ) of the unicode, text, normalized, and lemma features using precomposed characters, as shown in the following table.
Character | Unicode decomposed character (with acute accent) | Unicode precomposed character (with oxia) | Character name |
---|---|---|---|
ά | U+03AC | U+1F71 | Small letter alpha with tonos |
έ | U+03AD | U+1F73 | Small letter epsilon with tonos |
ή | U+03AE | U+1F75 | Small letter eta with tonos |
ί | U+03AF | U+1F77 | Small letter iota with tonos |
ΐ | U+0390 | U+1FD3 | Small letter iota with dialytika and tonos |
ό | U+03CC | U+1F79 | Small letter omicron with tonos |
ύ | U+03CD | U+1F78 | Small letter upsilon with tonos |
ΰ | U+038E | U+1FE3 | Small letter upsilon with dialytika and oxia |
ώ | U+03CE | U+1F7D | Small letter omega with tonos |