Unternehmensberatung Dieckmann

Home Report excerpts References DITECT -
Spelling check
Detailed program descriptions Price List
Languages
InDesign PlugIns Contact

 DIHYPH

 hyphenation

 Silbentrennung

DITECT

spelling-check 

Rechtschreibprüfung 

































































































































































































































































































































































































































































































DITECT Interface




DITECT-calling and -returning



As DITECT partly uses DIHYPH program-functions, the calling program has to take
care that the wanted pathname is set in both arrays "dtpath[100]" (for DITECT)
and "dhpath[100]" (for DIHYPH) before DITECT (or DIHYPH) is called.

Typesetting-system defines textarea to be checked by DITECT as follows:

NT:                          /* Get next text area for spell-checking */
       :
   afc = int-index of first text-character to be checked.
   alc = int-index of last  text-character to be checked.
NP:
   rc  = DTECT (nn, text);   /* nn = int-language-no. (1 =German)     */
   if (rc   == -1) ... ;     /* Program error, missing files. Abort   */
   if (errm >   0) ... ;     /* Evaluate error markings.              */
   if (afc  < alc) goto NP;  /* Check remaining part of text.         */
   else            goto NT;  /* Now get next text-area for checking   */

END:                         /* At end of job, typesetting-system     */
   DHCLOSAL();               /* closes all open files and             */
   DHFREEAL();               /* then free all RAM-allocations.        */

'afc' und 'alc' are defining text area to be spell checked.
Size of this area is unlimited as it is checked sentence by sentence !

After returning from DITECT with 'errm' > 0, typesetting-system has to
evaluate character-array 'charr[ ]' to find errors marked and has to position
text-editor-cursor directly on the erroneous position of text.

Correct words, falsely marked by DITECT as not found in dictionary may be
stored "short-term" (see: 'ftmp').
From then on, DITECT will 'know' them.

If possible, DITECT always ends checking at end of one sentence, stores
index of next following sentence into 'afc' and returns to calling program
that - after evaluating all marked errors - again calls DITECT, until the
defined text-area is checked.
When 'afc' > 'alc the calling program defines next text-area a.s.o.



Return-array 'charr'



After returning from DITECT, typesetting-system has to evaluate array 'charr'
   to get position and type of spelling error.
Charr-field 0      = 2-byte error count.
Charr-field 1 - n = 4-bytes, holding character-informations.
Characters unimportant for spelling check are skipped.
Lenght of 'charr' is: 0 to cap-1 ('cap' = int-value).
Maximum length of "charr"-array is defined by int-value 'charm'.


                          error         error
                            |             |
Example-sentence:           i t ' s  a  t y x t - l i n e .

                          | |        |  |       |       |   |
Hex. character-index:    00 01      06 08      0C      10  12
     _______________________|        |  |       |       |________________________________________
     |                           ____|  |       |____________________                           |
|    |                           |      |                           |                           |    |
| charr -field:                                                                                      |c
| 0|  1   |  2   |  3   |  4   |  5   |  6   |  7   |  8   |      |      |                  ...  n   |a
|  |      |      |      |      |      |      |      |      |      |      |      |      |      |      |p
|ee|ii|c|e|ii|c|e|ii|c|e|ii|c|e|ii|c|e|ii|c|e|ii|c|e|ii|c|e|ii|c|e|ii|c|e|ii|c|e|ii|c|e|ii|c|e|ii|c|e|
|02|01|2|2|02|0|0|03|0|0|04|0|0|06|1|0|08|1|0|09|0|1|0A|0|0|0B|0|0|0C|0|0|0D|0|0|0E|0|0|0F|0|0|10|0|0|
|--|------|------|------|------|------|------|------|------|------|------|------|------|------|------|-
|0 |2  4 5|6  8 9|10    |      |      |22     ...                                           = Byte-no.

  |  | | |                                         |
  |  | | |                                         |
  |  | | |____  Error-Byte set ! __________________|
  |  | |          (see:  "Error-type")
  |  | |
  |  | |______  Char.-byte: Char.-type is
  |  |            00      = Letter, hyphen (-), apostrophe (') or colon (.)
  |  |            01      = Start of word
  |  |            02      = Start of sentence
  |  |            04      = Ending abbreviation dot  ( etc. )
  |  |
  |  |________  Two index-bytes  ( = position of text-character).
  |___________  No. of errors found  (or int-value  'errm' ).


Error-type

DITECT sets 7 different error indices for different types of spelling errors.
In array "errtp[]" a special error-code may be defined for every error index
just as text-/publishing-system needs it.

e.g.:
errtp[] = { 2, 4, 6, 8, 10, 12, 14, 0 }
or better:
errtp[] = { 1, 2, 3, 4, 5, 6, 7, 0 }
            |  |  |  |  |  |  |  | Index  Type of spelling error        
            |  |  |  |  |  |  |  |_  0    unused
            |  |  |  |  |  |  |____  7    automatic replacement
            |  |  |  |  |  |_______  6    word refused by user
            |  |  |  |  |__________  5    space is missing
            |  |  |  |_____________  4    double words
            |  |  |________________  3    wrong capital initial letter
            |  |___________________  2    wrong small   initial letter
            |______________________  1    incorrect spelling
In case of:

    errtp[] = { 1, 2, 3, 1, 1, 1, 1 }
all errors are of type "incorrect spelling" (=1), except of
"wrong small" (=2) or "wrong capital" (=3) initial letter.


Error-type defined in "errtp[]" is stored into "error-byte" of array "charr"
whenever an error occurs.
When user doesn't want words of a specific error-type to be marked by DITECT, he
may set that error-type to "0" in "errtp[]", e.g. in case of:

    errtp[] = { 1, 2, 0, 1,  1,  1, 1 }
all errors resulting from wrong capital initial letter are ignored by DITECT.


Two consecutive words

a) Both single words are correct (e.g.: Helmuth Kohl)
    but they might be an incorrect combination in dictionary
    (must be: Helmut Kohl).
    
b) One or both single words are incorrect (e.g.: Barbra)
    but they might be a correct combination in dictionary
    (Barbra Streisand).

As checking all combinations decreases program performance it is only done
when +8 is added to switch "mexsw".
When such an expression is incorrect (=refused), +50 is added to error type
6 or 7 (56 or 57) to signal that both words together
- have to be rejected (error-type 56) or
- have to be automatically replaced (error-type 57).


Error-type 6 (or 56): Rejected expression

When DITECT marks an expression by error-type 6 (or 56= two words), a list is
displayed showing one or more words line by line. User may select one of these
words to replace the incorrect text word.
When the replacement happens to be at start of sentence, initial letter of the
selected proposal must be capital. This is easily done when calling program uses
following function, where "ptr_prop" is "char-pointer" to the selected proposal:
DTCAPIT (ptr_prop);


Error-type 7 (or 57): Automatic replacement

When DITECT marks an expression by error-type 7 (or 57= two words), the calling-
system will find the replacement expression in first or second line of proposal
list (percentage 101), but must not display this proposal list !
When the automatic replacement happens at start of sentence, the conversion to
capital initial letter is done automatically by DITECT.



Proposal word list



When DITECT has found a spelling error, array 'prbuf' holds max. 20 words
most similar to the erroneous word.

Every word in this proposal list is stored in 50 bytes, where always the first
byte holds binary percentage of similarity, followed by the word, ending with
binary zero. Unused 'prbuf' - lines have a percentage of binary zero.

e.g. when e.g.
errours is an unknown or incorrect word, proposal list looks like:

82 e r r o r s
81 e r r o r
71 e r r o r f u l
66 E r r o l
 :
 :
|  |                        |
0  1 2 3 4 5       ...     49   = 'prbuf'-index  0-49


Attention

When DITECT has to check not only one word, but a text document with one or more
sentences, the calling system has to call DITECT as follows:

1. Set switch "prbs= 0;" before calling, so DITECT finds all error words within
the text and stores error-index and error-type in array 'charr'.

2. Don't display all error-marked words at once but one after the other.

3. Before displaying it, evaluate type of error in array 'charr' and decide if
proposal list is useful to correct this word (normally only for error-types
1, 3 and 6). If yes, call DITECT again to check only that erroneous word but
with switch "prbs= 1;".
After the word is checked by DITECT, display the proposal list, wait for user
action and look for next error in array 'charr' (repeat action 3. a.s.o).


In case of:
- Proposal list switch 'prbs' = 0 (see file 'DTDFLT.CFG'),
- Double words (... word word ...),
- Incorrect small initial letter at start of sentence,
- Missing space error
a proposal list is not stored (all 20 percentages in 'prbuf' are binary zero).

When +1 is added to parameter "usuk" (see file: dtdflt.cfg), unwanted exception
words like Photo* are always displayed as first proposal (with three ending***)
to show why this (perhaps correct looking) word is marked by DITECT.

Program speed:
When DITECT is searching for proposal words, it is assumed that first two letters
of the word are correct, e.g. incorrect word "widerholen" would show the correct
proposal "wiederholen", but in case of "weiderholen" that proposal is not found,
as second letter is incorrect. Here switch "usuk" +2 (= 2 or 3) can help, but
program performance goes down as many more words have to be checked.


False error reports


Many text documents contain expressions such as foreign names unknown
to spelling check programs.
Such words are marked as errors and a replacement list is displayed for
correction.
But when such a word is not erroneous it is very irritating to see it
marked again and again throughout the text document.
DITECT can prevent this:
   - whenever the publishing system recognizes that the marked word
      was not corrected by user   or
   - with an extra line at start of replacement list
a message should be displayed such as:
"Word is correct. Don't mark it again. OK ?"
The user could click "OK" and DITECT will suppress future marking of
the word.

To do so following software statements must be inserted in publishing-
system and file DTSFWRD.c added to linkage list.
       ftmp = 0;                           /* suppression switched off */
       if (...don't mark it again...) ftmp = 1;  /*  "    switched on  */
       DTSFWRD(line, 1);   
( where "ftmp" and "line" are global definitions. Array "line" holds
the word in 8-Bit coding )

Stored words should be erased at end of document by following statement,
as a rapidly increasing number would affect program speed.

       DTSFWRD(drz, 0);                    /* Erase stored words.      */
However at start of program, change of language or end-of-array
the storage-array is reset automatically.


File - description



File DTnn.BIN
is the strongly compressed binary dictionary containing (nearly) all words
or expressions of language nn .
File address plus 18 holds (4 digits) Version-No. e.g. 3.09 !

File DTEXnn.TXT
has to be considered as an appendix of file DTnn.BIN
When growing very large, this file should be inserted into DTnn.BIN, which
can only be done by U.B. Dieckmann.
This may happen perhaps once a year, perhaps never.
After doing so, this file has to be erased from user's disk.


Global values definable by user.

following values may be changed either in file DTDFLT.CFG or - if needed - within the software:

name       value Meaning                                      default
mexsw            multiple search:                                 6
             0   = switched off
             1   = on combined-words (e.g. Jo-Ann)
             2   = on combined-words and
                   on compoundwords (see: minkl)
            +4   = on double words  ".. word word .."
            +8   = on two correct neighbouring words

minkl        n   Minimum length of word compounds.                5

prbs             proposal-word-list:                              1
             0   = switched off
             1   = switched on

usuk         1   = refused words are displayed***                 1
            +0   = Standard proposal search (improved speed)
            +2   = Standard proposal search (lower speed)
            +4   = Strong   proposal search (slow speed)
            +8   = Limited  proposal search (high speed)

csch             Check capital/small initial letter:              6
             0   = switched off
             1   = within sentence
             2   = at start of and within sentence
            +4   = Don't check words with 1-4
                      capital letters, e.g.  UBD
            +8   = Don't check words following "

ftmp             Short-term storage of unknown words:             0
             0   = switched off
             1   = store words short-term

charm            Max. text-size (charm:4 =2500 characters)    10000