synthesise_ocr_and_segmentation_errors_in_texts Archived

This software enables to damage texts written in any natural language by applying OCR degradation (phantom characters, character degradation, etc.) and by over-segmenting texts (this means splitting regularly the texts in equal parts).

This is useful to reproduce common errors found in historical documents when historical data is missing.