Explore projects
-
Updated
-
Thèse Guillaume Bernard / Jeux de données / dataset_manipulation_tools / compute_tf_idf_weights
GNU General Public License v3.0 or laterThis software is used to compute TF IDF weighting from texts that are based on the document_tracking_resources format. Vectors and weightings are computed thanks to a resource file that contains a representation of the language used in the same context as the text to weight (news features to weight texts published in the news).
Archived 0Updated -
Updated
-
Updated
-
Thèse Guillaume Bernard / Jeux de données / dataset_manipulation_tools / synthesise_ocr_and_segmentation_errors_in_texts
GNU General Public License v3.0 or laterThis software enables to damage texts written in any natural language by applying OCR degradation (phantom characters, character degradation, etc.) and by over-segmenting texts (this means splitting regularly the texts in equal parts).
This is useful to reproduce common errors found in historical documents when historical data is missing.
Archived 0Updated -
Updated
-
Updated
-
Methods to take into account digit preference (heaping) in count data of wildlife
Updated -
Updated
-
Thèse Guillaume Bernard / Développement / from events to documents / database_infrastructure_text_mining
GNU General Public License v3.0 or laterTextual Search Engine Infrastructure based on ElasticSearch (https://www.elastic.co/fr/elasticsearch/) and Lucene (https://lucene.apache.org/). Includes the import scripts to load datasets into the index.
Archived 0Updated -
Archived 0Updated
-
Updated
-
Updated
-
This competition proposes to improve / denoise OCR-ed texts, on a testbed of more than 20 million characters form English, French, German, Finish, Spanish, Dutch, Czech, Bulgarian, Slovak and Polish.
Updated