Skip to content
Snippets Groups Projects
user avatar
Guillaume Bernard authored
50d7825f

Compute dense representation of texts

This software is used to compute dense vectorisations (sentence embeddings) of sequences of sentences of natural text. It is able to handle multilingual documents until the model used is a multilingual one. This relies on the S-BERT architecture, software and models (https://www.sbert.net/). It computes dense vector representations for tokens, lemmas, entities, etc. of your datasets.

The idea of computing dense representation of documents is inspired by some previous works:

Reimers, Nils, et Iryna Gurevych. 2019. ’Sentence-BERT: Sentence Embeddings 
Using Siamese BERT-Networks’. In Proceedings of the 2019 Conference on 
Empirical Methods in Natural Language Processing and the 9th International 
Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982‑92. 
Hong Kong, China: Association for Computational Linguistics. 
https://doi.org/10.18653/v1/D19-1410.
Linger, Mathis, et Mhamed Hajaiej. 2020. ’Batch Clustering for Multilingual 
News Streaming’. In Proceedings of Text2Story - Third Workshop on Narrative
Extraction From Texts Co-Located with 42nd European Conference on 
Information Retrieval, 2593:55‑61. CEUR Workshop Proceedings. Lisbon, 
Portugal. http://ceur-ws.org/Vol-2593/paper7.pdf.
Staykovski, Todor, Alberto Barron-Cedeno, Giovanni da San Martino, et Preslav 
Nakov. 2019. ‘Dense vs. Sparse Representations for News Stream Clustering’. 
In Proceedings of Text2Story - 2nd Workshop on Narrative Extraction From 
Texts, Co-Located with the 41st European Conference on Information, 
2342:47‑52. Cologne, Germany: CEUR-WS.org. 
https://ceur-ws.org/Vol-2342/paper6.pdf.

Installation

pip install compute_dense_vectors

Pre-requisites

Dependencies

This project relies on two other packages: document_tracking_resources. This code needs to have access to this packages. It relies on sentence-transformers to compute dense representation of documents.

Transformers Models

To compute dense representation of documents, we use the sentence-transformers package is used with two multilingual models: paraphrase-multilingual-mpnet-base-v2 and distiluse-base-multilingual-cased-v1. According to the documentation, and at the time of writing, they are the two models to give the best results in multilingual semantic similarity.

The corpus to process

The script can process two different types of Corpus from document_tracking_resources. The one for News (NewsCorpusWithSparseFeatures), the other one for Tweets (TwitterCorpusWithSparseFeatures). The datafiles should be loaded by document_tracking_resources in order to have this project to work.

For instance, below an example of a TwitterCorpusWithSparseFeatures:

                                         date lang                                text               source  cluster
1218234203361480704 2020-01-17 18:10:42+00:00  eng  Q: What is a novel #coronavirus...      Twitter Web App   100141
1218234642186297346 2020-01-17 18:12:27+00:00  eng  Q: What is a novel #coronavirus...                IFTTT   100141
1219635764536889344 2020-01-21 15:00:00+00:00  eng  A new type of #coronavirus     ...            TweetDeck   100141
...                                       ...  ...                                 ...                  ...      ...
1298960028897079297 2020-08-27 12:26:19+00:00  eng  So you come in here WITHOUT A M...   Twitter for iPhone   100338
1310823421014573056 2020-09-29 06:07:12+00:00  eng  Vitamin and mineral supplements...            TweetDeck   100338
1310862653749952512 2020-09-29 08:43:05+00:00  eng  FACT: Vitamin and mineral suppl...  Twitter for Android   100338

And an example of a NewsCorpusWithSparseFeatures:

                              date lang                     title               text                     source  cluster
24290965 2014-11-02 20:09:00+00:00  spa  Ponta gana la prim   ...  Las encuestas...                    Publico     1433
24289622 2014-11-02 20:24:00+00:00  spa  La cantante Katie Mel...  La cantante b...          La Voz de Galicia      962
24290606 2014-11-02 20:42:00+00:00  spa  Los sondeos dan ganad...  El Tribunal  ...                    RTVE.es     1433
...                            ...  ...                       ...               ...                        ...      ...
47374787 2015-08-27 12:32:00+00:00  deu  Microsoft-Betriebssys...  San Francisco...               Handelsblatt      170
47375011 2015-08-27 12:44:00+00:00  deu  Microsoft-Betriebssy ...  San Francisco...               WiWo Gründer      170
47394969 2015-08-27 20:35:00+00:00  deu  Windows 10: Mehr als ...  In zwei Tagn ...                  gamona.de      170

Command line arguments

Once installed, the command compute_dense_vectors can be used directly, as registered in your PATH.

usage: compute_dense_vectors [-h] --corpus CORPUS --dataset-type {twitter,news} [--model-name MODEL_NAME] --output-corpus OUTPUT_CORPUS

Take a document corpus (in pickle format) and compute dense vectors for every feature

optional arguments:
  -h, --help            show this help message and exit
  --corpus CORPUS       Path to the pickle file containing the corpus to process.
  --dataset-type {twitter,news}
                        The kind of dataset to process. ‘twitter’ will use the ’TwitterCorpus’ class, the ‘Corpus’ class otherwise
  --model-name MODEL_NAME
                        The name of the model that can be used to encode sentences using the S-BERT library
  --output-corpus OUTPUT_CORPUS
                        Path where to export the new corpus with computed TF-IDF vectors.