README.md 1.14 KB
Newer Older
Robert Interdonato's avatar
Robert Interdonato committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Unsupervised Crisis Information Extraction from Microblog Data

Python framework to identify and rank crisis-related tweets based on their informativeness.

The framework should be run from Pipeline.py [which also includes detailed comments about each parameter]

The pipeline can be summarized as follows:

1 - preprocess tweet texts
     - remove URL, Emoji and Mentions
     - check language 
     - remove duplicates and retweets 

2 - run tweet clustering. Options:
		- LDA
		- NMF
		- K-Means


3 - Rank the clusters based on their overall (i.e., considering all tweets in the cluster as a single text) semantic similarity with Castillo's CrisisLex lexicon 

4 - Take top ranked clusters (i.e., above 90th percentile), and rank the tweets by their individual similarity with the CrisisLex.

Two Options are given to compute semantic similarity:
	- ESA
	- Word2Vec

Robert Interdonato's avatar
Robert Interdonato committed
28
You can find the original CrisisLex (CrisisLexRec.txt) and the fench translated one (CrisisLexRec_FrenchGT.txt) in the main folder.
Robert Interdonato's avatar
Robert Interdonato committed
29

Robert Interdonato's avatar
Robert Interdonato committed
30
31
32
To request the Eleanor and Ophelia Twitter datasets please write to roberto.interdonato at cirad point fr


Robert Interdonato's avatar
Robert Interdonato committed
33
34
35
36
37
38
39
40
## Requirements
* gensim 
* numpy
* json
* sklearn