Most of our clients are exceptionally busy in fast-moving markets where time and money are a luxury. The analysts we work with spend much time sifting through lengthy documents from multiple sources trying to find valuable insights.
We’re currently exploring ways of shortening lengthy text into a coherent summary, to try and reduce the amount of time analysts spend reading text.
There are various ways to do this, including extraction (or keyword) based summarization and abstraction based summarization. Both of these are examples of automatic summarization. Extraction summarisation uses statistical methods to highlight key phrases. Abstraction summarisation uses natural language techniques to generate readable summaries.
We used TextRank, an unsupervised keyphrase extraction algorithm to reduce the text in articles. We chose to use Text Rank because it can be easily customised with additional functions and features as per our clients’ requirements.
The article is broken into paragraphs, paragraphs are broken into sentences and sentences are broken into words. The algorithm excludes words of certain length or that are not tied to content, for example ‘the’ or ‘is’ etc. The algorithm compares words in sentences and then ranks the importance of sentences according to the co-occurrences of these words. The summary of the document then becomes a compilation of the “top” sentences.
Here’s an example, where we shortened this blog post at 508 words into the following 151 word summary, reducing the text by 30%. The post was an overview of a data science meet-up which focused on graphs and graph data in London last year.
‘The take home message for me was that graph databases are an exciting and distinctive piece of technology, firstly because they can simplify storage and analysis of messy data sets, and secondly because they provide a useful visual encoding facilitating communication, interaction, and exploration of data.
Ian made a presentation in which he described the fundamentals of the graph data model, the data structure at the at the core of a graph database.
Within Neo4j you can query the database looking for particular graph configurations or shortest paths, using Cypher, Neo4j’s query language, which doesn’t look that difficult to learn (there are a good number of drivers available for Neo4j here)
In the question and answer session, Ian also pointed out two solutions for handling ‘super nodes’, which are situations in which there are one or more nodes with a relatively large number of attached nodes, for instance Lady Gaga‘s followers’
We’re currently exploring summarization technology to take into account variables such as the length of a document, writing style and syntax to help our customers consume complex information. Natural language processing techniques can also help to process reports written in English by non-native speakers which may have a slightly different sentence structure.
We’re excited about this because it could considerably cut down the amount of time that it takes for analysts to sift through large quantities of information.
If you want to know more, please contact us at firstname.lastname@example.org.