The automated analysis of textual data and its application in business analytics holds great promise for providing decision-makers with information from a sheer endless stream of news available online. Recent advances in computing have led to exciting new tools in the areas of Natural Language Processing, Sentiment Analysis and Machine Learning that can be used to make sense from an ever-growing number of online sources. In a series of blog posts I will be looking at the basic concept of sentiment analysis in general and in the context of commodity trading, offer a behind-the-scenes view into DataGenic's very own efforts in building a sentiment engine, introduce a number of case studies and provide a glimpse into the future of machine learning and Artificial Intelligence (AI).
News Analytics and Sentiment Analysis
It is widely recognised that new information plays a key role in financial markets, impacting on volumes of trade, stock returns and volatility of prices, and consequently news has always been a key source of investment information1-3.
While decision-makers have always considered a portfolio of varied news domains and sources, with the growth of the Internet, the amount of readily available information has grown exponentially. As major news outlets increasingly bolster their online portfolio, newspapers articles are increasingly published online. Bloomberg alone adds an estimated 1 million news stories a day.
Apart from news produced by these reputable sources, an increasing number of opinionated documents of interest are published online asynchronously 24/7/365 on blogs, message boards and micro blogs (e.g. Twitter) by large and varied user communities.
Image 1: Online in 60 seconds [Infographic], 8 July 2014, http://blog.qmee.com/online-in-60-seconds-infographic-a-year-later/
Image 2: Statista 2017, Number of Twitter Users Worldwide: https://www.statista.com/statistics/303681/twitter-users-worldwide/
The enormity and high variance of this data presents an interesting opportunity for harnessing it into a form that allows for specific market predictions, and in recent years this information has been repeatedly demonstrated to dramatically influence markets1. A phenomenon recently termed “collective intelligence”2. For a single person (or even a group of people) harnessing this information successfully is increasingly impossible due its volume and asynchronous character. The necessity for automated collection, extraction, processing and aggregation of this data has long been recognised4 and advances in machine learning techniques have led to exciting new tools for its analysis.
It is widely accepted that the automated extraction of useful information from text is a complex challenge. Apart from technical constraints, word-sense disambiguation remains one of the major challenge in the computerised processing of textual unstructured data. Words frequently change their meaning depending on context, consequently changing the meaning of the surrounding body. In fact, functional structures are so complex that it builds the latest stage in infant learning acquisition, with most of us needing over 2.5 years to master even the simplest applications5. For example, one would not consider the word “long” as either exceptionally positive or negative. However, most humans would rate “the laptop’s start-up time was long” as negative, while “the laptop’s battery life was long” would be considered positive.
Sentiment Analysis and Natural Language Processing
Natural Language Processing (NLP), Sentiment Analysis and Machine Learning are widely recognised as the key tools in transforming the plethora of available text into meaningful information.
Sentiment Analysis (also known as Opinion Mining), seeks to identify and categorise opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic is positive, negative, or neutral6,7. Sentiment Analysis builds on NLP, originally referred to as computational linguistics, and at its most basic refers to the use of algorithms that allow computers to process and understand human languages.
While being researched since the late 1940’s, the field has been rapidly growing over the last decade. This can be attributed to:
- the advent of the Internet and the increased availability of large amounts of electronic text; and
- the availability of computers with increased speed and memory, and the consequent improvement in Machine Learning. Also, frequently referred as Artificial Intelligence (AI) or narrow AI, Machine Learning, describes the practice of using algorithms to parse data, learn from it, and then make a classification or prediction. So, rather than hard-coding a specific set of instructions for a particular task, the machine is “trained” using large amounts of data and algorithms that give it the ability to learn how to perform the task8.
Over the last decade, various NLP algorithms have been developed, combined with Machine Learning techniques and applied in numerous (commercial) applications. For example many popular spam filters include Naïve Bayes classifiers trained with NLP developed features to identify spam email. A similar approach can be used in Sentiment Analysis to classify text into positive, negative and neutral (sentiment polarity)9. Although these models tend to treat words as atomic units – no notion of word similarity, or text structure - researchers and practitioners argue that they frequently outperform more complex models, at higher computational efficiency, and consequently are more applicable to
As Machine Learning classification techniques require large quantities of relevant in-domain data for training, the highly varied and specialized topics in market news present a unique challenge. A recent approach has been provided by Google's Word2Vec algorithm, that has the potential to allow practitioners to overcome these limitations. It takes a text corpus as input and produces the word vectors as output, by constructing a vocabulary from a training data set and then learning vector representation of the contained words. This results in the capture of many syntactic and semantic regularities, represented by vectors10. For example, ‘melancholy’ would be closest to ‘bittersweet’ (sentiment) as opposed to ‘thoughtful’ and ‘warm’ (semantic). When properly employed this approach can effectively aid in the capture of sentiment of words not provided during training.
NLP, Sentiment Analysis and Machine Learning are fields heavily investigated and substantial breakthroughs are to be expected over the next years. It should be noted that in academic settings, more complex algorithms, such as Recurrent Neural Network based language models, have already shown promising results. However, up-to-date computational complexity does not permit their use in robust applications relying on near-real time processing of information.
- Bollen, J., Mao, H. & Zeng, X. Twitter mood predicts the stock market. J. Comput. Sci. 2, 1–8 (2011).
- Mitra, L. & Mitra, G. in The Handbook of News Analytics in Finance 1–39 (John Wiley & Sons, Ltd., 2012). doi:10.1002/9781118467411.ch1
- Godbole, N. & Srinivasaiah, M. Large-scale sentiment analysis for news and blogs. Conf. Weblogs Soc. Media (ICWSM 2007) 219–222 (2007). doi:10.1177/01461079070370040501
- Morrison, S. So Many, Many Words. The Wall Street Journal (2008).
- Language acquisition. Wikipedia (2017). Available at: https://en.wikipedia.org/wiki/Language_acquisition.
- Agarwal, A., Xie, B., Vovsha, I., Rambow, O. & Passonneau, R. Sentiment analysis of Twitter data. Assoc. Comput. Linguist. 30–38 (2011).
- Sentiment Analysis. OxfordDictionaries.com (2016). Available at: https://en.oxforddictionaries.com/definition/sentiment_analysis.
- Copeland, M. What’s the Difference Between Artificial Intelligence, Machine Learning, and Deep Learning? (2016).
- Pang, B., Lee, L. & Vaithyanathan, S. Thumbs up?: sentiment classification using machine learning techniques. Proc. Conf. Empir. Methods Nat. Lang. Process. 79–86 (2002). doi:10.3115/1118693.1118704
- Mikolov, T., Yih, S. W. & Zweig, G. Linguistic Regularities in Continuous Space Word Representations. in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT-2013) (2013).