CryptoCaptain: The MINER of knowledge in the streams of news

News as we know it today has come a long way.

In the past, you were most likely to hear about something new happening by word of mouth. Today? News is all around us – on your smartphone, smartwatch, tablet, laptop, computer, tv, home assistant, in the newspaper, and in the magazine.

CryptoCaptian as the text mining expert
CryptoCaptain marketing campaign – second illustration

Why SHOULD YOU read the news?

First, reading news is a great way to stay informed about things happening around you and around the world. Whether you want to find out when your favorite band is coming in town or why is the cryptocurrency market plunging right now, going through the news is what you would want to start with.

Second, reading the news helps you to develop an open and critical mind. By being well informed and by having many viewpoints, upon which you can base your judgment, you are more likely to distinguish true from false claims.

Say, you decide to invest in the cryptocurrency market because you believe that in 20 years everything will run on the blockchain.

One morning you read in the news about a rumored ban of bitcoin in China, which causes panic in the cryptocurrency market.

However, this is not the first time that you read about a possible ban of bitcoin in China and in your experience such rumors are meritless.

Instead of selling your cryptocurrency, you buy more at a discount and two weeks later you observe how the cryptocurrency market rebounds to its previous levels.

Third, reading the news helps you learn something new every day.

On the one hand, this makes you more knowledgeable and as a result a better conversationalist.

On the other hand, you are more likely to come up with bright, new ideas in the future.

What news streams are there?

The variety of news channels that we have today is astonishing.

Newspapers, specialized magazines, and dedicated websites publish in-depth articles on various topics. Such articles are informative, well researched, cross-checked and verified, and generally of really high quality. However, such articles take a lot of time to write and are rare and in between.

In contrast to the long and in-depth articles you find in newspapers, you find short yet still informative messages in Twitter.

An advantage of Twitter over traditional news websites is that, in a short time, you can get up to date with the events that matter to you. That is, as long as you are following the right people.

However, considering that all major news agencies have a presence on Twitter, finding reliable sources on Twitter doesn’t take long.

Other sources of news are social media sites such as Facebook and LinkedIn.

These social media sites allow you to share with your contacts your opinion as well as interesting news that you found elsewhere.

Social media sites are, however, notorious for fake news, which can spread rapidly, as people read and share articles without fact-checking them. This usually as people rely on the fact that the person who shared the article did the fact-checking for them.

Online forums such as Bitcointalk are great for in-depth discussions.

The bigger or the more specialized a forum is, the higher is also the probability that specialists will participate in the discussion and share their expertise and knowledge. In this way, you can often get insider information on promising new research or technology.

Finally, personal blogs may offer in-depth analysis on various topics or may also be just people just talking about their day. In this sense personal blogs can be a gold mine. However, you first have to go through tons of rock mass to extract a couple of gold nuggets.

Why mining news is DIFFICULT

Reading is an ability that we are not born with. Instead, it takes months and years to practice and perfect it.

Similarly, computers presented with a screenshot of a webpage cannot distinguish distinct webpage elements from one another. Instead, we have to teach them where to look at, what is important and what not.

For example, an article consists of a headline, author, publishing date, body, and comments. However, it is common to find references to other articles in the article body, advertisements, disclaimers, information about the author, etc.

While humans would normally just ignore such information when reading an article, computers would interpret them as a part of it, which can lead to biases in the later analysis.

Another common problem, that computers may face when presented with an article is that the article may cover multiple topics.

If two topics are similar to one another or if they are presented in a similar light, then the results would be consistent with that of a human.

However, assuming that one of the topics is presented in a positive light whereas the other in a negative one, then a computer may mix the two together, leading to results that are opposite to what is expected.

Further issues may arise, for example, if the author used irony or idioms in the article.

EXTRACTING KNOWLEDGE FROM TEXT

To extract useful and valuable knowledge for crypto and Bitcoin investors from crypto-related texts, we utilize various techniques of natural language processing, information extraction, machine learning, and text classification.

To start, entity extraction is a technique by which we can identify key elements in a text such as crypto and other assets, people, companies, locations, etc. In a second step, one can infer the author’s opinion regarding the entities. For instance, the author could discuss his projection of where the crypto market or a specific crypto asset will head. Entities can be extracted, for example, using a dictionary, rules, or machine learning.

In the first case, one would create a word dictionary about common representations of say a crypto asset. Regarding Bitcoin, one would expect “BTC”, “BTCUSD”, “Bitcoin”, maybe “digital gold”. The computer would then look for those words or phrases in a text to identify whether the entity is present in the text.

A disadvantage of this approach is that you can only identify entities, which were already included in one of the dictionaries used. Also, there is no disambiguation or context taken into consideration to keep the number of false positives low.

Therefore, in the second case, one would define grammatical rules with which one can identify entities in a text by also considering the context and making sure not something else was meant.

Such an approach can identify with high precision entities in the text but it often misses entities in the text.

In the third case, one would fit a computer model to a large set of annotated training data, so that the model can learn to identify, just like human, what an entity is (and what an entity is not).

A disadvantage of this approach is that the preparation of an exhaustive training data set is very laborious.

Text classification and topic modeling are two techniques with which we can determine the topic of a text and also the author’s opinion or sentiment about a topic or entity.

The difference is that topic identification is not aligned to some pre-defined topics but is open to new topics that can arise. Possible topics in the crypto space might relate to crypto regulation in a country or the adoption of cryptos in the banking sector.

On the contrary, text classification can be used to determine whether an author’s opinion about a market is bullish or bearish.

Very basic text classification would look for words in the text that match words in a pre-defined dictionary. The more words of a pre-defined category they find in the text, the higher is the probability that the text matches the category.

Alternatively, one would fit a model to a set of annotated training data using machine learning and then use the fitted model to determine the category of the text. A classic approach to machine learning-based text classification uses support vector machines. Recent approaches resort to deep learning and large language resources.

Topic modeling would feed a news stream to a machine learning algorithm and let it identify topics on its own by looking for similarities among text pieces. The next step would then either manually review similar texts and decide about the exact topic or use techniques of text summarization to decide about the topic.

Sentiment analysis is a technique with which we can evaluate opinions expressed in a given text.

For example, one can determine whether the author’s opinion about a person, a company or a (crypto) asset is positive, negative, or neutral.

We can also determine what emotions are expressed in the text, e.g., anger, joy, sadness, surprise etc.

Similar to the entity extraction and the text classification, sentiment analysis can be performed using a dictionary with positive and negative words.

Alternatively one can first fit a model to a set of annotated training data and then use the model to analyze the opinions in the text.

Some simple low level techniques used in text mining by the outlined natural language processing methods include:

1) Boilerplate removal to extract the body of an article without advertisements, disclaimers, information about the author, etc.

2) Tokenization to identify individual words, characters or sequences of words or characters in the text.

3) Stemming and lemmatization to determine the root form of inflected words in the text.

4) Part-of-Speech tagging to identify which words are nouns, adjectives, verbs, etc.

5) Stop words removal to remove words that do not contribute to the meaning of a sentence.

FURTHER READINGS

An excellent scientific introduction into natural language processing is given by the book “Foundations of Statistical Natural Language Processing” by Manning and Schütze, which is freely and fully available online here. A quick overview of NLP is available here.

A classic resource for sentiment analysis and opinion mining is Bing Liu’s book, which is available online.

TOOLS

An excellent state-of-the-art tool set for programming automatic text mining in general, text classification, information extraction, and natural language processing is spacy, using the Python programming language.

Also in Python, NLTK, is a classic set of tools for natural language processing, although it might be a bit slow. The reason is that its primary focus is teaching.

A good tool for basic machine learning in Python and applying it to text classification and topic modeling is provided by scikit-learn.

Coming to more advanced and possibly more accurate techniques using deep learning with large neural networks, the book of Chollet provides an introduction along Python examples.

About Cryptocaptain

CryptoCaptain helps investors answer the question “When should I invest in the cryptocurrency market?“.

To help answer this question, we’ve developed the Bull Market Compass.

The Bull Market Compass is an investment signal service, which guides long-term investors through the volatile cryptocurrency market. In its core it is a powerful custom-built A.I. with remarkable predictive analytics capabilities.

Furthermore, it protects investors’ capital by early detecting bear markets and warning investors ahead of time.

Finally, the Bull Market Compass saves investors’ time by sending out quality investment signals and by monitoring the market for them.

  • ​Invest early in emerging bull markets
     
  • Cash out ahead of market crashes
     
  • Count on years of experience
     
  • Spend only a couple of minutes / month

Become part of a rapidly growing investor community
SIGN UP NOW
close-link