Part 6: Text Analysis Demystified

How AI understands texts - or not


Uwe Weinreich, the author of this blog, usually coaches teams and managers on topics related to strategy, innovation and digital transfor­mation. Now he is seeking a direct confron­tation with Artificial Intelligence.

The outcome is uncertain.

Stay informed via Twitter or linkedIn.

Already published:

1. AI and Me – Diary of an experiment

2. Maths, Technology, Embarrassement

3. Learning in the deep blue sea - Azure

4. Experimenting to the Bitter End

5. The difficult path towards a webservice

6. Text Analysis Demystified

7. Image Recognition and Surveillance

8. Bad Jokes and AI Psychos

9. Seven Management Initiatives

10. Interview with Dr. Zeplin (Otto Group)

Now it's getting exciting. Understanding texts and language is something very human. When machines make the leap, they come so close to us that it feels uncomfortable, at least until we get used to it.

Computer science has long hoped that machines will be able to understand natural language. Understanding the text - i.e. analysing written and therefore often machine-readable text - is the easier exercise. It is much more difficult to understand spoken language because it must first be translated into machine-readable form. Anyone who has ever installed speech recognition programs such as Dragon Naturally Speaking or other* [advertising link] on their computer will not only be pleased that dictation now works quite well, but will also know which monstrous program packages are necessary to solve only one task, to convert spoken into written text.

Language remains something very human and is essential for building and maintaining relationships. More than 50 years ago, computer scientist Joseph Weizenbaum already had an amazing experience with this when he developed the small computer program Eliza in 1966, which imitated techniques of the in these days popular person-centered psychotherapy. Due to the very limited technical possibilities at that time, this was of course only possible in a very awkward and schematic way.



Weizenbaum was even more surprised when one day when entering his office he found his secretary in front of the computer, who asked him to wait a while outside the door, because she was about to have a very important personal conversation with Eliza. Of course, the lady knew how limited and schematic the program worked, nevertheless the interaction triggered the feeling of intimacy of a personal conversation.

Try it yourself here with the script that Norbert Landsteiner (2005) provided. The program runs in your browser only. No data is transferred or stored to the server.

Talk to Eliza in confidence

tamagotchiDecades later similar affection could be experienced During the Tamagotchi boom, the virtual growth and especially the death of chicks – actually small black pixel clouds – plunged entire families into emotional crises. Also here a close and emotional computer-human relationship.

This inclination to establish relationships is not due to the capabilities of the machine - not much seems to be necessary - but to our psychological equipment. Weizenbaum summed it up as follows: "Most men don't understand computers to even the slightest degree. So, unless they are capable of very great scepticism (…), they explain the computer's feat only by bringing to bear the single analogy available to them, that is, their model of their own capacity to think." This means that we humanize computers and under certain circumstances attribute more human qualities to them than they actually possess.

Understanding Artificial Intelligence today

From a technical point of view, conditions have changed dramatically. Computing and storage capacities have increased and algorithms have also become many times more powerful. Are we now rightly attributing human or even superhuman abilities to computers today? Let's see what Azure has to offer.

This lesson provides an introduction to AI text and speech recognition. It begins with text analysis, which is applied on three texts: Kennedy's "Moon Speech", Lincoln's "Gettysburg Address" – a text that probably no American AI computer can ignore during text analysis – and an unemotional Microsoft text on cognitive services. To anticipate the result, Azure AI concludes that the three texts differ substantially. That's not surprising. But the questions are, how does the system come to this realisation and what does it really "understand"?

Analysing texts

One of the first steps of machine text analysis is to clean the text (numbers and punctuation marks­ out). Better results can also be obtained by tracing the text back to its word stems­, i.e. "beautiful", "beauty", "beautifying" all become "beauty". And then we can finally make the essential step and count and arrange words according to frequencies. It'll look that way:


It is noticeable that common words such as "the", "and", "of" and others are frequently used. These are of course not as decisive for the meaning of the text as the words "space" and "science". Therefore, there is a method to remove such "stop words". After these steps, which essentially consist of reducing, excluding, counting and sorting, you get a word collection that can give you a first impression of the text. For the Kennedy speech, the characterizing words are "new" "go", "space", "say", "one", "sea", "choose", "hostile", "moon". Unfortunately, such a sequence of words does not have any of the captivating qualities of a charismatic speech.

The next step in the analysis is a Term Frequency - Inverse Document Frequency Analyse (tf-idf analysis). Probably every computer science freshman knows it, but I was surprised and thrilled by this procedure. Years ago, when we first experimented with text analysis in my company, we used the Bayes' theorem extensively. The advantages are that it is quite easy to implement and provides quite good classifications of texts if you train the system with sufficient data sets. The big disadvantage that led to the abandonment of the experiments was that we regularly ran into memory overflows.  Especially with longer and more complex texts. So much statistics had to be loaded into the main memory and read from and written into databases that the programs became slower and slower and eventually hung up.

tf-idf is the process we would have needed at the time to significantly reduce the amount of data to be processed. It sorts out all words that occur in all texts and weights up those that are contained in only one. Somehow, a Bayes' system achieves that as well, but it's a lot more cumbersome. The leaner algorithm wins in the long run. Here is a visualization of tf-idf. Admittedly, it certainly only thrills math freaks.


The results

If we now compare the three texts, they can be clearly classified based on the dominant words.

 Kennedy's Moon Speech  Lincoln's Gettysburg Address  Microsoft Cognitive Services
  • space
  • go
  • one
  • nation
  • dedicated
  • dead
  • services
  • speech
  • cognitive

Yes, that's enough to distinguish the texts from each other, even with an IQ below 50 (see blog post 1 on this topic). But it has nothing to do with understanding. The AI can only tell that the texts are different, without any knowledge of what it is about.

Missing connotation

Human communication lives from context and connotation. All the subtleties that make it clear whether something is meant seriously or ironically, lovingly or derogatorily, objectively or angrily help us to understand language. All this has been destroyed in the above analysis steps. For example, punctuation marks that identify a question and provide structure have completely disappeared, words have been torn from their context and deprived from their connecting words.

AI text analysis also provides a solution for this, sentiment analysis. It performs nothing else but searching texts for words that had been recognised to occur frequently in positively connotated or negatively connotated texts. Based on this, a probability can be calculated whether a text is rather positive or negative. If there is no high probability of either, it is classified as neutral. These algorithms always have difficulties with irony. And just in the beginning tweets like "I just arrived in Bad Doberan" (German town) were classified as highly negative, because they contain the negative signal word "bad". So, eyes open! The world is not (only) English.

We must not overestimate AI speech recognition

Sentiments are of course very reduced and coarse. This aspect will certainly improve in the course of time, but it will remain far from how people understand texts. Nevertheless, it is impressive that machines are able to process huge amounts of text, extract key figures and even generate recommendations from it. Nevertheless, we should not be too impressed. Most of what we recognize as intelligence here is still our own projection as described by Weizenbaum 50 years ago. Whenever we see certain behaviour in others (computers, dogs, Tamagotchis), we assume that the behaviour is based on processes similar to our own. A critical attitude towards AI's textual insights will be very beneficial at least for the next few years.

Less surprising but pleasing, however, are the recognition rates of spoken language, which are of course also possible with Azure. I don't need to deepen this here, because everyone can experience it for themselves with a speech recognition software or a voice-controlled assistant.


Weizenbaum J (1976) Computer power and human reason: From judgement to calculation, San Francisco, WH Freeman, pp 9-10


⬅ previous blog entry 

published: December 3, 2018, © Uwe Weinreich

new comment

nick name / alias