The historical background of text analysis
Text Analysis has been around since the days of the Second World War. You read that right!
Alan Turing, an American English mathematician, computer scientist, and cryptanalyst was the brain behind it.
He cracked enemy codes sent via morse codes and telegrams that helped shorten the war by two years. This was during a time when computers meant human beings who did calculations.
There was no Artificial Intelligence or Machine Learning. Bright-minded individuals like Alan Turing and several others will sit with pencils to mark alphabetical characters and words that had hidden meanings in them.
The data could come from varied sources — like newspaper feature articles, morse codes, telegram messages, even promotional flyers.
If you think about it, these are all unstructured data as modern computer scientists would call it. These data would not have a common structure that makes it easy to analyze them.
Today, we have unstructured data that comes in the form of social media content, emails, customer feedback, survey responses, and whatnot.
There is an information explosion that creates 2.5 quintillion bytes of data created each day or even more. In such a scenario, we need a powerful automated system to disseminate textual data and make sense out of it. It is here that text analytics steps in.
Text analytics leverages Artificial Intelligence and Machine Learning to pick insights out of heaps of unstructured data.
In fact, text analytics is a vast topic that cannot be summarized in a single page. (But, we tried. You can read it in the link below.)
Related reading: Everything you need to know about text analytics.
Here is a quick overview of how text analysis works. The process of text analysis can be broken down into four phases:
1st stage: Data gathering
Any analytic exercise requires data, to begin with. The larger the dataset, the better would be the outcome or the prediction. If you are planning to accumulate customer loyalty from Net Promoter Survey responses, you will have to populate the survey responses collected through email, social media, and review websites in a single location.
Now there are two sources from where you can collect the required data:
- Internal data
- External data
Internal data can be collected from sources that are owned, controlled and managed by the organization. They include data from:
- CRM software
- Chat conversations
- Call recordings
- Email exchanges
- NPS survey responses
- Mobile app usage data
- Other customer databases
While internal data can be accessed easily without any hindrance, populating external data about a business is difficult. There are external tools that help that.
They scrap the entire web and pick up metadata that relate to the business — like mentions, product reviews, star ratings, blogs, comparison articles, etc.
Once the internal and external data is populated, we proceed to prepare the data for analysis. The data preparation process is an extensive one.
In fact, the quality of data preparation is what determines the success of the analysis.
2nd stage: Data preparation
Data preparation is the crux of text analytics. It is in this stage that the data collected and unified is put to work. A variety of technologies like Machine Learning and Natural Language Processing are used at this stage.
These tools perform three processes that help the analytical system understand the meanings. They are:
- Tokenization,
- Part-of-speech Tagging, and
- Parsing
Tokenization
The process of breaking a sentence or phrase in multiple parts each with its own identity and meaning is known tokenization.
Let’s take for example a sentence: “Text analytics gives insights.”
In tokenization, each word in the sentence would be broken down as:
[“Text”, “analytics”, “gives”, “insights.”]
Once tokenization is completed, the system will get into part-of-speech tagging.
The purpose of this process is to determine whether the words in use are adjectives, nouns, verbs, etc. This helps in identifying whether the tone and sentiment of the sentence is positive or negative.
So, in the part of speech tagging, the sentence would be broken down as:
[“Text” :NOUN, “analytics” :NOUN “gives” :verb, “Insights” : Noun]
Tokenization and part of speech tagging preps the text for the third stage: PARSING.
The literary meaning of parsing is to resolve (a sentence) into its component parts and describe their syntactic roles. That is the same process that happens in text analytics as well.
Parsing can be done in two ways:
- Dependency parsing
- Constituency Parsing
Dependency parsing
Independency parsing, the system analyzes the sentence to find a “head” word and establishes its connection with the rest of the words.
Constituency Parsing
In constituency parsing, the system uses a constituency grammar, or a locally accepted form of grammar to synthesize and process the sentence and to derive meaning out of it.
During the process of data preparation, stop words and fillers are also removed from the sentence to make analysis more refined.
3rd Stage: Data analysis
Once the data is prepared for analysis, the fun begins. Data analysis is that part of text analytics where the data is put to work. The system begins slicing and dicing the data to churn out insights. There are two ways how this slicing and dicing can be done.
- Text classification
- Text extraction
Text classification
As the name suggests, text classification is the process using which tags are assigned to individual pieces of text.
The type of work involved has also given the process the alternative titles of text categorization or text tagging. It is in text classification that the sentiment behind the textual data is identified.
To identify the right sentiment behind the text, the system uses a variety of tools like rule-based systems where a rule is assigned for positive and negative words.
Machine learning systems that learn from training models and dynamic datasets.
Text extraction
Text extraction proves to be beneficial when standard text content is to be extracted from the web or from a given set of data. For example, a brand name to be extracted from web content.
4th stage: Data visualization
All the data is the world would be of no use if it cannot be visualized properly. Data visualization is what helps present the findings in a way that humans can work upon.
Otherwise, all the workings of the text analytics system would look like a gibberish of alphanumeric characters.
To avoid such a situation, the text analytics system uses interactive or static data visualization forms. The most commonly used format is a CSV sheet or a chart.
Alternatively, data visualization tools like Google Data Studio, Tableau, etc. can also depict the findings in an ideal form.
Text Analysis- Bringing it all together
Together, we create quintillion bytes of data. That data is subject to text analytics can help point a finger at what our customers feel about the business and the direction in which it is going.
To deliver such a unique value, text analytics has to work in a specific way. Text analytics leverages Artificial Intelligence, Machine Learning, Natural Language Processing and a host of other technologies to make that happen.
Now that you have read how text analysis works, you might also be interested in knowing the various types of text analytics. Read our blog.