Site icon Artha Solutions

Unleashing Talend Machine Learning Capabilities

Introduction

This article covers how Talend Real-time Big Data can be used to effectively leverage Talend’s Real-time Data processing and Machine Learning capabilities. The use case handled in this article is how Twitter data can be processed in real time, and classify if the person tweeting has post-traumatic stress disorder (PTSD). This solution can work for any major health situation of a person, for example cancer, which is discussed at the end.

What is PTSD?

PTSD is a mental disorder that can develop after a person is exposed to a traumatic event, such as sexual assaultwarfaretraffic collisions, or other threats on a person’s life.

 

Statistics about PTSD

Source: Taking a look at PTSD statistics

Insights into the solution

Considering the high increase in the end-users of the social networks, we expect a humongous amount of data written every day into social networks. To handle such a huge amount of data, we need a Hadoop Ecosystem. Hence, this use case of PTSD is classified as a Big Data use case, as Twitter is our data source.

 

Spark Framework
Apache Spark™ is a fast and general engine for large-scale data processing.
Random Forest Model
Random forest is an ensemble learning method for classificationregression, and other tasks, that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
Hadoop Cluster (Cloudera)
A Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment.
Hashing TF
As a text-processing algorithm, Hashing TF converts input data into fixed-length feature vectors to reflect the importance of a term (a word or a sequence of words) by calculating the frequency that these words in the input data appear.
Talend Studio for Real Time Big Data
Talend Studio to perform MapReduce, Spark, Big Data real-time Jobs.
Inverse Document Frequency
As a text-processing algorithm, Inverse Document Frequency (IDF) is often used to process the output of the Hashing TF computation in order to downplay the importance of the terms that appear in too many documents.
Kafka Service
Apache Kafka is an open-source stream processing platform written in Scala and Java to provide a unified, high-throughput, low-latency platform for handling a real-time data feed.
Regex Tokenizer
Regex tokenizer performs advanced tokeni

 

Step 1: Retrieve data from Twitter using Talend

Talend Studio not only supports Talend’s own components, it also supports the custom-built components from any third parties. All these custom-built components can be accessed from Talend Exchange, an online component store.

To perform all of the above, we need to get access to the Twitter API.

 

Snapshots of Talend Job designs

Deciding which hashtags to use plays a vital role. We may use a single hashtag, or a combination of multiple hashtags to pull the accurate data required. Choosing appropriate hashtags helps to filter the large volume of source data.

Step 2: Create and train the model using Talend

As we all know, nothing can be done without human intervention. Once the data pulled from Twitter is in place, we need to manually classify the tweets as Having PTSD or Not Having PTSD.

Classification can be done by adding a new attribute to that data. Values can be Yes or No (Yes – having PTSD, No – Not having PTSD). Once the classification is done, we can call this data as a training set that can be used to create and train the model.

To achieve our use case, before creating the model, training data needs to undergo some transformations such as:

  1. Hashing TF
  2. Regex Tokenizer
  3. Inverse Document Frequency
  4. Vector Conversion

After passing through all the algorithms above, training data can be passed into the model to create and train it. The model that suits this prediction use case best is the Random Forest Model.

Talend Studio for Real-time Big Data has some very good machine learning components that can perform regression, classification & prediction using Spark Framework. Leveraging the capability of Talend to handle machine learning tasks, the Random Forest Model has created and trained the model with the training data. Now we have the model ready to predict the tweets.

Note: All the work is done on a Cloudera Hadoop Cluster, Talend is connected to the cluster, and the rest of the computation is achieved by Talend.

 

Snapshot of a Talend Spark Job design

 

Step 3: Prediction of tweets using Talend

Now we have the model ready on our Hadoop cluster. We can use the process in step 1 and pull the data from Twitter again, which acts as a test data. The test data has only one attribute: Tweet.

When the test data is passed to the model we have created, the model adds a new attribute Label to the test data, and its value will be Yes or No (Yes – having PTSD, No – Not having PTSD). The predicted value depends solely on the way the model is trained in step 2. Again, all this prediction can be done in Talend Studio for Real- time using Spark framework.

 

Snapshot of a Talend Spark Job design for prediction

 Evolution of the model

Once the model predicts the classification of the test data set, we find the records to be 25% erroneous (on average). We need to assign the right classification to that 25% of the records, add them to the training set, and retrain the model. It should predict accurately now. Add more records to the training set, and repeat the same procedure until the model becomes accurate. A model needs to evolve over time, by training it with newly added training data that comes with time. Some management is required.

Note: To boost the effectiveness of the model, we can add synonyms of the training data to the training set and retrain the model, which leads to developing the model synthetically rather than just organically.

A threshold of 90% accurate predictions is a must to classify the model as accurate. If the prediction accuracy level drops below 90%, then it is time to retrain the model.

Real-time applications from this use case

Note: Once the classification of data is done (Yes or No), it may lead to many more useful real-time applications.

Broader Scope

The use case solution designed can work for any of the major health situations. For example, if the use case is with cancer, using cancer-specific hashtags we can train the model in an equivalent way and start predicting if the person has cancer or not. The same real-time applications as discussed above can be achieved.

Authors: Madhav Nalla, Saikrishna Ala, and Kashyap Shah

This Article also published on Talend Community Blog:
Source: https://community.talend.com/s/article/Unleashing-Talend-Machine-Learning-Capabilities

Exit mobile version