Unleashing Talend Machine Learning Capabilities

Introduction

This article covers how Talend Real-time Big Data can be used to effectively leverage Talend’s Real-time Data processing and Machine Learning capabilities. The use case handled in this article is how Twitter data can be processed in real time, and classify if the person tweeting has post-traumatic stress disorder (PTSD). This solution can work for any major health situation of a person, for example cancer, which is discussed at the end.

What is PTSD?

PTSD is a mental disorder that can develop after a person is exposed to a traumatic event, such as sexual assaultwarfaretraffic collisions, or other threats on a person’s life.

 

Statistics about PTSD

  • 70% of adults in the U.S. have experienced some traumatic event at least once in their lives, and up to 20% of these people go on to develop PTSD.
  • An estimated 8% of Americans, 24.4 million people, have PTSD at any given time.
  • An estimated one out of every nine women develop PTSD, making them about twice as likely as men.
  • Almost 50% of all outpatient mental health patients have PTSD.
  • Among people who are victims of a severe traumatic experience, 60 – 80% will develop PTSD.

Source: Taking a look at PTSD statistics

Insights into the solution

Considering the high increase in the end-users of the social networks, we expect a humongous amount of data written every day into social networks. To handle such a huge amount of data, we need a Hadoop Ecosystem. Hence, this use case of PTSD is classified as a Big Data use case, as Twitter is our data source.

 

Spark Framework
Apache Spark™ is a fast and general engine for large-scale data processing.
Random Forest Model
Random forest is an ensemble learning method for classificationregression, and other tasks, that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
Hadoop Cluster (Cloudera)
A Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment.
Hashing TF
As a text-processing algorithm, Hashing TF converts input data into fixed-length feature vectors to reflect the importance of a term (a word or a sequence of words) by calculating the frequency that these words in the input data appear.
Talend Studio for Real Time Big Data
Talend Studio to perform MapReduce, Spark, Big Data real-time Jobs.
Inverse Document Frequency
As a text-processing algorithm, Inverse Document Frequency (IDF) is often used to process the output of the Hashing TF computation in order to downplay the importance of the terms that appear in too many documents.
Kafka Service
Apache Kafka is an open-source stream processing platform written in Scala and Java to provide a unified, high-throughput, low-latency platform for handling a real-time data feed.
Regex Tokenizer
Regex tokenizer performs advanced tokeni

 

Step 1: Retrieve data from Twitter using Talend

Talend Studio not only supports Talend’s own components, it also supports the custom-built components from any third parties. All these custom-built components can be accessed from Talend Exchange, an online component store.

  • Taking advantage of a custom Twitter component, we can get data from Twitter by accessing both REST and Stream APIs.
  • To take advantage of the Hadoop ecosystem and for Big Data, we implemented a real time Kafka service to read data from Twitter.
  • Talend Studio for Real-time Big Data has Kafka components that we can leverage to read the data that is being read by the Kafka service, and pass it on to the next stages of the design in real time.

To perform all of the above, we need to get access to the Twitter API.

 

Snapshots of Talend Job designs

Deciding which hashtags to use plays a vital role. We may use a single hashtag, or a combination of multiple hashtags to pull the accurate data required. Choosing appropriate hashtags helps to filter the large volume of source data.

Step 2: Create and train the model using Talend

As we all know, nothing can be done without human intervention. Once the data pulled from Twitter is in place, we need to manually classify the tweets as Having PTSD or Not Having PTSD.

Classification can be done by adding a new attribute to that data. Values can be Yes or No (Yes – having PTSD, No – Not having PTSD). Once the classification is done, we can call this data as a training set that can be used to create and train the model.

To achieve our use case, before creating the model, training data needs to undergo some transformations such as:

  1. Hashing TF
  2. Regex Tokenizer
  3. Inverse Document Frequency
  4. Vector Conversion

After passing through all the algorithms above, training data can be passed into the model to create and train it. The model that suits this prediction use case best is the Random Forest Model.

Talend Studio for Real-time Big Data has some very good machine learning components that can perform regression, classification & prediction using Spark Framework. Leveraging the capability of Talend to handle machine learning tasks, the Random Forest Model has created and trained the model with the training data. Now we have the model ready to predict the tweets.

Note: All the work is done on a Cloudera Hadoop Cluster, Talend is connected to the cluster, and the rest of the computation is achieved by Talend.

 

Snapshot of a Talend Spark Job design

 

Step 3: Prediction of tweets using Talend

Now we have the model ready on our Hadoop cluster. We can use the process in step 1 and pull the data from Twitter again, which acts as a test data. The test data has only one attribute: Tweet.

When the test data is passed to the model we have created, the model adds a new attribute Label to the test data, and its value will be Yes or No (Yes – having PTSD, No – Not having PTSD). The predicted value depends solely on the way the model is trained in step 2. Again, all this prediction can be done in Talend Studio for Real- time using Spark framework.

 

Snapshot of a Talend Spark Job design for prediction

 Evolution of the model

Once the model predicts the classification of the test data set, we find the records to be 25% erroneous (on average). We need to assign the right classification to that 25% of the records, add them to the training set, and retrain the model. It should predict accurately now. Add more records to the training set, and repeat the same procedure until the model becomes accurate. A model needs to evolve over time, by training it with newly added training data that comes with time. Some management is required.

Note: To boost the effectiveness of the model, we can add synonyms of the training data to the training set and retrain the model, which leads to developing the model synthetically rather than just organically.

A threshold of 90% accurate predictions is a must to classify the model as accurate. If the prediction accuracy level drops below 90%, then it is time to retrain the model.

Real-time applications from this use case

Note: Once the classification of data is done (Yes or No), it may lead to many more useful real-time applications.

Broader Scope

The use case solution designed can work for any of the major health situations. For example, if the use case is with cancer, using cancer-specific hashtags we can train the model in an equivalent way and start predicting if the person has cancer or not. The same real-time applications as discussed above can be achieved.

Authors: Madhav Nalla, Saikrishna Ala, and Kashyap Shah

This Article also published on Talend Community Blog:
Source: https://community.talend.com/s/article/Unleashing-Talend-Machine-Learning-Capabilities

Achieve better performance with an efficient lookup input option in Talend Spark Streaming

Description

Talend provides two options to deal with lookup in Spark streaming Jobs: a simple input component (for example: tMongoDBInput) or a lookup input component (tMongoDBLookupInput). Using a lookup input component will provide heavy uplifting in performance and code optimization for any Spark streaming Job. 

Instead of looking up the entire data from the lookup component, Talend provides a unique option for streaming Jobs: to query a smaller chunk of input data for lookup, thereby saving an enormous amount of time and building highly performant Jobs.

By Definition

Lookup components like tMongoDBLookupInputtJDBCLookupInput, and others provided by Talend execute a database query with a strictly defined order that must correspond to the schema definition.

It passes on the extracted data to tMap in order to provide the lookup data to the main flow. It must be directly connected to a tMap component, and requires this tMap to use Reload at each row or Reload at each row (cache) for the lookup flow.

The tricky part here is to understand the usage of the Reload at each row functionality of the Talend tMap component, and how it can be integrated with the lookup component.

Example

Below is an example of how we have used a tJDBCLookupInput component with tMap in a Talend Spark Streaming Job.

 

  1. At the tMap level, make sure the tMap for the lookup is set up with Reload at each row, and an expression for globalMap Key is defined as well.
  2. At the lookup input component level, make sure our Query option is set up to query the globalMap Key (where condition extract.consumer_id) is defined in tMap as shown below. This is key for making sure the lookup component only fetches the data needed for processing at that point in time.

Summary

As we have seen, these minute changes in our Streaming Jobs can make our ETL Jobs more effective and performant. As there will always be multiple implementations of a Talend ETL Job, the ability to understand the nuances in making them more efficient is an integral part of being a data engineer.

For more information, reach out to us at: solutions@thinkartha.com[/vc_column_text][vc_column_text css=”.vc_custom_1596545053063{padding-top: 30px !important;padding-bottom: 30px !important;}”]Author: Siddartha Rao Chennur

This article also published on Talend Community:
Source: https://community.talend.com/s/article/Achieve-better-performance-with-an-efficient-lookup-input-option-in-Talend-Spark-Streaming

Quick Start Guide: Talend and Docker

Enterprise deployment work is notorious for being hidebound and slow to react to change. With many organizations adopting Docker and container services, it becomes easy to incorporate their Talend deployment life cycle into their existing Docker and container services, creating a more unified deployment platform to be shared across various applications within an organization.

This article is intended as a quick start guide on how to generate Talend Jobs as Docker images using a Docker service that is on a remote host.

Also, to provide better understanding on handling Docker images, a few topics below are discussed by drawing comparisons between sh/bat scripts and Docker images.

Setting up your Docker for remote build

Talend Studio needs to connect to a Docker service to be able to generate a Docker image.

The Docker service can run on a machine where Talend Studio is installed, or it might be running somewhere on a remote host. This step is not needed if Docker is running on the same machine where Talend Studio is installed; this step is needed only if Talend Studio and Docker are running on different hosts.

Building a Docker Image from Talend Studio v7.1 or Greater

In v7.1, Talend introduced the Fabric 8 Maven plugin to generate a Docker image directly from Talend Studio.

Using Talend Studio, we can build a Docker image stored in a local Docker repository. Otherwise, we can build and publish a Docker image to any registry of our choice.

Let us look at both options:

Build the Docker Image from Talend Studio

  1. Right-click on the Job and navigate to the Build Job option:
  2. Under build type, select Docker Image:

3. Choose the appropriate context and log4h level.

4. Under Docker Options, select local if Docker and Studio are installed on same host, or select Remote if your Docker service is running on a different host from the one where Talend Studio is installed. In our example, we enabled Docker for a remote build via TCP on port 2375

tcp://dockerhostIP:2375

5. Once this is done, your Docker image is built and stored in the Docker repository, in our example on host 2.

6. Log in to the Docker host, in our example host 2, and execute the command docker images. You should be able to view the image we just built:

Build and Publish the Docker Image to the Registry from Talend Studio

Talend Studio can be used to build a Docker image, and the image can be published to any registry where the images can be picked up by Kubernetes or any container services. In our example, I have set up an AWS ECR registry.

  1. Right-click on the Job name and navigate to the Publish option.
Quick-Start-Guide-Talend-and-Docker-publish.png
Quick-Start-Guide-Talend-and-Docker-publish.png

2. Select the Export Type Docker Image:

3. Under Docker Options, provide the Docker host and port details as discussed in the previous topics. Give the necessary details of the registry and Docker image name:

Image Name = Repository Name
Image Tag=Jobname_Version
Username = AccessKeyId (AWS)
Password=Secret (AWS)

4. Once this is done, navigate to AWS ECR and you should able to search and find the image

Running Docker Images vs Shell or Bat scripts

With Talend, we are all accustomed to either .SH or .Bat scripts, so for better understanding of how to run Docker images let’s cover various aspects, like how to pass run time parameters and volume mounting, in detail below.

Passing Run Time Parameters to a Docker Image

To run the Docker image that is in your Docker repository (Talend Build Job as Docker)

  1. List all the Docker Images by running the command docker images:
  2. Now I want to run the image madhav_tmc/tlogrow, Tag latest, which uses a tWarn component to print a message. Part of the message will be from the context variable param.

3. Run the Docker image by passing a value to the context variable param at runtime:

docker run madhav_tmc/tlogrow:latest \--context_param param="Hello TalendDocker"

Below in the log, we can see the value passed to the Docker image at runtime

Talend Cloud & AMC Web UI: Hybrid approach

What is AMC? fadeInDown Talend Activity Monitoring Console is an add-on tool integrated into Talend Studio and Talend Administration Center for monitoring Talend Jobs and projects. It helps Talend product administrators or users to achieve enhanced resource management and improved process performance through a convenient graphical interface and a supervising tool. It provides detailed monitoring capabilities that can be used to consolidate the collected activity monitoring information, understands the underlying component and Job interaction, prevents faults that could be unexpectedly generated, and supports system management decisions.

In general, the functionalities are:

  • Batch process monitoring
  • Log information about each execution of a DI Job
  • Jobs can automatically write information to the AMC DB or File
  • TAC and Studio can access information within the AMC DB or File through the Activity Monitoring Console GUI

The Talend Activity Monitoring Console interface consists of the following views:

  • Jobs view
  • History and Detailed history views
  • Meter log view
  • Main chart view
  • Job Volume view
  • Logged Events view
  • Error report view
  • Threshold Charts view

This article is intended for Talend Cloud customers who want to leverage the AMC web UI to monitor Talend Jobs. As many existing Talend on-prem customers are used to the AMC Web UI, with more customers migrating to Talend Cloud we can take a hybrid approach by using an amc.war file from the Talend on-prem version to host AMC as a standalone tool.

It is recommended to use custom dashboards on top of an AMC Database if you are looking for more advanced or custom metrics than that are offered by the AMC web UI.[/vc_column_text][/vc_column_inner][/vc_row_inner][vc_row_inner lg_spacing=”padding_top:20″][vc_column_inner][vc_column_text]

Steps to host AMC as a Standalone Tool on Apache Tomcat

  1. Install the Apache Tomcat Service.
  2. Contact Talend for access, and download the amc.war file.
  3. Place the amc.war file under the tomcat Install Dir/webapps folder.
  4. Restart the Tomcat service.
  5. Once Tomcat is started, create a folder under the webapps directory named amc.
  6. Download the Database JAR file that we want to host to store AMC Data.
  7. Place that JAR under tomcat Install Dir/webapps/amc/WEB-INF/plugins/org.talend.amc.libraries_7.3.1.20190624_1017/lib/ext.
  8. Restart Tomcat.
  9. Navigate to the URL http://ip:port/amc/rap?startup=amc, for example http://localhost:8080/amc/rap?startup=amc as shown in the screenshot below.This should take us to the AMC web page.

 

Conclusion

The AMC Web UI from Talend is plug and play to monitor Talend Jobs. Many on-premises customers have the leverage of accessing the AMC web UI; hosting AMC as a standalone tool with Talend Cloud as a kind of hybrid approach gives the same web UI for cloud customers in line with on-prem customers.

For more information, reach out to us at: solutions@thinkartha.comAuthor: Madhav Nalla

This article also published on Talend Community:
Source: https://community.talend.com/s/article/Talend-Cloud-AMC-Web-UI-Hybrid-approach-jJdLO 

Talend Studio Best Practices – Increase Studio Performance and Settings

Lets discuss about Talend Studio best practices, Issues/Fixes/Recommendation’s at studio level.

Increase Talend Studio memory :
Increase your Talend studio memory – Go to Talend Installed directory and change xms and xmx values based on your system memory. Ini file name should be Talend-Studio-win-x86_64.

If you are using Talend Cloud #TalendCloud #TalendReferenceProject
Setup Reference Project from Talend Studio – Feature is no more available in Talend TAC

Notice: Reference projects now managed in Studio. Read Release Notes how to migrate. If you want to remove reference – use DeleteProjectReference operation in MetaServlet.
Go to File -> Project Settings -> Click on Reference projects -> Added new reference -> Select project and add corresponding branch and click +.
Note : Don’t forget to click on + symbol.[/vc_column_text][/vc_column_inner][/vc_row_inner][vc_row_inner][vc_column_inner][vc_single_image image=”10303″ img_size=”full” alignment=”center”][/vc_column_inner][/vc_row_inner][vc_row_inner][vc_column_inner][vc_column_text]How to Change GroupId in Talend studio

• The default Group Id in Talend studio is org.example.projectname
• You need to change to com.arthasolutions. Talend – To do these changes you need to go file -> project settings -> Build -> Deployment groupid[/vc_column_text][vc_column_text]Publish Snapshot or Releases
In order to deploy a snapshot – Open Talend Studio – Open Talend Job
• Go to Job Tab and navigate to Deployment Tab -> Select the check box Use Snapshot[/vc_column_text][vc_column_text]How to make sure your Talend studio pointing to right snapshot or releases ?

Go to Talend Studio -> window -> Perferences -> Talend -> Artifact Repository -> Repository Settings
• Now check your repository settings tab
• You should see all the repository settings pre-populated.
• Now in 7.2.1 All these settings come from TAC itself.
• If you ever want to change these default settings, you need to click on Use Customized Settings and type the following default release repo and default snapshot repo.

Install all Additional Packages – Talend studio

The code of method (Map<String,Object>) is exceeding the 65535 bytes limit

• In Talend 7.2.1, some Jobs may fail in code compilation with an error of “65535 byte code”. This may happen in some specific Job designs where the code generation is already at the 65 KB limit.
• You can prevent this error by the following parameter to the config.ini file located in the configuration folder under your Talend Studio installation directory:
deactivate_extended_component_

This article is published on Talend Community:
https://community.talend.com/s/feed/0D53p00007vCnuMCAS[/vc_column_text][/vc_column_inner][/vc_row_inner][/vc_column][/vc_row]