Talend ETL Versus Hand Coding
Talend ETL is open-source data integration software which, with its impressive array of components to choose from, allows you to manipulate data from a wide variety of formats in a wide variety of ways. Talend ETL is also highly compatible with a wide range of different query languages and database types, and with its in-built design to connect to servers, this can make some tasks very quick and simple, which in turn make Talend ETL both powerful and accessible.
But is Talend ETL difficult to use? Is it quick to learn? Is it really worth the effort when you could quite possibly hand-code the same query?
This is exactly what I wanted to find out; so using some basic techniques I tested Talend ETL to see what it was capable of, and this blog article briefly goes over what I discovered and how.
Talend ETL Basics
Some of the more basic tasks in Talend can consist of simply accessing files and databases and knowing which component to use. But before we go into that, I will explain a little about the interface and how some tasks can be simplified from the very start.
When you first open up Talend ETL, you should see something that looks like the above image.
The more important part of this interface is on the upper left box, labelled 'Repository'. The Repository stores all the data that you will access, including any jobs that you create. In the Repository you can see a section called Metadata, which can be very useful. The Metadata allows you to store a 'set-up' which can then be accessed by your components (in your jobs) multiple times, saving you continuously editing details such as schemas and database servers.
In the centre is the designer. This is where you will create 'flow charts' for your jobs using components which can be found in the Palette on the right hand side. In the Palette is every accessible component, and the use of the search bar at the top of the Palette can sometimes be useful for finding a component you might otherwise not know about. When you look through the components you might notice that most have an input and output option. This is generally how the basics of Talend ETL work, however once you become accustom to whats at hand, it's quite easy to modify your flow charts in various ways to sometimes shorten a job, or even produce different results throughout the job entirely.
Talend ETL Testing
One of the most common jobs in the database world is simply loading data from a file into a database table and then keeping it up to date via insert and update – ELT. When attempting to do this in Talend ETL, it becomes obvious very quickly that there are several different ways of doing most tasks. For example, you can 'update', 'insert', or 'update and/or insert' into a database. So here you could either have four components, an input and output each for the insert and update tasks, or just two components and have the output automatically 'update and/or insert' where necessary.
The last point leads me nicely to explain my testing and what I was looking for. Initially, I wanted to know how difficult Talend ETL was to learn. Well it can be a little confusing at first that's for sure. I made the same mistakes over and over, but once you understand even the easiest of jobs everything else starts to fall into place and before long you'll find yourself modifying jobs at will. So my next task was to try and find out if it was worth it. To do this, I decided to test a simple ELT design. I gathered some data, performed a load, insert and update on it against the old data and then I tested the results against a hand-coded version in two ways.
Firstly, and more obviously, I wanted to see how fast it ran in comparison to the hand-coded version. If it takes hours to run something that took me an hour to code and ten minutes to run, then it isn't worth it at this point. But it did run quite well. In particular, I found Talend rather good at dealing with general files: Talend concatenated a set of files a good deal quicker than my code did.
The second part of my testing was all about the code produced by Talend ETL and comparing it against my hand-written code, to see how efficient it was at doing this simple task. What I found was that the code produced by Talend ETL was 'fail-safe' code. An example of this is as follows, so you might understand better what I mean:
SELECT COUNT(1) FROM "two" WHERE "id" = $1parameters: $1 = '3'UPDATE "two" SET "alpha" = $1,"beta" = $2,"delta" = $3 WHERE "id" = $4parameters: $1 = '7', $2 = '8', $3 = '9', $4 = '3'
This is all well and good, however I did notice that sometimes with larger inserts and/or updates that it seemed to take quite a bit longer than the hand-coded version, and perhaps this was simply because it parameterised everything. But with everything said, Talend ETL still has one, possibly major, advantage – flexibility.
Talend ETL Flexibility
As I stated previously, Talend ETL has a the ability to allow us to modify the flow chart in a variety of ways to still get the same end goal, but possibly with different results in the middle. Lets take the following two jobs (Job A and Job B respectively) and compare them to see the differences:
Both jobs do the same task overall; concatenate a set of files (from three different directories in this situation), load the data and then insert and/or update the new data against the old data. But as you can see, clearly job B has less components, and so by default is easier to set up. Furthermore, the code involved in job B is much simpler as you simply select all data from the load and then let Talend ETL automatically update or insert where necessary.
Of course, job A also has its benefits. Unlike job B, job A concatenates the files and then saves that as a new file before loading it into the database, which may be of some importance in certain situations.
So overall, Talend ETL s a very useful piece of software with its wide variety of available tools and applications for a range of query languages and database types. And with its all important flexibility when designing jobs, allowing you to design the job to your specifications, it makes it all the more accessible. However it does sometimes over-complicate the little things, and since the major aspect of querying is usually to deal with databases rather than files, then its downside may come with concerns over efficiency and performance.