Tidal Workload Automation Adapters for Hadoop
Tidal Workload Automation Sqoop Adapter Overview
The Tidal Workload Automation Sqoop Adapter provides easy import and export of data from structured data stores such as relational databases and enterprise data warehouses. Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS. Sqoop Adapter allows users to automate the tasks carried out by Sqoop.
The Sqoop Adapter allows for the definition of the following job tasks:
- Code Generation – This task generates Java classes which encapsulate and interpret imported records. The Java definition of a record is instantiated as part of the import process, but can also be performed separately. If Java source is lost, it can be recreated using this task. New versions of a class can be created which use different delimiters between fields or different package name.
- Export – The export task exports a set of files from HDFS back to an RDBMS. The target table must already exist in the database. The input files are read and parsed into a set of records according to the user-specified delimiters. The default operation is to transform these into a set of INSERT statements that inject the records into the database. In "update mode," Sqoop will generate UPDATE statements that replace existing records in the database.
- Import – The import tool imports structured data from an RDBMS to HDFS. Each row from a table is represented as a separate record in HDFS. Records can be stored as text files (one record per line), or in binary representation such as Avro or SequenceFiles.
- Merge – The merge tool allows you to combine two datasets where entries in one dataset will overwrite entries of an older dataset. For example, an incremental import run in last-modified mode will generate multiple datasets in HDFS where successively newer data appears in each dataset. The merge tool will "flatten" two datasets into one, taking the newest available records for each primary key. This can be used with both SequenceFile -, Avro- and text-based incremental imports. The file types of the newer and older datasets must be the same. The merge tool is typically run after an incremental import with the date-last-modified mode.
Tidal Workload Automation MapReduce Adapter Overview
Hadoop MapReduce is a software framework for writing applications that process large amounts of data (multi- terabyte data-sets) in-parallel on large clusters (up to thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. A Tidal Workload Automation MapReduce Adapter job divides the input data set into independent chunks that are processed by the map tasks in parallel. The framework sorts the map’s outputs, which are then input to the reduce tasks. Typically, both the input and output of the job are stored in a file- system. The framework schedules tasks, monitors them, and re-executes failed tasks. Minimally, applications specify the input/output locations and supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration. The Hadoop job client then submits the job (jar/executable etc.) and configuration to YARN. The client then assumes the following responsibilities:
- Distributes the software/configuration to the slaves
- Schedules and monitors tasks
- Provides status and diagnostic information to the job -client
The MapReduce Adapter serves as the job client to automate the execution of MapReduce jobs as part of a Tidal Workload Automation managed process. The Adapter uses the Apache Hadoop API to submit and monitor MapReduce jobs with full scheduling capabilities and parameter support. As a platform independent solution, the Adapter can run on any platform where the Tidal master runs.
The MapReduce Adapter provides real-time information on the execution of MapReduce job as it is running.
Figure 11 Jobs detail for a Maps Reduce program.
Tidal Workload Automation Hive Adapter Overview
The Tidal Workload Automation Hive Adapter provides the automation of HiveQL commands as part of the cross-platform process organization between Tidal Workload Automation and the Tidal Hadoop Cluster. The Adapter is designed using the same user interface approach as other Tidal Workload Automation adapter jobs, seamlessly integrating Hadoop Hive data management into existing operation processes.
The Hive Adapter allows you to access and manage data stored in the Hadoop Distributed File System (HDFS™) using Hive's query language, HiveQL. HiveQL syntax is similar to SQL standard syntax. The Have Adapter, in conjunction with Tidal Workload Automation, can be used to define, launch, control, and monitor HiveQL commands submitted to Hive via JDBC on a scheduled basis. The Adapter integrates seamlessly in an enterprise scheduling environment.
The Hive adapter includes the following features:
- Connection management to monitor system status with a live connection to the Hive Server via JDBC
- Scheduling and monitoring of HiveQL commands from a centralized work console with Tidal Workload Automation
- Dynamic runtime overrides for parameters and values passed to the HiveQL command
- Output-formatting options to control the results, including table, XML, and CSV
- Defined dependencies and events with Tidal Workload Automation for scheduling control
- Runtime MapReduce parameters overrides if the HiveQL command results in a MapReduce job.
Tidal Workload Automation HDFS Data Mover Linux Agent
The Tidal Workload Automation HDFS Data Mover Linux Agent helps to manage file transfers in and out of the Hadoop file system.