Abinitio advance
Why should you complete this course?
Records are the units that components use to process data. Data is read from a data source (a file, a database table, a queue, and so on) as a stream of bytes. Components use the record format on their in ports to parse these bytes into meaningful units of data -- that is, records.
A record is a collection of fields. Each field in the record can be a different data type and size. Some of the fields themselves might be records; these are called "nested records" or "subrecords".
When you design a graph, you write record formats for the ports of components in the graph. These record formats simply define record types that apply to the data on those ports. Here are the contents of the stores.dml file that you'll use in this course:
record
decimal(4) store_no = 000;
string("\0") store_manager = "";
string("\0") address = "";
string("\307") city = "";
string(2) state = "";
decimal(5) zipcode = "";
end("\n");
This record format file declares a record that contains six fields: store_no, store_manager, address, city, state, and zipcode.
Consider the FILTER BY EXPRESSION component. From your standpoint as a graph developer, you provide a record format for its in port and a select expression (select_expr). The select expression often references the value of an input field that is declared in the record format -- for example, zipcode > 20000.
When you run the graph, the FILTER BY EXPRESSION component parses some number of bytes into an input record. It knows how many bytes are in the record based on its description in the in port record format. It also uses the record format information to locate the bytes you are interested in -- in this case, the bytes that make up the value of zipcode. It checks if the value of those bytes is greater than 20000. If yes, all the bytes for that record are sent to the out port. If no, all the bytes for that record are sent to the deselect port. If the expression fails with an error, all the bytes for that record are sent to the reject port.
Thus, the record is the unit of data for the component: the select expression is applied to each record, each input record is sent to one of the output ports, and so on. For transform components, such as REFORMAT and ROLLUP, the relationship between records and components is even more apparent. When you edit a transform in the Transform Editor, you see a list of fields from the input record and a list of fields from the output record. You write rules to assign values to the fields in the output record. Behind the scenes, you are actually editing a function that takes a record as its input and returns a record as its result.
Records are a fundamental part of graph design. Understanding how to efficiently manipulate and process records is an important skill for graph developers. This course is designed to guide you from the basics of DML syntax for manipulating records, through optimizing records for performance, all the way to writing your own data types and functions using records.
This course is one part of the Advanced Graph Design series. There are several important themes that are introduced in this series:
Effective techniques for editing and designing graphs.
In this course, you'll practice editing a graph that someone else has designed. You will add new features to the graph, refactor the graph design to implement reuse, and optimize the performance of the graph.
In other courses, you'll practice designing graphs from the beginning, using incremental and data lineage based approaches.
Advanced DML features.
In this course, you'll learn how to use record assignments and wildcard rules to assign values to records; how automatic type conversion works for record types; how to create and use local and global variables; the difference between the REDEFINE FORMAT and REFORMAT components; how to write packages for simple components such as FILTER BY EXPRESSION; and, how to create your own user-defined data types and functions.
In other courses, you'll learn how to work with the vector type, how to use advanced aggregation functions, and how to write your own transform packages.
How to evaluate the performance of a graph.
In this course, you'll evaluate the performance of different data types within a record. You will also explore flow buffers: how they affect the performance of the graph and how to modify a graph design to avoid flow buffering. You'll see the effect that phasing has on component and pipeline parallelism, and you will learn how to use the CHECKPOINTED SORT component to improve component parallelism. You'll also design a "double ROLLUP" to improve the performance of a global ROLLUP that processes a large amount of input data.
In other courses, you'll evaluate the performance of sorting versus using components with unsorted input. You will also evaluate the performance of a graph that uses many pipelined components versus a graph that uses one component with a complicated transform, and you will evaluate the performance of a graph using different degrees of parallelism.
Reuse.
In this course, you'll learn how to write your own user-defined data types and functions. You will also learn how to reuse these data types and functions by saving them in external files, which are then included by your transforms and record formats.
In other courses, you'll write a linked subgraph and create qualified wildcard rules.
Dependency analysis.
In this course, you'll learn how to declare dependent parameters for analysis and how to write a documentation transform.
In other courses, you'll learn how to set the Enterprise Meta>Environment® (EME) dataset location for a data source
Comments
Post a Comment