introduction to ETL for Oracle20 Feb 2006
There are a variety of methods to extract, transform and load (ETL) data in and out of an Oracle database. This short article provides an overview of some of the basic options that are supplied with Oracle.
L is for Load
Firstly, we need a table to load data into. This SQL script creates a table called PERSON. create table person (personid number(12), forename varchar2(200), surname varchar2(200), title varchar2(10), email varchar2(100), gender varchar2(1), ninumber varchar(30), address1 varchar2(100), address2 varchar2(100), town varchar2(100), county varchar2(100), postcode varchar2(10), country varchar2(100), phone varchar2(100), mobile varchar2(100), salary number(12), dateofbirth date);
The following PL/SQL procedure may be used to populate this table with an arbitrary amount of data.
create or replace procedure load_person(p_records in pls_integer) is v_person_id person.person_id%type; begin for i in 1 .. p_records loop v_person_id := to_char(i); insert into person (person_id, forename, surname, title, email, gender, ni_number, address1, address2, town, county, postcode, country, phone, mobile, salary, date_of_birth) values (v_person_id, /* person_id */ 'Norman', /* forename */ 'Brightside', /* surname */ 'Mr', /* title */ 'firstname.lastname@example.org', /* email */ 'M', /* gender */ 'AA123456B', /* ni_number */ 'Leafy Cottage', /* address1 */ '1 New Street', /* address2 */ 'Anytown', /* town */ 'Anycounty', /* county */ 'NN12 9ZZ', /* postcode */ 'UK', /* country */ '0208444555', /* phone */ '07789123321', /* mobile */ 20000, /* salary */ '01-JAN-1970' /* DOB */ ); end loop; commit; end;
Loading 1,000,000 records took 1 min 54 secs running Oracle 10g (10.2.1) on my home PC.
One obvious deficiency in the PL/SQL is that all records (and the SQL statement) are identical. In practice, this will not be the case. Therefore, it may be beneficial to replace the literal SQL with bind variables using using native, dynamic SQL. create or replace procedure loadbind(precords in plsinteger) is sqlstmt varchar2(2000); vpersonid person.personid%type; vforename person.forename%type := 'Norman'; vsurname person.surname%type := 'Brightside'; vtitle person.title%type := 'Mr'; vemail person.email%type := 'email@example.com'; vgender person.gender%type := 'M'; vninumber person.ninumber%type := 'AA123456B'; vaddress1 person.address1%type := 'Leafy Cottage'; vaddress2 person.address2%type := '1 New Street'; vtown person.town%type := 'Anytown'; vcounty person.county%type := 'Anycounty'; vpostcode person.postcode%type := 'NN12 9ZZ'; vcountry person.country%type := 'UK'; vphone person.phone%type := '0208444555'; vmobile person.mobile%type := '07789123321'; vsalary person.salary%type := 20000; vdateofbirth person.dateofbirth%type := todate('01-JAN-1970', 'DD-MON-YYYY'); begin sqlstmt := 'insert into person (personid, forename, surname, title, email, gender, ninumber, address1, address2, town, county, postcode, country, phone, mobile, salary, dateofbirth) values (:vpersonid, :vforename, :vsurname, :vtitle, :vemail, :vgender, :vninumber, :vaddress1, :vaddress2, :vtown, :vcounty, :vpostcode, :vcountry, :vphone, :vmobile, :vsalary, :vdateofbirth)'; for i in 1 .. precords loop vpersonid := tochar(i); execute immediate sqlstmt using vpersonid, vforename, vsurname, vtitle, vemail, vgender, vninumber, vaddress1, vaddress2, vtown, vcounty, vpostcode, vcountry, vphone, vmobile, vsalary, vdateofbirth; end loop; commit; end; / show errors
The modified PL/SQL procedure took 2 mins 1 sec for 1 million records. This doesn't imply that bind variables are worse than literals. More likely, that parsing overhead is relatively small for insert statements.
E is for Extract
People often ask why Oracle do not provide an equivalent SQL*Unloader to dump tables to delimited files. Well, the standard SQL*Plus utility may be used for such a purpose. Tom Kyte created some generic wrapper scripts (for both Unix and Windows) to dump an Oracle table to a delimited text file.
The 'flat' utility was used to dump the contents of the PERSON table into a comma separated value (CSV) file containing 1 million (identical) records.
SQL*Loader was then used to immediately load the data back in to the PERSON table again using this control file. LOAD DATA TRUNCATE INTO TABLE person FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"' TRAILING NULLCOLS (personid, forename, surname, title, email, gender, ninumber, address1, address2, town, county, postcode, country, phone, mobile, salary, dateofbirth )
The time taken was 1 min 25 secs for conventional mode and 28 seconds when specifying DIRECT=TRUE for a direct path load.
Another interesting and often overlooked method for loading data is the use of external tables. This allows the creation of a virtual read-only table which is mapped directly to the host file using a CREATE TABLE syntax (similar to that for SQL*Loader control file).
This SQL script creates an external table which can then be analysed using standard SQL query tools without having to actually load a database table.
create table person_ext (person_id number(12), forename varchar2(200), surname varchar2(200), title varchar2(10), email varchar2(100), gender varchar2(1), ni_number varchar(30), address1 varchar2(100), address2 varchar2(100), town varchar2(100), county varchar2(100), postcode varchar2(10), country varchar2(100), phone varchar2(100), mobile varchar2(100), salary number(12), date_of_birth date) ORGANIZATION EXTERNAL (TYPE ORACLE_LOADER DEFAULT DIRECTORY data_dir ACCESS PARAMETERS (FIELDS TERMINATED BY ',') LOCATION ('person.dat') );
It is then trivial to load data from the external table into a conventional Oracle table as follows: SQL> insert into person select * from person_ext;
This method of data loading took 39 secs.
T is for Transform
And deliberately left until last. Extracting and loading data is relatively easy. Often the time consuming and complicated task in ETL is to identify, transform and potentially clean the data items from the source system and map them to tables in the target database.