Thursday, March 1, 2012

Tuning query with database link using USE_NL hint

I’ve just realized that I’m never fully satisfied after tuning queries with database links. Yes, I manage to improve performance significantly but I’m always missing some answers.

Case explained in this blog post is not exception also.

Environment:
Source DB (Oracle 11.1.0.7 on Solaris 64bit)
Remote DB (Oracle 10.2.0.4 on Solaris 64bit)
(Table names and values are changed )



One of the developers reported me problem with this simple query:

To get better output click "View Source".
SELECT *
  FROM tab1 cc, 
             tab2@db2 cfp
 WHERE cc.c_id = cfp.b_id 
 AND cc.code = '1234567890';

-------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name                | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
--------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                     |     1 |   309 |  5160   (4)| 00:01:02 |        |      |
|*  1 |  HASH JOIN                   |                     |     1 |   309 |  5160   (4)| 00:01:02 |        |      |
|   2 |   TABLE ACCESS BY INDEX ROWID| TAB1                |     1 |   125 |     1   (0)| 00:00:01 |        |      |
|*  3 |    INDEX RANGE SCAN          | C_IDX3              |     1 |       |     1   (0)| 00:00:01 |        |      |
|   4 |   REMOTE                     | TAB2                |  2053K|   360M|  5137   (3)| 00:01:02 | CRM_B~ | R->S |
--------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("CC"."C_ID"="CFP"."B_ID")
   3 - access("CC"."CODE"='1234567890')


I've executed query and it took over 10 minutes to get results.


Let's break the query:

To get better output click "View Source".
select c_id
from tab1 cc
where cc.code = '1234567890';

1 row selected.

<<<<<< Elapsed: 00:00:00.02 >>>>>>

---------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name                | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                     |     1 |   125 |     1   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| TAB1                |     1 |   125 |     1   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | C_IDX3              |     1 |       |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("CC"."CODE"='1234567890')



select *
from tab2@db2 cfp
where cfp.b_id = '32165420';

57 rows selected.

<<<<<< Elapsed: 00:00:00.03 >>>>>>

--------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                          | Name             | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop | Inst   |
--------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT REMOTE            |                  |    12 |  2208 |    15   (0)| 00:00:01 |       |       |        |
|   1 |  TABLE ACCESS BY GLOBAL INDEX ROWID| TAB2             |    12 |  2208 |    15   (0)| 00:00:01 | ROWID | ROWID |    DB2 |
|*  2 |   INDEX RANGE SCAN                 | B_IDX            |    12 |       |     3   (0)| 00:00:01 |       |       |    DB2 |
--------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("A1"."B_ID"='32165420')

Note
-----
   - fully remote statement

From both queries I've received results instantly.


To get more answers I've decided to gather more details with tracing.
Tkprof output showed why this query behaved poorly.

SELECT *
  FROM tab1 cc, tab2@db2 cfp
 WHERE cc.c_id = cfp.b_id AND cc.code = '1234567890'

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.03       0.16          0          0          1           0
Execute      1      0.00       0.02          0          0          0           0
Fetch        2    131.58     602.67          0          2          0          57
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4    131.61     602.87          0          2          1          57

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 41

Rows     Row Source Operation
-------  ---------------------------------------------------
     57  HASH JOIN  (cr=2 pr=0 pw=0 time=0 us cost=5052 size=309 card=1)
      1   TABLE ACCESS BY INDEX ROWID TAB1 (cr=2 pr=0 pw=0 time=0 us cost=1 size=125 card=1)
      1    INDEX RANGE SCAN C_IDX3 (cr=1 pr=0 pw=0 time=0 us cost=1 size=0 card=1)(object id 66342)
2053552   REMOTE  TAB2 (cr=0 pr=0 pw=0 time=732493 us cost=5029 size=374102912 card=2033168)


Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  single-task message                             1        0.14          0.14
  SQL*Net message to dblink                  937019        0.00          0.90
  SQL*Net message from dblink                937019        0.86        474.64
  SQL*Net message to client                       2        0.00          0.00
  SQL*Net message from client                     2      452.84        452.85
  SQL*Net more data from dblink                8289        0.55         29.50
  SQL*Net more data to client                     1        0.00          0.00


Main waits were associated to network traffic between databases.
Why Oracle decided to read 2053552 rows (full scan instead index) from remote table?

Maybe there was some problem with stale statistics. I've gathered fresh statistics but that haven't changed anything.

As developer was waiting for me I had to find solution quickly (if I can, of course).

I've solved similar issue using USE_HASH hint (SQL Tuning - using USE_HASH hint - dblink issue) so I've tried with USE_NL hint in this case.
It helped! I've received result under a second.

SELECT /*+ use_nl(cc cfp) */ *
  FROM TAB1 cc, TAB2@DB2 cfp
 WHERE cc.C_ID = cfp.B_ID AND cc.CODE = '1234567890'

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.02       0.13          0          0          1           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        2      0.01       0.01          0          2          0          57
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4      0.03       0.15          0          2          1          57

Misses in library cache during parse: 0
Optimizer mode: ALL_ROWS
Parsing user id: 41

Rows     Row Source Operation
-------  ---------------------------------------------------
     57  NESTED LOOPS  (cr=2 pr=0 pw=0 time=0 us cost=5030 size=309 card=1)
      1   TABLE ACCESS BY INDEX ROWID TAB1 (cr=2 pr=0 pw=0 time=0 us cost=1 size=125 card=1)
      1    INDEX RANGE SCAN C_IDX3 (cr=1 pr=0 pw=0 time=0 us cost=1 size=0 card=1)(object id 66342)
     57   REMOTE  TAB2 (cr=0 pr=0 pw=0 time=0 us cost=5029 size=1840 card=10)


Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  single-task message                             1        0.11          0.11
  SQL*Net message to dblink                      10        0.00          0.00
  SQL*Net message from dblink                    10        0.00          0.02
  SQL*Net message to client                       2        0.00          0.00
  SQL*Net message from client                     1        0.01          0.01
  SQL*Net more data from dblink                   4        0.00          0.00
  SQL*Net more data to client                     1        0.00          0.00

Improvement from 602.87 secs to 0.15 secs.


Developer was very happy with this outcome.
To be honest I wasn't completely satisfied. Hint USE_NL was just a quick guess and
I don't like guesswork to solve problems.

Queries with db links are always little tricky. Oracle simply does not have enough information to take the right decision.

If someone has an idea why USE_NL fixed this problem please leave your comment.
Thanks!

6 comments:

  1. Marko, this nested loops have resolved the situation because of the new NL implementation in Oracle 11GR2. When a remote table is queried, oracle will bring the necessary data over the DB link into the local temporary tablespace and do the join locally. The new implementation of the NL join doesn't actually do the looping thing, it reads the records from the driving table that satisfy the conditions and then reads the inner table using index range scans. When you specified the NL algorithm, Oracle only needed to bring a subset of the remote table over the DB link and into the temporary tablespace. As the remote table seems to be sizeable, 2M rows and 350MB in size, bringing it whole and then do the hash join is likely to take time. If, on the other hand, Oracle can perform a remote query that will only bring 12 rows, as it did, it will be much, much faster. The new implementation is documented in the performance tuning manual and by Jonathan Lewis on Oracle Scratchpad.

    ReplyDelete
  2. Mladen, thank you for your answers on blog and private message. As I said in private message, optimizer probably chooses wrong plan due to the lack of information about remote table. Probably it thinks that remote table is small enough to perform full scan and use HASH JOIN.

    You wrote that this behavior is related to new NL implementation introduced in 11GR2 (not earlier). In this case I'm using 11GR1 as source database and 10GR2 as remote. So I cannot relate new NL changes to my issue. But still thank you about that information. I didn't know that.

    This part is great "NL join doesn't actually do the looping thing, it reads the records from the driving table that satisfy the conditions and then reads the inner table using index range scans."

    Regards,
    Marko

    ReplyDelete
  3. Amrit Paul :- Can you pls also additionally look into hint DRIVING_SITE(DB2). Your processing in this case will be done on the Remote DB side.

    ReplyDelete
  4. Hello Amrit,

    as I don't have tables used in this post I cannot test performance using DRIVING_SITE hint.

    But it is valid option to test as the main waits were related to network traffic between databases.

    Thanks for comment.

    Regards,
    Marko

    ReplyDelete
  5. I know this is an old thread, but I ran into an issue with driving_site. It will throw an ORA-0219 error if the link is owned by a different schema (which is what I have in my case). See MOS 362028.1. USE_NL works like a champ!

    ReplyDelete
  6. Really it's so helpful and performance like anything...

    ReplyDelete