mysql에서 join 이라고 하면 nested lopp join 방식 밖에 지원이 되지 않았습니다.
물론 MRR 을 활용하여 NL join에서 driven table 로의 random access 를 줄이는 BKA (Batched Key Access )
NL join 의 대상을 작은 block 으로 나누어 block 하나씩 join 하는 hash join을 어느정도 따라하려했던 BNL(Block Nested Loop) 방식이 있기는 하지만
어디까지나 nested loop join이 기반이며 hash join과 merge join 을 대체할 수는 없었습니다.
그러나 8.0.18 버전으로 올라오면서 드디어 hash join 을 지원하게 되었습니다.

Hash join 이란

( 출처 : http://www.gurubee.net/lecture/2388 )

  • 선행 테이블의 값, 후행 테이블의 값에 대해 해쉬 값이 동등한지를 확인하기 때문에 동등조건에 대해서만 join 가능
  • 후행 테이블에 인덱스가 없을 때도 사용 가능
  • 조인에 성공하면 join_buffer에 넣는데 메모리에 적재할 수 있는 크기보다 커지면 디스크에 저장함 (join_buufer_size 튜닝 중요)
  • left outer join , right outer join 불가

Hash join test

  • optimizer_switch 확인
mysql> show variables like 'optimizer_switch';
 index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,engine_condition_pushdown=on,index_condition_pushdown=on,mrr=on,mrr_cost_based=on,block_nested_loop=on,batched_key_access=off,materialization=on,semijoin=on,loosescan=on,firstmatch=on,duplicateweedout=on,subquery_materialization_cost_based=on,use_index_extensions=on,condition_fanout_filter=on,derived_merge=on,use_invisible_indexes=off,skip_scan=on,hash_join=on 
  • test data 준비
CREATE TABLE t1 (
id int(11) NOT NULL AUTO_INCREMENT ,
c1 int(11) NOT NULL DEFAULT '0',
c2 int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (id),
KEY idx_c1 (c1)
) ENGINE=InnoDB;

CREATE TABLE t2 (
id int(11) NOT NULL AUTO_INCREMENT ,
c1 int(11) NOT NULL DEFAULT '0',
c2 int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (id),
KEY idx_c1 (c1)
) ENGINE=InnoDB;

insert into test.t1 select null,round(rand()*100),round(rand()*1000)
from information_schema.columns a1, information_schema.columns b1  
limit 1000000;

hash join 확인

mysql> explain format=tree select count(*) from t1 join t2 on t1.c2 = t2.c2\G
*************************** 1. row ***************************
EXPLAIN: -> Aggregate: count(0)
    -> Inner hash join (t1.c2 = t2.c2)  (cost=99689553395.24 rows=99689442783)
        -> Table scan on t1  (cost=0.02 rows=998490)
        -> Hash
            -> Table scan on t2  (cost=100329.20 rows=998402)

1 row in set (0.00 sec)

mysql> select count(*) from t1 join t2 on t1.c2 = t2.c2;
+-----------+
| count(*)  |
+-----------+
| 999567585 |
+-----------+
1 row in set (41.68 sec)

=> join 해서 10억건을 count 하는데 41초 소요

  • explain 으로는 hash join이 안보임
mysql> explain select count(*) from t1 join t2 on t1.c2 = t2.c2
    -> ;
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows   | filtered | Extra                                              |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------+
|  1 | SIMPLE      | t1    | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 131467 |   100.00 | NULL                                               |
|  1 | SIMPLE      | t2    | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 131472 |    10.00 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------+
2 rows in set, 1 warning (0.00 sec)

non-hash join 일 때

mysql> set optimizer_switch='hash_join=off';
Query OK, 0 rows affected (0.00 sec)

mysql> explain format=tree select count(*) from t1 join t2 on t1.c2 = t2.c2\G
*************************** 1. row ***************************
EXPLAIN: <not executable by iterator executor>

1 row in set (0.00 sec)

mysql> explain select count(*) from t1 join t2 on t1.c2 = t2.c2;
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows   | filtered | Extra                                              |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------+
|  1 | SIMPLE      | t1    | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 131467 |   100.00 | NULL                                               |
|  1 | SIMPLE      | t2    | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 131472 |    10.00 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------+
2 rows in set, 1 warning (0.00 sec)

mysql> select * from information_schema.processlist where id=15;
+----+------+-----------+------+---------+------+-----------+--------------------------------------------------+
| ID | USER | HOST      | DB   | COMMAND | TIME | STATE     | INFO                                             |
+----+------+-----------+------+---------+------+-----------+--------------------------------------------------+
| 15 | root | localhost | test | Query   | 1633 | executing | select count(*) from t1 join t2 on t1.c2 = t2.c2 |
+----+------+-----------+------+---------+------+-----------+--------------------------------------------------+
1 row in set (0.00 sec)

=> non-hash join 일 땐 driven table에 인덱스가 없기 때문에 이전과 같이 BNL 로 풀리며
41초 걸렸던 hash join 과는 달리 1600초 이상 기다려도 완료안됨. 작업 취소!

  • hash join on /off
mysql> set optimizer_switch='hash_join=off';
mysql> select /*+ HASH_JOIN (t1,t2) */ count(*) from t1 join t2 on t1.c2 = t2.c2
mysql> select /*+ NO_HASH_JOIN (t1,t2) */ count(*) from t1 join t2 on t1.c2 = t2.c2

=> session / global level에서 조정하거나 sql hint 사용