Transformer: Multi-head attention

개요

Single-head attention의 단점을 살펴봄으로써 Multi-head attention이 고안된 이유를 알아보자. 그리고 multi-head attention의 장점은 무엇인지 살펴보자.

Single-head attention과 그 문제점

Transformer 구조에서 나타나는 multi-head attention mechanism이 아닌 RNN sequence to sequnce 모델에 적용되었던 attention mechanism은 single-head attention이라고 할 수 있다. RNN sequence to sequence의 bottleneck problem을 해소하기 위해 적용되었던 attention은 encoder의 부담을 줄여주고 sequence to sequence 모델이 처리할 수 있는 최대 sequence 길이 제한 문제를 해소했다.

좋은 성적을 보였던 attention mechanism만을 사용하여 저자들은 transformer 모델을 만들기 위해 노력했다. Self-attention을 활용하여 encoder / decoder input sequence 자기 자신에 대한 representation vector(= attention vector)를 만들어 sequence to sequence model을 제작하고자 했다.

하지만, self-attention에 사용되는 single-head attention에는 아쉬운 부분이 있었다. 그것은 self-attention을 할 때 여러 time step에 대해 attention을 하는 것이 어렵다는 것이었다.

예시와 함께 설명하기 위해 다음과 같은 가정 상황을 정의한다.

1개의 sequence가 self-attention을 하고자 하는 상황이다. 현재 time step은 목적어이며, 이 목적어를 이해하기 위해 주어와 동사를 필요로 한다고 해보자.

이때, Single-head attention이 여러 time step의 word들에 대해 attention을 하기가 어려운 이유는 2가지라고 볼 수 있다.

1. Attention vector는 softmax를 통과한 attention weight를 이용한 attention-weighted value vectors를 가지고 만든 vector이기 때문에, 대부분의 상황에서 attention vector는 여러 time step의 value vector를 나타낸다기보다는 거의 1개의 time step(e.g. 주어)의 value vector를 나타내는 것이라고 볼 수 있다.

2. Attention score를 만들 때, 특정한 몇 개의 time step의 정보를 가져오도록 명시할 수가 없기 때문이다.

* Attention vector와 그것을 만들기 위한 key, query, value vector 그리고 attention score와 attention weight를 만드는 equations는 아래와 같다.

따라서, 이러한 아쉬운 single-head attention의 능력을 개선하고자 multi-head attention이 고안됐다.

Multi-head attention과 mechanism

Multi-head attention은 이름에서 알 수 있는 것처럼 attention mechanism이 여러 겹으로 쌓인 attention 구조를 말한다.

그림을 통해 single-head attention과 multi-head attention의 차이를 살펴본 다음, multi-head attention은 어떻게 여러 time step의 정보에 attention을 할 수 있는 것인지 알아보자.

Single-head attention

Multi-head attention

Multi-head attention은 single-head에서와는 달리 여러 개의 head를 가지고 있다. 그 head마다 attention이 일어난다. 그림에서 head는 색깔에 따라 분류되었으며 초록, 파랑, 빨강 head가 존재한다. 총 3개의 head가 존재하는 구조이다. (head마다의 weight는 다르다.)

Multi-head attention은 single-head attention 구조와는 달리 여러 개의 attention head를 가지고 있고 head 개수만큼의 attention vector를 얻을 수 있다. 따라서, 아래와 같은 attention이 가능해지므로 위에서 설정했던 가정 상황인 주어, 동사에 attention이 가능해진다.

초록 head: time step 1의 value vector를 가져온다. (time step 1은 주어의 자리라고 가정한다.)

파랑 head: time step 2의 value vector를 가져온다. (time step 2는 동사의 자리라고 가정한다.)

빨강 head: time step 3의 value vector를 가져온다. (time step 3은 목적어의 자리라고 가정한다.)

이렇게 총 3개의 head가 attention을 하여 3개의 attention vector를 얻어 이것들을 concat하여 1개의 attention vector로 사용한다.

Multi-head attention은 이러한 과정을 통해 attention vector를 만들기 때문에 single-head attention과는 달리 여러 time step에 대한 attention 정보를 얻을 수 있다.

Multi-head attention의 attention vector를 만들기 위한 equations는 아래와 같다.

Multi-head attention의 추가적인 장점

Multi-head attention은 single-head attention과는 달리 여러 개의 head에서 1개의 sequence에 대해 attention을 진행하기 때문에, 각 head마다 처리하는 dimension의 수가 작아지게 된다. 이에 더불어, transformer는 recurrece가 존재하지 않아 병렬 처리가 어렵지 않기 때문에 multi-head attention을 사용할 시에 연산 효율이 좋다.

Single-head, multi-head attention에서 head마다 다뤄야 하는 차원의 개수는 아래와 같아진다.

(연산 편의를 위해 input, output의 차원을 통일시키는 값을 제시하는 $d_{model}$ 은 512로 한다. head의 개수는 8로 한다.)

Single-head attention: 512-dimension(1개의 head만 존재하므로, 1개의 head에서 모두 처리한다.)

Multi-head attention: 512 / 8 = 64, 따라서 각 head는 64-dimension의 attention vector를 연산한다.

Multi-head attention은 8개의 64-dimension attention vectors를 만든 후 concat한다. 그리고 그 concat한 attention vector와 output weight matrix $W^o$ 를 연산하여 다시 512차원의 activation을 만든다.

References

https://cs182sp21.github.io

CS 182: Deep Learning

Head uGSI Brandon Trabucco btrabucco@berkeley.edu Office Hours: Th 10:00am-12:00pm Discussion(s): Fr 1:00pm-2:00pm

cs182sp21.github.io

https://arxiv.org/pdf/1706.03762.pdf

'자연어 처리 과정' 카테고리의 다른 글

Chain rule practice (0)	2023.08.17
Transformer: Scaled dot-product attention (0)	2023.08.13
Quotient rule for derivative of softmax with respect to fk(x) (0)	2023.08.09
What does "linear in parameters" mean in linear regression? (0)	2023.07.31
Backpropagation (0)	2023.07.31

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

바보 같지만 괜찮아.

Transformer: Multi-head attention

개요

목차

Single-head attention과 그 문제점

Multi-head attention과 mechanism

Single-head attention

Multi-head attention

Multi-head attention의 추가적인 장점

References

'자연어 처리 과정' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

Transformer: Multi-head attention

개요

목차

Single-head attention과 그 문제점

Multi-head attention과 mechanism

Single-head attention

Multi-head attention

Multi-head attention의 추가적인 장점

References

'자연어 처리 과정' 카테고리의 다른 글

'자연어 처리 과정' Related Articles

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역