Listen and Translate

Github Repo

Task Description

Preprocessing

Models

1. LSTM model

2. Bidirectional LSTM model

3. Simple baseline model: Use an RNN model to predict the first word

4. Strong baseline model: Sequence to sequence

Encoder: MFCC $\rightarrow$ Bidirectional RNN
Decoder: RNN $\rightarrow$ encoder initial state + attention (LuongAttention)

Hello!

5. Best model: Retrieval model

Encoder (MFCC): Bidirectional GRU
Encoder (caption): embedding layer + Bidirectional GRU
Similarity: inner product + sigmoid

Hello!

Result