Video Caption Generation
Task Description
- Given a short video, predict the corresponding caption that depicts the video

Model

- encoder + decoder
- uni-directional LSTM
LuongAttention: Allow model to peek at different sections of inputs at each decoding time step

ScheduledEmbeddingTrainingHelper: To solve “exposure bias” problem, When training, we feed (groundtruth) or (last time step’s output) as input at odds

Evaluation
- BLEU@1:
$\text{BP=}\begin{cases} 1 & \text{if } c > r \newline e^{1-r/c} & \text{if } c\leq r \end{cases}$
$\text{Precision = correct words / candidate length}$
$\text{BLEU@1 = BP}\times \text{Precision}$
- Attention:
| without attention model | LuongAttention | BahdanauAttention | |
|---|---|---|---|
| $\text{BLEU@1 score}$ | 0.5994 | 0.6059 | 0.5867 |
- Schedule sampling:
| without schedule sampling | ScheduledEmbeddingTrainingHelper | |
|---|---|---|
| $\text{BLEU@1 score}$ | 0.5994 | 0.6478 |
- Attention + Schedule sampling: $\text{BLEU@1 score = 0.6510}$