Confidence-based Event-centric Online Video Question Answering on a Newly Constructed ATBS Dataset.

ICASSP 2023

1University of Nottingham Ningbo China, China, 2Nottingham Ningbo China Beacons of Excellence Research and Innovation Institute,

Abstract

Deep neural networks facilitate video question answering (VideoQA), but the real-world applications on video streams such as CCTV and live cast place higher demands on the solver. To address the challenges of VideoQA on long videos of unknown length, we define a new set of problems called Online Open-ended Video Question Answering (O2VQA). It requires an online state-updating mechanism for the solver to decide if the collected information is sufficient to conclude an answer. We then propose a Confidence-based Event-centric Online Video Question Answering (CEO-VQA) model to solve this problem. Furthermore, a dataset called Answer Target in Background Stream (ATBS) is constructed to evaluate this newly developed online VideoQA application. Compared to the baseline VideoQA method that watches the whole video, the experimental results show that the proposed method achieves a significant performance gain.

Illustration of the Online Open-ended Video Question Answering task

Illustration of Online Open-ended Video Question Answering (O2VQA) task. For each frame, the solver derives the confidence score for each of the feasible candidate answers and decides if the evidence is sufficient to give a confident answer.

CEO-VQA Algorithm

1) Confidence-based Target Event Locator. Given an online video stream and a question, the confidence to answer $c$ is derived by using a VideoTextEncoder to evaluate the attentional information between the text features extracted from the question and the visual features extracted from the video stream. The Confidence-based Target Event Locator then filters out irrelevant events and locates the target event $\mathcal{T}$. To reduce the time complexity, we bi-directionally traverse the video in a Fibonacci way.

2) Event Question Answering. After locating the target event, a VideoEncoder is utilized to extract the visual features from the video key frames, and a QuestionEncoder is utilized to extract the linguistic features. The visual and linguistic features are then concatenated and fed into a MultiModal Encoder to perform cross-modal learning. Finally, a reliable answer is generated by the answer decoder.

Related Links

DiDeMo dataset used as our background video source.

MSRVTT-QA used as our target video source. *(We use the link provided by Frozen️ in Time)

BibTeX

@inproceedings{kong2023confidence,
      author    = {Kong, Weikai and Ye, Shuhong and Chenglin, Yao and Jianfeng, Ren},
      title     = {Confidence-based Event-centric Online Video Question Answering on a Newly Constructed ATBS Dataset},
      booktitle = {{IEEE} International Conference on Acoustics, Speech and Signal Processing, {ICASSP} 2023},
      publisher = {{IEEE}},
      year      = {2023},
    }