Confidence-based Event-centric Online Video Question Answering on a Newly Constructed ATBS Dataset

Abstract

Deep neural networks facilitate video question answering (VideoQA), but the real-world applications on video streams such as CCTV and live cast place higher demands on the solver. To address the challenges of VideoQA on long videos of unknown length, we define a new set of problems called Online Open-ended Video Question Answering (O²VQA). It requires an online state-updating mechanism for the solver to decide if the collected information is sufficient to conclude an answer. We then propose a Confidence-based Event-centric Online Video Question Answering (CEO-VQA) model to solve this problem. Furthermore, a dataset called Answer Target in Background Stream (ATBS) is constructed to evaluate this newly developed online VideoQA application. Compared to the baseline VideoQA method that watches the whole video, the experimental results show that the proposed method achieves a significant performance gain.

CEO-VQA Algorithm

1) Confidence-based Target Event Locator. Given an online video stream and a question, the confidence to answer $c$ is derived by using a VideoTextEncoder to evaluate the attentional information between the text features extracted from the question and the visual features extracted from the video stream. The Confidence-based Target Event Locator then filters out irrelevant events and locates the target event $\mathcal{T}$. To reduce the time complexity, we bi-directionally traverse the video in a Fibonacci way.

2) Event Question Answering. After locating the target event, a VideoEncoder is utilized to extract the visual features from the video key frames, and a QuestionEncoder is utilized to extract the linguistic features. The visual and linguistic features are then concatenated and fed into a MultiModal Encoder to perform cross-modal learning. Finally, a reliable answer is generated by the answer decoder.

BibTeX

@inproceedings{kong2023confidence,
      author    = {Kong, Weikai and Ye, Shuhong and Chenglin, Yao and Jianfeng, Ren},
      title     = {Confidence-based Event-centric Online Video Question Answering on a Newly Constructed ATBS Dataset},
      booktitle = {{IEEE} International Conference on Acoustics, Speech and Signal Processing, {ICASSP} 2023},
      publisher = {{IEEE}},
      year      = {2023},
    }

Confidence-based Event-centric Online Video Question Answering on a Newly Constructed ATBS Dataset.

ICASSP 2023

Abstract

Illustration of the Online Open-ended Video Question Answering task

CEO-VQA Algorithm

Related Links

BibTeX