1) Confidence-based Target Event Locator. Given an online video stream and a question, the confidence to answer $c$ is derived by using a VideoTextEncoder to evaluate the attentional information between the text features extracted from the question and the visual features extracted from the video stream. The Confidence-based Target Event Locator then filters out irrelevant events and locates the target event $\mathcal{T}$. To reduce the time complexity, we bi-directionally traverse the video in a Fibonacci way.
2) Event Question Answering.
After locating the target event, a VideoEncoder is utilized to extract the visual
features from the video key frames, and a QuestionEncoder is utilized to
extract the linguistic features. The visual and linguistic features are then concatenated and fed into a MultiModal
Encoder to perform cross-modal learning. Finally, a reliable answer is generated by the answer decoder.
