A Summarize-then-Search Method for Long Video Question Answering: Conclusion

:::info
This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Jiwan Chung, MIR Lab Yonsei University (https://jiwanchung.github.io/);
(2) Youngjae Yu, MIR Lab Yonsei University (https://jiwanchung.github.io/).
:::
Table of Links
Abstract and Intro
Method
Experiments
Related Work
Conclusion
Limitations and References
A. Experiment Details
B. Prompt Samples
5. Conclusion
We introduced Long Story Short, a summarize-then-search method to understand both global narrative and the relevant details for video narrative QA. Our approach is effective when the context of QA is vast and a high-level interaction with such context is necessary to solve the said QA, which is the case in long video QAs. Also, we propose to further enhance the visual grounding of the model-generated answer by post-checking visual alignment with CLIPCheck. Our zero-shot method improves supervised state-of-art approaches in MovieQA and DramaQA benchmarks. We plan to release the code and the generated plot data to the public.
\
There are two possible research directions beyond this work: first, providing visual descriptions better aligned with the story with character re-identification and co-reference resolution improve input quality to GPT-3. Second, one can devise a more dynamic multi-hop search that combines global and local information in a hierarchical manner.
Welcome to Billionaire Club Co LLC, your gateway to a brand-new social media experience! Sign up today and dive into over 10,000 fresh daily articles and videos curated just for your enjoyment. Enjoy the ad free experience, unlimited content interactions, and get that coveted blue check verification—all for just $1 a month!
Account Frozen
Your account is frozen. You can still view content but cannot interact with it.
Please go to your settings to update your account status.
Open Profile Settings