Sequential Vision to Language as Story: A Storytelling Dataset and Benchmarking
Sequential Vision to Language as Story: A Storytelling Dataset and Benchmarking
Blog Article
Storytelling is a remarkable human skill that plays a significant role in learning and experiencing everyday life.Developing narratives is central to human mental health development, simultaneously encapsulating broad details such as psychology, morality and common sense.Contemporary deep-learning algorithms require similar skills to be able to tell a story from a visual perspective.However, most algorithms function at a superficial or factual level, aligning descriptive text with images in a one-to-one manner without considering the temporal relation.
Stories are more expressive in style, language and content, involving imaginary concepts not explicit in the images.An ideal deep learning system Magnetic Pump should learn and develop cohesive, meaningful, and causal stories.Unfortunately, most existing storytelling methods are trained and evaluated on a single dataset, i.e.
, the VIsual STorytelling (VIST) dataset.Multiple datasets are essential to test the generalization ability of algorithms.We bridge the gap and present a new dataset for expressive and coherent story creation.We present the Sequential Storytelling Image Dataset (SSID,
org/documents/sequential-storytelling-image-dataset-ssid
Moreover, our dataset achieves lower mean average scores across all metrics, meaning that the ground truth stories of our dataset are more diverse.Finally, we train and evaluate existing state-of-the-art rhetorical storytelling methods on both datasets and show that our dataset is more challenging and requires sophisticated techniques to accurately detect a significant variety of events.