Hello!

Continuing on the topic of the last post, visual animations for scientific purposes, I would like to extend this idea to creating cinematic clips. In my experience, such realistic digital animations can be extremely powerful intros and outros for presentations, leaving a lasting impression on the audience. When presented at the start, they typically set the tone for what is about to appear, helping maximise attention as the presenter shares the content. Comparably, in the end, the animation will be the last scene the audience sees and can help to leave a mark on the main idea we would like them to keep from the talk. While producing such clips may still require some technical effort, the joy and excitement it produces during the talk makes it worth it, especially as it can be powerful and inspiring for non-scientific audiences to pay attention and genuinely try to grasp some of the ideas presented.

From a technical perspective, over the last few years, ground-breaking advances in generative artificial intelligence (Gen. AI) and large language models have materialised in many tools for text, image and, recently, video generation. While exceptional tools are available for designing and producing cutting-edge, realistic animations (like Blender), these models dramatically lower the expertise and time required to create a short video that still looks very professional. Therefore, in this post, I would like to highlight some of the tools and pipelines based on generative AI models I have used to produce such animations and successfully incorporate them into scientific presentations. Also, note that all the tools presented below are open-source or provide free versions and trials to test the software (which, in my experience, have been beyond sufficient for all the trials I needed to reach a final product). Overall, the following steps will span the ideation phase, movie clip generation with AI models, post-processing to improve image quality and create cinematic (e.g., slow motion effects) and production phase to create a complete movie.

1. From idea to initial sketches

2. Movie scene generation

3. Creating a cinematic look

4. Editing the final montage

From idea to initial sketches

The first step is to define the storyline. I believe in simplistic yet powerful representations, and looking back, I see that my presentations have been evolving over the years, showing a relatively consistent story but with progressively more attractive visualisations. In other words, early on, I had a view of what I wanted to share, and as I developed new skills and new tools became available, I could invest the time to make this original vision a reality.

From a practical standpoint, this approach translates to coming up with a rough idea of the scenes we would like to have, how they would flow together and what tools we can or should use to produce each one. For example, I have worked with mouse models for bone adaptation and regeneration in my research. Most of our experiments also consider some form of mechanical loading on bones to stimulate growth. However, intuitively, locomotion is one of the primary forms of loading our skeleton, making it a suitable movement for any audience to visualise (and animating our experimental setup would likely be incredibly difficult). Therefore, a straightforward introduction scene could be a mouse running, where the presenter can verbally communicate any technical details of the topic while keeping a light and intuitive visualisation on screen. Besides, with the current generation of models, we still need to balance scene complexity and model capabilities.

While current generative AI tools are becoming more powerful, they cannot yet render all prompts submitted accurately; hence, such an example scene should be simple enough to remain within the abilities of current models but powerful enough to make an impact during a presentation.

Movie scene generation

Having established our storyboard, we can start experimenting with gen. AI models. At the time of writing, the following models were the most impressive during my attempts: HunyuanVideo from Tencent and KlingAI. While the former is available on HuggingFace and readily deployable on high-performance computing (HPC) platforms or any system with GPUs, I only used the latter through their browser platform with the trial credits and limited functionalities. Still, HunyuanVideo was more than capable of handling the majority of the clips I produced.

After following all the installation steps provided, we should be ready to generate videos with HunyuanVideo. For this purpose, we now need to translate each of our desired scenes into a clear text prompt to be given as input. And this is where we must remain flexible with our storyboard to accommodate the outputs of the model. While there are a few parameters we can tune to make the model follow the prompt more closely, these also affect the output quality, and I have not found a straightforward way for it to produce exactly the image I had in mind. Nonetheless, I was amazed at the quality of the outputs produced. For this reason, it is optimal if the model is deployed on a system that we can use extensively, as the calibration of model parameters and input prompt typically requires several iterations for each scene (e.g., I needed about 20-30 iterations until the model started producing clips that I found impressive and valuable for my presentation). I first described an initial prompt in my trials and used the default model settings to generate the clips. Interesting (and unexpected to me) initial findings were the model’s dependence on the output resolution and number of frames despite fixing the random seed and input prompt. Therefore, I would recommend defining these settings to the desired values and optimising the prompt and remaining model parameters in subsequent trials (in the case of the image resolution, this typically means the highest possible for the model, often 1280p).

For the same reason, I also found it extremely helpful to keep a logbook for each trial to get an intuitive sense of the effect of each parameter on the output. While it is difficult to predict how many trials are needed to reach a successful outcome, this grid search optimisation/trial-and-error approach should be sufficient to identify working parameters without requiring more complex model optimisation techniques. Furthermore, as I started this process well in advance, I could let the model run for a few hours (or overnight) and do other relevant tasks, allowing me to evaluate each output and adjust accordingly for subsequent trials.

Ultimately, the default settings are likely fine for most applications, with only minor changes required, with most of the improvements coming from formulating the prompt. Conversely, most of the time should be spent learning to describe key details (e.g., camera movement or desired colours) or finding the correct order of sentences to produce a meaningful output. During this process, I would typically keep the random seed fixed, and after identifying a working set of inputs, I would generate additional clips with the same settings but different seeds to produce an abundance of clips to select from. Indeed, even though I would keep the prompt fixed (including camera movement descriptions), some of the outputs would include views from different angles that were consistently impressive and that I ended up combining into the final clip. In my case, I repeated this process for only two key scenes I wanted to generate, and by the end, the variety of clips produced was sufficient to meet my needs. Still, I expect more ambitious projects with more scenes to require further time commitment to calibrate the input for more scenes.

Creating a cinematic look

Most of the models I tested produced video clips with 24 or 30 frames-per-second (FPS). However, I am a fan of slow-motion videos to achieve a truly cinematic look, which typically require videos with 60 FPS or higher. Hence, after selecting the outputs from the previous step, I would consider using an additional model to perform frame interpolation and increase the FPS to a higher value. In my attempts, SuperSlowMo was particularly effective, relatively simple to install on the same HPC platforms and straightforward to use. Typically, doubling FPS produced excellent results while tripling FPS would create some artificial blur depending on the input. Furthermore, this model would output files in .mkv format, which had to be converted back to .mp4 for convenience. Nonetheless, it was extremely swift and required virtually no parameter configuration (apart from defining the number of FPS to produce), making it a strong choice.

Additionally, after this step, given that the resolution of the videos is still HD (typically the highest resolution such gen. AI models support), I would also consider a post-processing step to improve the image quality and upscale the videos. In my experiments, I found the tools available in Vmake to be versatile and powerful. Furthermore, coincidentally, most of the clips generated have a maximum duration of 4 to 5 seconds, which is also the maximum duration allowed for free video processing, allowing us to process all our clips conveniently. While the output is at 2K or 4K resolution, and my edits were typically in full HD, the improvement in image quality was noticeable.

Editing the final montage

Finally, we can move to the editing step after producing all the pieces. I genuinely enjoy DaVinci Resolve and firmly believe it is one of the most versatile movie editing software available. Even the free version is beyond capable of all the clips I have ever created. While the following may refer to functions available on this tool, similar capabilities are likely available in other software.

We begin by creating a new project and importing our clips to the Media tab. Depending on the storyboard, we may also need an image of the first title slide of the presentation to create a smooth transition to start the video. Another optional item is the soundtrack. While I have had cases where I added an inspirational, epic soundtrack to complement the scene, it is also completely fine to show the videos and only have the presenter speak over it. On this note, I would also add the potential of AI voice generation tools, particularly ElevenLabs tools, which can truly enhance the impact of such videos by adding realistic narrations over the videos. They also provide a generous amount of trial credits, which should suffice for a seconds-long clip describing the movie scene or any content of interest. Ultimately, we can create a 15-20 second clip, with or without music and with or without a narrator, that can dramatically improve the visual impact of our presentation.

Additional tricks for added cinematic effect include freezing the first frame and only starting the movie clip as the narrator starts talking, the soundtrack song reaches a critical transition, or adding additional content overlaid with the clips. For the former idea, instead of freezing the first frame, we may also use other models that support textual and image inputs to create clips that perfectly align with the subsequent clip. In my experiments, I would like to highlight KlingAI and Stable-Virtual-Camera. While these only allow providing the initial frame as input (note that KlingAI supports an end-frame input, but only on the paid tier), we can still generate, for example, a smooth zoom-out clip based on the first frame of our main clip such that, when we play it in reverse, it becomes a zoom-in clip ending precisely where the remaining clip continues. After editing with cross-fade or blur transitions, I believe this approach helps create a more pleasant effect for the viewer in contrast with the still frame, which also eases into the main scenes that typically have more camera and subject movement. On a different note, for the content overlay approach, I also experimented with adding an image (e.g., logo) on a clip with relatively high camera movement to create an outro visual effect. By including a position tracker on the scene on a stable landmark (e.g. horizon), we can apply its trajectory to the image of interest, effectively embedding it to the scene and creating a realistic effect as if it was part of it all along. These are some ideas I have personally tried, but each application case and user can hopefully explore what is most impactful for their montage.

Lastly, if needed, generated clips can also be edited along with other animations, such as ParaView clips discussed in the previous post, effectively creating a smooth transition between cinematic introductions and the beginning of the scientific content of the presentation.

Conclusion

Generative AI models are becoming widely available, and their potential is expected to improve soon. While many applications are targeted towards artistic cases, I sincerely believe we can harvest some of this potential for enhancing scientific communication and making science accessible to any audience. I am excited to follow the latest developments in these technologies and keep experimenting with new tools to understand their potential and application.

Please feel free to share your thoughts about your approach to producing scientific animations!

Have a great day!