This is a new family of open source video models designed to take on Gen-2. There is a 576x320 model that uses under 8gb of vram, and the 1024x576 model that uses under 16gb of vram. The recommended workflow is to render with the 576 model, then use vid2vid via the 1111 text2video extension to upscale to 1024x576. This allows for better compositions overall and faster exploration of ideas before committing to a high res render.
At the current rate of progress, completely generated films are probably possible next year. The audio and video part is currently not good enough, but the quality will probably get to Midjourney5 level in the next half year. Scripts for a full movie can be written by GPT-4, currently it still needs a lot of help for a good result, but with better fine tuning that shouldn’t be a problem. Then the audio and video parts can be combined by using ChatGPT code interpreter, which already works quite well.