As the title says. This is about software making accurate time stamps for a subtitle file by recognizing when different people speak. It is not about transcribing what they are saying (though that would be a welcome bonus). It is about speeding up the process of placing time stamps for dialogue, where dialogue is then manually added by the user by ear. Placing hundreds to thousands of time stamps accurate by the millisecond is a time-consuming process for a human worker. A working (AI) software solution would be very useful for subtitling projects.

I am trying to find free solutions before trying to find solutions that needs payment and concluding that there is no other alternative.

I have tried finding a solution by testing services from Amazon Web Services, Amberscript, Konch,, and more. They are fair transcription services, but they probably were made with transcribing meeting recordings or speeches, not dialogue in a video.

Maybe I am just not knowledgeable enough to properly use them, but skipping the question whether their output has the right formatting for a subtitle file’s contents or not, the recurring problem is, the software can’t differentiate enough between different speakers or separate pieces of dialogue. They tend to gather a lot of separate pieces of dialogue into one big block with a single time stamp, thus skipping several places where there should have been time stamps. That the block contains dialogue from different speakers with clearly distinct voices, says something for their ability to differentiate speakers too.

Do anyone have any additional insights?

Source link