===================
== Nathan's Blog ==
===================
infrequent posts about things I am working on

Processing Audio Files with Amazon Transcribe

amazon aws interviews oral history transcribe transcription

I have been working on collecting a family’s oral history for the past few months. During the process I took notes with simple descriptions of what the speaker was describing or telling and a rough timestamp of when in the file the conversation took place. After collecting hours of stories, I realized that having a transcription would make things much easier to search and perhaps more useful to those interested in these particular histories. Why not get a transcription of the contents via one of the cloud offerings? Amazon offers a service called Transcribe that is available via the AWS suite of services. Since I have a small account and some credits to burn I figured why not kick the tires and see how Transcribe would perform on meandering oral history interviews. But before I jump into the how, let me describe my particular use case.

Over the course of a few months, I have collected several half hour to two hour long interviews via multiple recording set ups. Some files were captured with Open Broadcast Software (OBS) and include video. Others were captured using a TASCAM field recorder and are .wav files. In order to get the audio of each interview put together and normalized, I used a free application called ocenaudio. It allowed me to load .flac, .wav, and other audio formats in the same editing channels and add various effects to sections or the entire workspace. Ocenaudio’s interface allows for simple drag-and-drop editing of sound files. It is also worth noting that Ocenaudio is non-destructive meaning it leaves the original audio file alone. This can be a little confusing if you are not used to using software that performs non-destructive editing. When a project is saved, is exports the results to a new file. Keep that in mind if you plan on adding additional files later.

After I collected and normalized all the sound files, I decided to turn to AWS to get a transcription. Transcribe limits the processing size to files to 2GB. The first collection of interviews was about 3 hours long and 2.5GB in size. I split the collection up into two smaller sizes. AWS needs the file to be uploaded to S3 in order for Transcribe to access it. Here is how I set up the S3 bucket with the audio file.

  • Place it in a region near your house. Or don’t, its up to you.
  • Give the S3 bucket a unique name. You can do what ever you like as long as it conforms to the S3 naming standard.
  • Block all public access
  • Encrypt the bucket. I used AES-256 encryption
  • Give read and write access to your AWS account.
  • Set the S3 bucket to Intelligent Tiering storage class. It does not matter too much unless you forget to spin the bucket down in which case Intelligent Tiering will push the storage into lower cost, slow storage.
  • Upload the file. As long as the files are less than 50GB, you can use the browser upload tool. Otherwise, you will need to download, configure, and use the command line utility.

Once your file is uploaded to S3, select the checkbox next to it and copy the “Object URL.” Transcribe needs this in order to find and process the file. If these steps do not work, check the AWS tutorial on connecting an S3 bucket to a Transcribe instance.

Screenshot from the AWS tutorial

Once the files are available in S3, setting up a Transcribe job is straightforward. The following needs to be configured for Transcribe to create a job.

  • Name of the job
  • The language of the interviews. I processed English but would love to hear someone’s experience processing other languages.
  • The location of the input file. That object URL you copied earlier.
  • The location of where the output should go. I used Amazon’s default.

That is it! Jobs take a little while to run. Mine took about 20 minutes for audio files that are an hour long. Your mileage may vary. The results can be downloaded as a JSON file. A sample JSON object returned from Transcribe would be:

Parsing the JSON object is all that is left for doing something useful with the transcript. My Transcribe results were not stellar. I believe that was due to the fact that those people I interviewed have an accent and the vocabulary used would sometimes be in another language. The service did do two things well. It identified the various speakers on the tape correctly and when I spoke, as the interviewer, it was able to correctly catch most of what I said. So if your audio files have native English speakers, this service does a fairly decent job. You will likely need to do some significant post processing to transform the transcript into a useful document.