Aws
I recently started working on a workflow for picking up files from S3, processing them, and writing the results to another S3 location. This is a common pattern in data processing pipelines and our team wanted to see whether we could do it using AWS serverless services. We were able to get it running via Lambda functions and event triggers published to an AWS EventHub. The entire workflow was fairly easy to stand up once we grasped how the various services worked together.
Below are some notes on getting
<a href="https://github.com/edenhill/kafkacat#build">kafkacat</a>installed on an Amazon workspace with admin access.The commands listed on the GitHub page will not work without a little preparation. A Linux Amazon Workspace image is based on Amazon Linux. Attempts to use a package manager like
yumgo through a plugin,amzn_workspaces_filter_updates. This filter only has a handful of packages (30 at the time of this writing) that can be pulled. The first thing to do is add Extra Packages for Enterprise Linux, EPEL, to the instance’s package repository. Following the instructions on the Fedora FAQ run:I have been working on collecting a family’s oral history for the past few months. During the process I took notes with simple descriptions of what the speaker was describing or telling and a rough timestamp of when in the file the conversation took place. After collecting hours of stories, I realized that having a transcription would make things much easier to search and perhaps more useful to those interested in these particular histories. Why not get a transcription of the contents via one of the cloud offerings? Amazon offers a service called Transcribe that is available via the AWS suite of services. Since I have a small account and some credits to burn I figured why not kick the tires and see how Transcribe would perform on meandering oral history interviews. But before I jump into the how, let me describe my particular use case.