Uncategorized
Normalizing data is a common data engineering task. It prepares information to be stored in a way that minimizes duplication and is digestible by machines. It also aims to solve other problems and issues that are out of scope for this particular article but worth reading about if you find yourself struggling to understand jokes about E. F. Codd. This begs the question, why does normalization matter when entering information in a table or organizing a spreadsheet? In order to properly answer that question, we should explore a simple example.
Wondering whether your favorite tools, services, or products are one sale this week? Below is a list of Cyber Week deals to help you get started with Data Engineering, refresh your toolbox, or launch your side project. Feel free to add to the list over on Github.
This post is mostly for me but I ran into a ton of conflicting information while troubleshooting my Windows Subsystem for Linux (WSL) and PyCharm integration and figured it may help someone else. First things first. Versions matter! Before wasting your time trying to get Pycharm and WSL to play nicely, make sure you are running PyCharm2020.2 or greater and WSL 2. If you a) have no idea what those versions mean or b) are not sure what version you are using, allow me a chance to explain.
There comes a time when you just need to take a little off the top of a file, see what you are working with. That is where knowing how to use a utility like
<a href="http://man7.org/linux/man-pages/man1/head.1.html">head</a>can help. Just running:$ head filename.txt
Will get you
Print the first 10 lines of each FILE to standard output.
But what if that file does not have nice lines? Large SQL dump files come to mind.
headhas an answer. Use the-cflag to print the beginning bytes of a file instead of lines. Change the command above to:Imagine a scenario where one party wants to check whether a name they have exists in a list of names kept by the another party. But I do not want the other party to know what name I am searching. This problem may seem unrealistic but imagine a data breach where tons of personal information is leaked. You want to check whether you were impacted in the breach but do not trust the party hosting the personal information to keep your query safe. This is possible with the help of homomorphic encryption, specifically Paillier encrytption.
Collecting memories from people is an excellent way to celebrate the experience of others. I have found it helps me learn more about why people hold certain beliefs, how they overcame hardships, and the world we live in. Interviewing other people has helped me learn more about myself, which is why I wanted to write up a guide for collecting the stories of other people.
The most obvious aspect of collecting stories is interviewing. There are a ton of resources by people much more experienced than myself on how to conduct an oral history interview. It is important to come up with a sample outline and use that as a starting point. I continue to consult the following resources to help me prepare for interviews.
The Windows Subsystem for Linux (WSL) is one of the best features on Windows 10. It makes development so much easier than it used to be but still has a few hiccups. Kinda like Linux, some things don’t “just work.” One pesky thing that I recently dealt with was getting SSH to work with a keypair file from WSL. Here is how to get SSH working on WSL.
Goal
Given a keypair file, we want to invoke ssh from the command line and establish a tunnel to another server. This is a common task when connecting to remote servers. Think AWS, Azure, or Digital Ocean. It is a simple command:
Below are some notes on getting
<a href="https://github.com/edenhill/kafkacat#build">kafkacat</a>installed on an Amazon workspace with admin access.The commands listed on the GitHub page will not work without a little preparation. A Linux Amazon Workspace image is based on Amazon Linux. Attempts to use a package manager like
yumgo through a plugin,amzn_workspaces_filter_updates. This filter only has a handful of packages (30 at the time of this writing) that can be pulled. The first thing to do is add Extra Packages for Enterprise Linux, EPEL, to the instance’s package repository. Following the instructions on the Fedora FAQ run:I have been working on collecting a family’s oral history for the past few months. During the process I took notes with simple descriptions of what the speaker was describing or telling and a rough timestamp of when in the file the conversation took place. After collecting hours of stories, I realized that having a transcription would make things much easier to search and perhaps more useful to those interested in these particular histories. Why not get a transcription of the contents via one of the cloud offerings? Amazon offers a service called Transcribe that is available via the AWS suite of services. Since I have a small account and some credits to burn I figured why not kick the tires and see how Transcribe would perform on meandering oral history interviews. But before I jump into the how, let me describe my particular use case.
Once a year I need to free up space on my work machine. Once a year I find myself searching for an efficient way to identify the largest files that I do not need any longer without installing third party software. Here is the solution I stumbled on this year and it couldn’t be simpler.
- Open File Explorer
- Navigate to the drive you want to search. Normally, C:\
- Type “size:gigantic” in the search bar
- Wait for the search to complete
- Sort the results by size
This search will scan your computer’s file system for files that are larger than 128mb. You can then identify files that are no longer needed.
Recently, I have been working with the Requests library in Python. I wrote a simple function to pull down a file that took more than a minute to download. While waiting for the download to complete I realized it would be nice to have some insight into the download’s progress. A quick search on StackOverflow led to an excellent example. Below is a simple way to display a progress bar while downloading a file.