UNIX commands for file handling
Recently, I was assigned a project to test an ETL application developed by one of my team members. I had to work with some of the largest file sizes I had ever encountered. The file sizes ran into GBs and wouldn’t open easily using any mainstream editors. I tried opening the files and they slowed down the whole system to the point where I had to restart the computer.
I knew I had to figure out a better way to deal with this new problem. I was searching for solutions online and a lot of stack-overflow answers recommended using UNIX commands. I started exploring unix commands and maintained a list of commands that I found useful to later share it with the community.
Without any further ado lets look at the commands -
First things first, you can open terminal by pressing these shortcuts —
Linux (Ubuntu) — press Ctrl + Alt + T
Mac — Click launchpad -> type terminal -> click terminal
- List command
Generally, files will be shared with you by developers over a cloud storage solution (AWS, Azure) or a NAS drive. Before you begin working on them, its a good idea to list all the files and get basic details like filenames, size, time stamps etc. You can use list command for this.
to show the contents of a directory use this command
to show hidden files and folders, add -a to the command
to show file size and time stamp, add -l to the command
to display recursive listing (folders inside folders) of all files and folders, add -R to the command
2. Move command (covers copy as well)
Chances are that you would like to move files from one location to another before the file processing starts. In some cases, you might want to rename them. Move command is very handy in such cases.
mv — move command. It supports moving single files, multiple files and directories.
to move a file from one directory to another
to move multiple files into a directory
If you pass -i to the move command it will prompt you before overwriting a file. Instead of moving the files, if you want to copy them replace mv with cp.
However, note that with cp command, you are creating a copy of the file and hence it might take a while before the files are copied (sometimes even hours). Hence, please use this command very carefully.
3. Word count command
When you read the files and load them into a database, the first thing you want to do is to make sure that all the records have been loaded successfully — basically compare the record count between files and database. Now, because the file size is in GBs, it will literally break your computer before you can open them. Good news is you can get the record count in a file without opening it.
wc — word count. It helps you get number of lines (read records) in a file, the number of characters in a file and the number of words in a file.
To get the record count in a file, use following command.
If you want to count the number of characters use -m and for words use -w.
Now let’s create a copy of the demo.txt and call it demo1.txt with exactly the same contents. So, now we have 12 lines in demo.txt and 12 lines in demo1.txt.
You can further multiply the power of word count command by using the pipe character. For example, you can use following command to get the sum of total records in all the files present under a folder. In this case, both demo.txt and demo1.txt
The cat command concatenates the contents of all the files and channels the output to word count command using pipe.
4. Head command
When you receive a new batch of files, you would quickly like to verify that the files comply with the contract agreed upon. By contract, I mean the header and footer signatures, file schema among other things. You can use the head command to read the first 10 lines in the file and print it on console.
For multiple files, just keep adding file names followed by space. You can print custom number of lines, by adding -n to the command followed by a number.
5. Tail command
Like head, you can use tail command, to read last 10 lines of the file
6. Grep command
How about searching some text in the file? We can use Grep command for that. In essence, this command prints all the lines that match a pattern. To understand what this means, let’s say that the file contains employee records. If you search employee name using the grep command, you will get all the details related to that particular employee as the command prints the entire line.
consider this file -
to print all the lines matching a text value
to print the count of number of matches, add -c to the command
to ignore case when searching, add -i to the command