This post will demonstrate programs which each perform a single very specific task but which can be chained together in such a way that the output of one forms the input of another. Connecting programs like this, or piping to use the correct terminology, enables more complex workflows or processes to be run.
This post includes just three short Python programs to carry out the following tasks:
- Generating data
- Filtering data from the data generating program
- Calculating totals from the filtering program
The three programs described above can really only be used together in the order listed (perhaps missing out the filtering stage) but the main purpose of this article is to show that a suite of programs, each carrying out a single task, can be developed which can be used in a "mix and match" fashion to carry out a large number of different tasks, providing their inputs and/or outputs are in a common format.
A few examples of additional programs to add to the suite might be:
- Extracting data from various different sources (databases, flat files, web services etc.) and transforming it into a common format for use by other programs
- Filtering and/or sorting data in a variety of ways
- Processing the data from previous programs in the pipeline in various ways, anything from simple totals to complex machine learning
- Saving data created earlier in the pipeline to a variety of formats - databases, XML, spreadsheet etc.
Project Requirements
For this project we will create a file of transaction amounts for a hypothetical business. Some will be positive representing invoices sent out to customers, and some will be negative representing invoices received from suppliers. For this demonstration the amounts will be randomly generated but of course in a real-world situation the data would be retrieved from a source such as a database. The data will then be written to a file.
The next stage is to filter the data. It is common accounting practice to write off amounts below a certain value as it is not cost effective to handle them. Our filtering program will therefore take a minimum amount as a command line parameter, read in amounts from the previous program, and only output amounts above the minimum.
The final program will read in data from the filtering program and calculate two totals. Negative amounts will be added to obtain the total creditors, while positive amounts will be added to obtain the total debtors. These amounts will be written to a further file.
stdin and stdout
In the previous section I mentioned files several times, implying that data will be written to and read from files saved to disc. While each of the three programs can be used in this way they don't need to be - the piping process mentioned above can be used to write the output from one program directly to the input of another without any intermediate stage of writing and reading disc files.
Console output with the print function in Python writes to a file called sys.stdout (standard output) which by default points to the screen rather than a file on disc. Similarly, console input reads from a file called sys.stdin (standard input) which by default points to the keyboard. However, these defaults can easily be changed. For example assume you have a program called getdata.py which uses print to output data. If you ran it like this:
python3.7 getdata.py
then not surprisingly the data would appear on the screen. However, you can easily change stdout like this:
python3.7 getdata.py > data.csv
which will make print write to data.csv. You can also change where stdin points to using the < operator, as we'll see later. We will also see later how to pipe several programs together, each redirecting stdout and stdin to pass data along the pipeline without any unnecessary disc writes and reads.
Time to Start Coding
Create a new folder somewhere and within it create three empty files; you can also download the source code as azip or clone/download from Github if you prefer.
- generatedata.py
- filterdata.py
- calculatetotals.py
Source Code Links
Open generatedata.py and enter or paste this code.
generatedata.py
import random import sys def main(): for i in range(0, 64): print(random.randrange(-100, 100)) print("data generated", file=sys.stderr) main()
The code for these three programs is deliberately simple so as to concentrate on redirecting sys.stdin and sys.stdout, and piping the outputs from one to the inputs of the next.
This first one simply generates 64 random numbers between -100 and +100, and uses print to write them to sys.stdout (wherever that might be - neither we or the print function know or care!)
Having done its stuff it then uses print to write a message to sys.stderr which, by default, also points to the screen but unlike sys.stdout is not redirected with the > operator. (Please feel free to disapprove of my hijacking sys.stderr for something which is not actually an error! However, it does demonstrate that you still have access to the screen should you need it for real errors. I have omitted any error handling for brevity but obviously that should not be done for production code.)
Now lets move on to filterdata.py.
filterdata.py
import sys def main(): if(len(sys.argv) > 1): try: minimum = int(sys.argv[1]) except ValueError: print("Argument must be valid integer", file=sys.stderr) for amount in sys.stdin: if(abs(int(amount)) >= minimum): print(amount, end="") else: print("minimum must be specified", file=sys.stderr) print("data filtered", file=sys.stderr) main()
This is equally simple. It firstly picks up the minimum from the command line arguments and then iterates sys.stdin with a for/in loop to read in data line by line. Any amounts over the minimum are written to sys.stdout, others being ignored.
As with print in the previous program we don't know or care where sys.stdin actually comes from or where sys.stdout actually goes to.Finally let's move on to calculatetotals.py.
calculatetotals.py
import sys def main(): totaldebtors = 0.0 totalcreditors = 0.0 for amount in sys.stdin: if(int(amount) > 0): totaldebtors += int(amount) else: totalcreditors += int(amount) print("Debtors: {:.2f}".format(totaldebtors)) print("Creditors: {:.2f}".format(totalcreditors)) print("totals created", file=sys.stderr) main()
Another very simple little program. Firstly we declare variables for the two totals, initialized to 0.
We then read in data from sys.stdin in the same way as filterdata.py, adding each amount on to the relevant variable. (Note the data comes to us as strings so we need to convert it to ints.) After the loop terminates we print the two totals which go to sys.stdout, although of course we still don't know or care where sys.stdin comes from or where sys.stdout goes to.
The coding is now finished so we can run the three programs. We can actually run these in different ways. The first way is individually, causing them to write data to or read data from files on disc. Let's do that to start with.
Run individually
python3.7 generatedata.py > transactions.csv
python3.7 filterdata.py 10 < transactions.csv > filteredtransactions.csv
python3.7 calculatetotals.py < filteredtransactions.csv > totals.csv
The first line runs generatedata.py, telling it to redirect sys.stdout to transactions.csv.
The second line runs filterdata.py with a command line parameter of 10, the minimum amount. The sys.stdin file is redirected to transactions.csv created by the previous program, and sys.stdout to filteredtransactions.csv.
The last line runs calculatetotals.py with sys.stdin redirected to filteredtransactions.csv and sys.stdout redirected to totals.csv.
When you run these all you will see is the stuff written to sys.stderr...
Program Output
data generated
data filtered
totals calculated
... but if you look in the folder where you have your source code you will see three csv files have been created.
Now let's run the programs again, this time all in one go with the output from one piped to the input of the next using the '|' character.
Run with piping to create totals.csv file
python3.7 generatedata.py | python3.7 filterdata.py 20 | python3.7 calculatetotals.py > totals.csv
There is only one file name here, totals.csv, which is the end result of the process. We no longer waste resources generating unnecessary intermediate files. When you run this you'll still see the messages printed to sys.stderr but only the final totals.csv file has been written.
If we just want to see the totals on screen we can do away with files on disc completely. Let's run the programs one last time, this time without redirecting the output of the last to a file.
Run with piping to print totals to screen
python3.7 generatedata.py | python3.7 filterdata.py 20 | python3.7 calculatetotals.py
The only difference here is we have missed off "> totals.csv" from the end. For the calculatetotals.py program sys.stdout points to the default screen so we actually get to see the totals without opening a file.
Program Output
data generated
data filtered
Debtors: 1125.00
Creditors: -1532.00
totals calculated