Friday, January 2, 2009

CSV Data Mix - Example Use

In CSV Data Mix 0.4, the command line options were removed from the core class file ( into an sample use file ( The core class file contains all the core classes needed to use the CSV Data Mix, and in order to use, the class can be imported into other applications. 

The sample use file uses standard command line conventions used in most Unix style operating systems. In order to display all options, execute the help option with -h or –help.

Quiet: -q / --quiet
Using this option will prevent standard output messages to be written to the prompt. This includes the output from the shuffling and mapping procedures. All error and progress (if used, see -p/--progress option) messages are sent to the standard error stream and therefore will be printed with this option. 

Title: -t / --title
Using this option will reserve the first row of the data as being column headings. The option is relevant because if not used with data containing headings, then those headings will be mixed into the data and not reserved. 

This option does nothing with any mapping options. The mapping ignores headings and just maps out, and therefore is the two data sets for mapping have the same headings, the map will reflect that. 

Delimiter: -d value / --delimiter=value
This option allows a custom delimiter to be used when processing data. For example, if data is separated with semicolons rather then commas then the given value with this option must be a semicolon in order for the data to be read properly. This value must be a string.

Output File: -o filename / --out=filename
This option allows for standard output to be written to a file. Sometimes it is not optimal to redirect or pipe output due to large data sets. Using this option will give a more optimal operation time for writing to a file. This applies the shuffling or mixing operations. The mapping operation will generate it's own filename in this implementation. If the given filename exists, it will be overwritten without warning. 

The mapping option always writes to a file in the implementation use. This is because I had not thought of a reasonable way to provide two command line options to save to a file.

Iterations: -t integer / -iterations=integer
This option allows for more shuffling to occur on a given data set then the default. By default, the data is shuffled once. The integer value given with this option will shuffle it that many times. The implementation limits the iterations value to 10. If zero is given, the data will not be shuffled or mixed. This option only applies to the shuffling procedures.

Map: -m / --map
This option switches to a completely different set of operations. Instead of shuffling data, it will read data and create a map between two files. It is assumed that one file was the input for the output file and the desired result is to know how the data got from one place to the other. The map generated will be saved to a CSV file named in the fashion: mixmap-file1_file2.csv that will be saved in the directory of the second file given. 

Progress: -p / --progress
This option will give the user some feedback if they are executing on large data sets. Progress bars and indicators will be written to the standard error stream. This progress will apply for shuffling and mapping operations. 

Example 1: Shuffling Data
In this example, data will be read from one file (rows.csv), shuffled and then written to an output file. 
  • We will want to use the -t option because our data had column headings. 
  • We will use the -t option to shuffle the data 5 times. 
  • We will use the -p option to show progress. 
  • We will use the -q option to hide the output of the shuffling procedure because we want the output saved to a file. 
  • Since we want the data saved to a file, we will use the -o option to specify the output filename (shuffled.csv). 
The execution at the command prompt will look like:

$ python -i 5 -pqt -o shuffled.csv rows.csv 

The result will be shuffled data being written to file shuffled.csv and nothing but the progress being printed to the prompt.

Example 2: Creating a Data Map
In this example, we want to create a map between two files, rows.csv and shuffled.csv.
  • We will use the -m option to create a map.
  • We will use the -p option to show progress at the prompt.
  • We will use the -q option to hide the output generated from the mapping procedure. 
The execution at the command prompt will look like:

$ python -mqp rows.csv shuffled.csv 

The result will be the map between rows.csv and shuffled.csv being written to file mixmap-rows_shuffled.csv in the directory of the file shuffled.csv.


No comments:

Share on Twitter