Friday, January 2, 2009

CSV Data Mix - Example Use

In CSV Data Mix 0.4, the command line options were removed from the core class file (csvdatamix.py) into an sample use file (csvmain.py). The core class file contains all the core classes needed to use the CSV Data Mix, and in order to use, the class can be imported into other applications. 

The sample use file uses standard command line conventions used in most Unix style operating systems. In order to display all options, execute the help option with -h or –help.

Quiet: -q / --quiet
Using this option will prevent standard output messages to be written to the prompt. This includes the output from the shuffling and mapping procedures. All error and progress (if used, see -p/--progress option) messages are sent to the standard error stream and therefore will be printed with this option. 

Title: -t / --title
Using this option will reserve the first row of the data as being column headings. The option is relevant because if not used with data containing headings, then those headings will be mixed into the data and not reserved. 

This option does nothing with any mapping options. The mapping ignores headings and just maps out, and therefore is the two data sets for mapping have the same headings, the map will reflect that. 

Delimiter: -d value / --delimiter=value
This option allows a custom delimiter to be used when processing data. For example, if data is separated with semicolons rather then commas then the given value with this option must be a semicolon in order for the data to be read properly. This value must be a string.

Output File: -o filename / --out=filename
This option allows for standard output to be written to a file. Sometimes it is not optimal to redirect or pipe output due to large data sets. Using this option will give a more optimal operation time for writing to a file. This applies the shuffling or mixing operations. The mapping operation will generate it's own filename in this implementation. If the given filename exists, it will be overwritten without warning. 

The mapping option always writes to a file in the csvmain.py implementation use. This is because I had not thought of a reasonable way to provide two command line options to save to a file.

Iterations: -t integer / -iterations=integer
This option allows for more shuffling to occur on a given data set then the default. By default, the data is shuffled once. The integer value given with this option will shuffle it that many times. The cvsmain.py implementation limits the iterations value to 10. If zero is given, the data will not be shuffled or mixed. This option only applies to the shuffling procedures.

Map: -m / --map
This option switches to a completely different set of operations. Instead of shuffling data, it will read data and create a map between two files. It is assumed that one file was the input for the output file and the desired result is to know how the data got from one place to the other. The map generated will be saved to a CSV file named in the fashion: mixmap-file1_file2.csv that will be saved in the directory of the second file given. 

Progress: -p / --progress
This option will give the user some feedback if they are executing on large data sets. Progress bars and indicators will be written to the standard error stream. This progress will apply for shuffling and mapping operations. 

Example 1: Shuffling Data
In this example, data will be read from one file (rows.csv), shuffled and then written to an output file. 
  • We will want to use the -t option because our data had column headings. 
  • We will use the -t option to shuffle the data 5 times. 
  • We will use the -p option to show progress. 
  • We will use the -q option to hide the output of the shuffling procedure because we want the output saved to a file. 
  • Since we want the data saved to a file, we will use the -o option to specify the output filename (shuffled.csv). 
The execution at the command prompt will look like:

$ python csvmain.py -i 5 -pqt -o shuffled.csv rows.csv 

The result will be shuffled data being written to file shuffled.csv and nothing but the progress being printed to the prompt.

Example 2: Creating a Data Map
In this example, we want to create a map between two files, rows.csv and shuffled.csv.
  • We will use the -m option to create a map.
  • We will use the -p option to show progress at the prompt.
  • We will use the -q option to hide the output generated from the mapping procedure. 
The execution at the command prompt will look like:

$ python csvmain.py -mqp rows.csv shuffled.csv 

The result will be the map between rows.csv and shuffled.csv being written to file mixmap-rows_shuffled.csv in the directory of the file shuffled.csv.

Links

CSV Data Mix: Readme

The comma-separated values (CSV) file format is a tabular data format that has fields separated by the comma character and quoted by the double quote character. If a field's value contains a double quote character it is escaped with a pair of double quote characters.
 
In some cases, the values stored in CSV files could be considered sensitive or private information that should be conceal from the public. In the case that the public desires similar information stored in CSV and the information is deemed sensitive, then the private information could be randomly disorganized and then presented to the public for use. 

For example, if the public desires private customer information from an organization to test against an application that will be manipulating data in the same format, then that private customer data might be considered for concealment before presented to the testing team of the application. In this case the public would be the application development team, or testing team of the application. 

Files
csvmain.py:
  • Sample use and command line handling for CSVDataMix and CSVMap classes.
  • Type csvmain.py -h for options
csvdatamix.py:
  • Class definitions for CSVDataMix and CSVMap
Sample Use

Shuffle data in rows.csv, had headings, write to shuffled.csv, show progress, reshuffle 5 times, don't print actual data to prompt.

$ python csvmain.py -i 5 -pqt rows.csv -o shuffled.csv
Create a map between rows.csv and shuffled.csv, show progress, don't print actual data to prompt. Map will always be saved to a file in map mode.

$ python csvmain.py -mqp rows.csv shuffled.csv 

Links

CSV Data Mix & Map on Google App Engine

I just recently started to mess around with a deployment of the CSV Data Mix project of mine hosted on the Google App Engine. This is currently being tested, but is active as I test. You can try it out via the link below. 
If you want to give this hosted application a test to see how it works you can do the following.

Mixing Data
  1. Pick your dataset. You can used some of the sample data I've used for testing, for example,  here is a dataset with 500 rows of fake user info.
  2. Determine the field delimiter of your data. If you are using this suggested file via the link in step one, then the field delimiter is a comma. 
  3. Decide if you want the mix to be aware of column headings. When using the sample data, there are headings so choose "Yes". 
  4. Click "Mix Data"
The result will be a response page of plain text containing the contents of the shuffled data. This page can then be saved as your mixed data set. 

Mapping Data
  1. Pick two datasets to compare,  basically you should be interested in determining what the shuffle did to your data. You can used the following sample data: fake info (file 1) and shuffled fake info (file 2).
  2. Determine the field delimiter of the files, this should be the same in both files. 
  3. Click "Map Data"
The result will be a response page of plain text containing the mapping. This result will attempt to show where a cell in file one exists in file two in the following format: (x,y) -> (x,y). To learn more about this mapping structure you can read the CSV Data Mapping post. 

Disclaimer
This application demo on the Google App Engine has just recently been developed and has caused some changes to the way the previous code releases have worked. Please consider this a beta at this time. 

Thursday, January 1, 2009

CSV Data Mapping

Starting with Release 0.4 of csvdatamix, included in the file release is an additional csv mapping class. CSVMap will compare file-a to file-b and generate a map of where the data in file-a occurs in file-b. This assumes that the data has been shuffled or modified in some manner from its original state (file-a). The two input files should match in tabular format, such that their number of rows and columns are the same.

Warning: This implementation does the best it can, meaning the results of execution does not exactly show how the data was was converted. The process of creating the map finds the first occurrence in the  output, makes note of it's location, and when the same data is found, it starts from where the last was found. For example, in the tabular data has "awesome" appear twice in a column, then when the mapping process occurs and identifies "awesome" in file one, the first occurrence in file two will be used as the mapping location. The next occurrence of "awesome" in file one, will use the next occurrence for "awesome" in file two, and so on.

How it Works....
The following will provide some insight into how the mapping works.

Original Data
The following table demonstrates what the original data in a CSV tabular format would look like. The x,y coordinates demonstrate positions of the tabular format. This data is represented as lists, such that each row is stored as a list, and its entries are represented by the x,y coordinates. Each of these lists are kept in a single container for reference, which is also a list (list of lists).

x0,y0

x1,y0

x2,y0

...

...

x(n),y0

x0,y1

x2,y1

x3,y1

...

...

x(n),y1

x0,y2

x2,y2

x3,y2

...

...

x(n),y2

x0,y3

x2,y3

x3,y3

...

...

x(n),y3

...

...

...

...

...

...

...

...

...

...

...

...

x0,y(n)

x1,y(n)

x2,y(n)

...

...

x(n),y(n)


Reorganized Data
In order to mix the data properly, it must be reorganized such that the data being mixed is mixed with similar data. In this case, each column in the original data is restructured as a row. In this case, row one of the original data is now column one of the reorganized data. This structure is shown below:

x0,y0

x0,y1

x0,y2

x0,y3

...

...

...

x0,y(n)

x1,y0

x1,y1

x1,y2

x1,y3

...

...

...

x1,y(n)

x2,y0

x2,y1

x2,y2

x2,y3

...

...

...

x2,y(n)

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

x(n),y0

x(n),y1

x(n),y2

x(n),y3

...

...

...

x(n),y(n)


Shuffling Process
The shuffling process of the data is fairly simple. Each list, or row, in the reorganized data is given to the Python Random module's shuffle1 method. The result of this process gives the same data structure of the reorganized data, the goal of concealing the original state of the original data is accomplished. The placing of the location of each value in the list is shuffled with similar data, while the actual data values remain the same. 

Data Output
The output process is fairly simple as well. This process takes the reorganized data that has been shuffled and outputs the data in a comma separated fashion. 

Links

CSV Data Mix Project Page

Welcome to the CSV Data Mix project page. This page used to be hosted on Sourceforge as a static page, but I don't want to write an informative project page as part of the project. So I decided to move the general "about" page to this blog. 

Overview

This project's goal is to accomplish a safe way to provide more realistic testing data to application development teams. To do this, this project assumes the use of comma separated values as an input of data into test applications. In order to use real data in a testing environment, the real data may be considered sensitive, and will need to be concealed in some way to protect the privacy of that data. This is where the CSV Data Mix application is used.

Explanation

So how does the CSV Data Mix application accomplish its goal? Easy. Given the actual data that is required to be concealed prior to distribution for whatever reason (i.e testing), the application accepts the data as input, randomizes the data, and then outputs it aaccordingly (standard output or file). The final state of the new data will have each original column of data shuffled around. 

The process of doing this requires a bit more detail, and can be understood better by reading the source code, but this is the basic idea. This project is written in Python and has been published under the GNU General Public License and is open source software. Please reference the links below to learn more about Python and the GNU General Public License.

This approach to concealing data is not new. You may have heard of data masking or data transformation. Although these techniques differ in the way they present the new data, they share a similar goal to conceal sensitive data.

Links

Share on Twitter