Thursday, January 1, 2009

CSV Data Mapping

Starting with Release 0.4 of csvdatamix, included in the file release is an additional csv mapping class. CSVMap will compare file-a to file-b and generate a map of where the data in file-a occurs in file-b. This assumes that the data has been shuffled or modified in some manner from its original state (file-a). The two input files should match in tabular format, such that their number of rows and columns are the same.

Warning: This implementation does the best it can, meaning the results of execution does not exactly show how the data was was converted. The process of creating the map finds the first occurrence in the  output, makes note of it's location, and when the same data is found, it starts from where the last was found. For example, in the tabular data has "awesome" appear twice in a column, then when the mapping process occurs and identifies "awesome" in file one, the first occurrence in file two will be used as the mapping location. The next occurrence of "awesome" in file one, will use the next occurrence for "awesome" in file two, and so on.

How it Works....
The following will provide some insight into how the mapping works.

Original Data
The following table demonstrates what the original data in a CSV tabular format would look like. The x,y coordinates demonstrate positions of the tabular format. This data is represented as lists, such that each row is stored as a list, and its entries are represented by the x,y coordinates. Each of these lists are kept in a single container for reference, which is also a list (list of lists).

x0,y0

x1,y0

x2,y0

...

...

x(n),y0

x0,y1

x2,y1

x3,y1

...

...

x(n),y1

x0,y2

x2,y2

x3,y2

...

...

x(n),y2

x0,y3

x2,y3

x3,y3

...

...

x(n),y3

...

...

...

...

...

...

...

...

...

...

...

...

x0,y(n)

x1,y(n)

x2,y(n)

...

...

x(n),y(n)


Reorganized Data
In order to mix the data properly, it must be reorganized such that the data being mixed is mixed with similar data. In this case, each column in the original data is restructured as a row. In this case, row one of the original data is now column one of the reorganized data. This structure is shown below:

x0,y0

x0,y1

x0,y2

x0,y3

...

...

...

x0,y(n)

x1,y0

x1,y1

x1,y2

x1,y3

...

...

...

x1,y(n)

x2,y0

x2,y1

x2,y2

x2,y3

...

...

...

x2,y(n)

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

x(n),y0

x(n),y1

x(n),y2

x(n),y3

...

...

...

x(n),y(n)


Shuffling Process
The shuffling process of the data is fairly simple. Each list, or row, in the reorganized data is given to the Python Random module's shuffle1 method. The result of this process gives the same data structure of the reorganized data, the goal of concealing the original state of the original data is accomplished. The placing of the location of each value in the list is shuffled with similar data, while the actual data values remain the same. 

Data Output
The output process is fairly simple as well. This process takes the reorganized data that has been shuffled and outputs the data in a comma separated fashion. 

Links

No comments:

Share on Twitter