Extract and sort data from the .mdb file using mdbtools in Python

advertisements

I'm quite new to Python, so any help will be appreciated. I am trying to extract and sort data from 2000 .mdb files using mdbtools on Linux. So far I was able to just take the .mdb file and dump all the tables into .csv. It creates huge mess since there are lots of files that need to be processed.

What I need is to extract particular sorted data from particular table. Like for example, I need the table called "Voltage". The table consists of numerous cycles and each cycle has several rows also. The cycles usually go in chronological order, but in some cases time stamp get recorded with delay. Like cycle's one first row can have later time than cycles 1 first row. I need to extract the latest row of the cycle based on time for the first or last five cycles. For example, in table below, I will need the second row.

Cycle#    Time        Data
  1      100.59        34
  1      101.34        54
  1      98.78         45
  2
  2
  2   ...........

Here is the script I use. I am using the command python extract.py table_files.mdb. But I would like the script to just be invoked with ./extract.py. The path to filenames should be in the script itself.

import sys, subprocess, os

DATABASE = sys.argv[1]

subprocess.call(["mdb-schema", DATABASE, "mysql"])

# Get the list of table names with "mdb-tables"
table_names = subprocess.Popen(["mdb-tables", "-1", DATABASE],
                               stdout=subprocess.PIPE).communicate()[0]
tables = table_names.splitlines()

print "BEGIN;" # start a transaction, speeds things up when importing
sys.stdout.flush()

# Dump each table as a CSV file using "mdb-export",
# converting " " in table names to "_" for the CSV filenames.
for table in tables:
    if table != '':
        filename = table.replace(" ","_") + ".csv"
        file = open(filename, 'w')
        print("Dumping " + table)
        contents = subprocess.Popen(["mdb-export", DATABASE, table],
                                    stdout=subprocess.PIPE).communicate()[0]
        file.write(contents)
        file.close()


Personally, I wouldn't spend a whole lot of time fussing around trying to get mdbtools, unixODBC and pyodbc to work together. As Pedro suggested in his comment, if you can get mdb-export to dump the tables to CSV files then you'll probably save a fair bit of time by just importing those CSV files into SQLite or MySQL, i.e., something that will be more robust than using mdbtools on the Linux platform.

A few suggestions:

  1. Given the sheer number of .mdb files (and hence .csv files) involved, you'll probably want to import the CSV data into one big table with an additional column to indicate the source filename. That will be much easier to manage than ~2000 separate tables.

  2. When creating your target table in the new database you'll probably want to use a decimal (as opposed to float) data type for the [Time] column.

  3. At the same time, rename the [Cycle#] column to just [Cycle]. "Funny characters" in column names can be a real nuisance.

Finally, to select the "last" reading (largest [Time] value) for a given [SourceFile] and [Cycle] you can use a query something like this:

SELECT
    v1.SourceFile,
    v1.Cycle,
    v1.Time,
    v1.Data
FROM
    Voltage v1
    INNER JOIN
    (
        SELECT
            SourceFile,
            Cycle,
            MAX([Time]) AS MaxTime
        FROM Voltage
        GROUP BY SourceFile, Cycle
    ) v2
        ON v1.SourceFile=v2.SourceFile
           AND v1.Cycle=v2.Cycle
           AND v1.Time=v2.MaxTime