A lot of my job involves importing data, and over the years I've found a few nice ways to do this. As Monday was a holiday (and I didn't want
to do any actual work) I figured I'd spend some time to put more of my brain on the web.
As always, this is only my personal experiences about what seems to work for me, and if you know of better ways to do it, please leave some comments.
There are couple of major cases I've come across the need to import data, and they should be treated a little differently:
- Initially importing a customer's old data into a new database schema.
- Habitual imports of data from other systems.
I'll talk about initial imports now, since the habitual ones follow a lot of the same guidelines.
These tend to be a hairy. The data you get from the customer is often is an inconvienent format, and in sorry shape. The whole
reason they need a new database is because their old one is god-awful, and doesn't meet their needs. Here we have to deal with schema changes,
invalid data, inconsistent relations, and usually a bunch of data the customer isn't really going to use, but needs in there anyway.
For these large imports you want to automate everything you can. You're obviously going to have some script or program to select from
one and insert into the other, and you want to make the import a one step operation. You want to push the button then go get a beer
and pray it worked.
There are two ways I've approached this, and they have their ups and downs.
Get it all in the DB, then munge.
Copy the source data, in it's original schema, into your database, and then run queries to convert the old schema to the new schema.
On SQL Server, DTS packages are a good way to automate this. An easy way to make one is to use Enterprise Manager:
- Browse to the table list for your database.
- Right click, choose All Tasks->Import data
- Follow the wizard, choosing data sources and data destinations. When you choose destination tables, make
sure you don't have any naming conflicts. To make it easy for me to tell what is imported, I often prefix
the imported table names with "imp_" or "import_".
- At the end of the wizard, you have the option to save the process. Do so however you choose, DTS packages are pretty
easy to deal with, so that's my preference.
After you have the source data/schema in your destination schema, you get to do the hard part: the SQL script to convert from one schema to another.
Some general tips:
- Be generous with PRINT statments and comments. I frequently use PRINT statements as comments, so when its actually running I can easily track the
progress and have a decent idea of where it breaks. I use Query Analyzer mostly, and that is notoriously bad about what line the script failed on.
-
Use set operations as much as you can. The major advantage of importing the old data/schema intact is to allow set operations, which is absurdly faster than looping over a dataset.
- Make a clean up script that wipes your destination database. This can be useful after a failed import, or when the customer gives you a newer copy of the source data.
- Wrap your import in a transaction. This allows you to recover automatically from failures. If you don't, then you'll need to manually run a
clean up script after each failure.
- Keep an eye on your transaction logs. They can get out of control if they aren't restricted, and an afternoon of debugging an import script
can balloon the thing to preposterous sizes. Clear the log
regularly, possibly as part of the cleaning script.
There are several common issues you'll run into converting from old schema to new:
- Changing column types. The old database may have been storing numeric data as text. Before getting the old data in, you'll need to
correct all the type conversion errors first. Manually search the old data for problems, using WHERE ISNUMERIC(column) = 0. Once you find the
problem rows, add some UPDATE statements into your import script. You'll end up with a lot of statements like this:
|
| UPDATE import_munchkin_leagues SET |
| members = '4' |
| WHERE members = '4r' |
Now that your data can be converted, I like using DDL statements to evolve the old schema, making it closer to the destination schema.
|
| ALTER TABLE import_munchkin_leagues ALTER COLUMN members int; |
| ALTER TABLE import_munchkin_leagues ALTER COLUMN dues money |
Now the actual data import is trivial:
|
| INSERT INTO MunchkinLeagues (Members, Name, Dues) |
| (SELECT members, league_name, dues |
| FROM import_munchkin_leagues) |
You could do the conversions in the SELECT statement, but I find it a lot easier to think about and debug as seperate steps.
-
Making lookup tables. A simple
|
| INSERT INTO SlipperTypes (Name) |
| (SELECT DISTINCT slipper_type FROM import_witches) |
works in a lot of cases, but when types are entered by users, you'll end up with a lot of near-duplicates ("Ruby" versus "ruby" versus "rubyt") that will
need to just be hashed out with the client. Making a simple tool to let the customer replace all uses of a bad type ("rubyt") with a proper type ("Ruby") is
a good way to easily fix these, and that task can often be put back on the client.
-
Maintaining relationships. If you have more than a flat table to import, then you'll need to maintain the relationships. I like
adding a column to an old table that links to it's counterpart in the new schema. After copying the rows from the old table
to the new table, update that column the old table to point at it's corresponding row in the new table. I usually need one of
those bridges for each major table in the old schema.
-
Normalizing. A lot of times the old schema will have columns like: "Color1", "Color2", "ColorN" to get a M-N relationship. Usually
the new schema will have a "Colors" table, with a helper table to model the many-to-many relation.
Use UNIONs to convert those columns to rows:
|
| INSERT INTO Colors (Name) |
| (SELECT Color1 FROM import_horses |
| UNION |
| SELECT Color2 FROM import_horses |
| UNION |
| SELECT Color3 FROM import_horses) |
In SQL Server, the UNION operator eliminates duplicates, but I'm not sure about other RDBMS, you might need a DISTINCT in there.
Now, assuming you've added a column linking to your destination table, you can seed your many-to-many table to import the relationship:
|
| INSERT INTO HorseColors (ColorId, HorseId) |
| ( |
| SELECT Colors.Id, HorseId FROM Colors |
| JOIN ( |
| SELECT Color1, HorseId FROM import_horses |
| UNION |
| SELECT Color2, HorseId FROM import_horses |
| UNION |
| SELECT Color3, HorseId FROM import_horses |
| ) ih ON ih.Color1 = Colors.Name |
| ) |
Munge in object space
O/R mappers are pretty common now, and sometimes its quicker to write a console application, and run the imports in object space. Odds are you
already have a data access layer of some kind for your new database. There are a few pros and cons.
Pros
- If the DALs are generated, this can be faster to write.
- Sometimes its easier to think in object space.
Cons
- Speed. All the data is having to go into your app, then to the database. You also can't easily do set operations in most OO languages,
so everything gets iterated over at some point. I had one import application run for about 4 days.
I'll put down my thoughts on habitual imports some other time. For now, I need to do some of that actual work.