Dealing with Databases - Inserting, Updating Etc.

imperator · 2 years ago

Dealing with Databases - Inserting, Updating Etc.

csm10495 · 2 years ago

I typically like using sqlalchemy’s ORM for my database operations.

For something simple, using it with sqlite3 can be more efficient than parsing through a JSON file. Combine this with a primary key to help with the double-insertion problem (so to not have to iterate through before inserting) and it can work out quite well.

I’ve never really used Pandas dataframes though.

Another fun option (if willing to not use a database, but rather a disk-cache) is https://github.com/grantjenks/python-diskcache. Behind the scenes it actually also uses a sqlite3 db.

imperator · 2 years ago

What function or class from that library would I use to do this? Or what key words can I use to search and learn more? I’m struggling wrapping my head around it.

kamenoko · 2 years ago

Here is a basic tutorial for how sqlalchemy works. If you already have a database in place you’ll have to port your schema to it, but it might be worth trying out and seeing if it’s more performat.

atzanteol · 2 years ago

Databases are more efficient with bulk queries.

Rather than query each entry individually batch your data and query for the existence of that batch (e.g. where key in (1,2,3,etc)). You could do this one out json document, once per 100 entries, or however it makes sense. You can then check the results for your key to determine whether to insert or update. Then commit on that batch set.

imperator · 2 years ago

Do you happen to have any examples? I’m just not sure how to convert the JSON example into a bulk query since I need to keep the reference and line detail associated to the header. There is no primary key across all 3 sections. It’s generated when I insert into the database.

atzanteol · edit-2 2 years ago

It’s a little hard to say without seeing your datastructure. Is this something like

{ header: {
  id: 1,
  items: [ {
      name: "foo",
      field2: "bar"
   } ]
}

If you have something unique in the “header” you can create 2 tables with a dependency.

create table header ( id number );
create table item ( id number, header_id number, name varchar, field2 varchar);

You can generate IDs for each item on-the-fly but won’t be able to tie the back to the JSON. BUT if you can tie back header to the JSON then you can do a “drop-and-replace” on the items with each run. Which may not be the most efficient but it will likely perform better than querying each row upon entry. e.g. (pseudocode)

for each header in headers {
   delete from item where item.parent_id = header.id;
   for each item in header.items {
       insert into item values ( some_id, header.id, item.name, item.field2 );
   }
   commit;
}

But if you don’t want to drop/re-create then if there’ some combination of things in the “item” that is unique then you can use that as a compound key. In the worst case you can just use all the columns. I once created a primary key that was an MD5 checksum of the string value of all the fields in the row. It gave me a calculable primary key which was good and I could query off it easily. But it does make expanding the table much harder…

The advantage of drop-and-replace will be that removed items in the JSON will also be removed in the database. Otherwise you’ll need to do some additional cleanup to find database entries that don’t have an entry in your JSON file(s).