|
This is a nightmare scenario: Quote: Of course it wasn't always like that, but the import table in Azure SQL has over 2 million records now and I need to validate for double records,
I've done this before and it's slow no matter what I do and our db is very optimized. What I wound up doing was pulling and storing in an array, all of the unique fields only. It's one pull (huge, but with only one or two values per record, doable). Then in the double-record validation, I just check each new record against that one array, in advance of any db insert/update work, sorting the new records into two arrays, one for insert, one for update. That way the inserts are quicker. If you are trashing the duplicates, you can skip the update part.
|
|
|
|
|
So, you've isolated it to the DB load. In ghost jobs of long past, I know inserting records into tables with indexes could bite you, since the db has to do it's thing. The application was a "real-time" database that gated prepaid phone calls. Although we didn't do batch operations, we might have anywhere from 512-4096 call handler threads all hitting the database relatively at the same time (don't want to get pedantic) . However, most of these transactions were either reads or updates. Creation occurred only for new customers.
So, lots of index use, but no new index entries.
Please add a reminder to your task list to update the thread with what the actual problem is and how you solved it
Oh, one other suggestion - it sounds like to me you could collect these files and set up an offline test system to play - remove the web complexity. I know you've done this a long time, so you probably already are doing this.
Charlie Gilley
“They who can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety.” BF, 1759
Has never been more appropriate.
|
|
|
|
|
Interesting. We have issues with CSV upload but it is in the upload failing itself, which is being used due to the lack of an API from an outside vendor. It's AWS to local. That's a topic in itself with a long, dumb winding road with a large vendor. Haven't had issues with Azure connects, though.
My first question is process: It sounds like you have control over both ends of this? If so, why a CSV rather than a direct upsert? Five seconds feels like an eternity (it does really feel that way today!) but CSVs to DB with validation takes a while. Our environment is different, using Postgres on a nix with a triggered background cron and our line count is only in the hundreds per upload, so numbers don't compare. What's an import "table" and why does it exist, or is that just wording for a real destination table?
This caught my eye: Quote: Not so much an issue in my development environment, but apparently a bottleneck in production. You mean the time just isn't an issue or does it run faster? Not sure why 5 seconds would create a "bottleneck" in production with five uploads a day? The timeout is generated on your end, right? Bump it up or calc it on estimated process time. If it ultimately works, let it run, no? Yeah, it's annoying looking at a spinner for a bit, but even if it's 30 seconds five times a day, no one is that busy.
Sorry, haven't worked with SQL Server in years but we do find the more you can put into a longer SQL query, the better the performance vs outside processing. Again, with SQL Server caveat above, I'd also index the heck out of it with 2 mil records.
|
|
|
|
|
I would implement a tmp table with a twist: diff
I would create a table, lets call it "ImportData" with 3 columns:
ID: i assume there is a field on the CSV that is unique
Data: the complete CVS line
Status: a importa status depending on the operation
I like char instead of int with "magical numbers" for these kinds os status
A single char will not hurt performance
Then:
1: Set the status to '?' before importing
2: Import to this table row by row
3: Have a SQL trigger that compares the value of column 'Data'
If it changed, set Status to 'U' (Updated) if did not change set status to 'N' (NOP, No operation)
4: At the end, set all rows with '?' to 'D' (Deleted), these are the rows not present on the CSV
5: Process all rows with status different from 'N' (the changed ones)
If 'U', insert or update on main table, if 'D' delete from main table
This will give you a differential update with only the changed rows being updated in your Main data table.
Will also give you a 'snapshot' of all data received in the last CSV file. Very useful for debug
I have several imports done this way, usually imports are very fast since (in my case) most CSV rows do not change.
If the import data is XML, o JSON you can do the same by saving the XML node or JSON node on the 'Data' column.
|
|
|
|
|
This is a huge area with all sorts of possible ways of solving it.
I currently work on a rather huge codebase, part of which performs automated imports and exports in a batching system.
That would probably be overkill for what you are wanting.
There isn't a one solution for everything in this sort of case - however a staging table is always a good place to start - import the data into a staging table then you can decide on a strategy from there.
The strategy might involve selectively picking records that are then fed into your main table's or even allowing the user to choose the data.
The advantage of a staging table is that you can isolate any issues within the staging table as well as be able to query it and analyse data from there without affecting your main tables.
As for timings - for something like an import there should really be no guarantees of how many seconds it takes because there are so many variables out of your control(connection speed, number of records, record sizes etc.)
What you can do is provide some sort of progress bar to let the user know what stage things are at(I actually deliberately slow the progress bar in some applications so that users get a sense of something happening rather than something flashing on and off the screen).
“That which can be asserted without evidence, can be dismissed without evidence.”
― Christopher Hitchens
|
|
|
|
|
This falls under the 'checking to make sure I'm not leaving my drink on the roof of my car before driving off' category of suggestions.
You may want to make sure the database connection is not closing and reconnecting between each record. This is something I've run into back in my old Access days. 
|
|
|
|
|
I had a similar issue where I needed to import 10's of thousands of records every day, most of them duplicates. We only inserted if the record didn't already exist. It was a serious performance bottleneck.
I generated hashes for all of the existing rows and added an index for the new hash column. Now I calculate the hash of the values in each new row to be inserted and compare the hash to existing rows. I skip the insert if an existing identical row already exists. Using the index solved the insert performance problem for me.
I don't recommend using MD5 or similar simple hashes because it's too easy to create duplicate hashes.
|
|
|
|
|
- Figure out the bottle neck.
Since it does not happen in Dev, but does in production, I suspect it is the duplicate checking. If you size dev the same as production will it reproduce in dev?
Also, do you have more fault tolerance/active nodes in production? If you are doing a few thousand small transactions in a high availability setup, it will take longer.
Do you perform all of your duplicate checks in a single query? Or one import row at a time?
If you setup good, unique indexes, then you can skip the duplicate checking and let the DB do it for you. This would point you toward a commit per import row.
We had an import feature that ended up with an unnecessary exponential complexity. It was fine during testing but started getting really slow at only 150 rows.
|
|
|
|
|
#Worldle #267 1/6 (100%)
🟩🟩🟩🟩🟩🎉
https://worldle.teuteuf.fr
easy one
"A little time, a little trouble, your better day"
Badfinger
|
|
|
|
|
Wordle 484 6/6
🟨⬛⬛🟨🟩
⬛🟨🟨⬛🟩
🟩⬛🟩⬛🟩
🟩⬛🟩⬛🟩
🟩⬛🟩🟨🟩
🟩🟩🟩🟩🟩
|
|
|
|
|
Wordle 484 2/6
🟨🟨⬜🟩🟩
🟩🟩🟩🟩🟩
My first 2!
Software rusts. Simon Stephenson, ca 1994. So does this signature. me, 2012
|
|
|
|
|
Wordle 484 3/6
⬜🟨⬜⬜⬜
⬜🟨🟩⬜🟩
🟩🟩🟩🟩🟩
Not my first 3 but still proud
"A little time, a little trouble, your better day"
Badfinger
|
|
|
|
|
Wordle 484 4/6
🟩⬛🟩⬛🟩
🟩⬛🟩⬛🟩
🟩🟩🟩⬛🟩
🟩🟩🟩🟩🟩
No comment! 
|
|
|
|
|
Wordle 484 5/6
🟨🟨⬜🟨⬜
🟨🟨🟩⬜⬜
⬜⬜🟩🟩🟩
⬜⬜🟩🟩🟩
🟩🟩🟩🟩🟩
|
|
|
|
|
Wordle 484 5/6
⬜🟨⬜⬜⬜
⬜⬜⬜🟨🟩
🟩⬜🟩⬜🟩
🟩🟩🟩⬜🟩
🟩🟩🟩🟩🟩
The order of my last two guesses was a coinflip
"If we don't change direction, we'll end up where we're going"
|
|
|
|
|
Wordle 484 4/6
🟨⬜⬜🟨🟩
🟩🟩🟩⬜🟩
🟩🟩🟩⬜🟩
🟩🟩🟩🟩🟩
"I have no idea what I did, but I'm taking full credit for it." - ThisOldTony
"Common sense is so rare these days, it should be classified as a super power" - Random T-shirt
AntiTwitter: @DalekDave is now a follower!
|
|
|
|
|
Wordle 484 4/6
⬛⬛🟩⬛🟩
🟩🟩🟩⬛🟩
🟩🟩🟩⬛🟩
🟩🟩🟩🟩🟩
|
|
|
|
|
Wordle 484 5/6
⬜⬜⬜⬜🟨
🟨⬜⬜⬜🟩
🟩🟨⬜⬜🟩
🟩⬜🟩🟨🟩
🟩🟩🟩🟩🟩
|
|
|
|
|
Wordle 484 6/6
⬜⬜🟩⬜🟩
🟩⬜🟩⬜🟩
🟩⬜🟩⬜🟩
🟩⬜🟩⬜🟩
🟩⬜🟩⬜🟩
🟩🟩🟩🟩🟩
Phew
Life should not be a journey to the grave with the intention of arriving safely in a pretty and well-preserved body, but rather to skid in broadside in a cloud of smoke, thoroughly used up, totally worn out, and loudly proclaiming “Wow! What a Ride!" - Hunter S Thompson - RIP
|
|
|
|
|
|
But not so for iron, which is ironic.
|
|
|
|
|
|
I thought the adjective for iron was "feral".
|
|
|
|
|
So should the adjective for copper be capric?
Freedom is the freedom to say that two plus two make four. If that is granted, all else follows.
-- 6079 Smith W.
|
|
|
|
|
I think this thread is very, very thin.
"A little time, a little trouble, your better day"
Badfinger
|
|
|
|