Import wizard for huge files (5 million symbols,2.5GB+ into 230GB database)

@Tomasz , I made one test:

  1. i don't add new symbol or new quotes in database
  2. i just change field "Symbol" (or "Full name", no matter)
  3. field "Alias" is empty
  4. time of changing field in point 2 takes about 10 minutes for one symbol

It is unclear for me what exactly you are doing. Please send the data file and format definition that you are using. Also you need to be specific about the database you are importing to (number of symbols). Please send broker.master.

You need to remember that it does not matter what happens on your computer unless I receive ALL DATA required to REPRODUCE your setup and actually repeat steps that you do in my controlled environment. Leave no room for guess work.

@Tomasz , Thank you for your patience. I can upload the full database archive (50-60 GB), split into 2-5GB volumes. Or just send only broker.master?

I use Amibroker more then 10 years and it is great product:

  1. very powerful functionality
  2. very flexible due possibility to develop plugins
  3. very convenient to work with symbols and quotes

You don't need to send entire database, but please do send:

  1. broker.master file
  2. the files that you tried to import as mentioned in post #22 Import wizard for huge files (2.5GB+ into 230GB database) - #22 by axweye
  3. the import definition that you used for point 2.

Let me give you some reasonable (not hard coded) limits for AmiBroker database size.

While AmiBroker does not have any hard-coded limit, with current hardware and with "smooth" operation in mind the recommended maximums for single database are as follows:

  1. Upto 200 000 symbols per single database (which is enough to cover all major stock exchanges on the Earth)
  2. Upto 16 million quotes (bars) per single symbol (that is 640MB per symbol)

While AmiBroker will still work beyond those numbers, it won't be comfortable enough.

Really trying 5 million symbols in one database is not within design specifications and not advised.

1 Like

I can confirm some striking differences on imported data files. So when I process IEX Tops files into 1 minute, Day-Date files, they came out like so...
Capture1

Importing these files into a database already consisting of approx 9,200 symbols of 1 minute data back to December 2016. causes AB to jump to differing symbols each line of import. This is because the lines of data aren't sorted as Tomasz recommends. Performance wise, after first opening AB's database and doing the first import, with no cached symbol data (because it was the first import), the approx 700k lines of 1 minute data for that 1 day took my machine 34 minutes. Then I imported the next day (after most of symbol data was already cached into memory due to the previous import) the import only took 14.5 minutes, instead of 34 minutes. So that is an example of non-sorted data lines, without and with cached symbol data.

Now... I tested the performance again with new data, but with the file being imported sorted first like so...

Capture

Now the results are striking. Only 30 seconds without pre-cached symbol data and 17.2 seconds for import with symbol data pre-cached!

The result is that with the imported data file "Sorted" as apposed to unsorted (as viewed above) result in a 68x performance speed import increase (uncached) and 50x performance speed import increase (cached).

This is with the database stored on a Sabrent Rocket PCIE V4 NVME drive, so a spinning rust drive would make the cached vs non-cached even more pronounced; God forbid using a 5200 RPM laptop drive etc.

@Tomasz, the restrictions you point are fully understandable. On the other hand, this means serious limitations to test various option strategies in Amibroker. Ok, I'll think about non-direct way to test options in Ami.

@Sean - with regards to sorted / not sorted and cached / non-cached.
It is important to understand that while sorting by date is not that crucial, sorting by SYMBOL is because the database is organized so there are separate FILES for SYMBOLS, see
http://www.amibroker.com/guide/h_workspace.html

This means that adding unsorted quotes for the same SYMBOL is not a problem (because AmiBroker will just sort them easily as they are in just one array PER SYMBOL). So there is no issue if you have randomly ordered DATES within the symbol

It is also NOT a problem if symbols are NOT sorted AS LONG as quotes for ONE symbol appear in the imported file in BLOCK, i.e. ONE contiguous block of randomly ordered quotes PER SYMBOL in the file.

So this layout is good (as long as there are single contiguous blocks of same symbol data):

RANDOM_SYMBOL1, random_dates
RANDOM_SYMBOL1, random_dates
RANDOM_SYMBOL1, random_dates
RANDOM_SYMBOL1, random_dates
RANDOM_SYMBOL1, random_dates
RANDOM_SYMBOL2, random_dates
RANDOM_SYMBOL2, random_dates
RANDOM_SYMBOL2, random_dates
RANDOM_SYMBOL2, random_dates
RANDOM_SYMBOL2, random_dates
RANDOM_SYMBOL3, random_dates
RANDOM_SYMBOL3, random_dates
RANDOM_SYMBOL3, random_dates
RANDOM_SYMBOL3, random_dates
RANDOM_SYMBOL3, random_dates

However IF you import data that jump from symbol to symbol multiple times, so quotes for ONE symbol are spread in entire file multiple times, and IF symbols can't fit in RAM cache it means that loading into cache, update, sort and flushing back to disk cycle is repeated multiple times for given symbol if the symbol appears in the imported file in random places multiple times. If in-RAM cache is large enough to accomodate all symbols, then it does not need to flush to disk and reads each symbol only once.

This is example of "BAD" file:

RANDOM_SYMBOL1, random_dates
RANDOM_SYMBOL2, random_dates
RANDOM_SYMBOL1, random_dates
RANDOM_SYMBOL3, random_dates
RANDOM_SYMBOL2, random_dates
RANDOM_SYMBOL3, random_dates
RANDOM_SYMBOL1, random_dates
RANDOM_SYMBOL2, random_dates
RANDOM_SYMBOL1, random_dates
RANDOM_SYMBOL2, random_dates

As you can see there are no contiguous blocks for same symbol data.

Whenever it is SSE or HDD matters less than you think because AmiBroker always accesses files in purely sequential way and in large blocks so HDD can benefit from sequential reads/writes without head movement (if file is properly defragmented). Of course SSD disk also get faster with larger sequential blocks.

1 Like

There is one performance problem that has been identified using your 5M symbol database that is caused by "TickerBox" (the ticker symbol field with combobox that allows you to select the symbol). This ticker box is using standard Windows ListBox.
ListBox is pretty old component in Windows (Microsoft did it in '90s) and it was never designed to handle more than 32K elements at the time they wrote it. While in modern Windows you can add more elements to it, it is pathetically slow and it takes few minutes to fill 5 million elements. Filling that listbox occurs when you load database or when you change symbol names or full names. Therefore this is yet another component of Windows that must be replaced by custom code to be able to handle millions of items quickly. So "rewriting Windows" saga continues.
Once I replace Windows component by custom code it should be able to handle millions of symbols within fraction of second (I hope).

3 Likes

Your cache setting is incorrect. You need to increase your cache settings (Tools->Preferences->Data) to be able hold ALL symbols in RAM (if possible). That way, even with unsorted data when importing to completely cached database you should be able to get time in SECONDS, not minutes.

Use Tools->Performance Monitor to see how many symbols are cached as compared to the number of symbols in the entire database.

image

1 Like

Tomasz, it would be great idea! Or another suggestion: don't fill "TickerBox" when database loads if "Symbols" window is closed in database

There are two symbol lists. "Symbols" and "TickerBox" are different animals.
"Symbols" window is FAST because it already features custom code (our own Listvew that is highly optimized and drives Analysis windows among other things and is capable of handling hundreds of millions of lines in no time).
"TickerBox" is slow because it uses Standard Windows ListBox component

image

Despite similar name ListVIEW and ListBOX are completely different animals.

You can close TickerBox yourself (View->Toolbars->Ticker menu) and observe how much it impacts performance for 5 million symbol database.

6 Likes

With closed TickerBox loading time is 30 sec vs 12 minute with opened TickerBox. Tomasz, thank you so much! Frankly say I never used TickerBox at all, so with closed is absolutely OK for me

1 Like

Run backtest time:
77 sec with closed TickerBox vs 980 sec with opened TickerBox

Amibroker is great software! ))

1 Like

That you Tom for all the great info and guidance!

In my development version I already have custom list box in TickerBox that is now INSTANT in refreshes. With everything open (Symbols list AND Ticker Box) and 2.5 million symbols database loading time is 5 seconds. Symbol rename (with full refreshes of both Symbols list and TickerBox) is 1 second.

1 Like

Tomasz, in that case, the loading /renaming_symbols time in my database is very strange for me. Maybe there is some kind of internal damage in it.

The folder with database was added to Microsoft Defender exclusions.

The key word is "in my development version". It means in the version I am CURRENTLY developing, I now have rewritten Windows list box to my own code that is many orders of magnitude faster.
You don't have that in-development version and you can't compare at the moment.

1 Like

Great result!
If possible, I am ready to become a tester of your development version ))

Thank you very much for the response. And I regret the delay in response on my part.
So much positive developments seems to have happened in between.
I will be sending the details by email to Support.

The forrthcoming versions seems to be getting more and more tempting.
Eagerly looking forward to the same.

With Regards

Sanjiv Bansal