Import wizard for huge files (5 million symbols,2.5GB+ into 230GB database)

Alias field is an alternative name for symbol. It should be left EMPTY, if possible, because it makes symbol lookups significantly slower, especially if your database has millions of symbols.

Thank you @Tomasz for the nice software. I do also use Native AmiBroker Database (Updated daily via ASCII Import). I take this opportunity again to express my happiness - for the Really Working ASCII Import feature - It really helps me maintain my database with high degree of confidence and convenience - and the forthcoming Stock.Split feature could be clearing the one little point of inconvenience I had faced.

And although the Imports are already fast - we do always want them to be faster and faster. One thing that came to my mind (might be a naive thought) - Can we have a "Database Setting" to enable/disable "Alias" field, will that make the Imports Faster in general?

With Regards

Sanjiv Bansal

Generally AmiBroker tries to figure out on its own whenever it needs to check Alias at all.
When no new symbols are added, it just needs one search to find out that you don't use aliases and it then internally marks "noaliasing" flag. Only when importing new symbols or importing Alias field or manipulating data via Symbol->Information and/or OLE it needs to verify that again.

Thank you very much for your guidance.

In my Database I do not use ALIAS - in the daily import there can be days when a few New Symbols are added, and there are days when No New Symbols are added.

In the Format definition file will it help to have "noaliasing" flag information at the top
or should I continue to be happy that the Application is already optimally taking care of the things?

With Regards

Sanjiv Bansal

Please tell me what import TIME you are getting now @Sumangalam and how many symbols?

@axweye 's database has 5 million symbols. It is rather unusual and as such it creates specific challenges. Typical database has less than 20-30K symbols.

@Tomasz , I made one test:

  1. i don't add new symbol or new quotes in database
  2. i just change field "Symbol" (or "Full name", no matter)
  3. field "Alias" is empty
  4. time of changing field in point 2 takes about 10 minutes for one symbol

It is unclear for me what exactly you are doing. Please send the data file and format definition that you are using. Also you need to be specific about the database you are importing to (number of symbols). Please send broker.master.

You need to remember that it does not matter what happens on your computer unless I receive ALL DATA required to REPRODUCE your setup and actually repeat steps that you do in my controlled environment. Leave no room for guess work.

@Tomasz , Thank you for your patience. I can upload the full database archive (50-60 GB), split into 2-5GB volumes. Or just send only broker.master?

I use Amibroker more then 10 years and it is great product:

  1. very powerful functionality
  2. very flexible due possibility to develop plugins
  3. very convenient to work with symbols and quotes

You don't need to send entire database, but please do send:

  1. broker.master file
  2. the files that you tried to import as mentioned in post #22 Import wizard for huge files (2.5GB+ into 230GB database) - #22 by axweye
  3. the import definition that you used for point 2.

Let me give you some reasonable (not hard coded) limits for AmiBroker database size.

While AmiBroker does not have any hard-coded limit, with current hardware and with "smooth" operation in mind the recommended maximums for single database are as follows:

  1. Upto 200 000 symbols per single database (which is enough to cover all major stock exchanges on the Earth)
  2. Upto 16 million quotes (bars) per single symbol (that is 640MB per symbol)

While AmiBroker will still work beyond those numbers, it won't be comfortable enough.

Really trying 5 million symbols in one database is not within design specifications and not advised.

1 Like

I can confirm some striking differences on imported data files. So when I process IEX Tops files into 1 minute, Day-Date files, they came out like so...
Capture1

Importing these files into a database already consisting of approx 9,200 symbols of 1 minute data back to December 2016. causes AB to jump to differing symbols each line of import. This is because the lines of data aren't sorted as Tomasz recommends. Performance wise, after first opening AB's database and doing the first import, with no cached symbol data (because it was the first import), the approx 700k lines of 1 minute data for that 1 day took my machine 34 minutes. Then I imported the next day (after most of symbol data was already cached into memory due to the previous import) the import only took 14.5 minutes, instead of 34 minutes. So that is an example of non-sorted data lines, without and with cached symbol data.

Now... I tested the performance again with new data, but with the file being imported sorted first like so...

Capture

Now the results are striking. Only 30 seconds without pre-cached symbol data and 17.2 seconds for import with symbol data pre-cached!

The result is that with the imported data file "Sorted" as apposed to unsorted (as viewed above) result in a 68x performance speed import increase (uncached) and 50x performance speed import increase (cached).

This is with the database stored on a Sabrent Rocket PCIE V4 NVME drive, so a spinning rust drive would make the cached vs non-cached even more pronounced; God forbid using a 5200 RPM laptop drive etc.

@Tomasz, the restrictions you point are fully understandable. On the other hand, this means serious limitations to test various option strategies in Amibroker. Ok, I'll think about non-direct way to test options in Ami.

@Sean - with regards to sorted / not sorted and cached / non-cached.
It is important to understand that while sorting by date is not that crucial, sorting by SYMBOL is because the database is organized so there are separate FILES for SYMBOLS, see
http://www.amibroker.com/guide/h_workspace.html

This means that adding unsorted quotes for the same SYMBOL is not a problem (because AmiBroker will just sort them easily as they are in just one array PER SYMBOL). So there is no issue if you have randomly ordered DATES within the symbol

It is also NOT a problem if symbols are NOT sorted AS LONG as quotes for ONE symbol appear in the imported file in BLOCK, i.e. ONE contiguous block of randomly ordered quotes PER SYMBOL in the file.

So this layout is good (as long as there are single contiguous blocks of same symbol data):

RANDOM_SYMBOL1, random_dates
RANDOM_SYMBOL1, random_dates
RANDOM_SYMBOL1, random_dates
RANDOM_SYMBOL1, random_dates
RANDOM_SYMBOL1, random_dates
RANDOM_SYMBOL2, random_dates
RANDOM_SYMBOL2, random_dates
RANDOM_SYMBOL2, random_dates
RANDOM_SYMBOL2, random_dates
RANDOM_SYMBOL2, random_dates
RANDOM_SYMBOL3, random_dates
RANDOM_SYMBOL3, random_dates
RANDOM_SYMBOL3, random_dates
RANDOM_SYMBOL3, random_dates
RANDOM_SYMBOL3, random_dates

However IF you import data that jump from symbol to symbol multiple times, so quotes for ONE symbol are spread in entire file multiple times, and IF symbols can't fit in RAM cache it means that loading into cache, update, sort and flushing back to disk cycle is repeated multiple times for given symbol if the symbol appears in the imported file in random places multiple times. If in-RAM cache is large enough to accomodate all symbols, then it does not need to flush to disk and reads each symbol only once.

This is example of "BAD" file:

RANDOM_SYMBOL1, random_dates
RANDOM_SYMBOL2, random_dates
RANDOM_SYMBOL1, random_dates
RANDOM_SYMBOL3, random_dates
RANDOM_SYMBOL2, random_dates
RANDOM_SYMBOL3, random_dates
RANDOM_SYMBOL1, random_dates
RANDOM_SYMBOL2, random_dates
RANDOM_SYMBOL1, random_dates
RANDOM_SYMBOL2, random_dates

As you can see there are no contiguous blocks for same symbol data.

Whenever it is SSE or HDD matters less than you think because AmiBroker always accesses files in purely sequential way and in large blocks so HDD can benefit from sequential reads/writes without head movement (if file is properly defragmented). Of course SSD disk also get faster with larger sequential blocks.

1 Like

There is one performance problem that has been identified using your 5M symbol database that is caused by "TickerBox" (the ticker symbol field with combobox that allows you to select the symbol). This ticker box is using standard Windows ListBox.
ListBox is pretty old component in Windows (Microsoft did it in '90s) and it was never designed to handle more than 32K elements at the time they wrote it. While in modern Windows you can add more elements to it, it is pathetically slow and it takes few minutes to fill 5 million elements. Filling that listbox occurs when you load database or when you change symbol names or full names. Therefore this is yet another component of Windows that must be replaced by custom code to be able to handle millions of items quickly. So "rewriting Windows" saga continues.
Once I replace Windows component by custom code it should be able to handle millions of symbols within fraction of second (I hope).

3 Likes

Your cache setting is incorrect. You need to increase your cache settings (Tools->Preferences->Data) to be able hold ALL symbols in RAM (if possible). That way, even with unsorted data when importing to completely cached database you should be able to get time in SECONDS, not minutes.

Use Tools->Performance Monitor to see how many symbols are cached as compared to the number of symbols in the entire database.

image

1 Like

Tomasz, it would be great idea! Or another suggestion: don't fill "TickerBox" when database loads if "Symbols" window is closed in database

There are two symbol lists. "Symbols" and "TickerBox" are different animals.
"Symbols" window is FAST because it already features custom code (our own Listvew that is highly optimized and drives Analysis windows among other things and is capable of handling hundreds of millions of lines in no time).
"TickerBox" is slow because it uses Standard Windows ListBox component

image

Despite similar name ListVIEW and ListBOX are completely different animals.

You can close TickerBox yourself (View->Toolbars->Ticker menu) and observe how much it impacts performance for 5 million symbol database.

6 Likes

With closed TickerBox loading time is 30 sec vs 12 minute with opened TickerBox. Tomasz, thank you so much! Frankly say I never used TickerBox at all, so with closed is absolutely OK for me

1 Like

Run backtest time:
77 sec with closed TickerBox vs 980 sec with opened TickerBox

Amibroker is great software! ))

1 Like

That you Tom for all the great info and guidance!