Import wizard for huge files (5 million symbols,2.5GB+ into 230GB database)

Hi all,
i loaded with Import wizard from txt file, containing about 35 million quotes (1 minute time interval). It took 7 hours and 30 minutes.

Example file:

Name,Date,Time,Open,High,Low,Close,Volume
ZM    210910P00355000-OPT,2021-09-07,15:56:00,57.5,57.5,57.5,57.5,0.0
ZM    210910P00355000-OPT,2021-09-07,15:57:00,57.5,57.5,57.5,57.5,0.0
ZM    210910P00355000-OPT,2021-09-07,15:58:00,57.5,57.5,57.5,57.5,0.0
ZM    210910P00355000-OPT,2021-09-07,15:59:00,57.5,57.5,57.5,57.5,0.0
ZNGA  140103C00004000-OPT,2013-12-26,10:51:00,0.15,0.15,0.15,0.15,11.0
ZNGA  140103C00004000-OPT,2013-12-26,14:23:00,0.13,0.13,0.11,0.11,23.0
ZNGA  140103C00004000-OPT,2013-12-26,14:46:00,0.12,0.12,0.1,0.1,30.0
ZNGA  140103C00004000-OPT,2013-12-27,09:36:00,0.1,0.1,0.1,0.1,30.0
ZNGA  140103C00004000-OPT,2013-12-27,09:38:00,0.09,0.09,0.09,0.09,30.0

Specifications of my computer:

  • Windows 10 Pro 64
  • Dual CPU Xeon E5-2690 v3
  • 128 Gb RAM

I wonder if there are any tips for improving loading time

How to import huge files quickly

If the entire file is sorted properly and since it is huge one have you tried to create several file chunks in addition?

@fxshrat , thank you for your reply.

File is sorted properly, according to the article you pointed:

  1. in ascending symbol order (so “AAPL” before “INTC”), and within symbol
  2. in ascending chronological order (so oldest records first, newest records last)

Files with 10 million quotes are imported in 2 hours. So it seems that file chunks are not very useful

@axweye , how long is your options history and where did you get it?

My experience is that these files are much slower than having files associated with just one symbol. Not sure though because you would be importing from perhaps and approximate 2.5 million active US options contracts, plus inactive older options contract histories.

Make sure you are utilizing your RAM, rather than just a small portion with an overactive pagefile.sys Also, do a quick explore with filter=true to pre-load existing symbol history into memory.

What is the size (in megabytes) of the files that you import? And how much data you already have in the database (in megabytes).
Some operations are slower than others. AmiBroker is optimized for adding quotes in chronological order, so imports of "last" quotes are fast. If you import data into already existing quotes and imported quotes are older than existing ones, then AmiBroker has to check for duplicates and insert in correct order and that is usually slower than appending newest.

Check your cache settings (Tools->Preferences, "Data").

@Sean , I bought a subscription to IQFeed - they provide intraday history for options since 2014. To download quotes, I use a Python script from github, customized for my tasks.
I have downloaded weekly options for 600 underlying assets and ES-mini and NQ-mini futures. In total, this is about 5 million symbols - so I decided to combine symbol quotes with the same first char into one file.

Thanks for the advice on utilizing RAM - I'll check it out.
What do you mean by "quick explore with filter=true to pre-load existing symbol history into memory" ?

@Tomasz , thank you for your reply.

Importing file size is 2.5 gb, database size - 230 gb.
Before start import, I sorted quotes in chronological order - older quotes are first, e.x.:

YELP  140103C00061000-OPT,2013-11-27,11:04:00,4.88,4.88,4.88,4.88,3.0
YELP  140103C00061000-OPT,2013-11-27,11:25:00,5.2,5.2,5.2,5.2,3.0
YELP  140103C00061000-OPT,2013-12-02,10:20:00,3.4,3.4,3.4,3.4,2.0
YELP  140103C00061000-OPT,2013-12-02,10:21:00,3.4,3.4,3.4,3.4,8.0
YELP  140103C00061000-OPT,2013-12-03,09:51:00,3.88,3.88,3.88,3.88,2.0
YELP  140103C00061000-OPT,2013-12-04,09:50:00,3.8,3.8,3.8,3.8,2.0
YELP  140103C00061000-OPT,2013-12-10,09:57:00,4.3,4.3,4.3,4.3,11.0

"If you import data into already existing quotes" No, imported symbols are absolutely new in my database

Well, 230GB worth of data that certainly does not fit in RAM. And this means lots of reads and writes from/to disk. The limiting factor is disk access. From the sample you provided it seems that there are few lines per symbol of new data. So you might have even million symbols in your database. Reading and writing million files is not going to be in "seconds" or even "minutes" range. "Hours" range is to be expected with huge number of files and 230GB worth of data.
The time you are getting with such huge amounts of data is reasonable given hardware limits. See Performance tuning tips for some reality check.

Update I just saw that:

With 5 million files (each symbol is in separate file) and Windows things must be slow. Windows OS file system tends to slow down starting with 2000 files per folder. AmiBroker uses A-Z folders (see Understanding AmiBroker database concepts) but even with all those subfolders, 5 million files and 230GB worth of data is going to take time.

@Tomasz what is your opinion about using ODBC/SQL in this case?

ODBC would be 10-100 times slower.

If you prepare a sample file with the data that you are usually importing (but less than 0.5GB), upload it to Google Drive then I can try it on my end and see if anything can be improved.

Tom,

I'm sure axweye will give you a file. May I also give you a similar file (IEX aggregated File)? If so, do you have a Google Drive or should we upload it to our own G-Drive and then share a link via email?

Use your own and share the link.

Thank you @axweye for sending the test file.

Some initial observations:

  1. If cache size is set smaller than number of symbols then about half of the required time is spent reading/writing to RAM cache from disk (this is in line with my previous comments)
  2. if new symbols are added during import process and large number of symbols are already in the database, then checking for ALIAS field may significantly increase import time. If you don't add new symbols during process and there are NO aliases used at all for existing symbols, then Alias check is not performed and there is no speed penalty.

@Tomasz thank you for assistance. Could you clarify about "ALIAS field"? Do you mean in the case "if new symbols are added during import process and large number of symbols are already in the database" the ALIAS field should be empty or should it be duplicated with the Name field?

Alias field is an alternative name for symbol. It should be left EMPTY, if possible, because it makes symbol lookups significantly slower, especially if your database has millions of symbols.

Thank you @Tomasz for the nice software. I do also use Native AmiBroker Database (Updated daily via ASCII Import). I take this opportunity again to express my happiness - for the Really Working ASCII Import feature - It really helps me maintain my database with high degree of confidence and convenience - and the forthcoming Stock.Split feature could be clearing the one little point of inconvenience I had faced.

And although the Imports are already fast - we do always want them to be faster and faster. One thing that came to my mind (might be a naive thought) - Can we have a "Database Setting" to enable/disable "Alias" field, will that make the Imports Faster in general?

With Regards

Sanjiv Bansal

Generally AmiBroker tries to figure out on its own whenever it needs to check Alias at all.
When no new symbols are added, it just needs one search to find out that you don't use aliases and it then internally marks "noaliasing" flag. Only when importing new symbols or importing Alias field or manipulating data via Symbol->Information and/or OLE it needs to verify that again.

Thank you very much for your guidance.

In my Database I do not use ALIAS - in the daily import there can be days when a few New Symbols are added, and there are days when No New Symbols are added.

In the Format definition file will it help to have "noaliasing" flag information at the top
or should I continue to be happy that the Application is already optimally taking care of the things?

With Regards

Sanjiv Bansal

Please tell me what import TIME you are getting now @Sumangalam and how many symbols?

@axweye 's database has 5 million symbols. It is rather unusual and as such it creates specific challenges. Typical database has less than 20-30K symbols.