Web scraping / data extraction problem

Hello,

As I have shown in this thread: AmiBroker - much more than just ordinary technical analysis software, for some time I have been using the new AFL Internet functions for web scraping / data extraction. Usually it all works properly and I am happy with this solution :+1: but from sometime I've been experiencing problems which (in case of one web page) result in errors, exceptions or even crashes. For example:

22 b

It's the first time I've seen any Error in AmiBroker in Polish :wink: In English it means:

Error 47. Exception. The connection with the server has been reset.

It looks like AmiBroker has partially downloaded the required data, when the connection was lost. My Internet connection is really stable - everything else is working properly - so that's not the problem. Because this message is in Polish, I suspect it comes from other source than only AmiBroker.

I don't know how to handle such errors. I always check if InternetOpenURL() returns 1 . For example:

ih = InternetOpenURL( "https://www.pb.pl/puls-dnia/" );

if( ih )
{
    PageStr = "";

    while( ( Str = InternetReadString( ih ) ) != "" )  PageStr = PageStr + Str;

    InternetClose( ih );

    // The rest of the code ...
}
else Say( "Puls Biznesu - Amibroker - No connection", 0 );

... if it returns 0, I now that something is wrong - i.e. there's no Internet connection, but what to do if it returns 1 and the problem sometimes occur?

At first I thought that it might has something to do with the changes implemented in ver. 6.25 http://www.amibroker.com/devlog/2017/07/24/amibroker-6-25-0-beta-released/

CHANGES FOR VERSION 6.25.0 (as compared to 6.22.0)
..
25. Web Research: many HTML5 pages did not display properly because of the fact that web browser used old IE7 mode. Now browser uses IE11 mode at minimum for proper HTML5 rendering

... but (apart from Vista) I've been testing it on WIN7 which uses IE11 - so it should be compatible. Besides I tried downgrading AB to 6.22 but the problem still (sometimes) exist.

This problem occurs only from time to time (every n tries) when I extract the data from this webpage: https://www.pb.pl/puls-dnia/

Maybe it is a kind of server limitation (implemented recently) to protect it from being overloaded?

Is there any way in which I can avoid such errors/exceptions or even crashes? If I was able to recognize such situation I could modify my code to handle it - for example wait and try again after x seconds/minutes. Another example (this one appears very rarely):

22

Error 47. Exception - Operation's time limit has been exceeded.

I would appreciate any suggestions :slight_smile:


p.s.

Because today I got notified that "Clickable links in Exploration output" and "The ability to control the speech rate of selected voices" - the features which I submitted to the Feedback Centre (and which are very useful in such applications) will be implemented in AB 6.26, I would like to say: @Tomasz Thank you very very much ! :smiley: :champagne: :smiley: :champagne:

1 Like

These internet exception are reported by the web page and/or Windows Internet API functions. You can not control or avoid them as these are external connectivity and/or web site errors (not occurring in AmiBroker but outside).
Typically these errors occur if the web site you are accessing BLOCK you temporarily because you are accessing them too frequently. Web servers have DOS (denial of service) protection mechanism that count number of requests per second from given IP and if you reach the threshold they will cut you out treating you as DOS invader. The only solution is to request data less frequently.

Tomasz, thank you for the information. The server's protection mechanism is probably responsible for this behaviour. I was hoping, that there is a way in which it would be possible to handle / prepare for such situations to avoid exceptions (or crashes in some cases). In the best case scenario only one Analysis window (Exploration) is stopped and needs to be manually started again, in the worst case, stability of the whole Amibroker is compromised - because of the problems with only one data source (web page):

25

The problem is that I cannot identify any specific pattern or logic when does this problem occur. Sometimes it can happen at the begining - when the second query for the data in a day is performed and sometimes for a long period of time everything is OK.

I tried doing the queries every 3 minutes (instead of 1 or 2 minutes) but even then the Exception sometimes pop up. I will try doing them even less often.

If it's not possible to directly counteract such situations, I will find another solution. At present only one web page randomly causes these problems and before it used to work properly for many weeks.

Thank you once again.

Regards

Sending screenshots in this case is pointless as screenshot contains only small, truncated part of entire report. instead you should use “COPY TO CLIP” or better just “Send report” button.

Anyway, most likely you have just run out of virtual memory space (2GB limit per 32-bit process). You can try using 64-bit instead.

On Win7 64Bit I don’t get this “Amibroker crash recovery system” screen, only those Errors 47 which screenshots I have shown above.

Anyway I will try experimenting with different solutions. I’ve started experiencing these problems recently.

Thank you

@Milosz,
As scraping takes time, why not just use other script languages which are more equip to handle http request. In the event that the data source make changes, the scrape data may corrupt the Amibroker database. Having a 2 step process i.e. scrape to a csv file and import into Amibroker. If import data is bad, you can prevent it from getting into Amibroker.

1 Like

Peter, thanks for the reply!

Your suggestion is logical. I suppose that dividing this process into two steps might ensure the stability of AmiBroker, but on the other hand it would make this solution more complex. I’m not a programmer and my knowledge of other programming languages is very limited - I feel quite comfortable in AFL though :slight_smile:

If the whole process of extracting data from a web page took place outside AmiBroker I wouldn’t need any of the recently introduced AB Internet functions. But that would be a pity because I’ve been testing these Internet functions for a few months and I rarely have any problems with them. Morover I really enjoy using them, because extracting data directly from the html code (usually I don’t even need to download the whole html code) is blazingly fast comparing to any browser based solution. On a daily basis I use some specialized software to extract data from different sources - especially when these sources require loging or authentification. The process of extraction is carried out by this software (which is a modified browser) but I control it (via API) using vbs/java scripts . All logic, speech synthesis etc. is done by my scripts. But this solution is much slower and CPU demanding. I really appreciate that now in most cases I can do everything (data extraction, logic, speech synthesis etc) directly in AmiBroker - and as a bonus it gives me possiblity to easily merge technical analysis with fundamental data coming from different sources and (sort of) news or event trading. For this reason I will do my best to overcome this problem somehow. If I don’t succeed I may try your solution :slight_smile:

Actually it is this only one web page which causes such problems. All I would need is being able to recognise when such problematic situation (whatever the reason is) takes place before the exception pops out and the exploration is stopped. If it’s not possible, I’m sure I will find another way around. In this case I might also try chart based solution instead of analysis window. Besides don’t know if it’s possible to recognize and avoid such problems if the data extraction process is carried out in javascript, C# or C++. I suppose, that if it was possible, Tomasz could implement some countermeasures in AmiBroker (for example InternetReadString() returning some specific values indicating problems), but (if I understood correctly) Tomasz wrote, that it cannot be done…

Regards

hello @Milosz

I bet you know already, but if not?
This is the way that you need to logging, if that site requires logging

username = mySecret…
password = mypass…

Url1= "http://username:password@www.pb.pl/puls-dnia/";

ih = InternetOpenURL( Url1 );
1 Like

@Milosz,
It’s a good plan. Wish you all the best!

1 Like

I wrote already: they CUT YOU OFF because you download too fast and too frequently. “Connection reset” message that you are getting is CLEARLY saying that REMOTE HOST has purposely closed the connection.
It has nothing to do with AmiBroker. It is GENERIC network error (that is why it is displayed in Polish on your machine, because the error comes from network stack in the OS, which is localized):

Such an error would occur REGARDLESS if you access it with browser X, Y, external tool or AmiBroker, because the error is generated OUTSIDE your computer. That is REMOTE host that closed (reset) your connection.

There are dozens of internet errors that are documented in Windows API. Microsoft has wininet.h header file that lists all possible Internet errors. These all come OUTSIDE of the program - from the OS network stack.
http://cpansearch.perl.org/src/JDB/Win32-Internet-0.087/WININET.H

Excerpt (NOT complete)

#define ERROR_INTERNET_OUT_OF_HANDLES           (INTERNET_ERROR_BASE + 1)
#define ERROR_INTERNET_TIMEOUT                  (INTERNET_ERROR_BASE + 2)
#define ERROR_INTERNET_EXTENDED_ERROR           (INTERNET_ERROR_BASE + 3)
#define ERROR_INTERNET_INTERNAL_ERROR           (INTERNET_ERROR_BASE + 4)
#define ERROR_INTERNET_INVALID_URL              (INTERNET_ERROR_BASE + 5)
#define ERROR_INTERNET_UNRECOGNIZED_SCHEME      (INTERNET_ERROR_BASE + 6)
#define ERROR_INTERNET_NAME_NOT_RESOLVED        (INTERNET_ERROR_BASE + 7)
#define ERROR_INTERNET_PROTOCOL_NOT_FOUND       (INTERNET_ERROR_BASE + 8)
#define ERROR_INTERNET_INVALID_OPTION           (INTERNET_ERROR_BASE + 9)
#define ERROR_INTERNET_BAD_OPTION_LENGTH        (INTERNET_ERROR_BASE + 10)
#define ERROR_INTERNET_OPTION_NOT_SETTABLE      (INTERNET_ERROR_BASE + 11)
#define ERROR_INTERNET_SHUTDOWN                 (INTERNET_ERROR_BASE + 12)
#define ERROR_INTERNET_INCORRECT_USER_NAME      (INTERNET_ERROR_BASE + 13)
#define ERROR_INTERNET_INCORRECT_PASSWORD       (INTERNET_ERROR_BASE + 14)
#define ERROR_INTERNET_LOGIN_FAILURE            (INTERNET_ERROR_BASE + 15)
#define ERROR_INTERNET_INVALID_OPERATION        (INTERNET_ERROR_BASE + 16)
#define ERROR_INTERNET_OPERATION_CANCELLED      (INTERNET_ERROR_BASE + 17)
#define ERROR_INTERNET_INCORRECT_HANDLE_TYPE    (INTERNET_ERROR_BASE + 18)
#define ERROR_INTERNET_INCORRECT_HANDLE_STATE   (INTERNET_ERROR_BASE + 19)
#define ERROR_INTERNET_NOT_PROXY_REQUEST        (INTERNET_ERROR_BASE + 20)
#define ERROR_INTERNET_REGISTRY_VALUE_NOT_FOUND (INTERNET_ERROR_BASE + 21)
#define ERROR_INTERNET_BAD_REGISTRY_PARAMETER   (INTERNET_ERROR_BASE + 22)
#define ERROR_INTERNET_NO_DIRECT_ACCESS         (INTERNET_ERROR_BASE + 23)
#define ERROR_INTERNET_NO_CONTEXT               (INTERNET_ERROR_BASE + 24)
#define ERROR_INTERNET_NO_CALLBACK              (INTERNET_ERROR_BASE + 25)
#define ERROR_INTERNET_REQUEST_PENDING          (INTERNET_ERROR_BASE + 26)
#define ERROR_INTERNET_INCORRECT_FORMAT         (INTERNET_ERROR_BASE + 27)
#define ERROR_INTERNET_ITEM_NOT_FOUND           (INTERNET_ERROR_BASE + 28)
#define ERROR_INTERNET_CANNOT_CONNECT           (INTERNET_ERROR_BASE + 29)
#define ERROR_INTERNET_CONNECTION_ABORTED       (INTERNET_ERROR_BASE + 30)
#define ERROR_INTERNET_CONNECTION_RESET         (INTERNET_ERROR_BASE + 31)
#define ERROR_INTERNET_FORCE_RETRY              (INTERNET_ERROR_BASE + 32)
#define ERROR_INTERNET_INVALID_PROXY_REQUEST    (INTERNET_ERROR_BASE + 33)
#define ERROR_INTERNET_NEED_UI                  (INTERNET_ERROR_BASE + 34)

#define ERROR_INTERNET_HANDLE_EXISTS            (INTERNET_ERROR_BASE + 36)
#define ERROR_INTERNET_SEC_CERT_DATE_INVALID    (INTERNET_ERROR_BASE + 37)
#define ERROR_INTERNET_SEC_CERT_CN_INVALID      (INTERNET_ERROR_BASE + 38)
#define ERROR_INTERNET_HTTP_TO_HTTPS_ON_REDIR   (INTERNET_ERROR_BASE + 39)
#define ERROR_INTERNET_HTTPS_TO_HTTP_ON_REDIR   (INTERNET_ERROR_BASE + 40)
#define ERROR_INTERNET_MIXED_SECURITY           (INTERNET_ERROR_BASE + 41)
#define ERROR_INTERNET_CHG_POST_IS_NON_SECURE   (INTERNET_ERROR_BASE + 42)
#define ERROR_INTERNET_POST_IS_NON_SECURE       (INTERNET_ERROR_BASE + 43)
#define ERROR_INTERNET_CLIENT_AUTH_CERT_NEEDED  (INTERNET_ERROR_BASE + 44)
#define ERROR_INTERNET_INVALID_CA               (INTERNET_ERROR_BASE + 45)
#define ERROR_INTERNET_CLIENT_AUTH_NOT_SETUP    (INTERNET_ERROR_BASE + 46)
#define ERROR_INTERNET_ASYNC_THREAD_FAILED      (INTERNET_ERROR_BASE + 47)
#define ERROR_INTERNET_REDIRECT_SCHEME_CHANGE   (INTERNET_ERROR_BASE + 48)
#define ERROR_INTERNET_DIALOG_PENDING           (INTERNET_ERROR_BASE + 49)
#define ERROR_INTERNET_RETRY_DIALOG             (INTERNET_ERROR_BASE + 50)
#define ERROR_INTERNET_HTTPS_HTTP_SUBMIT_REDIR  (INTERNET_ERROR_BASE + 52)
#define ERROR_INTERNET_INSERT_CDROM             (INTERNET_ERROR_BASE + 53)
1 Like

Hello Panagioti,

Thanks a lot - no I didn’t know about it :slight_smile: and surely will try that out. It might work in some cases, but I am afraid not in all, because some paid sources which I use (for example Reuters) require entering only randomly choosen characters from the password. The other one additionally requires entering password send by SMS etc. Usually the better the source, the more difficult it is to log in :wink: I’m sure it all can be managed, but for now I am not that good and in these cases I have to make use of the software which enables that. But I will do my best to improve my skills. The limiting factor here is lack of time and other responsibilities :wink:

Once again thanks for the useful information :+1:

Tomasz, I really appreciate your assistance :slight_smile:

As I wrote, in general I am really happy with AmiBroker Internet functions - I make use of them every day. Before you introduced them it hadn’t even crossed my mind, that such functionality might be implemented. I think that many people don’t realise what additional value it brings to AmiBroker. Of course I am not talking about those coders who can create any custom made solution in C++ or any other language…

Thank you for this confirmation. In that case only one thing is strange for me. I would expect that if I am purposely cut off, I won’t be able to access this web page for some period of time. But usually when I get this message/exception I am able to manually run the exploration (shortly afterwards) many times without problems. I can also open this web page in a browser. So it’s not a 1, 5, 10, 30, 60 etc. minutes ban. At least I have not identified the pattern yet :wink: Maybe it cuts off in case of any regular requests?

In this case I will try using chart based solution instead of analysis (which if stopped has to be run manually or via OLE). If I only get error 47 it should work properly in the next run (i.e. after one or two minutes). Additionally I will be able to make the requests in some random intervals (for example ranging from 1 to 3 minutes). Maybe that will help. It will look as if an average Joe (or even better Jan Kowalski) was browsing this web page :wink:

If it’s not possible to counteract Errors 47 directly, I will find another solution. The only thing that I am concerned are those rare crashes. With regards to them I will try to implement suggestions from one of your posts. I can also try some other measures i.e. checking every variable which stores the extracted data if it’s not Null.

I will study in details the information/links that you have provided in your above post.

Thank you very much!

Really google is your friend. Things are not as simple as you think. Google for “connection reset” and you will find zillions of reasons. You are assuming things. Assumptions are not facts. It does not need to be “programmed ban” as you think. There are zillions of things that may go wrong on remote server like overload, database timeout, etc, etc. This https://technet.microsoft.com/en-us/library/cc957018.aspx is just one of many possible reasons, it can be firewall rule, router, and many many more. But it all does not matter because you can not do anything with remote system resetting the connection. It is unrecoverable network error. The remote system you are trying to use is unusable in the way you are trying, so find different one.

But don’t worry 6.26 will offer solution to your problem. No more details at this moment.

1 Like

Great news - I really appreciate that! :champagne: :smiley: :champagne:

I can’t wait to try 6.26 (for many different reasons) --> but I will wait patiently :slight_smile:

Regarding the "too much too fast" cut off by the host, would iMacros "waitseconds = " be helpful? It allows you to slow the scraping commands.