This article presents a utility that lets you retrieve raw information from web servers using HTTP's
This utility is just a wrapper around reusable functions that allow programmatic access to the web through a sort of 'mini-browser' embedded inside your program.
There are many uses for such code. Programs that look at a series of web pages, much like a user surfing from one page to the next, are often called spiders, bots, or crawlers. Such programs are often used to catalog websites, import external data from the Web, or simply to send commands to a web server. You could extend the functionality of the classes presented here to retrieve information from the Internet in a variety of ways.
There are many third-party DLLs and solutions which retrieve data from websites. The functions presented in this article are totally self-contained. There is no reliance on
WinInet, Internet Explorer, Netscape, or any requirement that similar software be installed, apart from WinSock. WinSock is an integral part of the Windows TCP/IP stack and is available on any computer capable of running a browser.
Every Internet protocol is documented in an RFC (Request For Comments) document. HTTP is documented in RFC1945. Additionally, RFC1630, RFC1708 and RFC1808 document the format of a URL.
A complete set of RFCs can be found at http://www.rfc-editor.org.
The engine of the utility is in the
Request class. The key function is
SendHTTP(). This function accepts 5 parameters and returns one integer. The first parameter is the URL to
POST to or
GET from. The second parameter specifies any additional HTTP headers to be passed during this request. The third and fourth parameters specify the data and length of data to post. The fifth parameter is a pointer to an
HTTPRequest structure that will hold the headers and messages sent and returned by the web server.
0 if the
GET was successful, otherwise
1 to indicate an error.
SendHTTP() begins by parsing the URL
string. A URL is an address that specifies the exact location of a resource on the Internet. A URL has several parts, some of which are optional. An example of a URL would be:
The first part of the URL is the protocol which specifies how to receive the resource. Following the protocol is the host name. This can be either a domain name or an IP address. Following the host is a port number. Every protocol has a default port number to be used if no port is specified. The default HTTP port is port 80. Following the port is the request being made of the specified web server. If not specified, it defaults to just '/', which requests the root document of the web server.
SendHTTP() initializes the WinSock library by calling WinSock's
WSAStartup(). After establishing a socket connection,
SendHTTP() transmits a request to the server. There are 2 forms of HTTP requests. The first, and simpler form, is the HTTP
GET does not send any additional information to the web server other than the request headers and the URL. An HTTP
GET often uses the URL itself to send additional information:
The second form, an HTTP
POST, sends data along with the request, separate from the URL.
Usually, an HTTP
POST includes the header:
Without this header, some web servers (particularly ASP running on IIS) will not recognize your parameters. An HTTP
POST has 2 parts. The first is the HTTP headers, just as in the
GET. The headers contain the actual request and additional pieces of information. Unlike a
POST contains data after the headers (separated from them by a blank line).
After the web server receives the
POST request, it sends back a response. The response has 2 parts: headers followed by data (with a blank line separating the two).
The first line of the HTTP headers specifies the status of the request. It starts with a numeric error code.
- 100-199 is an informational message and is not generally used.
- 200-299 means a successful request.
- 300-399 indicates that the requested resource has been moved; web servers use this for redirection.
- 400-499 indicates client errors.
- 500-599 indicates server errors.
After the headers come the data returned by the
POST request. This is usually seen on the browser screen.
Dialog Box Wrapper
The MFC dialog project is used like a wrapper to the
Request class. In the dialog container is inserted a instance of the Microsoft Web Browser control. This makes it very easy to navigate the data, make commands like
POST. The control is used in 2 ways:
- When the user makes a request from the browser, the control fires the
OnBeforeNavigate2 event which is captured by the dialog program. In that way, in
OnBeforeNavigate2Explorer1 function is used to discover if is a
POST, the header sent to the web server and the posted data.
- If the user wants to use the
SendHTTP engine, enter the required URL, complete the '
SendHTTPrequest' and '
PostData' (if is a
POST) fields, chack the radio button
POST and click on the 'Go' button. The IE control will load the HTML formatted data received from
SendHTTP() function in the
m_HTTPbody string variable. The HTML loading is done in
lpDispatch = m_Browser.GetDocument();
hr = lpDispatch->QueryInterface(IID_IHTMLDocument2,
hr = pHTMLDocument2->get_body(&pBody);
bstr = m_HTTPbody.AllocSysString();
Input the URL address and click on the Go button. On the right, there is a mini-browser with your page. Navigating on links and buttons on this page and in the '
SendHTTPrequest' and '
ReceiveHTTPrequest' will receive the corresponding data. The radio buttons
Post are modified automatically - the IE instance knows if you make an
GET (you push on a link) or
POST (you push a button).
You are able to input your header in the '
SendHTTPrequest' edit box and your
POST data in the '
PostData' edit box, and then push the '
Go' button. The browser will navigate to your address using the headers and data submitted from '
SendHTTPrequest' and '
Use the TestGet.asp and TestPost.asp files from Web directory to test your
This article has no explicit license attached to it, but may contain usage terms in the article text or the download files themselves. If in doubt, please contact the author via the discussion board below.
A list of licenses authors might use can be found here.