In the first article of the series, we talked about the basic concepts of HTTP. Now that we have a foundation to build upon, we can talk about some of the architectural aspects of HTTP. There is more to HTTP than just sending and receiving data.
HTTP cannot function by itself as an application protocol. It needs infrastructure in the form of hardware and software solutions that provide different services and make communication over the World Wide Web possible and efficient.
How Big Did You Say? I am often contacted by prospective clients to help them crawl the web on a very large scale or find questions such as this one on StackOverflow. What people want to achieve with web data varies greatly from one case to the next: some need to extract specific data from as many pages as possible, some want to build search engines, while others wish to test the accuracy of a machine learning model on real data. Luckily, there are resources available for large scale web crawling, both on the platform side (e.g. Amazon Web Services) and the software side (StormCrawler, Apache Nutch), however large scale crawling (think billions of pages and hundreds of servers) is costly, complex and time-consuming.