First I have to say that This is an old-fashioned “technical” talk geek to me, not a newscast, so if you’re used to my news podcasts you might find it odd to see me revert to my old genre. Consider this an addition to my regular work.
I’ve recently started a pilot project with my podcast to facilitate a way of getting large files more efficiently with my podcast. But it would be odd for me not to explain what this technique is. I think if you bear with me, you will at least learn a new way of doing things that might be better, even if you are left not thinking that it is particularly appropriate to the podcast community.
What I am talking about is “segmented downloading.” “Segmented downloading” is a way of getting your file by getting pieces of your file from different webservers, which mirror each other with identical content. If “bittorrent” comes to mind, then you’re following me. It is essentially using full-fledged webservers as if they were bittorrent seeds. But in order to understand why you would want to do this, you need to understand some things about old-school downloads and some things about bittorrent, before you can understand the “why,” then the “how,” of segmented downloading.
The traditional way of getting a download completed on the Internet might not always be the best way, particularly for bigger files. We’re not talking about the picture file embedded in a blog post, nor the blog post’s text itself. Those are better served with a traditional download. We’re talking about files with a minimum of dozens of megabytes in size, but usually 100 megabytes to CD and DVD ISO file sizes. Think audio over a half hour, movies, software Cd’s and DVDs. That is what we are talking about.
Let’s suppose something like a music podcast, with a 50 megabyte file for the sake of an example.
Now, a traditional download is to put the podcast on a well-connected webserver, and then people who want the file will find it either in a webpage or a rss feed, and will right click a link and choose “download file” in their webrowser, and the webrowser will begin transferring the file onto their computer. Your browser’s download manager will connect to the webserver and begin copying the file onto your system, starting at the beginning and getting piece after piece of the file until it reaches the end.
You might ask yourself what is wrong with this. The answer is that if the file is new and desirable and downloading by many people at once, that the one webserver might not be able to keep up with the load. All of a sudden your 3mb/s down dsl connection to the Internet is being used at one. Your one minute download might become a three minute download. Now, in this case you might not care about the odd two minutes you lose. What if you like you files in the flac format? Now maybe your four minute flac music download becomes a sixteen minute download. Your favorite CD ISO of a Linux distribution? Maybe your 20-minute download becomes an hour-fifteen minute download.
It is interesting to note that the bittorrent guys have this covered. For extremely popular files, there is nothing like bittorrent. This is because the file is divided into chunks and everybody who is a downloader is also an uploader. If people share as much as they download, there is no problem.
So, what is the basics of bittorrent? First, the file is broken into chunks, let’s say they are one megabyte chunks. Therefore, the file consists of 50 chunks. If you have hundreds of people sharing the file, you can grab a chunk here and there, and your file will load quickly and efficiently. The group of computers sharing the file is called “the swarm.” Each computer that is just donating upload bandwidth is called a “seed.” As long as people don’t close their clients as soon as their download is complete, they keep “seeding” the file, and everything goes smoothly.
What can go wrong? Well, a hit and run downloader may not really share as much as he takes, as well as the situation where the file is not popular enough to get a big sustained following. Swarms work great with hundreds of people, not with dozens of people.
Enter the concept of using webservers as seeds. A webserver is connected in a way that it is designed to handle many people at once. But not hundreds of thousands of people asking for the same file at once.
This idea uses multiple webservers to serve a larger number of media downloaders at once, a number of downloaders that need speed to some extent, and more bandwidth than one webserver can handle at peak efficiency, but also handling media objects that are not popular enough to have bittorrent work for them efficiently.
Let’s return to our somewhat popular 50 MB music file, and it’s bigger 200 MB flac cousin.
If you have cheap shared hosting available to you on a couple of servers, you can upload the files to several servers at once. They will be identical files hosted on several mirrors. Let’s say you have server space on each coast of the USA as well as a server space in a European country.
Now, if you are close to a server, you can still do a traditional download at your nearest server. Nothing in this system stops that. So, if you are on the west coast of the USA, you can still download a copy from the west coast server with your firefox, and still get a somewhat good download.
But if you have a really big pipe to the Internet, you are not maxing out your connection unless you use segmented downloading. The way you do this is that you would use a segmented download manager like aria2, Axel, wxdownloadfast, or a windows or mac program that would do the same thing. So you could, to give an example, open up a text window and type “aria2 (space,)” then you would get one of the URLs from one of the mirrors, copy and paste that, a space, and repeat until you had the word “aria2,” which is the command, and a space separated list of the different locations of the same file. In actuality, the command “Axel” would be exactly the same, but I am most familiar with aria2 so I will stick to what I know.
Now, those of you who are tech-savvy know about “download managers.” They follow the UNIX philosophy of having one job, which in this case is downloading, and they do it very well. Most people get these programs when they grow concerned with the idea of a big download being interupted, because they are able to “talk” to the webserver and restart a download in the middle. Thus, in a traditional download, if the download were interupted half-way through, a download manager would later re-connect to the server, and say “start in the middle, I got the first half already.”
But a segmented downloader maxes out this situation. In the aria2 case, it first allocates the disk space needed for the whole file, you know, to get that pesky disk-space-allocation thing out of the way. Then aria2 looks at the 50 MB file and thinks, OK, this is really 50 1mb downloads. Then it contacts the first webserver, asks for the first megabyte. Simultaneously, it contacts the second webserver, and asks for the second megabyte of the file. Simultaneously, it contacts the third webserver and asks for the third megabyte of the file. So far, it has acted exactly like it’s simpler cousin Axel.
Aria2 is more sophisticated than Axel. Axel will keep round-robining the file until done. Aria2 is more obsessive about it’s connection to the file. Since aria2 is also a bittorrent client, it uses it’s bittorrent smarts to max things out. So while these three downloads are going on, it’s rating the server’s performance from it’s perspective. Then it will use the less loaded servers more, auto-magically. This behavior will max out your connection to the Internet.
This situation gets even better if you have a really fat connection, like a fiber optic (fios) connection, or a corporate office t3 connection, to the Internet. In that case, the webservers in question may not be able, even under the best of conditions, to max out that connection. In this case, the best outputs of the three servers are added to each other. To give you an idea, when I set up the mirrors for my pilot project of making this available for my podcast, I draw on two webservers for my last webserver. Just the other night, each of the first webservers I set up were functioning at about 3 mbps “up there in the Internet.” When I went to set up the third mirror image, where I could use aria2 “up there in the cloud,” I achieved a whopping 6mbps transfer. That flac file, it was moved in seconds, a speed not available to traditional tools like wget.
So I end this explanation of segmented downloading with the invitation to you to try it out on my news podcast to see if you like it. And if you do, I hope to hear from you.