Writing a Multipart Downloader (Part 1)
Posted
Tags:
programming
| rust
Due to the recent curl is C post by curl’s author,
along with the (unsurprising) response by the Rust Evangelism Strikeforce community,
I decided that writing something similar to curl would
be fun.
Well, “similar” probably isn’t actually the right word. I personally only use curl for uploading and downloading files. curl is capable of doing much more than that.
Heck, let’s look at curl’s man page.
curl is a tool to transfer data from or to a server, using one of the supported protocols (DICT, FILE, FTP, FTPS, GOPHER, HTTP, HTTPS, IMAP, IMAPS, LDAP, LDAPS, POP3, POP3S, RTMP, RTSP, SCP, SFTP, SMB, SMBS, SMTP, SMTPS, TELNET and TFTP).
Oh. Wow. That’s a lot of protocols. “Similar” is definitely not the right word. Let’s go
with “I decided that writing something I could use to download files would be fun”
instead. But just downloading files is too easy—do a GET
request and save
the message body of the response to a file? B-o-r-i-n-g. So I decided to write
a download accelerator
to make things interesting.
Before I get into the details, here’s the link to the GitHub repository for what I wrote. It technically works*!
*For some definition of “works”
…Download Accelerator?
Okay, “download accelerator” might be a bit of a misnomer. Maybe. What this means is that instead
of downloading a file via a single HTTP GET
request, you download parts of files using
multiple connections. This provides two main advantages:
- Some servers limit the bandwidth of each connection. Using multiple connections obviously circumvents this.
- Resuming a stopped download is really simple when you know the size of the file and how much you’ve downloaded already.
I say it might maybe be a misnomer because it probably won’t actually speed up your downloads unless you’re connected to a server that limits the rate per-connection (which I think is pretty rare these days).
The Process
Okay, now for details! Maybe because it’ll be helpful to someone. Or maybe because journaling is useful for making my thoughts more organized. Probably both. Personally, if a programming project isn’t teaching me anything, then I have no interest in doing it. This project has definitely taught me some things (and is going to end up teaching me even more). Here’s an outline of the process I went through.
Choosing an HTTP Library
My original plan was to use hyper, a popular Rust HTTP library, but I wasn’t sure of what additional steps (if any) would be required for downloading over HTTPS. I asked about this on the Rust IRC channel, and was told that hyper would require some additional setup to establish an HTTPS connection. I was also told about reqwest, a high-level client wrapper built on hyper that would handle HTTPS for me. Manually handling SSL stuff didn’t sound fun, so I decided to try using reqwest.
Unfortunately, I had some problems with reqwest,
like some HEAD
requests inexplicably causing a panic,
so I ended up going with hyper instead.
(And then I found that the inital TLS setup isn’t that bad anyway)
I also considered using the Rust binding to libcurl, but I’m not a huge fan of bindings. In my opinion, idioms in C (such as error handling) don’t translate that well to Rust, and it makes bindings feel “unnatural” to me.
Microsoft doesn’t like HEAD
I started out by writing a program to send a HEAD
request and display the status code
and headers of the reply. I tested it out on my favorite file upload site
(which uses Microsoft IIS, for reference), and it seemed to work.
For reasons I don’t recall, I decided to try it again with the same URL…and it failed.
404 response instead of the 200 that it gave before. I double-checked by making another
HEAD
request with curl -I
, and, sure enough, that failed too. GET
requests still worked
just fine, though. I assume this is some sort of security feature, but I don’t know what
preventing HEAD
while allowing GET
will secure you against.
Regardless of the reasoning, this led to a design decision: using GET
instead of HEAD
,
because apparently some things treat them differently (even though they probably shouldn’t).
Multipart Downloading
Okay, getting info from the server was easy. Not as easy as I wanted it to be, but still pretty
easy. Now…what do I do with that info? How do I request only a small part of the file?
Well, figuring that out required taking a crash course in HTTP headers and header fields
Through some searching, I was able to figure out pretty quickly that I needed to make a request
that included the Range
header field. Unfortunately, that’s not the only requirement;
there are a few more header fields that have to be in order:
Content-Length
. Fairly self-explanatory: it’s the size of the file, in bytes. This obviously needs to be known if you’re going to be making range requests (unless you plan on only making one range request, for the entire file…which would be pointless).Accept-Ranges
. Hopefully it’ll bebytes
. If you’re unlucky, it’ll benone
. And to make things confusing, I’m pretty sure that a server is allowed to omit this field even if it does accept range requests.Transfer-Encoding
andContent-Encoding
. I honestly don’t know what the difference is. I do know they both have to do with the encoding and can be something likecompress
orgzip
(meaning compressed with some compression algorith) oridentity
(meaning uncompressed). I decided to just request onlyidentity
encodings so that I wouldn’t have to deal with compression yet. Also, a very important note:Transfer-Encoding
can bechunked
, meaning the server will just send data in chunks until it’s done. This means theContent-Length
is unknown, and range requests aren’t going to work.
So, that means multiple requests per file. One to figure out if I can do a range request, and then one (or more) to actually get the file. And of course, there’s a list of header fields that I needed to set:
Range
. Unless that’s not allowed, of course.Accept-Encoding
, to specify that I only wantedContent-Encoding: identity
.TE
. Okay, I didn’t actually use that one becausehyper
doesn’t have a struct built-in for it and I was too lazy to write the header field myself. But I probably should’ve used it! It’s basicallyAccept-Encoding
but for theTransfer-Encoding
rather than for theContent-Encoding
. Don’t ask me why its name doesn’t follow theAccept-*
format like the other fields. I wish I knew.
Status Codes
When you normally request a file, you hopefully get a 200 OK
response (as opposed to
something like 404 Not Found
). There are a few response codes related to range requests:
206 Partial Content
is what you’re hoping for. It means that the server is fulfilling your range request.416 Range Not Satisfiable
, meaning that either the range you requested was invalid (such as requesting bytes 0-999 of a 500-byte file), or that the server doesn’t support ranges. In this case, you need to make anotherGET
request, either for a different range or without specifying a range at all.200 OK
, meaning that your range request isn’t being fulfilled, but the server is sending the complete file.4xx
or5xx
range errors, which aren’t really “related to range requests” but are of course still a possibility
Obviously, the code should handle all those cases accordingly.
Hangups
I’ll be honest, I’m not proud of the code I’ve written so far. I had a bunch of lifetime problems as a result of using multithreading, and as a result, I’m pretty sure I’m wasting memory by reallocating things that shouldn’t really need reallocated.
I also had trouble figuring out how to custom data types for the downloads that would work how I wanted. I decided to use a threadpool to handle all the downloads in such a way that individual parts of multipart downloads were treated the same as “normal” (not range-requested) downloads. Doing this while still grouping the parts of multipart downloads together proved more difficult than I would’ve liked.
I didn’t even get around to writing the code to merge the parts of a multipart download. Nor did I get around to writing code that would allow resuming an interrupted download. Oops.
Conclusion
So, what have I decided to do? A complete (better) rewrite from the ground up!
Specifically, I’m going to use the futures crate for asynchronous I/O.
I’ve kinda wanted to play around with Tokio/futures for a while now, and this seems like a good
excuse reason to do so.
Here’s something I read in futures’ documentation:
For example issuing an HTTP request may return a future for the HTTP response, as it probably hasn’t arrived yet.
Yeah, futures sounds like a good crate to use for this. I’m honestly a bit worried that the learning curve will be too steep for me, but the Rust community is ridiculously supportive, so I’m sure I’ll be fine.
comments powered by Disqus