Due to the recent curl is C post by curl’s author,
along with the (unsurprising) response by the Rust
Evangelism Strikeforce community,
I decided that writing something similar to curl would
Well, “similar” probably isn’t actually the right word. I personally only use curl for uploading and downloading files. curl is capable of doing much more than that.
Heck, let’s look at curl’s man page.
curl is a tool to transfer data from or to a server, using one of the supported protocols (DICT, FILE, FTP, FTPS, GOPHER, HTTP, HTTPS, IMAP, IMAPS, LDAP, LDAPS, POP3, POP3S, RTMP, RTSP, SCP, SFTP, SMB, SMBS, SMTP, SMTPS, TELNET and TFTP).
Oh. Wow. That’s a lot of protocols. “Similar” is definitely not the right word. Let’s go
with “I decided that writing something I could use to download files would be fun”
instead. But just downloading files is too easy—do a
GET request and save
the message body of the response to a file? B-o-r-i-n-g. So I decided to write
a download accelerator
to make things interesting.
Before I get into the details, here’s the link to the GitHub repository for what I wrote. It technically works*!
*For some definition of “works”
Okay, “download accelerator” might be a bit of a misnomer. Maybe. What this means is that instead
of downloading a file via a single HTTP
GET request, you download parts of files using
multiple connections. This provides two main advantages:
- Some servers limit the bandwidth of each connection. Using multiple connections obviously circumvents this.
- Resuming a stopped download is really simple when you know the size of the file and how much you’ve downloaded already.
I say it might maybe be a misnomer because it probably won’t actually speed up your downloads unless you’re connected to a server that limits the rate per-connection (which I think is pretty rare these days).
Okay, now for details! Maybe because it’ll be helpful to someone. Or maybe because journaling is useful for making my thoughts more organized. Probably both. Personally, if a programming project isn’t teaching me anything, then I have no interest in doing it. This project has definitely taught me some things (and is going to end up teaching me even more). Here’s an outline of the process I went through.
Choosing an HTTP Library
My original plan was to use hyper, a popular Rust HTTP library, but I wasn’t sure of what additional steps (if any) would be required for downloading over HTTPS. I asked about this on the Rust IRC channel, and was told that hyper would require some additional setup to establish an HTTPS connection. I was also told about reqwest, a high-level client wrapper built on hyper that would handle HTTPS for me. Manually handling SSL stuff didn’t sound fun, so I decided to try using reqwest.
Unfortunately, I had some problems with reqwest,
HEAD requests inexplicably causing a panic,
so I ended up going with hyper instead.
(And then I found that the inital TLS setup isn’t that bad anyway)
I also considered using the Rust binding to libcurl, but I’m not a huge fan of bindings. In my opinion, idioms in C (such as error handling) don’t translate that well to Rust, and it makes bindings feel “unnatural” to me.
Microsoft doesn’t like
I started out by writing a program to send a
HEAD request and display the status code
and headers of the reply. I tested it out on my favorite file upload site
(which uses Microsoft IIS, for reference), and it seemed to work.
For reasons I don’t recall, I decided to try it again with the same URL…and it failed.
404 response instead of the 200 that it gave before. I double-checked by making another
HEAD request with
curl -I, and, sure enough, that failed too.
GET requests still worked
just fine, though. I assume this is some sort of security feature, but I don’t know what
HEAD while allowing
GET will secure you against.
Regardless of the reasoning, this led to a design decision: using
GET instead of
because apparently some things treat them differently (even though they probably shouldn’t).
Okay, getting info from the server was easy. Not as easy as I wanted it to be, but still pretty
easy. Now…what do I do with that info? How do I request only a small part of the file?
Well, figuring that out required taking a crash course in HTTP headers and header fields
Through some searching, I was able to figure out pretty quickly that I needed to make a request
that included the
Range header field. Unfortunately, that’s not the only requirement;
there are a few more header fields that have to be in order:
Content-Length. Fairly self-explanatory: it’s the size of the file, in bytes. This obviously needs to be known if you’re going to be making range requests (unless you plan on only making one range request, for the entire file…which would be pointless).
Accept-Ranges. Hopefully it’ll be
bytes. If you’re unlucky, it’ll be
none. And to make things confusing, I’m pretty sure that a server is allowed to omit this field even if it does accept range requests.
Content-Encoding. I honestly don’t know what the difference is. I do know they both have to do with the encoding and can be something like
gzip(meaning compressed with some compression algorith) or
identity(meaning uncompressed). I decided to just request only
identityencodings so that I wouldn’t have to deal with compression yet. Also, a very important note:
chunked, meaning the server will just send data in chunks until it’s done. This means the
Content-Lengthis unknown, and range requests aren’t going to work.
So, that means multiple requests per file. One to figure out if I can do a range request, and then one (or more) to actually get the file. And of course, there’s a list of header fields that I needed to set:
Range. Unless that’s not allowed, of course.
Accept-Encoding, to specify that I only wanted
TE. Okay, I didn’t actually use that one because
hyperdoesn’t have a struct built-in for it and I was too lazy to write the header field myself. But I probably should’ve used it! It’s basically
Accept-Encodingbut for the
Transfer-Encodingrather than for the
Content-Encoding. Don’t ask me why its name doesn’t follow the
Accept-*format like the other fields. I wish I knew.
When you normally request a file, you hopefully get a
200 OK response (as opposed to
404 Not Found). There are a few response codes related to range requests:
206 Partial Contentis what you’re hoping for. It means that the server is fulfilling your range request.
416 Range Not Satisfiable, meaning that either the range you requested was invalid (such as requesting bytes 0-999 of a 500-byte file), or that the server doesn’t support ranges. In this case, you need to make another
GETrequest, either for a different range or without specifying a range at all.
200 OK, meaning that your range request isn’t being fulfilled, but the server is sending the complete file.
5xxrange errors, which aren’t really “related to range requests” but are of course still a possibility
Obviously, the code should handle all those cases accordingly.
I’ll be honest, I’m not proud of the code I’ve written so far. I had a bunch of lifetime problems as a result of using multithreading, and as a result, I’m pretty sure I’m wasting memory by reallocating things that shouldn’t really need reallocated.
I also had trouble figuring out how to custom data types for the downloads that would work how I wanted. I decided to use a threadpool to handle all the downloads in such a way that individual parts of multipart downloads were treated the same as “normal” (not range-requested) downloads. Doing this while still grouping the parts of multipart downloads together proved more difficult than I would’ve liked.
I didn’t even get around to writing the code to merge the parts of a multipart download. Nor did I get around to writing code that would allow resuming an interrupted download. Oops.
So, what have I decided to do? A complete (better) rewrite from the ground up!
Specifically, I’m going to use the futures crate for asynchronous I/O.
I’ve kinda wanted to play around with Tokio/futures for a while now, and this seems like a good
excuse reason to do so.
Here’s something I read in futures’ documentation:
For example issuing an HTTP request may return a future for the HTTP response, as it probably hasn’t arrived yet.
Yeah, futures sounds like a good crate to use for this. I’m honestly a bit worried that the learning curve will be too steep for me, but the Rust community is ridiculously supportive, so I’m sure I’ll be fine.comments powered by Disqus