Due to the recent curl is C post by curl’s author, along with the (unsurprising) response by the Rust Evangelism Strikeforce community, I decided that writing something similar to curl would be fun.

Well, “similar” probably isn’t actually the right word. I personally only use curl for uploading and downloading files. curl is capable of doing much more than that.

Heck, let’s look at curl’s man page.

curl is a tool to transfer data from or to a server, using one of the supported protocols (DICT, FILE, FTP, FTPS, GOPHER, HTTP, HTTPS, IMAP, IMAPS, LDAP, LDAPS, POP3, POP3S, RTMP, RTSP, SCP, SFTP, SMB, SMBS, SMTP, SMTPS, TELNET and TFTP).

Oh. Wow. That’s a lot of protocols. “Similar” is definitely not the right word. Let’s go with “I decided that writing something I could use to download files would be fun” instead. But just downloading files is too easy—do a GET request and save the message body of the response to a file? B-o-r-i-n-g. So I decided to write a download accelerator to make things interesting.

Before I get into the details, here’s the link to the GitHub repository for what I wrote. It technically works*!

*For some definition of “works”

…Download Accelerator?

Okay, “download accelerator” might be a bit of a misnomer. Maybe. What this means is that instead of downloading a file via a single HTTP GET request, you download parts of files using multiple connections. This provides two main advantages:

  1. Some servers limit the bandwidth of each connection. Using multiple connections obviously circumvents this.
  2. Resuming a stopped download is really simple when you know the size of the file and how much you’ve downloaded already.

I say it might maybe be a misnomer because it probably won’t actually speed up your downloads unless you’re connected to a server that limits the rate per-connection (which I think is pretty rare these days).

The Process

Okay, now for details! Maybe because it’ll be helpful to someone. Or maybe because journaling is useful for making my thoughts more organized. Probably both. Personally, if a programming project isn’t teaching me anything, then I have no interest in doing it. This project has definitely taught me some things (and is going to end up teaching me even more). Here’s an outline of the process I went through.

Choosing an HTTP Library

My original plan was to use hyper, a popular Rust HTTP library, but I wasn’t sure of what additional steps (if any) would be required for downloading over HTTPS. I asked about this on the Rust IRC channel, and was told that hyper would require some additional setup to establish an HTTPS connection. I was also told about reqwest, a high-level client wrapper built on hyper that would handle HTTPS for me. Manually handling SSL stuff didn’t sound fun, so I decided to try using reqwest.

Unfortunately, I had some problems with reqwest, like some HEAD requests inexplicably causing a panic, so I ended up going with hyper instead. (And then I found that the inital TLS setup isn’t that bad anyway)

I also considered using the Rust binding to libcurl, but I’m not a huge fan of bindings. In my opinion, idioms in C (such as error handling) don’t translate that well to Rust, and it makes bindings feel “unnatural” to me.

Microsoft doesn’t like HEAD

I started out by writing a program to send a HEAD request and display the status code and headers of the reply. I tested it out on my favorite file upload site (which uses Microsoft IIS, for reference), and it seemed to work. For reasons I don’t recall, I decided to try it again with the same URL…and it failed. 404 response instead of the 200 that it gave before. I double-checked by making another HEAD request with curl -I, and, sure enough, that failed too. GET requests still worked just fine, though. I assume this is some sort of security feature, but I don’t know what preventing HEAD while allowing GET will secure you against.

Regardless of the reasoning, this led to a design decision: using GET instead of HEAD, because apparently some things treat them differently (even though they probably shouldn’t).

Multipart Downloading

Okay, getting info from the server was easy. Not as easy as I wanted it to be, but still pretty easy. Now…what do I do with that info? How do I request only a small part of the file? Well, figuring that out required taking a crash course in HTTP headers and header fields Through some searching, I was able to figure out pretty quickly that I needed to make a request that included the Range header field. Unfortunately, that’s not the only requirement; there are a few more header fields that have to be in order:

  • Content-Length. Fairly self-explanatory: it’s the size of the file, in bytes. This obviously needs to be known if you’re going to be making range requests (unless you plan on only making one range request, for the entire file…which would be pointless).
  • Accept-Ranges. Hopefully it’ll be bytes. If you’re unlucky, it’ll be none. And to make things confusing, I’m pretty sure that a server is allowed to omit this field even if it does accept range requests.
  • Transfer-Encoding and Content-Encoding. I honestly don’t know what the difference is. I do know they both have to do with the encoding and can be something like compress or gzip (meaning compressed with some compression algorith) or identity (meaning uncompressed). I decided to just request only identity encodings so that I wouldn’t have to deal with compression yet. Also, a very important note: Transfer-Encoding can be chunked, meaning the server will just send data in chunks until it’s done. This means the Content-Length is unknown, and range requests aren’t going to work.

So, that means multiple requests per file. One to figure out if I can do a range request, and then one (or more) to actually get the file. And of course, there’s a list of header fields that I needed to set:

  • Range. Unless that’s not allowed, of course.
  • Accept-Encoding, to specify that I only wanted Content-Encoding: identity.
  • TE. Okay, I didn’t actually use that one because hyper doesn’t have a struct built-in for it and I was too lazy to write the header field myself. But I probably should’ve used it! It’s basically Accept-Encoding but for the Transfer-Encoding rather than for the Content-Encoding. Don’t ask me why its name doesn’t follow the Accept-* format like the other fields. I wish I knew.

Status Codes

When you normally request a file, you hopefully get a 200 OK response (as opposed to something like 404 Not Found). There are a few response codes related to range requests:

  • 206 Partial Content is what you’re hoping for. It means that the server is fulfilling your range request.
  • 416 Range Not Satisfiable, meaning that either the range you requested was invalid (such as requesting bytes 0-999 of a 500-byte file), or that the server doesn’t support ranges. In this case, you need to make another GET request, either for a different range or without specifying a range at all.
  • 200 OK, meaning that your range request isn’t being fulfilled, but the server is sending the complete file.
  • 4xx or 5xx range errors, which aren’t really “related to range requests” but are of course still a possibility

Obviously, the code should handle all those cases accordingly.

Hangups

I’ll be honest, I’m not proud of the code I’ve written so far. I had a bunch of lifetime problems as a result of using multithreading, and as a result, I’m pretty sure I’m wasting memory by reallocating things that shouldn’t really need reallocated.

I also had trouble figuring out how to custom data types for the downloads that would work how I wanted. I decided to use a threadpool to handle all the downloads in such a way that individual parts of multipart downloads were treated the same as “normal” (not range-requested) downloads. Doing this while still grouping the parts of multipart downloads together proved more difficult than I would’ve liked.

I didn’t even get around to writing the code to merge the parts of a multipart download. Nor did I get around to writing code that would allow resuming an interrupted download. Oops.

Conclusion

So, what have I decided to do? A complete (better) rewrite from the ground up! Specifically, I’m going to use the futures crate for asynchronous I/O. I’ve kinda wanted to play around with Tokio/futures for a while now, and this seems like a good excuse reason to do so. Here’s something I read in futures’ documentation:

For example issuing an HTTP request may return a future for the HTTP response, as it probably hasn’t arrived yet.

Yeah, futures sounds like a good crate to use for this. I’m honestly a bit worried that the learning curve will be too steep for me, but the Rust community is ridiculously supportive, so I’m sure I’ll be fine.

comments powered by Disqus