Lindsey Kuper (lindseykuper) wrote,
Lindsey Kuper

Request for comments: a thesis proposal proposal

When I came to Mozilla this summer, I told my co-workers that I was casting about for a thesis topic. For the last few weeks, I've been mulling over an idea that I think could lead to one.

Although my idea addresses a problem that I don't think is specific to Rust, I did arrive at it by working on Rust, so I'll start there. The Rust compiler is implemented with an LLVM backend; it translates Rust code to the LLVM IR. Before I started working on Rust this spring, I knew nothing about LLVM, but I've become a fan. By targeting LLVM, there are a whole set of machine-specific issues (register allocation, instruction selection, code generation) that our team (mostly) no longer has to think about, allowing us to (mostly) concentrate on higher-level language design issues. I like the modular design of LLVM; perhaps the best part is that by targeting LLVM we're able to take advantage of optimizations that other people contribute. If a compiler targets LLVM, then at least in theory, improvements to LLVM will make the compiler better over time without the compiler writers having to do anything. In the case of Rust, we've made a few contributions upstream to LLVM, too, so the relationship is, we hope, mutually beneficial.

Now, the LLVM folks have a way of making it sound like it's very easy to write an LLVM-backed compiler: all you have to do is write a straightforward front-end that massages source programs into LLVM IR, and you're done. This might not be too far-fetched of a thing to say if the source language you're compiling is C. If you're compiling a much higher-level language, though, translating to LLVM IR is nontrivial. In the Rust compiler, the translation from the fully-fleshed-out-and-annotated AST to LLVM IR happens via a single, monolithic, 9000-line pass known affectionately as "trans". It's almost impossible to try to reason about a compiler pass this monolithic. We can ameliorate some of the pain of dealing with trans by separating it into modules that handle, say, task and communication-related things, vector operations, and so on; we've started doing this, and it does help. We've also talked about abstracting low-level stuff out into what we call an "LLVM combinator library", which would present higher-level abstractions for common things we do with LLVM, and that will help, too. But what would really be nice would be if we had more intermediate languages. Even just one well-designed intermediate language would go a long way toward making this part of the compiler easier to reason about and less buggy, and at least one Rust developer besides me has advocated for such an intermediate language in discussions on our IRC channel.

But here's the thing: it's not just Rust! Everyone who compiles a high-level language to LLVM is going to have this problem and is going to have to deal with it one way or another. So, wouldn't it be great to have a language that sits on top of LLVM, compiles to LLVM in a semantics-preserving way, and serves as a target language for compiler writers who are trying to compile high-level languages but who want to take advantage of the goodness of LLVM? That's the language that I propose designing.

In order to do the proof of semantics preservation, I'd need a formal semantics for LLVM, and from what I can tell, nobody has one of those yet. (It looks like Steve Zdancewic and Jianzhou Zhao recently embarked on an LLVM-in-Coq project, but they haven't published anything yet.) So, the main contributions of this work would be as follows:

  • It would make compiler writers' lives easier by giving them a language to target that isn't as low-level as LLVM IR, but still lets them exploit the goodness of LLVM in their compiler -- which, after working on Rust, I believe is a real need.
  • It would give a formal semantics to a subset of LLVM, which I think would be useful in many contexts outside of this work.

Rather than having to give a semantics to all of LLVM, I can just give a semantics to the target language of compilation, which I think could be a rather smaller subset of LLVM IR. I think a proof of semantics preservation would be valuable even if it's only for a small subset of LLVM, because part of the point of the LLVM infrastructure is that it's okay to produce correct-but-inefficient IR and hand it off to the LLVM optimizer. If we don't have to worry too much about producing efficient LLVM IR, as long as we produce correct LLVM IR, then we can compile to some minimal subset of LLVM that's expressive enough for our needs and then let the optimizer go to town on code written in that subset.

What we're after here, by the way, is not end-to-end verified compilation; for that, there's CompCert and its ilk. Instead, we are working on the assumption that an LLVM backend is something some people are going to want anyway. Having said that, Greg Morrisett, Jean-Baptiste Tristan, and Paul Govereau have work underway on validating LLVM optimizations, so if someone did want an end-to-end verified compiler that used LLVM, this work and theirs could be complementary pieces of that larger puzzle.

In the last couple of weeks, I've gotten the chance to pick some brains about my idea. So far I've gotten some great questions and feedback:

  • Adam Foltzer asked how my proposed intermediate language would compare to C-- in terms of its level of abstraction. Adam may have been thinking of how, in the LLVM-backed version of GHC, an intermediate language that's based on C-- compiles to LLVM. However, C-- isn't really higher-level than LLVM; in fact, the level of abstraction they offer is quite similar. The C-- developers characterize it as a portable assembly language, and that sounds a lot like LLVM. (So why does GHC compile from C-- to LLVM, then? There's a great paper about the LLVM GHC backend from last year's Haskell Symposium that answers that question in detail.) So the answer is that I'm aiming for something higher-level than C--. How much higher-level is still an open question. The trans pass of Rust gives me an idea of the range of abstraction I'm aiming for, but between the level of abstraction of the input to Rust's trans pass (which is a big, hairy AST with a whole bunch of stuff hanging off of it) and the level of abstraction of LLVM, there's still a lot of room. This is something that I'd especially like more feedback from people about. I'd like to choose a set of abstractions that serve as a sane target for a variety of source languages -- say, a purely functional one, a super-OO-Kool-Aid one, and a super-concurrent-Kool-Aid one. I wouldn't, I hope, actually need to implement a compiler from each of these languages to my intermediate language. That seems like too much work for one grad student. But I'd like to be able to have a reasonably convincing story for how my language could be a target for any of them.
  • Dan Grossman pointed out that I'd need to consider how a compiler writer who wanted to target my language would deal with run-time system matters, particularly garbage collection. That reminded me that most of the complexity of the aforementioned enormous trans pass in Rust is code that handles reference counting. This is a stopgap measure that will be replaced by a real garbage collector one day, and I suppose that might cut down on the complexity, but in any case, I think I'd want my language to sit at a level of abstraction above garbage collection. On the other hand, I don't want it to tie the compiler writer to a particular set of GC choices. This, I think, is an example of the high-level-abstraction-versus-low-level-control tension that drives a lot of research in our field. Actually, C-- found a neat solution to the eat-cake-and-have-it-too dilemma by providing an interface for compiler writers to plug in their own garbage collector; this run-time interface thing is the most interesting and novel aspect of C--. Putting an interface like that in my language is a possibility, but I recently realized (from talking with my co-workers and from reading the aforementioned LLVM-backed GHC paper) that LLVM now offers such an interface, too. So maybe my story can be "not my problem; use the LLVM GC interface".
  • Caylee Hogg asked what the framework for semantics preservation would be. I think I'd want to mechanize a proof of semantics preservation in a proof assistant and then extract the compiler from the proof, like y' do. Of the various proof assistant options, Coq is the only one I've tried, and I'm not married to it; I'm willing to invest in learning something new. More generally, though, I need to pin down exactly what "semantics preservation" means in this setting. A lot of authors seem to use "type-preserving compilation" and "semantics-preserving compilation" interchangeably, so is type preservation what I want, or do I want something more? Or something less? I've always been fuzzy on this point, and I'd like to get very, very clear on it before embarking on a proof.

So now I'd like to pick your brain. Who should I be talking to? What related work should I be looking at? What potential pitfalls have I not considered? I'd love to know what you think!

Tags: mozilla, research

  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded