?

Log in

No account? Create an account
Request for comments: a thesis proposal proposal - Lindsey Kuper [entries|archive|friends|userinfo]
Lindsey Kuper

[ website | composition.al ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

Request for comments: a thesis proposal proposal [Jul. 25th, 2011|09:29 pm]
Lindsey Kuper
[Tags|, ]

When I came to Mozilla this summer, I told my co-workers that I was casting about for a thesis topic. For the last few weeks, I've been mulling over an idea that I think could lead to one.

Although my idea addresses a problem that I don't think is specific to Rust, I did arrive at it by working on Rust, so I'll start there. The Rust compiler is implemented with an LLVM backend; it translates Rust code to the LLVM IR. Before I started working on Rust this spring, I knew nothing about LLVM, but I've become a fan. By targeting LLVM, there are a whole set of machine-specific issues (register allocation, instruction selection, code generation) that our team (mostly) no longer has to think about, allowing us to (mostly) concentrate on higher-level language design issues. I like the modular design of LLVM; perhaps the best part is that by targeting LLVM we're able to take advantage of optimizations that other people contribute. If a compiler targets LLVM, then at least in theory, improvements to LLVM will make the compiler better over time without the compiler writers having to do anything. In the case of Rust, we've made a few contributions upstream to LLVM, too, so the relationship is, we hope, mutually beneficial.

Now, the LLVM folks have a way of making it sound like it's very easy to write an LLVM-backed compiler: all you have to do is write a straightforward front-end that massages source programs into LLVM IR, and you're done. This might not be too far-fetched of a thing to say if the source language you're compiling is C. If you're compiling a much higher-level language, though, translating to LLVM IR is nontrivial. In the Rust compiler, the translation from the fully-fleshed-out-and-annotated AST to LLVM IR happens via a single, monolithic, 9000-line pass known affectionately as "trans". It's almost impossible to try to reason about a compiler pass this monolithic. We can ameliorate some of the pain of dealing with trans by separating it into modules that handle, say, task and communication-related things, vector operations, and so on; we've started doing this, and it does help. We've also talked about abstracting low-level stuff out into what we call an "LLVM combinator library", which would present higher-level abstractions for common things we do with LLVM, and that will help, too. But what would really be nice would be if we had more intermediate languages. Even just one well-designed intermediate language would go a long way toward making this part of the compiler easier to reason about and less buggy, and at least one Rust developer besides me has advocated for such an intermediate language in discussions on our IRC channel.

But here's the thing: it's not just Rust! Everyone who compiles a high-level language to LLVM is going to have this problem and is going to have to deal with it one way or another. So, wouldn't it be great to have a language that sits on top of LLVM, compiles to LLVM in a semantics-preserving way, and serves as a target language for compiler writers who are trying to compile high-level languages but who want to take advantage of the goodness of LLVM? That's the language that I propose designing.

In order to do the proof of semantics preservation, I'd need a formal semantics for LLVM, and from what I can tell, nobody has one of those yet. (It looks like Steve Zdancewic and Jianzhou Zhao recently embarked on an LLVM-in-Coq project, but they haven't published anything yet.) So, the main contributions of this work would be as follows:

  • It would make compiler writers' lives easier by giving them a language to target that isn't as low-level as LLVM IR, but still lets them exploit the goodness of LLVM in their compiler -- which, after working on Rust, I believe is a real need.
  • It would give a formal semantics to a subset of LLVM, which I think would be useful in many contexts outside of this work.

Rather than having to give a semantics to all of LLVM, I can just give a semantics to the target language of compilation, which I think could be a rather smaller subset of LLVM IR. I think a proof of semantics preservation would be valuable even if it's only for a small subset of LLVM, because part of the point of the LLVM infrastructure is that it's okay to produce correct-but-inefficient IR and hand it off to the LLVM optimizer. If we don't have to worry too much about producing efficient LLVM IR, as long as we produce correct LLVM IR, then we can compile to some minimal subset of LLVM that's expressive enough for our needs and then let the optimizer go to town on code written in that subset.

What we're after here, by the way, is not end-to-end verified compilation; for that, there's CompCert and its ilk. Instead, we are working on the assumption that an LLVM backend is something some people are going to want anyway. Having said that, Greg Morrisett, Jean-Baptiste Tristan, and Paul Govereau have work underway on validating LLVM optimizations, so if someone did want an end-to-end verified compiler that used LLVM, this work and theirs could be complementary pieces of that larger puzzle.

In the last couple of weeks, I've gotten the chance to pick some brains about my idea. So far I've gotten some great questions and feedback:

  • Adam Foltzer asked how my proposed intermediate language would compare to C-- in terms of its level of abstraction. Adam may have been thinking of how, in the LLVM-backed version of GHC, an intermediate language that's based on C-- compiles to LLVM. However, C-- isn't really higher-level than LLVM; in fact, the level of abstraction they offer is quite similar. The C-- developers characterize it as a portable assembly language, and that sounds a lot like LLVM. (So why does GHC compile from C-- to LLVM, then? There's a great paper about the LLVM GHC backend from last year's Haskell Symposium that answers that question in detail.) So the answer is that I'm aiming for something higher-level than C--. How much higher-level is still an open question. The trans pass of Rust gives me an idea of the range of abstraction I'm aiming for, but between the level of abstraction of the input to Rust's trans pass (which is a big, hairy AST with a whole bunch of stuff hanging off of it) and the level of abstraction of LLVM, there's still a lot of room. This is something that I'd especially like more feedback from people about. I'd like to choose a set of abstractions that serve as a sane target for a variety of source languages -- say, a purely functional one, a super-OO-Kool-Aid one, and a super-concurrent-Kool-Aid one. I wouldn't, I hope, actually need to implement a compiler from each of these languages to my intermediate language. That seems like too much work for one grad student. But I'd like to be able to have a reasonably convincing story for how my language could be a target for any of them.
  • Dan Grossman pointed out that I'd need to consider how a compiler writer who wanted to target my language would deal with run-time system matters, particularly garbage collection. That reminded me that most of the complexity of the aforementioned enormous trans pass in Rust is code that handles reference counting. This is a stopgap measure that will be replaced by a real garbage collector one day, and I suppose that might cut down on the complexity, but in any case, I think I'd want my language to sit at a level of abstraction above garbage collection. On the other hand, I don't want it to tie the compiler writer to a particular set of GC choices. This, I think, is an example of the high-level-abstraction-versus-low-level-control tension that drives a lot of research in our field. Actually, C-- found a neat solution to the eat-cake-and-have-it-too dilemma by providing an interface for compiler writers to plug in their own garbage collector; this run-time interface thing is the most interesting and novel aspect of C--. Putting an interface like that in my language is a possibility, but I recently realized (from talking with my co-workers and from reading the aforementioned LLVM-backed GHC paper) that LLVM now offers such an interface, too. So maybe my story can be "not my problem; use the LLVM GC interface".
  • Caylee Hogg asked what the framework for semantics preservation would be. I think I'd want to mechanize a proof of semantics preservation in a proof assistant and then extract the compiler from the proof, like y' do. Of the various proof assistant options, Coq is the only one I've tried, and I'm not married to it; I'm willing to invest in learning something new. More generally, though, I need to pin down exactly what "semantics preservation" means in this setting. A lot of authors seem to use "type-preserving compilation" and "semantics-preserving compilation" interchangeably, so is type preservation what I want, or do I want something more? Or something less? I've always been fuzzy on this point, and I'd like to get very, very clear on it before embarking on a proof.

So now I'd like to pick your brain. Who should I be talking to? What related work should I be looking at? What potential pitfalls have I not considered? I'd love to know what you think!

LinkReply

Comments:
[User Picture]From: ssaiscps
2011-07-26 05:38 am (UTC)
Well, Tim and I have a paper on a high level GC interface. ;) Though I guess you still have to pick things like object representation.
(Reply) (Thread)
[User Picture]From: lindseykuper
2011-07-26 08:50 pm (UTC)
You know, one of the people I was talking to about this project actually referred me to that paper when I started talking about the run-time system stuff. It was kind of fun to say, "Yeah, uh, one of the authors is sitting next to me right now and the other one sits down the hall."
(Reply) (Parent) (Thread)
[User Picture]From: graydon
2011-07-26 06:13 am (UTC)
I hope this is not an inappropriate place to comment.

Generally the strength of LLVM lies in the letters "LL": it's low level, which is where a large class of compiler-writers take their semantics down to. It's used because it's one of the best-organized, well-typed low-level interfaces.

The reason trans is big is not that the interface to LLVM is clunky: it's that we're a medium-large language that exposes a fair bit of low-level control. We actually have to handle a ton of cases in trans. Real cases that a programmer can express in the language: differences between different word sizes and signedness, argument-passing modes, capture modes, ownership and sharing modes, direct and indirect calls, native calls with various FFIs, copying vs. moving, stacking calls vs. tail-calls, etc. etc.

Given the level of complexity in the language (it's really not that small), I would expect trans to never make it much below 5000 lines; though of course it can (and probably should) be split some, just for ease of editing. And I'm able to imagine half the complexity in there is just a result of poor factoring, clunky early use of the language (how much will nested patterns and or-patterns help?) and the general overgrowth of a heavily-worked file. We've been blasting ahead to get features done without a lot of time spent "going back" and refactoring. If you refactored heavily, you could likely squish it down quite a lot. Look at the work patrick's doing on the DPS backend. Much tidier.

Beyond that, the language semantics themselves have a lot of current redundancy: the whole communication system will sink out of view into libraries, the ABI is going to change for the simpler in a lot of cases (glue is going interpreted, the implicit args are going away), several source constructs and types are slated to vanish, etc. etc. so I expect there's another 10-20% shaving to do as features are eliminated.

And beyond all *that* ... even 9000 lines is, sorry to say, not terribly huge for such a module. I'd invite you to research other production compilers for low-level languages and see what the line counts come in at. I don't think we're terribly large, at ~39kloc. Even spidermonkey has files larger than this in its JITs, and they're "just" compiling JS :)
(Reply) (Thread)
[User Picture]From: lindseykuper
2011-07-26 04:01 pm (UTC)
You know, I hesitated about whether to bring up Rust to the extent I'm doing, or whether to even tag this post with "mozilla". I ended up doing so because credit is due to Rust for being what caused me to even think about any of this stuff in the first place. But, really, this idea isn't about Rust. In fact, it can't be about Rust, because if it's at all tightly coupled to Rust, I don't think it'll be a viable thesis topic. Whatever I do needs to be of independent interest. (I'm already a little concerned that it's too tightly coupled to LLVM, but I think I can just barely get away with that.) In any case, I'm not claiming that this idea makes sense for the specific case of Rust. And, even if that were my claim, what Rust is doing instead has the considerable advantage of actually existing. My goal in this endeavor isn't to change Rust; it's to do some research that I hope is interesting, and to eventually graduate!
(Reply) (Parent) (Thread)
[User Picture]From: ssaiscps
2011-07-27 03:21 am (UTC)
LLVM is pretty widely used, so convincing people it is useful shouldn't be too much of a problem. Apple is using it as part of its main compiler for OSX, replacing GCC. Really, I'd be more worried about it becoming a crowded space.
(Reply) (Parent) (Thread)
[User Picture]From: jes5199
2011-07-26 06:20 am (UTC)
wow!
(Reply) (Thread)
[User Picture]From: jes5199
2011-07-26 07:14 pm (UTC)
I could have sworn I saw a presentation by someone who worked on Parrot who basically said "it is not possible to write a pluggable garbage collector, it's too fundamental to the design of your object system"
(Reply) (Parent) (Thread)
[User Picture]From: lindseykuper
2011-07-26 08:45 pm (UTC)
I'm probably just being pedantic, but did they really mean "not possible to write a pluggable garbage collector" or did they perhaps mean "not possible to write a socket that will accept any GC that you might want to plug into it"?
(Reply) (Parent) (Thread)
[User Picture]From: jes5199
2011-07-26 08:49 pm (UTC)
I think they meant something like: we do not believe it is possible to write a system that generalizes to the point where you can transparently choose between a perl5-style reference-counting GC and a JVM-style generational GC.

Do you have any feel for how much latitude the LLVM GC interface gives you?
(Reply) (Parent) (Thread)
[User Picture]From: graydon
2011-07-26 09:33 pm (UTC)
"not enough"
(Reply) (Parent) (Thread)
[User Picture]From: lindseykuper
2011-07-26 11:30 pm (UTC)
Well, anyone who denies that ref-counting is garbage collection at all is just going to object to the premises.
(Reply) (Parent) (Thread)
[User Picture]From: jes5199
2011-07-26 11:32 pm (UTC)
well, that's batshit.
(Reply) (Parent) (Thread)
[User Picture]From: jes5199
2011-07-26 11:51 pm (UTC)
wait, apparently that jargon has two definitions:
1) iphone-style static analysis which inserts mallocs and frees at compile time
2) perl-style increment and decrement a counter whenever you make a reference to an object at runtime

it's reasonable to say that the first one isn't garbage collection.
It's reasonable to say that the second one is garbage, but it is certainly collection.
(Reply) (Parent) (Thread)
[User Picture]From: jes5199
2011-07-26 11:57 pm (UTC)
unless what the iphone compiler does is insert code that mimics the perl-style. maybe that's what's going on, the internet seems extra confused about it.
(Reply) (Parent) (Thread)
[User Picture]From: ssaiscps
2011-07-27 03:26 am (UTC)
Yes, I think ARC is just automatic insertion of AddRefs and DecRefs.
(Reply) (Parent) (Thread)
From: aleffert
2011-07-27 05:56 am (UTC)
ARC is definitely static.
(Reply) (Parent) (Thread)
From: aleffert
2011-07-26 06:29 am (UTC)
As someone who is very very vaguely writing a high level language targeted at llvm I am highly interested in such a thing. What sort of choices would you be making about an FFI and tail calls? It seems like you'd either have to dictate that or provide some way of customizing those choices.
(Reply) (Thread)
[User Picture]From: lindseykuper
2011-07-26 11:31 pm (UTC)
Those are good questions. Is your project a personal project? I.e., is it one you can talk to me about?
(Reply) (Parent) (Thread)
From: aleffert
2011-07-26 11:36 pm (UTC)
Yeah, this is a personal project. Like all good pet projects it even has a long inactive sourceforge page! http://avern.sourceforge.net/

I can send you the now out of date design doc if you're curious. I don't think it covers those issues at all though.
(Reply) (Parent) (Thread)
[User Picture]From: Aaron Turon
2011-07-26 09:54 pm (UTC)

Semantics preservation & some related work

Sounds like a neat idea! It would be fascinating to develop an IL that's at a midpoint between LLVM and high-level OO/FP/... languages. Doing this well enough to get buy-in from disparate compiler writers will be a challenge.

re: semantics vs type preservation: those are different ideas. Say you have a program P in your IL, and that C(P) is the LLVM compilation of it.

Semantics preservation says something like: any possible observation (e.g. final result) of C(P) using LLVM's semantics is also a possible observation of P using your source language semantics. The word "possible" here is meant to deal with nontermination and nondeterminism.

Type preservation says something like: given that P is a well-typed IL term, C(P) is a well-typed LLVM term. Or, you can go further and relate the types themselves: if P has type T in the IL, it has type C(T) in LLVM. We overload the compilation function to work on both types and terms.

Those aren't the only possible ways to set things up, but that at least gives the flavor and shows how the two ideas differ.

You'll definitely want to read "From System F to Typed Assembly Language":
http://www.cs.princeton.edu/~dpw/papers/tal-toplas.pdf

Finally, it might be worth looking at the FLINT project, http://flint.cs.yale.edu/flint/index.html

I don't know that work in detail, but I'm pretty sure they worked on (a series of?) typed intermediate languages, verified compilation, etc. FLINT was used in the SML/NJ compiler, and perhaps others as well.

Looking forward to hearing more!
(Reply) (Thread)
[User Picture]From: lindseykuper
2011-07-26 11:26 pm (UTC)

Re: Semantics preservation & some related work

Aaron, I think you're right that buy-in from disparate compiler writers will be nigh-impossible. It might be reasonable to narrow the focus to a particular flavor of high-level language, rather than asking everybody to compromise.

Thanks for the type preservation vs. semantics preservation explanation. Now that I think about it, that has been explained to me before -- probably more than once.

I'm a big fan of the F-to-TAL paper -- it's part of my mental model for What A Compiler Paper Should Be. And it turns out that one of my colleagues at Mozilla is an alum of the FLINT project. It's a small, small PL world.
(Reply) (Parent) (Thread)
[User Picture]From: ssaiscps
2011-07-27 03:29 am (UTC)

Re: Semantics preservation & some related work

Xavier Leroy's CACM paper on Semantics Preservation is a pretty readable presentation.
(Reply) (Parent) (Thread)
[User Picture]From: ssaiscps
2011-07-27 03:28 am (UTC)
Sounds neat, but I warn you that mechanical formalization will consume insane amounts of time. Ask Tim if you want more tales of woe. We're both battle-hardened veterans of the Coq trenches.
(Reply) (Thread)
[User Picture]From: lindseykuper
2011-07-29 06:28 am (UTC)
Have you ever tried doing a paper proof of the same size? I appreciate your point about insane amounts of time, but based on my experience with non-mechanical formalization, I think it would also consume insane amounts of time and be more error-prone.
(Reply) (Parent) (Thread)
From: neelk
2011-07-27 06:45 am (UTC)
So the answer is that I'm aiming for something higher-level than C--. How much higher-level is still an open question.


This is maybe the most important question you need to answer before you write your proposal. Can you give three features that you want your IR to have, which

1) C-- doesn't have,
2) which is complicated to implement well, and
3) is useful for some high-level feature?



(Reply) (Thread)
[User Picture]From: lindseykuper
2011-07-29 06:31 am (UTC)
Thanks, Neel; this is great, specific advice. Working on it.
(Reply) (Parent) (Thread)
[User Picture]From: aml
2011-08-02 08:32 am (UTC)

rubinius

i don't have nearly the chops to understand, but this makes me wonder how rubinius's JIT deals with their bytecode to LLVM ir translation.
(Reply) (Thread)