NOTE:This documentation is temporary and may have many errors. It will be continuously updated. A Chinese book on type inference with DSL is being written.
Many languages like Java and Typescript, have functioned a type system to prevent the occurrence of execution errors during the running of a program. These systems are usually Hindley-Milner Type System or System F. However, these systems are not omniscient and omnipotent.
As for scheme, an “untyped language”, for which many type annotation information won’t be trivially given in static code, an intuitive case is given as following:
(lambda (a) (+ 1 a))
It doesn’t have types or, equivalently, have a single universal type that contains all values and the inference mechanism. Because the parameter a
isn’t literally annotated with any type as Typescript.
function(a: number){return 1+a}
Many counterparts like Type-Racket adapt their syntax and demand type information by literally annotating as above Typescript code. But scheme-langserver has found another possible way that consumes r6rs-based code and digests their supposed type specification: recalling the above scheme code, apparently, the +
within r6rs context should have type of (number? <- (number? ...))
, in which the <-
indicates this is a type of function, left-hand side is the return’s type, and the (number? ...)
is a type list of parameter(s). Suppose the code always is executable, the a
can be easily fixed in this +
’s parameter with type number?
, and in this document, scheme-langserver is going to describe a type system using this implicit information from r6rs.
This document will introduce how scheme-langserver:
Let’s consider the type of literals and here 4 trivial cases:
#t
,3
, "string"
and 'a
, in which they all match corresponding basic type predictors boolean?
, number?
(also integer?
and fixnum?
), string?
and symbol?
. So that scheme-langserver will match such literals and using these predictors as their type annotation.'(1 2 3)
and vector literal #(1 2 3)
, in which scheme-langserver will match them with '(inner:list? number? number? number?)
and '(inner:vector? number? number? number?)
. Directly, another much more complicated example '(1 a "e")
will be assigned with '(inner:list? number? symbol? string?)
. The inner:list?
, inner:vector?
and inner:pair?
are primitives.+
function case, (number? <- (inner:list? number? ...))
, <-
indicate it’s a function or in lisp’s tongue, lambda calculus. It receives 0 parameters or something with number?
type, and returns a number?
-typed value, using some shortcut markers include **1
as +
in regular expression, that previous pattern occurs more than once time, and ...
as *
for zero or more times.something?
for universal type. For example, all predictors’ parameter has type of something?
.void?
for (void)
.Further, it’s roughly knows that types have its context. This makes differences because in different library may have same-named predicator. So scheme-langserver involves identifier-references in order to take place of such predictors in above cases. A function construct-type-expression-with-meta
can be used for this process. And something?
, void?
, **1
, ...
and many other primitives won’t take part in the transformation.
In scheme-langserver, an index is across all codes, scoped within document and denoted with index-node, so an intuitive idea is attaching types to index-node. Apparently, the hardest core of type inference is consuming codes matching predictor expressions, and after a bunch of cracks and crashes, splitting out codes match other simple or complex predictor expressions.
Previous paper work would hack this problem with lambda calculus’ tongue, $\alpha$-conversion, $\beta$-reduction and any other such things. But this section just mentions 2 examples, to show that nearly all codes can be digested in same way:
(let ((a 2)) (+ 1 a))
, can be simple transformed as ((lambda (a) (+ 1 a)) 2)
. And merely equal case in typescript is
function (){
var a = 2;
return a+1;
};
//=================>
//can be transformed to
function (){
return (function(a){return a+1;})(2);
};
This means literal 2
can introduce number?
type to variable a
by assignment and specific known information can infer unknown.
((lambda (a) (+ 1 a)) 1)
, the (lambda (a) (+ 1 a))
’s return type is specified by the bottom (+ 1 a)
. This bottom its return type is also specified by the r6rs-defined +
function’s bottom. Following this chain, as this document mentioned above, scheme-langserver has manually assigned 1108 forms with their should-be types, some more type assignment can be derived.Further, there’s still a long way to all types, because:
(let ((a 2)) (+ 1 a))
, its type is determined by several parts of code and scheme-langserver has implemented several rules in this folder;call/cc
, their types are not able predictable theoretically.Above we’ve mentioned “compound”, “complex”, “several parts” and such on many times. They all lead 3 following tips:
This is to say scheme-langserver has to introduce variable into type representation and take the part of identifier-reference. And this would be convenient for successive inference and logic calculation.
Although lambda calculus indicates how information flows from literal and r6rs-base functions to user-selves defined code, it’s like a crack and crash engine, in which fuel has been pumped, nothing know we how to ignite it. Luckily, we have Hindley-Milner (HM) type system, a classical type system for the lambda calculus. Among its more notable properties are its completeness and its ability to infer the most general types of a given program without programmer-supplied type annotations or other hints. HM has following detailed rules:
Variable Access \(\frac{(x:\sigma) \in \Gamma}{\Gamma \vdash (x:\sigma)}\) Intuitively, this whole rule means if a variable $x$ were given a type above this horizental line, a following-up process under it would be able to apply: spit out the type $\sigma$ of $x$. For detail, $\Gamma$ is a machine that can act all these deductions, $\in$ means extending the machine with assigning type $\sigma$ to variable $x$. $\vdash$ indicats the $\Gamma$ spliting something.
Application
\(\frac{\Gamma \vdash (e_1:(\tau_1 \to \tau_2)) \quad\quad \Gamma \vdash (e_2 : \tau_1) }{\Gamma \vdash ((e_1\ e_2) : \tau_2)}\)
This rule is usually used for applying function e1
to e2
like (e1 e2)
(like above (+ 1 a)
) and get this expression’s type $\tau_2$. You can find its implementation in application.sls. Attention, here the $\to$ is opposite to above <-
direction.
Abstract
\(\frac{(\Gamma, \;(x:\tau_1)) \vdash (e:\tau_2)}{\Gamma \vdash ((\lambda\ x\ .\ e) : (\tau_1 \to \tau_2))}\)
This rule is usually used for defining function’s type like (lambda (x) (+ 1 x))
. It means, if a function $e$, when its parameter $x$ had been given type $\tau_1$, its return type were $\tau_2$, then $e$’s whole type would be $\tau_1 \to \tau_2$. In scheme-langserver, this type expression may be (tau1 <- tau2)
.
Let \(\frac{\Gamma \vdash (e_1:\sigma) \quad\quad (\Gamma,\,(x:\sigma)) \vdash (e_2:\tau)}{\Gamma \vdash ((\mathtt{let}\ (x = e_1)\ \mathtt{in}\ e_2) : \tau)}\) Known: the expression $e_1$ has type $sigma$; for variable $x$ has type $\sigma$, function $e_2$ has $\tau$. If substituted $x$ with $e_1$ in $e_2$, the return type still would be $\tau$.
Subtype \(\frac{\Gamma \vdash (e: \sigma_1) \quad\quad \sigma_1 \sqsubseteq \sigma_2}{\Gamma \vdash (e : \sigma_2)}\) This rule is saying that if $e$ has type $\sigma_1$ and $\sigma_1$ is subset of $\sigma_2$, or in other words, $\sigma_1$ inherents $\sigma_2$, $e$ will also has type $\sigma_2$.
Type Generalize \(\frac{\Gamma \vdash (e: \sigma) \quad\quad \alpha \notin \mathtt{free}(\Gamma)}{\Gamma \vdash (e:(\forall\ \alpha\ .\ \sigma))}\) This rule is usually used for type’s generation especially for record type inherent in scheme language. It is saying that if there is some variable $\alpha$ which is “not free”, then it is safe to say that any expression whose type you know $e : \sigma$ will have that type for any value of $\alpha$. Further, “not free” means $alpha$ has been attached to some variables in above pair list.
((variable : type)... (type1 < type2))
can be facilitated in $\Gamma$;(append old-knowledge-list with-new-knowledge-list)
;(vairbale = type-expression-with-variable)
, in which variable
is referring variable.sls; With these equations, scheme-langserver can do following processes.Directly, $\Gamma$ is a machine who will run a routing process by finding a path from static code, usually represented with index-node to type expression. This algorithmic process of solving equations between symbolic expressions is named with unification in logic and computer science. And according to the mathematic tradition created by François Viète, by importing variables, the unification algorithm also denote expressions and their parts.
In scheme-langserver, the unification algorithm is in Robinson’s Unification Algorithm’s tongue, and acts $\Gamma$ machine like a spider walking on knowledge graph. After identifier catching, construct graph for single document with construct-substitution-list-for
function and the process path starts from index-node and finally ends with some type-known index-nodes, which are practically specified by r6rs functions or literals. Path-memory and cycle detection mechanism would avoid dead loop. Details are following:
PRIVATE-MAX-DEPTH
is used to limit the depth.NOTE: The detail code is in this file.
The interpreter involves 3 scopes of symbols:
inner:trivial?
will verify it. After verifying, the interpreter is mainly responsible for the “Application Rule” in above section. That means ((boolean? <- (inner:list? something?)) something?)
will be interpreted into boolean?
.car
and cdr
. Above representations can’t undertake the pressure. I just made a very simple macro expander so that users can write macro and make it writing type representations in interpreting process, and here’s an example.Recalling the example of (lambda (a) (+ 1 a))
, the parameter a
is not given type within Hindley-Milner’s system. Scheme-langserver solves this problem with Gradual Typing, it wants to make an implicit conversation: the +
only accept number?
input, why can’t assign the a
with number?
type?
The details are listed in the trivial.sls: When the a
symbol is legally presented in the +
form, a substitution will be generated like (variable0 = (with (a b c) ((with (d0 d1 d2) d1) c)))
with variable0
is attached to a
and variable1
is attached to +
. d0 d1 d2
is generated according to how many parameters the +
form has. Whatever, after some interpreting processes, the a
will be attached with number?
in +
’s (number? <- (inner:list? number? ...))
.
The function overloading and recursion raise the Halting Problem a practical barrier in my DSL interpreter. To solve this problem, the interpreter has to recall their function name many times and the loop-avoiding mechanism won’t allow this. I’ve just implemented a set of procedure to solve this problem:
type:interpret
is the basic version, and it involves a max-depth
in order to avoid loop. And here, a much more complicated case is that ((boolean? <- (inner:list? something?)) something?)
, the boolean?
part usually is taken place with a procedure type representation and like (((boolean? <- (inner:list? something?)) <- (inner:list? something?)) something?)
. If there were some variables in that part, the interpreter couldn’t recognize the newly generated representation is equal to origin one.type:interpret-result-list
just for taking the result from type:interpret.type:interpret
will fall into an exception and maybe return a list contain macros. Obviously, a macro should be interpreted before interpreting whole representation. This is why type:depature&interpret->result-list
is implemented.type:recursive-interpret-result-list
is for solving the halting problem: it will extend the substitution set and interpret those unsolved representations with these substitutions. With limited extending times, this procedure can solve many representations like this test.The interpreter now heavily depended on Cartesian Product, which is mainly used in generating type representations for procedures. Especially for those procedures have more than 4 parameters which may separately have 4 or more representations, and the results for this parameter list will account for more than 256. In other words, the raising speed of status number $n$ account for $\mathcal{O}(n!)$!
This is because that the Cartesian Product is a travesal operation. Maybe some features in my DSL could cut edges in this process, but now I don’t figure out any solution.