Here is the basic idea. Conceptually a binmap is a tree of bitmaps. In a leaf at the bottom of the tree each bit in the bitmap represents one bit. In a leaf one layer above the bottom each bit in the bitmap represents two bits. In a leaf two layers above the bottom each bit in the bitmap represents four bits etc.

``` ocaml type t = { layers : int ; tree : tree }

type tree = | Bitmap of int | Branch of tree * tree ```

Let’s pretend for simplicity our bitmaps are only 1 bit wide. Then the string 00000000 would be represented as:

```
ocaml
{ layers = 3
; tree = Bitmap 0 }
```

And the string 00001100 would be:

``` ocaml { layers = 3 ; tree =

```
Branch
(Bitmap 0)
(Branch
(Bitmap 1)
(Bitmap 0)) }
```

```

The worst case for this data structure is the string 0101010101… In this case we use about 6.5x as much memory as needed by a plain bitmap (3 words for a Branch with two pointers, 4 words for a Bitmap with a pointer to a boxed Int32). The c++ version uses some simple tricks to reduce this overhead to just over 2x that of a plain bitmap. We can replicate these in ocaml by using a bigarray to simulate raw memory access.

Our data structure looks like this:

``` ocaml module Array = struct include Bigarray.Array1 let geti array i = Bitmap.to_int (Bigarray.Array1.get array i) let seti array i v = Bigarray.Array1.set array i (Bitmap.of_int v) end

type t =

```
{ length : int
; layers : int
; mutable array : (Bitmap.t, Bitmap.bigarray_elt, Bigarray.c_layout) Array.t
; pointers : Widemap.t
; mutable free : int }
```

type node = | Bitmap of Bitmap.t | Pointer of int

let get_node binmap node_addr is_left = let index = node_addr + (if is_left then 0 else 1) in match Widemap.get binmap.pointers index with | false –> Bitmap (Array.get binmap.array index) | true –> Pointer (Array.geti binmap.array index)

let set_node binmap node_addr is_left node = let index = node_addr + (if is_left then 0 else 1) in match node with | Bitmap bitmap –>

```
Widemap.set binmap.pointers index false;
Array.set binmap.array index bitmap
```

| Pointer int –>

```
Widemap.set binmap.pointers index true;
Array.seti binmap.array index int
```

```

Each pair of cells in the array represents a branch. Leaves are hoisted into their parent branch, replacing the pointer. Widemap.t is an extensible bitmap which we use here to track whether a given cell in the array is a pointer or a bitmap. The length field is the number of bits represented by bitmap. The free field will be explained later.

Our previous example string 00001100 would now be represented like this:

``` ocaml
(*
0 –> Bitmap 0
1 –> Pointer 2
2 –> Bitmap 1
3 –> Bitmap 0
*)

{ length = 8; ; layers = 3; ; array = [| 0, 2, 1, 0 |] ; pointers = Widemap.of_string “0100” ; free = 0 } ```

When the bitmap is changed we may have to add or delete pairs eg if the above example changed to 00001111 it would be represented as:

```
ocaml
(*
0 -> Bitmap 0
1 -> Bitmap 1
2 -> ?
3 -> ?
*)
```

We can grow and shrink the array as necessary, but since deleted pairs won’t necessarily be at the end of the used space the bigarray will become fragmented. To avoid wasting space we can write a linked list into the empty pairs to keep track of free space. 0 is always the root of the tree so we can use it as a list terminator. The free field marks the start of the list.

``` ocaml let del_pair binmap node_addr = Array.seti binmap.array node_addr binmap.free; binmap.free <– node_addr

(* double the size of a full array and then initialise the freelist *)
let grow_array binmap =
assert (binmap.free = 0);
let old_len = Array.dim binmap.array in
assert (old_len mod 2 = 0);
assert (old_len <= max_int);
let new_len = min max_int (2 * old_len) in
assert (new_len mod 2 = 0);
let array = create_array new_len in
Array.blit binmap.array (Array.sub array 0 old_len);
binmap.array <– array;
binmap.free <– old_len;
for i = old_len to new_len-4 do

```
if i mod 2 = 0 then Array.seti array i (i+2)
```

done; Array.seti array (new_len-2) 0

let add_pair binmap node_left node_right = (if binmap.free = 0 then grow_array binmap); let node_addr = binmap.free in let free_next = Array.geti binmap.array binmap.free in binmap.free <– free_next; set_node binmap node_addr true node_left; set_node binmap node_addr false node_right; node_addr ```

I haven’t yet written any code to shrink the array but it should be fairly straightforward to recursively copy the tree into a new array and rewrite the pointers.

With the freelist our modified example now looks like this:

```
ocaml
{ length = 8;
; layers = 3;
; array = [| 0, 2, 0, 0 |]
; pointers = Widemap.of_string "0100"
; free = 2 }
```

With the representation sorted the rest of the code more or less writes itself.

The only difficulty lies in choosing the width of the bitmaps used. Using smaller bitmaps increases the granularity of the binmap allowing better compression by compacting more nodes. Using larger bitmaps increases the size of the pointers allowing larger bitmaps to be represented. I’ve written the binmap code to be width-agnostic; it can easily be made into a functor of the bitmap module.

The paper linked below suggests using a layered address scheme to expand the effective pointer size, where the first bit of the pointer is a flag indicating which layer the address is in. I would suggest rather than putting the flag in the pointer it would be simper to use information implicit in the structure of the tree eg is the current layer mod 8 = 0. Either way, this hugely increases the size of the address space at a the cost of a little extra complexity.

The original version is here and my version is here. This is just an experiment so far, I certainly wouldn’t suggest using it without some serious testing.

Overall I’m not sure how useful this particular data structure is but this method of compacting tree-like types in ocaml is certainly interesting. I suspect it could be at least partially automated.

]]>he main data structure looks like this:

```
ocaml
type 'a t =
{ latexs : Latex.t DynArray.t
; opaques : 'a DynArray.t
; deleted : bool DynArray.t
; mutable next_id : id
; mutable array : (id * pos) array
; mutable unsorted : ('a * Latex.t) list }
```

The array field is responsible for the vast majority of the memory usage. Each cell in the array contains a pointer to a tuple containing two integers for a total of 4 words per suffix. The types id and pos are both small integers so if we pack them into a single unboxed integer we can reduce this to 1 word per suffix. We have a new module suffix.ml with some simple bit-munging:

``` ocaml type id = int type pos = int

type t = int

let pack_size = (Sys.word_size / 2) – 1 let max_size = 1 lsl pack_size

exception Invalid_suffix of id * pos

let pack (id, pos) = if (id < 0) || (id >= max_size) || (pos < 0) || (pos >= max_size) then raise (Invalid_suffix (id, pos)) else pos lor (id lsl pack_size)

let unpack suffix = let id = suffix lsr pack_size in let pos = suffix land (max_size – 1) in (id, pos) ```

Notice how confusing infix functions are in ocaml.

The suffix array type becomes:

```
ocaml
type 'a t =
{ latexs : Latex.t DynArray.t
; opaques : 'a DynArray.t
; deleted : bool DynArray.t
; mutable next_id : id
; mutable array : Suffix.t array
; mutable unsorted : ('a * Latex.t) list }
```

With this change the memory usage drops down to 1.4 gb. The mean search time also improves. It seems that having fewer cache misses makes up for the extra computation involved in unpacking the suffixes.

Now that the array field is a single block it is easy to move it out of the heap entirely so the gc never has to scan it.

```
ocaml
let ancientify sa =
sa.array <- Ancient.follow (Ancient.mark sa.array);
Gc.full_major ()
```

This eliminates the annoyingly noticeable gc pauses.

]]>Intuitively, when searching within LaTeX content we want results that represent the same formulae as the search term. Unfortunately LaTeX presents plenty of opportunities for obfuscating content with macros, presentation commands and just plain weird lexing.

Texsearch uses PlasTeX to parse LaTeX formulae and expand macros. The preprocessor then discards any LaTeX elements which relate to presentation rather than content (font, weight, colouring etc). The remaining LaTeX elements are each hashed into a 63 bit integer. This massively reduces the memory consumption, allowing the entire corpus and search index to be held in RAM. Collisions should be rare given that there are far less than 2^{63} possible elements.

At the core of texsearch is a search algorithm which performs approximate searches over the search corpus. Specifically, given a search term S and a search radius R we want to return all corpus terms T such that the Levenshtein distance between S and some substring of T is less than R. This is a common problem in bioinformatics and NLP and there is a substantial amount of research on how to solve this efficiently. I have been through a range of different algorithms in previous iterations of texsearch and have only recently achieved reasonable performance (mean search time is now ~300ms for a corpus of 1.5m documents). The code is available here.

We define the distance from latexL to latexR as the minimum Levenshtein distance between latexL and any substring of latexR. With this definition we can specify the results of the search algorithm more simply as returning all corpus terms with distance R of S.

``` ocaml
let distance latexL latexR =
let maxl, maxr = Array.length latexL, Array.length latexR in
if maxl = 0 then 0 else
if maxr = 0 then maxl else
(* cache.(l).® is the distance between latexL[l to maxl] and latexR[r to maxr] *)
let cache = Array.make_matrix (maxl + 1) (maxr + 1) 0 in
(* Must match everything on the left *)
for l = maxl – 1 downto 0 do

```
cache.(l).(maxr) <- 1 + cache.(l+1).(maxr)
```

done;
(* General matching *)
for l = maxl – 1 downto 1 do

```
for r = maxr - 1 downto 0 do
cache.(l).(r) <-
minimum
(1 + cache.(l).(r+1))
(1 + cache.(l+1).(r))
((abs (compare latexL.(l) latexR.(r))) + cache.(l+1).(r+1))
```

done done;
(* Non-matches on the right dont count until left starts matching *)
for r = maxr – 1 downto 0 do

```
cache.(0).(r) <-
minimum
(cache.(0).(r+1))
(1 + cache.(1).(r))
((abs (compare latexL.(0) latexR.(r))) + cache.(1).(r+1))
```

done; cache.(0).(0) ```

The search algorithm is built around a suffix array presenting the following interface:

``` ocaml type ‘a t

val create : unit –> ‘a t val add : 'a t –> ('a * Latex.t) list –> unit val prepare : 'a t –> unit

val delete : ‘a t –> ('a –> bool) –> unit

val find_exact : ‘a t –> Latex.t –> (int * 'a) list val find_approx : 'a t –> float –> Latex.t –> (int * 'a) list val find_query : 'a t –> float –> Query.t –> (int * 'a) list ```

The data structure is pretty straightforward. We store the LaTeX elements in a DynArray and represent suffixes by a pair of pointers – the first into the DynArray and the second into the LaTeX term itself. Each LaTeX term is matched to an opaque object which is used by the consumer of this module to id the terms.

``` ocaml type id = int type pos = int

type ‘a t = { latexs : Latex.t DynArray.t ; opaques : 'a DynArray.t ; mutable next_id : id ; mutable array : (id * pos) array ; mutable unsorted : ('a * Latex.t) list }

let create () = { latexs = DynArray.create () ; opaques = DynArray.create () ; next_id = 0 ; array = Array.make 0 (0,0) ; unsorted = []} ```

The suffix array is built in a completely naive way. We just throw all the suffixes into a list and sort it. There are much more efficient methods known but this is fast enough, especially since we do updates offline. The building is separated into two functions to make incremental updates easier.

``` ocaml let add sa latexs = sa.unsorted <– latexs @ sa.unsorted

let insert sa (opaque, latex) = let id = sa.next_id in sa.next_id <– id + 1; DynArray.add sa.opaques opaque; DynArray.add sa.latexs latex; id

let prepare sa = let ids = List.map (insert sa) sa.unsorted in let new_suffixes = Util.concat_map (suffixes sa) ids in let cmp = compare_suffix sa in let array = Array.of_list (List.merge cmp (List.fast_sort cmp new_suffixes) (Array.to_list sa.array)) in sa.unsorted <– []; sa.array <– array ```

So now we have a sorted array of suffixes of all our corpus terms. If we want to find all exact matches for a given search term we just do a binary search to find the first matching suffix and then scan through the array until the last matching suffix. For reasons that will make more sense later, we divide this into two stages. Most of the work is done in gather_exact (better name, anyone?), where we perform the search and dump the resulting LaTeX term ids into a HashSet. Then find_exact runs through the HashSet and looks up the matching opaques.

``` ocaml
(* binary search *)
let gather_exact ids sa latex =
(* find beginning of region *)
(* lo < latex *)
(* hi >= latex *)
let rec narrow lo hi =

```
let mid = lo + ((hi-lo) / 2) in
if lo = mid then hi else
if leq sa latex sa.array.(mid)
then narrow lo mid
else narrow mid hi in
```

let n = Array.length sa.array in let rec traverse index =

```
if index >= n then () else
let (id, pos) = sa.array.(index) in
if is_prefix sa latex (id, pos)
then
begin
Hashset.add ids id;
traverse (index+1)
end
else () in
```

traverse (narrow (-1) (n-1))

let exact_match sa id = (0, DynArray.get sa.opaques id)

let find_exact sa latex = let ids = Hashset.create 0 in gather_exact ids sa latex; List.map (exact_match sa) (Hashset.to_list ids) ```

Now for the clever part – approximate search. First, convince yourself of the following. Suppose the distance from our search term S to some corpus term T is strictly less than the search radius R. Then if we split S into R pieces at least one of those pieces must match a substring of T exactly. So our approximate search algorithm is to perform exact searches for each of the R pieces and then calculate the distance to each of the results. Notice the similarity in structure to the previous algorithm. You can also see now that the exact search is split into two functions so that we can reuse gather_exact.

``` ocaml let gather_approx sa precision latex = let k = Latex.cutoff precision latex in let ids = Hashset.create 0 in List.iter (gather_exact ids sa) (Latex.fragments latex k); ids

let approx_match sa precision latexL id = let latexR = DynArray.get sa.latexs id in match Latex.similar precision latexL latexR with | Some dist –>

```
let opaque = DynArray.get sa.opaques id in
Some (dist, opaque)
```

| None –>

```
None
```

let find_approx sa precision latex = let ids = gather_approx sa precision latex in Util.filter_map (approx_match sa precision latex) (Hashset.to_list ids) ```

We can also extend this to allow boolean queries.

``` ocaml let rec gather_query sa precision query = match query with | Query.Latex (latex, _) –> gather_approx sa precision latex | Query.And (query1, query2) –> Hashset.inter (gather_query sa precision query1) (gather_query sa precision query2) | Query.Or (query1, query2) –> Hashset.union (gather_query sa precision query1) (gather_query sa precision query2)

let query_match sa precision query id = let latexR = DynArray.get sa.latexs id in match Query.similar precision query latexR with | Some dist –>

```
let opaque = DynArray.get sa.opaques id in
Some (dist, opaque)
```

| None –>

```
None
```

let find_query sa precision query = let ids = gather_query sa precision query in Util.filter_map (query_match sa precision query) (Hashset.to_list ids) ```

This is a lot simpler than my previous approach, which required some uncomfortable reasoning about overlapping regions in quasi-metric spaces.

It is instructive to compare texsearch with other math search engines. Texsearch is effectively a brute force solution that gave us an ok search engine search engine with minimal risk. It has minimal understanding of LaTeX and no understanding of the structure of the formulae it searches in. Uniquation accepts only a small (but widely used) subset of LaTeX but understands the structure of the equation itself and so can infer scope and perform variable substitution when searching. I am not sure yet how much content they are indexing or how well they handle searching within full LaTeX content but hopefully this approach can scale up to big corpuses. Hoogle is a search engine for haskell types which can handle even more sophisticated equivalences than uniquation thanks to its specialised domain. ArXMLiv is developing tools for inferring semantic information from LaTeX content in order to convert it to Semantic MathML, which is much easier for search engines to handle.

So, in summary, LaTeX is a pain in the ass.

]]>