Pattern matching magic.

I’m playing around with functional programming and decided to give Elixir a try. In order to get familiar with the language I decided to solve some exercises from a great source called Exercism. They support several languages including Ruby, C++, F#, and many others. This post will cover part of my submission to an exercise called Nucleotide Count.

Starting from the basics

1
2
3
4
defmodule NucleotideCount do
  @nucleotides [?A, ?C, ?G, ?T]
  ...
end

It’s funny when you first dive into a language and start asking yourself “whatheheck are these symbols for?”. I first assumed that @nucleotides was some kind of global variable from the module NucleotideCount and that was correct, there are also some reserved @ words like @docs, so far so good. What about ?A and the other question-mark prefixed letters? The ? operator (I’m not sure if it’s truly an operator but let’s move on) gives you the integer representation of the character right next to it:

1
2
3
4
iex(1)> ?A
65
iex(2)> to_string [65]
"A"

Something that still intrigues me is the fact that you have to pass 65 as an List ([65]) to to_string in order to get "A" back, if instead we simply pass 65, the function returns "65". Weird…

Now to the first part of the problem. Given a char list (DNA strand) like 'AAGCTA', I would like to count how many occurrences it has of a given char (nucleotide). Pretty straightforward, right? In ruby that could be done with something like "AAGCTA".scan(/A/).count, but we’re not programming in Ruby, are we?

Iterating in a list (without .each to help us out).

In Elixir [1,2,3] and 'ABC' are linked lists (the first contains integer and the later is a list of chars 'A','B','C') and they have a head (first element) and a tail (all the rest).

1
2
3
4
5
6
iex(1)> [head|tail] = 'ABC'
'ABC'
iex(2)> head
65 <-- remember this is the integer equivalent to 'A'
iex(3)> tail
'BC'

Cool, so what? If we add this and a bit of recursion (a function calling itself) we can navigate in a list.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def count(strand, nucleotide) do
  _count(strand, nucleotide, 0)
end

def _count([head|tail], nucleotide, accumulator) do
  if head == nucleotide do
    _count(tail, nucleotide, accumulator + 1)
  else
    _count(tail, nucleotide, accumulator)
  end
end

def _count([], nucleotide, accumulator) do
  accumulator
end

Let’s break this code into pieces. First we notice there are two functions with the same name _count and receiving both the same number of arguments, how is this possible? When you call a function, Elixir will pattern match the arguments into one of the functions defined. That is, if you call _count passing an empty list as the first parameter, the _count([], ...) will be executed, otherwise the other definition, _count([head|tail], ...), will be called.

Also notice that in the first signature, _count([head|tail], ...), we not only expect a list, we expect a list that has a head and a tail. And more, head and tail will be assigned exactly that, the first element of the list and the rest, respectively.

Now to the recursion part. As long as the list has a head, _count will keep calling itself passing as parameter the list minus the first element (aka tail). If the head of the list is equal to the nucleotide we’re looking for, we call _count again but this time we increment the accumulator by 1. So, are we going to call _count foreva?, you ask. We would, if we didn’t have what’s called a guard clause. In our case, our guard clause is the _count([]...) which simply returns the value of accumulator, thus breaking the chain of calling _count over and over.

Demo time \o/

I added some IO.puts around so we can inspect what’s inside the variables during run time.

Recursion in action Elixir.

One last thing ™

I wasn’t completely happy with that conditional if head == nucleotide. Then I had one of those What if… moments and tried the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def count(strand, nucleotide) do
  _count(strand, nucleotide, 0)
end

def _count([head|tail], nucleotide = head, acumulator) do
  _count(tail, nucleotide, acumulator + 1)
end

def _count([_|tail], nucleotide, acumulator) do
  _count(tail, nucleotide, acumulator)
end

def _count([], _, acumulator) do
  acumulator
end

Did you notice what’s happening here? We have introduced (yet) one more _count function, this time _count([head|tail], nucleotide = head, ...). Yes, this function will only be called if:

  1. first parameter is not an empty list
  2. the nucleotide is equal to the first element of the list (head)

Isn’t it awesome? We embedded the conditional checking right into the functions signatures!

Update 09 Nov 2016

@dcarral is a awesome colleague from Babbel and he was patient enough to review this post. And here are his suggestions (apart from typos):

  • Maybe the usage of the term guard clause is not correct in the context I’ve mentioned. Oftentimes they are represented by some kind of conditional inside the function itself rather than outside. I still believe this is just an implementation detail and the idea of guard clause is to “guard” you from some undesirable result (in case of recursion, never leaving the recursion chain).
  • A link to pattern matching from Elixir’s documentation. I don’t know if pattern matching is the exact term for what I meant (maybe function matching is a better definition).
  • A much shorter way to count occurrences in ruby would be "AAGCTA".count("A") (well done 👍)

Update 15 Nov 2016

@gabrielelana was really kind reviewing my submission on Exercism and I’ve made two changes:

  • use the same name for the “public” and “private” functions count (instead of _count);
  • instead of _count([head|tail], nucleotide = head, acumulator) we can define the function as _count([nucleotide|tail], nucleotide, acumulator). This is very interesting for me. The function will only be called if the head of the list (in this case named nucleotide) is equal to the second parameter (also called nucleotide).

Conclusion

Later on I’ve checked others’ solutions and they mostly use Enum.reduce, Enum.count or something similar. But since I don’t have any knowledge of the built in modules, I’m quite happy with that approach (using no more than functions). There’s the drawback of somewhat polluting the module’s implementation with several functions with apparently the same signature (_count) but I think this is a minor issue when we look on how simple they are to read and (even better) to test.

Are you banging your head against functional programming as well? Have you tried Elixir? Do you find this whole thing bullshit? Let me know what you think!