Menu

Making a JSON Parser in Haskell

By Noam Yadgar
#haskell #functional-programming #software-engineering

This article is different from my usual. I’m inviting you to join me in writing a JSON parser in Haskell. We’ll review the JSON RFC standards and create a Haskell data model for valid JSON. For parsing, I challenged myself to avoid consulting other implementations, believing such challenges enhance problem-solving skills.

Modeling JSON

To create a JSON parser, we must first model the available JSON values.

Literal Values

We begin with the basics. As per the RFC, three literal names are recognized as valid JSON values:

true / false / null

In Haskell, these values can be represented as type constructors:

data JsonValue
  = JsonNull
  | JsonBoolean Bool

The document also states that JSON values can be of the following types (I’ve condensed the original definitions):

String

The string value is similar to how the C family of programming languages handles UTF-8 strings.

string = quotation-mark *char quotation-mark

quotation-mark = "
char = unescaped / escape (
        "        \
        /
        %x62 backspace U+0008
        %x66 form feed U+000C
        %x6E line feed U+000A
        %x72 carriage return U+000D
        %x74 tab       U+0009
        %x75 4HEXDIG   uXXXX
)          
unescaped = %x20-21 / %x23-5B / %x5D-10FFFF
escape = \

Number

The definition of a JSON number is as follows:

number = [ minus ] int [ frac ] [ exp ]

decimal-point = .
digit1-9 = 1-9
e = e / E
exp = e [minus/plus]*DIGIT
frac = decimal-point 1*DIGIT
int = zero / ( digit1-9 *DIGIT )
minus = -
plus = +
zero = 0

To represent strings and numbers, I will include two additional constructors:

import qualified Data.Text as T 
import Data.Scientific (Scientific)

data JsonValue
  = JsonNull
  | JsonBoolean Bool
  | JsonString T.Text
  | JsonNumber Scientific

I could have used String for JsonString and Double for JsonNumber; however, the alternative types are better in terms of performance and accuracy.

Object

The object definition is particularly interesting:

object = begin-object [ member *( value-separator member ) ] end-object

begin-object = {
end-object = }
value-separator = ,
member = string name-separator value
name-separator = :

If we look closely at the definition of a member, we will see that it’s a pair of string and value, separated by a name-separator (:). According to the RFC, the definition of a value is:

value = false / null / true / object / array / number / string

An object is a value that contains a collection of 0 or more members, separated by value-separator (,), and enclosed with {}. Haskell is very good at recursive types and we can model an object as follows:

import qualified Data.Map as Map

data JsonValue
  = -- Other constructors ...
  | JsonObject (Map.Map T.Text JsonValue)

Array

The last value we model is the array.

array = begin-array [ value *( value-separator value ) ] end-array

begin-array = [
end-array = ]

Like the object, the array type is also recursive, and we can represent it as a list of JsonValue.

Complete Model

Here is my complete JSON data model:

module Json where

import qualified Data.Text as T
import qualified Data.Map as Map
import Data.Scientific (Scientific)

data JsonValue
  = JsonNull
  | JsonBoolean Bool
  | JsonString T.Text
  | JsonNumber Scientific
  | JsonObject (Map.Map T.Text JsonValue)
  | JsonArray [JsonValue]
  deriving (Eq)

-- for pretty print
instance Show JsonValue where 
    -- omitted from this example

objectLookup :: T.Text -> JsonValue -> Maybe JsonValue
objectLookup -- omitted from this example ...

This data type can represent any JSON value.

Tokenization

A JSON document is simply a string of UTF-8 characters, and we need to write a program to interpret it. The RFC document specifies six structural characters (: , { } [ ]) that define token boundaries. We can use these to create a function that takes the raw text and returns a tuple with the next token and the remaining text.

I aim for a parser with linear time complexity \(O(n)\), where \(n\) is the number of characters in the raw content we analyze. My approach is to develop a parser that reads, parses, and combines tokens as it processes the content stream from a given JSON file.

If I pass a stream like { "foo": true } to my token-reading function and keep reading tokens until the end of the stream, I expect collecting tokens to a list as follows:

>> :{
>> tokenize txt tokens = if TL.null txt then reverse tokens
>>   else let (token, rest) = readToken txt in
>>    tokenize rest (token : tokens)
>> :}

>> tokenize (TL.pack "{\"foo\":true}") []
["{","\"foo\"",":","true","}"]

Here’s my implementation:

import Data.Char (isSpace)
import qualified Data.Text as T
import qualified Data.Text.Lazy as TL

readToken :: TL.Text -> (T.Text, TL.Text)
readToken txt = 
  let stream = TL.dropWhile isSpace txt
   in case TL.uncons stream of
        Nothing -> (T.empty, TL.empty)
        Just (c, cs)
          | isStructToken c -> (T.singleton c, cs)
          | c == '"' -> 
              let (tokenBody, rest) = readStringBody cs
               in (T.concat ["\"", TL.toStrict tokenBody, "\""], rest)
          | otherwise -> 
              let (token, rest) = TL.span (\x -> not (isSpace x || isStructToken x)) stream
               in (TL.toStrict token, rest)

isStructToken :: Char -> Bool
isStructToken c = c `elem` ['{', '}', '[', ']', ',', ':']

readStringBody :: TL.Text -> (TL.Text, TL.Text)
readStringBody = f False
  where
    f escaped stream = 
      case TL.uncons stream of
        Nothing -> (stream, TL.empty)
        Just (c, cs) -> 
          if c == '"' && not escaped
            then (TL.empty, cs)
            else 
              let (body, rest) = f (c == '\\' && not escaped) cs
               in (TL.cons c body, rest)

There is a trade-off between memory and CPU. The readToken function reads a Data.Text.Lazy.Text stream, allowing file reading in chunks, which is more memory-efficient but requires more I/O, possibly slowing the process and increasing CPU use. I chose to prioritize memory efficiency, processing JSON data as read.

Associativity (Semigroup)

The associative property states that how items are grouped in a binary expression does not change the result, as long as the order of the operands stays the same.

\[(a \oplus b) \oplus c = a \oplus (b \oplus c)\]

Since I am interested in creating a process that parses tokens one by one, associativity is very useful. The difficulty of parsing tokens sequentially is that, for a current token, we don’t always have enough context to decide how two tokens should combine. If we could define associativity for our parser, we could read, parse, and merge tokens step by step. For example, here is a valid JSON object split into tokens, separated by ‘+’ to show token merging:

{ + "foo" + : + [ + 1 + , + 2 + ] + }

How and where do we begin? What is the result of { + "foo" or : + [? The challenge is to create a parser flexible enough to handle intermediate states, ensuring that regardless of the evaluation order, the outcome remains consistent.

Semigroup

In Haskell, the Semigroup class is used to specify associativity for a data type. It requires us to define the associative operation (<>). Let’s start by defining a JsonParser data type and representing things like Parsed JsonValue, NameSeparator (for the ':' token), and members types as intermediate states.

data JsonParser
  = MemberKey T.Text
  | MemberValue JsonValue
  | Member T.Text JsonValue
  | NameSeparator
  | Parsed JsonValue

Now we can start applying the pattern-matching rules of associativity.

instance Semigroup JsonParser where
  (<>) (Parsed (JsonString k)) NameSeparator = MemberKey k
  (<>) (MemberKey k) (Parsed v) = Member k v

  (<>) NameSeparator (Parsed v) = MemberValue v
  (<>) (Parsed (JsonString k)) (MemberValue v) Member k v

Note: these are incomplete definitions.

The Empty Member (Monoid)

To help us achieve true associativity, we introduce an empty member for our JsonParser data type. The empty member (similar to 0 in addition and 1 in multiplication) serves as an essential placeholder. It has a special property: it does not affect the result of the operation when combined with another member. Let’s add it to our parser and define its associative behavior.

data JsonParser
  = Void
  -- Other constructors ...

instance Semigroup JsonParser where
  (<>) Void p = p 
  (<>) p Void = p
  -- Other pattern ...

Why is it Important?

Imagine concatenating these three tokens from left to right:

"foo"  :  [

In our Haskell code, "foo" is a Parsed (JsonString T.Text) and ':' is a NameSeparator. One rule of our parser is:

(<>) (Parsed (JsonString k)) NameSeparator = MemberKey k

"foo" combined with ':' becomes a MemberKey T.Text. Eventually, we want this MemberKey T.Text to combine with a value to form a complete Member T.Text JsonValue. This is not possible by simply combining a MemberKey T.Text with '['. The solution is to carry another JsonParser value with the constructor. Let’s modify the type.

data JsonParser
  = Void
  | MemberKey T.Text JsonParser
  -- Other constructors ...

Now we can significantly enhance the rules for combining parsed tokens with a MemberKey T.Text JsonParser value.

instance Semigroup JsonParser where
  (<>) (MemberKey k p) q = case p <> q of 
        Parsed v -> Member k v
        s -> MemberKey k s
  -- Other patterns ...

We promote a MemberKey T.Text JsonParser to a Member T.Text JsonValue only when the next operation makes the carried JsonParser value a Parsed JsonValue.

Monoid

We use the Void value as the base placeholder for the carried JsonParser. For example, we should always read the ':' token as a NameSeparator Void. Formally, a Semigroup with an empty member is a Monoid and in Haskell we can define a Monoid instance for our JsonParser:

instance Monoid JsonParser where
  mempty = Void

The mathematical definition of a monoid is a set equipped with an associative binary operation and an identity member.

The Parser

Here is the complete definition of my JsonParser. This data type represents all possible tokens and values, as well as intermediate states.

data JsonParser
  = Void
  | Error
  | BeginObject (Map.Map T.Text JsonValue) JsonParser
  | EndObject JsonParser
  | BeginArray [JsonValue] JsonParser
  | EndArray JsonParser
  | ValueSeperator JsonParser
  | NameSeparator JsonParser
  | MemberKey T.Text JsonParser
  | MemberValue JsonValue JsonParser
  | Member T.Text JsonValue JsonParser
  | Parsed JsonValue JsonParser
  deriving (Show, Eq)

Complete Semigroup

The most challenging part of this journey is to define all possible token patterns and achieve full associativity for the JsonParser type. This is essentially the core logic of our parser. Once completed, the final parsing process can be defined. Here is my complete definition of a Semigroup JsonParser:

instance Semigroup JsonParser where
  -- Void and Error
  (<>) Void p = p
  (<>) p Void = p
  (<>) Error _ = Error
  (<>) _ Error = Error
  -- Object
  (<>) (BeginObject ms p) q = case p <> q of
    Member k v (ValueSeperator c) -> BeginObject (Map.insert k v ms) c
    Member k v (EndObject c) -> Parsed (JsonObject (Map.insert k v ms)) c
    EndObject c -> Parsed (JsonObject ms) c
    s -> BeginObject ms s
  (<>) (Member k v p) q = Member k v (p <> q)
  (<>) (MemberValue v p) q = MemberValue v (p <> q)
  (<>) (Parsed (JsonString k) p) q = case p <> q of
    NameSeparator c -> MemberKey k c
    MemberValue v c -> Member k v c
    s -> Parsed (JsonString k) s
  (<>) (MemberKey k p) q = case p <> q of
    Parsed v c -> Member k v c
    s -> MemberKey k s
  (<>) (NameSeparator p) q = case p <> q of
    Parsed v c -> MemberValue v c
    s -> NameSeparator s
  -- Arrays
  (<>) (BeginArray elms p) q = case p <> q of
    Parsed elm (ValueSeperator c) -> BeginArray (elm : elms) c
    Parsed elm (EndArray c) -> Parsed (JsonArray $ reverse (elm : elms)) c
    EndArray c -> Parsed (JsonArray $ reverse elms) c
    s -> BeginArray elms s
  -- Values
  (<>) (ValueSeperator p) q = ValueSeperator (p <> q)
  (<>) (EndObject p) q = EndObject (p <> q)
  (<>) (EndArray p) q = EndArray (p <> q)
  (<>) (Parsed v p) q = Parsed v (p <> q)

Parsing Tokens

The last step before we write the final parser function is to define a function that maps text tokens to JsonParser values.

parseToken :: T.Text -> JsonParser
parseToken t
  | T.null t = mempty
  | t == "{" = BeginObject Map.empty mempty
  | t == "," = ValueSeperator mempty
  | t == ":" = NameSeparator mempty
  | t == "[" = BeginArray [] mempty
  | t == "]" = EndArray mempty
  | t == "}" = EndObject mempty
  | t == "null" = Parsed JsonNull mempty
  | t == "true" = Parsed (JsonBoolean True) mempty
  | t == "false" = Parsed (JsonBoolean False) mempty
  | otherwise = stringOrNum
  where
    stringOrNum = case parseJsonString t <|> parseJsonNumber t of
      Just v -> Parsed v mempty
      _ -> Error

parseJsonString :: T.Text -> Maybe JsonValue
parseJsonString token =
  if T.head token == '"' && T.last token == '"'
    then Just (JsonString $ T.init (T.tail token))
    else Nothing

parseJsonNumber :: T.Text -> Maybe JsonValue
parseJsonNumber token =
  case readMaybe $ T.unpack token :: Maybe S.Scientific of 
    Just x -> Just (JsonNumber x)
    Nothing -> Nothing

A Working Parser

For this parser, I went with a simple, sequential process. My parser reads a token from the input stream (a file), parses it, and combines the token with the previous one (starting from Void), using the (<>) operator. Here’s the final function:

parse :: TL.Text -> Maybe JsonValue
parse stream = case f mempty stream of
  Parsed v Void -> Just v
  _ -> Nothing
  where
    f Error _ = Error
    f p s | TL.null s = p
    f p s =
      if T.null $ fst t
        then f p rest
        else f (p <> q) rest
      where
        t = readToken s
        q = parseToken $ fst t
        rest = snd t

To test it, I’ve created a simple application that reads a JSON file and lets us pass a flag to select values from a JSON object.

./jp test.json
{
  "foo": {
    "bar": {
      "arr": [
        true,
        false,
        1.0,
        null
      ],
      "baz": "biz"
    }
  }
}

Using an -l flag, I can perform simple object lookup:

./jp test.json -l foo.bar.baz
"biz"

What’s Next?

I took on this challenge to improve my Haskell knowledge as well as my algorithmic thinking. As mentioned, I really wanted to solve this puzzle on my own, so even though it may seem like I had the entire solution all along, it actually took a few iterations. To me, the most challenging part was finding a solution for combining tokens. I tried all sorts of approaches until I ended up taking the monoidal approach. Before I conclude, here are a couple of future paths I’d like to explore.

From JSON to Free Text

The process of tokenizing and processing tokens is far more generic; it is used for things like compilers, interpreters, and, more recently, LLMs . Having a good grasp of this type of process is a very nice addition to my knowledge arsenal. I chose to start with JSON because it is simple and very well-defined; however, other cases might unravel a completely different set of solutions.

For example, processing free text (for things like LLMs and NLPs) is computationally more complex. First, how do you even define a token? One approach might involve splitting chunks of text by whitespaces and concurrently finding the indices of words that make up each element from a giant vocabulary.

It’s All Very Similar

It’s interesting to see that you can fit parsers, compilers, interpreters, and free-text processors under the same umbrella. Their inner process might be completely different from one another, but the basic principles are similar:

  • Defining the set of tokens
  • Defining the process that tokenizes text
  • Defining the process that maps and combines tokens to create valuable context

Concurrency

I took the extra effort to define my JsonParser as a Semigroup instance. With a simple sequential parser, many of the patterns I defined are unnecessary. For example, when parsing a valid JSON from left to right, we would never encounter NameSeparator p <> JsonParser. This is because a Parsed (JsonString k) p must always come before a NameSeparator p.

So Why Bother?

True associativity opens the door for concurrent or parallel processing. To dramatically improve performance, I can write an algorithm that pairs tokens together, processes each pair asynchronously, and then pairs the results with one another. This process concurrently reduces pairs of tokens down to a single JsonValue—associativity allows this!

flowchart TD
    ob(("{"))
    foo(("''foo''"))
    sep((:))
    true(("true"))
    eobj(("}"))
    e2(("}"))
    e3(("}"))
    v((void))
    v2((void))

    of(("{''foo''"))
    st((":true"))

    ofst(("{''foo'':true"))

    all(("{''foo'':true}"))

    ob --> of 
    foo --> of

    sep --> st 
    true --> st

    eobj --> e2 
    v --> e2

    e2 --> e3 
    v2 --> e3

    of --> ofst
    st --> ofst

    ofst --> all
    e3 --> all

Thank you for reading, I hope you learned something new. See you next time.