Making a JSON Parser in Haskell
By Noam YadgarThis article is different from my usual. I’m inviting you to join me in writing a JSON parser in Haskell. We’ll review the JSON RFC standards and create a Haskell data model for valid JSON. For parsing, I challenged myself to avoid consulting other implementations, believing such challenges enhance problem-solving skills.
Modeling JSON
To create a JSON parser, we must first model the available JSON values.
Literal Values
We begin with the basics. As per the RFC, three literal names are recognized as valid JSON values:
true / false / null
In Haskell, these values can be represented as type constructors:
data JsonValue
= JsonNull
| JsonBoolean Bool
The document also states that JSON values can be of the following types (I’ve condensed the original definitions):
String
The string value is similar to how the C family of programming languages handles UTF-8 strings.
string = quotation-mark *char quotation-mark
quotation-mark = "
char = unescaped / escape (
" \
/
%x62 backspace U+0008
%x66 form feed U+000C
%x6E line feed U+000A
%x72 carriage return U+000D
%x74 tab U+0009
%x75 4HEXDIG uXXXX
)
unescaped = %x20-21 / %x23-5B / %x5D-10FFFF
escape = \
Number
The definition of a JSON number is as follows:
number = [ minus ] int [ frac ] [ exp ]
decimal-point = .
digit1-9 = 1-9
e = e / E
exp = e [minus/plus]*DIGIT
frac = decimal-point 1*DIGIT
int = zero / ( digit1-9 *DIGIT )
minus = -
plus = +
zero = 0
To represent strings and numbers, I will include two additional constructors:
import qualified Data.Text as T
import Data.Scientific (Scientific)
data JsonValue
= JsonNull
| JsonBoolean Bool
| JsonString T.Text
| JsonNumber Scientific
I could have used String for JsonString and Double for JsonNumber; however, the alternative types are better in terms of performance and accuracy.
Object
The object definition is particularly interesting:
object = begin-object [ member *( value-separator member ) ] end-object
begin-object = {
end-object = }
value-separator = ,
member = string name-separator value
name-separator = :
If we look closely at the definition of a member, we will see that it’s a pair of string and value, separated by a name-separator (:). According to the RFC, the definition of a value is:
value = false / null / true / object / array / number / string
An object is a value that contains a collection of 0 or more members, separated by value-separator (,), and enclosed with {}. Haskell is very good at recursive types and we can model an object as follows:
import qualified Data.Map as Map
data JsonValue
= -- Other constructors ...
| JsonObject (Map.Map T.Text JsonValue)
Array
The last value we model is the array.
array = begin-array [ value *( value-separator value ) ] end-array
begin-array = [
end-array = ]
Like the object, the array type is also recursive, and we can represent it as a list of JsonValue.
Complete Model
Here is my complete JSON data model:
module Json where
import qualified Data.Text as T
import qualified Data.Map as Map
import Data.Scientific (Scientific)
data JsonValue
= JsonNull
| JsonBoolean Bool
| JsonString T.Text
| JsonNumber Scientific
| JsonObject (Map.Map T.Text JsonValue)
| JsonArray [JsonValue]
deriving (Eq)
-- for pretty print
instance Show JsonValue where
-- omitted from this example
objectLookup :: T.Text -> JsonValue -> Maybe JsonValue
objectLookup -- omitted from this example ...
This data type can represent any JSON value.
Tokenization
A JSON document is simply a string of UTF-8 characters, and we need to write a program to interpret it. The RFC document specifies six structural characters (: , { } [ ]) that define token boundaries. We can use these to create a function that takes the raw text and returns a tuple with the next token and the remaining text.
I aim for a parser with linear time complexity \(O(n)\), where \(n\) is the number of characters in the raw content we analyze. My approach is to develop a parser that reads, parses, and combines tokens as it processes the content stream from a given JSON file.
If I pass a stream like { "foo": true } to my token-reading function and keep reading tokens until the end of the stream, I expect collecting tokens to a list as follows:
>> :{
>> tokenize txt tokens = if TL.null txt then reverse tokens
>> else let (token, rest) = readToken txt in
>> tokenize rest (token : tokens)
>> :}
>> tokenize (TL.pack "{\"foo\":true}") []
["{","\"foo\"",":","true","}"]
Here’s my implementation:
import Data.Char (isSpace)
import qualified Data.Text as T
import qualified Data.Text.Lazy as TL
readToken :: TL.Text -> (T.Text, TL.Text)
readToken txt =
let stream = TL.dropWhile isSpace txt
in case TL.uncons stream of
Nothing -> (T.empty, TL.empty)
Just (c, cs)
| isStructToken c -> (T.singleton c, cs)
| c == '"' ->
let (tokenBody, rest) = readStringBody cs
in (T.concat ["\"", TL.toStrict tokenBody, "\""], rest)
| otherwise ->
let (token, rest) = TL.span (\x -> not (isSpace x || isStructToken x)) stream
in (TL.toStrict token, rest)
isStructToken :: Char -> Bool
isStructToken c = c `elem` ['{', '}', '[', ']', ',', ':']
readStringBody :: TL.Text -> (TL.Text, TL.Text)
readStringBody = f False
where
f escaped stream =
case TL.uncons stream of
Nothing -> (stream, TL.empty)
Just (c, cs) ->
if c == '"' && not escaped
then (TL.empty, cs)
else
let (body, rest) = f (c == '\\' && not escaped) cs
in (TL.cons c body, rest)
There is a trade-off between memory and CPU. The readToken function reads a Data.Text.Lazy.Text stream, allowing file reading in chunks, which is more memory-efficient but requires more I/O, possibly slowing the process and increasing CPU use. I chose to prioritize memory efficiency, processing JSON data as read.
Associativity (Semigroup)
The associative property states that how items are grouped in a binary expression does not change the result, as long as the order of the operands stays the same.
\[(a \oplus b) \oplus c = a \oplus (b \oplus c)\]Since I am interested in creating a process that parses tokens one by one, associativity is very useful. The difficulty of parsing tokens sequentially is that, for a current token, we don’t always have enough context to decide how two tokens should combine. If we could define associativity for our parser, we could read, parse, and merge tokens step by step. For example, here is a valid JSON object split into tokens, separated by ‘+’ to show token merging:
{ + "foo" + : + [ + 1 + , + 2 + ] + }
How and where do we begin? What is the result of { + "foo" or : + [? The challenge is to create a parser flexible enough to handle intermediate states, ensuring that regardless of the evaluation order, the outcome remains consistent.
Semigroup
In Haskell, the Semigroup
class is used to specify associativity for a data type. It requires us to define the associative operation (<>). Let’s start by defining a JsonParser data type and representing things like Parsed JsonValue, NameSeparator (for the ':' token), and members types as intermediate states.
data JsonParser
= MemberKey T.Text
| MemberValue JsonValue
| Member T.Text JsonValue
| NameSeparator
| Parsed JsonValue
Now we can start applying the pattern-matching rules of associativity.
instance Semigroup JsonParser where
(<>) (Parsed (JsonString k)) NameSeparator = MemberKey k
(<>) (MemberKey k) (Parsed v) = Member k v
(<>) NameSeparator (Parsed v) = MemberValue v
(<>) (Parsed (JsonString k)) (MemberValue v) Member k v
Note: these are incomplete definitions.
The Empty Member (Monoid)
To help us achieve true associativity, we introduce an empty member for our JsonParser data type. The empty member (similar to 0 in addition and 1 in multiplication) serves as an essential placeholder. It has a special property: it does not affect the result of the operation when combined with another member. Let’s add it to our parser and define its associative behavior.
data JsonParser
= Void
-- Other constructors ...
instance Semigroup JsonParser where
(<>) Void p = p
(<>) p Void = p
-- Other pattern ...
Why is it Important?
Imagine concatenating these three tokens from left to right:
"foo" : [
In our Haskell code, "foo" is a Parsed (JsonString T.Text) and ':' is a NameSeparator. One rule of our parser is:
(<>) (Parsed (JsonString k)) NameSeparator = MemberKey k
"foo" combined with ':' becomes a MemberKey T.Text. Eventually, we want this MemberKey T.Text to combine with a value to form a complete Member T.Text JsonValue. This is not possible by simply combining a MemberKey T.Text with '['. The solution is to carry another JsonParser value with the constructor. Let’s modify the type.
data JsonParser
= Void
| MemberKey T.Text JsonParser
-- Other constructors ...
Now we can significantly enhance the rules for combining parsed tokens with a MemberKey T.Text JsonParser value.
instance Semigroup JsonParser where
(<>) (MemberKey k p) q = case p <> q of
Parsed v -> Member k v
s -> MemberKey k s
-- Other patterns ...
We promote a MemberKey T.Text JsonParser to a Member T.Text JsonValue only when the next operation makes the carried JsonParser value a Parsed JsonValue.
Monoid
We use the Void value as the base placeholder for the carried JsonParser. For example, we should always read the ':' token as a NameSeparator Void. Formally, a Semigroup with an empty member is a Monoid and in Haskell we can define a Monoid instance for our JsonParser:
instance Monoid JsonParser where
mempty = Void
The mathematical definition of a monoid is a set equipped with an associative binary operation and an identity member.
The Parser
Here is the complete definition of my JsonParser. This data type represents all possible tokens and values, as well as intermediate states.
data JsonParser
= Void
| Error
| BeginObject (Map.Map T.Text JsonValue) JsonParser
| EndObject JsonParser
| BeginArray [JsonValue] JsonParser
| EndArray JsonParser
| ValueSeperator JsonParser
| NameSeparator JsonParser
| MemberKey T.Text JsonParser
| MemberValue JsonValue JsonParser
| Member T.Text JsonValue JsonParser
| Parsed JsonValue JsonParser
deriving (Show, Eq)
Complete Semigroup
The most challenging part of this journey is to define all possible token patterns and achieve full associativity for the JsonParser type. This is essentially the core logic of our parser. Once completed, the final parsing process can be defined. Here is my complete definition of a Semigroup JsonParser:
instance Semigroup JsonParser where
-- Void and Error
(<>) Void p = p
(<>) p Void = p
(<>) Error _ = Error
(<>) _ Error = Error
-- Object
(<>) (BeginObject ms p) q = case p <> q of
Member k v (ValueSeperator c) -> BeginObject (Map.insert k v ms) c
Member k v (EndObject c) -> Parsed (JsonObject (Map.insert k v ms)) c
EndObject c -> Parsed (JsonObject ms) c
s -> BeginObject ms s
(<>) (Member k v p) q = Member k v (p <> q)
(<>) (MemberValue v p) q = MemberValue v (p <> q)
(<>) (Parsed (JsonString k) p) q = case p <> q of
NameSeparator c -> MemberKey k c
MemberValue v c -> Member k v c
s -> Parsed (JsonString k) s
(<>) (MemberKey k p) q = case p <> q of
Parsed v c -> Member k v c
s -> MemberKey k s
(<>) (NameSeparator p) q = case p <> q of
Parsed v c -> MemberValue v c
s -> NameSeparator s
-- Arrays
(<>) (BeginArray elms p) q = case p <> q of
Parsed elm (ValueSeperator c) -> BeginArray (elm : elms) c
Parsed elm (EndArray c) -> Parsed (JsonArray $ reverse (elm : elms)) c
EndArray c -> Parsed (JsonArray $ reverse elms) c
s -> BeginArray elms s
-- Values
(<>) (ValueSeperator p) q = ValueSeperator (p <> q)
(<>) (EndObject p) q = EndObject (p <> q)
(<>) (EndArray p) q = EndArray (p <> q)
(<>) (Parsed v p) q = Parsed v (p <> q)
Parsing Tokens
The last step before we write the final parser function is to define a function that maps text tokens to JsonParser values.
parseToken :: T.Text -> JsonParser
parseToken t
| T.null t = mempty
| t == "{" = BeginObject Map.empty mempty
| t == "," = ValueSeperator mempty
| t == ":" = NameSeparator mempty
| t == "[" = BeginArray [] mempty
| t == "]" = EndArray mempty
| t == "}" = EndObject mempty
| t == "null" = Parsed JsonNull mempty
| t == "true" = Parsed (JsonBoolean True) mempty
| t == "false" = Parsed (JsonBoolean False) mempty
| otherwise = stringOrNum
where
stringOrNum = case parseJsonString t <|> parseJsonNumber t of
Just v -> Parsed v mempty
_ -> Error
parseJsonString :: T.Text -> Maybe JsonValue
parseJsonString token =
if T.head token == '"' && T.last token == '"'
then Just (JsonString $ T.init (T.tail token))
else Nothing
parseJsonNumber :: T.Text -> Maybe JsonValue
parseJsonNumber token =
case readMaybe $ T.unpack token :: Maybe S.Scientific of
Just x -> Just (JsonNumber x)
Nothing -> Nothing
A Working Parser
For this parser, I went with a simple, sequential process. My parser reads a token from the input stream (a file), parses it, and combines the token with the previous one (starting from Void), using the (<>) operator. Here’s the final function:
parse :: TL.Text -> Maybe JsonValue
parse stream = case f mempty stream of
Parsed v Void -> Just v
_ -> Nothing
where
f Error _ = Error
f p s | TL.null s = p
f p s =
if T.null $ fst t
then f p rest
else f (p <> q) rest
where
t = readToken s
q = parseToken $ fst t
rest = snd t
To test it, I’ve created a simple application that reads a JSON file and lets us pass a flag to select values from a JSON object.
./jp test.json
{
"foo": {
"bar": {
"arr": [
true,
false,
1.0,
null
],
"baz": "biz"
}
}
}
Using an -l flag, I can perform simple object lookup:
./jp test.json -l foo.bar.baz
"biz"
What’s Next?
I took on this challenge to improve my Haskell knowledge as well as my algorithmic thinking. As mentioned, I really wanted to solve this puzzle on my own, so even though it may seem like I had the entire solution all along, it actually took a few iterations. To me, the most challenging part was finding a solution for combining tokens. I tried all sorts of approaches until I ended up taking the monoidal approach. Before I conclude, here are a couple of future paths I’d like to explore.
From JSON to Free Text
The process of tokenizing and processing tokens is far more generic; it is used for things like compilers, interpreters, and, more recently, LLMs . Having a good grasp of this type of process is a very nice addition to my knowledge arsenal. I chose to start with JSON because it is simple and very well-defined; however, other cases might unravel a completely different set of solutions.
For example, processing free text (for things like LLMs and NLPs) is computationally more complex. First, how do you even define a token? One approach might involve splitting chunks of text by whitespaces and concurrently finding the indices of words that make up each element from a giant vocabulary.
It’s All Very Similar
It’s interesting to see that you can fit parsers, compilers, interpreters, and free-text processors under the same umbrella. Their inner process might be completely different from one another, but the basic principles are similar:
- Defining the set of tokens
- Defining the process that tokenizes text
- Defining the process that maps and combines tokens to create valuable context
Concurrency
I took the extra effort to define my JsonParser as a Semigroup instance. With a simple sequential parser, many of the patterns I defined are unnecessary. For example, when parsing a valid JSON from left to right, we would never encounter NameSeparator p <> JsonParser. This is because a Parsed (JsonString k) p must always come before a NameSeparator p.
So Why Bother?
True associativity opens the door for concurrent or parallel processing. To dramatically improve performance, I can write an algorithm that pairs tokens together, processes each pair asynchronously, and then pairs the results with one another. This process concurrently reduces pairs of tokens down to a single JsonValue—associativity allows this!
flowchart TD
ob(("{"))
foo(("''foo''"))
sep((:))
true(("true"))
eobj(("}"))
e2(("}"))
e3(("}"))
v((void))
v2((void))
of(("{''foo''"))
st((":true"))
ofst(("{''foo'':true"))
all(("{''foo'':true}"))
ob --> of
foo --> of
sep --> st
true --> st
eobj --> e2
v --> e2
e2 --> e3
v2 --> e3
of --> ofst
st --> ofst
ofst --> all
e3 --> all
Thank you for reading, I hope you learned something new. See you next time.