分享

Easily Create Your Own Parser

 quasiceo 2013-12-12

Easily Create Your Own Parser

2011-7-6 11:44:51English
TokenicerSS.jpg

Introduction 

One of the more difficult tasks in computer science is building parsers and compilers. There are a lot of tools available that aid in the tedious task, most notably,  Flex and Yacc, both available on the Linux/UNIX platform. The program I present here in this article is called TokenIcer. It is similar to Flex, but TokenIcer provides a nice easy to use GUI that serves as an editor for your rules, as well as a test bed for testing your rules. In addition, once your parsing rules have been defined, TokenIcer can create a parser class, based on your rules, in either C# or VB.NET 

Background

To be able to use TokenIcer well, you should have a pretty good understanding of how Regular Expressions work.  Each rule you enter into TokenIcer will be based on a Regular Expression. Any regular Expression that the .NET Regex library can parse, will also be valid in TokenIcer.

The way a parser works, and also the way TokenIcer will work, is that you feed into the parser some kind of input string. For example, if we feed the following line into a parser:

3+2 * (6 + 1)  

We should expect our parser to provide us output like this:

{Integer}{Plus}{Integer}{Whitespace}{Asterisk}{Whitespace}{LeftParen}{Integer}{Whitespace}{Plus}{Whitespace}{Integer}{RightParen}{Newline} 

What we do with this parser output depends on exactly what we are trying to accomplish. Maybe you are building a language compiler, or perhaps a math parser. This is what TokenIcer does. It takes input like "3+2 * (6 + 1)" and converts it into a series of enumerated values.  

Using the code 

When you run TokenIcer, you will be presented with a screen with 3 text boxes and an TreeView. The first text box is where we enter our rules. Rules are simply regular expressions wrapped between quotes. Immediately following the rule is a space (or multiple spaces, if you prefer). Following the space is the identifier for that rule. As in the example above, an identifier can be something like Integer, Whitespace or whatever you want to use. Any valid VB.NET or C# identifier can be used. Go ahead and enter the following rule in the text box:

"[0-9]+" INTEGER 

This rule will correctly identify an integer number. In the middle text box, is our test bed. Anything you enter in here will be parsed according to the rules you have in the text box above it. Go ahead and enter any integer number you wish in the test box. Once you enter your number, click the "Test Grammar" button at the bottom left of the window. Once you do, you will see the third textbox, the output text box, did, indeed, correctly identify our number as an {INTEGER}. The output tree also shows INTEGER, but if you expand the tree, you will see the actual value the parser parsed. Now go ahead and enter the following 2 rules in the first text box, immediately following the first rule:

"[0-9]+\.+[0-9]+" FLOAT
"[ \t]+" WHITESPACE

In your test box, try testing the following:

3.2 15 4.932 

Go ahead and click "Test Grammar". You may be surprised to see a bunch of "UNDEFINED" tags and no FLOAT tags. The reason for this (gotcha #1) is because the parser will parse your input using the first rule it comes to that matches. When making your rules, you must place "higher priority" rules first. The parser came to the "3" and said "ok 3 is an integer". It didn't even look past the 3 to see if there was a decimal point. To fix this, simply put the FLOAT line above the INTEGER line. This way, the parser will try to match the FLOAT rule first and if there is no decimal point, move the INTEGER rule. Change your rules around in the rule text box so it looks like this:

"[0-9]+\.+[0-9]+" FLOAT
"[0-9]+" INTEGER
"[ \t]+" WHITESPACE 

Now if you click "Test Grammar", the output box and the output tree will look as expected. It is now properly parsing as you would expect. 

To wrap up this section, I will now show you how TokenIcer can create C# or VB.NET classes that you can include in your own projects. Since we now have our three rules and they are tested with out input, go ahead and click the button that says "Generate Class...". When you do that, a window pops up with a drop down list asking which language you prefer. You can select either C# or VB.NET. Also there is a checkbox for including comments. If you check this box, some simple comments will also be generated to help you understand the code. Once you've selected your language, click "Generate my Class" and another window will pop up with the generated C# or VB.NET code. You can hit <Ctrl>+A to select it all and <Ctrl>+C to copy the code into your own project. Its all that simple!  

The TokenParser class exposes the following property: 

InputString -- After instantiating a copy of TokenParser, you must set this string property to the value of the string you want parsed. 

The TokenParser class exposes the following methods:

GetToken() -- Call this method to retrieve the next token from the input string. The GetToken() method returns a Token object. A Token object contains the TokenName (which is an enum of tokens that you defined in your rules earlier, like INTEGER, WHITESPACE, FLOAT) and it contains the TokenValue. The TokenValue will be the value retrieved from the parser. For example, an INTEGER might return "53" for example. 

Peek() -- This method will return the next token that GetToken() will return. It allows you to look ahead in the token buffer without actually pulling anything off the queue. The return value of Peek() is a PeekToken. A PeekToken is a special object that contains a Token object and an index. By calling Peek() and passing a PeekToken as an argument to Peek(), Peek() will return the Token that is returned after the last Peek() call. In this way, you can peek ahead several tokens deep.  

ResetParser() -- This method resets the parser. After calling this, you must set the InputString property to a string again and then you can call GetToken() or PeekToken() as you normally would to parse a new string.  

Example Parser Rules

Here is an example of a very simple BASIC parser: 

"[Ll][Ee][Tt]" LET
"[Pp][Rr][Ii][Nn][Tt]" PRINT
"[Cc][Ll][Ss]" CLS
"[Rr][Ee][Mm].*" REM
"[Ee][Nn][Dd]" END
"[Gg][Oo][Tt][Oo]" GOTO
"[Ff][Oo][Rr]" FOR
"[Ss][Tt][Ee][Pp]" STEP
"[Nn][Ee][Xx][Tt]" NEXT
"[Tt][Oo]" TO
"\:" COLON
"=" EQUALS
"\".*?\"" STRING
"[a-zA-Z_][a-zA-Z0-9_]*"     IDENTIFIER
"[ \t]+" WHITESPACE
"[\r\n]+" NEWLINE
"[0-9]?\.+[0-9]+" FLOAT
"[0-9]+" INTEGER
"'.*" APOSTROPHE
"\(" LPAREN
"\)" RPAREN
"\*" ASTERISK
"\/" SLASH
"\+" PLUS
"\-" MINUS 

These rules can properly parse the following program: 

10 CLS
20 PRINT "Hello" : PRINT " World!"
30 X = 5
40 Y = X * 2 + 3.5
50 PRINT Y
60 FOR Z = 50 TO 1 STEP -1
70 PRINT "Z = " + Z
80 NEXT Z
90 END

Feel free to expand the parser rules and try to make your own BASIC compiler! 

History

7/3/2011 - Version 1.0 released

7/5/2011 - Version 1.1 released. TokenIcer can now save grammar and test input files! 

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多