自己动手实现一个脚本语言第一篇-------词法分析

1,797 阅读8分钟

目前来说编译原理的学习对于工作的用处很小,更多的方面可能是装逼,如果你不从nlp方面的话,编译器的工作研究基本已经在16年已经做完。不过,我觉得学东西没必要过分强调工作的需要,很多时候处于兴趣和爱好,探索底层,也许会有新的乐趣。

教材方面 推荐两本一本就是龙书 第二本就是

这是我对实现一个脚本语言的一点心得和体会。
该脚本语言的实现是使用java作为实现语言

该语言的学习是根据http://www.craftinginterpreters.com教程实现
知识前置
学习该知识需要你有一定的java基础知识,了解oop特性,以及学过编译原理。



lox是一门脚本语言,这意味着他的代码直接运行,而不是转换成汇编代码。

让我们开始吧

这是编译器的启动器。

 public class Lox {                                           
     public static void main(String[] args) throws IOException {
       if (args.length > 1) {                                   
         System.out.println("Usage: jlox [script]");            
          System.exit(64); 
       } else if (args.length == 1) {                           
       runFile(args[0]);                                      
       } else {                                                 
       runPrompt();                                           
      }                                                        
  }                                                          
 }      

实际上,有两种方法可以运行某些代码。如果您从命令行启动jlox并为其提供文件路径,它将读取文件并执行该文件:

    private static void runFile(String path) throws IOException {
       byte[] bytes = Files.readAllBytes(Paths.get(path));        
       run(new String(bytes, Charset.defaultCharset()));          
      }     

如果你想不带任何参数去启动他,做到交互的话,

  private static void runPrompt() throws IOException {         
  InputStreamReader input = new InputStreamReader(System.in);
    BufferedReader reader = new BufferedReader(input);
for (;;) { 
  System.out.print("> ");                                  
  run(reader.readLine());                                  
    }                                                           
  }  

上述两种运行方法的实现,其实都是依赖于对run方法的包装

   private static void run(String source) {    
       Scanner scanner = new Scanner(source);    
       List<Token> tokens = scanner.scanTokens();

// For now, just print the tokens.        
for (Token token : tokens) {              
  System.out.println(token);              
   }                                         
 }    

好了,到现在我们已经完成编译器工作的第一步,这只是开始,我们一起加油。

接下来是错误处理,当输入的tokens出现错误时,编译器如何识别,并报出一个错误。

     static void error(int line, String message) {                       
      |     report(line, "", message);                                        
     }

     private static void report(int line, String where, String message) {
       System.err.println(                                               
           "[line " + line + "] Error" + where + ": " + message);        
       hadError = true;                                                  
     }   

我们继续在lox类中创建一个静态error函数,静态调用report函数,report函数传入line 参数,error发生位置,错误信息。 这是一个最低要求的错误显示函数。当出现bug时候他会通知我们

  Error: Unexpected "," *somewhere* in your program. Good luck finding it!

但是这样的函数,还是无法让我们满意,我们希望一个错误显示函数,能告知我们错误在哪个地方

    Error: Unexpected "," in argument list.
15 | function(first, second,);
                 ^-- Here.

我们在low类中定义一个boolean变量hadError,

public class Lox {                
  static boolean hadError = false;

然后我们在runfile函数里

   private static void runFile(String path) throws IOException {
     byte[] bytes = Files.readAllBytes(Paths.get(path));        
     run(new String(bytes, Charset.defaultCharset()));          
     //增加一个判断,出现问题就退出
      if (hadError) System.exit(65);          
   }     

runPrompt函数

  private static void runPrompt() throws IOException {         
    InputStreamReader input = new InputStreamReader(System.in);
    BufferedReader reader = new BufferedReader(input);
     for (;;) { 
      System.out.print("> ");                                  
     run(reader.readLine());        
     hadError = false;      
     }                                                           
   }  

接下来就是词素工作 关键字是语言文法分析中一件很重要的工作, 我们的词法分析使用的是词法比较方式,这种方式其实非常缓慢,但是为了我们学习起来足够简单,我们将采用这种方法。

  enum TokenType {                                   
    // Single-character tokens.     
    //单字符串

    LEFT_PAREN, RIGHT_PAREN, LEFT_BRACE, RIGHT_BRACE,
    COMMA, DOT, MINUS, PLUS, SEMICOLON, SLASH, STAR, 

  //>  ,< ,==,!=

    // One or two character tokens.          

    BANG, BANG_EQUAL,                                
    EQUAL, EQUAL_EQUAL,                              
    GREATER, GREATER_EQUAL,                          
    LESS, LESS_EQUAL,                                
   //string  number

    // Literals.      

    IDENTIFIER, STRING, NUMBER,                      
    // 关键字

    // Keywords.        

    AND, CLASS, ELSE, FALSE, FUN, FOR, IF, NIL, OR,  
    PRINT, RETURN, SUPER, THIS, TRUE, VAR, WHILE,    

    EOF                                              
  }      

我们再创建一个token类,将上面那个枚举装进来

   class Token {                                                     
     final TokenType type;                                           
     final String lexeme;                                            
     final Object literal;                                           
     final int line; 

     Token(TokenType type, String lexeme, Object literal, int line) {
       this.type = type;                                             
       this.lexeme = lexeme;                                         
       this.literal = literal;                                       
       this.line = line;                                             
     }                                                               

     public String toString() {                                      
       return type + " " + lexeme + " " + literal;                   
     }     
     }

虽然在词法扫描中,正则表达式很重要,但为了入门考虑 该教程不尝试实现正则表达式功能

接下来,我们要实现一个字符串的scanner类

    import java.util.ArrayList;                                               
    import java.util.HashMap;                                                 
    import java.util.List;                                                    
    import java.util.Map;                                                     

    import static com.craftinginterpreters.lox.TokenType.*; 

    class Scanner {                                                           
      private final String source;                                            
      private final List<Token> tokens = new ArrayList<>();                   

      Scanner(String source) {                                                
        this.source = source;                                                 
      }     
       List<Token> scanTokens() {                        
         while (!isAtEnd()) {                            
        // We are at the beginning of the next lexeme.
        start = current;                              
         scanToken();                                  
         }

         tokens.add(new Token(EOF, "", null, line));     
         return tokens;                                  
       }             
             }      

这个类中我们实现scantokens函数,这个函数将不断的扫描token

然后,为了我们能够有效的定位字符的扫描,我们需要在类Scanner中添加

           private int start = 0;                               
           private int current = 0;                             
           private int line = 1;            

建立一个函数,帮助我们判断是否扫描完成

       private boolean isAtEnd() {         
          return current >= source.length();
        } 

接下来就是scanner中的核心函数,扫描功能的实现

    private void scanToken() {
        char c = advance();
        switch (c) {
        case '(': addToken(LEFT_PAREN); break;
        case ')': addToken(RIGHT_PAREN); break;
        case '{': addToken(LEFT_BRACE); break;
        case '}': addToken(RIGHT_BRACE); break;
        case ',': addToken(COMMA); break;
        case '.': addToken(DOT); break;
        case '-': addToken(MINUS); break;
        case '+': addToken(PLUS); break;
        case ';': addToken(SEMICOLON); break;
        case '*': addToken(STAR); break;
        case '!': addToken(match('=') ? BANG_EQUAL : BANG); break;
        case '=': addToken(match('=') ? EQUAL_EQUAL : EQUAL); break;
        case '<': addToken(match('=') ? LESS_EQUAL : LESS); break;
        case '>': addToken(match('=') ? GREATER_EQUAL : GREATER); break;
        case '/':
            if (match('/')) {
                // A comment goes until the end of the line.
                while (peek() != '\n' && !isAtEnd()) advance();
            } else {
                addToken(SLASH);
            }
            break;
        case ' ':
        case '\r':
        case '\t':
            // Ignore whitespace.
            break;
            
        case '\n':
            line++;
            break;
         
        case '"': string(); break;
        case 'o':
            if (peek() == 'r') {
                addToken(OR);
            }
            break;
        default:
            if (isDigit(c)) {
                number();
            } else if (isAlpha(c)) {
                identifier();
            } else {
                Lox.error(line, "Unexpected character.");
            }
            break;
    }
}

然后我们需要实现几个辅助函数

           private char advance() {                               
             current++;                                           
             return source.charAt(current - 1);                   
           }

           private void addToken(TokenType type) {                
             addToken(type, null);                                
           }                                                      

           private void addToken(TokenType type, Object literal) {
             String text = source.substring(start, current);      
             tokens.add(new Token(type, text, literal, line));    
           }          

除了这些,我们还是需要考虑一些问题,如果用户传入一些计算机不能识别的字符,如#^@这些,我们的扫描器应该就要报错 为此,我们的tokenscan方法中,我们就需要

           default:
            if (isDigit(c)) {
                number();
            } else if (isAlpha(c)) {
                identifier();
            } else {
                Lox.error(line, "Unexpected character.");
            }
            break;

我们tokenscan函数中,不仅对字符进行了扫描,还对操作符进行了扫描

        case '-': addToken(MINUS); break;
        case '+': addToken(PLUS); break;
        case ';': addToken(SEMICOLON); break;
        case '*': addToken(STAR); break;
        case '!': addToken(match('=') ? BANG_EQUAL : BANG); break;
        case '=': addToken(match('=') ? EQUAL_EQUAL : EQUAL); break;
        case '<': addToken(match('=') ? LESS_EQUAL : LESS); break;
        case '>': addToken(match('=') ? GREATER_EQUAL : GREATER); break;

我们在tokenscan函数后面添加一个 match函数

        private boolean match(char expected) {                 
           if (isAtEnd()) return false;                         
           if (source.charAt(current) != expected) return false;

           current++;                                           
           return true;                                         
         }      

还有

     private boolean isDigit(char c) {
       return c >= '0' && c <= '9';   
     } 

以及判断数字的函数

  private void number() {                                     
  while (isDigit(peek())) advance();
// Look for a fractional part.                            
  if (peek() == '.' && isDigit(peekNext())) {               
  // Consume the "."                                      
  advance();                                              
  while (isDigit(peek())) 
  advance();                      
     }                                                         
    addToken(NUMBER,                                          
    Double.parseDouble(source.substring(start, current)));
 }    

以及判断字符串的函数

  private void identifier() {                
      while (isAlphaNumeric(peek())) advance();

      addToken(IDENTIFIER);                    
    }   

添加一个辅助函数
  private boolean isAlpha(char c) {       
        return (c >= 'a' && c <= 'z') ||      
       (c >= 'A' && c <= 'Z') ||      
        c == '_';                     
       }
下面这个函数旁段字符串是数字还是字符
     private boolean isAlphaNumeric(char c) {
     return isAlpha(c) || isDigit(c);      
       }        

干完这些,我们就能把字符串进行扫描,那现在是不是大功告成了呢?还没有,我们还没有把关键字进行扫描 在scanner类中我们加入

     private static final Map<String, TokenType> keywords;

     static {                                             
       keywords = new HashMap<>();                        
        keywords.put("and",    AND);                       
        keywords.put("class",  CLASS);                     
        keywords.put("else",   ELSE);                      
        keywords.put("false",  FALSE);                     
        keywords.put("for",    FOR);                       
        keywords.put("fun",    FUN);                       
        keywords.put("if",     IF);                        
        keywords.put("nil",    NIL);                       
        keywords.put("or",     OR);                        
        keywords.put("print",  PRINT);                     
        keywords.put("return", RETURN);                    
        keywords.put("super",  SUPER);                     
        keywords.put("this",   THIS);                      
        keywords.put("true",   TRUE);                      
        keywords.put("var",    VAR);                       
        keywords.put("while",  WHILE);                     
              }                  

在identifed方法中

      private void identifier() {                
         // See if the identifier is a reserved word.   
         String text = source.substring(start, current);
         TokenType type = keywords.get(text);           
         if (type == null) type = IDENTIFIER;           
         addToken(type);                     
      }     

我们实现了扫描功能

最后,为了大家理解,将完整源码发出

启动器

     import java.io.BufferedReader;
     import java.io.IOException;
     import java.io.InputStreamReader;
     import java.nio.charset.Charset;
     import java.nio.file.Files;
     import java.nio.file.Paths;
     import java.util.List;

     public class Lox {
         static boolean hadError = false;

    public static void main(String[] args) throws IOException {
        if (args.length > 1) {
            System.out.println("Usage: jlox [script]");
            System.exit(64);
        } else if (args.length == 1) {
            runFile(args[0]);
        } else {
            runPrompt();
        }

}
private static void runFile(String path) throws IOException {
    byte[] bytes = Files.readAllBytes(Paths.get(path));
    run(new String(bytes, Charset.defaultCharset()));

    if (hadError) System.exit(65);

}
private static void runPrompt() throws IOException {
    InputStreamReader input = new InputStreamReader(System.in);
    BufferedReader reader = new BufferedReader(input);

    for (;;) {
        System.out.print("> ");
        run(reader.readLine());
        hadError = false;

    }
}
private static void run(String source) {
    Scanner scanner = new Scanner(source);
    List<Token> tokens = scanner.scanTokens();

    // For now, just print the tokens.
    for (Token token : tokens) {
        System.out.println(token);
    }
}
static void error(int line, String message) {
    report(line, "", message);
}

private static void report(int line, String where, String message) {
    System.err.println(
            "[line " + line + "] Error" + where + ": " + message);
    hadError = true;
         }
 }

枚举类

      enum TokenType {
          LEFT_PAREN, RIGHT_PAREN, LEFT_BRACE, RIGHT_BRACE,
          COMMA, DOT, MINUS, PLUS, SEMICOLON, SLASH, STAR,
// One or two character tokens.
          BANG, BANG_EQUAL,
          EQUAL, EQUAL_EQUAL,
          GREATER, GREATER_EQUAL,
          LESS, LESS_EQUAL,

// Literals.
          IDENTIFIER, STRING, NUMBER,

          // Keywords.
          AND, CLASS, ELSE, FALSE, FUN, FOR, IF, NIL, OR,
           PRINT, RETURN, SUPER, THIS, TRUE, VAR, WHILE,

          EOF
         }

扫描器

   import java.util.ArrayList;
   import java.util.HashMap;
   import java.util.List;
   import java.util.Map;
   
   import static com.craftinginterpreters.lox.TokenType.*;

   class Scanner {
       private static final Map<String, TokenType> keywords;

static {
    keywords = new HashMap<>();
    keywords.put("and",    AND);
    keywords.put("class",  CLASS);
    keywords.put("else",   ELSE);
    keywords.put("false",  FALSE);
    keywords.put("for",    FOR);
    keywords.put("fun",    FUN);
    keywords.put("if",     IF);
    keywords.put("nil",    NIL);
    keywords.put("or",     OR);
    keywords.put("print",  PRINT);
    keywords.put("return", RETURN);
    keywords.put("super",  SUPER);
    keywords.put("this",   THIS);
    keywords.put("true",   TRUE);
    keywords.put("var",    VAR);
    keywords.put("while",  WHILE);
}
private final String source;
private final List<Token> tokens = new ArrayList<>();
          //    private final List<Token> tokens = new ArrayList<>();

private int start = 0;
private int current = 0;
private int line = 1;
Scanner(String source) {
    this.source = source;
}
List<Token> scanTokens() {
    while (!isAtEnd()) {
        // We are at the beginning of the next lexeme.
        start = current;
        scanToken();
    }

    tokens.add(new Token(EOF, "", null, line));
    return tokens;
}
private void scanToken() {
    char c = advance();
    switch (c) {
        case '(': addToken(LEFT_PAREN); break;
        case ')': addToken(RIGHT_PAREN); break;
        case '{': addToken(LEFT_BRACE); break;
        case '}': addToken(RIGHT_BRACE); break;
        case ',': addToken(COMMA); break;
        case '.': addToken(DOT); break;
        case '-': addToken(MINUS); break;
        case '+': addToken(PLUS); break;
        case ';': addToken(SEMICOLON); break;
        case '*': addToken(STAR); break;
        case '!': addToken(match('=') ? BANG_EQUAL : BANG); break;
        case '=': addToken(match('=') ? EQUAL_EQUAL : EQUAL); break;
        case '<': addToken(match('=') ? LESS_EQUAL : LESS); break;
        case '>': addToken(match('=') ? GREATER_EQUAL : GREATER); break;
        case '/':
            if (match('/')) {
                // A comment goes until the end of the line.
                while (peek() != '\n' && !isAtEnd()) advance();
            } else {
                addToken(SLASH);
            }
            break;

        case ' ':
        case '\r':
        case '\t':
            // Ignore whitespace.
            break;

        case '\n':
            line++;
            break;

        case '"': string(); break;
        case 'o':
            if (peek() == 'r') {
                addToken(OR);
            }
            break;
        default:
            if (isDigit(c)) {
                number();
            } else if (isAlpha(c)) {
                identifier();
            } else {
                Lox.error(line, "Unexpected character.");
            }
            break;
    }
}
private void identifier() {
    while (isAlphaNumeric(peek())) advance();
    // See if the identifier is a reserved word.
    String text = source.substring(start, current);

    TokenType type = keywords.get(text);
    if (type == null) type = IDENTIFIER;
    addToken(type);
}
private void number() {
    while (isDigit(peek())) advance();

    // Look for a fractional part.
    if (peek() == '.' && isDigit(peekNext())) {
        // Consume the "."
        advance();

        while (isDigit(peek())) advance();
    }

    addToken(NUMBER,
            Double.parseDouble(source.substring(start, current)));
}
private char peekNext() {
    if (current + 1 >= source.length()) return '\0';
    return source.charAt(current + 1);
}
private boolean isAlpha(char c) {
    return (c >= 'a' && c <= 'z') ||
            (c >= 'A' && c <= 'Z') ||
            c == '_';
}

private boolean isAlphaNumeric(char c) {
    return isAlpha(c) || isDigit(c);
}
private void string() {
    while (peek() != '"' && !isAtEnd()) {
        if (peek() == '\n') line++;
        advance();
    }

    // Unterminated string.
    if (isAtEnd()) {
        Lox.error(line, "Unterminated string.");
        return;
    }

    // The closing ".
    advance();

    // Trim the surrounding quotes.
    String value = source.substring(start + 1, current - 1);
    addToken(STRING, value);
}
private boolean match(char expected) {
    if (isAtEnd()) return false;
    if (source.charAt(current) != expected) return false;

    current++;
    return true;
}
private char peek() {
    if (isAtEnd()) return '\0';
    return source.charAt(current);
}
private boolean isDigit(char c) {
    return c >= '0' && c <= '9';
}
private boolean isAtEnd() {
    return current >= source.length();
}
private char advance() {
    current++;
    return source.charAt(current - 1);
}

private void addToken(TokenType type) {
    addToken(type, null);
}

private void addToken(TokenType type, Object literal) {
    String text = source.substring(start, current);
    tokens.add(new Token(type, text, literal, line));
}
 }

下面一张是文法分析,我们将构造词法分析树,用递归向下的方式生产语法树。