目前来说编译原理的学习对于工作的用处很小,更多的方面可能是装逼,如果你不从nlp方面的话,编译器的工作研究基本已经在16年已经做完。不过,我觉得学东西没必要过分强调工作的需要,很多时候处于兴趣和爱好,探索底层,也许会有新的乐趣。
教材方面 推荐两本一本就是龙书 第二本就是
这是我对实现一个脚本语言的一点心得和体会。
该脚本语言的实现是使用java作为实现语言
该语言的学习是根据http://www.craftinginterpreters.com教程实现
知识前置
学习该知识需要你有一定的java基础知识,了解oop特性,以及学过编译原理。
lox是一门脚本语言,这意味着他的代码直接运行,而不是转换成汇编代码。
让我们开始吧
这是编译器的启动器。
public class Lox {
public static void main(String[] args) throws IOException {
if (args.length > 1) {
System.out.println("Usage: jlox [script]");
System.exit(64);
} else if (args.length == 1) {
runFile(args[0]);
} else {
runPrompt();
}
}
}
实际上,有两种方法可以运行某些代码。如果您从命令行启动jlox并为其提供文件路径,它将读取文件并执行该文件:
private static void runFile(String path) throws IOException {
byte[] bytes = Files.readAllBytes(Paths.get(path));
run(new String(bytes, Charset.defaultCharset()));
}
如果你想不带任何参数去启动他,做到交互的话,
private static void runPrompt() throws IOException {
InputStreamReader input = new InputStreamReader(System.in);
BufferedReader reader = new BufferedReader(input);
for (;;) {
System.out.print("> ");
run(reader.readLine());
}
}
上述两种运行方法的实现,其实都是依赖于对run方法的包装
private static void run(String source) {
Scanner scanner = new Scanner(source);
List<Token> tokens = scanner.scanTokens();
// For now, just print the tokens.
for (Token token : tokens) {
System.out.println(token);
}
}
好了,到现在我们已经完成编译器工作的第一步,这只是开始,我们一起加油。
接下来是错误处理,当输入的tokens出现错误时,编译器如何识别,并报出一个错误。
static void error(int line, String message) {
| report(line, "", message);
}
private static void report(int line, String where, String message) {
System.err.println(
"[line " + line + "] Error" + where + ": " + message);
hadError = true;
}
我们继续在lox类中创建一个静态error函数,静态调用report函数,report函数传入line 参数,error发生位置,错误信息。 这是一个最低要求的错误显示函数。当出现bug时候他会通知我们
Error: Unexpected "," *somewhere* in your program. Good luck finding it!
但是这样的函数,还是无法让我们满意,我们希望一个错误显示函数,能告知我们错误在哪个地方
Error: Unexpected "," in argument list.
15 | function(first, second,);
^-- Here.
我们在low类中定义一个boolean变量hadError,
public class Lox {
static boolean hadError = false;
然后我们在runfile函数里
private static void runFile(String path) throws IOException {
byte[] bytes = Files.readAllBytes(Paths.get(path));
run(new String(bytes, Charset.defaultCharset()));
//增加一个判断,出现问题就退出
if (hadError) System.exit(65);
}
runPrompt函数
private static void runPrompt() throws IOException {
InputStreamReader input = new InputStreamReader(System.in);
BufferedReader reader = new BufferedReader(input);
for (;;) {
System.out.print("> ");
run(reader.readLine());
hadError = false;
}
}
接下来就是词素工作 关键字是语言文法分析中一件很重要的工作, 我们的词法分析使用的是词法比较方式,这种方式其实非常缓慢,但是为了我们学习起来足够简单,我们将采用这种方法。
enum TokenType {
// Single-character tokens.
//单字符串
LEFT_PAREN, RIGHT_PAREN, LEFT_BRACE, RIGHT_BRACE,
COMMA, DOT, MINUS, PLUS, SEMICOLON, SLASH, STAR,
//> ,< ,==,!=
// One or two character tokens.
BANG, BANG_EQUAL,
EQUAL, EQUAL_EQUAL,
GREATER, GREATER_EQUAL,
LESS, LESS_EQUAL,
//string number
// Literals.
IDENTIFIER, STRING, NUMBER,
// 关键字
// Keywords.
AND, CLASS, ELSE, FALSE, FUN, FOR, IF, NIL, OR,
PRINT, RETURN, SUPER, THIS, TRUE, VAR, WHILE,
EOF
}
我们再创建一个token类,将上面那个枚举装进来
class Token {
final TokenType type;
final String lexeme;
final Object literal;
final int line;
Token(TokenType type, String lexeme, Object literal, int line) {
this.type = type;
this.lexeme = lexeme;
this.literal = literal;
this.line = line;
}
public String toString() {
return type + " " + lexeme + " " + literal;
}
}
虽然在词法扫描中,正则表达式很重要,但为了入门考虑 该教程不尝试实现正则表达式功能
接下来,我们要实现一个字符串的scanner类
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import static com.craftinginterpreters.lox.TokenType.*;
class Scanner {
private final String source;
private final List<Token> tokens = new ArrayList<>();
Scanner(String source) {
this.source = source;
}
List<Token> scanTokens() {
while (!isAtEnd()) {
// We are at the beginning of the next lexeme.
start = current;
scanToken();
}
tokens.add(new Token(EOF, "", null, line));
return tokens;
}
}
这个类中我们实现scantokens函数,这个函数将不断的扫描token
然后,为了我们能够有效的定位字符的扫描,我们需要在类Scanner中添加
private int start = 0;
private int current = 0;
private int line = 1;
建立一个函数,帮助我们判断是否扫描完成
private boolean isAtEnd() {
return current >= source.length();
}
接下来就是scanner中的核心函数,扫描功能的实现
private void scanToken() {
char c = advance();
switch (c) {
case '(': addToken(LEFT_PAREN); break;
case ')': addToken(RIGHT_PAREN); break;
case '{': addToken(LEFT_BRACE); break;
case '}': addToken(RIGHT_BRACE); break;
case ',': addToken(COMMA); break;
case '.': addToken(DOT); break;
case '-': addToken(MINUS); break;
case '+': addToken(PLUS); break;
case ';': addToken(SEMICOLON); break;
case '*': addToken(STAR); break;
case '!': addToken(match('=') ? BANG_EQUAL : BANG); break;
case '=': addToken(match('=') ? EQUAL_EQUAL : EQUAL); break;
case '<': addToken(match('=') ? LESS_EQUAL : LESS); break;
case '>': addToken(match('=') ? GREATER_EQUAL : GREATER); break;
case '/':
if (match('/')) {
// A comment goes until the end of the line.
while (peek() != '\n' && !isAtEnd()) advance();
} else {
addToken(SLASH);
}
break;
case ' ':
case '\r':
case '\t':
// Ignore whitespace.
break;
case '\n':
line++;
break;
case '"': string(); break;
case 'o':
if (peek() == 'r') {
addToken(OR);
}
break;
default:
if (isDigit(c)) {
number();
} else if (isAlpha(c)) {
identifier();
} else {
Lox.error(line, "Unexpected character.");
}
break;
}
}
然后我们需要实现几个辅助函数
private char advance() {
current++;
return source.charAt(current - 1);
}
private void addToken(TokenType type) {
addToken(type, null);
}
private void addToken(TokenType type, Object literal) {
String text = source.substring(start, current);
tokens.add(new Token(type, text, literal, line));
}
除了这些,我们还是需要考虑一些问题,如果用户传入一些计算机不能识别的字符,如#^@这些,我们的扫描器应该就要报错 为此,我们的tokenscan方法中,我们就需要
default:
if (isDigit(c)) {
number();
} else if (isAlpha(c)) {
identifier();
} else {
Lox.error(line, "Unexpected character.");
}
break;
我们tokenscan函数中,不仅对字符进行了扫描,还对操作符进行了扫描
case '-': addToken(MINUS); break;
case '+': addToken(PLUS); break;
case ';': addToken(SEMICOLON); break;
case '*': addToken(STAR); break;
case '!': addToken(match('=') ? BANG_EQUAL : BANG); break;
case '=': addToken(match('=') ? EQUAL_EQUAL : EQUAL); break;
case '<': addToken(match('=') ? LESS_EQUAL : LESS); break;
case '>': addToken(match('=') ? GREATER_EQUAL : GREATER); break;
我们在tokenscan函数后面添加一个 match函数
private boolean match(char expected) {
if (isAtEnd()) return false;
if (source.charAt(current) != expected) return false;
current++;
return true;
}
还有
private boolean isDigit(char c) {
return c >= '0' && c <= '9';
}
以及判断数字的函数
private void number() {
while (isDigit(peek())) advance();
// Look for a fractional part.
if (peek() == '.' && isDigit(peekNext())) {
// Consume the "."
advance();
while (isDigit(peek()))
advance();
}
addToken(NUMBER,
Double.parseDouble(source.substring(start, current)));
}
以及判断字符串的函数
private void identifier() {
while (isAlphaNumeric(peek())) advance();
addToken(IDENTIFIER);
}
添加一个辅助函数
private boolean isAlpha(char c) {
return (c >= 'a' && c <= 'z') ||
(c >= 'A' && c <= 'Z') ||
c == '_';
}
下面这个函数旁段字符串是数字还是字符
private boolean isAlphaNumeric(char c) {
return isAlpha(c) || isDigit(c);
}
干完这些,我们就能把字符串进行扫描,那现在是不是大功告成了呢?还没有,我们还没有把关键字进行扫描 在scanner类中我们加入
private static final Map<String, TokenType> keywords;
static {
keywords = new HashMap<>();
keywords.put("and", AND);
keywords.put("class", CLASS);
keywords.put("else", ELSE);
keywords.put("false", FALSE);
keywords.put("for", FOR);
keywords.put("fun", FUN);
keywords.put("if", IF);
keywords.put("nil", NIL);
keywords.put("or", OR);
keywords.put("print", PRINT);
keywords.put("return", RETURN);
keywords.put("super", SUPER);
keywords.put("this", THIS);
keywords.put("true", TRUE);
keywords.put("var", VAR);
keywords.put("while", WHILE);
}
在identifed方法中
private void identifier() {
// See if the identifier is a reserved word.
String text = source.substring(start, current);
TokenType type = keywords.get(text);
if (type == null) type = IDENTIFIER;
addToken(type);
}
我们实现了扫描功能
最后,为了大家理解,将完整源码发出
启动器
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.List;
public class Lox {
static boolean hadError = false;
public static void main(String[] args) throws IOException {
if (args.length > 1) {
System.out.println("Usage: jlox [script]");
System.exit(64);
} else if (args.length == 1) {
runFile(args[0]);
} else {
runPrompt();
}
}
private static void runFile(String path) throws IOException {
byte[] bytes = Files.readAllBytes(Paths.get(path));
run(new String(bytes, Charset.defaultCharset()));
if (hadError) System.exit(65);
}
private static void runPrompt() throws IOException {
InputStreamReader input = new InputStreamReader(System.in);
BufferedReader reader = new BufferedReader(input);
for (;;) {
System.out.print("> ");
run(reader.readLine());
hadError = false;
}
}
private static void run(String source) {
Scanner scanner = new Scanner(source);
List<Token> tokens = scanner.scanTokens();
// For now, just print the tokens.
for (Token token : tokens) {
System.out.println(token);
}
}
static void error(int line, String message) {
report(line, "", message);
}
private static void report(int line, String where, String message) {
System.err.println(
"[line " + line + "] Error" + where + ": " + message);
hadError = true;
}
}
枚举类
enum TokenType {
LEFT_PAREN, RIGHT_PAREN, LEFT_BRACE, RIGHT_BRACE,
COMMA, DOT, MINUS, PLUS, SEMICOLON, SLASH, STAR,
// One or two character tokens.
BANG, BANG_EQUAL,
EQUAL, EQUAL_EQUAL,
GREATER, GREATER_EQUAL,
LESS, LESS_EQUAL,
// Literals.
IDENTIFIER, STRING, NUMBER,
// Keywords.
AND, CLASS, ELSE, FALSE, FUN, FOR, IF, NIL, OR,
PRINT, RETURN, SUPER, THIS, TRUE, VAR, WHILE,
EOF
}
扫描器
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import static com.craftinginterpreters.lox.TokenType.*;
class Scanner {
private static final Map<String, TokenType> keywords;
static {
keywords = new HashMap<>();
keywords.put("and", AND);
keywords.put("class", CLASS);
keywords.put("else", ELSE);
keywords.put("false", FALSE);
keywords.put("for", FOR);
keywords.put("fun", FUN);
keywords.put("if", IF);
keywords.put("nil", NIL);
keywords.put("or", OR);
keywords.put("print", PRINT);
keywords.put("return", RETURN);
keywords.put("super", SUPER);
keywords.put("this", THIS);
keywords.put("true", TRUE);
keywords.put("var", VAR);
keywords.put("while", WHILE);
}
private final String source;
private final List<Token> tokens = new ArrayList<>();
// private final List<Token> tokens = new ArrayList<>();
private int start = 0;
private int current = 0;
private int line = 1;
Scanner(String source) {
this.source = source;
}
List<Token> scanTokens() {
while (!isAtEnd()) {
// We are at the beginning of the next lexeme.
start = current;
scanToken();
}
tokens.add(new Token(EOF, "", null, line));
return tokens;
}
private void scanToken() {
char c = advance();
switch (c) {
case '(': addToken(LEFT_PAREN); break;
case ')': addToken(RIGHT_PAREN); break;
case '{': addToken(LEFT_BRACE); break;
case '}': addToken(RIGHT_BRACE); break;
case ',': addToken(COMMA); break;
case '.': addToken(DOT); break;
case '-': addToken(MINUS); break;
case '+': addToken(PLUS); break;
case ';': addToken(SEMICOLON); break;
case '*': addToken(STAR); break;
case '!': addToken(match('=') ? BANG_EQUAL : BANG); break;
case '=': addToken(match('=') ? EQUAL_EQUAL : EQUAL); break;
case '<': addToken(match('=') ? LESS_EQUAL : LESS); break;
case '>': addToken(match('=') ? GREATER_EQUAL : GREATER); break;
case '/':
if (match('/')) {
// A comment goes until the end of the line.
while (peek() != '\n' && !isAtEnd()) advance();
} else {
addToken(SLASH);
}
break;
case ' ':
case '\r':
case '\t':
// Ignore whitespace.
break;
case '\n':
line++;
break;
case '"': string(); break;
case 'o':
if (peek() == 'r') {
addToken(OR);
}
break;
default:
if (isDigit(c)) {
number();
} else if (isAlpha(c)) {
identifier();
} else {
Lox.error(line, "Unexpected character.");
}
break;
}
}
private void identifier() {
while (isAlphaNumeric(peek())) advance();
// See if the identifier is a reserved word.
String text = source.substring(start, current);
TokenType type = keywords.get(text);
if (type == null) type = IDENTIFIER;
addToken(type);
}
private void number() {
while (isDigit(peek())) advance();
// Look for a fractional part.
if (peek() == '.' && isDigit(peekNext())) {
// Consume the "."
advance();
while (isDigit(peek())) advance();
}
addToken(NUMBER,
Double.parseDouble(source.substring(start, current)));
}
private char peekNext() {
if (current + 1 >= source.length()) return '\0';
return source.charAt(current + 1);
}
private boolean isAlpha(char c) {
return (c >= 'a' && c <= 'z') ||
(c >= 'A' && c <= 'Z') ||
c == '_';
}
private boolean isAlphaNumeric(char c) {
return isAlpha(c) || isDigit(c);
}
private void string() {
while (peek() != '"' && !isAtEnd()) {
if (peek() == '\n') line++;
advance();
}
// Unterminated string.
if (isAtEnd()) {
Lox.error(line, "Unterminated string.");
return;
}
// The closing ".
advance();
// Trim the surrounding quotes.
String value = source.substring(start + 1, current - 1);
addToken(STRING, value);
}
private boolean match(char expected) {
if (isAtEnd()) return false;
if (source.charAt(current) != expected) return false;
current++;
return true;
}
private char peek() {
if (isAtEnd()) return '\0';
return source.charAt(current);
}
private boolean isDigit(char c) {
return c >= '0' && c <= '9';
}
private boolean isAtEnd() {
return current >= source.length();
}
private char advance() {
current++;
return source.charAt(current - 1);
}
private void addToken(TokenType type) {
addToken(type, null);
}
private void addToken(TokenType type, Object literal) {
String text = source.substring(start, current);
tokens.add(new Token(type, text, literal, line));
}
}
下面一张是文法分析,我们将构造词法分析树,用递归向下的方式生产语法树。