Dcspm develop and Compile Subset of pascal language to msil

© Copyright by Abdullah Sheneamer 2012

Yüklə 1,28 Mb.

səhifə	2/4
tarix	07.11.2018
ölçüsü	1,28 Mb.
	#78911

1 2 3 4

1. Introduction

© Copyright by Abdullah Sheneamer 2012

All Rights Reserved

This project for the Master of Science degree by

Abdullah Sheneamer
has been approved for the
Department of Computer Science
By

_______________________________________________________

Dr. Albert Glock, Advisor

_______________________________________________________

Dr. C. Edward Chow, Committee member

_______________________________________________________

Albert Brouillette, Committee member

_______________________________

Date

DCSPM

Develop and Compile Subset of

PASCAL Language to MSIL

Abstract
The focus of this project is designed the Intermediate language (IL or MSIL) for PASCAL Language. This project aims to design a compiler that can handle subset of PASCAL Language to compile to MSIL including, Assignment statement, Write line instructions, If statement, If/else statement, While statement, For statement, Switch statement, If logic statement, and One dimensional array. Also, it’s designed an algorithm that implements the lexical analysis and another algorithm for syntax analysis that the first phase of a compiler is called scanning too. The lexical analyzer reads the stream of characters making up the source program and groups the characters into meaningful sequences called lexemes. The second phase of a compiler is called syntax analysis or parsing. The parser uses the first components of the tokens produced by lexical analysis. The compilation time is important so, there are some algorithms and evaluate these different algorithms for their speed performance in Lexical Analysis and Parser which can become bottleneck.

“ILAsm has the instruction set same as that the native assembly language has. You can write code for ILAsm in any text editor like notepad and then can use the command line compiler (ILAsm.exe) provided by the .NET framework to compile that. ILAsm.exe is a command line tool shipped with the .NET Framework and can be located at \Microsoft.NET\Framework\ folder. You can include this path in your path environment variable. When you have finished compiling your .IL file, then it will output the exe with the same name as that of .IL file. You can specify the output file name using /OutPut= switch like ILAsm Test.il /output=MyFile.exe. To run the outputted exe file, just type the name of the exe and hit enter” [11].

Acknowledgements

I would never have been able to finish my dissertation without the guidance of my committee members, and support from my family and wife.

I would like to express my deepest gratitude to my advisor, Dr. Albert Glock, for his excellent guidance, caring, patience, and providing me with an excellent atmosphere for doing research.
I offer my sincerest gratitude to Dr. Edward Chow, who let me experience the research of practical issues beyond the textbooks, patiently corrected my writing research and giving important questionable ideas to me through his comments on my proposal.
I would also like to thank Albert Brouillette for being interested in getting my proposal succeeded and giving important questionable ideas to me through his comments on my proposal.

List of Content

List of Figures
List of Tables

1. Introduction

In the computer world, techniques evolve rapidly from theories, algorithms, programming languages, software systems, and software engineering.

“Programming languages are notations for describing computations to people and to machines. The world as we know it depends on programming languages, because all the software running on all the computers was written in some programming language. But, before a program can be run, it first must be translated into a form in which it can be executed by a computer. The software systems that do this translation are called compilers.” [6]

Fortunately, compilers allow programmers to write at a high level, and automated processing takes care of creating the machine-specific instructions. My project designs and creates a compiler that translates PASCAL source code into Microsoft Intermediate Language (MSIL). When compiling the source code to managed code in .Net environment, the compiler translates the source into Microsoft Intermediate Language (MSIL). MSIL includes instructions for loading, storing, initializing, and calling methods on objects, as well as instructions for arithmetic and logical operations. There is currently no PASCAL compiler which compiles to MSIL. The Just-in-time (JIT) compiler will convert the MSIL to CPU- Specific code [1].

The advantage in compiling to MSIL is that 1) legacy PASCAL can now be run on modern machines, 2) MSIL is platform independent and 3) JIT compilers can be optimized for specific machines and architectures. The JIT compiler can also do aggressive optimizations specifically for the machine where the code is running.

“Before you can run Microsoft intermediate language (MSIL), it must be converted by a .NET Framework just-in-time (JIT) compiler to native code, which is CPU-specific code that runs on the same computer architecture as the JIT compiler. Because the common language runtime supplies a JIT compiler for each supported CPU architecture, developers can write a set of MSIL that can be JIT-compiled and run on computers with different architectures. However, your managed code will run only on a specific operating system if it calls platform-specific native APIs, or a platform-specific class library.

JIT compilation takes into account the fact that some code might never get called during execution. Rather than using time and memory to convert all the MSIL in a portable executable (PE) file to native code, it converts the MSIL as needed during execution and stores the resulting native code so that it is accessible for subsequent calls. The loader creates and attaches a stub to each of a type's methods when the type is loaded. On the initial call to the method, the stub passes control to the JIT compiler, which converts the MSIL for that method into native code and modifies the stub to direct execution to the location of the native code. Subsequent calls of the JIT-compiled method proceed directly to the native code that was previously generated, reducing the time it takes to JIT-compile and run the code.

The runtime supplies another mode of compilation called install-time code generation. The install-time code generation mode converts MSIL to native code just as the regular JIT compiler does, but it converts larger units of code at a time, storing the resulting native code for use when the assembly is subsequently loaded and run. When using install-time code generation, the entire assembly that is being installed is converted into native code, taking into account what is known about other assemblies that are already installed. The resulting file loads and starts more quickly than it would have if it were being converted to native code by the standard JIT option.

As part of compiling MSIL to native code, code must pass a verification process unless an administrator has established a security policy that allows code to bypass verification. Verification examines MSIL and metadata to find out whether the code is type safe, which means that it only accesses the memory locations it is authorized to access. Type safety helps isolate objects from each other and therefore helps protect them from inadvertent or malicious corruption. It also provides assurance that security restrictions on code can be reliably enforced.

The runtime relies on the fact that the following statements are true for code that is verifiably type safe:

A reference to a type is strictly compatible with the type being referenced.
Only appropriately defined operations are invoked on an object.
Identities are what they claim to be.

During the verification process, MSIL code is examined in an attempt to confirm that the code can access memory locations and call methods only through properly defined types. For example, code cannot allow an object's fields to be accessed in a manner that allows memory locations to be overrun. Additionally, verification inspects code to determine whether the MSIL has been correctly generated, because incorrect MSIL can lead to a violation of the type safety rules. The verification process passes a well-defined set of type-safe code, and it passes only code that is type safe. However, some type-safe code might not pass verification because of limitations of the verification process, and some languages, by design, do not produce verifiably type-safe code. If type-safe code is required by security policy and the code does not pass verification, an exception is thrown when the code is run.” [12]

Program HelloWorld; Begin

Writeln (‘ Hello World’); End .

Execution

Compilation

JIT Compiler

Native Code

MSIL

PASCAL

Compiler

.method public static void Main() cil managed

{

.entrypoint

.maxstack 1

IL_00: ldstr "Hello World"

IL_05: call void [mscorlib]System.Console::WriteLine(string)

IL_10: ret

} // end of method HelloWorld::Main

Figure 1: The compilation and execution process of PASCAL programs.

-Compilation process: takes PASCAL source code and produces MSIL. The PASCAL compiler includes lexical and syntax analysis, and the creation of the symbol table. MSIL is created when compiling to manage native code. MSIL is a CPU-independent set of instructions that can be efficiently converted to native code. Such as in Figure 2.

-Execution process: MSIL must be converted to CPU-specific code, usually by a just-in-time (JIT) compiler. Native code is computer programming (code) that is compiled to run with a particular processor (such as an Intel x86-class processor) and its set of instructions.

Lexical Analysis

Parser & MSIL

Symbol Table

Source code of PASCAL

Error

Handler

.method public static void Main() cil managed

{

.entrypoint

.maxstack 1

IL_00: ldstr "Hello World"

IL_05: call void [mscorlib]System.Console::WriteLine(string)

IL_10: ret

} // end of method HelloWorld::Main

Figure 2: Compilation process

1.1. Motivation:

“During compilation of MSIL, the source code is translated into MSIL code rather than platform or processor-specific object code. MSIL is a CPU- and platform-independent instruction set that can be executed in any environment supporting the Common Language Infrastructure, such as the .NET runtime on Windows, or the cross-platform Mono runtime. In theory, this eliminates the need to distribute different executable files for different platforms and CPU types. MSIL code is verified for safety during runtime, providing better security and reliability than natively compiled executable files. ” [13]

Since, there is currently no PASCAL compiler which compiles to MSIL so, I design MSIL of subset of PASCAL language which has the advantage in compiling to MSIL is that 1) legacy PASCAL can now be run on modern machines, 2) MSIL is platform independent and 3) JIT compilers can be optimized for specific machines and architectures.
2. Background
2.1. Overview of Compilation Process
A compiler is a program that can read a program in one language- the source language – and translate it into equivalent program in another language as Figure 2. An important role of the compiler is to report any errors in the source program that it detects during the translation process [6].

Target Program

Source Program

Compiler

Figure 2: A Compiler

“Microsoft Intermediate Language (MSIL) is a language used as the output of a number of compilers (C#, VB, .NET, and so forth). The ILDasm (Intermediate Language Disassembler) tool that ships with the .NET Framework SDK (FrameworkSDK\Bin\ildasm.exe) allows the user to see MSIL code in human-readable format. By using this utility, we can open any .NET executable file (EXE or DLL) and see MSIL code.

The ILAsm (Intermediate Language Assembler) tool generates an executable file from the MSIL language. We can find this program in the WINNT\Microsoft.NET\Framework\vn.nn.nn directory. Any PASCAL programmer starting with .NET development is interested in what happens in the low level of the .NET Framework. Learning MSIL gives a user the chance to understand some things that are hidden from a programmer working with PASCAL or another language. Knowing MSIL gives more power to a .NET programmer. We never need to write programs in MSIL directly, but in some difficult cases it is very useful to open the MSIL code in ILDasm and see how things are done” [14].
2.2. History

“Pascal is an influential imperative and procedural programming language, designed in 1968–1969 and published in 1970 by Niklaus Wirth a small and efficient language intended to encourage good programming practices using structured programming and data structuring. A derivative known as Object Pascal designed for object-oriented programming was developed in 1985.

Pascal, named in honor of the French mathematician and philosopher Blaise Pascal, was developed by Niklaus Wirth and based on the ALGOL programming language

Prior to his work on Pascal, Wirth had developed Euler and ALGOL W and later went on to develop the Pascal-like languages Modula-2 and Oberon.

Initially, Pascal was largely, but not exclusively, intended to teach students structured programming. A generation of students used Pascal as an introductory language in undergraduate courses. Variants of Pascal have also frequently been used for everything from research projects to PC games and embedded systems. . Newer Pascal compilers exist which are widely used” [15].

Grace Murray Hopper coined the term compiler in the early 1950s. Translation was viewed as the “compilation” of a sequence of machine language subprograms selected from a library. One of the ﬁrst real compilers was the FORTRAN compiler of the late 1950s. It allowed a programmer to use a problem-oriented source language. Ambitious “optimizations” were used to produce efficient machine code, which was vital for early computers with quite limited capabilities. Efficient use of machine resources is still an essential requirement for modern compilers [16].

3. Design
3.1: Introduction to Symbol Table and Lexical Analysis
A symbol table is a data structure containing a record for each identifier, with fields for the attributes of the identifier (information about storage allocation, type,…, etc.). When the lexical analyzer detects an identifier in the source, the identifier is entered into the symbol table. However, its attributes will be entered in the following phases. These attributes are also used later phases.

The lexical Analysis is the first phase of a compiler is called lexical analysis or scanning. The lexical Analysis reads the stream of characters making up the source program and groups the characters into meaningful sequences called lexemes.

3.1.A. Symbol Table Design

Every key word is a token and has a unique integer code as shown in the table:

Token Code	Keyword
300	Begin
323	If
302	For
305	Switch
376	While

Table 1: Part of Symbol Table

So, The identifier token has a code 256, the number token has a code 257, and every special character is a token and has an integer token code equals its ASCII number. Tokens of two characters have unique to Codes as shown in the below table:

Token Code	Tow – Characters Tokens
406	!=
407	==
408	<=
409	>=

Table 2: Two Characters Tokens

A token in an instance of the class as shown in the below figure:

Figure 3: Class Token in Lexical Analysis

3.1.B. Lexical Analysis Design
after reading next character from input stream ;

State 0 : identify the current token and decide the next state ;

State 1 : Handle identifiers and keywords.

State 2: Handle Number .

State 3 : Handle one – character token or two –character token .

State 4,5 : Handle Comments “\\” or “\*”, skip the line start with “\\” or skip the data between “\*” and “*\”.

Begin -/ 1-lexbuf= “”

2-state=0;

Figure 4: State diagram for the lexical Analyzer (states 0,1,2)

Begin -/ 1-lexbuf= “”

2-state=0;

Figure 5: State diagram for the lexical Analyzer (states 3,4,5)

3.1. Parser and MSIL (Microsoft Intermediate Language) of PASCAL

Language Design
3.1.1: Introduction to Parser (Syntax Analysis)

The parser inputs the stream of tokens into a hierarchical structure represented by a syntax tree. A typical data structure for the syntax tree of this example “ position := initial + rate * 60 token stream is shown below:

=

+

Position

Initial

*

Rate

60

Figure 6: Syntax Tree

=: • •

Id1 1

+ • •

Id2 2

* • •

Id3 3

Num 60

Figure 7: Typical Data Structure for the given Syntax Tree

“Grammar is used throughout the parser to organize compiler front ends. A grammar naturally describes the hierarchical structure of most programming language constructs such as the Pascal Language. For example, an if- else statement in Pascal language can have the form

If (expression) statement else statement.

That is, an if-else statement is the concatenation of the keyword if , an opening parenthesis, an expression, a closing parenthesis, a statement, the keyword else, and another statement. Using the variable expr to denote an expression and variable stmt to denote a statement, this structuring rule can be expressed as:

stmt  if (expr) stmt else stmt

in which the arrow may be read as “ can have the form” Such a rule is called a production. In production, lexical elements like the keyword if and the parentheses are called terminals. Variables like expr and stmt represent sequence of terminals and called non-terminals” [6].

3.1.2. Parser (Syntax Analysis) Design
Parsing is the process of determining if a string of tokens can be generated by a grammar. To parse Pascal, it is sufficient to make a single left to right scan over the input, looking ahead one token at a time. Top-Down parsing constructs the nodes of a parse tree starting at the root and proceeding towards the leaves such as the simple example in Figure 8. To construct the parse tree, start at the root and repeatedly do the following two steps:

At the function “OneDimArray” construct children at “n”. For the symbols on the right side the production.
Find the next node at which a sub tree is to be constructed.

“Note: When starting with nonterminal OneDimArray at the root, we should use a production for OneDimArray that starts with lookahead symbol array. The lookahead symbol always contains the next token to be parsed in the input stream.” [6]

Figure 8: Steps in the top-down construction of Parse Tree

integer

of

]

num

dot dot

array

[

num

“When the node being considered in the parse tree is for a terminal, and the terminal matches the look ahead symbol, then we advance in both the parse tree and the input. The next token in the input becomes the new look ahead symbol, and the next child in the parse tree is considered. When a node labeled with a nonterminal is considered, we repeat the process of selecting a production for the nonterminal. In general, the selection of a production for a nonterminal may involve trial-and-error. However, a method called “predictive parsing” is simple and free from trial-and-error.” [6]

The statements and one dimensional array grammar that include my project:

Assignment statement that an arithmetic expression is an expression using additions +, subtractions -, multiplications *, and divisions div. A single mode arithmetic expression is an expression all of whose operands are of the same type (i.e. INTEGER, REAL or COMPLEX). However, only INTEGER and REAL will be covered in this project. Therefore, those values or variables in a single mode arithmetic expression are all integers or real numbers. such as a:=b+c div d-e OR an assignment statement gives a value to a variable such as x:=5; and compile that to Intermediate language.

<assignment statement> ::=

<variable> := <expression>

The PASCAL compiler is structured in such a way that a write, and writeln statements containing more than one argument is compiled into several write statement with only one argument. For writeln, these statements are followed by a statement that writes the end-of-line. So for example the writeln statement: “ Prgoram Write; Begin writeln('This writeln is compiled into MSIL '); End . ”

“if” Statement grammar:

<if statement> ::=

if <expression> then <statement>

“if/Else” Statement grammar:

<if statement> ::=

if <expression> then <statement> |
if <expression> then <statement> else <statement>

“ While” Statement grammar:

<while statement> ::=

while <expression> do <statement>

“For” Statement grammar:

<for statement> ::= for <variable identifier > ::= <expression> to

<expression> do < statement>
for statement syntax diagram

“Case” Statement grammar:

:= Case id Of End ‘;’ | empty

:= ‘’’ ‘:’ ’;’

| empty

< case_Label_list> := < Constant> ‘{‘ | ‘,’ |’{‘

:= ‘’’ | ’+’ | ’-‘ | id | num

“ Array” structure grammar:

:= array [ num .. num ] of
:= integer | real

If logic statement grammar:

:= Or < expression_list> | empty

:= < expression_list> And | empty

:= < expression> | ‘,’ < expression >

:=…….
3.1.3. Introduction to MSIL (Microsoft Intermediate Language)

MSIL is the Microsoft Intermediate Language. All .NET compatible languages will get converted to MSIL. MSIL also allows the .NET Framework to JIT compile the assembly on the installed computer. The main purpose of this Intermediate code formation is to have a platform independent code...that is once MSIL is available you can run on any platform provided appropriate run time environments are installed on the specific platform you wish to run such as CLR in case of .NET.

IL is what your Pascal code gets compiled into and is sent to the JIT compiler when .NET programs are run. MSIL is a very low level language that is very fast, and working with it gives you exceptional control over your programs.
“All operations in MSIL are executed on the stack. When a function is called, its parameters and local variables are allocated on the stack. Function code starting from this stack state may push some values onto the stack, make operations with these values, and pop values from the stack.

Execution of both MSIL commands and functions is done in three steps:

Push command operands or function parameters onto the stack.
Execute the MSIL command or call function. The command or function pops their operands (parameters) from the stack and pushes onto the stack result (return value).
Read result from the stack”[14].

The Pascal code of MSIL in our previous example looks like this simple code:

“ Program HelloWorld;

Begin

Writeln (‘ Hello World’);

End . “

The output MSIL:

// Metadata version: v4.0.30319

.assembly extern mscorlib

{

.publickeytoken = (B7 7A 5C 56 19 34 E0 89 ) // .z\V.4..

.ver 2:0:0:0

}

.assembly HelloWorld

{

.hash algorithm 0x00008004

.ver 0:0:0:0

}

.module expression.dll

.imagebase 0x00400000

.file alignment 0x00000200

.stackreserve 0x00100000

.subsystem 0x0003 // WINDOWS_CUI

.corflags 0x00000001 // ILONLY

// Image base: 0x00820000

// =============== CLASS MEMBERS DECLARATION ===================

.class public auto ansi HelloWorld

extends [mscorlib]System.Object

{

.method public static void Main() cil managed

{

.entrypoint

.maxstack 1

IL_00: ldstr "Hello World"

IL_05: call void [mscorlib]System.Console::WriteLine(string)

IL_10: ret

} // end of method HelloWorld::Main

.method public specialname rtspecialname

instance void .ctor() cil managed

{

.maxstack 2

IL_00: ldarg.0

IL_01: call instance void [mscorlib]System.Object::.ctor()

IL_06: ret

} // end of method HelloWorld::.ctor
} // end of class HelloWorld

What’s inside the Class Members Declaration :

.method : A method definition begins with the .method directive and can be defined at global scope or within a class. The application entry point must be static, meaning an instance is not required to call the method, and that is indicated by the static keyword. Declaring a global method static seems redundant but the ILASM compiler complains if you omit thestatic keyword in some cases. ‘void main()’ as the signature of the method which, as you would expect, indicates that it does not return a value and takes zero arguments.
.entrypoint : The .entrypoint directive signals to the runtime that this method is the entry point for the application. Only one method in the application can have this directive.

.maxstack : The .maxstack directive indicates how many stack slots the method expects to use. For example, adding two numbers together involves pushing both numbers onto the stack and then calling the add instruction which pops both numbers off the stack and pushes the result onto the stack. In that example you will need two stack slots.
Ldstr : The ldstr instruction pushes the string that is passed to the WriteLine method onto the stack.

Call : The call instruction invokes the static WriteLine method on the System.Console class from the mscorlib assembly. This is an example of a method declaration. It provides the full signature of the WriteLine method (including the string argument) so that the runtime can determine which overload of the WriteLine method to call.

Ret : The ret instruction returns execution to the caller. In the case of the entry point method, this would bring your application to an end.

Also, some programs have a .local directive that declares variables such as:

.local ( int32 a, int32 b,…..). In this MSIL method, variables are declared using the .locals directive.

3.1.4. Intermediate language Instructions
“When a method is executed, three categories of memory local to the method plus one category of external memory are involved. All these categories represent typed data slots, not simply an address interval as is the case in the unmanaged world. The external memory manipulated from the method is the community of the fields the method accesses (except the fields of value types belonging to the local categories). The local memory categories include an argument table, a local variable table, and an evaluation stack. Figure 9 describes data transitions between these categories. As you can see, all IL instructions resulting in data transfer have the evaluation stack as a source or a destination, or both.

image from book

Figure 9: Method memory categories

The argument and local variable tables have a static type which can be any of the types defined in the .NET Framework and the application. The evaluation stack table holds different types at different times during the course of the method execution. So, the same stack could be used for different variables.

IL instructions consist of an operation code (opcode), which for some instructions is followed by an instruction parameter. Opcodes are either 1 byte or 2 bytes long.

Some of the IL instructions that I used in my project such as:

3.1.4.1. Unconditional branching: instructions take nothing from the evaluation stack and put nothing on it.

br (0x38). Branch bytes from the current position.

By default, the IL assembler does not automatically choose between long-parameter and short-parameter forms. Thus, if you specify a short-parameter instruction and put the target label farther away than the short parameter permits, the calculated offset is truncated to 1 byte, and the IL assembler issues an error message.

br.s (0x2B). The short-parameter form of br.

Yüklə 1,28 Mb.

Dostları ilə paylaş:

1 2 3 4