Compilers

A compiler is a crucial software tool that translates code written in high-level programming languages into machine language, enabling computers to execute instructions. By allowing programmers to use more intuitive languages rather than complex machine code, compilers enhance the usability and efficiency of computer programming. The process of compilation involves several steps: lexical analysis, syntactic analysis, and semantic analysis followed by code generation. Each of these stages plays a significant role in ensuring that the code is correctly interpreted and executable. Compilers can be categorized alongside other program translators, such as assemblers and interpreters, each serving distinct functions in code translation. The theoretical foundations of compiler design have been influenced by linguistic principles, particularly those articulated by Noam Chomsky, which help structure the relationship between programming languages and computational grammar. Compilers not only facilitate traditional programming tasks but also support a variety of applications, including natural language processing and animation scripting. Overall, compilers remain integral to modern computing, driving the development of new programming languages and enhancing the performance of existing ones.

Published in: 2024

By: Albright, Larry

Subject Terms

Compilers

Type of physical science: Computation
Field of study: Computers

A compiler is a computer program that translates program code from one language, which is used by the programmer, into machine language, the language of the computer. Compilers have made computers much more useful than they would have been otherwise by allowing programmers to use languages that provide a more natural way to express the solution to a programming problem.

Overview

Computers are driven by programs, and programs are usually written by people. A program is a set of instructions that tells a computer what to do. If a program is written in a language other than the machine language of the particular computer on which the program is to run, then the program must be translated into that machine's native language. The purpose of a compiler is to translate programs from one language into another.

Just as human language consists of words, phrases, and sentences, so computer languages consist of similar constructs. In order for a compiler to translate a program, it must be able to recognize every element in the program. Word-level elements constitute the lexical structure of a language, phrase and sentence-level elements constitute the syntactic structure of a language, and the meaning of sentences is called semantics. A compiler consists of individual components to process each of these types of language elements.

There are three basic types of program translators: assemblers, compilers, and interpreters. Assemblers were among the first translators. They convert programs that are written in assembly language into machine language. Since assembly language is not far removed from machine language, assemblers are fairly simple translators. Assembly language instruction codes are words (called mnemonics) that represent a machine language instruction. Some codes that one would expect to find in an assembly language are "ADD," "SUB," "MOVE," "LOAD," and "STORE." When assemblers find these words in a program, they look for the word in a table and substitute the machine language code for that word. The other primary task of the assembler is to assign memory locations for all the variables in a program. The programmer is allowed to give names to locations in the computer's memory and then to refer to these names (called variables) in the program. Computers refer to memory locations only by a number, which is called the "address" of the location. The "MOVE" instruction provides a means for moving or copying a value from one memory location to another. "MOVE X Y" is an example of an instruction that shows that the memory locations can be referred to by name rather than by a number. If the assembler can successfully substitute machine code for all the mnemonics and assign numeric memory locations to all the variables, then a machine language program can be generated that carries exactly the same meaning as the assembly language program it was given.

Assembly language is a difficult language in which to program because it is so close to machine language. The programmer is often forced to think about the solutions to problems in unnatural ways. It is a common practice to "fit the tool to the task," and assembly language is a difficult tool to use for large and complex programming tasks.

In order to overcome the obstacles of programming in assembly language, many programming languages have been designed to meet diverse needs. There are various types of languages: imperative, applicative, functional, and object-oriented. These languages are generally referred to as high-level languages or high-order languages, and each provides a unique viewpoint for solving programming problems. In addition to languages for computer programming, there are also languages for various other computer applications, such as text processing, computer animation, and computer-assisted design. Every language requires a translator. Unless the programmer is dealing with assembly language, that translator usually takes the form of either a compiler or an interpreter. Before one can understand how a compiler works, the languages that compilers are expected to translate must be understood.

Many high-order computer languages are similar in structure to the natural languages that people use to communicate with one another. The parts of a natural language are generally divided into three main structural components: lexicon, syntax, and semantics.

The lexicon is the dictionary of the language. It consists of all the words in the language. The rules of syntax govern the manner in which those words may be grouped together to form sentences that are acceptable to the speakers of the language. How to determine what a group of words means to the speakers of the language is called semantics. Mathematical theories of natural language syntax have enabled computer scientists to build artificial languages with properties that are precise enough to be interpreted by computer programs. Since artificial languages are modeled on abstractions of natural languages, they contain many of the same structural properties as natural languages, but without the problems of ambiguity.

A computer language usually has a lexicon, rules of syntax, and semantic rules. The lexicon in a computer language is the set of all "words" that exist in the language. This set consists of two types of words: those literally specified by the language and those that may be invented by the programmer. In inventing new words, the programmer must follow certain rules.

Different types of words serve different purposes in the language, much as natural languages group words into nouns, verbs, adjectives, and so on, with each category serving a different purpose. The rules for syntax in a computer language are very precise and must be followed to the letter if the compiler is to understand the program. An entire computer program is often looked upon as a single "sentence," with various subcomponents viewed as individual "phrases."

Semantic rules govern what actions a compiler is to take for specific "phrases," which are found in the translation process. The translation process has three major steps: lexical analysis, syntactic analysis, and semantic analysis combined with code generation. These steps may be performed individually (in isolation) or in tandem. Usually, the steps are performed in tandem, with syntactic analysis guiding the process.

The lexical analyzer part of a compiler (also called the scanner) performs several important functions. It scans the program text (also called the source), ensuring that all the words (also called tokens) are legal members of the lexicon of the language. Each token is classified according to lexical category for subsequent use by the syntax analyzer. The scanner also makes note of any token that has been "invented" by the programmer by placing that token into a table.

Each programmer-defined token must eventually be assigned a semantic category. The purpose of the scanner is therefore threefold: First, it checks every token to ensure that it is a member of the lexicon of the language. Second, it classifies each token. Third, it sets every programmer-defined token aside for future reference.

The syntax analyzer is also called the "parser." Parsing is the process of determining whether the syntax of a sentence conforms to the rules of the language. If a program's syntax is incorrect, then the compiler is unable to determine the meaning of the program and hence cannot generate machine code. If the syntax is correct, then the parser passes individual "phrases" of code to the semantic analyzer, which interprets their meaning and performs the final step of machine code generation.

There are two basic types of parsers. One is called a top-down parser, the other is called a bottom-up parser. Both types of parsers attempt to match the given "sentence" to some subset of rules in the grammar. The top-down parser works by starting with the rules themselves and attempts to derive the sentence directly from the rules. The bottom-up parser starts with the sentence and attempts to find the rules working in the opposite direction. In general, the bottom-up parser is considered to be more discriminating than the top-down parser. It is possible, however, and quite common, for a language designer to construct a language so that it is most easily parsed by one or the other of the two types of parsers.

The semantic analyzer deals with aspects of the program that more directly concern computing problems. One of these aspects is type compatibility. Integer numbers and real numbers are represented differently from each other on a computer, so, strictly speaking, integers and reals are not type-compatible. In order to perform arithmetic operations such as addition, which involve both integers and reals, one must be converted into the other before the operation can be performed (the integer is generally converted to a real). Sometimes the programmer makes a mistake and attempts an operation between types, which is not legal. The semantic analyzer has its own set of rules for inspecting the program for such anomalies. If the semantic analyzer finds a breach of the rules, it must be reported and the final stage of compilation, code generation, cannot be performed. If all the semantic rules are satisfied, machine code will be generated from the phrases of the program that have corresponding machine code interpretations.

Applications

Translators in general and compilers in particular play an essential role in the continuing development of computers. The task of writing computer programs directly in machine code is so painstaking that it is seldom undertaken except as an educational experience.

Every programming language has at its heart a translator of some kind. So much is known about compiler construction that new languages are often developed with compiler theory in mind. This ensures that the language will be usable with current compiler tools. The programming language Pascal was designed so that it could be quickly compiled on a small computer using a technique known as "recursive-descent parsing." At the time that Pascal was developed, personal computers seldom had more than sixteen thousand bytes of memory. Even after memory constraints lessened, the speed of compilation possible by the best Pascal compilers set standards for other compilers of languages similar to Pascal.

While compilers are commonly used to translate high-order programming languages into machine language, translators are often embedded within many user interfaces for commercial programs. Some applications are more easily handled by setting up a "script" that specifies, step by step, the actions to be taken. One example is a computer animation. In hand-drawn "cell" animation, each scene must be drawn out in detail with as many as thirty frames per second of the animation. This same process can be done on a computer, with the artist drawing out every detail of every scene. An alternate approach is to provide the artist with a combination of drawing tools and a scripting language. The artist draws only select portions of his animation and then uses a script to specify how the animation is to unfold in time. The script is a powerful language for the artist. The program that reads, interprets, and carries out the commands of the script is a translator, either a compiler or an interpreter. Even though the output generated by the script interpreter is not machine code, all the components of the compiler are there: the scanner, the parser, and the semantic analyzer and code generator; except in this case, the "code" is the screen image of each frame of the animation.

Another example of uses for compilers is that of compilers writing compilers. These "compiler compilers" are programs that take the syntax rules of language as input and generate a program, often in a high-order language, as output. The output of a compiler compiler is a compiler. Compiler compilers are very useful for testing new languages. They can be used to generate a prototype compiler for a new language to ensure that the rules of the language have been properly specified and are acting in the manner desired and predicted by the language designers. Compilers generated by compiler compilers may also be used as a finished product, especially if the language is anticipated to have a short, limited life. Automatically generated compilers are not generally used extensively or for long periods of time because they usually do not perform as well as hand-coded compilers which have been written by experienced and capable compiler writers.

A spelling checker is a primitive version of the compiler lexical analyzer. Spelling checkers examine a computer-based text document for misspelled words by first locating each word and then looking in a dictionary for that word. If the word is not found, then alternative words are suggested that are close in spelling to the word in question.

Somewhat more sophisticated are programs that analyze sentence structure within documents. These programs also take a text file such as that produced within a word processor and attempt to confirm that the sentences in the document are syntactically correct. These programs may be based on the same principle as the compiler parser. They work from a set of rules of English grammar, trying to find a way that the sentence may have been constructed from those rules. If it does find a way, then the program knows that the sentence is correct. If it does not find a way, it knows that the sentence is suspect, but, unlike the compiler, the program does not know that the sentence is incorrect. The compiler knows that any sentence that it cannot parse is incorrect because it has a complete set of rules for the language it is parsing. No complete set of rules exists for the English language, so the word processor syntax checker can determine only if the sentence is acceptable according to its standards.

Compiler theory is also used extensively in the field of natural language processing (NLP). NLP is a major area of research within the broader field known as artificial intelligence. A large portion of the work in this area is devoted to finding ways to make computers "understand" sentences, which are presented to them in a human language such as English. Since almost all representations of natural language require the use of some type of grammar rule, it is necessary to employ some type of parsing mechanism in order to begin to understand what has been said. The success of compilers in building parsers for different types of restricted grammars provides some assistance for researchers in NLP.

Context

Much of the theory behind modern compilers was developed by the linguist Noam Chomsky in the late 1950s and early 1960s. Chomsky defined what has come to be known as the "Chomsky hierarchy of language types." This hierarchy consists of four nested sets of language types, each of which is described by a grammar form and an abstract machine for recognizing sentences within that language type. Computer languages are generally built using a combination of the two simplest types, regular languages and context-free languages, with more complex aspects of the computer language being handled by semantics.

Computers and assembly language programming predate Chomsky's theories of language, and one of the first high-level languages, FORTRAN, was developed without their aid. Language development soon became a more serious pursuit, however, with the first attempts at more theoretical approaches taking shape in Algol in 1960. Pascal is a direct descendant of Algol. Its designer, Nikolaus Wirth, was one of the original committee members for Algol.

Approaches to language design, which were more firmly grounded in the Chomskian theory, fostered more rigorous theoretical approaches to compiler design. Compilers form the link between the programming language and the actual machine. A compiler is expected to "know" much about the machine to which it is translating. As computers became more powerful, the applications developed for them became more demanding, thus requiring even more power. Speed of computation is an important aspect of the power of a computer. In order to make computers faster, computer designers have turned toward massively parallel computers. Where computers of the past may have had one central processor, a massively parallel computer may have thousands.

It is generally considered to be an impossible task for someone to compose a computer program that makes efficient use of a thousand processors. Therefore, it becomes the compiler's task to analyze the program and divide its subcomponents among the multiple processors of a massively parallel machine in a manner that makes efficient use of them. The theories of language design and compiler design that have worked so well in the past provide an essential foundation for the development of more complex languages and compilers for massively parallel computers.

Principal terms

Computer program: a sequence of instructions to a computer

High-level language: a computer programming language whose purpose is to bridge the gap between the solution to a problem and machine language

Lexicon: the words, or dictionary, of a language

Machine language: a set of instruction codes in the form of sequences of binary digits (1's and 0's); these codes correspond to specific actions within the computer

Syntax: the rules that govern how words may be put together in a sentence

Essay by Larry Albright

Bibliography

Aho, Alfred, Ravi Sethi, and Jeffrey Ullman. COMPILERS: PRINCIPLES, TECHNIQUES, AND TOOLS. Addison-Wesley, 1985.

Arbib, Michael, A. J. Kfoury, and Robert Moll. A BASIS FOR THEORETICAL COMPUTER SCIENCE. Springer-Verlag, 1981.

Calingaert, Peter. ASSEMBLERS, COMPILERS, AND PROGRAM TRANSLATION. Computer Science Press, 1979.

Freeman, Peter. SOFTWARE SYSTEMS PRINCIPLES: A SURVEY. Science Research Associates, 1975.

Pratt, Terence. PROGRAMMING LANGUAGES: DESIGN AND IMPLEMENTATION. Prentice-Hall, 1975.

Sheldon, Robert. "Compiler." TechTarget, Apr. 2022, www.techtarget.com/whatis/definition/compiler. Accessed 5 Aug. 2024.

Compilers

Related Topics

On this Page

Subject Terms

Compilers

Overview

Applications

Context

Principal terms

Bibliography