CSC 330 Lecture Notes Week 1

CSC 330 Lecture Notes Week 1
Intro to the Course; Some Programming Language History;
The Syntax of Programming Languages

Preliminaries
1. Go over syllabus
2. Pass around 1st-day roster
What is a programming language (PL)?
Standard textbook definition -- a systematic notation to describe a computational process.
What kind of PLs will we be looking at in this class?
1. For the most part, we will be examining what is called the von Neumann class of programming languages.
2. Briefly, this is the class of languages that have memory variables and represent computation as a linear sequence of executable statements.
3. Most of the main languages we will study -- Java, C, Perl, Lisp -- fall into this general category.
4. We will also look at some principles of non-von Neumann languages by focusing on the functional subset of Lisp, and the declarative languages Yacc and Lex.
5. There are two interesting subclasses of von Neumann languages that we will consider, by focusing on features of Java that you probably have not covered in your earlier classes:
  1. concurrent programming languages, as represented by Java threads
  2. reflective languages, by examining the reflection features of Java, and the programs-as-data features of Lisp.
Comparing sequential, von Neumann, high-level programming languages (HLPLs) to other kinds of language:
1. HLPLs compared to natural languages:
  1. HLPLs are formally less expressive; i.e., PLs are a different formal class than are natural languages.
  2. HLPLs must be unambiguous.
  3. HLPLs must be drastically simpler than natural language in order to provide efficient translation.
  4. HLPLs are machine-oriented, more than they are human-oriented.
2. HLPLs compared to mathematical (non-von Neumann) languages:
  1. HLPLs have variables that represent physical memory that retains its state through time; mathematical variables do not have these physical/temporal properties.
    1. E.g., compare
      x = y
      as a Java or C programming language statement versus
      x = y
      as a mathematical statement.
    2. The latter is a statement of fact whereas the former denotes the process of evaluating y and moving its value into x.
  2. A fundamental concept of HLPLs is that of time passing in a sequence of discrete steps; mathematical languages in general do not have this concept.
    1. E.g., compare
      x = y; x = z;
      as a program versus
      x = y and x = z
      as a mathematical statement.
    2. The latter means that x and y and z are all equal; the former means something quite different than this.
  3. It is noteworthy that there is much research work devoted to the development of programming languages that behave more like pure mathematics; as noted above, we will look at some of the ideas in such non-von Neumann during the class.
3. HLPLs compared to assembly and machine languages:
  1. Conceptually, HLPLs and assembly languages are really in the same class -- i.e., both are von Neumann class languages.
  2. What assembly languages lack is largely a matter of notational convenience.
4. Sequential HLPLs compared to concurrent HLPLs.
  1. Sequential PLs can only express computations with a single thread of control -- executable statements can only run one at a time.
  2. Concurrent HLPLs have capabilities to express computations in which two or more statements may execute at the same time.
Why are there so many different HLPLs?
1. Socio-cultural-economic reasons.
2. Different application domains are better served by some languages than others.
Some history.
1. Figure 1 depicts a generational history of HLPLs (cf Figure 1.3 on Page 11 of the book).
  
  0th Generation: machine and assembly languages
  
  1st Generation: FORTRAN
  
  2nd: academic AI/symbolic systems commercial misc ALGOL 60 SNOBOL LISP COBOL BASIC
  
  2.5th: ALGOL68 Pascal Specialized BCPL, BLISS BASIC+, SIMULA67 LISP FORTH derivatives C
  
  2.75th: Euclid, Clu, Modula Mesa, Alphard, Conc. Pascal
  
  Smalltalk
  
  2.9th: Ada Modula-2 Perl Icon Scheme C++ Ada,C++ PostScript
  
  SR Interlisp, MacLisp, ZetaLisp, etc.
  
  2.95th: Java Dylan Java, C#
  
  3rd: Prolog, ML, LISPs, Prologs, SISAL Lucid, OBJ3 Lex, Yacc
  
  4th: Hypercard REFINE RN SQL Access
  Figure 1: Generational history of programming languages.
2. Prominent 2nd, 3rd, and 4th generation developments:
  1. Modularity and object-oriented features (Simula, Smalltalk, Modula, C++, Java).
  2. Concurrency (Ada, SR, MultiLisp, ParLOG, Java).
  3. Functional programming (Lisp, SISAL, ML, OBJ).
  4. Logic programming (OBJ, Prolog).
  5. Very-high-level declarative programming (Prolog, OBJ, REFINE).
  6. GUI, event-based support (Smalltalk, Java).
  7. Integration of language and its environment (LISPs, SmallTalk, Java).
  8. Domain-specific languages (Access, SQL).
3. One last note on history --
  1. Simula 67 is a serious historical omission from the object-oriented languages in the book's Figure 1.3.
  2. Simula is widely acknowledged to be the first object-oriented language, having introduced the keyword "class" into the PL lexicon, and having inheritance features very much like languages of today, including C++, Java, and C#.
What is syntax?
1. Expresses the form (or grammar) of a program in a particular language.
2. It does not express at all what a program means.
3. In other words, syntax expresses what a program looks like but not what it does.
Elements of syntax for a language (including a programming language) are:
1. An alphabet of symbols that comprise the basic elements of the grammar.
2. The symbols are comprised of two sets -- terminal symbols (also known as tokens) and non-terminal symbols (referred to as grammatical categories in the book).
  1. Terminals are those symbols that cannot be broken down into constituents.
  2. Non-terminals are those symbols that can be broken down further.
3. A set of grammar rules that define how symbols are combined together to make legal sentences of the language.
4. The rules are of the general form
  non-terminal symbol -> list of zero or more terminals or non- terminals
  This rule can be read as ``the non-terminal on the left-hand side (LHS) of the rule is comprised of the list of constituents on the right-hand side (RHS) of the rule''
5. One uses the rules to recognize (a.k.a. parse) and/or generate legal sentences of the language.
Equivalent forms in which to express syntax.
1. Backus-Naur Form (BNF).
2. Other notational variants, such as extended BNF (EBNF).
3. Syntax graphs.
4. All notations commonly used for programming language syntax have the same expressive power -- all produce context-free syntax
5. Context-free grammars are a topic we cover at a practical level in 330, and which you will study more formally in CSC 445.
As an initial example, let us look at a grammar for a very small subset of English:
1. Consider the sentence ``Mary greets John.''
2. A simple grammar that could be used to parse or generate this sentence is the following:
```
 sentence  ->  subject  predicate  .
 subject  ->  Mary
 predicate  ->  verb  object
 verb  ->  greets
 object  ->  John
```
3. How do we use the rules of such a grammar to parse or generate a sentence like ``Mary greets John.''?
  1. We consider each rule to be a production that defines how some part of a sentence is recognized or formed.
  2. The trace of the application of the production rules can be shown in a tree form, as in the following:
There are a number of concepts that will enhance the expressibility of a grammar:
1. The concept of alternation allows more than one right-hand side (RHS) for a grammar rule.
  1. In standard BNF, rule alternatives are separated by a vertical bar symbol.
  2. E.g., suppose we modify the last rule in the above sample grammar to be
```
 object  ->  John   |   Alfred
```
    This adds ``Mary greets Alfred.'' to the sentences generatable by the grammar.
  3. Another way to extend the expressibility of a grammar is to add additional non- terminal symbols.
    1. E.g., in the above grammar, only ``Mary'' may be the subject, and ``John'' the object; a grammar that removes this restriction would modify the rules for subject and object as follows:
      subject -> Mary | John object -> Mary | John
    2. A notationally more convenient grammar to do the same thing adds a new non- terminal:
      sentence -> subject predicate . subject -> noun predicate -> verb object verb -> greets object -> noun noun -> John | Mary
2. One of the things that makes BNF particularly powerful is its ability to express languages that have an infinite number of sentences or sentences that are infinitely (or at least unboundedly) long.
  1. As a first step toward infinite sentences, consider the following modification to the object rule:
```
 object  ->  John    |    John again    |    John again and again    |    ...
```
  2. This provides the basic idea of a pattern for generating sentences such as ``Mary greets John.'', ``Mary greets John again.'', ``Mary greets John again and again.'', ``Mary greets John again and again and again.'', etc.
  3. Clearly it's infeasible to write an infinite-length grammar to generate such sentences. Instead, we notice the pattern that is developing, and use the following form of recursive notation:
```
 object  ->  John    |    John repeat_factor
 repeat_factor  ->  again    |    again and   repeat_factor
```
  4. Here is a sample parse tree:
3. The use of such recursion is a very important part of context free grammars.
Let us now consider the syntax of programming languages.
1. The following is the beginning of a simple expression grammar that fits most programming languages:
```
 expr  ->  expr  op  expr   |   var
 op  ->  +    |    -    |    *    |    /
 var  ->  a   |   b   |   c   |  ...
```
2. From this grammar, we can generate the following two parse trees for the expression ``a+b*c'':
3. Is there anything wrong with the fact that we can get these two different trees?
  1. First, the syntactic ambiguity is not good; in general we would like a grammar to give us a unique parse for each sentence.
  2. Also of possible concern is that the two trees might intuitively be interpreted differently in terms of the precedence of operators.
To answer the concerns of the previous expression grammar, consider the following version:
```
expr  ->  simple_expr   |   expr  rel_op  simple_expr
simple_expr  ->  term   |   simple_expr  add_op  term
term  ->  factor   |   term  mult_op  factor
factor  ->  var   |   integer
rel_op  ->  <   |   >   |   ==   |   <=   |   >=   |   !=
add_op ->  +   |   -
mult_op  ->  *   |   /
var  ->  a    |    b    |    c    |   ...
integer  ->  digit   |   integer digit
digit  ->  0   |   ...   |   9
```
1. Here, precedence is represented through a hierarchy of non-terminals -- expr, simple_expr, term, factor.
2. This hierarchy ensures that the operator precedence is properly reflected in the parse tree for any expression, e.g., that multiplying operators have higher precedence than adding operators, which in turn have higher precedence than relational operators.
3. Look at things closely here to be sure you understand how this works to define precedence; we will return to this subject in coming weeks.
A hint at the parsing problem.
1. Given the Pascal grammar above, how do we go about parsing a string such as ``a+b*c''?
2. Basically, use the following procedure:
  1. Examine each of the symbols in the string from left to right, looking for grammar rules that we can apply. That is, we look in the grammar for the RHS of some rule that matches one or more of the symbols at the current place we're examining in the string.
  2. Continuing this process, if we match the topmost rule in the grammar, and there are no symbols left to examine in the string, then we're done -- we've successfully parsed the string.
  3. If we apply the topmost rule, but there are still symbols left in the string to examine, then we've gone too far. In this case, we must undo the most recent rule we matched, and go back to step 1.
3. Given this procedure, consider how the following parse tree would be derived from the string along its frontier:
4. Note that the simplistic procedure given above is really far too vague to be used formally for parsing. In upcoming lectures and assignments, we will delve much further into how parsing is performed.
Alternate context-free syntax notations.
1. Extended BNF (EBNF), with the following notational features:
  1. A specialized bracketing notation { ... }* is used to express repetition, avoiding the need for recursive rules in many cases.
    1. E.g.,
      number -> { digit }*
      means a number is comprised of zero or more digits.
    2. This can be expressed with the slightly more complicated recursion:
      number -> | digit | number digit
      Note the empty alternative at the beginning of the RHS; this allows for the zero case.
  2. Another form of brackets `[' ... `]' express one-of selection. E.g.,
```
 signed number  ->  [ +  |  - ]  number
```
    means that a signed number is a number preceded with a plus or minus sign.
  3. One more EBNF construct is the opt subscript, used to indicate an optional construct. E.g.,
```
 signed_number  ->  [ +  |  - ]_opt  number
```
    means that a signed_number is a number preceded with an optional plus or minus sign, i.e., by a plus sign, a minus sign, or neither.
2. Syntax graphs.
  1. Syntax graphs are a graphical variant of BNF with the following notational conventions:
    1. The name of the graph (on the left) depicts the LHS of the rule.
    2. RHS non-terminals are enclosed in oval graph nodes.
    3. RHS terminals are unenclosed graph nodes.
    4. Graph edges depict rule structure:
      1. left-to-right flow depicts RHS ordering
      2. split edges depict alternatives
      3. looping edges depict repetition
  2. For example, Figure 2 is the syntax graph for the expr grammar given above on Page 6 of the notes.
Figure 2: Syntax graph for expression grammar on Page 6.
More on syntactic ambiguity in programming languages.
1. A classic case is that of the ``dangling else.''
2. Consider the following excerpt from a Pascalese statement grammar:
```
 stmt  ->  assignment_stmt   |   procedure_stmt   |   if_stmt  |  ...
 if_stmt  ->  if  bool expr  then  stmt  [ else  stmt  ]_opt
```
3. Now consider the statement ``if E1 then if E2 then S1 else S2''.
4. Question -- does the grammar indicate with which of the if's the else should be associated?
5. Answer: No, since the grammar allows either of the following parses:
6. So, what should be done about this ambiguity?
  1. The grammar can be fixed to remove it (think about how this could be done; see Section 2.2.1 of the book).
  2. The syntactic ambiguity can be left, and disambiguated with a semantic rule; e.g., with a rule stated in English such as: ``The else ambiguity is resolved by connecting an else with the last encountered else-less if at the same block nesting level'' (C Language Reference Manual, pg. 223).
  3. Resolving syntactic ambiguity using semantic rules is a means that is sometimes used in the definition of real programming languages (e.g., C).
Some thoughts on the art of designing the syntax of programming languages.
1. Throughout CSC 330, we'll be examining what's good and bad about the syntax of a variety of languages.
2. We'll see how subtle differences can make a major difference in the usability of a language.
3. A preview of some interesting advances (or regressions) beyond "Pascal-style" syntax includes:
  1. The C-ification of begin-end bracketing, as in, e.g.,
```
if (t) { s1; s2 } else { s3; s4 }
```
    versus the Pascalese
```
if t then s1; s2 else s3 ; s4 endif;
```
  2. Allowing semicolons to be optional in most normal cases, as in the Icon language, for example.
  3. The advent of syntax-directed editors that help programmers worry less about the picky syntactic details of a program. Such editors are now commonplace in IDEs such as J- Creator and NetBeans.
Summary of the focus of our study of syntax and syntactic notation in CSC 330:
1. Here in 330 we will use BNF as a tool in which to express programming language syntax; i.e., BNF will be used as a communication tool between humans.
2. We will also use BNF as a form of programming language in a couple of the assignments, where you will use a language called "Yacc" (Yet another compiler- compiler) to generate a program that does parsing.
3. We will not address the formal theory of grammars -- this is studied in CSC 445.
4. Neither will we address formally the theory and practice of parsing programs using context free grammars -- this is studied in CSC 430.