[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
This chapter describes some aspects of the design and implementation
of the g77
front end.
Much of the information below applies not to current
releases of g77
,
but to the 0.6 rewrite being designed and implemented
as of late May, 1999.
To find about things that are "To Be Determined" or "To Be Done", search for the string TBD. If you want to help by working on one or more of these items, email [email protected]. If you're planning to do more than just research issues and offer comments, see http://www.gnu.org/software/contribute.html for steps you might need to take first.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The current directory layout includes the following:
libg2c
configuration and g2c.h
file generation
libg2c
libg2c
libc
for libg2c
Components of note in g77
are described below.
`f/' as a whole contains the source for g77
,
while `libf2c/' contains a portion of the separate program
f2c
.
Note that the libf2c
code is not part of the program g77
,
just distributed with it.
`f/' contains text files that document the Fortran compiler, source
files for the GNU Fortran Front End (FFE), and some other stuff.
The g77
compiler code is placed in `f/' because it,
along with its contents,
is designed to be a subdirectory of a gcc
source directory,
`gcc/',
which is structured so that language-specific front ends can be "dropped
in" as subdirectories.
The C++ front end (g++
), is an example of this--it resides in
the `cp/' subdirectory.
Note that the C front end (also referred to as gcc
)
is an exception to this, as its source files reside
in the `gcc/' directory itself.
`libf2c/' contains the run-time libraries for the f2c
program,
also used by g77
.
These libraries normally referred to collectively as libf2c
.
When built as part of g77
,
libf2c
is installed under the name libg2c
to avoid
conflict with any existing version of libf2c
,
and thus is often referred to as libg2c
when the
g77
version is specifically being referred to.
The netlib
version of libf2c/
contains two distinct libraries,
libF77
and libI77
,
each in their own subdirectories.
In g77
, this distinction is not made,
beyond maintaining the subdirectory structure in the source-code tree.
`libf2c/' is not part of the program g77
,
just distributed with it.
It contains files not present
in the official (netlib
) version of libf2c
,
and also contains some minor changes made from libf2c
,
to fix some bugs,
and to facilitate automatic configuration, building, and installation of
libf2c
(as libg2c
) for use by g77
users.
See `libf2c/README' for more information,
including licensing conditions
governing distribution of programs containing code from libg2c
.
libg2c
, g77
's version of libf2c
,
adds Dave Love's implementation of libU77
,
in the `libf2c/libU77/' directory.
This library is distributed under the
GNU Library General Public License (LGPL)---see the
file `libf2c/libU77/COPYING.LIB'
for more information,
as this license
governs distribution conditions for programs containing code
from this portion of the library.
Files of note in `f/' and `libf2c/' are described below:
g77
documentation:
info -f f/g77.info -n "Actual Bugs" |
g77
internals.
libg2c
internals.
g77
documentation, plus internal
changes of import.
Or use:
info -f f/g77.info -n News |
g77
documentation, in Info format,
produced by building g77
.
All users of g77
(not just installers) should read this,
using the more
command if neither the info
command,
nor GNU Emacs (with its Info mode), are available, or if users
aren't yet accustomed to using these tools.
All of these files are readable as "plain text" files,
though they're easier to navigate using Info readers
such as info
and GNU Emacs Info mode.
If you want to explore the FFE code, which lives entirely in `f/',
here are a few clues.
The file `g77spec.c' contains the g77
-specific source code
for the g77
command only--this just forms a variant of the
gcc
command, so,
just as the gcc
command itself does not contain the C front end,
the g77
command does not contain the Fortran front end (FFE).
The FFE code ends up in an executable named `f771',
which does the actual compiling,
so it contains the FFE plus the gcc
back end (GBE),
the latter to do most of the optimization, and the code generation.
The file `parse.c' is the source file for yyparse()
,
which is invoked by the GBE to start the compilation process,
for `f771'.
The file `top.c' contains the top-level FFE function ffe_file
and it (along with top.h) define all `ffe_[a-z].*', `ffe[A-Z].*',
and `FFE_[A-Za-z].*' symbols.
The file `fini.c' is a main()
program that is used when building
the FFE to generate C header and source files for recognizing keywords.
The files `malloc.c' and `malloc.h' comprise a memory manager
that defines all `malloc_[a-z].*', `malloc[A-Z].*', and
`MALLOC_[A-Za-z].*' symbols.
All other modules named xyz are comprised of all files named `xyz*.ext' and define all `ffexyz_[a-z].*', `ffexyz[A-Z].*', and `FFEXYZ_[A-Za-z].*' symbols. If you understand all this, congratulations--it's easier for me to remember how it works than to type in these regular expressions. But it does make it easy to find where a symbol is defined. For example, the symbol `ffexyz_set_something' would be defined in `xyz.h' and implemented there (if it's a macro) or in `xyz.c'.
The "porting" files of note currently are:
ARRAY_SIZE
and such.
INTEGER*8
map, for example),
how to convert between them,
and so on.
Over time, versions of g77
rely less on this file
and more on run-time configuration based on GBE info
in `com.c'.
If you want to debug the `f771' executable,
for example if it crashes,
note that the global variables lineno
and input_filename
are usually set to reflect the current line being read by the lexer
during the first-pass analysis of a program unit and to reflect
the current line being processed during the second-pass compilation
of a program unit.
If an invocation of the function ffestd_exec_end
is on the stack,
the compiler is in the second pass, otherwise it is in the first.
(This information might help you reduce a test case and/or work around
a bug in g77
until a fix is available.)
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The order of phases translating source code to the form accepted by the GBE is:
To get a rough idea of how a particularly twisted Fortran statement gets treated by the passes, consider:
FORMAT(I2 4H)=(J/ & I3) |
The job of `lex.c' is to know enough about Fortran syntax rules to break the statement up into distinct lexemes without requiring any feedback from subsequent phases:
`FORMAT' `(' `I24H' `)' `=' `(' `J' `/' `I3' `)' |
The job of `sta.c' is to figure out the kind of statement, or, at least, statement form, that sequence of lexemes represent.
The sooner it can do this (in terms of using the smallest number of lexemes, starting with the first for each statement), the better, because that leaves diagnostics for problems beyond the recognition of the statement form to subsequent phases, which can usually better describe the nature of the problem.
In this case, the `=' at "level zero"
(not nested within parentheses)
tells `sta.c' that this is an assignment-form,
not FORMAT
, statement.
An assignment-form statement might be a statement-function definition or an executable assignment statement.
To make that determination, `sta.c' looks at the first two lexemes.
Since the second lexeme is `(', the first must represent an array for this to be an assignment statement, else it's a statement function.
Either way, `sta.c' hands off the statement to `stq.c' (via `sti.c', which expands INCLUDE files). `stq.c' figures out what a statement that is, on its own, ambiguous, must actually be based on the context established by previous statements.
So, `stq.c' watches the statement stream for executable statements, END statements, and so on, so it knows whether `A(B)=C' is (intended as) a statement-function definition or an assignment statement.
After establishing the context-aware statement info, `stq.c' passes the original sample statement on to `stb.c' (either its statement-function parser or its assignment-statement parser).
`stb.c' forms a statement-specific record containing the pertinent information. That information includes a source expression and, for an assignment statement, a destination expression. Expressions are parsed by `expr.c'.
This record is passed to `stc.c', which copes with the implications of the statement within the context established by previous statements.
For example, if it's the first statement in the file
or after an END
statement,
`stc.c' recognizes that, first of all,
a main program unit is now being lexed
(and tells that to `std.c'
before telling it about the current statement).
`stc.c' attaches whatever information it can, usually derived from the context established by the preceding statements, and passes the information to `std.c'.
`std.c' saves this information away, since the GBE cannot cope with information that might be incomplete at this stage.
For example, `I3' might later be determined
to be an argument to an alternate ENTRY
point.
When `std.c' is told about the end of an external (top-level) program unit, it passes all the information it has saved away on statements in that program unit to `ste.c'.
`ste.c' "expands" each statement, in sequence, by constructing the appropriate GBE information and calling the appropriate GBE routines.
Details on the transformational phases follow. Keep in mind that Fortran numbering is used, so the first character on a line is column 1, decimal numbering is used, and so on.
22.2.1 g77stripcard 22.2.2 lex.c 22.2.3 sta.c 22.2.4 sti.c 22.2.5 stq.c 22.2.6 stb.c 22.2.7 expr.c 22.2.8 stc.c 22.2.9 std.c 22.2.10 ste.c
22.2.11 Gotchas (Transforming) 22.2.12 TBD (Transforming)
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The g77stripcard
program handles removing content beyond
column 72 (adjustable via a command-line option),
optionally warning about that content being something other
than trailing whitespace or Fortran commentary.
This program is needed because lex.c
doesn't pay attention
to maximum line lengths at all, to make it easier to maintain,
as well as faster (for sources that don't depend on the maximum
column length vis-a-vis trailing non-blank non-commentary content).
Just how this program will be run--whether automatically for old source (perhaps as the default for `.f' files?)---is not yet determined.
In the meantime, it might as well be implemented as a typical UNIX pipe.
It should accept a `-fline-length-n' option, with the default line length set to 72.
When the text it strips off the end of a line is not blank
(not spaces and tabs),
it should insert an additional comment line
(beginning with `!',
so it works for both fixed-form and free-form files)
containing the text,
following the stripped line.
The inserted comment should have a prefix of some kind,
TBD, that distinguishes the comment as representing stripped text.
Users could use that to sed
out such lines, if they wished--it
seems silly to provide a command-line option to delete information
when it can be so easily filtered out by another program.
(This inserted comment should be designed to "fit in" well
with whatever the Fortran community is using these days for
preprocessor, translator, and other such products, like OpenMP.
What that's all about, and how g77
can elegantly fit its
special comment conventions into it all, is TBD as well.
We don't want to reinvent the wheel here, but if there turn out
to be too many conflicting conventions, we might have to invent
one that looks nothing like the others, but which offers their
host products a better infrastructure in which to fit and coexist
peacefully.)
g77stripcard
probably shouldn't do any tab expansion or other
fancy stuff.
People can use expand
or other pre-filtering if they like.
The idea here is to keep each stage quite simple, while providing
excellent performance for "normal" code.
(Code with junk beyond column 73 is not really "normal", as it comes from a card-punch heritage, and will be increasingly hard for tomorrow's Fortran programmers to read.)
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
To help make the lexer simple, fast, and easy to maintain,
while also having g77
generally encourage Fortran programmers
to write simple, maintainable, portable code by maximizing the
performance of compiling that kind of code:
Some other distinctions will be handled by subsequent phases, so at least one of them will have to know which form is involved.
For example, `I = 2 . 4' is acceptable in fixed form,
and works in free form as well given the implementation g77
presently uses.
But the standard requires a diagnostic for it in free form,
so the parser has to be able to recognize that
the lexemes aren't contiguous
(information the lexer does have to provide)
and that free-form source is being parsed,
so it can provide the diagnostic.
The g77
lexer doesn't try to gather `2 . 4' into a single lexeme.
Otherwise, it'd have to know a whole lot more about how to parse Fortran,
or subsequent phases (mainly parsing) would have two paths through
lots of critical code--one to handle the lexeme `2', `.',
and `4' in sequence, another to handle the lexeme `2.4'.
That is, once it starts parsing the "statement" part of a line (column 7 for fixed-form, column 1 for free-form), it'll keep going until it finds a newline, rather than ignoring everything past a particular column (72 or 132).
The implication here is that there shouldn't be anything past that last column, other than whitespace or commentary, because users using typical editors (or viewing output as typically printed) won't necessarily know just where the last column is.
Code that has "garbage" beyond the last column
(almost certainly only fixed-form code with a punched-card legacy,
such as code using columns 73-80 for "sequence numbers")
will have to be run through g77stripcard
first.
Also, keeping track of the maximum column position while also watching out for the end of a line and while reading from a file just makes things slower. Since a file must be read, and watching for the end of the line is necessary (unless the typical input file was preprocessed to include the necessary number of trailing spaces), dropping the tracking of the maximum column position is the only way to reduce the complexity of the pertinent code while maintaining high performance.
Code written in other character sets will have to be converted first.
Specifically, a tab is converted to between one and eight spaces as necessary to reach column n, where dividing `(n - 1)' by eight results in a remainder of zero.
That saves having to pass most source files through expand
.
Otherwise, it is rejected (with a diagnostic).
This includes backspaces, form feeds, and the like.
(It might make sense to allow a form feed in column 1 as long as that's the only character on a line. It certainly wouldn't seem to cost much in terms of performance.)
It will be up to subsequent phases to decide to fold case.
Current plans are to permit any casing for Fortran (reserved) keywords while preserving casing for user-defined names. (This might not be made the default for `.f' files, though.)
Preserving case seems necessary to provide more direct access
to facilities outside of g77
, such as to C or Pascal code.
Names of intrinsics will probably be matchable in any case,
However, there probably won't be any option to require
a particular mixed-case appearance of intrinsics
(as there was for g77
prior to version 0.6),
because that's painful to maintain,
and probably nobody uses it.
(How `external SiN; r = sin(x)' would be handled is TBD.
I think old g77
might already handle that pretty elegantly,
but whether we can cope with allowing the same fragment to reference
a different procedure, even with the same interface,
via `s = SiN(r)', needs to be determined.
If it can't, we need to make sure that when code introduces
a user-defined name, any intrinsic matching that name
using a case-insensitive comparison
is "turned off".)
CHARACTER
and Hollerith constants
are not allowed.
This avoids the confusion introduced by some Fortran compiler vendors providing C-like interpretation of backslashes, while others provide straight-through interpretation.
Some kind of lexical construct (TBD) will be provided to allow
flagging of a CHARACTER
(but probably not a Hollerith)
constant that permits backslashes.
It'll necessarily be a prefix, such as:
PRINT *, C'This line has a backspace \b here.' PRINT *, F'This line has a straight backslash \ here.' |
Further, command-line options might be provided to specify that
one prefix or the other is to be assumed as the default
for CHARACTER
constants.
However, it seems more helpful for g77
to provide a program
that converts prefix all constants
(or just those containing backslashes)
with the desired designation,
so printouts of code can be read
without knowing the compile-time options used when compiling it.
If such a program is provided
(let's name it g77slash
for now),
then a command-line option to g77
should not be provided.
(Though, given that it'll be easy to implement, it might be hard
to resist user requests for it "to compile faster than if we
have to invoke another filter".)
This program would take a command-line option to specify the default interpretation of slashes, affecting which prefix it uses for constants.
g77slash
probably should automatically convert Hollerith
constants that contain slashes
to the appropriate CHARACTER
constants.
Then g77
wouldn't have to define a prefix syntax for Hollerith
constants specifying whether they want C-style or straight-through
backslashes.
The above implements nearly exactly what is specified by 8.6.1 GNU Fortran Character Set, and 8.6.2 Lines, except it also provides automatic conversion of tabs and ignoring of newline-related carriage returns, as well as accommodating form-neutral INCLUDE files.
It also implements the "pure visual" model,
by which is meant that a user viewing his code
in a typical text editor
(assuming it's not preprocessed via g77stripcard
or similar)
doesn't need any special knowledge
of whether spaces on the screen are really tabs,
whether lines end immediately after the last visible non-space character
or after a number of spaces and tabs that follow it,
or whether the last line in the file is ended by a newline.
Most editors don't make these distinctions, the ANSI FORTRAN 77 standard doesn't require them to, and it permits a standard-conforming compiler to define a method for transforming source code to "standard form" however it wants.
So, GNU Fortran defines it such that users have the best chance of having the code be interpreted the way it looks on the screen of the typical editor.
(Fancy editors should never be required to correctly read code written in classic two-dimensional-plaintext form. By correct reading I mean ability to read it, book-like, without mistaking text ignored by the compiler for program code and vice versa, and without having to count beyond the first several columns. The vague meaning of ASCII TAB, among other things, complicates this somewhat, but as long as "everyone", including the editor, other tools, and printer, agrees about the every-eighth-column convention, the GNU Fortran "pure visual" model meets these requirements. Any language or user-visible source form requiring special tagging of tabs, the ends of lines after spaces/tabs, and so on, fails to meet this fairly straightforward specification. Fortunately, Fortran itself does not mandate such a failure, though most vendor-supplied defaults for their Fortran compilers do fail to meet this specification for readability.)
Further, this model provides a clean interface
to whatever preprocessors or code-generators are used
to produce input to this phase of g77
.
Mainly, they need not worry about long lines.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
This section is not about transforming "gotchas" into something else. It is about the weirder aspects of transforming Fortran, however that's defined, into a more modern, canonical form.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Each lexeme carries with it a pointer to where it appears in the source.
To provide the ability for diagnostics to point to column numbers, in addition to line numbers and names, lexemes that represent more than one (significant) character in the source code need, generally, to provide pointers to where each character appears in the source.
This provides the ability to properly identify the precise location of the problem in code like
SUBROUTINE X END BLOCK DATA X END |
which, in fixed-form source, would result in single lexemes consisting of the strings `SUBROUTINEX' and `BLOCKDATAX'. (The problem is that `X' is defined twice, so a pointer to the `X' in the second definition, as well as a follow-up pointer to the corresponding pointer in the first, would be preferable to pointing to the beginnings of the statements.)
This need also arises when parsing (and diagnosing) FORMAT
statements.
Further, it arises when diagnosing
FMT=
specifiers that contain constants
(or partial constants, or even propagated constants!)
in I/O statements, as in:
PRINT '(I2, 3HAB)', J |
(A pointer to the beginning of the prematurely-terminated Hollerith constant, and/or to the close parenthese, is preferable to a pointer to the open-parenthese or the apostrophe that precedes it.)
Multi-character lexemes, which would seem to naturally include
at least digit strings, alphanumeric strings, CHARACTER
constants, and Hollerith constants, therefore need to provide
location information on each character.
(Maybe Hollerith constants don't, but it's unnecessary to except them.)
The question then arises, what about other multi-character lexemes, such as `**' and `//', and Fortran 90's `(/', `/)', `::', and so on?
Turns out there's a need to identify the location of the second character of these two-character lexemes. For example, in `I(/J) = K', the slash needs to be diagnosed as the problem, not the open parenthese. Similarly, it is preferable to diagnose the second slash in `I = J // K' rather than the first, given the implicit typing rules, which would result in the compiler disallowing the attempted concatenation of two integers. (Though, since that's more of a semantic issue, it's not that much preferable.)
Even sequences that could be parsed as digit strings could use location info, for example, to diagnose the `9' in the octal constant `O'129''. (This probably will be parsed as a character string, to be consistent with the parsing of `Z'129A''.)
To avoid the hassle of recording the location of the second character, while also preserving the general rule that each significant character is distinctly pointed to by the lexeme that contains it, it's best to simply not have any fixed-size lexemes larger than one character.
This new design is expected to make checking for two `*' lexemes in a row much easier than the old design, so this is not much of a sacrifice. It probably makes the lexer much easier to implement than it makes the parser harder.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Certain lexemes need to be padded with virtual spaces when the end of the line (or file) is encountered.
This is necessary in fixed form, to handle lines that don't extend to column 72, assuming that's the line length in effect.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Last I checked, the Fortran 90 standard actually required the compiler to silently accept something like
FORMAT ( 1 2 Htwelve chars ) |
as a valid FORMAT
statement specifying a twelve-character
Hollerith constant.
The implication here is that, since the new lexer is a zero-feedback one,
it won't know that the special case of a FORMAT
statement being parsed
requires apparently distinct lexemes `1' and `2' to be treated as
a single lexeme.
(This is a horrible misfeature of the Fortran 90 language. It's one of many such misfeatures that almost make me want to not support them, and forge ahead with designing a new "GNU Fortran" language that has the features, but not the misfeatures, of Fortran 90, and provide utility programs to do the conversion automatically.)
So, the lexer must gather distinct chunks of decimal strings into a single lexeme in contexts where a single decimal lexeme might start a Hollerith constant.
(Which probably means it might as well do that all the time for all multi-character lexemes, even in free-form mode, leaving it to subsequent phases to pull them apart as they see fit.)
Compare the treatment of this to how
CHARACTER * 4 5 HEY |
and
CHARACTER * 12 HEY |
must be treated--the former must be diagnosed, due to the separation between lexemes, the latter must be accepted as a proper declaration.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Recognizing a Hollerith constant--specifically, that an `H' or `h' after a digit string begins such a constant--requires some knowledge of context.
Hollerith constants (such as `2HAB') can appear after:
Hollerith constants don't appear after:
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
While
REAL FUNCTION FOO () |
must be a FUNCTION
statement and
REAL FUNCTION FOO (5) |
must be a type-definition statement,
REAL FUNCTION FOO (names) |
where names is a comma-separated list of names, can be one or the other.
The only way to disambiguate that statement (short of mandating free-form source or a short maximum length for name for external procedures) is based on the context of the statement.
In particular, the statement is known to be within an
already-started program unit
(but not at the outer level of the CONTAINS
block),
it is a type-declaration statement.
Otherwise, the statement is a FUNCTION
statement,
in that it begins a function program unit
(external, or, within CONTAINS
, nested).
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The statement
READ (N) |
is equivalent to either
READ (UNIT=(N)) |
or
READ (FMT=(N)) |
depending on which would be valid in context.
Specifically, if `N' is type INTEGER
,
`READ (FMT=(N))' would not be valid,
because parentheses may not be used around `N',
whereas they may around it in `READ (UNIT=(N))'.
Further, if `N' is type CHARACTER
,
the opposite is true---`READ (UNIT=(N))' is not valid,
but `READ (FMT=(N))' is.
Strictly speaking, if anything follows
READ (N) |
in the statement, whether the first lexeme after the close parenthese is a comma could be used to disambiguate the two cases, without looking at the type of `N', because the comma is required for the `READ (FMT=(N))' interpretation and disallowed for the `READ (UNIT=(N))' interpretation.
However, in practice, many Fortran compilers allow the comma for the `READ (UNIT=(N))' interpretation anyway (in that they generally allow a leading comma before an I/O list in an I/O statement), and much code takes advantage of this allowance.
(This is quite a reasonable allowance, since the juxtaposition of a comma-separated list immediately after an I/O control-specification list, which is also comma-separated, without an intervening comma, looks sufficiently "wrong" to programmers that they can't resist the itch to insert the comma. `READ (I, J), K, L' simply looks cleaner than `READ (I, J) K, L'.)
So, type-based disambiguation is needed unless strict adherence to the standard is always assumed, and we're not going to assume that.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Continue researching gotchas, designing the transformational process, and implementing it.
Specific issues to resolve:
USE
processing take place?
This gets into the whole issue of how g77
should handle the concept
of modules.
I think GNAT already takes on this issue, but don't know more than that.
Jim Giles has written extensively on comp.lang.fortran
about his opinions on module handling, as have others.
Jim's views should be taken into account.
Actually, Richard M. Stallman (RMS) also has written up some guidelines for implementing such things, but I'm not sure where I read them. Perhaps the old [email protected] list.
If someone could dig references to these up and get them to me, that would be much appreciated! Even though modules are not on the short-term list for implementation, it'd be helpful to know now how to avoid making them harder to implement them later.
g77
command become just a script that invokes
all the various preprocessing that might be needed,
thus making it seem slower than necessary for legacy code
that people are unwilling to convert,
or should we provide a separate script for that,
thus encouraging people to convert their code once and for all?
At least, a separate script to behave as old g77
did,
perhaps named g77old
, might ease the transition,
as might a corresponding one that converts source codes
named g77oldnew
.
These scripts would take all the pertinent options g77
used
to take and run the appropriate filters,
passing the results to g77
or just making new sources out of them
(in a subdirectory, leaving the user to do the dirty deed of
moving or copying them over the old sources).
CHARACTER
(or Hollerith) constants?
Knowing what other compilers provide would help.
I've asked [email protected] for input on this. Not having to support these makes it easier to write the new front end, and might also avoid complicated its design.
The consensus to date (1999-11-17) has been to drop this support. Can't recall anybody saying they're using it, in fact.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Don't poke the bear.
The g77
front end generates code
via the gcc
back end.
The gcc
back end (GBE) is a large, complex
labyrinth of intricate code
written in a combination of the C language
and specialized languages internal to gcc
.
While the code that implements the GBE is written in a combination of languages, the GBE itself is, to the front end for a language like Fortran, best viewed as a compiler that compiles its own, unique, language.
The GBE's "source", then, is written in this language, which consists primarily of a combination of calls to GBE functions and tree nodes (which are, themselves, created by calling GBE functions).
So, the g77
generates code by, in effect,
translating the Fortran code it reads
into a form "written" in the "language"
of the gcc
back end.
This language will heretofore be referred to as GBEL, for GNU Back End Language.
GBEL is an evolving language,
not fully specified in any published form
as of this writing.
It offers many facilities,
but its "core" facilities
are those that corresponding most directly
to those needed to support gcc
(compiling code written in GNU C).
The g77
Fortran Front End (FFE)
is designed and implemented
to navigate the currents and eddies
of ongoing GBEL and gcc
development
while also delivering on the potential
of an integrated FFE
(as compared to using a converter like f2c
and feeding the output into gcc
).
Goals of the FFE's code-generation strategy include:
g77
-specific) constructs,
such as command-line options.
The strategies historically, and currently, used by the FFE to achieve these goals include:
"Don't poke the bear" somewhat summarizes the above strategies.
The GBE is the bear.
The FFE is designed and implemented to avoid poking it
in ways that are likely to just annoy it.
The FFE usually either tackles it head-on,
or avoids treating it in ways dissimilar to how
the gcc
front end treats it.
For example, the FFE uses the native array facility in the back end
instead of the lower-level pointer-arithmetic facility
used by gcc
when compiling f2c
output).
Theoretically, this presents more opportunities for optimization,
faster compile times,
and the production of more faithful debugging information.
These benefits were not, however, immediately realized,
mainly because gcc
itself makes little or no use
of the native array facility.
Complex arithmetic is a case study of the evolution of this strategy. When originally implemented, the GBEL had just evolved its own native complex-arithmetic facility, so the FFE took advantage of that.
When porting g77
to 64-bit systems,
it was discovered that the GBE didn't really
implement its native complex-arithmetic facility properly.
The short-term solution was to rewrite the FFE
to instead use the lower-level facilities
that'd be used by gcc
-compiled code
(assuming that code, itself, didn't use the native complex type
provided, as an extension, by gcc
),
since these were known to work,
and, in any case, if shown to not work,
would likely be rapidly fixed
(since they'd likely not work for vanilla C code in similar circumstances).
However, the rewrite accommodated the original, native approach as well
by offering a command-line option to select it over the emulated approach.
This allowed users, and especially GBE maintainers, to try out
fixes to complex-arithmetic support in the GBE
while g77
continued to default to compiling more code correctly,
albeit producing (typically) slower executables.
As of April 1999, it appeared that the last few bugs in the GBE's support of its native complex-arithmetic facility were worked out. The FFE was changed back to default to using that native facility, leaving emulation as an option.
Later during the release cycle (which was called EGCS 1.2, but soon became GCC 2.95), bugs in the native facility were found. Reactions among various people included "the last thing we should do is change the default back", "we must change the default back", and "let's figure out whether we can narrow down the bugs to few enough cases to allow the now-months-long-tested default to remain the same". The latter viewpoint won that particular time. The bugs exposed other concerns regarding ABI compliance when the ABI specified treatment of complex data as different from treatment of what Fortran and GNU C consider the equivalent aggregation (structure) of real (or float) pairs.
Other Fortran constructs--arrays, character strings,
complex division, COMMON
and EQUIVALENCE
aggregates,
and so on--involve issues similar to those pertaining to complex arithmetic.
So, it is possible that the history of how the FFE handled complex arithmetic will be repeated, probably in modified form (and hopefully over shorter timeframes), for some of these other facilities.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The FFE does not tell the GBE anything about a program unit until after the last statement in that unit has been parsed. (A program unit is a Fortran concept that corresponds, in the C world, mostly closely to functions definitions in ISO C. That is, a program unit in Fortran is like a top-level function in C. Nested functions, found among the extensions offered by GNU C, correspond roughly to Fortran's statement functions.)
So, while parsing the code in a program unit, the FFE saves up all the information on statements, expressions, names, and so on, until it has seen the last statement.
At that point, the FFE revisits the saved information (in what amounts to a second pass over the program unit) to perform the actual translation of the program unit into GBEL, ultimating in the generation of assembly code for it.
Some lookahead is performed during this second pass, so the FFE could be viewed as a "two-plus-pass" design.
22.4.1 Two-pass Code 22.4.2 Why Two Passes
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Most of the code that turns the first pass (parsing) into a second pass for code generation is in `gcc/gcc/f/std.c'.
It has external functions, called mainly by siblings in `gcc/gcc/f/stc.c', that record the information on statements and expressions in the order they are seen in the source code. These functions save that information.
It also has an external function that revisits that information, calling the siblings in `gcc/gcc/f/ste.c', which handles the actual code generation (by generating GBEL code, that is, by calling GBE routines to represent and specify expressions, statements, and so on).
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The need for two passes was not immediately evident
during the design and implementation of the code in the FFE
that was to produce GBEL.
Only after a few kludges,
to handle things like incorrectly-guessed ASSIGN
label nature,
had been implemented,
did enough evidence pile up to make it clear
that `std.c' had to be introduced to intercept,
save, then revisit as part of a second pass,
the digested contents of a program unit.
Other such missteps have occurred during the evolution of the FFE, because of the different goals of the FFE and the GBE.
Because the GBE's original, and still primary, goal
was to directly support the GNU C language,
the GBEL, and the GBE itself,
requires more complexity
on the part of most front ends
than it requires of gcc
's.
For example,
the GBEL offers an interface that permits the gcc
front end
to implement most, or all, of the language features it supports,
without the front end having to
make use of non-user-defined variables.
(It's almost certainly the case that all of K&R C,
and probably ANSI C as well,
is handled by the gcc
front end
without declaring such variables.)
The FFE, on the other hand, must resort to a variety of "tricks" to achieve its goals.
Consider the following C code:
int foo (int a, int b) { int c = 0; if ((c = bar (c)) == 0) goto done; quux (c << 1); done: return c; } |
Note what kinds of objects are declared, or defined, before their use, and before any actual code generation involving them would normally take place:
Whereas, the following items can, and do, suddenly appear "out of the blue" in C:
Not surprisingly, the GBE faithfully permits the latter set of items to be "discovered" partway through GBEL "programs", just as they are permitted to in C.
Yet, the GBE has tended, at least in the past, to be reticent to fully support similar "late" discovery of items in the former set.
This makes Fortran a poor fit for the "safe" subset of GBEL. Consider:
FUNCTION X (A, ARRAY, ID1) CHARACTER*(*) A DOUBLE PRECISION X, Y, Z, TMP, EE, PI REAL ARRAY(ID1*ID2) COMMON ID2 EXTERNAL FRED ASSIGN 100 TO J CALL FOO (I) IF (I .EQ. 0) PRINT *, A(0) GOTO 200 ENTRY Y (Z) ASSIGN 101 TO J 200 PRINT *, A(1) READ *, TMP GOTO J 100 X = TMP * EE RETURN 101 Y = TMP * PI CALL FRED DATA EE, PI /2.71D0, 3.14D0/ END |
Here are some observations about the above code, which, while somewhat contrived, conforms to the FORTRAN 77 and Fortran 90 standards:
ENTRY
statement is parsed.
DATA
statement is parsed.
REAL
or a subroutine
(which can be thought of as returning type void
or, to support alternate returns in a simple way,
type int
)
is not known
until the `CALL FRED' statement is parsed.
FORMAT
label
or the label of an executable statement
is not known
until the `X =' statement is parsed.
(These two types of labels get very different treatment,
especially when ASSIGN
'ed.)
ASSIGN
statement is parsed.
(This happens after executable code has been seen.)
Very few of these "discoveries" can be accommodated by the GBE as it has evolved over the years. The GBEL doesn't support several of them, and those it might appear to support don't always work properly, especially in combination with other GBEL and GBE features, as implemented in the GBE.
(Had the GBE and its GBEL originally evolved to support g77
,
the shoe would be on the other foot, so to speak--most, if not all,
of the above would be directly supported by the GBEL,
and a few C constructs would probably not, as they are in reality,
be supported.
Both this mythical, and today's real, GBE caters to its GBEL
by, sometimes, scrambling around, cleaning up after itself--after
discovering that assumptions it made earlier during code generation
are incorrect.
That's not a great design, since it indicates significant code
paths that might be rarely tested but used in some key production
environments.)
So, the FFE handles these discrepancies--between the order in which it discovers facts about the code it is compiling, and the order in which the GBEL and GBE support such discoveries--by performing what amounts to two passes over each program unit.
(A few ambiguities can remain at that point,
such as whether, given `EXTERNAL BAZ'
and no other reference to `BAZ' in the program unit,
it is a subroutine, a function, or a block-data--which, in C-speak,
governs its declared return type.
Fortunately, these distinctions are easily finessed
for the procedure, library, and object-file interfaces
supported by g77
.)
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Consider the following Fortran code, which uses various extensions (including some to Fortran 90):
SUBROUTINE X(A) CHARACTER*(*) A COMPLEX CFUNC INTEGER*2 CLOCKS(200) INTEGER IFUNC CALL SYSTEM_CLOCK (CLOCKS (IFUNC (CFUNC ('('//A//')')))) |
The above poses the following challenges to any Fortran compiler
that uses run-time interfaces, and a run-time library, roughly similar
to those used by g77
:
SYSTEM_CLOCK
expects to set an INTEGER*4
variable via its COUNT
argument,
the compiler must make available to it a temporary variable of that type.
SYSTEM_CLOCK
library routine returns,
the compiler must ensure that the temporary variable it wrote
is copied into the appropriate element of the `CLOCKS' array.
(This assumes the compiler doesn't just reject the code,
which it should if it is compiling under some kind of a "strict" option.)
SYSTEM_CLOCK
library routine returns),
the compiler must ensure that the IFUNC
function is called.
That requires evaluating its argument,
which requires, for g77
(assuming -ff2c
is in force),
reserving a temporary variable of type COMPLEX
for use as a repository for the return value
being computed by `CFUNC'.
CFUNC
should, ideally, be deallocated
(or, at least, left to the GBE to dispose of, as it sees fit)
as soon as CFUNC
returns,
which means before IFUNC
is called
(as it might need a lot of dynamically allocated memory).
g77
currently doesn't support all of the above,
but, so that it might someday, it has evolved to handle
at least some of the above requirements.
Meeting the above requirements is made more challenging by conforming to the requirements of the GBEL/GBE combination.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Most Fortran statements are given their own block, and, for temporary variables they might need, their own scope. (A block is what distinguishes `{ foo (); }' from just `foo ();' in C. A scope is included with every such block, providing a distinct name space for local variables.)
Label definitions for the statement precede this block, so `10 PRINT *, I' is handled more like `fl10: { ... }' than `{ fl10: ... }' (where `fl10' is just a notation meaning "Fortran Label 10" for the purposes of this document).
22.6.1 Statements Needing Temporaries 22.6.2 Transforming DO WHILE 22.6.3 Transforming Iterative DO 22.6.4 Transforming Block IF 22.6.5 Transforming SELECT CASE
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Any temporaries needed during, but not beyond, execution of a Fortran statement, are made local to the scope of that statement's block.
This allows the GBE to share storage for these temporaries among the various statements without the FFE having to manage that itself.
(The GBE could, of course, decide to optimize management of these temporaries. For example, it could, theoretically, schedule some of the computations involving these temporaries to occur in parallel. More practically, it might leave the storage for some temporaries "live" beyond their scopes, to reduce the number of manipulations of the stack pointer at run time.)
Temporaries needed across distinct statement boundaries usually
are associated with Fortran blocks (such as DO
/END DO
).
(Also, there might be temporaries not associated with blocks at all--these
would be in the scope of the entire program unit.)
Each Fortran block should get its own block/scope in the GBE.
This is best, because it allows temporaries to be more naturally handled.
However, it might pose problems when handling labels
(in particular, when they're the targets of GOTO
s outside the Fortran
block), and generally just hassling with replicating
parts of the gcc
front end
(because the FFE needs to support
an arbitrary number of nested back-end blocks
if each Fortran block gets one).
So, there might still be a need for top-level temporaries, whose "owning" scope is that of the containing procedure.
Also, there seems to be problems declaring new variables after generating code (within a block) in the back end, leading to, e.g., `label not defined before binding contour' or similar messages, when compiling with `-fstack-check' or when compiling for certain targets.
Because of that, and because sometimes these temporaries are not discovered until in the middle of of generating code for an expression statement (as in the case of the optimization for `X**I'), it seems best to always pre-scan all the expressions that'll be expanded for a block before generating any of the code for that block.
This pre-scan then handles discovering and declaring, to the back end, the temporaries needed for that block.
It's also important to treat distinct items in an I/O list as distinct statements deserving their own blocks. That's because there's a requirement that each I/O item be fully processed before the next one, which matters in cases like `READ (*,*), I, A(I)'---the element of `A' read in the second item must be determined from the value of `I' read in the first item.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
`DO WHILE(expr)' must be implemented so that temporaries needed to evaluate `expr' are generated just for the test, each time.
Consider how `DO WHILE (A//B .NE. 'END'); ...; END DO' is transformed:
for (;;) { int temp0; { char temp1[large]; libg77_catenate (temp1, a, b); temp0 = libg77_ne (temp1, 'END'); } if (! temp0) break; ... } |
In this case, it seems like a time/space tradeoff between allocating and deallocating `temp1' for each iteration and allocating it just once for the entire loop.
However, if `temp1' is allocated just once for the entire loop, it could be the wrong size for subsequent iterations of that loop in cases like `DO WHILE (A(I:J)//B .NE. 'END')', because the body of the loop might modify `I' or `J'.
So, the above implementation is used, though a more optimal one can be used in specific circumstances.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
An iterative DO
loop
(one that specifies an iteration variable)
is required by the Fortran standards
to be implemented as though an iteration count
is computed before entering the loop body,
and that iteration count used to determine
the number of times the loop body is to be performed
(assuming the loop isn't cut short via GOTO
or EXIT
).
The FFE handles this by allocating a temporary variable to contain the computed number of iterations. Since this variable must be in a scope that includes the entire loop, a GBEL block is created for that loop, and the variable declared as belonging to the scope of that block.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Consider:
SUBROUTINE X(A,B,C) CHARACTER*(*) A, B, C LOGICAL LFUNC IF (LFUNC (A//B)) THEN CALL SUBR1 ELSE IF (LFUNC (A//C)) THEN CALL SUBR2 ELSE CALL SUBR3 END |
The arguments to the two calls to `LFUNC'
require dynamic allocation (at run time),
but are not required during execution of the CALL
statements.
So, the scopes of those temporaries must be within blocks inside
the block corresponding to the Fortran IF
block.
This cannot be represented "naturally"
in vanilla C, nor in GBEL.
The if
, elseif
, else
,
and endif
constructs
provided by both languages must,
for a given if
block,
share the same C/GBE block.
Therefore, any temporaries needed during evaluation of `expr'
while executing `ELSE IF(expr)'
must either have been predeclared
at the top of the corresponding IF
block,
or declared within a new block for that ELSE IF
---a block that,
since it cannot contain the else
or else if
itself
(due to the above requirement),
actually implements the rest of the IF
block's
ELSE IF
and ELSE
statements
within an inner block.
The FFE takes the latter approach.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
SELECT CASE
poses a few interesting problems for code generation,
if efficiency and frugal stack management are important.
Consider `SELECT CASE (I('PREFIX'//A))',
where `A' is CHARACTER*(*)
.
In a case like this--basically,
in any case where largish temporaries are needed
to evaluate the expression--those temporaries should
not be "live" during execution of any of the CASE
blocks.
So, evaluation of the expression is best done within its own block,
which in turn is within the SELECT CASE
block itself
(which contains the code for the CASE blocks as well,
though each within their own block).
Otherwise, we'd have the rough equivalent of this pseudo-code:
{ char temp[large]; libg77_catenate (temp, 'prefix', a); switch (i (temp)) { case 0: ... } } |
And that would leave temp[large] in scope during the CASE blocks (although a clever back end *could* see that it isn't referenced in them, and thus free that temp before executing the blocks).
So this approach is used instead:
{ int temp0; { char temp1[large]; libg77_catenate (temp1, 'prefix', a); temp0 = i (temp1); } switch (temp0) { case 0: ... } } |
Note how `temp1' goes out of scope before starting the switch, thus making it easy for a back end to free it.
The problem that solution has, however, is with `SELECT CASE('prefix'//A)' (which is currently not supported).
Unless the GBEL is extended to support arbitrarily long character strings
in its case
facility,
the FFE has to implement SELECT CASE
on CHARACTER
(probably excepting CHARACTER*1
)
using a cascade of
if
, elseif
, else
, and endif
constructs
in GBEL.
To prevent the (potentially large) temporary,
needed to hold the selected expression itself (`'prefix'//A'),
from being in scope during execution of the CASE
blocks,
two approaches are available:
CASE
tests,
producing an integer ordinal that is used,
a la `temp0' in the earlier example,
as if `SELECT CASE(temp0)' had been written.
Each corresponding CASE
is replaced with `CASE(i)',
where i is the ordinal for that case,
determined while, or before,
generating the cascade of if
-related constructs
to cope with CHARACTER
selection.
CASE
string
that'll actually be compared against the expression
(in this case, `'prefix'//A').
Since that length must be constant
(because CASE
expressions are all constant),
it won't be so large,
and, further, `temp1' need not be dynamically allocated,
since normal CHARACTER
assignment can be used
into the fixed-length `temp0'.
Both of these solutions require SELECT CASE
implementation
to be changed so all the corresponding CASE
statements
are seen during the actual code generation for SELECT CASE
.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The interactions between statements, expressions, and subexpressions at program run time can be viewed as:
action(expr) |
Here, action is the series of steps performed to effect the statement, and expr is the expression whose value is used by action.
Expanding the above shows a typical order of events at run time:
Evaluate expr Perform action, using result of evaluation of expr Clean up after evaluating expr |
So, if evaluating expr requires allocating memory, that memory can be freed before performing action only if it is not needed to hold the result of evaluating expr. Otherwise, it must be freed no sooner than after action has been performed.
The above are recursive definitions, in the sense that they apply to subexpressions of expr.
That is, evaluating expr involves evaluating all of its subexpressions, performing the action that computes the result value of expr, then cleaning up after evaluating those subexpressions.
The recursive nature of this evaluation is implemented via recursive-descent transformation of the top-level statements, their expressions, their subexpressions, and so on.
However, that recursive-descent transformation is, due to the nature of the GBEL, focused primarily on generating a single stream of code to be executed at run time.
Yet, from the above, it's clear that multiple streams of code must effectively be simultaneously generated during the recursive-descent analysis of statements.
The primary stream implements the primary action items, while at least two other streams implement the evaluation and clean-up items.
Requirements imposed by expressions include:
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Names exported by FFE modules have the following (regular-expression) forms.
Note that all names beginning ffemod
or FFEmod
,
where mod is lowercase or uppercase alphanumerics, respectively,
are exported by the module ffemod
,
with the source code doing the exporting in `mod.h'.
(Usually, the source code for the implementation is in `mod.c'.)
Identifiers that don't fit the following forms are not considered exported, even if they are according to the C language. (For example, they might be made available to other modules solely for use within expansions of exported macros, not for use within any source code in those other modules.)
ffemod
FFEumod_[A-Z][A-Z0-9_]*
A #define
or enum
constant of the type ffemod
.
ffemod[A-Z][A-Z][a-z0-9]*
The portion of the identifier after ffemod
is
referred to as ctype
, a capitalized (mixed-case) form
of type
.
FFEumod_type[A-Z][A-Z0-9_]*[A-Z0-9]?
A #define
or enum
constant of the type
ffemodtype
,
where type is the lowercase form of ctype
in an exported typedef.
ffemod_value
ffemod_value_input
Below are names used for value and input, along with their definitions.
col
file
find
initialize
int
int
.
is
len
line
lookup
name
text
that points to a name of something.
new
find
without crashing.
pt
run
terminate
text
char *
that points to generic text.
[ << ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |