1. Background
WinATA Mark2 represents an extension and development of the original ATA
(The Aston Text Analyser). With
the passage of time and technological changes, this suffered increasingly
from a number of disadvantages. These included chiefly:
| * | It operated under MS-DOS. The indexer could not
operate with Windows running in the background, and increasingly users
were becoming unfamiliar with the most elementary operations at the DOS
prompt. |
|
| * | It was designed as a research tool,
particularly for the needs of teachers engaged in an investigation of
natural language for pedagogic purposes, and did not lend itself so well
to wider research needs, or as a learner interface. This meant that
teachers could not easily explore the pedagogic possibilities offered by
a 'learner as researcher' approach to language learning. | |
There were also numerous requests from users for enhanced functionalities not
available under the original ATA. The new version is a completely
reprogrammed suite of tools operating under a 32-bit environment
(Windows 95 etc), designed with the
needs or the teacher researcher, the language researcher and the learner
researcher in mind. Each user simply ignores the built-in
functionalities not appropriate for his/her purposes.
2. Technical Requirements
WinATA will only run
under a 32-bit environment, eg Windows 95 or later. It will not run under Windows 3.x. Most
compatible machines will normally have sufficient power to run WinATA, but
as a lower bound a 100Mhz machine, with 16Mb of RAM and 25Mb of free
disk space are recommended. WinATA is distributed on a CD-ROM.
3. Installing WinATA
The instructions for installing WinATA and your corpus are contained in a separate document.
4. Running ataIndex
Before you can use ataInsight, you must first create
the necessary database. This is because a programme designed with so
many functions would be too slow handling a large corpus 'on the fly'.
This process should not take long on a reasonably fast machine,
typically five or six minutes for a quarter of a million words on a
300Mhz machine. Double clicking on C:\Program Files\WinATA\ataIndex.exe will produce a
screen labelled "Corpus Administrator". At the top left corner click on
"Corpus" and select "New". This produces a dialogue box as shown in the
top half of Figure 1.
Figure 1. Dialogue box showing the selections made for ataIndex, and the
"Jobs to do" progress chart. Click the image to see a large-scale version.
Now carry out all these steps:
| * | Under "Project description" (top
left) enter a name of your choice for this project. You may wish to
enter "Film Reviews". | |
| * | Make sure the language selected (top right) is
correct. (For most users only "English" will be available.) | |
| * | Select the
correct drive, in our example D: | |
| * | Select the correct directory in
which you have placed your corpus. In our case this will be
D:\WINATA\filmdir | |
| * | If you enter say "*.txt" in the small centre box
labelled "File filter", only files with the ".txt" extension will be
shown in the list underneath. | |
| * | The files in d:\corpora\filmdir will now
be shown in the centre box. Highlight the one(s) you want (in our case
"filmrev.txt") and click on the arrow to move it to the box headed
"Selected corpus files". | |
| * | Finally, click on "Index". The full screen
shown in Figure 1 is now shown. | |
5. The "Jobs to do" Progress Screen
The "Jobs to do" progress screen
(lower half of Figure 1) keeps you up to date with the various stages of
the indexing process. When this is finished you are shown the amount of
time taken (hours and minutes only). This is of interest only in the
case of large corpora, i.e. millions of words. Click OK and shut
down the indexer using the "Corpus/Exit" Menu option (top left corner).
6. Running ataInsight
Double clicking on c:\Program Files\WinATA\ataInsight.exe will produce a screen
labelled "Open ATA project". The main box will show the project names of
all the corpora which you have indexed. You can highlight any of these
and click the delete button to remove it. To investigate a previously indexed corpus
highlight its project name, in our example "Film Reviews", and click on
"OK". The main research screen appears as is shown in Figure 2.
Figure 2. The main research screen. Click the image to see a large-scale version.
7. Chief Features of the Research Screen
The illustration in Figure 2 is
of a corpus of articles headed "Money Markets" taken from a CD-ROM
kindly provided by The Financial Times. Chief features of the screen are
as follows:
| Feature | Description | |
| Title bar (top left): | The project name (in this case
"Money"), the total number of tokens in the corpus (in this case
152018), and the total number of types (in this case 7451). |
| The Word Frequency List box (upper left): | This shows the number of entries in the
list, (in this case 7538, a higher number than the types to take account
of potentially ambiguous types). The four columns in this box show: |
|
| => | The absolute (raw) frequency of the type in the corpus; |
| => | The relative (out of 10,000) frequency of the type in the corpus; |
| => | The relative (out of 10,000) frequency of the type in a user-defined reference corpus; |
| => | The type itself. | |
| The Concordance/profile box (upper right). | This is labelled
"Contexts" and shows a concordance for the type highlighted in the Word
Frequency List, "under" in the example. This may be converted to a
summary in the form of a synoptic profile (see Paragraph 11 below). |
| The Sentence box
(lower right) |
This shows the full sentence containing the context
highlighted. It is possible to cut and paste from the Sentence box. |
| The 'Word Families' box (lower left). | This shows a group of types associated
by some syntactic or semantic criterion with the type currently
highlighted in the Word Frequency List. |
These boxes, and their associated functions, are each considered in more detail below.
8. The Word Frequency List.
A right mouse click in this WINDOW produces a
pull-down menu showing all available options. These operate as follows:
| Option | Description | |
| Alpha sort: | Present the Word Frequency List with types arranged in
alphabetical order. | |
| Numeric sort: | Present the Word Frequency List with
types arranged in numerical order. | |
| Find: | This produces a dialogue box in
which the user enters a string of letters to be located in the Word
Frequency List. This may be a whole word, or part of a word. The user
may specify that the string should either start or end the word.
Wildcards are allowed, that is "*" can be used to represent "any number (including zero)
of consecutive characters", and "?" any single character. The string
"a*e*i*o*u" with the "Containing" option selected (=anywhere in the
word) locates types containing the five vowels in the given order. (This
search on a corpus on hip replacement identified the word
"arteriovenous".) | |
| Filter: |
This functions in much the same way as Find,
except that instead of locating instances of the desired string one by
one, it collects them all together, e.g. all types ending in "ity" or,
as in the above example, all types containing the vowels in that order.
However, there are two further useful functions available under
"Filter". | |
|
| => | Frequency filter: this creates a list of all those type with a
specified raw frequency. For example, by entering "1" the researcher can
obtain a list of all the hapax legomena in the corpus. |
|
| |
| | => | Comparative filter:
Clicking on the Comparative option offers a choice of
comparisons of the relative frequencies of the corpus being studied with
that of the user-defined reference corpus. Possible comparisons are
greater than, less than, equal, or Not in reference. | |
|
| Collocations: |
This produces a full list of contexts (for the type currently highlighted) in
the upper right hand box. | |
| Synoptic Profile: |
This reduces the concordance
for the type highlighted to an ordered wordcount for each word position,
from four to the left to four to the right. | |
| Export: |
This causes the current Word Frequency List, in its present filtered or unfiltered
state, to be appended to a file called wfl.out in the same directory as
the corpus. | |
| Font: |
This enables the user to select a preferred type face
and size for the display of the Word Frequency List. | |
9. The Contexts Box
This box (upper right) is filled with the
concordance, or synoptic concordance profile, requested as described
above. A right mouse click brings up a similar menu to that for the Word
Frequency List, with the following features:
| Option | Description | |
| Sort Left: | This
alphabetises the concordance by words to the left of the keyword.
| |
| Sort Right: |
This alphabetises the concordance by words to the right of the keyword.
| |
| Filter: |
This reduces the list to those lines containing a chosen substring in a
chosen position (right or left of keyword, or anywhere). Thus a filter
for "er " ('e' and 'r' followed by a space) applied to a concordance for
"than" would select all the comparatives in "-er" (plus a few oddments
like "water"). A left-hand filter using "more" would generate a list of
comparatives with "more". In this case oddments such as "moreover" can
be avoided by choosing the option of specifying that the string "more"
must in all cases represent a whole word. Once again, wildcards are
allowed, such that a search for "?n" (the letter 'n' preceded by any one
letter), with the further specification that this be a whole word, would
reduce the concordance to those lines which contain one of the words
'an', 'in' or 'on'. A further useful feature is the option of preserving
previous filtered lists in memory and adding new lists to them. Thus a
concordance for e. g. "doing" could be successively filtered for the
various forms of the verb 'to be', in illustration of the present
continuous tense. In the illustration in Figure 2 the 89 lines of the
concordance for "under" have been filtered for the 35 lines containing
the type "pressure". | |
| Export: | Any list may be appended to a
file named "context.out" in the same directory as the corpus. The two
halves of each concordance line are separated by a tab, thus making it
easy to convert the text into a table using a standard word processor.
| |
| Font & Sentence to front: |
These two facilities are designed for use in
conjunction with a further facility. Next to the number of entries at
the top right corner of the box is a small window called the Maximising button. Clicking on this
expands the concordance/profile box to full screen size. If the font
size is to small this can be changed by selecting 'Font'. But enlarging
the screen obscures the normal position of the full sentence. Selecting
'Sentence to front' corrects this. | |
10. The Synoptic Profile
This is a researchers tool, and not intended
for language learners. It is intended to be richly informative rather
than user friendly. It is called a 'synoptic profile' because it shows
at a glance a summary of all types occurring in a concordance, together
with the number of times they occur in that position. Thus if we
have a synoptic profile for 'close' showing '60 to' in position -1
(immediately before the keyword) and '50 to' in position +1 (immediately
after the keyword), we know we are (almost certainly) dealing with 60
cases of the verb 'to close', and 50 of the adjective 'close to'. An
example of a full-screen version is shown in Figure 3, being a reduction
of 170 lines of a concordance for 'out'.
Figure 3. A synoptic profile for "out". For explanation see text.
Click the image to see a large-scale version.
Here the eight columns must be read as representing the eight
positions to right and left of the keyword, from -4 to +4. If for example one is
investigating the right and left associativity of 'out', it is
immediately clear (to the trained eye) that the particle bonds strongly
to form 'out+of' and 'due+out', and that a phrasal verb to 'take+out' is
widely used, all in the context of money market reports.The researcher
might next wish to return to the full-screen concordance
and reduce the full list for 'out' to instances where the keyword is
preceded by some form of 'take', and followed by 'shortage'. The result
is shown in Figure 4.
Figure 4. A concordance for 'out' successively cumulatively filtered for
all parts of 'take' followed by 'shortage'. Click the image to see a
large-scale version.
See also Notes for the advanced user contained in a separate document.