Instructions

How to download the entire ASJP database

This web application serves the latest released version of DOI . The ASJP database in ASJP's txt format is included in the zip file provided as download at Zenodo under the directory path raw/lists.txt.

Now you can copy and paste whatever you want of this txt file into a data file of your own. To find a particular language in the database, search for its ISO 639-3-code preceded by three spaces. The ISO 639-3-code sits to the far right in the second line of each word list. There may be more than one list with the same ISO 639-3-code, representing different dialects or sources.

How to download a selection of (up to 1500) wordlists

Click on Wordlists. In the search fields below the column headers you can enter your search criteria. The ISO 639-3 field (but not the Glottocode or WALS fields) allows for entering multiple codes separated by spaces. The Latitude, Longitude, Number of speakers and Year of extinction fields allow for simple Boolean expressions. For instance, >1000000 in the Number of speakers field will produce all languages with more than a million speakers. In the Classification Ethnologue and the Classification Glottolog fields you can insert a family or a subgroup of a phylogeny. For instance, inserting Indo-European,Baltic into Classification Ethnologue will give you Latvian and Lithuanian. Clicking on the "ASJP text format" button produces a file for these languages which can either be copied and pasted or saved using the browser’s file save option.

General description of the ASJP database of wordlists

Whether you have downloaded the entire database or a subset, the general format is the same, and this is described here.

The first line is specific to ASJP software. For users of that software, the 2 in col. 6 is the maximum number of synonyms read for each item, the number in cols. 11-12 indicates that wordlists with at least that number of attested items are used, and the number in cols. 15-18 is a date that can specify which lists are used (if it’s 0, all lists with enough attested items are used).

The next line gives the format for reading the immediately following list. The list itself consists of the 40 most stable items, as determined by Holman et al. (2008), in the 100-item list of Swadesh (1955). Most of the wordlists in the database contain these 40 items, or as many of them as are attested in the sources. About 300 wordlists contain as many items as are attested from the full 100-item Swadesh list. The English names of the items are less important than the preceding numbers, which are used to identify the items in the wordlists.

The next list consists of the ASJP code symbols that are used to transcribe the wordlists. These are described by Brown et al. (2008).

Then there is a wordlist for each language, on consecutive lines. In the full database, lists are ordered according to the classification in WALS (Dryer and Haspelmath 2013). Families are ordered geographically, genera are ordered alphabetically within families, and languages are ordered alphabetically within genera. The format of each list, including the two first lines consisting of metadata, is described below in the section More detail on the software and input file format.

How to get a matrix of ASJP distances

Download the entire database and make a selection or download a selection as described above. The file must contain all the stuff before the first list in the ASJP database, including the blank lines. It must also have a line at the end containing at least 5 blank characters (     ), but no other blank lines between or within lists. You can now apply our software. This is downloaded and installed as follows.

  1. Click on Software.
  2. Click on Programs for calculating ASJP distance matrices (Holman 2011c).
  3. Unzip and save the .exe files in the same folder that your data file is in.

There is also an Instructions file, but all the instructions for the programs described here are also contained in the present document. There are different programs: asjp62, asjp62x, asjp62e. To just produce a distance matrix asjp62 can be used. The other programs are described in the next section.

You run the program in the DOS Command Prompt as follows. Type:

        asjp62 < yourdata.txt > output

where ‘yourdata’ is the name of your data file and ‘output’ is the name of your output file. Press Enter and wait a few seconds to a few hours, depending on the size of your data file.

The output should be a matrix of distances between the lists in the data file. By the default settings

  • words identified in the database as loans aren’t used in calculating distances;
  • if a list has two synonymous words for a given item, both are used and the distance is the average of the distances based on each synonym;
  • if a list has words for fewer than 28 items (not counting loanwords), or if the list refers to a language that went extinct before 1700 CE, distances involving that list aren’t calculated.

To change these options, consult the next section.

More detail on the software and input file format

The programs described here calculate LDND, as defined by Bakker et al. (2009), between pairs of languages. The programs all use the same input and produce slightly different output.

  • asjp62 produces a matrix of LDND with rows and columns labeled by the language names.
  • asjp62x produces the same matrix in a format appropriate for use as input to the MEGA6 phylogeny package (Tamura et al. 2013). It outputs distances in percentages with two decimal points but with the dot removed (alternatively the numbers can be interpreted as multiplied by 100). This is a way of saving space in the matrix which, with the current number of languages in the database, would otherwise be too large for MEGA6 (Tamura et al. 2013).
  • asjp62e produces 1-LDND for pairs of languages within taxonomic groups, with each pair on a separate line in a format appropriate for pasting into spreadsheet software like Excel.

To run a program, get the MS-DOS command prompt, type a command of the form

        program < input > output

and then press Enter. For example, the command

        asjp62x < input.txt > output62x.txt

will run asjp62x on input.txt to produce output62x.txt. The computer may add the line

Stop - Program terminated.

to the end of the output, but this can be deleted before the output is used further.

The input file must obey the following general rules.

  • The first line is in fixed format so the columns are important.
    Col. 6:
    maximum number of synonyms read for each item (1 or 2).
    Col. 11-12:
    minimum number of attested items in lists, up to 100; lists with fewer attested items are ignored.
    Col. 15-18:
    if this number is 0, all lists are read; if it’s a positive number, it’s interpreted as a date and lists from languages extinct before that date are ignored.
    Col. 24:
    if this is a number other than 0, transcribed words and phrases preceded by % are ignored, which allows loans to be excluded if they are identified by %.
    Col. 30:
    Taxonomic rank of groups within which similarities are calculated by asjp62e: 3 = families, 2 = genera. Only asjp62e uses this information; asjp62 and asjp62x ignore it. The ASJP database uses the families and genera defined in WALS (Dryer and Haspelmath 2013) but the computer will accept whatever definition is specified for the languages as described below for Col. 2 in the second line of metadata.
  • The next line gives the format for reading the item numbers below it. The programs described here ignore the item names so the format in the example could just as well be I4.
  • The next set of lines gives the item numbers that will be used. There is one line for each item in the list. The item numbers must be between 1 and 100 inclusive, but they don't have to be consecutive or listed in numerical order. For I4 format, the numbers must be in Cols. 1-4, right justified. Items with numbers other than those listed here aren’t used in calculating LDND. The item names in the example are just for convenience.
  • There must be a blank line after the item list. Press the space bar a few times to give the computer something to read. This line tells the computer that the item list is finished.
  • The next set of lines gives the ASJPcode symbols, one per line in Col. 1, in any order. As an alternative to ASJPcode, any ascii symbols can be used; there can be up to 100 different symbols. Symbols not on this list aren’t used in calculating LDND (except for the four modifier symbols, which are described in Brown et al. 2008).
  • There must be two blank lines after the symbol list.
  • Then there is a wordlist for each language, on consecutive lines, consisting of two lines of metadata and then the wordlist proper. The metadata format is described below.
  • The first line for each list gives the name of the language followed within curly brackets by its position in three classifications, without any blank spaces. The name is taken from the source of the list; it never starts with a number or a blank. Between { and | is the classification of the language in WALS. It’s of the form Fam.GENUS, with the family name abbreviated and the genus name spelled out. Between | and @ is the classification of the language in Ethnologue (Lewis et al. 2016), and between @ and } is the classification in Glottolog (Hammarström et al. 2016). Names of taxonomic groups and subgroups are separated by commas and ordered from most inclusive to least inclusive. Languages not in a given classification are classified from information in the source for the list. If this information is insufficient for WALS, the family and genus are called Unknown. If it’s insufficient for Ethnologue or Glottolog, the sequence of subgroups is continued only as far as the information permits, including in some cases no groups at all.
  • The second line gives properties of the languages, again in fixed format so the columns are important.
    Col. 2:
    3 if the language is the first one in a new family, 2 if it’s the first language in a new genus, 1 otherwise.
    Col. 4-10:
    latitude in degrees and hundredths of a degree; minus means South. The programs described here don’t use this information.
    Col. 12-18:
    longitude in degrees and hundredths of a degree; minus means West. The programs described here don’t use this information.
    Col. 19-30:
    number of speakers, from Ethnologue (Lewis et al. 2016); 0 if the number of speakers is unknown; -1 if the language is recently extinct; -2 if the language is long extinct; or if the approximate date of extinction is known, the date is preceded by a minus sign. If there is a date in the first line of the entire file, lists with earlier extinction dates here are ignored, as are lists with -2; otherwise, all lists are used.
    Col. 34-36:
    three-letter WALS code, if any. The programs described here don’t use this information.
    Col. 40-42:
    three-letter ISO code from Ethnologue, if any. The programs described here don’t use this information.
  • Each of the next lines refers to an item in the list, until the next language begins. Items can be in any order. The line must begin with the item number, starting in Col. 1, left justified. The next column after the number can be anything except a tab. The program then ignores everything until it reaches a tab; this part of the line can be used for the name of the item. After the tab is the transcribed word or phrase; words in a phrase are separated by a space, which is ignored in the calculations; synonyms are separated by a comma. XXX here means that the item isn’t attested for the language; alternatively, unattested items can be omitted from the list. The end of the transcription is indicated by a space and then //. Two consecutive spaces also signal the end of the transcription.
  • There must be a blank line after the last list.

References

Bakker, Dik, André Müller, Viveka Velupillai, Søren Wichmann, Cecil H. Brown, Pamela Brown, Dmitry Egorov, Robert Mailhammer, Anthony Grant, and Eric W. Holman. 2009. Adding Typology to Lexicostatistics: A Combined Approach to Language Classification. Linguistic Typology 13.167-179.

Brown, Cecil H., Eric W. Holman, Søren Wichmann, and Viveka Vilupillai. 2008. Automated classification of the world’s languages: a description of the method and preliminary results. STUF – Language Typology and Universals:285-308.

Hammarström, Harald, Robert Forkel, Martin Haspelmath, and Sebastian Bank. 2016. Glottolog 2.7. Jena: Max Planck Institute for the Science of Human History. ( http://glottolog.org)

Dryer, Matthew S., and Martin Haspelmath (eds.). 2013. The World Atlas of Language Structures Online. Leipzig: Max Planck Institute for Evolutionary Anthropology. ( http://wals.info/)

Holman, Eric W., Søren Wichmann, Cecil H. Brown, Viveka Velupillai, André Müller, Pamela Brown, and Dik Bakker. 2008. Explorations in automated language comparison. Folia Linguistica 42:331-354.

Lewis, M. Paul, Gary F. Simons, and Charles D. Fennig (eds.). 2016. Ethnologue: Languages of the World, Nineteenth edition. Dallas, Texas: SIL International. Online version: http://www.ethnologue.com.

Tamura, K., G. Stecher, D. Peterson, A. Filipski, and S. Kumar. 2013. MEGA6: Molecular Evolutionary Genetics Analysis Version 6.0. Molecular Biology and Evolution 30:2725-2729.

Swadesh, Morris. 1955. Towards greater accuracy in lexicostatistic dating. International Journal of American Linguistics: 121-137.