top of page
Writer's picturezielekfujupamarina

Computers in Literary and Linguistic Research: A Comparative Study of Latin American and European E-



The European Association for Digital Humanities (EADH), formerly known as the Association for Literary and Linguistic Computing (ALLC), is a digital humanities organisation founded in London in 1973.[1] Its purpose is to promote the advancement of education in the digital humanities through the development and use of computational methods in research and teaching in the Humanities and related disciplines, especially literary and linguistic computing.[2] In 2005, the Association joined the Alliance of Digital Humanities Organizations (ADHO).[3]


A precursor for the later following annual conferences of the association was a meeting on literary and linguistic computing organized by Roy Wisbey and Michael Farringdon at the University of Cambridge in March, 1970. The year after the second conference in Edinburgh, Scotland, in 1972, the Association for Literary and Linguistic Computing was founded at a meeting at King's College, London (1973).[1] Together with the Association for Computers and the Humanities (ACH) and the Association for Computational Linguistics (ACL), the Association for Literary and Linguistic Computing sponsored and organized the Text Encoding Initiative (TEI) in 1987.[4]




computers in literary and linguistic research




David Holmes is a Principal Lecturer in Statistics at the University of the West of England, Bristol with specific responsibility for co-ordinating the research programmes in the Department of Mathematical Sciences. He has taught literary style analysis to humanities students since 1983 and has published articles on the statistical analysis of literary style in theJournal of the Royal Statistical Society, History and Computing, andLiterary and Linguistic Computing. He presented papers at the ACH/ALLC conferences in 1991 and 1993.


Computational Linguistics, or Natural Language Processing (NLP), is not a new field. As early as 1946, attempts have been undertaken to use computers to process natural language. These attempts concentrated mainly on Machine Translation and, due to the political situation at the time, almost exclusively on the translation from Russian into English. Considerable resources were dedicated to this task, both in the U.S.A. and in Great Britain, during the fifties and sixties. Other countries, mainly in continental Europe, joined the enterprise, and the first systems ("SYSTRAN") became operational at the end of this period. However, the limited performance of these systems made it clear that the underlying theoretical difficulties of the task had been grossly underestimated, and in the following years and decades much effort was spent on basic research in formal linguistics. Today, a number of Machine Translation systems are available commercially although there still is no system that produces fully automatic high-quality translations (and probably there will not be for some time). Human intervention in the form of pre- and/or post-editing is still required in all cases.


Many people with a degree in Computational Linguistics work in research groups in universities, governmental research labs, or in large enterprises. For example in Sweden Computational Linguists work in research groups at the various universities that offer courses in linguistics (like Göteborg or Uppsala), at research labs like SICS (The Swedish Institute of Computer Science), or for companies like Telia or IBM.


Wmatrix is a software tool for corpus analysis and comparison. It providesa web interface to the English USAS and CLAWS corpus annotation tools, andstandard corpus linguistic methodologies such as frequency lists andconcordances. It also extends the keywords method to key grammaticalcategories and key semantic domains.Wmatrix allows the user to run these tools via a web browser such as Chrome or Firefox,and so will run on any computer (Mac, Windows, Linux) with a web browser anda network connection.Wmatrix was initially developed by Paul Raysonin the REVERE project,extended and applied to corpus linguistics during PhD workand is still being updated regularly. Earlier versions were available for Unix viaterminal-based command line access (tmatrix) and Unix via Xwindows (Xmatrix),but these only offer retrieval of text pre-annotated with USAS and CLAWS.Sections in this introduction to Wmatrix:screenshots, screencasts (short video introductions),acknowledgements and references for Wmatrix, and example applications and publications.Tutorial for Wmatrix: with step-by-step instructions using a case study on howto compare Liberal Democrat and Labour Party Manifestos for the 2005 UK General Election(updated May 2022).Further examples of the application to the 2010 general election manifestos can be seenon Paul's blog.The plain text versions of the 2010 UK election manifestos can be downloaded foruse in your favourite text analysis software (with thanks to Martin Wynne for editing two of the files).TEI encoded versions of the 2010 election manifestos are now available (with thanks to Lou Burnard).Similar application has also been carried out on the 2015,2017 and 2019General Election manifestos with downloadable versions of the documents from seven main parties.Two versions of Wmatrix are now live: -wmatrix5.lancaster.ac.uk/ -wmatrix4.lancaster.ac.uk/Usernames for Wmatrix are free to members and alumni of Lancaster University for non-commercial research.Please apply on Wmatrix5 using your Lancaster email address, or if you no longer have access to a Lancaster address as an alumni then please contactPaul Rayson. Accounts on Wmatrix5 are freely available for UK government and academic researchers in countries on the OECD DAC list of ODA recipients ( ), and these accounts will stay free beyond the current one month trial period.Please apply on Wmatrix5 using your organisational email address.Usernames for non-commercial research and teaching: (e.g. by non-Lancaster academics and students).A free one-month trial is available for individual academic users, please apply on Wmatrix5 using your organisational email address to set up a username and password. Once the one-month trial has expired, usernames are available for 50 per username per yearfrom the online secure order page run by Lancaster University.Multiple usernames (or years) may be purchased at a reduced cost e.g. for teaching purposes. Please contact Paul for details.Further development, support, and external availability of Wmatrix currently depends on licensing its use.Introduction to WmatrixFoldersWmatrix users can upload their own corpus data to the system, so that it can be automaticallyannotated and viewed within the web browser.Each file is stored in a folder (equivalent to a folder in Windows or directory on Unix).Input format guidelinesThe analysis may be improved with some pre-editing of the input text, although pre-editing is not normally required. There are guidelinesprovided for texts to be tagged by CLAWS. Most important is the replacementof less-than () characters by the corresponding SGML entity references (<) and (>) respectively. The text may contain well-formed HTML, SGML or XML tags. If the text contains less-than or greater-than symbols in formulae, for example, then CLAWS may mistake large quantities of the following text for SGML tags, or fail to POS tag the file.The guidelines mention start and end text markers, but these are not requiredsince they are inserted for you by Wmatrix.Tag wizardWmatrix users can upload their file and complete the automatic tagging process by clicking on the tagwizard. Once the file has been uploaded to the web server, it is POS tagged by CLAWSand semantically tagged by USAS. This process can be carried out step by step startingwith the 'load file without tagging' option in the advanced interface.As a shortcut you can simply upload frequency profilesif you have them. The format for a frequency list is a very simple two column formatwith a total line at the head of the file. You can see an example of this. The column widths are not significant.My Tag WizardMy Tag Wizard is a variant of the tag wizard which allows you tooverride or extend the system dictionaries for your own data. There aretwo main uses. First, you can override the current most likely tag for anyword or MWE. Second, you can extend the dictionaries in terms of coverageof vocabulary and tagset. For example, you can create a new tag bylisting the words and MWEs that you wish to be tagged with it.Viewing foldersBy clicking on the folder name, the user can see its contents. Following the applicationof the tag wizard, the folder contains the original text, POS and semantically tagged versions of that text, and a set of frequency profiles.Simple and advanced interfacesThe user can toggle between simple and advanced interfaces in Wmatrix.The advanced interface offers more options and more control over the data.Frequency profilesFrom the folder view, the user can click on a frequency list to see the most frequent items in their corpus. Frequency lists are available for words in the simple interface, and in the advanced interfacefor POS tags and semantic tags.The lists can be sorted alphabetically or by frequency.ConcordancesFrom the frequency list view, the user can click on 'concordance' and see standard concordances. These can show the usual word based concordance as well asall occurrences for words in one POS or semantic category.Key words, key POS and key domains: comparison of frequency listsFrom the folder view, the user can click on compare frequency list toperform a comparison of the frequency list for their corpus against another largernormative corpus such as the BNC sampler, or against another of their own texts (once that text has been loaded into Wmatrix). This comparison can be carried outat the word level to see keywords, or at the POS (in the advanced interface), or at the semantic level (to see key concepts or domains). The log-likelihood statistic is employed by Wmatrix. For more details, see the log-likelihood calculator.In the simple interface, word and tag clouds are shown which visualise the more significant differences in the larger font sizes.In the advanced interface more detailed frequency information is also displayed in table form. Then the key comparison shows the most significant key itemstowards the top of the list since the result is sorted on the LL(log-likelihood) field which shows how significant the difference is.You should just look at items with a '+' code since this shows overusein your text as compared to the standard English corpora. To bestatistically significant you should look at items with a LL value over about 7, since 6.63 is the cut-off for 99% confidence ofsignificance.N-grams and c-gramsRecurrent sequences of words are called n-grams in Wmatrix. These are similarto clusters in WordSmith and lexical bundles in Biber's work. You can calculaten-grams of length 2 to 5 for each text. Collapsed-grams (or c-grams) area merged version of these lists. They show you which 2-grams are subsets of3-grams, which 3-grams are subsets of 4-grams, and so on. The resulting c-gramlist is a tree structure with the longest n-grams on the left and shortest n-grams on the right.CollocationsCollocations in Wmatrix are pairs of words that occur together more often than would be expecteddue to chance. There are a choice of 11 different statistics that can be used to calculate the strength of association between the two words. For further details about these statistics, see the following paper:Piao, S. (2002) Word alignment in English-Chinese parallel corpora.Literary and linguistic computing, 17 (2), 207-230. doi:10.1093/llc/17.2.207The collocation feature was introduced in September 2009 and is currently in beta testing.Screencasts:This section shows short video introductions to the Wmatrix software.Further videos will be appearing soon. Acknowledgements and references:Wmatrix was initially developed within the REVERE project (REVerse Engineering of Requirements)funded by the EPSRC, project numberGR/MO4846. Lancaster University Proof of concept funding in July 2006provided support for a new server and continued software development.In December 2006, further interface design using XHTML/CSS was carried out by Andrew Foote (InfoLab21 Knowledge Business Centre) funded under support fromthe European Regional Development Fund. Through a Lancaster University small grant(Towards an Online Conceptual Database of the Latin Vulgate Bible)a 'reader' interface is being developed for pre-tagged corpora. 2ff7e9595c


0 views0 comments

Recent Posts

See All

Comments


bottom of page