The Latin American and Caribbean Cultural Heritage Archives (LACCHA, established 2008), with the Society of American Archivists (SAA, founded 1936) are to have their Roundtable Meetings this August.
They will have a guest speaker Kent Norsworthy, Content Director of Latin American Network Information Center (LANIC). His talk is on:
Archiving the Latin American Web: Challenges and Opportunities
LAGDA seeks to preserve and facilitate access to a wide range of ministerial and presidential documents from 18 Latin American and
Caribbean countries. The Archive contains copies of the Web sites of approximately 300 government ministries and presidencies. Capture of sites began on multiple dates in 2005 and 2006, and will continue with regularly scheduled captures.
Content in the Archive includes not only the full-text versions of official documents, but also original video and audio recordings of key regional leaders. Archive contents include thousands of annual and "state of the nation" reports; plans and programs; and speeches by presidents and government ministers. Content can be accessed via full-text search (search help), or by browsing by country or by specialized sample collection, such as "Presidential Messages" or "Ministerial Documents".
LAGDA is a joint project of the University of Texas Libraries, The Nettie Lee Benson Latin American Collection, and the Latin American Network Information Center at The University of Texas at Austin. Web archiving services are provided by the Internet Archive's Archive-It service.
LAGDA Basics• Collaborative effort
– Latin American Network Information Center - LANIC
– Benson Latin American Collection
– University of Texas Libraries
• 2003: CRL-Mellon grant, Web archiving political communications
• 2005: One of original five Archive-It Pilot Partners
Collection Focus• Ministry and Presidency Web sites from Latin America & the Caribbean
• All major Spanish-speaking countries in the region, plus Brazil
• Sample of ministries: health, economy, defense, agriculture, etc.
Collecting Objectives #1• Supplement Benson print collection
• Born digital, no longer printed at all
• Brief life on publishing entity’s site
Collecting Objectives #2• Platform for research
• Numerous types of scholarly & applied research supported
• Important to recreate the “look and feel” of the original, ability to browse, etc.
• Historical record
LAGDA by the Numbers• 280 Archive-It seeds in one collection
• 18 Latin American &
Caribbean Countries• Quarterly crawls, six to date
• Linking to LAGDA: Over 100 sites, mostly libraries
More LAGDA Numbers• Over 24.8 million files archived to date
• 2.4 million PDF documents archived
• Largest site archived (by file): Ecuador Ministry of Industry 120,000 URLs
• Largest site archived (by size): Colombian Presidency, 20GB
Why Web Archive?• Governments come and go. . .
Live Archived• Disk drives fill up . . .
Live: |
Archived: |
|
Ideologies evolve . . .
Live: |
Archived: |
|
Challenges• In Web archiving, Latin America is a “moving target”
• “Best Practices” in Web design = consistent Web archiving quality
• Overuse: JavaScript menus; IFrames; Redirects; Flash, https; cookies; etc.
• How to make more researchers aware LAGDA exists
Quality Control• Systematically separate crawl issues from playback issues
• Immediate corrective action on crawl issues (fix or eliminate seed)
• Address playback issues through user interface and documentation
Quality Control #2Proxy Mode
• Eliminates many playback problems
• Confronts some provenance issues
Web Archives and Large-Scale Data: Preliminary
Techniques for Facilitating Research
History of archiving Latin America at UT Austin
• Benson Library collected gov docs in print since 1920s
• Latin America began moving to digital gov docs around 2000
• Download, print and curate
• Latin American Government Document Archive begins 2005
• Crawl entire websites, compress and curate data
• Provide access to digital content directly
Latin American Government Document ArchiveLAGDA = 280 seeds, about 15 government ministries per each of 18 countries crawled quarterly since 2005
• Files crawled and archived to date in LAGDA
• Data archived
• Items added to collection per year
• HTML pages archived per crawl
• PDF documents archived per crawl
• Monthly average pageviews on LAGDA |
70 million
5.9 TB
9-10 million
1.6 million
260,000
2,918 |
Latin American Government DocumentsLAGDA: challenges to data mining• Heterogeneous corpus
• Various languages
• Data formats (HTML, Word, PDF, Other)
• Document characteristics
• Variety of sources (countries, governments, departments)
LAGDA: motivating problem• Goal:
• Automatically attach labels to documents in a large collection based on training documents
• Challenges:
• Keyword search is ineffective due to lack of consistent words
• Training documents may cover broad subject areas
LAGDA: techniques for data mining• Break documents into n-grams
• 1-gram {The, quick, brown, fox, jumps, over, the, lazy}
• 2-gram {The quick, quick brown, brown fox, fox jumps}
• 3-gram {The quick brown, quick brown fox…}
• Identify one or more subsets of n-grams with significant high usages in the training documents
• Evaluate all documents in the corpus using these n-grams
LAGDA: techniques for data mining• Use this score and others to create a composite score
• The company you keep - Examine the text and the links that point to our documents
• Natural language processing
• Named entities & Part-of-Speech tagging
LAGDA: technology for large-scale computing at TACC• Corral data storage system (6 Petabyes)
• Longhorn High Performance Cluster
• Paradigms for distributed computing (MPI and Hadoop)
• Nodes work in parallel and combine their results
• Allows us to divide and conquer the problem
• Open source libraries (Heritrix, Tika, Lucene, OpenNLP)
LAGDA: initial results• Traditional classification approaches are unsuccessful
• Our n-gram approach for classification based on training set outperforms traditional Bayesian Inference Classifier
• Results from our composite scores demonstrate additional improvement
“big data” and libraries: going forward• Challenges posed by web-archived data
• Size, heterogeneity and limited metadata
• Data access that is more dynamic and flexible
• How big data can create data-driven research
• Development of use cases and research examples
• Technology at the service of social sciences, humanities and other fields whose research could benefit
About LANIC - The Latin American Network Information Center (LANIC) is affiliated with the Lozano Long Institute of Latin American Studies (LLILAS) at the University of Texas at Austin. Live on the internet since 1992 LANIC's mission is to facilitate access to Internet-based information to, from, or on Latin America. Its target audience includes people living in Latin America, as well as those around the world who have an interest in this region. While many of its resources are designed to facilitate research and academic endeavors, its site has also become an important gateway to Latin America for primary and secondary school teachers and students, private and public sector professionals, and just about anyone looking for information about this important region. LANIC's editorially reviewed directories contain over 12,000 unique URLs, one of the largest guides for Latin American content on the Internet.
One of LANIC’s initiatives is the Latin American Government Documents Archive (LAGDA). LAGDA seeks to preserve and facilitate access to a wide range of ministerial and presidential documents from 18 Latin American and Caribbean countries. The Archive contains copies of the Web sites of approximately 300 government ministries and presidencies. Content in the Archive includes not only the full-text versions of official documents, but also original video and audio recordings of key regional leaders. Archive contents include thousands of annual and "state of the nation" reports; plans and programs; and speeches by presidents and government ministers. LAGDA is a joint project of the University of Texas Libraries, The Nettie Lee Benson Latin American Collection, and the Latin American Network Information Center at The University of Texas at Austin.
nwoodward@mail.utexas.edu
http://lanic.utexas.edu/project/archives/lagda/
http://www.archive-it.org/public/collection.html?id=176