The Goals of the Library Data Archive
We librarians are not kind to our data. While we organize, store, and preserve human records, sadly, are not good about preserving or archiving data about libraries.
This archive has a number of types of data about libraries. It has collections of raw data in various digital formats and it has publications about libraries--increasingly in pdf format. The largest part of this archive, currently, are data collected and published by US government entities, although I will be publishing a great deal more from other organizations. I still hope to find all US government publications about libraries and digitize them. It is a daunting prospect.
There is a subset of these publications—those which published data collected systematically, over a number of years, of a defined population of libraries. The two big categories are public libraries and academic libraries. There are quite a number of such series but longitudinal files of these data, are typically not done by the issuing agency. Characteristically, these data are issued one year at a time without an explicit infrastructure to bind the data from the various years together. The Association of Research Libraries (ARL) being the notable excepis the steward of the longest running such longitudinal series which is now over 110 years.
This archive was started as a result of my experience in recompiling such longitudinal data series of library data—which work was a result of research interests. As mentioned, when systematically collected population library data are compiled and published, they are characteristically published one year at a time and normally without reference to previous or future data but often continuing practices from previous years, although not always. The major use of such data seems to be for comparisons between a given library with others for...budget presentations and related activities it seems. But data collected for one purpose can be used for others like using them to examine trends in library practices and finances, for example. To create a longitudinal file of library data, the various individual annual publications must be found and converted—these days— to digital formats. But what is a “Longitudinal file of library data?”
These are data collected and arranged over time so that one can study trends. Given the facts of publication, it is necessary that the annual data be rearranged with a date field and other accommodations required to accurately present the data as published. The person recompiling these annual data will normally add dates of collection if they are not there and may well add other fields. For instance, in the largest file in this Archive is PLDF3, the US Public Library Data File. In the original annual publications, the libraries have a key variable called the “FSCSKey” which for reasons explained at the link, were not usable in the longitudinal file to do what key variables must do: provide a single variable to identify each entity over time. Libraries change names, addresses, etc., as the years pass so a key variable is a critical item for a longitudinal file. There is an exceptionally long, intricate, and regrettably, dull discussion of the solution I developed at the link that involved creating and adding a new key variable. It was a hard problem. Note, though, that the variables in the original are not changed but new ones added so the longitudinal file is a superset of the original with added infrastructure for analysts. The compiler must first do no harm. Happily, most series I have worked on have had less complicated key variable characteristics. They often make up for that strength in other ways, though.
I have recompiled a number of such series and too often, finding all years is a challenge because they get lost by our library colleagues. After some experience with this fact, I resolved to keep copies of all such compilations I have done and to gather library data where I can find them. That ongoing work is documented here. Periodically, I look for library data on the Web. These digital files are but a small part of all library data available. There are enormous collections of data publications in paper that have not been converted to digital formats. There is so much that could be done with them but librarians are rarely numerate and the value in these archives is not clear to our colleagues nor is there a critical mass of people skilled in data in our field. I was in the University of Toronto's iSchool library as a collection of such paper copies was being boxed up and moved to storage. Box after box. All that potentially useful information about library trends off to cold storage.
These collections do take up space...and who uses them? Working librarians have many constraints and one historical one in archives is space and librarians rarely are comfortable with numerical data. This Archive will be expanded to include either data I have found or data I have recompiled and I would welcome the opportunity to compile some of those data locked in paper. Let me add that some of the oldest numbers we have as a species are library numbers. Libraries are a key to what humans do because libraries are a part of the memory function. We are not the fastest species and there are those who will say we are not the smartest but we have institutions that organize our memories and what we have learned. Libraries are one of those institutions.
A review of longitudinal publications of library data
The oldest longitudinal recompilation I am aware of is College and University Library Statistics (Princeton, 1947.) It is referred to here as the “Princeton Compilation.” In the early 1980s, when I first ran across them, the data were referred to as the “Princeton” data/statistics and the received wisdom was that 1920 was the first year of the series because it was the first year of the data of this monograph. The publication itself is a bound typescript of a longitudinal rearrangement of the annual data sheets from the 1919/20-1943/44 academic years.
In recompiling these data and trying to understand what their history was, I talked to Haynes McMullen at the (then) School of Library Science at the University of North Carolina who researched the early history of US libraries and used data in those discussions. At the time I asked Professor McMullen if he knew about these data, which I had thought began in 1920 when I walked in his office that day. But Professor McMullen reached in his files and pulled out a virtually complete run from the first data (that is 1907/08) through at least 1920. It turned out he got them while working on his dissertation. But imagine my surprise to discover the series began before 1920. I have found the first publication of these data thought provoking. This copy of the 1907/08 academic year shows 14 libraries and six variables that has grown into the astonishingly rich ARL Data we have today. This copy of that first issue was obtained by Nicola Duval at ARL from the archives at the University of Minnesota to include in the monographic of The Gerould Statistics. What a wonderful surprise that was when she had obtained the copy and included it in the Gerould Statistics. She and I surprised Kendon with Gerould Statistics and Nicky surprised me.
Who was James Thayer Gerould? Gerould was the director of the University of Minnesota library in the early part of the 20th century and had an idea. He wrote an article that appeared in Library Journal in 1906: “A Plan for the Compilation of Comparative University and College Library Statistics.” A committee was appointed by ALA and as far as I can tell, the committee never made a report. Undaunted, Gerould went ahead and started collecting and reporting data. The first year had data from 14 state university libraries from the 1907/08 academic year (linked to in the preceding paragraph) and continued in the following years. He moved to Princeton in 1920 where he continuing compiling these annual data. After he retired in 1938, the compilation continued at Princeton and these data were commonly referred to as the “Princeton Data” since the history had been lost. Chapter 2 of The Gerould Statistics discusses the details of this compilation with an assessment of it. This series is discussed here further.
This first longitudinal recompilation was published at Princeton with the data from 1919/20 through 1943/44 academic years of an expanding number of US academic libraries. While the data were published annually and could easily have been ordered by year—as all other such series in this Archive are—the compilers rearranged the data by institution. It was a bold undertaking with the available technology. This series is discussed here further with the data in various formats. Chapter 2 of The Gerould Statistics also discusses these data with an assessment of them.
The next one is the Purdue series. I believe I have a copy of these data but I am not sure for reasons discussed here. Again, Chapter 2 of the Gerould Statistics discusses these data in some detail.
The next longitudinal data compilation is the Cumulated ARL University Library Statistics, 1962-63 through 1978-79 by Kendon Stubbs and David Buxton (Washington, ARL: 1981.) This is the seminal work and is foundation of the work I have tried to build on. Kendon is certainly the best data analyst the library field has produced. His data work was in addition to his being the Associate University Librarian at the University of Virginia (UVA). For this project, he took the printed annual university ARL Statistics and with David Buxton converted these data to a digital format. Originally, it was on a mainframe and available via computer tape. The digital publication of the ARL data has continued and grown since their work. Moreover, he adduced the principles to be followed in such work. His interest in textural integrity informed much of those principles. The Introduction is worth a quiet reading annually. A bit more on the organization of these data are outlined in the discussion of the ARL data structure. This original structure was adapted to a series of academic data compilations based on data collected using the ARL data collection instruments. The data themselves and their documentation are owned by various agencies and the discussion here of this structure is an overview to provide a comparison of the structure of the public library data series I have also compiled.
I found the Stubbs-Buxton publication when I was doing research on what turned out to be my dissertation and got intrigued. I thought I would update this publication with more recent data for what I had become interested in. I am skilled in data input so I keyed the data and checked the published data with the digital Stubbs-Buxton data. I found discrepancies. I wrote him and we talked and resolved those problems. It turns out that data can be corrected after publication and there were errors. What do you do then? You correct the digital copy and it becomes the master copy. Version control is always an issue.
The linked page is a work in progress. I have collected links to data sources I have worked with so this is an sample I hope to build on. I have also collected these data. It seems that library data tend to disappear without an archiving effort and that fact was the proximate cause of my starting this exercise about 20 years ago or so. The library world would benefit from an agency that performs the ICPSR function for our data.
While looking at BIX (Der Bibliotheksindex)—the BIX has closed and its Web site is no longer responding as of September 5, 2022,
March 18, 2025
Back to the main page