Welcome to the Library Data Archive

Robert E. Molyneux

This is an archive of data and reports about libraries. Before the broader introduction, I'd like to provide short links to the data available now, those planned, or documentation or related publications in digital form. After this section, there is a broader discussion of this project.

Today (February 19, 2025) is moving day. For a number of years, the Library Research Service, an office of the Colorado State Library has graciously hosted this Archive as I worked on it. The Archive would not exist without their hosting it while I built it. Life goes on, things change, and they asked me to move it and waited patiently–or so it appeared to me. I thank the many people at the LRS who helped.

The last few years I have been busy and have not kept things up and, too, I lost access to data software that made life much easier so things go slower now. I will be updating these files and adding data from more libraries, provinces, countries.

A note on dates. Generally library data are published one year at a time and longitudinal files that are necessary for orderly analysis of trends must be created. It also takes time to collect and vet the data before publication so the dates of publication will often be some years after the data were generated. In effect, we are managing our libraries through the rear-view mirror. Other fields have this problem, too, but have analysts who know how to do projections. One thing one needs is longitudinal files. For the longitudinal files, there will also be a delay in my adding newly published data to the file. Particularly with the software situation. I resolve to do better. One other aspect of dates. I dated these files with a date in the lower left-hand corner. That date is the last time I modified something on that file. That change might be a small one particularly on the very large files

I believe Kendon Stubbs and David Buxton did the first such longitudinal file of the data from the Association of Research Libraries. I used those data in my dissertation, and was invited to do a post-doc working for Kendon and have done quite a few such longitudinal recompilations since then. I have also collected digital library data because we librarians do not preserve our data. I think it was Frank Schick who was at one time Assistant Director of the Library Services Branch at the Office of Education, Department of Health, Education, and Welfare who started the train of thought that led me on this course. He was a lovely person and made the case that the US Government had to collect library data because librarians wouldn't. But, if one looks over the history of US library data, one will see that collecting library data is something the US government does sporadically. It will start up for a while and then whatever reasons existed for the effort changed, the effort is shut down. Oddly, we are going through a similar change now.

In any case, I realized that the day would come when all these data and the publications about libraries based on analysis of the data collected by government agencies would be lost or at least imperiled. So, when I was at the U.S. National Commission of Libraries and Information Science (NCLIS), I started what has turned out to be this Archive. NCLIS was an agency that was started well after the Library Services Branch that Mr. Schick worked for was closed. NCLIS has been closed. And now the government is changing its mind about collecting library data. We have a problem in relying on the kindness of strangers. If our data are important to us, we should maintain and preserve our data.

Comments and suggestions are always welcome. I have done this without an editor and everyone needs an editor. For now, write me at drdata@librarydataarchive.com.

Links to library datasets, documentation, and reports

Data available now (all from US Government sources)
Source of data Type of library Link to publications Years Notes
NCES-IMLS US Public Library Data PLDF3 FY 1987-FY2020 This is a longitudinal recompilation of the annual Administrative Entity US public library from NCES-IMLS. This is a large file.
Open source.
NCES-IMLS US State Summary Public Library Data PUSUM FY 1992-FY2020 This is a longitudinal recompilation of the annual State Summary/State Characteristics public library from NCES-IMLS.
Open source.
NCES-IMLS A collection of publications summarizing and reporting on characteristics of libraries from results of the various surveys published by NCES-IMLS about the US Public Library Data. There are also publications about the history of these data series. Reports FY 1983-FY2022 This is a large collection of publications. Note that currently, the earliest publication was about the 1977-78 US public library data collection. There are many such early publications. Collecting them converting them to a digital format will be a formidable task..
Open source.
NCES-IMLS US Public Library Data Survey Documentation Documentation FY 1987-FY2022 The annual publication of these data comprises three series: The State Summary/State Characteristics file [State library data], The Administrative Entity file [public library data], and the Outlet file [branch libraries.] The documentation is for all three of those files. There is no longitudinal file for the outlet data.
The documentation for PLDF3 and PUSUM are at those links above. These files used the NCES-IMLS data but rearranged them.
Open source.
NCES-IMLS US public library data--raw data files Annual data FY 1973-FY2022 These data go back to FY1973, so they also predate the FSCS era. This earlier series was the LIBGIS (Library and General Information Survey.)
Open source.
NCES-IMLS State Library Agencies Survey/State Library Administrative Agency Survey Main page 1994-2022 Open source.
NCES Academic Library Statistics ALS 1970-1971 through FY 2012 Early years are from the Higher Education General Information Survey (HEGIS)
Open source.
NCES School Library/Media Center publications SLMC 1974 through 2013 Early years are from the Library General Information Survey (LIBGIS)
Open source.
NCES Federal Libraries and Information Centers FLIC 1994 Open source.

These data are library data or analysis collected and published by either the National Center for Education Statistics or the Institute for Museum and Library Services. For more on those two agencies, see: NCES-IMLS introduction



These two sets of are from non-US government sources and available at the links below.
Source of data Type of library Filename and link Years Notes
Princeton Compilation Data from the Princeton Compilation Princeton [Academic years] 1919/20- 1943/44 I keyed these data when I was working on the The Gerould Statistics. Open source
Purdue Data Academic library data series of 58 academic libraries Purdue from 1951 Are these the actual Purdue data? I believe so but it is a tangled web discussed at the link.
Open source.

The following are not open source because I did them for the agencies listed. The are “works for hire.” I would need permission of those agencies to distribute the data. These were based on the infrastructure of the Stubbs-Buxton Cumulated ARL University Library Statistics.


Works for hire
Source of data Who owns the data? Type of library Filename and link to ARL Infrastructure Years Notes
Gerould Statistics ARL Academic library data begun in 1907/08 Gerould background.
[ARL infrastructure]
[Academic years] 1919/20- 1943/44 I keyed these data when I was working on the The Gerould Statistics for the Association of Research Libraries
ARL has produced derivative products and owns these data.
Survey/Compilation ACRL Academic: Historically Black Colleges and Universities HBCU [ARL] [Academic year] 1988-89 Not a longitudinal series but one of the collections following the ARL structure
ACRL members ACRL ACRL libraries not in ARL ACRL [ARL] 1978/79-1987/88 Also followed the ARL structure and used the ARL form. Essentially, ARL surveyed (roughly) the largest 100 academic libraries and ACRL surveyed the (roughly) second 100 libraries.
Gerould/ARL ARL Research Libraries Research Library Statistics [ARL] [Academic years] 1907/08-1987/88 This was a compilation issued in digital formats, with a guide. It was the first time the Gerould and ARL data were joined in one series.

The Goal of the Library Data Archive

We librarians are not kind to our data. While we organize, store, and preserve human records, sadly, are not good about preserving or archiving data about libraries.

This archive has a number of types of data about libraries. It has collections of raw data in various digital formats and it has publications about libraries--increasingly in pdf format. The largest part of this archive, currently, are data collected and published by US government entities, although I will be publishing a great deal more from other entities that I have collected over the years from both government and private organizations. I hope to find all US government publications about libraries and digitize them. It is a daunting prospect.

There is a subset of these publications—those which published data collected systematically, over a number of years, of a defined population of libraries. The two big categories are public libraries and academic libraries. There are quite a number of such series but longitudinal files of these data, are typically not done by the issuing agency. Characteristically, these data are issued one year at a time without an explicit infrastructure to bind the data from the various years together. The Association of Research Libraries (ARL) being the notable exception. It is the steward of the longest running such longitudinal series which is now over 100 years.

This archive was started as a result of my experience in recompiling such longitudinal data series of library data—which work was a result of research interests. As mentioned, when systematically collected population library data are compiled and published, they are characteristically published one year at a time and normally without reference to previous or future data but often continuing practices from previous years, although not always. The major use of such data seems to be for comparisons between a given library with others for...budget presentations and related. But data collected for one purpose can be used for others like using them to examine trends in library practices and finances, for example. To create a longitudinal file of library data, the various individual annual publications must be found and converted—these days— to digital formats. But what is a “Longitudinal file of library data?”

These are data collected and arranged over time so that one can study trends. Given the facts of publication, it is necessary that the annual data be rearranged with a date field and other accommodations required to accurately present the data as published. The person recompiling these annual data will normally add dates of collection if they are not there and may well add other fields. For instance, in the largest file in this Archive is PLDF3, the US Public Library Data File. In the original annual publications, the libraries have a key variable called the “FSCSKey” which for reasons explained at the link, were not usable in the longitudinal file to do what key variables must do: provide a single variable to identify each entity over time. Libraries change names, addresses, etc., as the years pass so a key variable is a critical item for a longitudinal file. There is an exceptionally long, intricate, and regrettably, dull discussion of the solution I developed at the link that involved creating and adding a new key variable. It was a hard problem. Note, though, that the variables in the original are not changed but new ones added so the longitudinal file is a superset of the original with added infrastructure for analysts. The compiler must first do no harm. Happily, most series I have worked on have had less complicated key variable characteristics. They often make up that loss in other ways, though.

I have recompiled a number of such series and too often, finding all years is a challenge because they get lost by our library colleagues. After some experience with this fact, I resolved to keep copies of all such compilations I have done and to gather library data where I can find them. That work is to be documented here. Periodically, I look for library data on the Web. These digital files are but a small part of all library data available. There are enormous collections of data publications in paper that have not been converted to digital formats. There is so much that could be done with them but librarians are rarely numerate and the value in these archives is not clear to our colleagues nor is there a critical mass of people skilled in data in our field. I was in the University of Toronto's iSchool library as a collection of such paper copies was being boxed up and moved to storage. Box after box. All that potentially useful information about library trends off to cold storage.

These collections do take up space...and who uses them? Working librarians have many constraints and one historical one in archives is space and librarians rarely are comfortable with numerical data. This Archive will be expanded to include either data I have found or data I have recompiled and I would welcome the opportunity to compile some of those data locked in paper. Let me add that some of the oldest numbers we have as a species are library numbers. Libraries are a key to what humans do because libraries are a part of the memory function. We are not the fastest species and there are those who will say we are not the smartest but we have institutions that organize our memories and what we have learned. Libraries are one of those institutions.

A review of longitudinal publications of library data

The oldest longitudinal recompilation I am aware of is College and University Library Statistics (Princeton, 1947.) It is referred to here as the “Princeton Compilation.” In the early 1980s, when I first ran across them, the data were referred to as the “Princeton” data/statistics and the received wisdom was that 1920 was the beginning data of this series. The publication itself is a bound typescript of a longitudinal rearrangement of the annual data sheets from the 1919/20-1943/44 academic years.

In recompiling these data and trying to understand what their history was, I talked to Haynes McMullen at the (then) School of Library Science at the University of North Carolina who researched the early history of US libraries and used data in those discussions. At the time I asked him if he knew about these data, which I had thought began in 1920 when I walked in his office that day, but Professor McMullen reached in his files and pulled out a virtually complete run from the first data (that is 1907/08) through at least 1920. It turned out he got them while working on his dissertation. But imagine my surprise to discover the series began before 1920. I have found the first publication of these data thought provoking. This copy of the 1907/08 academic year shows 14 libraries and six variables that has grown into the astonishingly rich ARL Data we have today. This copy of that first issue was obtained by Nicola Duval at ARL from the archives at the University of Minnesota to include in the monographic of The Gerould Statistics. What a wonderful surprise that was when she had obtained the copy and included it in the Gerould Statistics. She and I surprised Kendon with Gerould Statistics and Nicky surprised me.

Who was James Thayer Gerould? Gerould was the director of the University of Minnesota library in the early part of the 20th century and had an idea. He wrote an article that appeared in Library Journal in 1906: “A Plan for the Compilation of Comparative University and College Library Statistics.” A committee was appointed by ALA and as far as I can tell, the committee never made a report. Undaunted, Gerould went ahead and started collecting and reporting data. The first year had data from 14 state university libraries from the 1907/08 academic year (linked to in the preceding paragraph) and continued in the following years. He moved to Princeton in 1920 where he continuing compiling these annual data. After he retired in 1938, the compilation continued at Princeton and these data were commonly referred to as the “Princeton Data” since the history had been lost. Chapter 2 of The Gerould Statistics discusses the details of this compilation with an assessment of it. This series is discussed here further.

This first longitudinal recompilation was published at Princeton with the data from 1919/20 through 1943/44 academic years of an expanding number of US academic libraries. While the data were published annually and could easily have been ordered by year—as all other such series in this Archive are—the compilers rearranged the data by institution. It was a bold undertaking with the available technology. This series is discussed here further with the data in various formats. Chapter 2 of The Gerould Statistics also discusses these data with an assessment of them.

The next one is the Purdue series. I believe I have a copy of these data but I am not sure for reasons discussed here. Again, Chapter 2 of the Gerould Statistics discusses these data in some detail.

The next longitudinal data compilation is the Cumulated ARL University Library Statistics, 1962-63 through 1978-79 by Kendon Stubbs and David Buxton (Washington, ARL: 1981.) This is the seminal work and is foundation of the work I have tried to build on. Kendon is certainly the best data analyst the library field has produced. His data work was in addition to his being the Associate University Librarian at the University of Virginia (UVA). For this project, he took the printed annual university ARL Statistics and with David Buxton converted these data to a digital format. Originally, it was on a mainframe and available via computer tape. The digital publication of the ARL data has continued and grown since their work. Moreover, he adduced the principles to be followed in such work. His interest in textural integrity informed much of those principles. The Introduction is worth a quiet reading annually. A bit more on the organization of these data are outlined in the discussion of the ARL data structure. This original structure was adapted to a series of academic data compilations based on data collected using the ARL data collection instruments. The data themselves and their documentation are owned by various agencies and the discussion here of this structure is an overview to provide a comparison of the structure of the public library data series I have also compiled.

I found the Stubbs-Buxton publication when I was doing research on what turned out to be my dissertation and got intrigued. I thought I would update this publication with more recent data for what I had become interested in. I am skilled in data input so I keyed the data and checked the published data with the digital Stubbs-Buxton data. I found discrepancies. I wrote him and we talked and resolved those problems. It turns out that data can be corrected after publication and there were errors. What do you do then? You correct the digital copy and it becomes the master copy. Version control is always an issue.

The Library Research Service has graciously offered to house this archive of digital data and reports on U.S. libraries by two agencies of the U.S. government: the U.S. National Center for Education Statistics (NCES) and the U.S. Institute for Museum and Library Services (IMLS). The NCES-sponsored program behind the collection and publication of the public library data was known as the Federal State Cooperative System (FSCS). See this useful timeline of this program for more information. Having watched this effort from close up, I can say it was an impressive organization that functioned well. IMLS continues a similar program as the Public Libraries Survey which continues the public library data series without interruption.


Sources of Library Data Outside the United States and a Look at Assessment of Libraries

The linked page is a work in progress. I have collected links to data sources I have worked with so this is an sample I hope to build on. I have also collected these data. It seems that library data tend to disappear without an archiving effort and that fact was the proximate cause of my starting this exercise about 20 years ago or so. The library world would benefit from an agency that performs the ICPSR function for our data.

While looking at BIX (Der Bibliotheksindex)—the BIX has closed and its Web site is no longer responding as of September 5, 2022,


March 10, 2025