Standards… the final frontier. These… are the voyages of the starship NADDI… its continuing mission…
… no, but seriously, thank you, Barry [Radler], and hi everybody, it’s great to meet you all. I’m Dorothea Salo, and I teach XML markup and research-data management, among other things, at the iSchool here at UW-Madison. So of course I’ve known about the Data Documentation Initiative (DDI) for a long time, and have been watching it progress and gain adoption with great interest and delight.
Designing a markup language is hard! I know this because I’ve done it. Getting anybody to use a markup language is even harder! So I hope you are all proud of what DDI’s designers and user community have accomplished. I am certainly proud to stand here before all of you.
The tagline for NADDI 2015 is “Enhancing discoverability with open metadata standards.” I have to say, this is probably not everybody’s cup of chai. But if you’re here in this room, you’ve drunk the chai already: you truly believe that the standard life is a good life.
So do I! I believe! I believe in standards! If I didn’t believe in standards I wouldn’t have gone to an information school. If I didn’t believe in standards I wouldn’t have been interested in librarianship. If I didn’t believe in standards I certainly wouldn’t teach in an information school!
I love standards. I know about lots of them, and like anyone, I have my favorites and my unfavorites. What with all the hullaballoo about linked data, you might think that RDF is one of my favorite standards, but actually it’s my chief unfavorite. I work with RDF, I teach and train on it, I even give talks about it now and then, but I’m not a gigantic fan of it. I work with it and teach it because I have to; people need to know about it. I am, however, a cautious fan of schema.org microdata. I’ll be mentioning it again later. And just so you know, schema.org microdata has nothing whatever to do with what the DDI community typically calls microdata—so, yes, just the vocabulary in the standards landscape is a mess.
And now some of you are giving me sideeye with a “what in the world is wrong with this woman?! Standards are all fine and good, but moderation in all things!” Hey, I said librarians loved standards. I wasn’t kidding! Don’t try this level of dedication to standards at home; go to the library instead, okay? The point is, there’s lots of standards out there. So many standards! You almost have to be a librarian to love the standards universe, right?
No, but seriously, as I thought about what I wanted to say to you all today, I decided it was important to point out the very crowded and confusing standards and markup-languages space, not to mention the even more crowded and confusing best-practices space opening up around research-data management. I decided to hack the conference tagline, a little bit. (I do this. I hack things. You want to see my latest Mad Information Science hacking efforts, come on up to the iSchool library on the fourth floor of Helen C. White Hall, and I’ll show you my media-archaeology machine.) Instead of enhancing discovery with open metadata standards—which is a fine and worthy and very librarianly goal, don’t get me wrong—I decided to talk about enhancing discoverability of open metadata standards.
Why did I do that? Because in my head, that is the final frontier, the discovery and exploration frontier for standards just like DDI. In a crowded, confusing, competitive standards landscape, how does anybody get a standard noticed? How do you get it adopted? How do people whose problems your standard can solve discover your standard? How do they decide to adopt it, and how do you explain to them why they should adopt it in the first place? How does your standard, your one tiny galaxy in the giant universe, fit into the rest of their universe? That turns out to be a crucially important question these days, as it happens, because there is no One Ring—I mean, One Standard—there is no One Standard to Rule Them All.
I’ll say this again, because it’s important. There is no one universal data or metadata standard. There never will be. There never should be. With all the million different things we create data from, and do with data, and need from data, there’s just no way to create a single comprehensive standard that makes sense for every imaginable kind of data and data use case. Sorry, but not even DDI.
That means that inevitably—seriously, there’s no getting around this—DDI has no choice but to do two things. DDI has to compete for mindspace and adoption against other standards, not to mention non-standard technologies like Microsoft Excel, which is of course one of the horrors that DDI was designed to prevent. This competition takes place in what I already showed you is a huge, complicated, and confusing space. Secondly, DDI also has to fit itself into a universe where people will be using other standards and non-standard technologies alongside it, and they’d ideally like that to be easy for them.
DDI isn’t alone. This community is not alone in facing these challenges! In libraries, we are struggling with exactly the same thing right now. We honestly thought back in the 1960s and 1970s that we’d created a One Ring, a One Standard that would rule them all—or at least describe them all, everything, everything a library might collect. We called it MARC, Machine Readable Cataloging, and a brilliant programmer and systems analyst named Henriette Avram designed it in the 1960s, about the same time as XML’s precursor SGML and relational databases were coming into existence.
MARC was designed so that computers could hold, share, and print out the kind of metadata you find on a card in a library card catalog: author, title, subject, call number, copyright date, physical item description, and so forth. It stored metadata for books and also for other things libraries collect: maps, music scores, journal titles, and so forth.
By the way, you can impress your friends at parties with how long libraries have been standardizing stuff: the card catalog was invented by Melvil Dewey in the mid-1800s, and card size and catalog size were standardized by the American Library Association in 1876. As a standardista, I love libraries, I really do—you have to love it when practically the first act of a brand-new professional organization is to set a standard!
This goes to show how durable a useful standard can be. The physical card catalog survived as a standardized technology well into the 1990s, after all, over a century of use. And for half a century now, MARC has been librarianship’s freight train, our rail gauge, our standards heavy hitter. I can’t begin to explain to you the importance of this standard in librarianship globally. So what’s the problem with MARC? As I just said, MARC was designed for printing catalog cards. That means that we librarians designed our computerized record structure around a human-readable data format.
DDI didn’t actually do this. DDI isn’t intended for humans to look at directly. And that’s good! That was the right design decision to make! I want you to understand why it’s good, though, because it’s something that you may well have to explain to potential DDI adopters who expect something more human-friendly than raw DDI is.
We couldn’t really have known this at the time in libraries, but it turned out that basing our data structure for computers on something meant to be human-readable was putting the cart before the horse. This perfectly understandable and reasonable decision, the decision to build a standard around human-readability, actually hurt libraries and librarians in the long run.
One reason is that designing around the card catalog, which was then totally the ultimate in human-readable data display, meant serious problems when the ultimate in data display changed on us by going digital! There are eighteen long stories here that I’m passing over in silence, but the practical upshot is that catalog cards didn’t translate well to web pages, never mind web search engines like Google. Worse, the human-readable catalog-card format has turned out to be ridiculously hard to program against for computerized indexing and search.
DDI didn’t do this. DDI’s design is based on the structures inherent in the data, without making assumptions about how humans would want to see or manipulate it. And that was exceptionally wise, because humans don’t always want to see or use data the same way.
Another problem I’ll mention with MARC has to do with what I said earlier about there not being a single standard that handles every single use with equal ease and effectiveness. MARC tried to be that standard, for libraries. It failed, and we’ve been dealing with that failure for decades. Talk to any music cataloger! MARC was designed for books, not sheet music, and there are some key differences that it just doesn’t respect. Or talk to anybody who deals with CDs or DVDs or other multimedia. Just forget it, MARC’s terrible for that stuff. Even at the time, MARC was a poor fit for some of the existing library environment—librarians were just so laser-focused on books that they didn’t take the rest of library collections seriously enough.
I encourage the DDI community to look seriously at its edge cases, maybe even publicize them. Where is DDI being used in unexpected contexts? Is it a good fit? If it isn’t, could it be, or is the problem truly out of scope? Where is DDI being “misused,” and are any of the so-called misuses interesting enough to become real use cases? Sometimes the way standards achieve broad adoption is by paying closer attention to problems that weren’t in the original scope. Just a suggestion.
Returning to MARC, the really, really big reason that modeling MARC on human-readable catalog cards was a huge mistake for libraries has to do with data consistency, or more properly, lack thereof. Humans can usually—not always, but usually—read past inconsistency, or ignore it when it’s not important. You or I might chuckle or frown if we wound up driving behind a car with mismatched taillights in traffic, but we probably wouldn’t crash our car into it, right? Because inconsistent taillight coverings don’t matter to us really. We look right past them.
True story, one day when I was a new librarian, I accidentally wore one black shoe and one navy-blue shoe in the same style to work, and I was completely mortified once I noticed, but absolutely nobody else even saw it. This is an amazing, brilliant human skill, this ability to cope with inconsistency. We’re also, as a species, absolutely top-notch at dealing with this in text—most of us handle abbreviations, misspellings, smartphone autocorrect errors, no problem!
But this amazing human skill of tolerating inconsistency makes a mess of data structures, especially when computers enter the picture. While MARC was being designed, nobody cared whether its standards and practices were completely consistent. Formal consistency wasn’t even considered worth shooting for, because who would even notice really if it wasn’t there? Just like nobody noticed my shoes that day. So there are lots of places in library cataloging standards where the instructions just shrug and say “meh, put whatever you want, as long as people can understand it.” I kid you not, the standards say “grab a fortune cookie and write down what it says, meh, whatever, it doesn’t matter.” And I see some of you cringing, because you know the kinds of data analysis and data reuse problems that leads to—it’s part of why DDI exists, right?—and so do I, it’s just that in the 1960s nobody knew that yet… except maybe E.F. Codd, but relational databases were still being invented at the time, so never mind.
The real-world consequence of that decision has been that libraries are completely dependent on expensive, lousy, backward computer systems to run our operations. We’re stuck! MARC locked us out of using off-the-shelf or open-source software for the most part, partly because none of it was designed to read or write MARC—seriously, who even knows about MARC except librarians?—and partly because writing code to handle the records that were inconsistent because the rules didn’t tell anybody to be consistent in the first place is a computer programmer’s purgatory! It’s not easy, it’s not fun—that’s an understatement—so the open-source community waves it on past, and libraries aren’t a big enough market to attract much for-profit programming effort. DDI isn’t a really big market either, though I do know about Colectica and I’m glad it exists, but seriously, the more fun you can make working with DDI data, the more software the community will have. Consistent data is easy and fun to work with. Inconsistent data is not.
It gets worse. After MARC was standardized, library catalogers wanted to make catalog cards better for the people who used libraries, and when online catalogs came around, they wanted to make those work better too, but the only tool they had to make changes with was how they built their MARC records. So this completely praiseworthy “users first!” ethos among library catalogers meant that they dinked around with the structure and content of MARC records in inconsistent and computer-unfriendly ways.
This led, as you’d expect, to all kinds of inconsistency across records even just in a single catalog in a single library! As for records across all the MARC-using libraries in the world, just forget it—there is heinous amounts of inconsistency there, all in the name of making life easier for people. I sure hope this isn’t happening with DDI. What we didn’t know in 1960 but know really well now is that computers just cannot read right past inconsistency the way humans do. Generally they break. When they don’t break, it takes absolutely heroic programming effort to get them past the inconsistency. This, of course, is a major reason humans invent and use standards like MARC and DDI to begin with! Standards help design and enforce a degree of consistency that an unaided human being is generally not capable of and certainly won’t produce spontaneously.
Now, as we’ve seen with MARC, a standard is not an ironclad guarantee of consistency; HTML is another great example of this. People abuse standards, they don’t learn them well, sometimes they even insist on loosening standards up because they don’t want the validator yelling at them any more. By and large, though, the last best hope for consistency—anybody see what I did there? Babylon 5, getting all the geek jokes in today—the last best hope for consistency is still some kind of standard. But strictness, enforcement of consistency, comes at a cost. And in talking with researchers, and graduate students who are learning to become researchers, I’ve found it’s a cost that especially hurts at the standards-discovery stage. For a standard, the discovery stage is when people who don’t already use a standard on their data, but have that nagging uneasy sense that maybe they should, search the huge, complicated, confusing standards universe to try to discover the standard that they should be using.
The first question that someone in the middle of the standards-discovery process asks when they spot a likely standard is “Can I do this? Can I work with this?” Of course they ask other questions, but the first question, every single time, is a total gut-check can I do this? Strict enforcement of consistency makes standards harder to use, harder to experiment with, easier to mess up. Strict enforcement of consistency makes it a lot more likely that a standards-discoverer’s answer to the gut-check “can I do this?” question will be “nope, this is way out of my league, moving on now!” What I’ve found in my standards-building and standards-using life is that if the answer to that gut-check is “no,” honestly the only way a standard ever grabs that potential user back is by making them use it, which means a journal requirement or a funder mandate or a repository mandate or whatever.
The DDI community knows about this; the Inter-university Consortium for Political and Social Research (ICPSR) is DDI’s current enforcer. ICPSR has done a great job in that role, but it’d be nice if DDI had carrots as well as sticks, right? Not every social scientist engages with ICPSR, either.
The other way to encourage standards use is by making the standard use invisible by baking it into a tool, sort of like Colectica has tried to do. The problem with that is that people are persnickety about their tools. Not everybody will use the same tool if they’re not forced to. So we’re left with people looking at a standard that’s new to them and saying “I can’t use my favorite tool with this standard?! Well, forget this standard then!”
From your point of view, you want people with social-science data from surveys and interviews and the like to choose DDI, right? And you want people who need to understand or reuse that data to see that it’s in DDI and cheer, because they know they can figure out how to do what they need to do with it, right? So that’s two audiences of standards discoverers that DDI has to court, people who make social science data and people who use social science data.
So this tension between a standard that makes consistent computer-friendly data, and a standard that human beings can figure out how to use, is really important for the DDI community, an important cost to mitigate if you can. You want standards discoverers to encounter DDI and say “yes, I can do this!”
In libraries, we’re trying to figure this one out too. We pretty much know it’s time for MARC to go out to pasture. And we know this partly because MARC, in addition to making it harder and more expensive to run library systems, has been a serious barrier to getting everyone else in the world, from library vendors to programmer hobbyists, to work comfortably with what libraries know about what libraries have! I mean, I went over to Wikipedia’s article on MARC for a quick check on something and had to stop to laugh at the top cleanup note! If you can’t read it from where you are, it says “This article may be too technical for most readers to understand, blah blah fix it.”
Now look. When Wikipedia says “most readers” it really means “most Wikipedians,” and Wikipedians tend heavily toward the computer-nerdy. If computer nerds can’t figure MARC out, MARC has a pretty serious comprehensibility problem. So for this reason, and for the horrific inconsistency across the universe of MARC records that makes dealing with them via computer so difficult and frustrating, MARC’s got to go.
It’s looking pretty likely that the successor standards to MARC will be based on a technology called “linked data.” You may or may not have heard of linked data—I know DDI is currently working on three linked-data vocabularies, but it looks to me like it’s still early days for those—but look, honestly, it doesn’t matter if you haven’t. The point is, librarians are hunting a way forward through standards discovery. A lot of us are looking at linked data for the very first time, and let’s just say it’s not going as well as it might.
A lot of librarians have looked at linked data, done the “can I do this” gut-check, and had the answer be “oh my gosh, get me out of here, what even is this? I can’t with this!” So far, linked data has totally failed the gut-check test among librarians. It ain’t pretty, let me tell you: bone folders at ten paces, people. So I’m going to ask you all this, and you don’t have to answer me except in your heart. How often has DDI failed the gut-check test among social scientists? How many of your colleagues have taken one look at DDI and said “oh heck naw, are you kidding me?” If the number is as high as I suspect it is, what can the DDI community do about that? Library linked data, speaking sociologically, is a total mess, I can’t even begin to tell you. I don’t want the same for DDI. You don’t want the same for DDI.
Because no lie, I am a DDI fan, because I’m a digital preservationist—that’s another thing I teach—and I know what’ll happen to a lot of social-science datasets that should be in DDI but aren’t. They’ll glitch, like this image on the screen, and then they’ll die. That information will be unrecoverably lost. Ain’t nobody want that. I have another dog in this hunt too, and that’s this: the social science community, by and large, is light-years ahead of the rest of research when it comes to taking proper care of data. I really want other disciplines to learn from you people, because that’ll make my life as a digital research-data preservationist easier! But that brings up a consistency thing again. If even social scientists can’t converge on a standard as useful as DDI, how useful are social scientists as a model? So I need DDI to pass the gut-check test.
So let me close by making some suggestions, as an outsider to the DDI community who is nonetheless invested in DDI’s success, about how DDI might pass more gut checks, become more discoverable, more adoptable, and more adaptable.
After I stopped laughing at MARC’s Wikipedia page, just for the heck of it I looked up DDI’s. It does have one, and that’s great, that’s totally step one. But it’s got a blah-blah-fix-it note up too, this time about uncited information. Like it or not, Wikipedia is a place a lot of people go for that “can I do this?” gut check. Blah-blah-fix-it notes do not inspire confidence in these people. I really recommend a community Wikipedia hackathon day or whatever to fix this. One thing you may well find is that some or all of the uncited information in the Wikipedia page here doesn’t actually have an available, citable online source. That’s a problem! That’s a documentation problem for DDI! If there’s information basic enough to be in the Wikipedia entry, you absolutely want to ensure it’s in DDI’s website and documentation also.
So here’s DDI’s home page, the other likely place for that gut-check question.
And I love y’all, I really do, but tough love here: this page is not good. This page seems absolutely designed to make standards discoverers run screaming in the opposite direction. Just as a minor example, the first information after the navigation bar is the last-updated date. Nobody’s coming to this page looking for that; put it in the page footer where it belongs.
And then there’s the research-data lifecycle diagram, and look, I know lifecycle models were trendy in 2009 or so, but my experience is that they’re terrible communication tools. Nobody understands these things; they’re too vague and abstract for people who do research to see themselves and their workflows in. This one specifically, it’s not clear why DDI is at the center of the picture, or why it’s in this weird gear thing, and the picture doesn’t make clear what DDI actually does or how it helps with all the things in the blue boxes. Ditch this thing. Seriously, just dump it. It’s not helping DDI’s adoptability among social scientists.
What I might do instead, and this is only a suggestion, is to explain clearly what kinds of research and research data DDI works with. This is a quick list off the top of my head; you could probably do better:
The point is, a researcher who doesn’t use DDI will come to this page, see that list, and if they make the kind of data that DDI is good for, they’ll immediately recognize that, which they can’t from this lifecycle diagram.
Then there’s DDI’s tagline, “a metadata specification for the social and behavioral sciences.” Two things about this. One, give me an estimate here, how many social and behavioral scientists have a sense of what “metadata” even means? I mean, it’s probably higher than some other disciplines, but in my experience, lots and lots of researchers bounce right off the word “metadata,” and its negative connotations due to our friends at the NSA probably don’t help much.
Two, take it from a librarian, DDI is not just a metadata specification! It contains metadata, sure, codebooks are metadata and instrument descriptions are metadata, but DDI is also a content and data specification! You don’t just describe your interview instruments or your survey methodology with DDI, you can also put the actual interview transcripts or survey results in DDI. This seems like a persnickety objection, and I won’t lie, it is! Putting my librarian hat on, though, librarians hold pretty strictly to the distinction between content and metadata. As some of you learned yesterday from my fellow librarians Brianna Marshall and Trisha Adamus and Kristin Briney, librarians are helping guide standards discoverers to standards these days, and if your home page misleads librarians about what the DDI standard actually does, I do think it’s a problem.
This is nitpicky, but exactly how many DDI specifications are there? To somebody trying for that gut-check, hearing that DDI is one specification from the tagline, and then seeing a couple inches down that it’s more than one specification is worrisome. It’s like a mini-bait-and-switch, like you’re trying to make DDI seem easier than it is.
Last thing: This page doesn’t even try to answer the gut-checker’s first question. can I do this? Heck if I can tell from this page. And no, nobody wants to start from the documentation, especially if it’s called that. Documentation is what you give your grad students so they can get on with it and you can ignore it, right? Y’all need a getting-started-with-DDI page here in the worst way.
Stepping back from specifics, there’s a thought-pattern that I want to encourage the DDI community to use: It’s not about what you can do with DDI, it’s about what I can do with DDI. No joke, I really mean this, DDI will not succeed or fail based on what you here in this room can do with DDI. You wouldn’t be here if you weren’t already knowledgeable, okay?! So DDI adoption is not about you. It’s about me, it’s about what I can do with DDI as a community outsider. Look, DDI wants me, because I train people who you hope will be community insiders someday! And I’m not the only outsider you care about, either. It’s students. It’s librarians helping people preserve data and find datasets. It’s journalists looking for stories, stories that might be lurking in your data. It’s web developers looking for interesting data to mash up. And bringing it back to the conference theme, it’s definitely about search engines. So much. So much about what web search engines can do with DDI!
As anyone who works toward web usability knows, the way you figure out what people think when they do that gut-check with your standard is to ask them. Hey, check out the DDI web page, do you get what DDI is now? No? Okay, what don’t you understand? And you revise your page from there. But that can be hard and time-consuming to do for every population of outsiders you’re interested in, and I actually think there’s a short-cut: ask educators, ask people who teach about DDI. Ask people who teach DDI—Jane Fry, are you here? Ask Jane Fry! We educators see people’s first encounters with new standards all the time. We can totally tell you what trips people up! It’s what I’ve just been doing, right?
Maybe you don’t believe me about outsiders, so I’m going to show you something. This is a question-and-answer website called Open Data StackExchange, which is where people who are interested in open data ask and answer each others’ questions. A ton of questions on this site revolve around social-science data, mostly where to find it. It’s kind of hilarious, the kinds of data people just assume somebody has; I really want to ask half the questioners on this site why on earth they think the dataset they want exists, but look, that’s not the point. The point is there are a lot of potential DDI users here, from both sides of the pipeline—data creators and data users. Do they know about DDI? Not from Stack Exchange presently; nothing comes up if you search on “DDI.” DDI is not part of this universe. As I keep repeating, even if somebody points them to it, when they ask themselves that gut-check question, “Can I do this? Can I do something with DDI? Is there something in DDI for me?” DDI really needs the answer to be “yes.” Right now it’s not. So what can DDI do to fit better into new users’ environments?
I think part of the answer for DDI, just as it is for libraries, is “fitting into the World Wide Web better.” And that’s why I bring up microdata, as I promised I would earlier. Again, sorry, terminology problem, your definition of microdata is the first one Wordnik gives, “data concerning individuals in a trial, survey, et cetera,” but that isn’t actually what I mean today. I mean the second definition, “data stored in a microformat.” This is a completely useless definition, of course, so let’s look up microformat: “a simple data format that can be embedded in a webpage.” Aha! Now we’re on to something, something that might help DDI fit better into the larger web.
Where you go to find out about web page microdata is a website called schema.org. They give an even better definition of microdata: “schemas webmasters can use to mark up HTML pages in ways recognized by major search providers.” And they go on to say that all the major search engines use microdata to improve how their search results look. Now, who wouldn’t kill to have Google actually understand what a DDI dataset is—just understanding that it’s a dataset would be a lot all by itself! If Google actually helped people find DDI datasets, and gave them an idea of what they’re looking at? Wouldn’t that be great? That is what microdata can do for DDI.
But does microdata understand what a dataset is, you ask? Why yes, yes it does! In a limited way, I grant you—you won’t be able to pack all your metadata into the web page for your project—but enough so that Google search results, before anybody even clicks on one, can say “this is a dataset about midlife, its called MIDUS, it’s by these researchers and it’s published by the UW Institute on Aging” and so on and so forth. Even better, microdata understands that datasets often come in catalogs, so if you have a project portal, you can totally tell Google that it’s a project portal with a whole bunch of datasets in it!
Coming full circle here, microdata is how I think DDI and DDI datasets should be leveraging their metadata to enhance their own discoverability. And even better, I think this will even help with the gut-check question from potential DDI users. If DDI makes it super-easy to create microdata for a project web page or portal, maybe through an XSLT stylesheet or an HTML-plus-microdata template or building it into existing DDI tools or whatever, I really think “it’ll be way easier for people to Google my dataset” is a pretty compelling statement of DDI’s worth.
So as DDI experiments with linked data and other possible serializations and representations for social-science data—and I know you’re doing that this very afternoon!—I encourage you to put microdata on the list. See what you can do with it. Let’s show standards discoverers what DDI is good for!
Thanks for sticking with me through all that; I hope some of it’s helpful. If you’d like to get in touch with me or you’re curious about what I do, my contact information’s on the slide there. Have a great day here in Madison, and long live DDI!