Thursday, April 3, 2008

A Digital Needle in the Haystack: Finding the Good Stuff Online

"Wikipedia does not publish original research (OR) or original thought. This includes unpublished facts, arguments, speculation, and ideas; and any unpublished analysis or synthesis of published material that serves to advance a position…. Citing sources and avoiding original research are inextricably linked: to demonstrate that you are not presenting original research, you must cite reliable sources that provide information directly related to the topic of the article, and that directly support the information as it is presented."
-- (From Wikipedia's NOR article)

Wikipedia, as an authority considered by a study published in Nature to be as (or more) accurate than the Encyclopedia Britannica and which now dwarfs it in volume of entries aims to be a repository for established (if not common) knowledge. As the above quote indicates, Wikipedia's chief weapon in this pursuit is reliance on citations to reliable sources. But in these heady digital days in which original ideas are promulgated at light-speed alongside substandard copies and half-baked iterations (guilty as charged?), how do users identify those sources that are reliable? Taken in a museum context, this question could invite a books' worth of consideration, but for the sake of this blog entry I'll just aim for a thumbnail sketch and touch on specific concerns of plagiarism, of museum authority, and of orphan works.

An Information Theory Approach to Plagiarism

To judge from the complaints I've heard, teachers, professors, and editors are being driven to distraction now more than ever before by a generation which does not seem to understand the importance of properly attributing their source material. When not dealing with ethically-challenged sloths who prefer to submit third party-drafted essays as their own homework, the watchdogs of the new recognize a more insidious copy-and-paste dilemma fomented by the internet, one in which paraphrasing and proper sourcing are increasingly (and nonetheless erroneously) viewed as passe. However, plagiarism represents a bigger threat to scholarship than simple laziness or academic fraud would make it appear.

My reader(s) presumably will accept my argument that proper attribution is, like provenance, crucial to credibility of content (to say nothing of the underlying author's ego and pocketbook, which surely are entitled to at least minor limning as a means of encouraging/making possible future contributions to the marketplace of ideas). However, the further down the road we get with digital publishing, the fewer the obstacles that emerge to impede the proliferation of unintended plagiarism. Left unchecked, popularity and ease-of-indexing become more the arbiters of influence and ready identification than do originality and authenticity. What makes this so insidious is that as the amount of content on the internet increases exponentially, so does the ratio of noise-to-signal. The prevalence of citations to identical articles or parts or paraphrases taken from such articles that have been erroneously attributed to different authors is only going to increase in the infinite plane of hypertext. Therefore, museums and academic publishers must not only remain vigilant about properly attributing original authorship, but develop, identify, and take advantage of new, user-friendly means of assigning accurate credit.

Thanks to Claude Shannon, the father-author of information theory, we have a means of comparatively quantifying the new, and can therefore deal with the paradox of intellectual relativism posed by the Universal Library -- that fictional infinite repository which contains every volume from A to ZZZZZ… including not only the complete works of Shakespeare (and translations to as-yet-uninvented languages and every binary-encoded video incarnation of performances of these), but somewhat less helpfully the complete works of Shakespeare less the second-to-last lowercase letter 'r.' Pull a "video" from off the Universal Library's shelf and you are 10x times more likely to see snow as to you are to see anything that passes for a performance of Macbeth. (For all you lay science readers out there, I highly recommend William Poundstone's books, "The Recursive Universe: Cosmic Complexity and the Limits of Scientific Knowledge," "Labyrinths of Reason: Paradox, Puzzles, and the Frailty of Knowledge," and especially, "Fortune's Formula: The Untold Story of the Scientific Betting System That Beat the Casinos and Wall Street" which provides more insight on Shannon specifically.) Further, most source material lacks the authorial power of celebrity that Shakespeare's works enjoy. Apart even from issues of accurate attribution, for the ideas subsumed in these original works to retain their resonance and value, there must be a way to distinguish inaccurate copies beyond authorial brand recognition.

Thanks, But I Prefer Museum-Brand Filters

To combat this ever-rising tide of online ignorance, I think museums must serve at least two functions. First, they must establish themselves as a filter, an online brand that represents the "accurate," "well-researched," and true. Above and beyond accuracy, a museum should never publish or republish any content for which it cannot verify provenance and authorship. More than this, and in lieu of presenting themselves as an exclusive vehicle for "the good stuff," museums should dedicate at least part of their online outreach efforts to portal activity by linking to or otherwise calling attention to this "good stuff." As Jim points out in his post on federated authentication, there's a movement afoot to share or network login approvals among the respective staff of museums and cultural organizations (much the way that banks' respective ATMs acknowledge their customer's respective cards and account information). Where A trusts B and B trusts C, so should A be able to trust C. The public should remain confident that what it finds on or via museum websites will lead quickly and easily back to original and respected sources.

Second, museums must act as an authority on what people should take seriously with their explicit content (exhibits, articles, and research) and especially their implicit content (metadata, taxonomical standards, and search tools). This is something that makes museum involvement in the theoretical semantic web so important. Museums as much as other authoritative content providers owe it to their public to lead them to what they regard as "relevant" and "right." But can the data itself help users reach such conclusions?

Consider the attitude of a lay user with an interest in banjo music. At this moment, a search for "banjo" on Smithsonian Global Sound yields 316 pages and 3158 results. That's certainly better than starting the same search on Google, which produces 232,000 results for "banjo bluegrass" and 628,000 results for "banjo blues" (all results presumably dated as of the publication of this blog posting), but still off-putting to someone who just wants quick access to the "good stuff." What to make of any of this? Curated mediation by initial article/item selection (i.e., that which has been included in the SGS database), cross-referencing, context, and related articles will sometimes point confused users in the direction of "favorites" and icons of virtuosity. However, it's beginning to look as though the semantic web offers the possibility that a straightforward set of algorithms can sort this overwhelming offering of material by "relevance" and rightness" -- for example by telling users which results are both most distinctive (using unique attributes as a measures of originality) and most frequently referenced (as a synonym for determining influence on later work). In this semantic utopia, users should then be able to follow the trail of influence from an original "root" of authorship (say, those forbears as existed in 18th and 19th century broadsheet ballads or slave songs) to its further branches of influence (say, Brownie McGhee, Pete Seeger, and Bruce Springsteen). My personal ignorance of "true" banjo-based blues progenitors notwithstanding, my point is that the data here may be seen as containing the DNA of its own provenance.

The Fallacy of the Orphan Works Dilemma

For academic and edu-cultural organizations to fulfill their proper role as both filter and authority, they must be able to act on content (meaning exploit and further promulgate as a component of research and diffusion), whose authorship and/or ownership may be a bit on the cloudy side. In some cases, this may be considered usage that exceeds "fair use" under the Copyright Act. Certainly, given that copyright law leaves the ultimate definition of "fair use" to the courts on a case-by-case basis, every museum use inherently involves some exposure to claims of infringement, and this often has an impact in determining which images are included in exhibition catalogs, books, and websites and like decisions that straddle the traditional worlds of commerce (chiefly publishing) and education, reportage, and critical commentary (traditionally considered well within the boundaries of copyright's "fair use" defense). Museum staffers enjoy opportunities to engage authors/artists in discussions about the potential use of their work (perhaps less so authors'/artists' estates), but spend lots of frustrating time stymied by so-called "orphan works." (I have a colleague in legal who has had to hold up museum use of a bunch of sound recordings for over a year while chasing down sound recording ownership issues.)

According to the US Copyright Office, “orphan works” are those works within the term of copyright protection whose owner(s) cannot be identified and located. “Orphans” are considered public domain, available for unfettered exploitation by all. (See this white paper, published in 2006.) Thanks to the European Union and the Sonny Bono copyright extension law, the term of the majority of works under copyright now lasts for the life of the author plus 70 years. This isn’t the place to go into details over the intricacies of copyright law, but suffice it to say that considering the statutory duration exceeds that of any single human life, we can make a few simple assumptions. First, this is a heck of a long time as by definition the term of copyright protection in any work will outlive its author. Therefore, it follows that at some point in a work’s copyright life, a would-be licensor will have to deal with the author’s estate, if such there be (the author being dead and gone). Further, if such readily identifiable estates there be not (and things can get pretty murky some 50 years after anyone’s death, notwithstanding the presence of estate lawyers), would-be licensors may well be considering use of an “orphan.”

This issue has been explored in better fora than this (in an April 24, 2007 DC Bar panel program, for example), but in a nutshell the debate centers around how we can assure that lawful copyright owners can receive the compensation and protection to which the law entitles them without unnecessarily removing a large volume of relatively contemporary work from circulation just because we believe a lawful owner has yet to be identified. Let’s remember as well that orphan works include not just those “abandoned” by an artist’s death, but those whose initial attribution may not have been well-identified to begin with (stolen or grey-market wallpaper designs, papers authored by a collective long since disbanded, traditional “folk” art, sound recordings of naïve performers, etc.). Though I summarize the problem in breezy fashion, so-called “orphan works” present a potentially serious dilemma for cultural organizations inasmuch as they set in opposition two mainstays of museum credibility as regard cultural and historical materials, use/publication/distribution and sensitive treatment. The flimsy solution floated to solve the problem requires would-be users to exercise due diligence before considering a work to be “orphaned,” and upon notification by a legitimate owner promptly cease use or else pay up for continued use. It’s the niggling details of what levels of effort should constitute “due diligence” and be sufficient to recant (or pay for) the sin of use that make the solution a rather flimsy one.

Why bother with a solution at all? Perhaps copyright use prohibitions ought to be struck in favor of a new regime that promotes clear attribution of original authorship while establishing statutory licensing fees across the board (as is already the case for cover artists re-recording yesterday’s new releases). Setting principal aside, the digital environment is not one which lends itself to authorial control. As I pointed out above in my observations about online plagiarism, the creator(s) foolish enough to publish a work today will see it self-replicate, mutate, and disseminate the moment a binary-source facsimile is produced. The only way to keep the virtual cat in the bag is for the cat not to exist at all, and I think most creators would find that somewhat self-defeating.

If the world wide web renders copyright enforcement difficult, if not impossible, perhaps the presence of a uniform, published billing structure could increase the likelihood of authors receiving compensation while assuring authorial recognition. Would "open-sourcing" works chill distribution and minimize compensation by depriving authors/owners of the commercial benefits afforded by monopolistic control? It’s doubtful. The success of online micropayment vehicles like i-Tunes and PayPal pretty clearly demonstrate that enough people prefer to pay for affordable, desirable content to allow for a valid business model. “Open-sourcing” works works. The argument that authors should not be required to relinquish commercial control of their work simply because the internet makes it easier to co-opt works or copy them is, I think, a moot one. Reality is an amoral (as opposed to immoral) place; we must adapt our social structures to deal with what life throws at us.

Viewed from the Wikipedia perspective rather than that of current international law, the orphan works problem is misstated. The more the marketplace of ideas fills with noise, the more critical it becomes for us to be able to identify a good signal. It is therefore far more important that original works of authorship be recognizable and be reliably recognized. Yes, authors of all stripes should be able to be fairly compensated (and therefore hopefully incentivized) for their creative production. We continue to have a need for innovative, low transaction cost mechanisms for collecting and distributing money (and for fair enforcement of same). However, the focus on orphan works should prioritize the need for accurate source attribution, something which as stated above, must be considered central to the museum’s “brand.” In an age of mass information consumption, it is imperative that the contents of our firehose not be filled with empty calories.

Endpaper - The Talking Points

Here, then, are a few things that museums should do to assure the continued purity and vitality of the marketplace of ideas in an increasingly-polluted digital world:

1. Be an authority:

  • seek out authors and remain vigilant about properly attributing all sources;
  • keep primary source material alive and digital so that it can be referenced;
  • build semantic widgets to accurately and efficiently tag their "good stuff;"*
2. Be a filter:
  • dedicate resources to portal activity to identify others' "good stuff;" and
3. Be a good citizen:
  • participate in discussions to create statutory royalty reservoirs

* [The Powerhouse Museum, a lead participant in the project, may be among the first to take aggressive advantage of this, see this article.]


Bruce Falk said...

A friend pointed out the EU Provenance Project to me:

"The Provenance Architecture is defined as a computer system that deals with all issues pertaining to the recording, maintenance, visualisation, reasoning and analysis of the documentation of the process that underpins the notion of provenance."

However, it seems to me that this project is only partially apt and fails to address the human computation problems inherent in false attribution, plagiarism, and miscopying. As I understand the EU Provenance Architecture, it targets provenance of computer-generated documents (the site provides examples of tracking origins of aerospace simulation data and hospital logs of transfers of raw organs for potential transplants). I assume this is extensible to scans and OCRs of primary documents, as well, though it's not immediately clear from the site.

What is still required is a means of gleaning validity/credibility from human-generated documents (e.g., providing a reliability score by which to judge the likely validity of retyped manuscript copies, "cover" performances of music, nonsensical scene-by-scene remakes of classic movies like "Psycho," Lego homages, or paraphrases and attributions of original content). Anyone know of attempts to address this problem?

Anonymous said...

Who knows where to download XRumer 5.0 Palladium?
Help, please. All recommend this program to effectively advertise on the Internet, this is the best program!

Rick Campbell said...

Yes, it is really a big problem nowadays - to find smth really worthy. Another problem - security. Although after I read this data room reviews some things on security became clear. Maybe it is also useful for you. Seems like it is quite your topic.

Ada Smith said...

Super online casino is waiting for you my friend, come in and play. Best casino blackjack play as your intuition tells you, win as much as you can.

Anonymous said...

bons casino - Payment Without Commission
bons casino - Payment Without ボンズ カジノ Commission - bons casino. 샌즈카지노 Withdrawal of your winnings 12bet is possible at bons casino. Online casinos, casinos, and bonuses.