The article “What Happens When Big Data Blunders” by Logan Kugler concentrates on the failure of Google researches using search trends to predict flu outbreaks, identifying these failures to be a result of both the inability of Google researchers to isolate what should be meaningful indicators of illness (searches about flu symptoms and remedies) from other trendy searches and the difficulty of adapting dynamic nature of Google’s search algorithms to assumptions about the search habits of the susceptible population. This article characterizes a common problem within social science research: statistical methods which once struggled to collect enough data are applicable now that digital resources faithfully aggregate copious amounts of information, but these methods often require stable sampling techniques which don’t align with the goals of the application or the consumer’s behavior. In a few words, messy data is as bad as no data. As Kugler notes, Google’s profit-driven, business goals don’t align with those of social science researchers and the data being collected is often skewed by the desire of the application (like Google search) to improve customer experience, rather than provide consistency. Finding clever ways to work with data with has been comprised in such a way, allowing social scientists to piggy-back their experimental data collection on modern applications, would provide ways for businesses to profit from selling consumer data and ways for social scientists to utilize the computational resources which have revolutionized so many other fields already.
In “What is Computable,” MacCormick gives a proof for why a program that can decide whether any other program can crash cannot exist. He relates this to the halting problem and explains that although it is not as important in practice as one might think, it raises important philosophical questions about what computers and people are capable of.
In “What Happens When Big Data Blunders,” Logan Kugler explains the reasons that David Lazer and Ryan Kennedy discovered for the failure of Google Flu Trends to predict the 2013 flu outbreak. Furthermore, he discusses reasons for over predicting the spread of Ebola. In both cases, the problem was based on making assumptions based only on big data that left out changing dynamics. In the Google case, the algorithm did not account for changes in the Google search algorithm itself and in the Ebola case, the CDC and WHO did not account for the initial efforts of people working to contain the disease.
It is an interesting and challenging idea to combine the themes of these two articles. One exercise that comes to mind is coming up with our own theoretical questions about what is possible with big data and whether these questions can be answered. Some questions might be:
- Is it possible to determine whether a big data algorithm is sound by some definition of sound? If not, can we derive bounds on acceptable error?
- Is it possible to prove that a particular problem cannot be decided by any big data algorithm?
The first question is quite challenging. The goal of a “big data” algorithm is typically to make some prediction given some large quantity of data. To think about this problem, we might think about solving one of these problems without the aid of a computer. Imagine you were able to think fast enough or live long enough to process all of the data. What are some issues that might arise? Is the data relevant? Is there enough non-overlapping information in the data to arrive at an answer? We would need to answer these questions. The answer to our question involves the relationship between the question, the data itself, and the operations we can perform over the data.
For the second question, we must first decide what it means to be decidable. Clearly if we give no data, and the problem requires data, it will be possible to prove this. On the other hand, if we supply all of the data about everything, will it then be able to solve it? This is somewhat philosophical, in fact. If we knew everything, could we predict the future?
There’s a dual comfort and dismay at the fact of incomputability. Philosophically, it’s humbling to admit that there are simply things that we cannot and may not ever be able to solve. Bringing in Stephen Hawking and his 10 billion year time frame gives perspective to our humanity and our abilities. I recently bought a telescope and having spent a few weeks looking up at the sky at night between the rain lately, I have to admit it’s given my research some helpful perspective.
It contrasts some with the discussion of Turing’s “On Computable Numbers” paper in this same piece, the Church-Turing thesis. It’s more than a computable question, I think, on whether human brain capacity can be equivalent to a computer and deep neural network computing. I listened to an interesting debate recently between Jaron Lanier and a singularity advocate, whose name I now forget. The idea that the human mind could and will eventually be replicated by a computer to me seems like a bad ending to what had been an otherwise enjoyable sci fi novel. I don’t know that that need be our end point, or that it is even possible. Jacques Ellul wrote about Technique, and the ever growing and ever more integral obsession with results, efficiency, and function, and I think there is a healthy space for critique in this area- what’s possible, what’s not, what is lost, what is outside of our horizon and paradigm.
Then there’s the example of Google and the flu. Here we see that big data can make mistakes, and reach impossible predictions. But, when it comes to other areas, like self driving cars, or areas where artificial general intelligence or artificial super intelligence may step in as part of the algorithmic fabric of the imminent future, big data fails are inevitable and more deadly. All new technologies fail. A failure at a certain level, however, might be difficult to come back from. Interesting discussion of this in Our Final Invention.
The readings for this week highlight three limitations for computational algorithms:
- Logical Contradiction
- Private Ownership of Data
- Mimetic Fidelity
In “What is Computable?,” MacCormick uses deductive reasoning to prove certain types of computer programs logically contradictory, particularly the creation of a computer program that identifies whether or not other programs will crash. Except for a brief mention of phenomenology and spirituality at the end, he focuses almost exclusively on the logical limits of algorithms, insisting that everything not logically contradictory is at least theoretically computable.
However, Kugler’s article, “What Happens When Big Data Blunders?” (interestingly both articles are phrased as questions), uncovers two other issues through case studies: Google’s attempt to predict flu trends and the WHO/CDC attempts to predict ebola trends. These two big data projects failed, not from logical contradictions, but from commercial bias and mimetic infidelity respectively. In the first case, Google Flu Trends were based on a commercial search algorithm that changes based on fluctuating business plans. This presents difficulties beyond the comparatively clear-cut deductive reasoning of MacCormick, questioning whether a commercial venture determined by profit and competition can provide reliable algorithms for scientific research. While perhaps practically difficult, there’s no theoretical reason why such issues cannot be resolved through, say, performing such research outside of the commercial sphere.
In contrast, the WHO/CDC case study uncovers a much more difficult (and perhaps unanswerable) question: what are the limits of computational simulation? The WHO CDC studies used simulations that extrapolated from initial conditions to approximate ebola deaths, failing to keep up with “the ever-changing situation on the ground”–what we might call “reality.” This opens up philosophical questions going back to Plato regarding the relation between representation and reality. Further, in the age of computer simulation, what are the conditions necessary to render reality representable in a computational environment? Does reality itself have to function according to the principles of computation?
In the first article, Hamish Robertson and Joanne Travaglia talk about how the first explosion of data in the 19th century was used in ways to categorize people and motivate “social change”. Whether that change was good change became irrelevant once certain assumptions, including negative social categories were built into the data collection process. Often this was used to oppress various categories. In this article, the authors express a concern that this same situation will be carried over to the “big data” revolution of the 21st century. Lev Manovich suggests that similar to the field of social science, computer scientists exploring social media use probabilistic models to analyze big data. Ted Striphas explores how common cultural words have changed due to the use of computing to produce, hold, and analyze cultural data.
I think the articles raise an interesting question since there is so much data now through social media and other online systems where people supply information that the criteria for categorizing people will be so much richer in the new data era. Furthermore, with the application of algorithms to data, errors, for example from an algorithm taking a string of information out of context, will be extremely likely and it is concerning what kind of influence such “mistakes” could have on our understanding of people and society. To what extent will small decisions in the way algorithms are designed and used impact society in ways we don’t understand? Since most of the decisions made by algorithms are probabilistic and our concepts of society are influenced by these decisions, how will we ensure that we are not causing societal damage by relying on the decisions of algorithms? These are especially issues because the large scale of big data magnifies a small decision made early on.
We can start with some basic questions inspired by my reading of Robertson and Travaglia. Who wants to know? Who owns the data? Who is being cataloged? To what end? Looking back at the practice of social ordering in Robertson and Travaglia’s piece, we realize that the power and control endemic to this earlier “first information age” is familiar to today. Data is a raw material collected by corporate and government entities at alarming rates. If you follow the main stream news, it seems like intelligence agencies are collecting data in such bulk that it remains unexamined and unorganized. If this is even true, it can’t be true for long.
The authors write, “…much of the data collected about human beings by bureaucratic systems has a history not simply of description or even understanding but one of control”. This applies to the mass data collection exposed in recent years by US whistle blowers, but I would add also a means of profit to the discussion. Data is a primary raw material collected and traded by top corporations, funneled as fuel into the refining mechanisms of a mature advertising industry. The old adage about the free lunch is true- our social media activity isn’t actually free, in that our habits, purchases, love lives, and friendships are being mined in order to more expertly sell them back to us. Political and technology theorist Jodi Dean calls this “communicative capitalism”, and understands it to be a qualitatively new phase of capitalism. In this phase, capitalism has adapted in new ways to control workers and the surplus army of nonworkers through communications technology. Furthermore, in a Lacanian twist, our interactions on social media satisfy drive while always thwarting a deeper desire for equality and justice, keeping us in front of the screens instead of in streets.
This leads into further questions about culture, in Manovich and Striphas’s texts. Is it time to just admit that 21st century culture is all online and cataloged? Probably. What does culture even mean outside of technology today? Given that, what does it mean when culture is mediated through corporations and listed according to algorithms to which we have zero access? Is culture a hood that has been welded shut? Both authors do their part to define the word “culture” and its content. Twitter and Instagram are our cultural mediators, but they are also owned by seemingly unstoppable and unknowable corporations.
A part of me still feels resistant to calling this “culture”. I watched an interview recently with filmmaker Abel Ferrara, and, when asked why he’d abandoned his home country to live and work in Italy, he said “There’s no culture where I come from. A bunch of fuck’n lunatics show up 300 years ago, shoot everybody that’s there. Kill every motherfucker that’s there…I’ve never met an Indian in my life… that’s my country. So where’s the culture?” On some days, I agree with him.
This week’s readings give us a great jumping off point for why interdisciplinary classes like this are necessary and should be embraced by academia. In my mind, the essays we read boil down to cross disciplinary communication and issues of close and distant approaches, or micro vs. macro, or nuanced vs. pragmatic if you prefer.
Dr. El Abbadi’s piece was great for a humanist like me to get insight into how a computer scientist views data or a problem, particularly in terms of wanting the most efficient (and helpful to the user) solution, and not just merely a solution. This is a problem that I think occurs in much of the humanities, where we get lost in high-minded concepts that alienate our work from others, be they in a different discipline or completely outside of academia, or even from the very classroom in which we are trying to transmit the information over to the undergraduate student population. We’re good talkers, and can certainly elucidate a topic or bring up important issues, but sometimes the talk leads to very little impact for the “user” or student for that matter.
But both Manovich and the Striphas provide a glimpse into a more complicated approach that marries both the pragmatic and efficient computational methods with the critical humanist approach. Rather than battle with each other like it’s the good ol’ days of academics, building our reputation off of esoteric assaults at other over-educated intellectuals in our privileged bubble, underlying both pieces is that collaboration is the key. Tracing the history of data like Striphas and Roberston/Travaglia provides vital insight into the consequences of data analytics, and how we might learn from past mistakes in order to build a more complicated portrait from any given dataset.
I’ve read the Robertson/Travaglia and Manovich pieces previously, and there is certainly more to say about both (if I’m going to get all critical humanist), but again the most important takeaways from the readings this week were collaboration and complexity. It’s why Manovich proposes the “wide” approach to data, and why he stresses the overlap of so many different projects.
Whether you call them Digital Humanities, Cultural Analytics, or Social Computing, there will be people dissecting your methods to shreds. There will be attacks and superficial dismissals and even blind support of these new paths forward, sometimes labeling them as trends or money-grabs. In my own program, whenever I mention the Digital Humanities I get one of two reactions: (1) utter contempt and indignant comments about distant reading, without really understanding the idea, or (2) “oh that’s really big right now, it’ll help you get a job.” Maybe it will (crossing my fingers), maybe it will help me a get a big fat grant like Manovich, but more importantly, as it relates to my research interests as well, I hope this interdisciplinary approach brings positive gains for everyone— people in a variety of disciplines in academia and (especially) those without the privilege to be involved in these discussions.