tv Book TV CSPAN March 30, 2013 4:45pm-6:00pm EDT
4:45 pm
chairman of the house ways and means committee. then john lott argues that thanks to president obama we're on the verge of economic and social collapse. at 9:15 even we'll hear from melanie warner, author of "pandora's lunchbox: how processed food took over the american meal," followed by our weekly "after words" program. this week gay rights advocate john corvino and maggie gallagher engage in a point/counterpoint discussion of gay marriage. we conclude at 1 is p.m. eastern -- 11 p.m. eastern with eric deggins, whose book, "race baiter," will be the topic. visit booktv.org more more on this weekend's television schedule. >> next on booktv, viktor mayer-schonberger and kenneth cookier tube a new field of research that uses information
4:46 pm
to predict human behavior and events. this is a little over an hour. >> good evening and welcome to today's program at the commonwealth club of california, the place where you're in the know. i'm dr. missouri rah gunn, host of tech nation which airs on npr.org and also on the npr channels on xm sirius radio. i'm your moderator for the program this evening. tonight's program is being held in associate -- association with the commonwealth club's science-technology forum. find us on the internet at commonwealthclub.org or down lode our iphone and android apps for program and schedule information and podcasts of past programs. and now it is my pleasure to introduce today's distinguished guests. viktor mayer-schonberger,
4:47 pm
professor of internet governance and regulation at oxford university, and kenneth cukier, data editor for the economist. together thawf written the book "big data: a revolution that will transform how we live, work and think." i have the distinct pleasure of interviewing professor mayer-schonberger and mr. cukier today earlier for a tech nation broadcast to be aired in the coming weeks, and i thought you should know a few things about these fellows. professor mayer-schonberger has more than one law degree, only one of which is from harvard. he's not just a lawyer, he's also a lawyer lawyer, and he's earned a master's in economics from the london school of economics. with over 100 academic papers and seven books to his credit, i think my favorite title is "delete: the value of forgetting in the digital age." his co-author, kenneth cukier, you'd best know as his long career at the economist. appreciate being the data
4:48 pm
editor, he's held such positions as japan business and finance editor and global technology correspondent. you might also know him as the technology editor for the asian "wall street journal" in hong kong. all very important, because big data isn't just here in the united states, big data is global. so please welcome viktor mayer-schonberger and kenneth cukier. of. [applause] >> thank you very much. it's a pleasure to be here. welcome. big data is going to change how we live, work and think, and our journey begins with a story, and the story begins with the flu. every year the winter flu kills tens of thousands of people around the world, but in 2009 a new virus was discovered, and experts feared it might kill tens of millions.
quote
4:49 pm
there was no vaccine available. the best health authorities could do was to slow its spread. but to do that, they needed to know where it was. in the u.s. the centers for disease control have doctors report new flu cases, but collecting the data and analyzing it takes time. so the cdc's picture of the crisis was always a week or two behind. which is an eternity when a pandemic is underway. around the same time, engineers at google developed an alternative way to predict the spread of the flu. not just nationally, but down to regions in the united states. they used google searches. now, google handles more than three billion searches a day and saves them all. google took 50 million of the most common-searched terms that americans use and compared when
4:50 pm
and where these terms were searched for with flu data going back five years. the idea was to predict the spread of the flu through web searches alone. they struck gold. what you're looking at right now is a graph, and the graph is showing that after crunching through almost half a billion mathematical models, google identified 45 search terms that predicted the spread of the flu with a high degree of accuracy. here you can see the official data of the cdc, and alongside are google's predicted data from its search queries. but where the cdc has a two week reporting lag, google could spot the spread of the flu almost many realtime. in realtime. strikingly, google's method does not involve distributing mouth swabs or contacting physicians' offices. instead, it's built on big data,
4:51 pm
the ability to harness data to produce new -- novel insights and valuable goods and services. let's look at another example. a company called faircast. in 2003 a computer science professor was taking an airplane, and he knew to do what we all this hi we know to do -- think we know to do which is he bought his ticket well in advance of the day of departure. that made sense, but at 30,000 feet the devil got the better of him, and he couldn't help but ask a passenger next to him how much he paid. and sure enough, the perp paid considerably -- the person paid considerably less. he asked another passenger how much he paid, he also paid less even though they had both bought the ticket much later than he had. he was upset. who wouldn't be? but he's a computer or science professor, so not only does he get upset, he thinks about his research. so what he realized is he didn't
4:52 pm
actually immediate to know what are the reasons -- need to know what are the reasons on how to save money on air fare, whether you should buy in advance, whether there's something that might affect the price. instead he realized the answer was open for the taking which is to say all you needed to know was the price that every other passenger paid on every single other airline for every single seat for every single route for all of american civil aviation for an entire year or longer. this is a big data problem. but it's possible. he scraped a little bit of data, and he found out that he could predict with a high degree of accuracy whether a price that you're presented online at a travel site is a good price, and you should buy the ticket right away, or whether you should wait and buy it later. because the price is likely to go down. he called his research project hamlet, to buy or motto -- not
4:53 pm
to buy, that is the question. [laughter] but a little data got him a good prediction. a few years later he was country of. ing 75 billion -- crunching 75 billion flight records with which to make his prediction, almost every single flight in american civil aviation for an entire year. and now his predictions were very good, indeed. microsoft knocked on his door, and he sold his company for $100 million. the point here is the data was generated for one purpose, reused for another. information had become a raw material of business. it had become a new economic input. so it's tempting to think of big data in terms of size. it's true, our world is awish with day -- awash with data, and the amount of digital data that is being collected is doubling
4:54 pm
almost every three years. the trend is obvious when we look at the sciences. when the slow and digital sky survey telescope began in 2000, it gathered more data in its first few weeks than had been amassed in the entire history of astronomy. over ten years the telescope collected astronomy data exceeding 140 terabytes of information. but the successor telescope due to come online in 2016 would acquire that amount of data every five days. internet companies, particularly, are drowning in data. youtube has more than 800 million monthly users who upload an hour of video every single second. on facebook over ten million photos are uploaded every hour.
4:55 pm
google processes a ped da bite of data a day, around 100 times the quantity of all printed material in the u.s. library of congress. the quantity of data in the world is estimated in 2013 to reach around 1.2zetabytes of which only a small percentage is nondigital. so it's tempting to follow the hype cycle of silicon valley and to see big data as one characterized by the sheer size of digital information collected and used worldwide. but that would be like describing an elephant by the size of its footprints. if contrast, we -- if contrast, we suggest that big data is more about the volume. we suggest that there are three defining and reinforcing qualities that characterize big data. more, messy and correlations.
4:56 pm
first, more. today we can collect and analyze far more data about a particular problem or phenom man than ever before when we were limited to working with just a small sample. it's not the absolute size of data points, it's the relative size of data points, relative to the phenomenon we study. that gives us a remarkably clear view of the granular details that conventional sampling can't assess. we also can let the data speak, and that often reveals insights that we would never have thought of. the second quality of big data is its embrace of messiness. looking at vastly more data permits us to loosen up our desire for exact tuesday. when our ability to measure was limited, we had to treat what we did bother to quantify as precisely as possible. in contrast, big data is often
4:57 pm
messy and varies in quality, but rather than going after exactitude and measuring and collecting small quantities of data at big cost, with big data we'll accept a little messiness. we'll often be satisfied with a sense of general direction rather than striving to know a phenomenon down to the inch, the penny, the atom. we don't give up exactitude entirely, we only give up our singular devotion to it. what we lose in accuracy at the micro level we gain in insight at the macro level. these two shifts, more and messy, lead to a third one with on a more important, most important change, a move away from the age-old search for causality. instead of asking why or looking for elusive causal relationships, in many instances we can simply ask what.
4:58 pm
and often that is good enough. now, that's hard for us humans to come prehend because as humans we are conditioned -- some might even hard wired -- to understand the world as a series of causes and effects. it makes the world comprehensible. it's comforting. it's reassuring. and often times it's just plain wrong. if we fall sick after we ate at a new restaurant, our hunch will tell us that it was the food even though it's far more likely that we got the stomach bug by shaking hands with a colleague. these quick causal hunches often lead us down the wrong path. with big data we now have an alternative available instead of looking for the causes, we can go for correlations, for uncovering connections and associations between variables that we might not have known
4:59 pm
otherwise. correlations help amazon and netflix make predictions and recommendations to customers. correlations are at the heart of google's translation service and spell checker. they do not tell us why, and they do not know why. but what? at a crucial moment and many time for us to -- and in time for us to act. these three features of big data, more, messy and correlations, are used to save lives. premature babies are prone to infections. it's very important to know infections very early on. but how do you do that? in the analog, small data world, you would take vital signs every couple of hours. blood ox yes nation level, heart or beat, heart rate, these type of things.
5:00 pm
now, part of a research project in canada, researchers collect 16 realtime data flows from premature babies and collect about over 1,000 realtime data points a second from them. then they combine the data and look for patterns, look for correlations and were able to spot the onset of an infection 24 hours in advance. way before the first symptoms would manifest themselves. that's incredibly important for these prixmies because then they can receive medication well before the infection is strong and can, perhaps, not be combated, battled successfully. perhaps, intriguingly, the best predicter for these, in these vital signs is not that the vitals go haywire, but that they
5:01 pm
actually stabilize. we don't know why, but we coknow that in -- we do know that in the small data age, a doctor would look at the stabilization of vital signs and say the baby's doing well, i can go home for the night. now we know that that means that the baby might actually be in trouble and might need extra monitoring. it's also wonderful example of the fundamental features of big data, of more, messy and correlations. the day was much more than we typically process. the data was so vast that it wasn't all in a clean form. it was messy. and the findings were correlations, they answered what was happening. but not why. not the biological mechanisms at work. >> now, often big data has been
5:02 pm
portrayed as a consequence of the digital age, but that misses the point. what really matters is that we are taking things that we never really thought of as informational and rendering it into data form. we're datafying it, to coin a term. once we do, we can process it, store it, analyze it and extract new value from it. think of location. people have always existed somewhere. nature has always existed somewhere. but it was only until recently that we have added on longitude and latitude, then gps, a mechanism to do this, and now it's smartphone that we're probably all carrying in our pockets. but now our location has been datafied, our mobility is da, atafied all the time. think of books. think of words.
5:03 pm
in the past we would look up to the temple of delphi and see two mottos etched in stone. later, we had books. and even more recently we scanned those books. google, for example, went to many libraries and scanned books. and the first thick that they had -- thing that they had was a digital rendering of what was on the page. it was digitized, the book was digitized, and we had the words. now, we get the benefits of digitization. westore it easily, we can proces it -- well, we can't process it per se, but we can certainly share it. what we can't do it is analyze it because it was just simply an image file. so what happens when we can take those words and extract it and treat these words as data while suddenly what researchers are doing is they're looking back at all the journal articles in the medical sciences going back a century. these are hundreds of thousands of articles. and they're looking for side
5:04 pm
effects. a human being reading these journals for a century would not be able to spot some of the weird correlations of drug side effects. but a a machine can. big data can, and that's what you get from the word datafied. all of you in the audience right now are sitting. think of it in terms of something as fundamental as posture. it's way that you are sitting and you are sitting and you and you. it's all different. in fact, the way that you're sitting is really a function of your weight and the distribution of your weight and your leg length, and if we were to measure it and instrument it with maybe a hundred sensors, the way you sit is very personal, it would look like a fingerprint. one person sits differently than another. okay, so what could we do with this? researchers in tokyo right now are placing sensors into car seats. it's an antitheft device.
5:05 pm
suddenly the car would know when someone else is driving it, and maybe you would put the controls that if that was happening, then you'd congress out the engine. -- conk out the engine. if you have a teenager, this might be a very useful thing to say you're not allowed to drive the beemer after 10 p.m., and just like cinderella that turns into a pumpkin, the car engine doesn't start. okay, that's great. now, imagine, what if 100 million vehicles had this on the road today? let's think of what we could do with it. perhaps, perhaps we'd be able to identify the tell-tale signals of the shift in body posture prior to an accident, 30 seconds prior to an accident. we would have datafied driver fatigue, and the car might know the service here would be to alert driver. maybe the steering wheel would vibrate. those are the sorts of things
5:06 pm
that we can do when we. now, that's also the core by-product of social media platforms. think about this. facebook has catfied our friendships and the things that we like. twitter datafies our stray thoughts, our whispers. linkedin datafies our professional contacts. once things are in data form, they can be transformed into something else. so what's data's hidden value? well, traditionally data was processed for its primary purpose with little thought given about novel reuses. but this is changing. the core economic points of big data is that a myriad of reuses of the information are possible, that it can unleash new services or improve existing ones. so the value of data shifts from the reason it was collected and the immediate uses on the
5:07 pm
surface to the subsequent uses that may not have been apparent initially, but are worth a lot. think of delivery vehicles. ups has 60,000 vans on the road. right? it needs to do maintenance on this. it's a problem. but it's a problem that can be fixed with information. when a car breaks down, it doesn't break down all at once, it sort of lets you know it. for example, you might be driving it, and it feels funny, right? or there's a strange sound that it normally doesn't have. well, if we place sensors in the engine, what we would be able to do is datafy some of this, so we would be able to measure the vibration or the heat. and we could compare that signature with what a normal engine sounds like and what the likely problem is. and suddenly now what we can do and what ups does to save money is predict a breakdown. it's called predictive maintenance. so what their able to do is to identify that when the sensor rating tells it that the heat's
5:08 pm
going up or the vibration is out of bounds of normalcy, that you need to bring the van in to get a tune-up and probably replace a part. they're able to replace the part before it breaks. a company uses data from 100 million cars to predict traffic flow in many cities around the world. by reusing its old data, it found a strong correlation between road traffic and the health of local economies. now, inrix's business model is simply to predict how long it's going to take for you to go from one place to another. it's a traffic prediction service. here what they're doing is they're reusing the data and turning it into a new form of economic value, because there's a correlation between the road traffic in a city and its economic health. but there's more. one investment fund uses the data from the weekend traffic around a large national retailer
5:09 pm
because it correlates very strongly with its sales. you can see where this is headed. so it can measure the road traffic in the proximity of those stores, and then it can trade that company's shares prior to its quarterly earnings announcement because it as a lens into whether the sails are going to increase or -- the sales are going to increase or decrease. >> so that's data's hidden value. and through hidden value of data, big data offers us extraordinary benefits. unfortunately, it also has a dark side. as we have just heard, so much of data's value remains hidden, ready to be unearthed by secondary uses. this puts big data on direct collision course with how we currently protect informational privacy.
5:10 pm
through telling individuals at the point of collection, through notice and consent why we are gathering that data and asking for their consent. but in the big data age, we simply do not know when we collect the day for what purposes we'll be using it in the future. so as we reap the benefits of big data, our core mechanisms of privacy protection is rendered ineffective. but there's another dark side, a new problem that emerges. algorithms predicting human behavior that we are likely to do, how we will behave rather than how we have behaved. and penalizing us for it. before we even have committed the infraction. and if you think of minority report, that's exactly right. in a way, that provides value,
5:11 pm
right? isn't prevention through probabilities better than punishment after the fact? and yet such a big data use would be terribly misguided. for starters, predictions are never perfect. they only reflect the statistical probability. so we would punish people without certainty, negating a fundamental tenet of justice. worse, by sewer screening before -- intervening before an action has taken place and punishing the individuals involved in it, we essentially deny them human volition, our ability to live our lives freely and to decide whether and when to act. in a world of predictive punishment, we never know whether or not somebody would actually committed, have actually committed the crime. we would not let fate play out. holding people responsible on the basis of big data analysis that can never be disproven.
5:12 pm
but let's be careful. let's be careful here. the culprit is not big data itself. the culprit is how we use it. the crux is that holding people responsible for actions they have yet to commit is using big data correlations, the what, to make causal decisions about individual responsibility, the why. as we have explained, big data correlations cannot tell us about the why, the causality behind things. often that's good enough. but it does make big data correlations singularly unfit to decide who to punish and who to hold responsible. the trouble is that we humans are trying to see the world through the lens of causes and effect; thus, big data is under
5:13 pm
constant threat of being abused for causal purposes. of and threatens to imprison us, perhaps literally, in probabilities. so what can we do? to begin with, there's no denying of big data's dark side. we can only safely reap the benefits of big data if we also expose its evils and discuss them openly. and we need to think innovatively about how to contain these evils and how to prevent the dark side from taking control. one suggestion is that information privacy in the big data age needs to have a modified foundation. in this new era, privacy control by the individual will have to be augmented by direct accountability of the data users. second and perhaps more importantly, on the dangers of
5:14 pm
punishing people based on predictions rather than actual behavior, we suggest that we have to expand our understanding of justice. justice is just different than the big data age than in the small data age. the big data age will require us to enact safeguards for human free will as money as we currently protect procedural fairness. government must never hold an individual responsible for what they're only predicted to do. third, most big data analysis today and going into the future is too complex for the individuals affected the to comprehend. if we want to protect privacy and to protect individuality in the big data age, we need help. professional help. much like privacy offices aid in insuring privacy measures are in place, we envision a new caste of experts, call them
5:15 pm
algorithmists, if you want. they're experts in big data analysis and act as reviewers of big data predictions. we see them take a vow of impartiality, of confidentiality and of professionalism. like civic engineers do or civil engineers do or doctors. of course, big data requires more than these individual rights safeguards to fulfill its amazing potential. for instance, we may need to insure the data isn't held by an ever smaller number of big data holders. much like previous generations rose to the challenge posed by the robber barons that dominated railways and steel manufacturing in the 19th century, we may need to constrain the reach of data barons and to insure big data
5:16 pm
markets stay competitive. >> we have seen the risks of big data and how to control them, but yet there is another challenge, one that is not unique to big data, but that in the big data age society needs to be extra vigilant to guard against, and can that is what we call the dictatorship of data. it's the idea that we may fetishize the data and endow it with more meaning and importance than it deserves. as big data starts to play a part many all areas of life, this tendency to place trust in the data and cut off our common sense may only grow. placing one's trust in data without a deep appreciation of what the data means and an understanding of its limitations can lead to terrible consequences. if american history we have experienced a war -- in american
5:17 pm
history, we have experienced a war fought on behalf of a data point. the war was vietnam, and the data point was the body count. it was used to measure progress when the situation was far, far more complex. so in the big data age, it will be critical that we do not follow blindly the path that big data seems to set. big data will help us. it is going to help us understand the world better, it will improve how we make decisions from what medical treatments work to how we educate our children to how a car can drive itself. but it also brings new challenges, new dangers. what is essential is that we harness this technology understanding that we remain its master. but just as there is a vital need to learn from data, we also
5:18 pm
need to carve out a space for the human, for our reason, our imagination, for acting in defiance of what the data says because the data is always just a shadow of reality and, therefore, it is always imperfect, always incomplete. as we walk into the big data age, we need to do so with humility and humanity. thank you very much. [applause] >> wonderful. our thanks to viktor mayer-schonberger, professor of internet governance and regulation at oxford university, and kenneth cukier, data editor for the economist, co-authors of "big data: a revolution that will transform how we live, work and think." [applause]
5:19 pm
and now it's time for our audience question period, and we have a number of questions, and i've looked at some of them already. we have more, please, turn them in. and if i could ask to have those trolleyed over on this side as well, we'll be able to do that. want to get to everyone's questions. um, i did is ask the last one before i saw the title before i met these guys who i'm crazy about. everyone always writes these books where the subtitle is live, work and play. but not them. they live, work and think. so it's like they're all business. [laughter] on top of that, it suggests they may not think while they're at work, which is my kind of work. [laughter] so it's really, really interesting to see what the kind of questions that i've already seen come up here. and the first one is, we might as well start right -- because i've seen an ad go through a couple of these cards, and that is the dark side. what is the worst, the negative,
5:20 pm
the loss of privacy, you know? ill, weak, well, the lust just goes on. who wants to take that? >> well, we talked a little bit about the dark side, and i mentioned the danger of propensity, ken mentioned the dataship of data and, of course, the privacy challenge. the privacy challenge is one that is severe because the mechanisms by which we protect privacy become ineffective. but we, writing the book really thought more that the propensity challenge is one that gets often overlooked but going forward is going to become incredibly important. and so what we really thought we want to impress on the audience is not only that big data may challenge the informational privacy that we have, but that it really does challenge the
5:21 pm
role of free will and human volition. and we need to be quite careful on how we guard this, this important human element. and we, in the book, suggest a number of possibilities to do that. but that's, that's really what keeps me awake at night. >> well, in a real sense we're all automatically collecting and disseminating all this data, many of it generated by ourselves. if you've ever been to a hospital, you realize you're suddenly giving all this data out, not just what you sign on a form. do we really have an expectation of privacy in the wig data -- big data age? >> well, in some instances we have to ask the question should we have an expectation of privacy. so let's take health care as an example 6789 we have developed a very cumbersome legal regime that actively blocks the sharing of health care data. now, you can just imagine in a hundred years our children are
5:22 pm
going to look back on us and be bewildered how we could ever let this priceless information that would certainly improve care just slip away and, informs, how the federal government actively blocked it. not just here in america, but around the world. in fact, what we probably need to do is have a very healthy debate and change the narrative entirely and say, well, perhaps we should make it as a condition of citizenship that all health care data of the individual gets shared. now, it is true there is a problem. there's a risk of inadvertent disclosure that could lead to bad consequences for individuals. let's look at mechanisms that try to prevent that and police that. but just learning from the data is certainly a social goal. >> well, you know, it's very interesting because to say we'll maybe deidentify or encrypt that, you know, i defy you to leave this room and not leave your dna behind. we know who you are. if you have dna data on somebody, we know who you are. so in a real sense, this is a conundrum. go ahead, viktor.
5:23 pm
>> right. but what is happening right now is that a lot of the health care data gets collected, and then it is being used by the health care provider or the insurance company perhaps to discriminate, perhaps not to discriminate depending on the regulatory regime in which you live. but what the data very rarely gets used for is research into what could actually help you if you have the condition or how that condition could be prevented. what we really need to do is to unleash the power of big data on the research side rath orer than to unleash the power of big data on the sort of cost effectiveness of the insurance side. and we've been pretty bad at that. >> well, you know, in science we always say we have, you know, 220 people in the study or 16 people in the study or even 500 people in the study, and very few long-term clinical studies with 15, 20,000 people. with big data we can actually change the face of science, right? >> absolutely. and it's almost laughable
5:24 pm
because we are in silicon valley today, and speaking of facebook. you absolutely know facebook would never dare change a pixel on its web site unless it actually had tested it with millions of people. yet we're approving drugs with a few hundred. it's somewhat laughable. it just really underscores the degree to which we need a new mindset. >> now, i want to remind you that you're listening to the commonwealth club of california guest program. our guests are viktor mayer-schonberger and kenneth cukier. we're discussing the pluses and pitfalls of big data. you can find us online at fora tv. and, of course, everybody wants to know, ken, you know, data editor? what is -- you're not a data input clerk, right? >> i'm not a data input clerk. >> okay. now we've got a promotion from the universe. what is a data editor? >> it is a new title.
5:25 pm
i am the first one. we understand at the economy that data is more and more important to our readers. we've been in the data journalism game for about 160 years, so it's nothing new there. but we recognize that there's new techniques in terms of how to visualize data, that we can data as the basis of stories instead of, if you will, anecdotal-based journalism where you sort of pattern recognition of the story through talking to people that we can interview a database. just as our sources might lie to us as a journalist, the data might lie as well so we have to keep our suspicions up. but we can crunch lots of numbers to either visualize it, or we can actually use the data as the basis of our reporting. and a data editor is a service provider to the rest of the organization to see if that happens. >> now, this is a question i forgot to ask you today, so whoever this is is npr producer for the day, whoever asked this
5:26 pm
question. in january and february, we had a flu outbreak on the east coast, the midwest, we kept hearing about it in the paper every day, ghoog l flu was telling us where it was, and apparently google flu overestimateed the outbreak. what happened? >> well, first of all, it's a prediction. the prediction tells you that 85% of the time you're right. that means 15% of the time you're wrong. and so being wrong is just part of being in the prediction game. then, of course, this is a die name you can world in which you need to rerun your models all the time because if cnn reports on flu trends or reports on the flu season, people might google the flu even though they don't have it. and so there's a feedback mechanism in place. and, of course, google flu trends is being compared to the centers for disease control data. so maybe the fluke is in the centers for disease control data rather than the google data.
5:27 pm
we don't know. so what we should not do is to immediately create a causal link and say, oh, this must be because google's model is wrong. that "because" part, that causality part, that's dangerous in a big data world, and we shouldn't do that. when we look at these bytes and so forth, we should investigate with an open eye and an open mind. >> yeah. i would just underline what viktor said which is to say the presumption in your question is that, actually, the cdc represents the true flu cases, right? and that google is just simply a shadow of that, right? but that may not be true. it may actually be the inverse. so, for example, we are in the middle of an economic crisis. when google first did its fitting of the pod el, we were not in a recession. perhaps many people now not going to the doctor or not going to the emergency room when they have flu symptoms because they feel they can't take a day off of work or because they can't
5:28 pm
afford it. so, in fact, google flu trends might be more accurate in terms of the outbreak of the flu, and the cdc data maybe has more variability. >> well, i'll introduce you to the cdc. they'll be delighted to hear that, ken. [laughter] >> i think they will be because -- >> i think, no, i'm actually going with you in all directions, and i'll tell you why. once you know something is being collected, in science you know that, it's why there are double blind studies. it's why -- it's like is this what you're doing? you can game the system. you mean, i can go out and find out about flu symptoms? i mean, we don't know. once it becomes public, how people behave changes, and the data collected is there. so that's a problem. as well as i think it's a massive point that you have is that if cdc is only recording those people that go to doctors, that's changed dramatically even with the internet, you know? even with the internet.
5:29 pm
so we have to really be good at this new big data role of analytics. and for some people the algorithmics -- i still can't say that. >> algorithm mists. >> it's based on algorithm, and that is, for those who don't know, some 50 years ago don canoe, a stanford professor in computer science -- who also listens to tech nation, so he's a wonderful guy, obviously -- [laughter] he wrote a book called fundamental algorithms which is a textbook, and all computer science majors have to take this. and the beauty of this book is that in the very beginning he quotes from the betty crocker cookbook about how this is recipes, and we're going to take you step by step, and we've checked absolutely everything we think we can check, and we're trying to bring you precisely how we are doing everything, and we've studied it, and we think
5:30 pm
we really have it down, and please let us know if you don't. and that's all an algorithm is. so when you're having an algorithmist -- which i probably will give you a new name since i can't pronounce this one -- coming with the algorithms in how we look at this big data and account for what we're all just tossing around here, because it's a dynamic kind of a thing. ..
5:31 pm
they need to use physical packages. they need to use analysis tools and with a wide variety of tools and methods available and they need some really good grounding in the latest statistics. a lot of the statistical methods that we use were designed for small -- so there might need to be a need to upgrade or improve them to an extent. and then they might also need some sense of visualizing the data as we go into the age and in addition to all of that we would like to imbue them with theoretical grounding, not just mathematics but a more general theory. oftentimes people who are doing very well as algorithmists are those that come from the natural
5:32 pm
sciences, particularly the physicists who are well trained to deal with huge amounts of data either through astronomy or telescopes and data-gathering fair or -- so that is the kind of mixed interdisciplinary mix that we need and unfortunately relatively few universities around the world have programs yet to educate algorithmists -- algorithmists. >> we have additional statistics which i did not do very well. probably not alone there and we already remember pi square and confidence levels. do we have new statistical techniques for these data? >> yes, and in many ways we are looking at photo advancement.
5:33 pm
right now from classical statistical approaches to look for linear progression, that is linear relationships. if a increases than b will increase or decrease in the same way but a lot of times that's not the case. it's much more complex, the relationship i'd be more difficult than that. and so we need some advancements and we need some insights there. we need better ways to measure the thickness of a model to data today statisticians talk about pi squared and how well a particular model sits. in this "big data" world we need to upgrade a lot of these tools, these methods that we have available but that doesn't mean the scores are bad. that just means there's room for improvement. >> is it possible -- though what
5:34 pm
can we regulate with respect to "big data" and what can't we? >> lloyd, you always throw these curveball questions. what can we regulate? i think what we need to do is to make sure that we are not stifling innovation. ken and i both agree very strongly that the benefits of "big data" outweigh the risks or the drawbacks. but that doesn't mean we need to take the risk slightly. we need to focus on the risks. we talked about the propensity challenge and the dictatorship of data challenge and we need to find pragmatic solutions to pragmatic safeguards. in the book we go into quite a bit of detail in chapter 8 on how we can do that in innovation and market and the ways but still in short that society and
5:35 pm
the individuals are going to be protected. >> you asked what can we not regulate? i i will just on crystallized the issue. right now if we go to a doctor and we are told that we have to have an operation we can ask the doctor y. and the doctor can tell us. i learned this in medical school and this is why you need the operation and he can point to something in the textbook or the literature. you can imagine 30 years from now that the doctor is not making a decision blindly but uses algorithms just like a commercial airline pilot would not ever dare to land the plane without the benefit of the instrumentation of an autopilot. so you asked the doctor, why do i need this operation? in the risk the doctor might sa. he could also say that more generally he may ask the bank that denies you a low -- -- a
5:36 pm
loan and they would say because there are 15 factors in here is how how you scored but what if we looked at 1000 variables and what about their variables throughout 400 strong signals and weak signals in all of them in a complicated formula that was tailored to the individual that was always changing over time with the reason why. where would the percy jewell fairness be? where were the transparency be? hence the rule of the algorithm. give the public the confidencconfidenc e that it will need for big data to go forward. >> i think the whole idea of social responsibility i would think society and government and i don't think you can break this to a part, here sitting in the city of san francisco roughly three-quarters of a million people in the center of the much larger bay area. say 10,000 people would go that far.
5:37 pm
what should the should the priority b-1, two, three of the "big data"? >> well it's simple. >> i love simple. >> i feel a great confidence because the leadership position in the united states in terms of collecting data and opening this data up so we should applaud -- >> in what way? i didn't know that. >> well, there was a gentleman and i want to say chris, the cto of the city of san francisco who now i believe works at the white house who took a strong leadership position in getting the government to open up crime reports and public transport data so that developers can build apps alongside it. for delbert -- developers to come in and build those and bring it together. he is actually doing very good things with where the real
5:38 pm
gemstone is in the united states is in new york city and there they have a director of analytics so you might want to look at that model and what is done is a fellow created a small little team to access the service providers to automate your cities so one of the problems the city faces is overcrowded buildings tenements for example. imagine you stop than 10 times are 100 times and those buildings are a huge risk of fires. when those buildings catch on fire the likelihood of the firemen to be injured or to perish is extremely high. so what he is done is he has said we don't know which buildings at the outset are the ones that are at the most risk for fire ,-com,-com ma the worst offenders are the ones that are a problem. we get 56,000 complaints a year to our helpline but we only have 200 inspectors. how can data help is? you build a model and predict the model that looks at
5:39 pm
ambulance visits, police visits whether there has been a financial lean on the property of and whether there has been utilities cut. whether the exterior has been worked on. all of it's in the model so now when an inspector goes they issue a vacate order to get rid of the building and clear the building. now they are doing 70% of it so it's a fivefold increase in efficiency so that inspectors love it goes there a lot more effective. the mayor, bloomberg is a big data guy because he is doing more for less and the fire department, it means less danger for the firemen. >> what we are saying is if you get the right data and write to write letters to the right people we can make a big difference. now i am always, i get should grant and i'm sure other people do as well because i know specifically speaking we only have seven to 12 people.
5:40 pm
this is the kind of thing, if i don't believe you with 12 people how my going to believe you with all of these? what is the argument they're? just that it gets worse? >> well, anyway if i may take your question in a -- context in a way that gets at the heart of it. anyway in the small data, the way we approach problem-solving and decision-making, because we were starved for data we would come up with a theory about how the world works and based on this theory we would then develop a hypothesis and then based on those hypotheses we would go out and collect the small data that is necessary, a sample that is necessary. you could prove or disprove the hypothesis and then when the
5:41 pm
hypothesis is disproven we would go back and change the hypothesis and try it over again to collect the data and analyses and so forth. that was a scientific trial and error step-by-step. it worked reasonably well but it was an artifact of the small data age. we can do more if we have not just a sample of data but close to all of it. we can look at details. we can look at subgroups and subcategories that we couldn't do before. but we can also let the data speak in defense that we can use the data to produce hypotheses and then to test them. take the example of google. when they try to find out 50 million search, which 45 searches were the best to protect its? they had no clue which of the 50 to take varied and the old-fashioned days they would have picked the first, try it,
5:42 pm
it doesn't work, pick the next tried and it doesn't work. now that's crazy to do that. and what is the exact combination? that is crazy to do that. what you want to do is to have a method by which you can create a way of producing hypothesis and then testing them. so in a way we are using "big data" analysis not just to tell us whether we are right or wrong about a hypothesis but to help us come up with a hypothesis. >> now, checking those terms just because computationally they can check them all at the same time are where they have a new strategy for coming up with those? >> what they are doing is they take the 50 million most common search terms and they
5:43 pm
essentially try each one to see if they can prove the model and then when they find one that is good with the model they try another. >> does it say oliphant or does it say sniffled? >> that is really essential. they are not making any pre- judgments of what the useful term is so for example in the top 100 terms, the term high school basketball because high school basketball is played in the wintertime. so there is a fit there. there is a correlation to keep in mind it was used in the 60s and the 70s that term. a model, try the 44th term. it was good and improve the predictability of the model. to try the 45th term, it worked, it proved the predictability of the model. they tried the 46th term and the model deteriorated. they cut it off at the fifth term. >> what are they comparing it to
5:44 pm
>> keep in mind there's another wrinkle to it. you've rewound the whole model because it continuously learn from the past. >> and you check it against the past as well so it's obviously it's tricky and its progressive but it gets better over time as long as you don't take any one point in time and use that for the data and there is the answer. i think that is part of it. we are moving ahead now building a history of the stated now that we didn't have before. really and honestly i did not come up with this question. given the field is growing and changing so fast estimate the shelf life of your book. [laughter] it's going to be on the shelf for the rest of your life. how's that?
5:45 pm
>> the book is timeless. [laughter] >> he would say that what he not? >> and the reason is that it's the first book out of the gate to define this new trend in this trend is not going to affect business. it's not going to affect government. it's going to affect everything. if you are asked where do you think computing is going to go? a person would have to honestly answer, it's not really the right question because by the year 2013 computers will have wormed their way into everything until they are almost invisible. so too, society now is going to learn from data. it's going to be data findings. we have to change the way that we approach things. the future is going to be based on information. the way that we have -- the way that we are driving cars is not because we can program a
5:46 pm
computer to drive a car. we tried that and it failed. it's because we can pour in a lot of data and left the statistics of the machine teach itself. the light is red, the light is green, accelerate. >> a member of the audience has pointed out that the internet searches and social network participants are not representative and this is just one example within a really large dataset by all the people in this room. all the dna generated by the complete genome generated by the national -- etc. so we are talking about a large data mix. you can't just look at it and get the data. so you are correct and i think that asks some of the questions being done here, being answered here and that is if the data is about correlation and not cause, how can you judge this fairness
5:47 pm
or completeness and that therefore the quality of the decision? >> i think it's incredibly important to understand the limitations of "big data" that you collect and otherwise you run the risk of repeating the problem of 1936 if i recall correctly, where the data erroneously predicted a republican landslide in the presidential election. they are sampled with bias. it was a large sample but it was bias. keep in mind the "big data" age work slightly different so if you have half a% of the population that you sample, then that gives you a good first cut of what the population thinks. but if you then sample 3% or 5%
5:48 pm
but there is a bias of that sample that actually doesn't prove anything. that makes it worse. but in the big data age if you collect 97 or 99% of the whole data, then even if that is slightly biased, that 1% that you are not collecting is not going to undo all of the analysis. again it doesn't give an exact but it gives us the right direction oftentimes that's enough. >> it's like yes these are not right but there is such a great body that is here intuitively know when i can't give you an example. now can you discuss, obviously this is the last question. if you don't have another answer there will be another one. we have actually come to the end here. can you express big data as it pertains to climate change?
5:49 pm
ces, obviously. it's going to be important for all of our global challenges but first, the first step what we need to do is quantify the problem and so the era of quantification that we saw in the 19th and 20th centuries where we are data finding thing. there's one company that has a clever app that allows you to take a photograph of an animal on a path and it will tell you what it is. but the service is not really designed to be able to identify the street but it's now how many people are doing this. they're able to identify his spring coming early this week or this year rather or is this mushroom the kind that exists in one climate zone or not another which suggests climate zones are creeping up further.
5:50 pm
as data becomes the bedrock of most of society we are going to be able to put a measure and the quantity on things like climate change. we may not be able to avert it with big data, probably camp but will they can certainly do is identify it and then take steps. >> thanks to viktor mayer-schonberger and kenneth cukier, editor for the -- [applause] they are co-authors of "big data" a revolution who will transform how we live, and think. we thank our audiences on radio, television and the internet. this was held in association with the commonwealth club science technology forum. exploring visions of the future through science and technology. we also want to remind everyone here that copies of our guests new book or in the lobby on sale and they would be pleased to
5:51 pm
sign them outside of this room immediately following the program and we appreciate you letting them make their way to the signing table as quickly as possible. i am host of tech nation on npr and now the meeting of the commonwealth club of california the place where you are in the know is adjourned. [applause] [inaudible conversations] >> we have to take back the independent that will save us. the media are the most powerful institutions on earth, more powerful than any bomb, more powerful than any missile. it's an idea that explodes onto the scene. but it doesn't happen when it is
5:52 pm
5:54 pm
>> we are at the annual conservative political action conference in washington d.c. and we are here with marji ross publisher of regnery publishing based here in washington. booktv viewers may recognize ms. ross. she was here last year for a long conversation about publishing. how are you doing? >> i'm great. happy to be here and happy you are here. >> lets talk about a couple books you have coming up in the print season. first off former lieutenant governor -- >> this is a terrific luck. it's our first best-seller of 2013 so we are very excited about that. we release this book beating obama cared as a paperback
5:55 pm
because we really wanted to make it an accessible handbook, sort of consumers guide to what people can expect. a lot of people talked about what was going to happen with obamacare and was starting to come into effect. well now it's here and we have to live with it and we have to deal with it so she is an expert in this area. she's a former lieutenant governor of new york and one of the few people who has read the entire bill and she goes through it in a very commonsense, easy-to-understand -- i was very impressed -- go very easy to understand explanation of what actually is in the bill, what these different laws are and what the rules are and what you can expect what these different exchanges are and how it's going to a effect people in their paycheck and in their withholding and their insurance coverage at their job. it's just a very practical guide for consumers to find out what they are getting.
5:56 pm
>> regardless of whether you're a conservative or liberal? >> actually it is. she is not a fan of the law but she walks you through in a very practical consumer way what you need to do to navigate this. >> and macbook air, david carr sonny. >> this is our newest book out, obama's four horsemen. as you can see it's rather apocalyptic. i think a lot of the books that are out that have come out in the past few months have talked about america at a crossroads, america at a point where we have a big decision to make. david carr sonny he was a terrific writer and book person basically says we have crossed that point. it is too late to avoid some of the disasters that we are facing. now we just have to buckle down and figure out how to get through. >> the last book you have is holding a galley right here.
5:57 pm
>> this book is not out but this is our next book coming out in april. it's called the ultimate obama survival guide. this is a terrific read. it's very fun. it's also very practical so the first part of the book tells us all the terrible things we are facing under a second term of barack obama in the second half of the book is a very practical survival guide, everything from how to buy gold coins to how to stock your house with food and water to how to buy a gun and how to pick one and what ammunition to stock up on. he has covered all the bases in a very entertaining way and you will be amused and you will be prepared. >> couldn't help but notice all three of these books deal with obama's second term. is there kind of an understanding for conservatives that is something they are going to have to live through?
5:58 pm
so how did you go about acquiring these books in such a short caird of time because you weren't aware of the election? >> that was something we struggled with and talked about a lot in the second half of last year. it's something that publishers have to do with but particularly for regnery because we have focused on only conservative political books and we know every four years it's going to be an interesting challenge to try to publish in the beginning of a new presidential term especially when you don't know and you never do whether it's going to be the incumbent or someone new. in some cases it's going to be someone new now matter what and what we did was we tried to find a book that was very practical and talking about what people needed to do to survive and thrive, no matter who was in charge and then we knew that once the election was over we would pay that one way or the
5:59 pm
other in the positioning of the book and even in the titling of the book or the subtitling of the book depending on who won. so if it would have been mitt romney we would have had a book that said well we are in a mess and we have a chance of getting out of it before we have a lot of work to do and here's what we need to do. and the pivot for barack obama winning re-election is we are an msn it's only going to get worse from here. >> publishing inside from marji ross publisher and president for regnery publishing. thank you very much. >> thank you very much. nice to see you. >> now obo
114 Views
IN COLLECTIONS
CSPAN2Uploaded by TV Archive on
