tv Everybody Lies CSPAN June 17, 2017 9:00am-10:31am EDT
9:00 am
>> booktv is on twitter and facebook. we want to hear from you. tweet us or post a comment on our facebook page facebook.com/booktv and twitter twitter.com/booktv.[inaudible conversations] >> welcome everybody! welcome to the american enterprise institute. thank you for joining us today. we are going to dive right in. we are going to talk about the new book everybody lies. subtitles for about 15 to 20 minutes. after that, we will sit down and have a conversation with -
9:01 am
of columbia university and harry and will viciously attack seth and the book he has produced yourself will get the opportunity to defend himself. there we will take some questions from the audience and follow that same procedure. all right. seth? >> thank you dan for the introduction. and for inviting me. and this is a book everybody lies. it is about five years of research i have been doing. so i will describe what it is. and for the last 80 years if you want to know, what people want, why people did the things they do, what people are going to do, you basically one main
9:02 am
approach. you asked them. you conduct a randomized survey. so you go out and ask people questions. and there is a main problem with this approach which is that people tend to lie to surveys to make themselves look good. so if you ask people immediately before an election, are you planning to vote? the overwhelming majority of americans in the survey will say sure, i am going to vote in the election. they do not want to admit that there were not voting in the election. it is kind of considered socially unacceptable to not vote in the election. my favorite example is the general social survey asked two men and women in the united states how frequently they have sex. whether it is heterosexual, homosexual and whether they use a condom. so you can do the math on this. and american women said that they have sex about once a week and use condoms 20 percent of
9:03 am
the time. that means these 1.1 billion condoms every year. american women with heterosexual encounters. american men you ask them the same question and bay city is 1.6 billion condoms in heterosexual encounters. and if you think about this place technically is by definition to the same. so we are know someone is not on the truth. somebody is lying. so who is telling the truth? men or women? neither according to data from nielsen which attracts every condom sold in the united states only 600 million condoms are sold in the united states every year. so apparently i want to find about the sex that they have. they are all lying. so we have only had surveys but i think we now have a new tool to understand the human psyche which is the searches people make on google. that is the stuff i have been
9:06 am
9:07 am
post-racial society back in the day. there was this idea voters voted for obama and didn't care that obama was black. could you use google searches because people are so honest, tell google things they might not tell anybody else about socially acceptable attitudes, could you use google searches to get the real answer of the effect race played in people's voting decision. i made a map of racist search volume and this is the percentage of google searches, charged racist word, you can guess what it is, the first thing that struck me about this data, in the time period i was using people make these searches in the same frequency they are searching migraine and daily
9:08 am
show and economists. it wasn't by any stretch a fringe search. these are mostly for jokes marking african-americans. that is a big team. the other thing that struck me is it looked very different from the map i would've expected of racism. if you asked me where racism was highest against african-americans in the united states, i would have guessed racism is concentrated in the south, had a strong north and south divide. and in the deep south, southern mississippi and louisiana, you can see the map of darker red meaning higher frequency of searches that it is higher in western pennsylvania, industrial michigan or illinois.
9:09 am
the real divide the search data reveals is not north versus south but east versus west. it is much higher, east of the mississippi river drops substantially west of the mississippi river. because people are so honest could you use this data to measure how much obama really lost in the 2008 election? you can't just compare racist searches to vote for obama because it might be places that have high racist searches would have opposed any democratic candidate in 2008 so that wouldn't be a fair comparison. i compared obama's vote total to previous, was ranked similarly liberal. you can you read it in the book,
9:10 am
a very strong, significant relationship, places that had highest racist search volume this in michigan. support obama less than previous candidates and you can control education or demographics or political views or cultural views and nothing changes the relationship, that was a big factor, i conclude obama lost for percentage points from racism which was higher than you get from another measure and he got one to two percentage points. this paper languished in the academic world for a little bit and very recently the trump phenomenon was started, trump was saying racially charged, making a lot of racially charged
9:11 am
comments and people were questioning how he is doing so well even when saying things they are not supposed to say and was racism driving this support? nate cohen of the new york times asked for date on the racist search volume, he had data on support for trump in the republican primary and all the variables he could test whether it was age, education, economics, the single highest correlation you can find was the racist search volume for trump. this doesn't mean everybody who supports trump is racist but it does mean some of his supporters were and it did drive his progress in the primary. there are all kinds of things you can do with this data.
9:12 am
measuring child abuse, abortion and pretty much all very depressing topics. it is depressing and horrifying but i put a lot of jokes, there is a lot of value to knowing these parts of the human psyche that we don't -- aren't usually talk about so i will give one more example of the research i have done. if you remember the san bernardino terrorist attack in 2015, when two americans, muslim american name, shot up their coworkers at a party and as soon as the attack happened almost within minutes, you saw a huge spike in nasty searches on muslims, the number one search for muslims after the san bernardino attack was kill
9:13 am
muslims which was another one where it is not like what people are looking for google but people do express thoughts on google, they search things like kill muslims or muslims are evil or i hate muslims and they were getting out of control immediately after this attack. after the san bernardino attack, trying to calm down this islamohphobia, he kind of, addressed these attitudes. with the big news outlets, unlike a lot of people in this room i am an obama supporter but i found the speech beautiful, spectacular, it is best and he
9:14 am
talks about moving a sermon, the responsibility of all americans to not give in to fear, appeal more to freedom, it is our responsibility to treat everybody the same no matter their religion. it was a moving speech in all the traditional sources, really love this speech and gave it great reviews in the new york times or the la times or other news organizations, great job, he hit this one out of the park as far as explaining to people why they should not give into islamohphobia. you can breakdown google data minute by minute. i decide to see what happened in these searches for kill muslims and i hate muslims and these angry searches as obama was speaking, how do they compare before the speech, during the speech, after the speech. i did the comparison and found
9:15 am
not only did these searches not drop as obama hopes, they didn't even stay the same. everything obama was saying totally backfired as far as calling this angry mob. this is surprising, there was one line obama did give it had a different response which is obama says we have to remember that muslims -- muslim americans are our friends and neighbors, sports heroes at the men and women who will die for our country. a huge spike in interest, the top descriptor of muslims on google was not muslim terrorists or muslim refugees but muslim athletes followed by muslim soldiers, the searches date up for a week afterwards.
9:16 am
you can compare most of the lines about responsibility, they were a lecture, sermon, didn't tell anybody anything they didn't already know. you compare that to the line about athletes and military heroes, that was provoking curiosity, new information. i wrote this up in the new york times, analysis, when you write an article in the new york times, two week later obama gave a speech about islamohphobia in a baltimore mosque and stops with all the lecturing, didn't talk about, to do anything, on the curiosity strategy. muslim americans, how farmers
9:17 am
and merchants, a copy of the koran and muslim americans feel skyscrapers of chicago and this speech got a lot of attention. and searches after this speech, a drop in searches for kill muslims, i hate muslims, following his speech, those are two speeches, islamohphobia. the data that you can turn as seemingly unpredictable, the common angry mob is something like a science but research has to be done. these people are not necessarily
9:18 am
9:19 am
9:20 am
appreciate and deeply misguided -- >> some people will take lessons from seth's presentation and learn the science of stirring up an angry mob so maybe it is too bad to have done that. too late to test that any further. i came here a little early and went to the museum of american history where they had an old directorate from the 1800s, an address directory from philadelphia and for the men in the directory and add their profession, a captain, a shop, cooper, agent, a gentleman, gardener, attorney and alterman, gunsmith, whose occupation was businesses listed as shoemaker's
9:21 am
tool, baker, turpentine distiller and a few others. it made me realize there used to be a lot of available data and everybody used to know everybody. in some sense what seth is saying, the source of data and how much we can learn, part of that is the feeling that we need to learn about such data because people are harder to know about, 100 years ago or 200 years ago, didn't need a lot of polling, how many people they could get out to the polls next month. it is good to have a historical perspective about this. regarding everybody lies i had a couple thoughts, as a person who does use a lot of opinion polls
9:22 am
i am impressed at how honest people are, what people are doing, the best ideas you won't get it completely right. 60% of americans vote, 70%, after the election, more people plan to vote, if you plan to vote after the election, and a few more say they did and actually do. not that far off. when you ask people who they plan to vote for, they are very accurate hillary clinton had 52% of the two party vote, 51%, the polls were off in some states i do not think the evidence is not point to people lying to
9:23 am
pollsters, differential nonrespondents, the clue to why people are honest in paul's, with motivations. why you should respond, it is ridiculous people are trying to make money off of me and i spent 20 minutes answering somebody else question, if you answer a poll you might as well be honest. the whole point of answering the political poll is to say i support her. it was a little different, in the air up of gallup polls, not that many people were surveyed. if you were surveyed by gallup, you would be one of 1500 americans, the newspapers the next day, 50% of americans, you
9:24 am
would have a big impact and it was rational to respond not so much, you might as well tell the truth, figuring out how many balloons you use when having sex or whatever, people might be misremembering. in the past year, and people aren't sincere, what is the true self? is the true self the person who googles is that who you really are? i don't know about that, when googling rasul -- racial jokes is not the truth either, just another aspect of who you are. the really interesting question is what is the incentive process
9:25 am
to tell the truth? when someone gives a talk that says everybody lies it raises a certain paradoxical m & and seth does have incentive to get things right. if he gets things wrong, people like me, set the goal of discovery and would like to learn things. i'm interested in the data journalism. it is important -- let's say we have three data journalist and the panel right here. data journalism is playing a large role in the life of our society. we had a lot of discussion can we trust science just because something is published in the
9:26 am
top journal should you believe it? a journal like science or nature is like a brand name. we see this in data journalism, the new york times has to be careful. when they make mistakes they tend to correct themselves. other sites trying to get hit, malcolm gladwell makes a lot of mistakes. i don't know that he has incentive to get things right. apparently he doesn't have such incentive to get things right. i think he may have an incentive not to admit he got anything wrong. i made enough mistakes in my career like i am already wet so i don't mind being dumped in the tank one more time. tough to make mistakes first. we had a thing about politicians and the clinton era, it would be great if they could first and a politician to prison and let them out to run the country because in first we wouldn't have the suspense of when they
9:27 am
are going to get caught and they would have more sympathy for the ordinary person having been in prison for a while. it should be a requirement for every data journalist to make some big mistakes to get it out of the system so maybe seth will say those things in his second book. >> what do you think? the value of organically generated data, and -- how much did you lie when you were standing there? >> i am a compulsively honest person and the only reason -- i call it "everybody lies: big data, new data, and what the internet reveals about who we really are" except for me. with the voting, they compare people's individuals who said
9:28 am
they voted individuals, their actual voting behavior, people exaggerate their voting -- a huge problem with surveys, people give weird answers, they answer randomly. say they didn't vote which is really bizarre. that played a role, we talked about this before. a not trivial role, random answering, and we have a definitional difference on what constitutes lying a. andrew think you have to be consciously aware you are
9:29 am
deliberately misleading a survey and if you ask people why they did things, frequently not consciously aware of things. they want me to -- that is one of the areas -- they are not very good at when you ask other things people are not good at predicting in the future. they are overoptimistic, explaining reasons they did things and people are not so honest when admitting their desires so this one area of who you will vote for in an election is a study over and over again by polls, they are okay or having new problems but definitely a lot of areas where surveys will become a smaller
9:30 am
part of understanding of the human psyche relative to the sources we talk about in the book. >> what do you? tell us what you see. >> the book was very well written, material that can be difficult to grasp or not familiar with statistics and this was very easy to understand. you open with a story about your grandmother if i am not mistaken, who you should meet, who would be a good person to meet. that was a very good opening. ..
9:31 am
i don't think we are at the point yet, at least on the political center anything that has to do necessarily with understanding the percentage of americans who believe x, y and z. and based off of google searches. so i do not want anyone to get the four think that it is a popular case. i don't say we are making a case but is certainly something going on. one of the things i want to ask
9:32 am
you know i love that 2008 map, the change from kerry to obama. then you have donald trump. it also seemed to me that the map was somewhat, i'm not familiar if you've done this yet, correlated with how the vote change from obama to cleansing. it seems to me there was a correlation where clinton basically did even worse a number of areas that obama was already trending bed. it doesn't mean that it is not because people were racist. they could be racist and that is why they were changing the mind but it could simply be whether they were racist against obama could be the fact that now racial views are increasingly correlated with how people vote whether it is for democratic or republican party. i think that is another thing we have to keep in mind. something we always talk about correlation. i think that continues to be the case. in terms of using google and the truth serum, i google a lot of strange things to be perfectly honest with you!
9:33 am
i mean who was on the cast of who's the boss? i mean you know, these are the strange things. does that mean all is and i'm the biggest fan of who's the boss? no maybe i was interested produced a particular reason. so again, i think google on its own can give us a keen understanding what is interesting for german-americans but not necessarily a given thought of okay, why are they necessarily searching that? and then the final thing i will say which i think was, i think i emailed you about this was the book is not about politics. there is a specific thing about sports and being a mets fan versus a yankees fan. with facebook data? yes, it was used for that. i think this is a type of the thing where in fact you can use surveys that actually replicate the findings that you see on facebook so that we know for instance i had done my own research which showed that in new york there obviously to baseball teams. the new york mets in the new york yankees. four surveys have indicated was
9:34 am
that the mets, there will be more fans when they do better and when they do worse there are fewer mets fans. we are finding what was put in the book along the same lines that people, what is it eight years old? eight years old when a particular baseball team is doing well all of a sudden there a lot more fans of that particular franchise. i think that is a type of thing that can also be very interesting and cool to use because conducting polls although are getting cheaper with online surveys people do not necessarily know to answer the questions or do not necessarily have the ability to go out and conduct a survey. using this type of data can help confirm pull data or perhaps make any finding on perhaps something more trivial but may in fact be interesting to people. >> do you agree with that? >> not totally. that example is you can measure how much basically, how much a team is winning a championship when every year of your
9:35 am
childhood effects when you're an adult. so you will see people born in 1978 and 1961, all of these men were boys in the mets were eight-year-old bents and they wanted championship. and you can see this across a whole bunch of teams that you kind of see that obviously it is not perfect but when a kid is eight years old, they have a lot more fans. but the point of the study is the value of the data. in a survey i don't think you can replicate this in a survey. you can see this as a small change over time. you can compare -- but you can see the changes over time. to really see that type of subtle pattern, you need data on how many mets fans are there
9:36 am
among men born in 1978 and how many mets fans are there among men born in 1979 and 1980. you need this for every single team. and you can never see that in a survey with 1000 or 2000 people. so these big data sources you can really zoom in to tiny populations and you still have a big sample. so they have samples for any year in any team combination. they're going to have example for that were surveys will not. >> basically facebook covers everybody as you were saying. in the book, you cleverly work in some papers based on the -- that is more like the census of 1810 in a way than the internet, sort of organically generated data. i can see how he of the full population of people you're interested in. you can do these things. it is very different from the google data. we have no way really to look
9:37 am
at subtext except for some geographical level and we have no real baselines. but we really need, so we can compare things all the time but as we were staying to find what percentage of people are actually going to vote for donald trump based on searches is because we don't have a way to know what absolute levels mean as opposed to relative levels which are more meaningful. those are two very separate types of data. >> i just wanted to insert that there are questions that we demand less or more precision from. it is actually not hard. you can actually use just like crude search data to find out like to predict elections pretty closely. you can get it within five or 10 percent just by looking at who people are searching for. on the other hand predict an election within five or 10
9:38 am
percent is not going to be that helpful. if you want to know what the portion of americans are you know, using balloons when they're having sex or whatever then be off by 10 percent is not such a big deal. so a lot of it has to do i think with the success of seth's work is he's getting away from i'm going to answer the question everyone else is asking. so the search is not representative sample. but what is this off by week 10 or 20 percent. a lot of questions are like that. so you have to do it where you ask questions where you are not demanding that level of precision sometimes. >> i totally agree with that. you might know one of the most famous examples of you think google searches -- scientists can predict, they said they tried to protect the rate of fluid in the united states in a given week based on people making searches for coffee or
9:39 am
flu in a week. it kind of blew up a little bit. one of the problems with google flu was that our flu models are really really good.you can get really really close by just assuming that the flu is going to be the same as it was previously. so you have .95 or google data is always going to be as the answer says i totally agree, somewhat noisy. it will never be perfect. it will probably get better as we learn to weight it. it is really a question of if it can be a model that has been well-developed for many years. but i think there's only areas that help. i was in it google contemplation makes more sense than google flu. because he would have no
9:40 am
information about what's going on in the united states. then the noise of the google search data will still be kind of nill or close to nill. >> are the cdc people doing that? dino? and i'll people that have time on their hands they may be doing google constipation. i'm wondering if you might know? >> i think after the google flu thing came out there was a lot of excitement. they started doing things but the google flu everyone kind of lost a little bit. i think you lost a lot of momentum. >> i think one of the things that will be interesting as we go forward as i hinted in my last answer, the survey is becoming less and less expensive.at least on an entry ground level.everyone can conduct a survey. i'm going to be interested to see how google searches along
9:41 am
with survey monkey searches, how all of this quote - big data or small data that is cheap can be used together to figure things that are going on. they think if we just have right now most of the surveys we are conducting are in a larger survey or telephone survey. that to me is you know everyone is very expensive. has resorted get down and people form their own surveys and it is not just an expert like you who understands everything that is going on. that is why i think we can get interesting things going because the end of the day, what's most interesting is not just what we can find once we determine what we want to find but coming up with the questions in the first place which is exactly what you are saying. which is the questions are most important.what is it like now that we are not asking that this data can get to us?i'm just interested, i do not know the answer. >> what is the answer to that?
9:42 am
do you know? where do you see a role for this kind of - >> i have done a lot of research on anxiety. and researcher my anxiety is highest in the us. i thought it was going to be new york city and urban intellectuals. but that is not true at all. it is higher in maine and in kentucky and generally rural areas and with low education. is actually not typical search data that you would see in surveys as well.but then because you have this rich dating can actually see how anxiety changes over time. and so for example it is when two people start for panic attacks? like 3:00 a.m.. but now we basically know through this data, we know how many people are having panic attacks in new york city on any given tuesday
9:43 am
evening. and how many people having panic attacks in washington on tuesday evening. and how many people in boston? and we can say what happened in the day? what happened leading up to that that caused this? is a just random that everyday some people have a panic attack when there's nothing to it? or was there a clear event a day or two before? -- i think there are certain situations where it is not true but i think if you wake up it is probably likely to get to google. i think we have to break some of this data down. because with health conditions, 11 people have criticized my book because they said i search all of these things because i'm a doctor. for i am a research. i search things but i do not have them. and that is probably true. i think in general research searches are small percentage
9:44 am
does not overwhelm the data. but then you can really dig down. on think too many researchers at 3:00 a.m. are searching for that.i think that's a pretty good idea that a 70 searching, sometimes data is blunt but it could get a lot better. if we got better we can really take of the people who are searching for other reasons. >> i think that is fair. i just want to go back to you can predict the presidential election within 10 points pretty and accurately based on certain data. and there are things like google and facebook that have much more opportunities to link data within person and generally over time. what kind of use would you see of that kind of data? what would you associate with those databases existing because it is a very different level of quality of the data. then another search may be.
9:45 am
>> one of the studies i focus on is a study by microsoft. they studied pancreatic cancer. they used data anonymous and identified over time. that the same users are making searches over many months. >> they say they know who are searching the does not looking at the actual name. >> yes, there are links. then they said okay, so probably has pancreatic cancer if they make a search like just diagnosed with pancreatic cancer. so to get a diagnosis like that you might go to a search engine.and then they said okay so here are people who you know anonymous people that we know just got diagnosed with pancreatic cancer in this month. then i hear from people who are similar to them and never got
9:46 am
pancreatic cancer. to have these users over time in the say what were they searching in the months leading up to the diagnosis? what symptoms were they searching? so what searches would predict pancreatic cancer? and they found very subtle things like indigestion. if you are undergoing me will go home and - but that is like a really really subtle pattern of symptoms. and i spoke to researchers. doctors do not know, like our system of diagnosing diseases is not as sophisticated as having a huge group of people and following a time period. so i think that is a revolutionary type of medicine. and with pancreatic cancer if you are told that you have pancreatic cancer, the earlier you are told it can
9:47 am
dramatically improve your odds of survival. the ethical question is then what should have been if the search engine figures out that you pancreatic cancer. and you have attorneys present cancer 25 percent or 50 percent chance. and i think you can - if i have a pattern of symptoms with a risk factor of a disease that can be cured i want to know that. >> in your case it sounds like you're inflicting most of the symptoms on yourself by reading about them and - obviously but this method that you talk about in a variety of you know sort of matching people to most similar cases. both on facebook and twitter and health records. why is that not - especially
9:48 am
with healthcare it seems like an important innovation. ready think those methods are not coming through as well as they could? >> based on the data a lot of it is not linked together. it is kind of all over the place and there is not a huge incentive for people to put it together. a lot of people have been trying to go around the official healthcare organization and stuff to try and collect their own data. >> i think this is one of the things people should keep in mind if they read about this. if a lot of stories about like baseball, politics, mostly because i think it is not that there is no implications outside of those. >> those are the three topics i am most interested. >> i see! so you guys use search data when you try to make productions. >> sort of.
9:49 am
>> is that the main source of this type of data that you use? where do you see these things going? where have you guys moved beyond survey data? >> sure, there is no trade secrets going on. you will have access to the same google unless you all are part of the secret club i'm not aware of. the way that we use google, at least one of those we certainly is it i will give you an example. figuring out for instance, what would be effective during the promise of the campaign. did in fact that harm hillary clinton? i found using the data was that in fact, there does seem to be some correlation to when people decided there was that they were going to book for donald trump and when searches for reasonable links in fact jumped up. if anything, if you look at this it might be a stronger correlation then when james comey he said to send his
9:50 am
letter over to congress. and i do not think that wikileaks really got asked all that much in terms of after the election. was at the big effect? i think a lot of my people with concentrated more in james comey. the google data suggested otherwise. that goes back again so i think the question of where americans being asked about wikileaks and their polling? perhaps not as much as you might have thought.and also, one of the things i think survey data sometimes is more difficult to get out is why are people voting the way that they are? survey data pulls a very very good at predicting who's going to win and who will lose. regardless of perhaps what some people might argue.but they are not as good at is assigning reasons for why it is occurring. people often give weird answers. oh i was thinking that the entire time! i was certain. all that had no effect!
9:51 am
oh blah, blah, blah. so if something did or didn't change was in direct contradiction of the reasons they gave. and again, i think this goes back to the question that both of you were sort of tangling with. i think is necessary that people are lying, is that people honestly don't know what it is that kind of pulls them in one direction or another. part was finally this one thing that with them over the top. i think that is one example of where we are using google data. i think again, i mentioned this before the other thing where we are using google data is tracking changing minds in real-time. and in general elections most people make up their mind if you month before the election. very few people are going to change but in fact, in primaries people are much more likely to change their minds and that is where global trend that is specifically useful. the other thing is that there are a lot of races in which we do not have polling data. were talking about presidential elections right now and there was how many polls 15,000?
9:52 am
it was ridiculous! you could beneful is it anything want. i'm sure there was one that has edwin johnson or other people. but there were many fewer polls for races of the house, senate primaries, make a very good idea. i've not seen any large-scale studies on this. but i'm very interested to see us more and more people are turning to the internet whether or not these google trends can be applicable to understanding who has momentum in a senate or house primary. like someone who myself might otherwise be like i have no clue what was going to win. and maybe it was so-and-so who has a particular chance. for me, those are most of the applications and politics. again as he said, is just such a small part of the book. to me, there are many more applications and places where
9:53 am
perhaps polling isn't as vast. whether it be sports or health or maybe a topic is interesting to me. i'm a huge fan of entertainment i think we all are. even if we otherwise argue that we are not. but there was this whole story about brad pitt appearing on a magazine cover this past weekend in the story and no brad pitt seems a little sad but i would be interested in seeing whether or not people read that story the same way that the mainstream media those numbers of the press read it. i think the particular case we can get outside and i think one of the lessons of the presidential elections that will live in his bubble, where we are all in washington d.c.. you cannot get any more elitist than -- i'm particularly interested in coming effective this google trend data to figure out directions where the public is thinking that we might not think and we never ask again a question about.
9:54 am
>> as opposed the current situation where you only know how bill kristol is doing. >> bill kristol has a good sense of humor i will give him that. >> understood was a literature which linus arose going to be my engines and just followed ted cruz's campaign chief. i said that i looked that he probably knows where the train will end up going even on their is no sign up. and it was exactly right. >> i am going to open this up to the audience. seth, do you want to say something before this gets guided another directions? >> any and all questions are welcome. please wait for the microphone and introduce yourself. ideally questions under that question mark.
9:55 am
>> hello. my name is julia abraham. i just came out of general interest as a member of the public. i wanted to know, surely there are commercial companies who are developing methods for going through all of this data because they want to know, things to sell. you know they have a strong motivation to make money by tracking people's behavior. are there methods that the commercial entities are developing that you all produce to deal with politics and psychology? and other things that may be commercial interests. >> i think you're former employer is particularly interested in this.you can go ahead.
9:56 am
>> there's definitely a lot of uses in marketing. i think marketing is one area so i've been talking to people, their stopping using surveys to the east as people like what products they're going to buy. and they said they don't correlate very strongly with what product they actually buy. so i don't know why that is the case if they correlate with voting. >> i think this gets back to one of the competing sources of information. in an environment where you are and to who is buying what the maybe a survey will -- amazon now of course they not so much about who buys what. they know what i buy and so forth. so that creates less of a value could you can just run it directly with people are buying. you need to ask them. >> for what they're going to buy. >> is always going to be harder to know people are going to do.
9:57 am
i think there is good data from amazon and other companies. >> of course, most of the ads you see on the internet are based on all of your previous actions that google and organizations are aware of safety are in gmail and you see an ad, there will be comedy as he will be determined partially by the content of your emails and things you have searched for to some extent. and so that i think is the most obviously commercial accusation. >> i talked about this as well. one thing that they do is rapid experimentation. they basically, i think for now does more experiments than - does in the entire year they can very quickly divide the users into a treatment and control group. and they can show different
9:58 am
groups different versions of this. so they can't change the typing or change the wording and see which one gets people to be on the same moral click more or do whatever they want more. i think the question of how we can kind of researchers utilize these tools and something that is underutilized and at the game you know there is betterments and academia, you recruit a bunch of people over long period of time, and it takes a month to set up and you do a small experiment like one or two small experiments over 30 people and i think where i think i can move is they can use some additional tools to run a thousand experiments and say this is what we found and i think it would make - but i have not seen it done at all. >> is over here to middle of the table jonathan.
9:59 am
a little more skeptical. >> hi, i am jonathan just here for my general verification. i'm curious. you talk about weighing google data. how is it highest, obviously there's going to be a classic and agent bias filled in. when you're looking at a google trend for the us, are there any unusual ways we perhaps do not think of that is biased? again, not presented above the general population.>> there are a lot of random situations that they just find more and more. one of the things i found is that i was doing all this research using trend data and they frequently go with trend data so who searches for what.
10:00 am
but if you these over and over again, one thing i found is that he tends to be an outlier. and i think one of the reason is because the residents from d.c. are different from the people that make searches in dc. they commute to d.c. and make searches and is one of the issues that come up in a data source like this. that i would not have thought about before hand. but it does come up so there really are a lot of issues. i think it is unfortunate i think initially there hasn't been too much methodological research on this because i think a lot of survey people are skeptical of the data source but it is starting to change. they just did a beautiful study on using google search data. they have these hard-core methodologist that are much more detail oriented and i am and have pages and pages of things with data. so i'm hoping more and more
10:01 am
people, methodologist start studying this to find a lot of these in ways that you can weight it and make it better. >> i was going to say, as you are basically suggesting on the democratic primary side, you found the data to be pretty much useless because bernie sanders was like 9000 trillion searches ahead of hillary clinton. >> hi, bill - former employer and co-author with seth. i am biased. one comment and one question. i thought this was an incredible book, very well written and interesting, really funny. and really insightful. the question is, baseball and politics are interesting and fine but i think the most interesting part of the book was the stuff on child abuse.
10:02 am
because kids cannot just talk about it. there is sort of a sense that they are going on google and say why does my daddy hit me? and it just seems like an incredibly powerful mechanism. something you always they cannot get from a survey. he mentioned in the book that you are working for consulting with some governments on that. i would be interested to hear more about what is happening on that. >> yeah, there is a disturbing element on searches. and one is that kids obviously they are not like younger kids who don't speak or use the internet but older kids do make searches like my dad hit me or my mom beat me. like really frightening stuff. you know it is obviously horrifying but i think one of the things that also using the data is that during the great
10:03 am
recession there was a big drop in official reported cases of child abuse. which is surprising. you would think that all of everything you know about child abuse is that when people are out of work that is a big risk factor in child abuse. but he did see at the same time the rise in searches, disturbing rise in searches of kids and child it is. and i argue with a lot of data i think it just became harder to report child abuse. everyone was overworked and child protective service people were on hold forever. so it is disturbing that you see kids making searches and the official data, it was really disturbing. i, yeah - i'm still kind of talking to them and it moves a little slowly and i'm not exactly sure exactly how to use the data but it is a continuing conversation with some of the
10:04 am
organizations and how they incorporate this data. i do not know exactly where it goes or will go. but i think it is really important and it is an area where obviously i think the official data can be misleading. you will not get a survey asking kids this. so i do hope that we will incorporate some of this data. >> in a similar area, there are things that data comes in more slowly than what we would like which is drug addiction data. i think there should be searches that like to figure out where opiate overdose - i think it could be pretty useful. let's go over here. i am trying to make you run around as much as possible. >> good afternoon i am todd wiggins. i am enjoying the presentation. i would like to ask you hypothetically what you see
10:05 am
coming down the road in 10 years. i predict that my reading is going to be ever-growing industry. it seems to have developed in my last 20 years from couponing and you talk about advertising to modern couponing which is essentially what we are doing with search engines. and intently as we want to know more about how people think nine aspects of advertising the finesse of security or immigration so where do you see yourself in 10 years so i know what to invest in right now as far as stock is concerned. [laughter] >> i think what you're getting at is that there is a scary element to some data, we can talk about the fun things on important things like child abuse. is that companies can potentially take advantage of people and one is a set of by columbia professors and you
10:06 am
apply for a loan and either get it or don't get it and they have data on everyone applies for loans, what they said on the loan application and when they pay back the loan. in the event you can conduct whether you pay back the loan based on the words that you use. getting at there's a scary element to some hoff this data. welcome talk about the fun thing order important things like child abuse but the other area is that companies can use the data to take advantage of people and one study i talk about in the book is a study by columbia professors of peer to peer sites
10:07 am
and you apply for a loan, and then either get it or don't get it, and they dat on everybody who applied for loans and what they said on a loan application and whether they paid back the loan, and they found you could predict whether you would pay back a loan based on the worded you use in the loan application, and some of the relationships were just kind of weird, like god was one of the biggest predictors. using god is a big predictor
10:08 am
>> the same on the government side, i think. i think in the book you emergency the movie "minority report. " i don't know if you have seen it recently. they had immense processing capabilities in movie and do calculations rapidly to predict murder but don't have the cloud so they are constantly carrying these really large hard drives around. most of what -- i think tom crews. -- tom cruise -- obviously governments will be tempted to gain access to this very detailed data on citizens and noncitizens and use them to macfrost and i don't think d make forecast and i don't think there's a of how to predict that. in u.s. we long relied on systems not communicateding with each other to keep the government from using large data sets inch europe there's much more leeway for governments and systems to communicate better. of course, conversely, europe had much stronger restriction us on what corporations can do with data and much stronger privacy
10:09 am
protections. we'll see how those two will evolve. let's go to the left against the wall. >> following up on your comment about europe. you mentioned that -- of course americans tend to be very open on google in terms of expressing honest opinions. other countries, of course in europe, have a much greater sense of privacy, in asia, of course in japan, you have a certain sense of individual distance and china you have the great fire wall. what kind of -- how would you characterize responses in countries like china, japan, korea and germany? >> i haven't done as much research in other countries, because i don't in the any other language so becomes challenging. >> a real american.
10:10 am
>> becomes challenging to do it so i don't know as much. i think it is interesting. the kind of premise of the research in the book is everyone is really honest on google may not stay like that. someone date paper comparing people's searches before and after snowden's revelations. they measured whether there was a lowering embarrassing search rate when snowden's rev layings were mid. we found there was a drop. it could be that in the future, people make fewer of these searches. that's part of the paper. the categorizing -- >> increase in the number of searches for snowden? >> probably. but that probably -- the category is embarrassing searches and started asking people how embarrassing is this search and mores of them war
10:11 am
child porn or really, real -- like herpes, like you expert, and one on there that made the list is nickelbeck. and that also dropped apparently after snowden -- [laughter] >> let's go over here and -- >> i was just wonder -- i haven't read the book so i might be exposing myself for not finishing it but did you look interest the -- along olines pan treeatic cancer, what too terrorists search for right before the commit the crime and -- everybody who does a mass shooting searches this but not everybody to searches this does a mass shooting. >> the government may be doing that. i don't know. don't have access to individual level searches. i'm a little bit skeptical. just looking at the absolute
10:12 am
number is think people make a lot of horrible searches, more than you expect. may just be that the false positive -- like the false negative ratio is higher than we expect. people have bad thoughts more than we thought, and that, like, the government really shouldn't be intervening, not just because of legal and privacy reasons but just because of data science reasons that a lot of people are in the same group and look horrible and never go through with it. >> i think this relates to the appoints of questioning whether what someone does that this most embarrassing is their truest self. you could imagine a super google of the future that it won't even need you to search. just track your thoughts at all times and then you say, well, then what is your true self? whatever thought you a have the
10:13 am
most frequently. might have to do with, like, going to the bathroom or whatever. because every once in a while you have to go. i don't question a lot of what you found, but i again -- getting back to the title of people lying, this idea there's a truth out there which is what people are really like. think we have to watch out about being sort of identifying truth with the latest technology. >> i mean, i agree there's different -- actually came up, tim woo, a colleague of yours at columbia. he said that, like, he -- he also brought this point. what is the real you? and he is like, one study is you compare searches for gay porn and people who say they're gay and that in states where it's hard to be
10:14 am
gay, like mississippi or tennessee, there's lot fewer people who say there are gay but almost as many gay porn searches. so it seems like a lot of men in my reading in places where it's hard to be gay, make gay porn searches. might be married to a woman or tell facebook surveys they're not gay. my assumption is, that means they're gay and tim i like, well, does that mean they're gay? maybe they're not gay. i just thought it was obvious in that case that means they're gay but i see tim's point. >> well, there's something called the tyranny of measurement or labeling. it's been count it out having the word gay or the expression gay sort of changes things just the way people can think of themselves, it's not something i do, it's something i am. but also we talked about this earlier, you're example in the
10:15 am
book about the efficacy of different schools and you compare them based on how effective they are in raising kids test scores or getting them into top colleges and thoser measurable things, yet people go to school for other reasons than to get into top colleges and to get higher test scores. so sometimes it's super clear. some of the clearest examples through these medical examples but a the outcome is kind of clear. this what you want. or an election, who is going to win the election other. cases like obama's speech or the outcome of a school is tougher because it's not always clear what you can measure. of course at a stat -- statistician i don't say don't measure things but we have to be aware of following in love with what we found. >> the thing about dat explosion is we can tell richer stories and more complex stories and get much. measurements itch don't think the available of google searches or facebook or twitter will mean we now have fewer things to measure. we now have way more things to measure. fully agree with you, you have to be careful, not just one measurement.
10:16 am
>> i guess also what you're saying i we have the ability to measure things that are perhaps a smaller part of the population. no matter what state you are most men are straight or prefer to have sex with women than with men but we can now get in on that pick population, that small -- we help study that in the way that perhaps a larger survey where you poll and people get maybe 10 or 15 or 20 people who match that description and there was no real way to tell x, yoz about them but now we have a real way to get on the specific things. >> let's stay away from more specific levels in this area and go to heidi. >> him a -- thank you for the presentation i'm looking forward reading the book. one observation. one is that the health
10:17 am
communication stub is supertricky because if we try look at how many people are searching ebola and me might think a lot of people have ebola. and have questions about how representative google searches are. thinking just about the number of americans who have access to internet, 85% of herons have access the internet, particularly 40% of over 65 don't use the internet. wonder how that might skew your -- the second question is whether there's a change over time because of the increase of number of time people are using apps rather than google. how does that skew your data? >> i think the 15% not using the internet, that will get smaller and smaller over time. just kind of gets to the point that the data is not perfect.
10:18 am
definitely 0 not going to be 100% correlate with the population but -- >> let other people worry about that. >> yeah. i think the first thing that is striking when you first search google trends and now in the public, is how powerful the patterns are and more -- it could have been done wouldn't have surprised anyone if google searches calm out and it was all this noisy, crazy datament the bible is searched more in, like, new york and least in mississippi. and so kind of the data does tend to work as best we can tell pretty well and will get better over time. don't know too much about the app stuff. haven't looked into it. but i mean -- i think changes over time are -- long-term changed can be tough to measure
10:19 am
with this search data. one of the -- one thing you see is searches for science went down over time, the percent of searches that include the word science, and some people use that as this is showing that americans are losing interest in science, but i think just the earliest observations of google were much more interested in science. long-term trends can be problematic. >> i wanted to pick up on your questions nice were kind of critical, and i think criticism is so essential to making all of this work. not criticizing seth's book -- a few months ago there was something on the internet and somebody who i guess maybe someone wrote a paper and it was passed reasons and said people are most religious in the
10:20 am
midwest, not the south, and then there was a lot of, cud chewing about what that meant and someone had used some religion data that was a mashup from several different churches and turn out some churches report attendance in a different they want other churches the data were complete crap would have been better to use google, i'm sure. what happened, people was, this is hard numbers, let's start explaining them. and it's one thing that we sort of lack is a great way -- is a way to sort of engage with these claims. way for poo tome take a claim and -- a claim and say this is kind of interesting could be wrong, let's bounce it around. student can't die either. social science is terrible at doing this. don't know if journalism is better or worse but i think -- that it what i was getting at earlier, with the rise of data journalism or google science or
10:21 am
whatever you want to call it, think this is a great opportunity. for us to try to figure out how we can be skeptical without being nihilistic, and i wonder what you think about this. >> i definitely agree with you. a lot of -- most of the dat in my book is public and kind of -- i think a lot of people are -- we talked -- andrew and i talked about the peer review versus data journalism. i think from my experience, writing a data journalism article which with -- which maybe 100,000 people read is a lot harder than writing an academic paper that five of your rivals/buddies read and attack. so i think i get e-mails from grad students and undergrated grads not infrequently critiquing my assumptions and
10:22 am
making sense of them. think if we can do a data analysis that is very public, that may be a better way to move this thing forward and what they got a -- they put themselves on the line. they say they're judged afterwards by how good their predictions are. so that's kind of very powerful and their audience is better than the other ones. >> i will say we get -- if i say something wrong i'm thrashed for it. my e-mail, my twitter feed, i don't even know what else dish don't income anyone has tried to call me yet or send me a snail mail, although a few people have asked me -- that was interesting. agree with you. this type of -- anything that is is open that allows people to get a look underneath the hood and everyone has an equal
10:23 am
opportunity to do so ensures that the data process and what i we claim from data is we're all better off for it. right? if i have my secret little data set i'm claiming i'm making x, y and z from, you have almost no way of checking it, even from the public polls the underlying survey data is not released until six months plus and no reason that anyone will give that polly data over, versus this data to tends to be more open and we ick make the determination whether or not this correlation makes any sense, whether or not this data was looked at in the wrong way, and to me that's very important. there are a lot of people these days who have access to the internet and can write a lot of crazy things and make a lot of crazy claims and people will believe them. and it is very important to me
10:24 am
that we're able to check those. some cases people will believe no matter what but with this kind of data we can check them right on the spot and anybody who has knowledge who wants to figure out what the truth its, can do so. >> i agree. the social sciences have sevier weaknesses in speed with review processes were -- and the small numbers in basically every sub subarea and i don't think the numbers are always small for a good reason. five people have written bat about a topic, and many more people can -- a little more transparent than typically is. so i think that's helpful. i got phone calls a few times. wrote about a case one time and -- all men -- >> doesn't surprise me. >> -- from massachusetts called to thank me and call me an american hero. let's go back to the left here. second table. yes. >> thank you. hi, joe schultz with scripps. one thing that came it earlier
10:25 am
was being sure to ask the right questions. another element is asking the right people, and one thing i think is true with polls, with goggle, is that it's not necessarily representative, and there's this almost calibration step that we need to make. think someone mentioned the whole sanders vs. clinton google trend responses having ahere response from sanders. absolutely. but after a couple of come calm -- comcast you get results. some poll wes had pretty big misses with public polling in wisconsin and ohio. a shame. how do you go bat making sure you're able to calibrated your dat so you're able to kind of validate yourself and make sure you're right going forward? >> goetz back to the sample.
10:26 am
>> it's challenge with google data. with surveys you can ask people demographic questions and party identification and it turned out, as we said during the campaign itself, that a lot of the fluctuations in the polls were clearly attributable to changes in relative nonresponse of gift -- gift -- different groups. the google dat extra you don't have people's demographics. >> i think -- the one thing i will say is that i think you're right, that the longer the dat is around the more you can calibrate the models and know how to weight them and stuff. if with the predicting election things, you can't predict an election from the volume of searches people make. they search for trump but doesn't mean they like him. might mean they hate him.
10:27 am
one thing found that reasonably correlated is that you can predict which way people are going good based on the order in which they search the candidates. so, like, 22% of search with the word "clinton" in it also cloud the word "trump" so people search for clinton trump polls or debated but if they go clinton -- they go clinton trump polls -- compare that to state voting they're much lisch likely to go clinton, if they good trump clinton, they're -- subtle indicators will be predictive and as we do more and more elects we can weight them and calibrate them and figure out whether you can add information to the polls. >> i think we have time for one last question. a quick one. and then we'll wrap. >> hello.
10:28 am
with the possibility of artificial intelligence bots do you think that some of these searches could be manipulated? >> that's one of the value's google search data relative to other sources is that google has the smartest people in the world fishing out who is a bot. i think a lot of searches are by bots but google puts a lot of energy into eliminating those searches because advertisers don't want those to be included in the data. i kind of didn't an arm's race between google and the hackers but a bigger issue if you take a random search. >> going has the smartest bots in the world working on this. is that right? >> all right. with that i thing we're done. there should be wine and cheese somewhere.
10:29 am
thank you again for coming and have a good evening. [applause] >> >> book tapes hundreds of author programs throughout the country all year long. tuesday, we are at politics and prose bookstore in washington d.c. for the recount of hitler's trial for high treason following his attempted involvement in a coop. saturday, back to hear sue mcken share her experience reporting on terrorist groups around the world. and thursday, creamer books where lawrence goldstein looks at the attack of the submarine. many of these events are open to the public and look for them to
10:30 am
air in the near future in booktv on c-span. you are watching booktv on c-span2. here is our prime time lineup. at 7:00 p.m. you will hear from former skin head leader, author of romantic violence. then at 8:30, we sit down with author allen who offers thoughts on skiens and communication. and then rachel pierson decribes here experience as a resident physician. at 10:00 p.m. an "after words," mike lee recalls the work of the forgotten men and women who fought against a large federal government during the founding. and david kaut mcculla presents ee
67 Views
IN COLLECTIONS
CSPAN2 Television Archive Television Archive News Search ServiceUploaded by TV Archive on