tv Everybody Lies CSPAN July 6, 2017 11:03pm-12:33am EDT
11:03 pm
don't i tell the story of my body today without apology and just explanation of this is my body and this is what it's like to be in this world in this body. [inaudible conversations] >> welcome everybody. welcome to the american enterprise institute. thank you for joining us today. we are going to dive right in. seth stephens-davidowitz is going to talk about his book "everybody lies" for 15 to 20 minutes and after that seth is going to sit down and they're going to have a conversation with andrew gelman of columbia
11:04 pm
university and harry enten of 538 and we will viciously attack seth and the book he has produced. seth will get the opportunity to defend the supplemental will take questions from the audience and follow the same procedure. seth, do you want to go first? >> thanks dan for the introduction and for inviting me to this panel and this is a book , "everybody lies" about five years of research i have been doing. i will describe what it is and for the last 80 years if you want to know what people want, why people did the things they did, what people are going to do
11:05 pm
you have basically one main approach. you ask them. you conduct a randomized survey said gallup or peel or quinnipiac will go out and ask people questions and there is a main problem with this approach which is that people tend to lie and surveys to make themselves look good. if you ask people immediately before an election are you planning to vote the overwhelming majority of americans in the survey will say sure i want to vote in the election. they don't want to admit that they are not voting in the election. it's kind of considered socially unacceptable to not vote in an election. my favorite example is the general social survey asking the men and women and the united states have frequently they have sex, whether it's heterosexual or sex and use a. you can do the math on this. american women say they have sex
11:06 pm
about once a week and use 20% of the time and that means they use 1.1 billion in heterosexual sexual encounters. american men if you asked them the same question, they say they use 1.6 billion every year in heterosexual and sexual encounters. think about for second goes by definition are the same so we are to know that somebody is lying. who is telling the truth, men or women? neither according to data from nielsen which only 600 million are sold in the united states so everyone is lying about how much sex they have. men are lying more than women. generally we have relied on surveys because that's the only thing we have had so i think we now have a new tool to understand the human psyche which is the searches people make on google. that's the stuff i've been
11:07 pm
researching for the last five years and the idea behind this is people are honest on google and tell things that they might not tell to anybody else. they will confess things to google that they wouldn't tell two friends, family members and keep to themselves. so why are people so -- there are couple of reasons. one is that you have an incentive to tell the truth to google so if you are someone who doesn't usually vote in an election you don't have an incentive to tell a pollster the truth about whether you are planning to vote in the upcoming election but you have an incentive, you probably need information on voting. you may not know where polling places because you don't usually vote so you have to search in the weeks before the election where to vote, how to vote at polling places and you can see clearly in the google search data that people that search and
11:08 pm
search volumes where an area where voting information is high in the weeks leading up to election turns out to be high in that area. no matter what people say to polsters you can see based on the searches they make whether they're actually going to vote. that is one reason that people are honest on google. the most intuitive and makes the most sense because you need information the second way people tend to be more honest on google which is one of the more surprising things i learned when i began doing this research is a lot of people just confess like full sentences to google with no obvious reason. you see searches on google for i hate my boss or i'm happy or i'm sad or i'm drunk and you kind of go why are people telling google this? i think it maybe relates to and
11:09 pm
catholicism of the confessional. people seem to use google in big numbers to just say what is on their mind which definitely i was not expecting at all when i started this research but it definitely caught me by surprise. so what can we learn when we look at these searches? if you remember in the 2008 presidential election all the way back then barack obama was elected president. there was a big question after this election did race matter in the voting, do people care that obama was black when deciding whether to vote for him? this is a classic question that could the complicated by social desirability. the overwhelming majority of americans, 98 or 99% of americans say they didn't care that obama was black. that was why a lot of people
11:10 pm
included a post-racial society back in the day. there was an idea that voters voted for obama and they didn't care that obama was black. could you use google searches to potentially because people are so on is because people tell google things that they might not tell anybody else on socially unacceptable attitudes could we use google searches to get the rail answer if race played a part in people's voting decision. what i did is i made a map of search volume on google and this is a% of google searches that include charge racist word. i won't say it out loud but you can guess what it is and the first thing that struck me about this data bursts without, and the search was at the time. back i was using people are making these racist searches in the same frequency as they were
11:11 pm
searching migraines and daily show and economists. it wasn't by an imagination of french search. these were mostly for jokes. the other thing that struck me about this as it looked very different from the map i would have expected of racism. if you have asked me where racism was highest against african-americans in the united states i would have guessed racism is predominantly concentrated in the south. if you think it the country's history the civil war and slavery we think of racism is having a strong divide. definitely racism is highest, some of the places it is highest for the deep south like southern mississippi in southern louisiana but you can see the maps with darker red with a higher frequency of searches is also higher in many places in the north in western pennsylvania and eastern ohio industrial michigan upstate new york or l.a..
11:12 pm
i think the real divide by the search data reveals is not a north versus south, is east versus west. you see that is much higher east of the mississippi river and it drops substantially west of the mississippi river. once i have this map i wanted to see because people are so honest here could you use this data to measure how much obama really lost in the 2008 election and of course you can't just compare racist searches to vote for obama because it might mean places have higher racist searches would apply to any candidate in 2008 so that wouldn't be a fair comparison. i compared obama's vote total to previous democratic candidates such as john kerry. what you see when you do that and you can read the paper or read it in the book is a very
11:13 pm
strong significant relationship that places the highest volume in places like appalachia or industrial michigan support obama much less than previous candidates pretty good start controlling for anything you would like. control for education are demographics or political views or cultural views and nothing changes the relationship. that was a big factor and overall i conclude obama lost to four percentage points from racism which is much higher than you get from any other measure. very recently when the trump phenomenon was starting, trump was saying a lot about racially
11:14 pm
charged comments and people were questioning how is he doing so well. he's saying things to you are not supposed to say and was racism driving some of this forward and nate cohen of "the news york times" asked me for the daytona race of search volume. yet data on the research for trump and the republican primary at all the variables were their age or education are at thenomics are the single highest correlation he could find was the racist search volume for trump. this of course does not mean that everybody who supports trump is racist but it does mean that some of his supporters were ended to track some of his progress in the primary. i think there are all kinds of things you can do it this day then i talk about them in the book whether it's predicting turnout or measuring child abuse you can measure the data or i've
11:15 pm
done data on do-it-yourself abortion, pretty much all depressing topics. this book is really depressing and horrifying but i put a lot of jokes since he wouldn't notice. i think there is a lot of value to knowing these parts of the human psyche that we don't usually, that aren't usually talked about so i will give one more example of the research i have done. if you remember the san bernardino terrorist attack in december of 2015 when two americans, muslim americans shot up their co-workers at a party and right afterwards as soon as this attack happened almost within minutes we saw a huge spike in nasty searches about muslims. the number one search about muslims immediately after the
11:16 pm
sand or need to know attack was kill muslims which is another one where it's not clear but people do express these random thoughts on google so they search things like kill muslims are muslims are evil or i hate muslims and they were really getting out of control immediately after this attack. four days after the san bernardino attack barack obama gave a speech to the nation basically trying to calm down some of the islamaphobia because he wanted to address these attitudes they were getting out of control. it was a nationally televised speech that was covered by all the big news outlets. the speech was unlike a lot of people in this room i'm an obama supporter but i found the speech totally beautiful and spec tacky alert and obama at his best.
11:17 pm
he talked about, it was really moving. he talked about how it's the responsibility of all americans not to give in to fear and appeal more to freedom and it's our responsibility to treat everybody the same no matter their religion. it was a moving speech and all the traditional sources really loved the speech and gave it great reviews read in "the news york times" were other organizations. great job obama you really hit this one out of the park as far as explaining to people why they should not give in to islamaphobia. i decided you can break down google data minute by minute so i decided to see the searches were kill muslims and i hate muslims and all these angry searches and how they compare to before the speech during the speech and after this beach.
11:18 pm
i did the comparison and found not only did these not drop as obama hopes. they skyrocketed. everything obama was saying backfired as far as calming the angry mob. so this was kind of surprising. there was one line that obama did give that seem to have a different response which is obama says we have to remember muslim americans are friends and neighbors. they are sports heroes and the men and women who will die for our country and as soon as obama said this you could see clearly on google this huge spike to this statement. for the first time it was muslim athletes followed by muslim soldiers. these searches stayed up for a week afterwards. i think you can kind of compare
11:19 pm
most of the lines in that speech about responsibility. they were election, they were a sermon. he didn't tell anybody anything they didn't argue no any compare that to the line about athletes and military heroes. i was provoking curiosity and new information. so we wrote this up in "the new york times" our analysis of the speech and i don't think it's crazy when you write an article in "the news york times" is some powerful people will see that including people in the presidents office because two weeks later obama again gave a speech about islamaphobia this time in the baltimore mosque. he basically stopped with all the lecturing in the sermons. he didn't talk about how it was anybody's responsibility to do anything. he instead really double down or quadruple down on the curiosity strategy so he talked about how the muslim americans are athletes and soldiers but he also talked about farmers and
11:20 pm
merchants and he said thomas jefferson had a copy of the koran and muslim americans built the skyscrapers of chicago. the speech got a lot of attention and was also a national tv. you do see many of the searches immediately after this speech actually did go down. saw a drop in searches. muslims and i hate muslims in the hours after his speech. that's just two speeches and i don't say that the science of how you come down islamaphobia but i think it does show the power of data that you could maybe turn something as seemingly unpredictable as how come an angry mob into something like a science and obviously more research has to be done but now we have minute by minute data on these people that are not necessarily, they are a small number of people.
11:21 pm
they might not be picked up by surveys and they're probably not going to come into princeton or harvard to participate in the laboratory experiment but they do make these crazy searches on google and we could use this data to potentially try to understand how to calm down an angry mob. a lot of the things we thought worked probably don't work and a lot of times we pat ourselves on the back for speeches we did which may just be back firing. that's kind of the theme of this book. there's so much that we can now learn about people from all this data on internet. i think now i will take the attacks from the other panelists. [applause]
11:22 pm
>> thank you, seth. we are going to have a conversation about the book. it's a relatively small part of it a large part of the book is about internet. so we are going to find a bunch of topics that were not necessarily in the opening. i very much enjoyed this book. i found the jokes to be a little bit corny. i wanted to first give him the opportunity to share his initial thoughts on the book and one of the things you appreciated and for hats that were deeply
11:23 pm
misguided. >> i think we will take some lessons from seth presentation and learn the science of stirring up an angry mob. maybe it's too bad that he happened to have done that but it's too late for that to discuss that any further. today, i came here a little bit early and went to the museum of american history where they had an old direct, telephone direct tree but it was from 1800 so i guess it was a directory from philadelphia and it had each person and for the man in the direct. have their perceptions. there was a captain, a shop lab, a gents i guess a gentleman a gardener milliner and haldeman accord winner of gunsmith to shoot some of his occupation or
11:24 pm
business is listed as the shoemakers tools, baker and a turpentine distiller and a few others. it made me realize there used to be a lot of available data and everybody used to mill everybody so in some sense what set this thing look now we have the source of data that we didn't have before and look how much we can learn maybe part of that is there's a feeling that we need to learn about such data because people are harder to know about. 100 years ago or 200 years ago maybe you didn't need a lot of polling to find that people would vote. just as the neighborhood precinct captain how many people they could get out to the polls next month. it's good for us to have a historical respective about this. regarding everybody -- "everybody lies" i had a couple of thoughts on this. as a person who does use a lot
11:25 pm
of opinion polls i'm actually impressed at how honest people are. i always tell people if you want to find out what people are doing the ideas to ask them. you won't get it completely right so 60% of americans vote, maybe 70% will say they were planning to vote or that they voted after the election. certainly more people plan to vote than actually vote. it's not lying if you plan to vote and decide not to because something comes up right after the election they asked people if they voted in a few percentage points more say they did than actually do but it's not that far off. when you ask people who they plan to vote for the polls are very accurate. hillary clinton had 52% of the two-party vote in the polls and she actually got 51%. the polls were off in states which i do not think the evidence and that the evidence
11:26 pm
points to people lying. the clue to why people are so honest and polls was given by seth when he talked about motivations. why you should respond to a survey in the first place, that i have no i.d. a. i don't respond to surveys. someone is trying to make money off of me taking 45 minutes answering people's questions. that's silly, not going to do it but if you are going to answer poll you might as well be honest and in some sense the whole point of answering a political poll is to say yes i support him or her support her. in 1950s it was all different. in the era of the gallup poll not that many people were surveyed. if you are surveyed by gallup you would be one of 1500 americans. your vote would be counted. it would be in the newspaper the next day. 51% of americans whenever they
11:27 pm
talked about in 1951 like i think we should bomb china whatever it was, whatever gratitude was it would have a big impact and it was very rational to respond to polls predict you're going to respond you might as well tell the truth maybe not about how many balloons to use when you're having sex and i think people might beam misremembering too. sometimes you don't remember everything that happened in the past year. when it comes to voting i see no reason to think people aren't sincere. it's kind of interesting because there's this question is what is the true self? is the true self a person who googled is that who you really are? i don't know about that. you don't spend 24 hours of the day googling either. who you are when you are googling racial jokes is not necessarily your truth either. it's just another aspect of who you are. the really interesting question is what is the incentive for us
11:28 pm
to tell the truth? when someone gives a talk that says everybody lies a racist a certain paradox of go element to it. seth i think does have an incentive to get things right in the sense that if he gets things wrong than people like me will pick on him and of course also i think seth has the goal of discovery and i would like to learn things. i'm very interested in this idea data journalism. we have the data journalist -- let's say we have three data journalist on this panel right here. data journalism is playing a large role in my personal life and you could save the life of our society. we have had a lot of discussion recently in science about can we trust science just because something is published in the top journals should we believe
11:29 pm
that? a journal like science or nature or lancet is like a brand name. we see this in data journalism that a site like 538 or at "the news york times". when they make mistakes they tend to correct themselves. other sites which are out there trying to get hits, journalists like malcolm gladwell, i don't think he lies that he makes a lot of mistakes. i don't know if he has an incentive to get things right. in fact i think he may have an incentive to an never admit he got anything wrong. for me i've made enough mistakes in my career. i'm maria wet so i don't mind being dunked in the tank one more time. we established thing about politicians back in the clinton era that it would be great if they could send a politician to prison and then let them out and govern the country.
11:30 pm
11:31 pm
catalyst and compared people individuals who said they voted and individuals who didn't say in their actual voting behavior survey. so people exaggerate their voting but now there's a huge growing problem a surveys which is people just give weird answers. they answer randomly. a lot of people who voted on say they didn't vote. which is really bizarre. i think that actually played a role and we talked about this before to write a column about it they played at not trivial role in by the polls were often the previous election. random answering which is becoming a bigger problem with polls. but andrew and i kind of have a definition difference on what constitutes line. so i think andrew thinks you have to be consciously aware that your deliberately they are
11:32 pm
deliberately misleading a survey. i think people can my to themselves and i think that's an important part of it. if you ask people why they did things then they will frequently not really be consciously aware of either doing things. there are certain areas where surveys are good. evie everybody wanted me to make a book about why the polls were horrible because it was somewhere books. that's one of those areas where the polls art good at but there's some that they're not good at when you ask other things people lie to themselves a lot and are over optimistic. people are not good about explaining the reasons they do things and people are not so honest with admitting some of their desires. there is one area of who you will vote for an election that has been studied over and over by polls.
11:33 pm
but there's definitely a lot of areas where surveys will become a smaller smaller part of understanding the human psyche relative to some of the sources we talk about in the book. >> what you think? do you want to comment on the book in general? what you see. >> else a few things. i thought the book was very well written. i think there is often times people who will write about material that can be difficult to grasp to those who are not familiar with statistics. to me this book was very easy to understand. you open up with a story about your grandmother if i'm not mistaken. about who you should meet, who would be a good person to meet with humor. i thought that was a good opening. terms of the book and the idea of using google and figuring out our true selves, first off from a data journalist, got help us all, that we are using google more and more.
11:34 pm
please google trends to figure out what is popular in the public, what is going on in terms of a republican primary we have all these guys running they were trying to figure out who's going to come up and be the main challenger of donald trump and we found that google trends had a good understanding of people are searching for that candidate in the final days. maybe that guy would be on his way out. certainly with the case we were not using google trends in isolation with the context at large. were not at the point of the political side or anything that has to do necessarily with understated the percentage of americans who believe x, y come and see the we can save that's the percentage based off of google searches. i don't want anyone to think that's the case. nothing that you are making a case but certainly something is going on. one of the things i wanted to
11:35 pm
ask you is i love that map, the 2008 map the change from -- to obama and then you used. but it seemed to me that map was somewhat, although not familiar if you have done this yet, correlated with how the vote changed from obama to clinton. it seemed to me there is certainly some correlation where clinton did work in a number of areas obama was trending bad. it doesn't mean it's not because people were racists, it could be racist and that's why they are changing their mind. it could be the fact that now racial views are increasingly correlated with how people vote whether they vote for the democratic or republican party. so that's another thing we need to keep in mind. correlation is not causation. that continues to be the case. the terms of using google and assisted truth serum? i google a lot of strange things. who is on the cast of who's the
11:36 pm
boss. these are the strange things, does that mean all of a sudden i'm the biggest fan of who's the boss? know me because i'm interested for a particular reason. i think google on its own can give us a keen understanding of what's interesting to americans but not necessarily assigning a given thought of why are they necessarily searching that. the final thing i will say that i think i e-mailed you about this was the book is not all about politics, there's a specific thing about sports and being a mets fan versus yankee fan. this facebook data that you use for that. this is the type of thing where you can you surveys that actually replicate the findings they see on facebook so that we know for instance i had done my own research which shows in new york there's two baseball teams, the new york mets and yankees. but surveys indicated is that
11:37 pm
the mets tend to be more mets fans and when the mets do worse the fewer met fields. it's something along the same lines which is that people who are eight years old when a particular baseball team is doing well all of a sudden there's a lot more fans of that franchise. that's a thing that can be very interesting and cool to use this data for. conducting polls although were getting cheaper with online surveys people don't necessarily know how to ask the questions will have the ability to go out and conduct a survey and using this data can help confirm data or maybe make a finding on something that's more trivial. >> to greet that? >> not totally. that example you can measure how much basically how much a teams winning the championship every
11:38 pm
of your childhood affects the probability that you like the team is mattel. for example the mets have a lot of fans who are born in 1978 to 1961 because all of these men were boys when the mets won the championship. and you can see that across teams that when a person is eight they have a lot more fans. the point of the study is the value of the data. i don't think you can replicate this in a survey. the small change over time that you can compare when the mets are doing better. but you can see the changes over time. but to really see that subtle pattern you need data on how
11:39 pm
many met fans are there among men born in 1978 and how many met fans either born in 1979 and 80. you can never see them on a survey with 1000 or 2000 people. the point is you can really zoom into tiny populations and he still have a big sample. because facebook had everybody there's samples for any year in every team. >> so that works because facebook covers everybody is your same. and in the book you work in some papers based on the population and obviously that's more like the senses from 1810 in a way than the internet of organically generated data. i can see how if you have the full population of people you're interested in you can do these granular things.
11:40 pm
that's very different than the google data where we had no idea how representative searches are or to look at subsets except for geographical level. we have no real baseline. we can compare things over time but as you're saying to find out what percentage of people will vote for trump based on searches because we don't have a way to know what the levels mean as opposed to relative levels which are more meaningful. those are very separate types of data. >> i just want to insert, there are questions that we demand less or more precision from. it's actually not hard, you can actually use crude search data to find out and predict elections pretty closely. you can get within five or 10% just by looking at who people are searching for. predicting the election within
11:41 pm
five or 10% is likely to be that helpful. if you want to know what proportion of americans are using balloons when they're having sex or whatever, being a fight% isn't a big deal. a lot of the success of seth's work is he's getting away from the trap of i want to answer the question everybody else's answer. the thing about the races search, yes not a representative sample. what if it's up by ten or 20%. these are still huge differences. a lot of the questions are like that. it has to do with asking questions that are not demanding that level of precision sometimes. >> i agree with that. the one famous example of using google searches is google flu. and they try to predict the rate of the flu in a given week based
11:42 pm
on people making searches or running nose or flu. on a week. it kinda blow up a little bit, one of the problems of google flu was that our flu models are really good. you can get really close by just assuming that flu is going to be the same as it was previous weeks. so you have in our square of .95, google data is always going to be i totally agree is can be somewhat noisy. you'll probably get better as we learn how to weighted and more people study it but it's not a perfect data source for obvious reasons. so there's a question whether it can really be just a simple model as well developed for many years. there are areas of health that i say google constipation makes
11:43 pm
more sense in google flu because anyone have any information of what's going on in the united states. then the noise of the google search data still going to be the nail information that we currently have. so i agree with interest point. >> about different health conditions? >> that people at the cdc they might be doing google constipation. >> i think after the google flu thing came out there is a lot of excitement about google fever and then the flu thing had a little bit a blowup and everybody lost a little bit. i think it lost momentum. >> one of the things interesting going forward is that surveys is becoming less and less expensive at least entry ground-level. so everybody can conduct a survey monkey pole. i will be interested to see how
11:44 pm
google searches along with survey monkey searches, how all this big data that's cheap to be used together to figure out things that are going on. right now most of the surveys we are conducting our thin larger telephone surveys and that to me is very expensive. as we get down to people are starting to form their own surveys and it's not just messed up people like you understand what's going on that's how i see we could get interesting things happening. at the end of the day what's most interesting is not just what we can find once we determine what we want to find the coming up with the questions in the first place. the questions are most important. what is a right now were not asking of this data. i don't know the answer to that. >> what is the answer?
11:45 pm
what questions should we be able to not currently answer where you see a role for this kind? >> my up session is anxiety. of done a lot of research. scientists doing research on where anxiety is highest in the united states. i thought it was going to be new york city and urban intellectuals. like anxiety types in kentucky and generally in rural areas. that actually is not just google search data but i just did know that. them because you have this rich data you can see how anxiety changes over time. so when do people search for panic attacks? not surprising, 3:00 a.m. now through this data we know how many people to an approximate degree or having
11:46 pm
panic attacks and how many people are having panic attacks when we can basically say what happens in that day? what happens leading up to that come is it random that every day some people have a panic attack and there's nothing to it? >> years to mean that people suffering a panic attack in google it. >> i think their current situations where that's not true. but if you wake up in a panic attack is likely. >> that's another example, i think we have to break the data down. with health conditions a lot of people criticize my book because they say they search for this because their dr., i'm a researcher. i don't have those. and that's probably true in general the researchers are
11:47 pm
pretty small but then you can really dig down, i don't think too many researchers that to amr research in that. so the status right now could get a lot better a lot of problems we have with it if we get better we could take out the people who are searching for the reasons. >> i just want to go back to your point, no one can predict the presidential election within ten points comfortably based on thursday. but there are places like google and facebook who have much more of opportunity to link data within person and generally overtime. what kind of uses would you see of that data? what would you associate with those databases existing. that's a very different level of the quality of the data then
11:48 pm
more aggregate. >> so one of the studies it's not my study which is a study by microsoft and columbia researchers, they study pancreatic cancer and they used anonymous data over time the same users to make searches over many months, they said. >> they know who searching there just now looking at the actual name of the person? >> they link. then they said somebody probably has pancreatic cancer and they make a link of just diagnosed. if you get a diagnosis like that you probably turn to such a search engine because of such a big event in your like. then he said if your people who we know have pancreatic cancer this month and here some people
11:49 pm
never got diagnosed. and then this happens to users over time and they said what were they searching in the months leading up to the diagnosis? what symptoms are they search in. what symptoms predicted that you would have it. they found settled patterns that indigestible by abdominal pain is a risk factor. indigestion by itself is not a risk factor. and if you're anything like me you're going to go home and wonder if you have indigestion. that's a really subtle pattern of symptoms. doctors don't know like our system of diagnoses diseases not as sophisticated as having a huge group of people in finding which one predicts. so i think that's almost a revolutionary type of medicine. pancreatic cancer when you're told the earlier told you can dramatically improve your odds
11:50 pm
of surviving. then the ethical question is what should happen if a search engine figures out that you have pancreatic cancer, any of it 20 or 50% chance should you be told that? and i think yes. but if i had a pattern of systems they gave me a risk factor of a disease that could be cured i want to know about that. >> the problem in your case it sounds like you're inflicting most of the symptoms on yourself by reading about them. obviously this method you talk about in a variety of spots in the book about matching people to mow similar cases both on facebook and twitter for their health record, why is that not in the healthcare section that
11:51 pm
seems like an important innovation. why do you think those methods are not coming through as reputable as they could? >> a lot of help date is not linked together. it's all over the place and not huge incentives for people and help to put the data together. so people have been trying to go around the official healthcare organization to try to select their own data. >> one thing people should keep that in mind is a lot of stories are about politics -- most have visibility. it's not that there is no application outside the realm. >> harry, so you can use google
11:52 pm
search what did you try in your predictions is that the main source of this type of data you use? where do you see these things going? when he will be on survey data? >> there are no trade secrets going on. we all have access to the same google unless you're part of a secret club that i'm not aware of. the way we use google is figuring out for instance what was the effect of wikileaks during the final month of the campaign? did that harm hillary clinton. what i have found using that data was that in fact there seems to be some correlation when people decided they were going to vote for donald trump and when searches for wikileaks jumped up. if anything, if you look at the national poll it might be a stronger correlation when james
11:53 pm
comey decided to send his letter to congress. i don't think wikileaks got past that much in terms of was that the biggest factor i think people were more concentrated on comey but the google data perhaps suggested otherwise. that goes back again to the question of where americans being asked about wikileaks and their polling, perhaps not as much as you might've thought. one of the things i think survey data is more difficult to get a is wire people voting the way they are? survey data polls very good at predicting his going to win and who's going to lose regardless of what people might argue. what they're not as good at is assigning reasons why that's occurring. people give weird answers. i was thinking that the entire time. i was certain. and then you look at the polling
11:54 pm
data and it either didn't change or changed which was in direct contradiction of the reason they gave. this goes back to the question that both of you were tangling with. i don't necessarily think people are lying, people honestly don't know the polls in one direction or another so i think that is one example of where we are using google data. i mention this before. the other thing we're using google data for tracking and changing minds in real-time. in general elections most people make up their mind a few months before the election. very few people will change. in fact in primaries people are much more likely to change their minds at the final second. and that's where trend data is useful. the other thing, there are many races in which we don't do polling data for. were talking about presidential
11:55 pm
elections now and how many polls came out of presidential election, it was ridiculous. you can find a poll that said anything you wanted. i'm sure there's a paul having evan mcmullen winning the race. there are many fewer poles for lower down races. for races for senate primaries. we make it a very good idea, i've not seen large state scale studies on it. i'm interested to see more more people turning to the internet whether not the google trends can be applicable to who has momentum it a try momentum in the house senate. for somebody who might be a have no clue of who's going to win maybe this also was a chance. those are most of the applications and politics. as such a small part of the book. to me there's many more applications and places were pulling is not as bad.
11:56 pm
whether sports or health, whether it might be a topic i'm a huge fan of entertainment. i think we all are. there is a story about red pit. on a magazine cover this past week. it just seemed a little sad but i would be interested in seeing whether or not people read that story the same way that the mainstream media and the members of the press reddit. that particular place we can get outside we all live in this bubble were were all in washington, d.c. radio you cannot get anymore -- than that. i'm interested in can we in fact uses google trend data to figure out certain directions for the public is thinking things we might not be thinking about.
11:57 pm
>> as opposed to the current situation where you only know how -- will do. >> has a terrific sense of humor. the fun little stories i wasn't sure which line was necessarily going to be my entrance and i followed with tyler who is take cruises campaign chief. that guy probably knew where the train was going to be going. >> i'm going to open up to the audience. you want to say something before you're going to get guided in another direction? all questions are welcome. please wait for the microphone. please introduce yourself. ideally questions and with a?
11:58 pm
>> my name is julie. i just came out of general interest is a member of the public. i wanted to know, surely there are commercial companies who are developing methods for going through this data because they want to know things to sell, they have motivation to make money by attracting people's behavior. other methods that commercial entities are developing that you could use to deal with politics and psychology. >> i think a formal employer's heavily active in this industry.
11:59 pm
>> there is definitely a lot of use of this in marketing, think marketing is one area where people are using surveys. they are asking people what products are going to buy and they say they don't correlate strongly with what products to buy set up know why that's the case that they correlate with the voting. >> i think this gets back to whether the competing sources of information. in an environment where you have little idea of who's buying wh what, maybe a survey will provide information because otherwise all you knows who selling what nothing else. amazon now knows so much about who buys what. so that creates lesser value. you can find out regular people are buying, you don't need test them. >> it's always going to be harder to know what people are going to do.
12:00 am
there's very good data from amazon and other companies. of course most of the ncc of internet are based on previous actions that google and organizations like that are aware of. if you're in gmail a new cnn the ncc will be determined by the content of your e-mails in the things you have search for. >> so that the most obvious commercial application. >> one thing that text terms have do a lot they do rapid experimentation. i think facebook now does more experiments in a day than the fda does an entire year. they can very quickly devise their users into a treatment and control group.
12:01 am
they can put in different versions of the site and put in different fonts or typing or change the wording and see which one gets people to click more or do whatever they want more. think that's how researchers are doing and it's all under little lies in academia. you recruit a bunch of people over time and it takes months to set up a new do a few small experiments over 30 or 60 people were academia could move his they could use digital tools to run a thousand experiments say this is what we found i think that would make science move faster but i haven't seen it done. >> let's go over here.
12:02 am
>> hello. i'm jonathan i'm curious, you talk about wasting google data. how is it biased? obviously there's going to be a bias built-in, but when you're looking at a google transport, any unusual ways that we perhaps don't think that it is biased. not represent to the general population. >> there are a lot of random situations that you find when you do it more and more. one thing i found as i was doing the research and using google trend data and it goes with what people search.
12:03 am
african-americans are searched in more areas where there's high african-american populations. one thing i found is a c tends to be an outlier. one of the residents of d.c. tend to be a little different. there's issues that come up in the data source like this that i would not of thought about before hand but it does come up. there really are a lot of issues. it's unfortunate, initially there has not been too much research on the topic because i think a lot of people survey people were very skeptical of the data source. that's starting to change and there is a beautiful study done on google search data. there's hard-core mythologists were much more detailed that i am on the pages of things so i'm hoping more more people start
12:04 am
studying the data to find these biases. >> i was just gonna say, as you're basically suggesting on the democratic primary site i found the data to be useless as bernie sanders was like 9000 searches ahead. >> i'm build it at the brookings institution and former employer and co-author with seth. i'm not lying, but i am biased. one comment to my question. i thought this was an incredible book, well-written and very interesting. very insightful. the question is, baseball, porn and politics are interesting and fun but i thought the most interesting part of the book was the stuff on child abuse.
12:05 am
kids can't just talk about it. there's a sense that they're going on google to say why does my daddy hit me and it just seems like an incredibly powerful mechanism, something you cannot get from the survey. you mentioned in the book you are working are consulting with governments on that, i'd be interested to hear more about what's happening. >> there is a disturbing element to some of the searches. then one is that obviously younger kids can use the internet but older kids can. really depressing stuff kenny going to this point were people just type and send it to google. it's horrifying. one of the things that i think
12:06 am
you see in the data is during the great recession there is a big drop an official reported case of child abuse. it is surprising. you would think that everything you know about child abuse is when people are out of for work it's a big risk factor. at the same time he saw the rise in these searches and i argue using data that is harder to report in everybody's overworked and child protective service agencies harder to get through. at the same time you see kids making searches on the official data and it was really disturbing. i'm still kinda talking to them and moving slowly. i'm not exactly sure how to use this data. but it's a continuing
12:07 am
conversation with some of the organizations about how they incorporate the data. i don't know where will go but it is important. the initial data can be misleading. i hope we incorporate some of these. >> in a somewhat similar area where data comes in more slowly than we would like his drug addiction data. there should be searches to allow you to figure out where opiate -- spike before the annual cdc data. i think for research purposes that would be pretty useful. let's go to the middle. >> i'm trying to make you run around as much as possible. >> good afternoon.
12:08 am
i really enjoy the presentation. i wanted to ask you hypothetically what you see coming down the road in ten years, because i'm predicting that mind reading this going to be an ever-growing industry. seems to have developed to my last 20 years. talk about modern coupon he which is what were doing with search engines. in ten more years we would want to know more about how people think not only for aspects of advertising but national security purposes or immigration. so where do you see yourself in ten years so i know what to invest in us for stocks are concerned. >> what you're getting is a scare l elements. we can talk about the fun things in the important things companies use the data to take advantage of people. one study i talk about in the book is a study by columbia
12:09 am
professor everybody of applied for loans on an loan application and also what they paid back the loan. and you could forget whether you pay back a loan based on the words he used in the loan application. some of the relationships or warrior like god was the biggest predictor. if you use god you're much less likely to pay back more likely to default. like somebody getting a loan would be would be wise from a prophet perspective to not give a loan to someone who said god bless you and their loan application which is eerie. that's not how we think of the world. basically anything that anybody does correlates to some degree
12:10 am
with everything else that they do. the closing where, the things you like on facebook the word to use, everything will predict what you might do in the future when companies can use that to make better predictions. you can have a legal framework based on this idea that there are three or four things companies will know about you and hear the things companies cannot use, they can't use religion or race and then now there's companies that no a million things about you and a lot of them are put into machines and not paying attention to what they mean. just putting correlations in. i don't think there's a legal framework is prepared for that. >> the same on the government side. in the book you mention the movie minority report, the minority report on a few seen it recently, the head immense
12:11 am
capabilities in the movie and rapidly predicts murders. but they don't have the clout in their care in these really large hard drives around. most of what he does is he takes hard drives out of the wall obviously governments will be sensitive to gain access to these details on citizens and use it to make forecasts. i don't think there is much of a sense of how that access should be restricted. in the u.s. we have long relied on systems not communicating with each other to keep the governments from using large data sets that way. in europe i think there is much more leeway for government
12:12 am
conversely europe has stronger restrictions on what corporations can do with data and privacy protections. let's go all the way to the left against the wall. >> i just want to follow on your comment about europe. you mentioned that america has to be very open on google in terms of expressing honest opinions. other countries of course in europe have a much greater sense of privacy. in asia and japan you have certain sense of individual distance. in china you have the great firewall. how would you characterize responses and places like china, japan and germany? >> i to find as much research and other countries because i
12:13 am
don't know any other language so becomes challenging. so i don't know as much. i think it is interesting. the premise of the research and the book is everybody's honest on google but it's not messerli going to stay like that. it may be that i was just starting this time where people were honest. somebody did a search about that before and after of the snowden issue. and was there a change in the search rate and they found that there was indeed a drop. could be in the future people makes your searches. they had to categorize. >> they worked with the
12:14 am
mechanical and started asking people how embarrassing it is to search. most of them are like child porn are really herpes or something that you would expect. one that made the list was nickel back. [laughter] and that also dropped. >> hello. i'm wondering if you i haven't read enough of the books. did you look at all into along the lines of the pancreatic cancer example, what do terrorists, mass shooters search before they commit the crime and then check that against what other people are searching. everybody does a mass shooting search does this and that sort of thing? >> the government may be doing
12:15 am
that, i don't have access to individual searches but i am a little bit skeptical. just looking at the numbers i think people make a lot of horrible searches more than you would expect. it may just be the false negative ratio is higher than we would expect. people have bad thoughts more than we thought and the government really shouldn't be intervening not just because of legal and privacy reasons because of data science reasons. >> i think this relates to the point of questioning whether was someone does that's most embarrassing is their true self. i feel like we have to take you could imagine super google is the future it won't even need you to search it will just tracked your thoughts at all times.
12:16 am
in this they will what's your true self. whatever thought you had the most frequently might have to do with going to the bathroom or whatever. because every once in a while you have to go on it's on your mind. think we just have to be i don't question a lot of what you found, but this idea that there is this truth out there that what people are really like. i think we have to watch out about being identified truth with whatever the current latest technology is. >> i agree that there are different parts, this came up kim i think is a colleague of years at columbia. he said and he brought this point said was the real you
12:17 am
there's a lot of people who say that their gay but there is almost as many gay porn searches. seems like a lot of men in my reading who are gamache gave porn searches but might be married to a woman or tell facebook and surveys that they're not gay. my assumption is that their gay intends like maybe they're not gay or they don't think of themselves as gay. but i see his point. >> there's something called the tyranny of measurement or labeling. it's been pointed out just having the word gay or the expression gay changes things the way people think of themselves. it's not something i to come in something i am. but also we talked about this earlier. the example in the book about
12:18 am
the efficacy of different schools any compare them based on how effective they are in raising kids test scores are getting them in top colleges. those are very measurable things you people go to school for other reasons than to get into top colleges and get high test course. sometimes it's super clear. sometimes their medical examples because the outcome is clear. you're an election like who's going to win the election. other cases like obama's speech or the outcome of the school staff or because it's not always clear what you can measure. as a statistician night don't say don't measure things but we have to be aware of falling in love with whatever data we have. >> i agree but i think one of the good things about this data explosion is that now we can make richer stories and more complex stories and get multiple measurements that we used to have. i don't think the availability of google searches or facebook
12:19 am
or twitter -- we have fewer things to measure but i agree you have to be careful. >> also what you're saying is that we have the ability to measure things that are perhaps a small part of the population. the matter what state you are, most men are straight or prefer to have sex with women or men. we can now get in on that specific population, that small study that in way that perhaps a larger survey where you pull a thousand people and get ten or 15 people who match that there's no real way to tell that about them. but now we have a way to get in on. >> let's go to heidi now.
12:20 am
>> i'm heidi, fellow at the general fund. i'm looking for to reading the book. i have one observation. one is that the health communication stuff is super tricky because if we try to look at how many people are googling ebola the talks more about their fears and their symptoms. but i have to questions about how represented google searches are. they can about the number of americans have access to the internet, only about 85% of americans have access and about 40% or over 65 don't use the internet. so can you reflect on how your day google searches might be skewed by the 15% of the population who don't use the internet. and will this change over time because of the increase of people using apps than google? >> i think the 15% will get
12:21 am
smaller and smaller over time, it gets to the point where the data is not perfect. it won't be a hundred% correlated with the population. >> i think the first thing that striking when you search google is how people are using it how powerful the patterns are, it would not surprise anyone if google searches came a and all this crazy data in the bible was searching most in new york so it's kind of like the data tends to work and it will get better over time. i don't know much about the app stuff, haven't looked into it.
12:22 am
i think it changes over time, long-term changes can be tough to measure. one thing you see is searches go down over time but other searches that include the word science and some people use that that americans are losing interest in science but there early -- and google were much more scientific. long-term trends can be problematic. >> i want to pick up on your questions because they were critical. think criticism is essential to making all this work. not criticizing steps but -- a few months ago there something on the internet somebody wrote a paper and it got passed around
12:23 am
and said people are most religious in the midwest not the south. then there's a lot of code chewing about what that meant. it turned out someone had viewed religious data that was a mash up of several different churches. some churches had reported attendance in different ways. the data was complete crap. but what happened is people are saying this is hard numbers to start explaining them. it's one thing that we lack is a great way, a way to engage with these claims. a way for people to take a claim and say this is kind of interesting or could be wrong, let's bounce it around. science can't do it either. social science is terrible at doing that. i don't know if journalism is better worse, but with the rise of data journalism or google
12:24 am
science or whatever we want to call it, this is a great opportunity for us to figure out how we can be ethical. i wonder what you all think about this? i agree with you, think most of the data in my book is public and a lot of people are obsessed, from my experience writing a data journalism article which a lot of people read is a lot harder than writing an academic paper that five of your rivals/buddies read and attack. so i think i get e-mails from grad students and undergrads and i'm frequently critiquing my
12:25 am
assumption so if we could do a data analysis that is very public that might be a better way to move it forward. they really put themselves on the line, they say they are judged afterwards by how good their predictions are. so that's powerful. there are always better than the other ones. >> if i say something wrong i get thrashed for. my e-mail, my twitter feed, i don't even know what else. i don't think anybody is called to try me yet or send me snail mail. although i agree with you. i think this type -- anything that is open that allows people to get a look underneath the hood and everyone has an equal opportunity to do so ensures the
12:26 am
data process and what we claim from the data we are all better off for it. because if i have my secret little data set that i'm claiming in making x, y, and z from, you have almost no way of checking it even from the public pulls the underlying survey data -- and those no way of checking. versus the state in which tends to be more open that we can get to quickly make the determination whether not the correlation make sense whether the data was looked at the wrong way. to me that's important because there's a lot of people these days have access to the internet and can write crazy things and claims. people believe them important to me to be able to check those.
12:27 am
>> i disagree. i think they have severe weaknesses and speed in which they have a small number and basically every subarea. i don't think those areas are always small. the five people written even though there many more people to assess the quality of work certainly a little bit more transparent than it is. i think that's helpful. i got phone calls, i wrote about the flaky one time. relatively old men from massachusetts. >> let's go to the left,.
12:28 am
>> thank you. joe scholz with scripts. one thing that came out earlier in the presentation was being asked the right questions. another element that i think is important is asked the right people. one thing i think is true with pulsing google's is not necessarily representative. there's a calibration step that we need to make. thanks someone mention the whole google trend responses happen at a much higher response from sanders. absolutely, but after a couple a contest and you get results you can calibrate that donna get a good idea. with polls we had big mrs. with public polling in wisconsin and ohio. how do you go about making sure that you can calibrate your data to make sure you can validate yourself to go forward. >> this goes back to the sample
12:29 am
-- it is a challenge with google data were surveys you can calibrate because you can ask people demographic questions and party identification. and it turned out as we said during the campaign itself that a lot of the fluctuations in the polls were attributable to changes in relative nonresponsive different groups. but i guess with the public google date it's more difficult because you don't have people's demographics. . .
12:30 am
12:31 am
intelligence do you think some of these searches can be manipulated is >> it is relative to others. i think a lot of researchers are but they put a lot of energy into eliminating those searches because they don't want those included in the data. it is. >> there should be wine and cheese. thank you everybody for coming. [applause]
24 Views
IN COLLECTIONS
CSPAN2 Television Archive Television Archive News Search ServiceUploaded by TV Archive on