How the NSA Converts Spoken Words Into Searchable Text
Most people realize that emails and other digital communications they once considered private can now become part of their permanent record.
But even as they increasingly use apps that understand what they say, most people don’t realize that the words they speak are not so private anymore, either.
Top-secret documents from the archive of former NSA contractor Edward Snowden show the National Security Agency can now automatically recognize the content within phone calls by creating rough transcripts and phonetic representations that can be easily searched and stored.
The documents show NSA analysts celebrating the development of what they called “Google for Voice” nearly a decade ago.
Though perfect transcription of natural conversation apparently remains the Intelligence Community’s “holy grail,” the Snowden documents describe extensive use of keyword searching as well as computer programs designed to analyze and “extract” the content of voice conversations, and even use sophisticated algorithms to flag conversations of interest.
The documents include vivid examples of the use of speech recognition in war zones like Iraq and Afghanistan, as well as in Latin America. But they leave unclear exactly how widely the spy agency uses this ability, particularly in programs that pick up considerable amounts of conversations that include people who live in or are citizens of the United States.
Spying on international telephone calls has always been a staple of NSA surveillance, but the requirement that an actual person do the listening meant it was effectively limited to a tiny percentage of the total traffic. By leveraging advances in automated speech recognition, the NSA has entered the era of bulk listening.
And this has happened with no apparent public oversight, hearings or legislative action. Congress hasn’t shown signs of even knowing that it’s going on.
The USA Freedom Act — the surveillance reform bill that Congress is currently debating — doesn’t address the topic at all. The bill would end an NSA program that does not collect voice content: the government’s bulk collection of domestic calling data, showing who called who and for how long.
Even if becomes law, the bill would leave in place a multitude of mechanisms exposed by Snowden that scoop up vast amounts of innocent people’s text and voice communications in the U.S. and across the globe.
Civil liberty experts contacted by The Intercept said the NSA’s speech-to-text capabilities are a disturbing example of the privacy invasions that are becoming possible as our analog world transitions to a digital one.
“I think people don’t understand that the economics of surveillance have totally changed,” Jennifer Granick, civil liberties director at the Stanford Center for Internet and Society, told The Intercept.
“Once you have this capability, then the question is: How will it be deployed? Can you temporarily cache all American phone calls, transcribe all the phone calls, and do text searching of the content of the calls?” she said. “It may not be what they are doing right now, but they’ll be able to do it.”
And, she asked: “How would we ever know if they change the policy?”
Indeed, NSA officials have been secretive about their ability to convert speech to text, and how widely they use it, leaving open any number of possibilities.
That secrecy is the key, Granick said. “We don’t have any idea how many innocent people are being affected, or how many of those innocent people are also Americans.”
I Can Search Against It
NSA whistleblower Thomas Drake, who was trained as a voice processing crypto-linguist and worked at the agency until 2008, told The Intercept that he saw a huge push after the September 11, 2001 terror attacks to turn the massive amounts of voice communications being collected into something more useful.
Human listening was clearly not going to be the solution. “There weren’t enough ears,” he said.
The transcripts that emerged from the new systems weren’t perfect, he said. “But even if it’s not 100 percent, I can still get a lot more information. It’s far more accessible. I can search against it.”
Converting speech to text makes it easier for the NSA to see what it has collected and stored, according to Drake. “The breakthrough was being able to do it on a vast scale,” he said.
More Data, More Power, Better Performance
The Defense Department, through its Defense Advanced Research Projects Agency (DARPA), started funding academic and commercial research into speech recognition in the early 1970s.
What emerged were several systems to turn speech into text, all of which slowly but gradually improved as they were able to work with more data and at faster speeds.
In a brief interview, Dan Kaufman, director of DARPA’s Information Innovation Office, indicated that the government’s ability to automate transcription is still limited.
Kaufman says that automated transcription of phone conversation is “super hard,” because “there’s a lot of noise on the signal” and “it’s informal as hell.”
“I would tell you we are not very good at that,” he said.
In an ideal environment like a news broadcast, he said, “we’re getting pretty good at being able to do these types of translations.”
A 2008 document from the Snowden archive shows that transcribing news broadcasts was already working well seven years ago, using a program called Enhanced Video Text and Audio Processing:
(U//FOUO) EViTAP is a fully-automated news monitoring tool. The key feature of this Intelink-SBU-hosted tool is that it analyzes news in six languages, including Arabic, Mandarin Chinese, Russian, Spanish, English, and Farsi/Persian. “How does it work?” you may ask. It integrates Automatic Speech Recognition (ASR) which provides transcripts of the spoken audio. Next, machine translation of the ASR transcript translates the native language transcript to English. Voila! Technology is amazing.
A version of the system the NSA uses is now even available commercially.
Experts in speech recognition say that in the last decade or so, the pace of technological improvement has been explosive. As information storage became cheaper and more efficient, technology companies were able to store massive amounts of voice data on their servers, allowing them to continually update and improve the models. Enormous processors, tuned as “deep neural networks” that detect patterns like human brains do, produce much cleaner transcripts.
And the Snowden documents show that the same kinds of leaps forward seen in commercial speech-to-text products have also been happening in secret at the NSA, fueled by the agency’s singular access to astronomical processing power and its own vast data archives.
In fact, the NSA has been repeatedly releasing new and improved speech recognition systems for more than a decade.
The first-generation tool, which made keyword-searching of vast amounts of voice content possible, was rolled out in 2004 and code-named RHINEHART.
“Voice word search technology allows analysts to find and prioritize intercept based on its intelligence content,” says an internal 2006 NSA memo entitled “For Media Mining, the Future Is Now!”
The memo says that intelligence analysts involved in counterterrorism were able to identify terms related to bomb-making materials, like “detonator” and “hydrogen peroxide,” as well as place names like “Baghdad” or people like “Musharaf.”
RHINEHART was “designed to support both real-time searches, in which incoming data is automatically searched by a designated set of dictionaries, and retrospective searches, in which analysts can repeatedly search over months of past traffic,” the memo explains (emphasis in original).
As of 2006, RHINEHART was operating “across a wide variety of missions and languages” and was “used throughout the NSA/CSS [Central Security Service] Enterprise.”
But even then, a newer, more sophisticated product was already being rolled out by the NSA’s Human Language Technology (HLT) program office. The new system, called VoiceRT, was first introduced in Baghdad, and “designed to index and tag 1 million cuts per day.”
The goal, according to another 2006 memo, was to use voice processing technology to be able “index, tag and graph,” all intercepted communications. “Using HLT services, a single analyst will be able to sort through millions of cuts per day and focus on only the small percentage that is relevant,” the memo states.
A 2009 memo from the NSA’s British partner, GCHQ, describes how “NSA have had the BBN speech-to-text system Byblos running at Fort Meade for at least 10 years. (Initially they also had Dragon.) During this period they have invested heavily in producing their own corpora of transcribed Sigint in both American English and an increasing range of other languages.” (GCHQ also noted that it had its own small corpora of transcribed voice communications, most of which happened to be “Northern Irish accented speech.”)
VoiceRT, in turn, was surpassed a few years after its launch. According to the intelligence community’s “Black Budget” for fiscal year 2013, VoiceRT was decommissioned and replaced in 2011 and 2012, so that by 2013, NSA could operationalize a new system. This system, apparently called SPIRITFIRE, could handle more data, faster. SPIRITFIRE would be “a more robust voice processing capability based on speech-to-text keyword search and paired dialogue transcription.”
Extensive Use Abroad
Voice communications can be collected by the NSA whether they are being sent by regular phone lines, over cellular networks, or through voice-over-internet services. Previously released documents from the Snowden archive describe enormous efforts by the NSA during the last decade to get access to voice-over-internet content like Skype calls, for instance. And other documents in the archive chronicle the agency’s adjustment to the fact that an increasingly large percentage of conversations, even those that start as landline or mobile calls, end up as digitized packets flying through the same fiber-optic cables that the NSA taps so effectively for other data and voice communications.
The Snowden archive, as searched and analyzed by The Intercept, documents extensive use of speech-to-text by the NSA to search through international voice intercepts — particularly in Iraq and Afghanistan, as well as Mexico and Latin America.
For example, speech-to-text was a key but previously unheralded element of the sophisticated analytical program known as the Real Time Regional Gateway (RTRG), which started in 2005 when newly appointed NSA chief Keith B. Alexander, according to the Washington Post, “wanted everything: Every Iraqi text message, phone call and e-mail that could be vacuumed up by the agency’s powerful computers.”
The Real Time Regional Gateway was credited with playing a role in “breaking up Iraqi insurgent networks and significantly reducing the monthly death toll from improvised explosive devices.” The indexing and searching of “voice cuts” was deployed to Iraq in 2006. By 2008, RTRG was operational in Afghanistan as well.
A slide from a June 2006 NSA powerpoint presentation described the role of VoiceRT:
Keyword spotting extended to Iranian intercepts as well. A 2006 memo reported that RHINEHART had been used successfully by Persian-speaking analysts who “searched for the words ‘negotiations’ or ‘America’ in their traffic, and RHINEHART located a very important call that was transcribed verbatim providing information on an important Iranian target’s discussion of the formation of a the new Iraqi government.”
According to a 2011 memo, “How is Human Language Technology (HLT) Progressing?“, NSA that year deployed “HLT Labs” to Afghanistan, NSA facilities in Texas and Georgia, and listening posts in Latin America run by the Special Collection Service, a joint NSA/CIA unit that operates out of embassies and other locations.
“Spanish is the most mature of our speech-to-text analytics,” the memo says, noting that the NSA and its Special Collections Service sites in Latin America, have had “great success searching for Spanish keywords.”
The memo offers an example from NSA Texas, where an analyst newly trained on the system used a keyword search to find previously unreported information on a target involved in drug-trafficking. In another case, an official at a Special Collection Service site in Latin America “was able to find foreign intelligence regarding a Cuban official in a fraction of the usual time.”
In a 2011 article, “Finding Nuggets — Quickly — in a Heap of Voice Collection, From Mexico to Afghanistan,” an intelligence analysis technical director from NSA Texas described the “rare life-changing instance” when he learned about human language technology, and its ability to “find the exact traffic of interest within a mass of collection.”
Analysts in Texas found the new technology a boon for spying. “From finding tunnels in Tijuana, identifying bomb threats in the streets of Mexico City, or shedding light on the shooting of US Customs officials in Potosi, Mexico, the technology did what it advertised: It accelerated the process of finding relevant intelligence when time was of the essence,” he wrote. (Emphasis in original.)
The author of the memo was also part of a team that introduced the technology to military leaders in Afghanistan. “From Kandahar to Kabul, we have traveled the country explaining NSA leaders’ vision and introducing SIGINT teams to what HLT analytics can do today and to what is still needed to make this technology a game-changing success,” the memo reads.
Extent of Domestic Use Remains Unknown
What’s less clear from the archive is how extensively this capability is used to transcribe or otherwise index and search voice conversations that primarily involve what the NSA terms “U.S. persons.”
The NSA did not answer a series of detailed questions about automated speech recognition, even though an NSA “classification guide” that is part of the Snowden archive explicitly states that “The fact that NSA/CSS has created HLT models” for speech-to-text processing as well as gender, language and voice recognition, is “UNCLASSIFIED.”
Also unclassified: The fact that the processing can sort and prioritize audio files for human linguists, and that the statistical models are regularly being improved and updated based on actual intercepts. By contrast, because they’ve been tuned using actual intercepts, the specific parameters of the systems are highly classified.
“The National Security Agency employs a variety of technologies in the course of its authorized foreign-intelligence mission,” spokesperson Vanee’ Vines wrote in an email to The Intercept. “These capabilities, operated by NSA’s dedicated professionals and overseen by multiple internal and external authorities, help to deter threats from international terrorists, human traffickers, cyber criminals, and others who seek to harm our citizens and allies.”
Vines did not respond to the specific questions about privacy protections in place related to the processing of domestic or domestic-to-international voice communications. But she wrote that “NSA always applies rigorous protections designed to safeguard the privacy not only of U.S. persons, but also of foreigners abroad, as directed by the President in January 2014.”
The presidentially appointed but independent Privacy and Civil Liberties Oversight Board (PCLOB) didn’t mention speech-to-text technology in its public reports.
“I’m not going to get into whether any program does or does not have that capability,” PCLOB chairman David Medine told The Intercept.
His board’s reports, he said, contained only information that the intelligence community agreed could be declassified.
“We went to the intelligence community and asked them to declassify a significant amount of material,” he said. The “vast majority” of that material was declassified, he said. But not all — including “facts that we thought could be declassified without compromising national security.”
Hypothetically, Medine said, the ability to turn voice into text would raise significant privacy concerns. And it would also raise questions about how the intelligence agencies “minimize” the retention and dissemination of material— particularly involving U.S. persons — that doesn’t include information they’re explicitly allowed to keep.
“Obviously it increases the ability of the government to process information from more calls,” Medine said. “It would also allow the government to listen in on more calls, which would raise more of the kind of privacy issues that the board has raised in the past.”
“I’m not saying the government does or doesn’t do it,” he said, “just that these would be the consequences.”
A New Learning Curve
Speech recognition expert Bhiksha Raj likens the current era to the early days of the Internet, when people didn’t fully realize how the things they typed would last forever.
“When I started using the Internet in the 90s, I was just posting stuff,” said Raj, an associate professor at Carnegie Mellon University’s Language Technologies Institute. “It never struck me that 20 years later I could go Google myself and pull all this up. Imagine if I posted something on alt.binaries.pictures.erotica or something like that, and now that post is going to embarrass me forever.”
The same is increasingly becoming the case with voice communication, he said. And the stakes are even higher, given that the majority of the world’s communication has historically been conducted by voice, and it has traditionally been considered a private mode of communication.
“People still aren’t realizing quite the magnitude that the problem could get to,” Raj said. “And it’s not just surveillance,” he said. “People are using voice services all the time. And where does the voice go? It’s sitting somewhere. It’s going somewhere. You’re living on trust.” He added: “Right now I don’t think you can trust anybody.”
The Need for New Rules
Kim Taipale, executive director of the Stilwell Center for Advanced Studies in Science and Technology Policy, is one of several people who tried a decade ago to get policymakers to recognize that existing surveillance law doesn’t adequately deal with new global communication networks and advanced technologies including speech recognition.
“Things aren’t ephemeral anymore,” Taipale told The Intercept. “We’re living in a world where many things that were fleeting in the analog world are now on the permanent record. The question then becomes: what are the consequences of that and what are the rules going to be to deal with those consequences?”
Realistically, Taipale said, “the ability of the government to search voice communication in bulk is one of the things we may have to live with under some circumstances going forward.” But there at least need to be “clear public rules and effective oversight to make sure that the information is only used for appropriate law-enforcement or national security purposes consistent with Constitutional principles.”
Ultimately, Taipale said, a system where computers flag suspicious voice communications could be less invasive than one where people do the listening, given the potential for human abuse and misuse to lead to privacy violations. “Automated analysis has different privacy implications,” he said.
But to Jay Stanley, a senior policy analyst with the ACLU’s Speech, Privacy and Technology Project, the distinction between a human listening and a computer listening is irrelevant in terms of privacy, possible consequences, and a chilling effect on speech.
“What people care about in the end, and what creates chilling effects in the end, are consequences,” he said. “I think that over time, people would learn to fear computerized eavesdropping just as much as they fear eavesdropping by humans, because of the consequences that it could bring.”
Indeed, computer listening could raise new concerns. One of the internal NSA memos from 2006 says an “important enhancement under development is the ability for this HLT capability to predict what intercepted data might be of interest to analysts based on the analysts’ past behavior.”
Citing Amazon’s ability to not just track but predict buyer preferences, the memo says that an NSA system designed to flag interesting intercepts “offers the promise of presenting analysts with highly enriched sorting of their traffic.”
To Phillip Rogaway, a professor of computer science at the University of California, Davis, keyword-search is probably the “least of our problems.” In an email to The Intercept, Rogaway warned that “When the NSA identifies someone as ‘interesting’ based on contemporary NLP [Natural Language Processing] methods, it might be that there is no human-understandable explanation as to why beyond: ‘his corpus of discourse resembles those of others whom we thought interesting’; or the conceptual opposite: ‘his discourse looks or sounds different from most people’s.’”
If the algorithms NSA computers use to identify threats are too complex for humans to understand, Rogaway wrote, “it will be impossible to understand the contours of the surveillance apparatus by which one is judged. All that people will be able to do is to try your best to behave just like everyone else.”
Siri can understand what you say. Google can take dictation. Even your new smart TV is taking verbal orders.
So is there any doubt the National Security Agency has the ability to translate spoken words into text?
But precisely when the NSA does it, with which calls, and how often, is a well-guarded secret.
It’s not surprising that the NSA isn’t talking about it. But oddly enough, neither is anyone else: Over the years, there’s been almost no public discussion of the NSA’s use of automated speech recognition.
One minor exception was in 1999, when a young Australian cryptographer named Julian Assange stumbled across an NSA patent that mentioned “machine transcribed speech.”
Assange, who went on to found WikiLeaks, said at the time: “This patent should worry people. Everyone’s overseas phone calls are or may soon be tapped, transcribed and archived in the bowels of an unaccountable foreign spy agency.”
The most comprehensive post-Snowden descriptions of NSA’s surveillance programs are strangely silent when it comes to speech recognition. The report from the President’s Review Group on Intelligence and Communications Technologies doesn’t mention it, and neither does the October 2011 FISA Court ruling, or the detailed reports from the Privacy and Civil Liberties Oversight Board.
There is some mention of speech recognition in the “Black Budget” submitted to Congress each year. But there’s no clear sign that anybody on the Hill has ever really noticed.
As The Intercept reported on Tuesday, items from the Snowden archive document the widespread use of automated speech recognition by the NSA.
The strategic advantage, invasive potential and policy implications of being able to turn spoken words into text are not trivial: Suddenly, voice conversations, historically considered ephemeral and unsearchable, can be scanned, catalogued and archived — not perfectly, but well enough to dramatically increase the effective scope of eavesdropping.
Former senior NSA executive turned whistleblower Thomas Drake, who’s seen NSA’s automated speech recognition at work, says the silence is telling.
“You’re seeing a black hole,” Drake told The Intercept. “That means there’s something there that’s really significant. You’re seeing some of the fuzzy contours of this whole other program.”
Not Technically a Secret
The NSA’s ability to turn voice into text, interestingly enough, is not technically a secret.
And speech recognition technology has been heavily — and openly — funded by the Defense Advanced Research Project Agency (DARPA) since the early 1970s.
The latest of DARPA’s many public research projects in that area is the Robust Automatic Transcription of Speech program, known as RATS, which focuses on “noisy or degraded speech signals that are important to military intelligence.”
Meanwhile, DARPA’s intelligence-world counterpart, IARPA, announced the Babel Program in 2011, with its goal of “developing agile and robust speech recognition technology that can be rapidly applied to any human language in order to provide effective search capability for analysts to efficiently process massive amounts of real-world recorded speech.”
Despite openly announcing its speech-to-text program, IARPA declined an interview request by The Intercept.
Robert Litt, who as general counsel for the Office of the Director of National Intelligence is the intelligence community’s chief lawyer, was asked about the NSA’s speech-to-text capabilities at a forum on transparency on Capitol Hill on Friday.
He took the opportunity to lash out at The Intercept’s reporting: “I think that story is a great example of what is wrong with a lot of media coverage of this,” he said. “That story made absolutely no distinction between technical capabilities and legal authorities. There are all sorts of technical capabilities that NSA has. I’m not commenting on the existence or nonexistence of any such authority. The question is when are they used and what are the legal authorities under which they are used. And I think that that’s something that a lot of the press reporting completely ignores, including that story you wrote.”
Asked to explain in what ways the use of speech-to-text is limited, Litt repeatedly refused to even acknowledge its existence.
“I’m not saying that the government isn’t using these techniques. I am not acknowledging that these techniques exist even.”
You won’t hear much about the use of speech recognition for surveillance in academe, either.
Researchers in the field are divided between those who don’t take NSA funding, and can only speculate about what goes on over there — and those who do take NSA funding, but won’t say what they know.
“There’s a lot of weird hush-hush that goes on,” said Bhiksha Raj, an associate professor at Carnegie Mellon University’s Language Technologies Institute, who said he does not receive NSA funding. “Academics who work for the NSA must go through various clearances. They sign several papers. They hold closed meetings that are only attended by people with clearances.”
Some non-NSA affiliated academics were once “quite keen” on seeing how the NSA was faring in the face of the technical challenges in the field, Steve Young, a professor of information engineering at the University of Cambridge, recalled. “But unless you actually work for the NSA and you’ve been vetted, you’re not going to get close to the real data.”
Ironically, even GCHQ, NSA’s intelligence partner in the U.K., has complained about DARPA and NSA’s secrecy. A 2009 GCHQ assessment of speech-to-text technology said that “The DARPA evaluation programme, with significant steer from NSA, has been the main driving force behind technology improvements in the field. Unfortunately, the results of the evaluations are not put in the public domain, making reference difficult.”
All the secrecy has an obvious advantage for the NSA. If the NSA can keep their speech-recognition capabilities secret, nobody can tell them what to do. And if nobody knows what they are doing, then nobody can tell them to stop.
Senator Ron Wyden, D-Ore., arguably the foremost congressional critic of NSA overreach, wouldn’t comment directly on the question of speech recognition. But, he said through a spokesperson: “After 14 years on the Intelligence Committee, I’ve learned that senators must be constantly on the lookout for secret interpretations of the law and advances in surveillance that Congress isn’t aware of.”
He added: “For centuries, individual privacy was protected in part by the limited resources of governments. It simply wasn’t possible for governments to secretly collect information on every single citizen without investing in massive networks of spies and informants. But in the 21st century mass surveillance is no longer difficult and expensive — it’s increasingly cheap and easy. The only privacy protections that will matter in the future are the ones that are written into law and defended by public demand for freedom and openness.”
When it comes to the National Security Agency’s recently disclosed use of automated speech recognition technology to search, index and transcribe voice communications, people in the United States may well be asking: But are they transcribing my phone calls?
The answer is maybe.
A clear-cut answer is elusive because documents in the Snowden archive describe the capability to turn speech into text, but not the extent of its use — and the U.S. intelligence community refuses to answer even the most basic questions on the topic.
Asked about the application of speech-to-text to conversations including Americans, Robert Litt, general counsel for the Office of the Director of National Intelligence, said at a Capitol Hill event in May that the NSA has “all sorts of technical capabilities” and that they are all used in a lawful manner.
“I’m not specifically acknowledging or denying the existence of any particular capability,” he said. “I’m only saying that the focus needs to be on what are the authorities the NSA is using, and what are the protections around the execution of those authorities?”
So what are those authorities? And what are the protections around their execution?
Litt wouldn’t say. But thanks to previous explorations of the Snowden archive and some documents released by the Obama administration, we know there are four major methods the NSA uses to get access to phone calls involving Americans — and only one of them technically precludes the use of speech recognition.
215 Bulk Collection of Metadata
The only surveillance program we know does not involve speech-to-text processing is the bulk collection of metadata of domestic phone calls, commonly known as 215 collection, after the section of the Patriot Act that the NSA says makes it legal.
U.S. officials have unequivocally denied that they get any access to any content through the 215 program — information about the calls, yes, but no actual calls. So no voice means no transcripts.
As it happens, that’s also the one program that Congress has decided to eliminate in its current form. But many other far more invasive programs, many of which sweep up American content including voice communications, remain untouched.
702 PRISM and Upstream
Voice communications have been and continue to be widely intercepted and collected under both programs the NSA considers authorized by section 702 of the Foreign Intelligence Surveillance Act.
The two 702 programs are called PRISM and Upstream. You may recall this slide from an earlier story:
PRISM collects hundreds of millions of Internet communications (text and video, as well as voice) of “targeted individuals” from providers such as Facebook, Yahoo and Skype. But plenty of ordinary people speak to those “targets.” Those “targets” are not necessarily targeted for good reason. And the system picks up additional “incidental” communications it wasn’t technically looking for.
As a result, an unknown but not inconsiderable amount of voice communications involving ordinary Americans gets swept up by PRISM and dumped into massive, centralized databases that are widely accessible to U.S. law enforcement agencies.
There is nothing in the Snowden archive that indicates whether or not the NSA applies what it calls “human language technologies” to these huge troves of voice data, to index, tag, transcribe and/or search them by keyword. And the NSA will not say. But there’s no technological impediment, given the huge leaps forward in automatic speech recognition.
The 702 program called Upstream takes all sorts of communications, including voice, straight from the major U.S. Internet backbones run by telecommunication companies such as AT&T and Verizon.
Here, text is handled differently from telephone calls. Data in text form is searched in bulk using “selectors” such as email addresses, IP addresses and unique IDs. Any “transaction” (some are big, and some are small) that contains a “selector” is moved to an NSA database for further examination.
By contrast, according to the Privacy and Civil Liberties Oversight Board (PCLOB) report on the 702 program, traditional phone calls are collected solely if they are to or from targeted foreign individuals.
But even so: In 2013, the one year we have hard numbers for, the NSA used Upstream to collect phone calls to and from 89,000 targets — many of those calls inevitably involving U.S. persons. And after collection, these calls are dumped into the NSA’s databases, where — although we don’t know for sure — they could be transcribed, indexed, keyword-searched and stored.
An otherwise unalarmist PCLOB report on the 702 program issued this quite striking warning on that topic: “Even though U.S. persons and persons located in the United States are subject to having their telephone conversations collected only when they communicate with a targeted foreigner located abroad, the program nevertheless gains access to numerous personal conversations of U.S. persons that were carried on under an expectation of privacy.”
The PCLOB report made no reference to automated transcription. When asked about the topic, board chairman David Medine told The Intercept that the report contained only information that the intelligence community agreed could be declassified.
Most analyses of Upstream also assume that the NSA considers VoIP calls — voice communications that travel across the Internet — as entitled to the same legal protection as the ever-diminishing number of calls that travel over the old-fashioned telephone circuits. If that’s not the case — and an NSA spokesperson declined to comment on that question — then many more American voice conversations could be subject to collection and processing.
Executive Order 12333
Finally, there is the vast and essentially unconstrained collection of communications that the NSA intercepts abroad, citing its authority under Executive Order 12333. The scope and scale of those programs is massive; in some cases it involves NSA collecting voice communications of entire countries, hacking cell networks, breaking into private data links, and tapping phone and Internet backbones throughout the world.
All the specific examples of the application of speech-to-text processes described in the Snowden documents reviewed by The Intercept appear to have involved intercepts abroad. But surveillance anywhere in the world will inevitably pull in a great deal of voice conversations involving Americans who call, visit or work in the country being surveilled.
The NSA responded to our inquiries with boilerplate: “Regardless of the tool, analytic technique, or technology, NSA always applies rigorous protections designed to safeguard the privacy not only of U.S. persons, but also of foreigners abroad, as directed by the President in January 2014.”
Has the Court Really Approved?
In a now-public October 2011 opinion, former Foreign Intelligence Surveillance Court presiding Judge John Bates bitterly complained that for the third time in less than three years, the government had significantly misrepresented the scope of its collection to the secret court.
It wasn’t until 2011, for instance, that the court understood that the government wasn’t just looking at a small cross-section of Upstream data, but was searching through all of it for “selectors” — and was putting into its databases not just discrete, single “transactions” that contained a given selector, but potentially huge “multi-communication transactions.” As a result, the court finally realized, an incalculable number of purely domestic communications were ending up in the NSA’s databases.
Georgetown University Law Center professor Laura Donohue has written that either “the Court was particularly slow, the government had been lying, or the government had made a mistake.”
The Bates opinion explicitly recognized that there might be more epiphanies to come — “that further revelations concerning what NSA has actually acquired through its 702 collection, together with the constant evolution of the Internet, may alter the Court’s analysis at some point in the future.”
Asked at a public event by The Intercept if anyone had ever explicitly advised the court that the NSA was using speech-to-text processing on voice intercepts that were collected by 702 programs, Litt replied: “The FISA court orders specifically dictate what we can do and what we can’t do in conducting collection under 702. You have seen those orders. You know what they say.”
He continued: “The orders also provide what kinds of processing we can do on them. We do what those orders authorize. If the orders authorize it, we’re allowed to do it. If they don’t, we’re not. And it doesn’t matter whether we would use this speech-to-text recognition tools or whether we use 800 monkeys sitting at typewriters.”
But none of the FISA court orders appear to say anything specific about processing. And the ability to turn massive amounts of voice into text raises intense privacy concerns because of the scale involved in the collection. Assigning an analyst (or even a single monkey) to listen in on every international phone call would be impossible. Automatically transcribing them and storing the text in a searchable database is not.
The Protections Involved
The NSA’s historic mission has been to spy on foreigners, not Americans. So all the surveillance methods mentioned above — with the notable exception of the domestic collection of phone records — come with a phalanx of rules intended to “minimize” the retention and dissemination of information related to U.S. persons.
The rules are explicit, and absolute, and the government’s argument is that they are sufficient to protect Americans’ constitutional rights.
And there is indeed evidence throughout the Snowden archive of how meticulously followed those minimization rules are supposed to be; how analysts are instructed to immediately throw out “U.S. persons” information the instant they recognize it shouldn’t be there; how careful they are supposed to be about providing access to unminimized intelligence gathered under FISA.
But even if you overlook the possibility of illicit searches — for loveint, to look for sexually explicit content, and so on — the NSA’s interpretations of the rules appear to be problematic, and their application inconsistent.
Georgetown Law’s Donohue, for instance, writes that the government “has created a presumption of non-U.S. person status” and “absent evidence to the contrary, assumes that the target is located outside the United States.”
The PCLOB report on 702 programs found that, as regards the rule quoted above, “in practice, this requirement rarely results in actual purging of data.”
A recently declassified report from the Justice Department’s inspector general found that authorities had failed to comply with basic minimization requirements regarding the 215 program — for eight years.
Glenn Greenwald reported in September 2013 that the NSA routinely shares raw, uniminized “Raw Sigint” with Israel, that “includes, but is not limited to, unevaluated and unminimized transcripts, gists, facsimiles, telex, voice and Digital Network Intelligence metadata and content.”
And of course there’s no external oversight.
Bob Litt says we shouldn’t worry. But neither he nor anyone else in a position to know will provide the facts that might — or might not — reassure us.
— source theintercept By Dan Froomkin