Who’s afraid of big data?

Nestled in a dusty corner of my office sits a safe, which requires both a key and a combination to open. Inside the safe sits an external hard drive containing my only copy of encrypted confidential data from the National Longitudinal Study of Adolescent Health (Add Health), a federally funded research study that collected longitudinal data on the health behaviors of adolescents, their friendship networks and siblings, their schools, school transcripts, academic performance, school and community context, and, in later waves, some biological data derived from the analysis of blood, saliva and urine samples. I don’t use these biomarkers in my own research, so I don’t have those data—but I do have some of the survey and transcript data.

If I wish to analyze the data on my desktop computer, I must first remove the computer from my college’s network, unplugging the Ethernet cable and disabling any WiFi connectivity. The statistical software must reside solely on my desktop computer; it cannot reside on a network server. Furthermore, I have to configure the software to store any temporary files on the external hard drive, not the desktop computer’s hard drive. And if the computer is inactive for three minutes, the screensaver is activated, and I must enter a password to log in.

Even though these data files do not include the names of study participants, the design of the Add Health study increases the risk of deductive disclosure, which is when an individual’s identity and information can be identified through the use of information in the study itself, or when those data are combined with other publicly available information. For example, if you knew that an individual had participated in the Add Health study, and you knew five or more of his or her characteristics—such as his or her age, sex, racial/ethnic background, grade level and region of the country—you might be able to deduce that person’s more sensitive data.

To protect the privacy of study participants, any researcher who uses the Add Health data must sign a security pledge to treat the data as confidential, and ensure that they are not accessible to non-project personnel; to not attempt to identify individuals, families, households, schools or institutions; and to notify appropriate authorities about any breach of confidentiality, whether intentional or inadvertent. There are also rules designed to ensure the data that are summarized in research reports don’t contain sufficient detail to identify individuals or their personal information. As a further safeguard, the identification numbers used in collecting the data aren’t distributed to researchers; instead, altered identification numbers are created by a security manager outside of the United States using a proprietary algorithm, and researchers receive only the altered IDs.

Many federal agencies, such as the National Center for Education Statistics (NCES), distribute two forms of data: a public-use file stripped of information that might identify individuals, and a restricted-use file that has more sensitive information. Although I don’t have any current licenses for restricted-use data other than Add Health, in the past I have. These licenses also require a detailed security plan. The confidentiality procedures of NCES are governed by the Education Sciences Reform Act of 2002. Violation of these confidentiality procedures are a Class E felony, punishable by a fine of up to $250,000 or five years’ imprisonment (or both). I don’t know of any instances of prosecution, but I can assure you that signing a document that spells this out is not a trivial matter.

I’m a researcher, and researchers love data. And the ability to link information from different sources allows me, and other researchers, to address more and different research questions than can be addressed via data from a single source. But linking information—e.g., a student’s high-school graduation status to the neighborhood in which he lives, or a teacher’s identity to the adult earnings of her students—requires identifiers, and both federal law and ethical research practice require careful attention to the conditions under which personally identifiable information is linked to other sources or shared with others. The Family Educational Rights and Privacy Act of 1974 (FERPA) generally precludes the disclosure of personally identifiable information in an educational record maintained by a school or college without the prior consent of a parent or eligible student, but there are exceptions, such as when a school or school system outsources institutional functions, such as research and evaluation, to outside contractors.

Outsourcing these functions to specialized organizations is common and appropriate. Few of us, for example, would think it desirable for every school district in the country to have its own in-house team of psychometricians available to design, validate and administer psychological and/or cognitive tests deemed necessary for teaching and learning. But information is typically compartmentalized, and shared on a need-to-know basis. For many applications, an outside contractor would not need to know the identities of particular students whose data are represented in a batch of test scores. On occasion, though, a student’s identity might be important, such as when a contractor is preparing customized curriculum materials or reports for particular schools, teachers and students.

Who should have access to personally identifiable information about students is a hot-button issue, with the news that inBloom, Inc.—a nonprofit organization with initial funding from the Bill & Melinda Gates Foundation and the Carnegie Corporation of New York—is partnering with a number of for-profit companies to offer technology applications relying on personally identifiable information to states and school districts. So far, nine states have participated in the development and pilot testing of inBloom applications. (Editor’s note: the Bill & Melinda Gates Foundation and the Carnegie Corporation of New York are among the various funders of The Hechinger Report.)

As the issue has heated up, inBloom has sought to reassure the public that it is committed to maintaining the privacy of sensitive student information, and that the organization is fully compliant with FERPA. These claims, however, rely on what sociologists John Meyer and Brian Rowan once termed “the logic of confidence”: states and local education agencies, not inBloom itself, are responsible for ensuring that their use of the technology and the data are compliant with applicable laws, and that the vendors they authorize to make use of the data are compliant. And vendors are responsible for ensuring compliance among their subcontractors. inBoom, therefore, simply needs to assert that it’s confident everybody will meet their legal obligations for securing personally identifiable data—but they are neither responsible nor accountable for doing so.

It’s a far cry from the procedures I’ve described that apply to individual researchers, where the accountability is more direct and rests with the individual rather than the organization, and there are fewer ways in which oversight could break down. Perhaps the risk of inappropriate sharing or disclosure of personally identifiable information on students remains low, but at this point it’s difficult to say.

Who’s afraid of big data?

I am.