In my first post for this blog I covered the splashy debut of InBloom at the SXSWEdu conference in Texas in March. I noted that it’s tough to explain exactly what the company does (essentially, they provide the infrastructure for a variety of smaller applications to harness the data generated by students to make their offerings more efficient and personalized). I also highlighted privacy concerns that are starting to surface about the collecting, repackaging and re-selling of student data for the benefit of for-profit companies.
Several months later it seems that both the inability to explain for the layperson what the company does, and the panic over privacy and security (underlined by the recent and related upheaval over NSA data mining), are dogging InBloom and may doom it. The number of partner and pilot states, initially listed at 7, is now down to five according to the website, and in at least two of those states, New York and Colorado, the idea faces vociferous local opposition. The American Federation of Teachers has stepped in, issuing a statement citing a “growing lack of public trust” in the company.
This debate is important, and as the AFT notes, it doesn’t stop with InBloom. The promise of big data for schools is not going away and neither are the perils, so perhaps it’s time to have a more grounded conversation about both the issues and the remedies at hand.
Recently I spoke with entrepreneur Jose Ferreira of the adaptive learning platform Knewton, another kind of big data company in education. During our conversation he said that from the ed-tech point of view, there are several types of student data. Each has different values and different dangers. (Ferreira separated out five kinds of data in his typology, but to simplify I’ll designate just three).
1) The first type is known as personally identifiable information: names, addresses, Social Security numbers. Exposure of this data generally is a security breach of the first order. It may be valuable for spammers, but it’s not all that useful to analyze for educational outcomes. Say you find out that girls named Alana from Phoenix do better in reading–that’s not generalizable. For this reason PII should always be well hidden within any software program.
2) The second type of data is the kind collected and tabulated by school, state and federal student information systems–let’s call it SIS. There is academic and behavioral information, like attendance, standardized test scores, suspension rates and class sizes. And there’s demographic data, like ethnicity, learning disability or IEP classification, and the percentage receiving free/reduced lunch. It’s very useful to correlate this kind of data with educational outcomes and interventions. It’s necessary for resource allocation. Because it pertains to groups, not individuals, it’s less sensitive than PII. But there’s still a chance for schools or groups to be stigmatized or stereotyped with the sharing of such information, so it needs to be released judiciously. No one is arguing that a particular student’s test score, for example, should be a state secret, but as with anything that appears on a transcript, its release should be controlled and limited to those who need to know.
3) The third, and newest, type of data is the user interaction information collected by learning software systems like Dreambox, Khan Academy or Knewton. These systems combine time on page and keystrokes with student responses to assessment questions to construct a picture of the engagement and proficiency of individual students and the efficacy of particular pieces of content. This is where you truly get into “big data.” Some of these systems claim to generate millions of data points per hour.
Let’s leave PII alone for a minute. The power of both SIS and “big data” to improve the practice of teaching and learning depends on aggregating and analyzing as much of it as possible, and making the relevant results available as quickly as possible to students, educators, parents, and the people who build these systems. The system we have today doesn’t do a very good job of this. Adequate Yearly Progress test results, for example, typically become available several months after a student takes the test. If big data is going to be useful at all, the privacy considerations attached to it have to be different because of the sheer volume and velocity at which it is generated. “Opting-in” often becomes impractical.
I would suggest separate tests be applied to determine responsible privacy and security considerations for student data. PII should always be separated out and kept hidden except when explicitly shared or agreed to by informed individuals. SIS and “big data” should be protected and its use disclosed, especially when it’s being made available for the enrichment of private businesses. (For example, I’m not a huge fan of the startup Junyo, founded by a former cofounder of the online game company Zynga, which has introduced a product that scrapes publicly available SIS data and sells the information to textbook and ed-tech companies for marketing purposes). In all cases, we have to balance the potential harm to vulnerable young people with the potential gains to learning and teaching.