How much should you pay to keep your data up to date? How much does it cost you to use stale date?
Let’s look at a real example using the voter database (VRDB) provided by the secretary of state. This tells you the voters in your district and a campaign uses this to know who to contact for voter outreach. This perhaps the most critical piece of data for any campaign.
Suppose it costs you $1 to mail a postcard to a voter. If 1000 voters have moved out of your district and can no longer vote for you, you’d arguably be wasting $1000 to continue sending them postcards asking for their vote.
So not updating your copy of the voter-database costs you money in wasted resources. But ingressing a new copy of the voter-database costs you something too. So where’s the sweet spot? How frequently do you need to refresh your copy of the voter database?
Don’t guess! Let’s measure it …
We must establish a difference function to compare to tables (or CSV files). This is somewhat arbitrary, but we’ll count the “decay” as the number of deltas to convert the first file into the second. The difference function should be symmetric.
We’ll count the following as differences:
– if a voter is in one file but not the other. This may mean a voter has moved into the district or left the district.
– If a voter has changed (such as a different last name or different precinct number). This may mean the voter has moved within the district or changed their name.
If the two files have N1 and N2 rows respectively, then the maximum number of differences would be (N1+N2).
For this study, we use an implementation from https://github.com/TechRoanoke/CsvCount/blob/master/CsvCount/CsvDiff.cs
2. Get the data
Here, we’ll look at VRDBs from Oct’12 through Feb’16. Voter Databases can be obtained from the Secretary of State at http://www.sos.wa.gov/elections/vrdb/
Here’s the result of applying the difference function. We start with Oct’12 and use that as a baseline, and comparing each VRDB back it.
Within 1.5 years, there were over a million differences. If it costs $1 a contact, that could be potentially wasting $1 million in a statewide campaign by operating with stale data with a VRDB that’s even 2 years out of date.
We’d expect the decay to slow down and not be linear. Once a person moves, subsequent moves don’t count as additional differences. For example, say person X starts in precinct p1 in Jan’14, moves to precinct p2 in June ’14, and then moves to precinct p3 in Dec ’14. That’s only 1 total difference from Jan’14 to Dec ’14 (moving from P1 to P3) even though there were 2 moves.
5. Next steps?
Possible future explorations here:
- Refine the difference function. Is there an ideal difference function?
- Rrepeat with more data
- This was for Washington state. Compare to other states.
- Compare the vrdb decay rate in urban vs. rural counties.
- Analyze the empirical data and correlate it with specific events. For example, why was their a decrease in voter registration records in Mar’14.
- Develop a theoretical model and match to the empirical data here.