Saturday, June 29, 2013

Unique objects with Core Data: Find and Create

I don't think it's a secret that I love Core Data. It's my opinion that the people who don't like it or use it are either very stupid or extremely smart.

There is one problem that has plagued me for years, how to ensure that when i'm perform imports I don't create duplicate objects. In reality the problem is fairly simple conceptually, you create a hash table, and then hash the Incoming objects to test for their membership in the set, Easy.

But if you believe in DRY, the problem is then making this unique constraint checking code modular and portable across projects, and different Core Data Models.

The first issue we need to is identify what property of a NSManagedObject should be unique relative to the rest of the greater set. We can achieve this by adding a class method to the NSManagedObject class

+ (NSString *)uniqueKeypath {
  return nil;
}


However, in an more advanced model there might be multiple properties that need to be evaluated to determine if an object is unique so …

+ (NSArray *)uniqueKeypaths {
  return nil;
}


It's important to note that since we are using Keypaths, we can actually evaluate the values of related entities.

Finding a Needle in a Haystack

Now it's time to make the magic happen. We apply the concept, we create a hash table, then lookup the unique value of the new objects in the table. If we get a 'hit', we don't do anything, if we 'miss' we create a new object. The sudo code is effectively.

FOR EACH object IN newObjects:

  IF NOT object IN allObjects:

    createObject(object)


The above is simplistic, but you get the point. However you should have made the observation, that if we do it exactly like this, it will not only take ages but we'll run out of memory long before we've began doing the comparison object if we have an extremely large set.

So, the solution is GCD and batched fetching. Given that we can assume that only one import is happening at any one time. So once we have performed the initial fetch of all the existing objects, we can split the comparison operations for each of the new objects on to a concurrent GCD queue. We also need a place to store the objects that need to be created, rather than copying them into another collection, we can keep the current collection and gather the indexes of all the new objects, and then create a subset of the superset.

//Compare new hashes against all the known hashes on multiple threads
dispatch_apply([newObjects count], concurrentQueue, ^(size_t idx) {

   //Note that everything that happens here is on a concurrent queue
   if (![hashes member:[[newObjects objectAtIndex:idx] valueForKeyPath:aKeypath]]) {

   //We have synchronize access to the mutable indexset

    [lock lock]; //Lock the index set

    [uniqueIndexes addIndex:idx]; //add the unique index

    [lock unlock]; //Unlock the index set

  }

});


In this case I'm using dispatch_apply (which I personally think is awesome). It will spawn multiple instances of the block on a concurrent queue. Because of the concurrent nature of this method it's important that we lock the NSMutableIndexSet to ensure that it doesn't blow up when two indexes are added at the same time. The current implementation with a simple NSLock results in poor performance on the initial import as every single block will attempt to get the lock so they can add an index. One possible solution is to use a serial dispatch queue to handle the adding of the indexes, and call it via a dispatch_async.

The next part is to split the work of fetching the objects in to smaller batches so we can not only perform smaller units of work, but also have a lower high memory watermark. NSFetchRequest has support for batching requests so this is apparently handled transparently to the rest of the code using a special type of NSArray. However I haven't tested this to ensure it behaves as I expect.

I've posted an implementation as a abstract subclass of NSManagedObject on Github, fork away!


No comments: