Skip to content Skip to sidebar Skip to footer

Correct Use Of A Fold Or Reduce Function To Long-to-wide Data In Python Or Javascript?

Trying to learn to think like a functional programmer a little more---I'd like to transform a data set with what I think is either a fold or a reduce operation. In R, I would thin

Solution 1:

I know this isn't the fold-style solution you were asking for, but I would do this with itertools, which is just as functional (unless you think Haskell is less functional than Lisp…), and also probably the most pythonic way to solve this.

The idea is to think of your sequence as a lazy list, and apply a chain of lazy transformations to it until you get the list you want.

The key step here is groupby:

>>> initial = json.loads(s)
>>> groups = itertools.groupby(initial, operator.itemgetter('query'))
>>> print([key, list(group) for key, group in groups])
[('Q1',
  [{'detail': 'cool', 'query': 'Q1', 'rank': 1, 'url': 'awesome1'},
   {'detail': 'cool', 'query': 'Q1', 'rank': 2, 'url': 'awesome2'},
   {'detail': 'cool', 'query': 'Q1', 'rank': 3, 'url': 'awesome3'}]),
 ('Q#2',
  [{'detail': 'same', 'query': 'Q#2', 'rank': 1, 'url': 'newurl1'},
   {'detail': 'same', 'query': 'Q#2', 'rank': 2, 'url': 'newurl2'},
   {'detail': 'same', 'query': 'Q#2', 'rank': 3, 'url': 'newurl3'}])]

You can see how close we are already, in just one step.

To restructure each key, group pair into the dict format you want:

>>> groups = itertools.groupby(initial, operator.itemgetter('query'))
>>> print([{"query": key, "results": list(group)} for key, group ingroups])
[{'query': 'Q1',
  'results': [{'detail': 'cool',
               'query': 'Q1',
               'rank': 1,
               'url': 'awesome1'},
              {'detail': 'cool',
               'query': 'Q1',
               'rank': 2,
               'url': 'awesome2'},
              {'detail': 'cool',
               'query': 'Q1',
               'rank': 3,
               'url': 'awesome3'}]},
 {'query': 'Q#2',
  'results': [{'detail': 'same',
               'query': 'Q#2',
               'rank': 1,
               'url': 'newurl1'},
              {'detail': 'same',
               'query': 'Q#2',
               'rank': 2,
               'url': 'newurl2'},
              {'detail': 'same',
               'query': 'Q#2',
               'rank': 3,
               'url': 'newurl3'}]}]

But wait, there's still those extra fields you want to get rid of. Easy:

>>> groups = itertools.groupby(initial, operator.itemgetter('query'))
>>> deffilterkeys(d):
... return {k: v for k, v in d.items() if k in ('rank', 'url')}
>>> filtered = ((key, map(filterkeys, group)) for key, group in groups)
>>> print([{"query": key, "results": list(group)} for key, group in filtered])
[{'query': 'Q1',
  'results': [{'rank': 1, 'url': 'awesome1'},
              {'rank': 2, 'url': 'awesome2'},
              {'rank': 3, 'url': 'awesome3'}]},
 {'query': 'Q#2',
  'results': [{'rank': 1, 'url': 'newurl1'},
              {'rank': 2, 'url': 'newurl2'},
              {'rank': 3, 'url': 'newurl3'}]}]

The only thing left to do is to call json.dumps instead of print.


For your followup, you want to take all values that are identical across every row with the same query and group them into otherstuff, and then list whatever remains in the results.

So, for each group, first we want to get the common keys. We can do this by iterating the keys of any member of the group (anything that's not in the first member can't be in all members), so:

defcommon_fields(group):
    defin_all_members(key, value):
        returnall(member[key] == value for member in group[1:])
    return {key: value for key, value in group[0].items() if in_all_members(key, value)}

Or, alternatively… if we turn each member into a set of key-value pairs, instead of a dict, we can then just intersect them all. And this means we finally get to use reduce, so let's try that:

defcommon_fields(group):
    returndict(functools.reduce(set.intersection, (set(d.items()) for d in group)))

I think the conversion back and forth between dict and set may make this less readable, and it also means that your values have to be hashable (not a problem for you sample data, since the values are all strings)… but it's certainly more concise.

This will, of course, always include query as a common field, but we'll deal with that later. (Also, you wanted otherstuff to be a list with one dict, so we'll throw an extra pair of brackets around it).

Meanwhile, results is the same as above, except that filterkeys filters out all of the common fields, instead of filtering out everything but rank and url. Putting it together:

def process_group(group):
    group = list(group)
    common = dict(functools.reduce(set.intersection, (set(d.items()) for d ingroup)))
    def filterkeys(member):
        return {k: v for k, v in member.items() if k notin common}
    results = list(map(filterkeys, group))
    query = common.pop('query')
    return {'query': query,
            'otherstuff': [common],
            'results': list(results)}

So, now we just use that function:

>>> groups = itertools.groupby(initial, operator.itemgetter('query'))
>>> print([process_group(group) for key, group ingroups])
[{'otherstuff': [{'detail': 'cool'}],
  'query': 'Q1',
  'results': [{'rank': 1, 'url': 'awesome1'},
              {'rank': 2, 'url': 'awesome2'},
              {'rank': 3, 'url': 'awesome3'}]},
 {'otherstuff': [{'detail': 'same'}],
  'query': 'Q#2',
  'results': [{'rank': 1, 'url': 'newurl1'},
              {'rank': 2, 'url': 'newurl2'},
              {'rank': 3, 'url': 'newurl3'}]}]

This obviously isn't as trivial as the original version, but hopefully it all still makes sense. There are only two new tricks. First, we have to iterate over groups multiple times (once to find the common keys, and then again to extract the remaining keys)

Post a Comment for "Correct Use Of A Fold Or Reduce Function To Long-to-wide Data In Python Or Javascript?"