From time to time I come across the code that triggers thoughts like: “oh, it would be much simpler in Haskell”, “why author did not use more functional style” or “f*ing mutability”. The last is usually related to hard to track bugs…
I don’t want to repeat here numerous arguments why functional programming is better, nor start discussion whether it is really a better approach. Personally and professionally I think it is (with some exceptions). I will just give another example. What is interesting, is that programs written in functional programming languages, or functional style, are sometimes recognized as slower. But in this example adhering to functional style would avoid making super-slow program. Another solution would be assign software developer that had a little bit more than no experience with working with larger data.
I was asked to improve performance of a piece of Python code that was used to perform audits of (reports of) stock trading operations. I did not know much about the domain of the problem though. However, it involved quite large data frames.
First glance on the profiling results shows that about 1/3 of program execution is pd.DataFrame append operation. Not surprising: each time it is called it reallocates memory so with large data frames it is expected.
pd.DataFrame
The first piece of code I’ve encountered that may perform too many unnecessary appends looked more or less like this:
def replay(df, ...): # ... for isin in isin_list: isin_movements = get_isins(m_current) if len(isin_movements.index) > 0: df_current = get_movements(df, isin_movements) df_current = replay_movements(df_current, isin_movements) df = pd.concat([df, df_current]) df.reset_index(inplace=True, drop=True) # ...
The obvious improvement would be to concat once after the loop:
dfs = [df] for isin in isin_list: isin_movements = get_isins(m_current) if len(isin_movements.index) > 0: df_current = get_movements(df, isin_movements) df_current = replay_movements(df_current, isin_movements) dfs.append(df_current) df = pd.concat(dfs) df.reset_index(inplace=True, drop=True)
However note the line:
df_current = get_movements(df, isin_movements)
Without examining the get_movements and quite likely deep understanding of the business problem being solved, I cannot say if get_movements depends only the part of df that was passed to replay function or does it depend on result of previous loop iterations? I suspect the former. But if the code code was written in the first place using list comprehension:
get_movements
df
replay
def replay(df, ...): # ... def _replay_movements(isin): isin_movements = get_isins(m_current) if len(isin_movements.index) > 0: df_current = get_movements(df, isin_movements) df_current = replay_movements(df_current, isin_movements) return df_current else: return pd.Dataframe() dfs = [_replay_movements(isin) for isin in isin_list] df = pd.concat(dfs) df.reset_index(inplace=True, drop=True) # ...
not only it would not perform any unnecessary appends, but also, it is perfectly clear that _replay_movements` does not depend on previous iterations.