Generate Unique Identifiers (UID) in U-SQL on Azure Data Lake Analytics with Python extension scripts

U-SQL doesn't support constructs to generate Unique Identifier in Text Files. The script below generates unique identifier for every row in the input file.

The steps are

Extract the data file with the EXTRACT statement
REDUCERS are spun based on the customer code. Too little reducers or too many reducers may both cause performance issues. Identify a column that can fairly split, but make sure not to specify a unique column.
For every reduced data set, the python script is invoked with the DATA FRAME. Add another column to the data frame "sguid" and generate a new encoded UID.
The output produced out of the reducer will have a new column sguid

REFERENCE ASSEMBLY [ExtPython];

DECLARE @ReduceScript = @"

import uuid

import base64

def usqlml_main(df):

df['sguid'] = ''

df['sguid'] = df.sguid.apply(lambda row: str(base64.urlsafe_b64encode(uuid.uuid1().bytes)))

return df

@AllData = EXTRACT OrderNo string,

Date string,

CustomerCode string,

ProductCode string,

SalesArea string,

OrderValue string

FROM "/DataLoads/Input/TempFile.csv"

USING Extractors.Text(delimiter: ',', skipFirstNRows: 1);

@ReducedData =

REDUCE @AllData

ON CustomerCode

PRODUCE sguid string,

OrderNo string,

Date string,

CustomerCode string,

ProductCode string,

SalesArea string,

OrderValue string

USING new Extension.Python.Reducer(pyScript:@ReduceScript);

OUTPUT @ReducedData

TO "/DataLoads/CSVOutputwithGUID.txt"

USING Outputters.Text();

Note : Follow these instructions to enable U-SQL extensions on your ADL-A account

Last updated on 2017-07-10

Generate Unique Identifiers (UID) in U-SQL on Azure Data Lake Analytics with Python extension scripts

Zusätzliche Ressourcen