Welcome to Data Quality Whistler’s documentation!
DQ Analyzer
- class dq_whistler.analyzer.DataQualityAnalyzer(data: Union[pyspark.sql.dataframe.DataFrame, pandas.core.frame.DataFrame], config: List[Dict[str, str]])[source]
Analyzer class responsible for taking
JSON
dict and executing it on the columnar data- Parameters
data (
pyspark.sql.DataFrame
|pandas.core.series.Series
) – Dataframe/Series containing the dataconfig (
List[Dict[str, str]]
) – The array of dicts containing config for each column
- class dq_whistler.analyzer.NpEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]
- default(obj)[source]
Implement this method in a subclass such that it returns a serializable object for
o
, or calls the base implementation (to raise aTypeError
).For example, to support arbitrary iterators, you could implement default like this:
def default(self, o): try: iterable = iter(o) except TypeError: pass else: return list(iterable) # Let the base class default method raise the TypeError return JSONEncoder.default(self, o)
Constraints
- class dq_whistler.constraints.constraint.Constraint(constraint: Dict[str, str], column_name: str)[source]
Defines the base Constraint class
- execute_check(data_frame: Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series]) Dict[str, str] [source]
- Parameters
data_frame (
pyspark.sql.DataFrame
|pandas.core.series.Series
) – Column data- Returns
The dict containing the final output for one constraint Example Output:
{ "name": "eq", "values", 5, "constraint_status": "failed/success", "invalid_count": 21, "invalid_values": [4, 6, 7, 1] }
- Return type
dict[str, str]
- get_column_name()[source]
- Returns
The name of the column for which the Constraint instance was created
- Return type
str
- abstract get_failure_df(data_frame: Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series]) Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series] [source]
- Parameters
data_frame (
pyspark.sql.DataFrame
|pandas.core.series.Series
) – Column data- Returns
The dataframe containing failed cases for a constraint
- Return type
pyspark.sql.DataFrame
Numeric Constraints
- class dq_whistler.constraints.number_type.Between(constraint: Dict[str, str], column_name: str)[source]
Between constraint class that extends the base Constraint class
- Parameters
constraint (
Dict[str, str]
) –The dict representing a constraint config
{ "name":"between", "values": [3, 4] }
column_name (
str
) – The name of the column for constraint check
- get_failure_df(data_frame: Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series]) Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series] [source]
- Parameters
data_frame (
pyspark.sql.DataFrame
|pandas.core.series.Series
) – Column data- Returns
The dataframe with
invalid cases
as per the constraint for ex: if constraint isbetween
[2, 8]
, then the dataframe will have rows where values arenot in between [2, 8]
(i.e only invalid cases)- Return type
pyspark.sql.DataFrame
|pandas.core.series.Series
- class dq_whistler.constraints.number_type.Equal(constraint: Dict[str, str], column_name: str)[source]
Equal constraint class that extends the base Constraint class
- Parameters
constraint (
Dict[str, str]
) –The dict representing a constraint config
{ "name":"eq", "values": 5 }
column_name (
str
) – The name of the column for constraint check
- get_failure_df(data_frame: Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series]) Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series] [source]
- Parameters
data_frame (
pyspark.sql.DataFrame
|pandas.core.series.Series
) – Column data- Returns
The dataframe with
invalid cases
as per the constraint, for ex: if constraint iseq
to5
, then the dataframe will have rows where values are!= 5
(i.e only invalid cases)- Return type
pyspark.sql.DataFrame
|pandas.core.series.Series
- class dq_whistler.constraints.number_type.GreaterThan(constraint: Dict[str, str], column_name: str)[source]
GreaterThan constraint class that extends the base Constraint class
- Parameters
constraint (
Dict[str, str]
) –The dict representing a constraint config
{ "name":"gt", "values": 5 }
column_name (
str
) – The name of the column for constraint check
- get_failure_df(data_frame: Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series]) Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series] [source]
- Parameters
data_frame (
pyspark.sql.DataFrame
|pandas.core.series.Series
) – Column data- Returns
The dataframe with
invalid cases
as per the constraint for ex: if constraint isgt
5
, then the dataframe will have rows where values are<= 5
(i.e only invalid cases)- Return type
pyspark.sql.DataFrame
|pandas.core.series.Series
- class dq_whistler.constraints.number_type.GreaterThanEqualTo(constraint: Dict[str, str], column_name: str)[source]
GreaterThanEqualTo constraint class that extends the base Constraint class
- Parameters
constraint (
Dict[str, str]
) –The dict representing a constraint config
{ "name":"gt_eq", "values": 5 }
column_name (
str
) – The name of the column for constraint check
- get_failure_df(data_frame: Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series]) Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series] [source]
- Parameters
data_frame (
pyspark.sql.DataFrame
|pandas.core.series.Series
) – Column data- Returns
The dataframe with
invalid cases
as per the constraint for ex: if constraint isgt_eq
to5
, then the dataframe will have rows where values are< 5
(i.e only invalid cases)- Return type
pyspark.sql.DataFrame
|pandas.core.series.Series
- class dq_whistler.constraints.number_type.IsIn(constraint: Dict[str, str], column_name: str)[source]
IsIn constraint class that extends the base Constraint class
- Parameters
constraint (
Dict[str, str]
) –The dict representing a constraint config
{ "name":"is_in", "values": [1, 2, 3] }
column_name (
str
) – The name of the column for constraint check
- get_failure_df(data_frame: Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series]) Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series] [source]
- Parameters
data_frame (
pyspark.sql.DataFrame
|pandas.core.series.Series
) – Column data- Returns
The dataframe with
invalid cases
as per the constraint for ex: if constraint isis_in
[1, 2, 3]
, then the dataframe will have rows where valuesare in [1, 2, 3]
(i.e only invalid cases)- Return type
pyspark.sql.DataFrame
|pandas.core.series.Series
- class dq_whistler.constraints.number_type.LessThan(constraint: Dict[str, str], column_name: str)[source]
LessThan constraint class that extends the base Constraint class
- Parameters
constraint (
Dict[str, str]
) –The dict representing a constraint config
{ "name":"lt", "values": 5 }
column_name (
str
) – The name of the column for constraint check
- get_failure_df(data_frame: Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series]) Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series] [source]
- Parameters
data_frame (
pyspark.sql.DataFrame
|pandas.core.series.Series
) – Column data- Returns
The dataframe with
invalid cases
as per the constraint for ex: if constraint islt
5
, then the dataframe will have rows where values are>= 5
(i.e only invalid cases)- Return type
pyspark.sql.DataFrame
|pandas.core.series.Series
- class dq_whistler.constraints.number_type.LessThanEqualTo(constraint: Dict[str, str], column_name: str)[source]
LessThanEqualTo constraint class that extends the base Constraint class
- Parameters
constraint (
Dict[str, str]
) –The dict representing a constraint config
{ "name":"lt_eq", "values": 5 }
column_name (
str
) – The name of the column for constraint check
- get_failure_df(data_frame: Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series]) Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series] [source]
- Parameters
data_frame (
pyspark.sql.DataFrame
|pandas.core.series.Series
) – Column data- Returns
The dataframe with
invalid cases
as per the constraint for ex: if constraint islt_eq
to5
, then the dataframe will have rows where the values are> 5
(i.e only invalid cases)- Return type
pyspark.sql.DataFrame
|pandas.core.series.Series
- class dq_whistler.constraints.number_type.NotBetween(constraint: Dict[str, str], column_name: str)[source]
NotBetween constraint class that extends the base Constraint class
- Parameters
constraint (
Dict[str, str]
) –The dict representing a constraint config
{ "name":"not_between", "values": [3, 5] }
column_name (
str
) – The name of the column for constraint check
- get_failure_df(data_frame: Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series]) Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series] [source]
- Parameters
data_frame (
pyspark.sql.DataFrame
|pandas.core.series.Series
) – Column data- Returns
The dataframe with
invalid cases
as per the constraint for ex: if constraint isnot_between
[2,8]
, then the dataframe will have rows where valuesare in between [2, 8]
(i.e only invalid cases)- Return type
pyspark.sql.DataFrame
|pandas.core.series.Series
- class dq_whistler.constraints.number_type.NotEqual(constraint: Dict[str, str], column_name: str)[source]
NotEqual constraint class that extends the base Constraint class
- Parameters
constraint (
Dict[str, str]
) –The dict representing a constraint config
{ "name":"nt_eq", "values": 5 }
column_name (str) – The name of the column for constraint check
- get_failure_df(data_frame: Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series]) Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series] [source]
- Parameters
data_frame (
pyspark.sql.DataFrame
|pandas.core.series.Series
) – Column data- Returns
The dataframe with
invalid cases
as per the constraint for ex: if constraint isnt_eq
to5
, then the dataframe will have rows where values are= 5
(i.e only invalid cases)- Return type
pyspark.sql.DataFrame
|pandas.core.series.Series
- class dq_whistler.constraints.number_type.NotIn(constraint: Dict[str, str], column_name: str)[source]
NotIn constraint class that extends the base Constraint class
- Parameters
constraint (
Dict[str, str]
) –The dict representing a constraint config
{ "name":"not_in", "values": [1, 2, 3] }
column_name (
str
) – The name of the column for constraint check
- get_failure_df(data_frame: Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series]) Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series] [source]
- Parameters
data_frame (
pyspark.sql.DataFrame
|pandas.core.series.Series
) – Column data- Returns
The dataframe with
invalid cases
as per the constraint for ex: if constraint is “not_in” [1, 2, 3], then the dataframe will have rows where values are in [1, 2, 3] (i.e only invalid cases)- Return type
pyspark.sql.DataFrame
|pandas.core.series.Series
String Constraints
- class dq_whistler.constraints.string_type.Contains(constraint: Dict[str, str], column_name: str)[source]
Contains constraint class that extends the base Constraint class
- Parameters
constraint (
Dict[str, str]
) –The dict representing a constraint config
{ "name":"contains", "values": "abc" }
column_name (
str
) – The name of the column for constraint check
- get_failure_df(data_frame: Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series]) Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series] [source]
- Parameters
data_frame (
pyspark.sql.DataFrame
|pandas.core.series.Series
) – Column data- Returns
The dataframe with
invalid cases
as per the constraint for ex: if constraint iscontains
"abc"
, then the dataframe will have rows where valuesdoes not contains "abc"
(i.e only invalid cases)- Return type
pyspark.sql.DataFrame
|pandas.core.series.Series
- class dq_whistler.constraints.string_type.EndsWith(constraint: Dict[str, str], column_name: str)[source]
EndsWith constraint class that extends the base Constraint class
- Parameters
constraint (
Dict[str, str]
) –The dict representing a constraint config
{ "name":"ends_with", "values": "abc" }
column_name (
str
) – The name of the column for constraint check
- get_failure_df(data_frame: Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series]) Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series] [source]
- Parameters
data_frame (
pyspark.sql.DataFrame
|pandas.core.series.Series
) – Column data- Returns
The dataframe with
invalid cases
as per the constraint for ex: if constraint isends_with
"abc"
, then the dataframe will have rows where valuesdoes not ends with "abc"
(i.e only invalid cases)- Return type
pyspark.sql.DataFrame
|pandas.core.series.Series
- class dq_whistler.constraints.string_type.Equal(constraint: Dict[str, str], column_name: str)[source]
Equal constraint class that extends the base Constraint class
- Parameters
constraint (
Dict[str, str]
) –The dict representing a constraint config
{ "name":"eq", "values": "abc" }
column_name (
str
) – The name of the column for constraint check
- get_failure_df(data_frame: Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series]) Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series] [source]
- Parameters
data_frame (
pyspark.sql.DataFrame
|pandas.core.series.Series
) – Column data- Returns
The dataframe with
invalid cases
as per the constraint for ex: if constraint iseq
to"abc"
, then the dataframe will have rows where values are!= "abc"
(i.e only invalid cases)- Return type
pyspark.sql.DataFrame
|pandas.core.series.Series
- class dq_whistler.constraints.string_type.IsIn(constraint: Dict[str, str], column_name: str)[source]
IsIn constraint class that extends the base Constraint class
- Parameters
constraint (
Dict[str, str]
) –The dict representing a constraint config
{ "name":"is_in", "values": ["abc", "xyz"] }
column_name (
str
) – The name of the column for constraint check
- get_failure_df(data_frame: Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series]) Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series] [source]
- Parameters
data_frame (
pyspark.sql.DataFrame
|pandas.core.series.Series
) – Column data- Returns
The dataframe with
invalid cases
as per the constraint for ex: if constraint isis_in
["abc", "xyz"]
, then the dataframe will have rows where valuesare not in ["abc", "xyz"]
(i.e only invalid cases)- Return type
pyspark.sql.DataFrame
|pandas.core.series.Series
- class dq_whistler.constraints.string_type.NotContains(constraint: Dict[str, str], column_name: str)[source]
NotContains constraint class that extends the base Constraint class
- Parameters
constraint (
Dict[str, str]
) –The dict representing a constraint config
{ "name":"not_contains", "values": "abc" }
column_name (
str
) – The name of the column for constraint check
- get_failure_df(data_frame: Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series]) Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series] [source]
- Parameters
data_frame (
pyspark.sql.DataFrame
|pandas.core.series.Series
) – Column data- Returns
The dataframe with
invalid cases
as per the constraint for ex: if constraint isnot_contains
abc
, then the dataframe will have rows where valuescontains "abc"
(i.e only invalid cases)- Return type
pyspark.sql.DataFrame
|pandas.core.series.Series
- class dq_whistler.constraints.string_type.NotEndsWith(constraint: Dict[str, str], column_name: str)[source]
NotEndsWith constraint class that extends the base Constraint class
- Parameters
constraint (
Dict[str, str]
) –The dict representing a constraint config
{ "name":"not_ends_with", "values": "abc" }
column_name (
str
) – The name of the column for constraint check
- get_failure_df(data_frame: Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series]) Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series] [source]
- Parameters
data_frame (
pyspark.sql.DataFrame
|pandas.core.series.Series
) – Column data- Returns
The dataframe with
invalid cases
as per the constraint for ex: if constraint isnot_ends_with
"abc"
, then the dataframe will have rows where valuesends with "abc"
(i.e only invalid cases)- Return type
pyspark.sql.DataFrame
|pandas.core.series.Series
- class dq_whistler.constraints.string_type.NotEqual(constraint: Dict[str, str], column_name: str)[source]
NotEqual constraint class that extends the base Constraint class
- Parameters
constraint (
Dict[str, str]
) –The dict representing a constraint config
{ "name":"nt_eq", "values": "abc" }
column_name (
str
) – The name of the column for constraint check
- get_failure_df(data_frame: Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series]) Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series] [source]
- Parameters
data_frame (
pyspark.sql.DataFrame
|pandas.core.series.Series
) – Column data- Returns
The dataframe with
invalid cases
as per the constraint for ex: if constraint isnt_eq
to"abc"
, then the dataframe will have rows where values are== "abc"
(i.e only invalid cases)- Return type
pyspark.sql.DataFrame
|pandas.core.series.Series
- class dq_whistler.constraints.string_type.NotIn(constraint: Dict[str, str], column_name: str)[source]
NotIn constraint class that extends the base Constraint class
- Parameters
constraint (
Dict[str, str]
) –The dict representing a constraint config
{ "name":"not_in", "values": ["abc", "xyz"] }
column_name (
str
) – The name of the column for constraint check
- get_failure_df(data_frame: Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series]) Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series] [source]
- Parameters
data_frame (
pyspark.sql.DataFrame
|pandas.core.series.Series
) – Column data- Returns
The dataframe with
invalid cases
as per the constraint for ex: if constraint isnot_in
["abc", "xyz"]
, then the dataframe will have rows where valuesare in ["abc", "xyz"]
(i.e only invalid cases)- Return type
pyspark.sql.DataFrame
|pandas.core.series.Series
- class dq_whistler.constraints.string_type.NotStartsWith(constraint: Dict[str, str], column_name: str)[source]
NotStartsWith constraint class that extends the base Constraint class
- Parameters
constraint (
Dict[str, str]
) –The dict representing a constraint config
{ "name":"not_starts_with", "values": "abc" }
column_name (
str
) – The name of the column for constraint check
- get_failure_df(data_frame: Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series]) Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series] [source]
- Parameters
data_frame (
pyspark.sql.DataFrame
|pandas.core.series.Series
) – Column data- Returns
The dataframe with
invalid cases
as per the constraint for ex: if constraint isnot_starts_with
"abc"
, then the dataframe will have rows where valuesstarts with "abc"
(i.e only invalid cases)- Return type
pyspark.sql.DataFrame
|pandas.core.series.Series
- class dq_whistler.constraints.string_type.Regex(constraint: Dict[str, str], column_name: str)[source]
Regex constraint class that extends the base Constraint class
- Parameters
constraint (
Dict[str, str]
) –The dict representing a constraint config
{ "name":"regex", "values": "^[A-Za-z]$" }
column_name (
str
) – The name of the column for constraint check
- get_failure_df(data_frame: Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series]) Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series] [source]
- Parameters
data_frame (
pyspark.sql.DataFrame
|pandas.core.series.Series
) – Column data- Returns
The dataframe with
invalid cases
as per the constraint for ex: if constraint isregex
^[A-Za-z]$
, then the dataframe will have rows where valuesdoes not
satisfies the regex^[A-Za-z]$
(i.e only invalid cases)- Return type
pyspark.sql.DataFrame
|pandas.core.series.Series
- class dq_whistler.constraints.string_type.StartsWith(constraint: Dict[str, str], column_name: str)[source]
StartsWith constraint class that extends the base Constraint class
- Parameters
constraint (
Dict[str, str]
) –The dict representing a constraint config
{ "name":"starts_with", "values": "abc" }
column_name (
str
) – The name of the column for constraint check
- get_failure_df(data_frame: Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series]) Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series] [source]
- Parameters
data_frame (
pyspark.sql.DataFrame
|pandas.core.series.Series
) – Column data- Returns
The dataframe with
invalid cases
as per the constraint for ex: if constraint isstarts_with
"abc"
, then the dataframe will have rows where valuesdoes not starts with "abc"
(i.e only invalid cases)- Return type
pyspark.sql.DataFrame
|pandas.core.series.Series
Profilers
- class dq_whistler.profiler.column_profiler.ColumnProfiler(column_data: Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series], config: Dict[str, Any])[source]
Base class for column profiler
- add_constraint(constraint: dq_whistler.constraints.constraint.Constraint)[source]
Adds an instance of
Constraint
to the the parent list of constraints for this profiler- Parameters
constraint (dq_whistler.constraints.constraint.Constraint) – An instance of
Constraint
class
- get_column_config() Dict[str, Any] [source]
- Returns
The data quality config for the column
- Return type
Dict[str, Any]
- get_column_info() str [source]
- Returns
The column info for which the instance has been created Sample output:
str({ "fields":[ { "metadata":{}, "name":"col_name", "nullable":True, "type":"string" } ], "type":"struct" })
- Return type
str
- get_constraints_config() List[Dict[str, str]] [source]
- Returns
The array containing the constraints for the column
- Return type
List[Dict[str, str]]
- get_custom_constraint_check() List[Dict[str, str]] [source]
- Returns
An array containing the output of each of the constraint for a column Sample Output:
[ { "name": "eq", "values", 5, "constraint_status": "failed/success", "invalid_count": 21, "invalid_values": [4, 6, 7, 1] }... ]
- Return type
List[Dict[str, str]]
Numeric Profiler
- class dq_whistler.profiler.number_profiler.NumberProfiler(column_data: Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series], config: Dict[str, str])[source]
Class for Numeric datatype profiler
- get_stddev_value() float [source]
- Returns
Standard deviation value of the column value
- Return type
float
- run() Dict[str, Any] [source]
- Returns
The final dict with all the metrics of a numeric column Example Output:
{ "total_count": 100, "null_count": 50, "unique_count": 20, "topn_values": {"1": 24, "2": 25}, "min": 2.0, "max": 30.0, "mean": 18.0, "stddev": 5.0, "quality_score": 0, "constraints": [ { "name": "eq", "values", 5, "constraint_status": "failed/success", "invalid_count": 21, "invalid_values": [4, 6, 7, 1] } ] }
- Return type
Dict[str, Any]
String Profiler
- class dq_whistler.profiler.string_profiler.StringProfiler(column_data: Union[pyspark.sql.dataframe.DataFrame, pandas.core.series.Series], config: Dict[str, str])[source]
Class for String datatype profiler
- run() Dict[str, Any] [source]
- Returns
The final dict with all the metrics of a string column Example Output:
{ "total_count": 100, "null_count": 50, "unique_count": 20, "topn_values": {"abc": 24, "xyz": 25}, "quality_score": 0, "constraints": [ { "name": "eq", "values", "abc", "constraint_status": "failed/success", "invalid_count": 21, "invalid_values": ["xy", "ab", "abcd"] } ] }
- Return type
Dict[str, Any]