BISG and BIFSG

BISG stands for Bayesian Improved Surname Geocoding. It is a statistical method used to infer the likely racial or ethnic composition of a geographic area based on the surnames of its residents.

BIFSG stands for Bayesian Improved First Name Surname Geocoding.It is a statistical method used to infer the likely racial or ethnic composition of a geographic area based on the first names and surnames of its residents.

Both techniques are often employed in demographic research, public health studies, and marketing analysis where accurate demographic data is required but may be incomplete or unavailable.

Why is BISG and BIFSG important?

In banking, fair lending testing is a crucial requirement for lenders, ensuring their lending practices comply with the law and do not discriminate against protected classes.

However depending on the jurisdiction,collecting racial or ethnic information is prohibited in some applications posing a challenge for fair lending testing. BISG and BIFSG are then used to estimate the composition of the population in the data.

Overall, BISG and BIFSG provide a valuable tool to estimate demographic characteristics in areas where traditional survey methods may be impractical or too costly.

How does FAIRLY perform BISG analysis?

FAIRLY performs BISG and BIFSG analyses using the Surgeo Library. This library provides a set of models that can impute race based on surname, first name and geolocation. The following models are available:

  • BIFSGModel, which takes a series of first names, a series of surnames, and a series of ZIP codes and gives the BIFSG results of those inputs (e.g. someone named HECTOR DIAZ in ZIP 79902 has a 80% probability of being hispanic).

  • FirstNameModel, which uses a series of first names and gives the racial percentage of that first name (e.g. 92% of people with the first name AARON are white).

  • GeocodeModel, which uses a series of ZIP codes and gives the racial percentage make up of that ZIP code (e.g. 81% of those people in ZIP 65201 are white).

  • SurnameModel, which uses a series of surnames and gives the racial percentage of that surname (e.g. 5% of people with the surname DIAZ are white);

  • SurgeoModel, which takes a series of surnames and a series of ZIP codes and gives the BISG results of those inputs (e.g. someone named DIAZ in ZIP 65201 has a 26% probability of being white).

The base data for Surgeo models are sourced from publically available data. Specifically:

  1. 2010 United States Census Summary File 1 data set;
  2. 2010 United States Census Frequently Occurring Surnames data set; and,
  3. Demographics aspect of first names data set [1].

[1] Konstantinos Tzioumis, “Data for: Demographic aspects of first names”. Harvard Dataverse (2018), V1 https://doi.org/10.7910/DVN/TYJKEZ

The Fairly Platform Approach

Fairly uses all of the aforementioned models to impute race information. The most accurate model from Surgeo is the BIFSGModel, which takes a combination of first names, surnames and zip codes to give results.

In some cases, the dataset to be analyzed may not contain all of the required information to run the BIFSGModel.

To account for these situations, we take the “best-effort” approach to impute race for each of the entries in a dataset.

For example, if only the first name model is available, Fairly will use the FirstNameModel to infer race. If only the zip codes are available, then Fairly will use the GeocodeModel.

For more details, please visit:

https://surgeo.readthedocs.io/en/dev/

Limitations

It is important to note that BISG and BIFSG are not without limitations, and their accuracy can vary depending on factors such as the quality of the surname data and the diversity of the population being analyzed.