Title: | Who are You? Bayesian Prediction of Racial Category Using Surname, First Name, Middle Name, and Geolocation |
---|---|
Description: | Predicts individual race/ethnicity using surname, first name, middle name, geolocation, and other attributes, such as gender and age. The method utilizes Bayes' Rule (with optional measurement error correction) to compute the posterior probability of each racial category for any given individual. The package implements methods described in Imai and Khanna (2016) "Improving Ecological Inference by Predicting Individual Ethnicity from Voter Registration Records" Political Analysis <DOI:10.1093/pan/mpw001> and Imai, Olivella, and Rosenman (2022) "Addressing census data problems in race imputation via fully Bayesian Improved Surname Geocoding and name supplements" <DOI:10.1126/sciadv.adc9824>. The package also incorporates the data described in Rosenman, Olivella, and Imai (2023) "Race and ethnicity data for first, middle, and surnames" <DOI:10.1038/s41597-023-02202-2>. |
Authors: | Kabir Khanna [aut], Brandon Bertelsen [aut, cre], Santiago Olivella [aut], Evan Rosenman [aut], Alexander Rossell Hayes [aut], Kosuke Imai [aut] |
Maintainer: | Brandon Bertelsen <[email protected]> |
License: | GPL (>= 3) |
Version: | 3.0.4 |
Built: | 2025-01-10 04:48:17 UTC |
Source: | https://github.com/kosukeimai/wru |
format_legacy_data
formats legacy data from the U.S. census to allow
for Bayesian name geocoding.
format_legacy_data(legacyFilePath, state, outFile = NULL)
format_legacy_data(legacyFilePath, state, outFile = NULL)
legacyFilePath |
A character vector giving the location of a legacy census data folder, sourced from https://www2.census.gov/programs-surveys/decennial/2020/data/01-Redistricting_File–PL_94-171/. These file names should end in ".pl". |
state |
The two letter state postal code. |
outFile |
Optional character vector determining whether the formatted RData object should be saved. The filepath should end in ".RData". |
This function allows users to construct datasets for analysis using the census legacy data format. These data are available for the 2020 census at https://www2.census.gov/programs-surveys/decennial/2020/data/01-Redistricting_File–PL_94-171/. This function returns data structured analogously to data from the Census API, which is not yet available for the 2020 Census as of September 2021.
## Not run: gaCensusData <- format_legacy_data(PL94171::pl_url('GA', 2020)) predict_race_new(ga.voter.file, namesToUse = 'last, first, mid', census.geo = 'block', census.data = gaCensusData) ## End(Not run)
## Not run: gaCensusData <- format_legacy_data(PL94171::pl_url('GA', 2020)) predict_race_new(ga.voter.file, namesToUse = 'last, first, mid', census.geo = 'block', census.data = gaCensusData) ## End(Not run)
get_census_data
returns county-, tract-, and block-level Census data
for specified state(s). Using this function to download Census data in advance
can save considerable time when running predict_race
and census_helper
.
get_census_data( key = Sys.getenv("CENSUS_API_KEY"), states, age = FALSE, sex = FALSE, year = "2020", census.geo = c("tract", "block", "block_group", "county", "place", "zcta"), retry = 3, county.list = NULL )
get_census_data( key = Sys.getenv("CENSUS_API_KEY"), states, age = FALSE, sex = FALSE, year = "2020", census.geo = c("tract", "block", "block_group", "county", "place", "zcta"), retry = 3, county.list = NULL )
key |
A character string containing a valid Census API key, which can be requested from the U.S. Census API key signup page. By default, attempts to find a census key stored in an
environment variable named |
states |
which states to extract Census data for, e.g., |
age |
A |
sex |
A |
year |
A character object specifying the year of U.S. Census data to be downloaded.
Use |
census.geo |
An optional character vector specifying what level of
geography to use to merge in U.S. Census 2010 geographic data. Currently
|
retry |
The number of retries at the census website if network interruption occurs. |
county.list |
A named list of character vectors of counties present in your voter.file, per state. |
Output will be an object of class list
indexed by state.
Output will contain a subset of the following elements:
state
, age
, sex
,
county
, tract
, block_group
, block
, and place
.
## Not run: get_census_data(states = c("NJ", "NY"), age = TRUE, sex = FALSE) ## Not run: get_census_data(states = "MN", age = FALSE, sex = FALSE, year = "2020")
## Not run: get_census_data(states = c("NJ", "NY"), age = TRUE, sex = FALSE) ## Not run: get_census_data(states = "MN", age = FALSE, sex = FALSE, year = "2020")
predict_race
makes probabilistic estimates of individual-level race/ethnicity.
predict_race( voter.file, census.surname = TRUE, surname.only = FALSE, census.geo = c("tract", "block", "block_group", "county", "place", "zcta"), census.key = Sys.getenv("CENSUS_API_KEY"), census.data = NULL, age = FALSE, sex = FALSE, year = "2020", party = NULL, retry = 3, impute.missing = TRUE, skip_bad_geos = FALSE, use.counties = FALSE, model = "BISG", race.init = NULL, name.dictionaries = NULL, names.to.use = "surname", control = NULL )
predict_race( voter.file, census.surname = TRUE, surname.only = FALSE, census.geo = c("tract", "block", "block_group", "county", "place", "zcta"), census.key = Sys.getenv("CENSUS_API_KEY"), census.data = NULL, age = FALSE, sex = FALSE, year = "2020", party = NULL, retry = 3, impute.missing = TRUE, skip_bad_geos = FALSE, use.counties = FALSE, model = "BISG", race.init = NULL, name.dictionaries = NULL, names.to.use = "surname", control = NULL )
voter.file |
An object of class |
census.surname |
A |
surname.only |
A |
census.geo |
An optional character vector specifying what level of
geography to use to merge in U.S. Census geographic data. Currently
|
census.key |
A character object specifying user's Census API key.
Required if If |
census.data |
A list indexed by two-letter state abbreviations,
which contains pre-saved Census geographic data.
Can be generated using |
age |
An optional |
sex |
optional |
year |
An optional character vector specifying the year of U.S. Census geographic
data to be downloaded. Use |
party |
An optional character object specifying party registration field
in |
retry |
The number of retries at the census website if network interruption occurs. |
impute.missing |
Logical, defaults to TRUE. Should missing be imputed? |
skip_bad_geos |
Logical. Option to have the function skip any geolocations that are not present
in the census data, returning a partial data set. Default is set to |
use.counties |
A logical, defaulting to FALSE. Should census data be filtered by counties available in census.data? |
model |
Character string, either "BISG" (default) or "fBISG" (for error-correction, fully-Bayesian model). |
race.init |
Vector of initial race for each observation in voter.file.
Must be an integer vector, with 1=white, 2=black, 3=hispanic, 4=asian, and
5=other. Defaults to values obtained using |
name.dictionaries |
Optional named list of |
names.to.use |
One of 'surname', 'surname, first', or 'surname, first, middle'. Defaults to 'surname'. |
control |
List of control arguments only used when
|
This function implements the Bayesian race prediction methods outlined in Imai and Khanna (2015). The function produces probabilistic estimates of individual-level race/ethnicity, based on surname, geolocation, and party.
Output will be an object of class data.frame
. It will
consist of the original user-input voter.file
with additional columns with
predicted probabilities for each of the five major racial categories:
pred.whi
for White,
pred.bla
for Black,
pred.his
for Hispanic/Latino,
pred.asi
for Asian/Pacific Islander, and
pred.oth
for Other/Mixed.
#' data(voters) try(predict_race(voter.file = voters, surname.only = TRUE)) ## Not run: try(predict_race(voter.file = voters, census.geo = "tract")) ## End(Not run) ## Not run: try(predict_race( voter.file = voters, census.geo = "place", year = "2020")) ## End(Not run) ## Not run: CensusObj <- try(get_census_data(state = c("NY", "DC", "NJ"))) try(predict_race( voter.file = voters, census.geo = "tract", census.data = CensusObj, party = "PID") ) ## End(Not run) ## Not run: CensusObj2 <- try(get_census_data(state = c("NY", "DC", "NJ"), age = T, sex = T)) try(predict_race( voter.file = voters, census.geo = "tract", census.data = CensusObj2, age = T, sex = T)) ## End(Not run) ## Not run: CensusObj3 <- try(get_census_data(state = c("NY", "DC", "NJ"), census.geo = "place")) try(predict_race(voter.file = voters, census.geo = "place", census.data = CensusObj3)) ## End(Not run)
#' data(voters) try(predict_race(voter.file = voters, surname.only = TRUE)) ## Not run: try(predict_race(voter.file = voters, census.geo = "tract")) ## End(Not run) ## Not run: try(predict_race( voter.file = voters, census.geo = "place", year = "2020")) ## End(Not run) ## Not run: CensusObj <- try(get_census_data(state = c("NY", "DC", "NJ"))) try(predict_race( voter.file = voters, census.geo = "tract", census.data = CensusObj, party = "PID") ) ## End(Not run) ## Not run: CensusObj2 <- try(get_census_data(state = c("NY", "DC", "NJ"), age = T, sex = T)) try(predict_race( voter.file = voters, census.geo = "tract", census.data = CensusObj2, age = T, sex = T)) ## End(Not run) ## Not run: CensusObj3 <- try(get_census_data(state = c("NY", "DC", "NJ"), census.geo = "place")) try(predict_race(voter.file = voters, census.geo = "place", census.data = CensusObj3)) ## End(Not run)
Dataset including FIPS codes and postal abbreviations for each U.S. state, district, and territory.
state_fips
state_fips
A tibble with 57 rows and 3 columns:
state
Two-letter postal abbreviation
state_code
Two-digit FIPS code
state_name
English name
Derived from tidycensus::fips_codes()
Census Surname List from 2000 with race/ethnicity probabilities by surname.
surnames2000
surnames2000
A data frame with 157,728 rows and 6 variables:
Surname
Pr(White | Surname)
Pr(Black | Surname)
Pr(Hispanic/Latino | Surname)
Pr(Asian/Pacific Islander | Surname)
Pr(Other | Surname)
#'
data(surnames2000)
data(surnames2000)
Census Surname List from 2010 with race/ethnicity probabilities by surname.
surnames2010
surnames2010
A data frame with 167,613 rows and 6 variables:
Surname
Pr(White | Surname)
Pr(Black | Surname)
Pr(Hispanic/Latino | Surname)
Pr(Asian/Pacific Islander | Surname)
Pr(Other | Surname)
#'
data(surnames)
data(surnames)
An example dataset containing voter file information.
voters
voters
A data frame with 10 rows and 12 variables:
Voter identifier (numeric)
Surname
State of residence
Congressional district
Census county (three-digit code)
First name
Last name or surname
Census tract (six-digit code)
Census block (four-digit code)
Voting precinct
Voting place
Age in years
0=male, 1=female
Party registration (character)
Party registration (numeric)
#'
data(voters)
data(voters)
Checks if namedata is available in the current working directory, if not
downloads it from github using piggyback. By default, wru will download the
data to a temporary directory that lasts as long as your session does.
However, you may wish to set the wru_data_wd
option to save the
downloaded data to your current working directory for more permanence.
wru_data_preflight()
wru_data_preflight()