Project: Credit Card Approval Analysis

Names: Jingzhi Yang

https://yship1002.github.io

  • Motivation
  • Project Goal
  • Background
  • Dataset
    1. Exploratory Data Analysis
    2. Dataset 1: Credit Card Data from book "Econometric Analysis"
    3. Dataset 2: Cleaned Credit approval dataset from UCI
    4. Summary of EDA
      Building Prediction Model
    1. KNN
    2. Logistic Regression
    3. Summary of Model Building Section
  • Oaxaca-Blinder-Kitagawa Decomposition
  • Conclusion
  • Motivation:

    Credit cards are great financial tool to have for everybody. Credit cards can reap lucrative rewards, build a credit history, increase purchase power as well as other benefits. Unfortuantely many people have experienced discrimination in credit card application process. Credit consumers are protected under The Equal Credit Opportunity Act (ECOA) and the federal agency has an obligation to make sure creditors adhere to the ECOA. More specifically the ECOA makes it illegal for creditors to discriminate based on color, income, national origin, sex, age as well as other factors defined in the ECOA. Unfortunately even though the ECOA is designed to protect credit card consumers, it is often difficult for credit card applicants to know whether they have been discriminated against or not. For example, under the ECOA credit card applicants are entitled to know the specific reasons why their applications are denied. When applicants ask for the reasons why they are denied a credit card, the reasons are usually vague such as too many opening accounts or income too low. Since creditors will never release their credit decision criteria, it is impossible for individual applicants to know whether they are discriminated against or not. Therefore, it is necessary to look at the application result by comparing the application results across multiple applicants using data science. In this project I will focus on two types of discrimination: gender discrimination and racial discrimination.

    Project Goal

    This project is meant to answer the following questions:

    Question 1: Does gender and race play an important role in credit card approval decision?

    Question 2: Are there gender or racial discrimination in credit approval decision? How much?

    To answer these two questions I will first build KNN and Logistic Regression models to see if gender and race are important in credit card approval decision. Then I will be using Oaxaca-Blinder-Kitagawa decomposition, a regression analysis I learnt in my Economics of Discriminiation class to check if gender or racial discriminatino exists and quantify the discrimination in terms of percentage.

    Background

    Oaxaca-Blinder-Kitagawa Decomposition

    Oaxaca-Blinder-Kitagawa decomposition is widely used by quantitative social scientists to quantify discrimination between two groups of people. The basic idea is to control for as many independent variables as possible to make two groups 'identical' and see if there are still unexplained difference in dependent variables. This method was first proposed to study the gender wage gap. I will use this example to illustrate the method.

    Let us say that wage (W) is dependent on education (ED), experience (EXP) and whether the worker is unionized (UNION). We can run two linear regression models to estimate wage for both men and women

    $W_m=a_m+b_m{ED}_m+c_mEXP_m+d_mUNION_m$

    $W_f=a_f+b_f{ED}_f+c_fEXP_f+d_fUNION_f$

    subscript m denotes male and f denotes female

    If we subsitute average $\overline{ED}_m, \overline{EXP}_m, \overline{UNION}_m,\overline{ED}_f, \overline{EXP}_f, \overline{UNION}_f$ for both male and female group then the raw gender wage gap $W_m-W_f$ is the following:

    $W_m-W_f=a_m+b_m\overline{ED}_m+c_m\overline{EXP}_m+d_m\overline{UNION}_m-$
    $\hspace{3cm}(a_f+b_f\overline{ED}_f+c_f\overline{EXP}_f+d_f\overline{UNION}_f)$

    To isolate explained and unexplained gender wage gap we do a simple math trick:
    add and subtract $b_m\overline{ED}_f,c_m\overline{EXP}_f, f_m\overline{UNION}_f$ on the right hand side of equation

    $W_m-W_f=b_m(\overline{ED}_m-\overline{ED}_f)+c_m(\overline{EXP}_m-\overline{EXP}_f)+d_m(\overline{UNION}_m-\overline{UNION}_f)+$
    $\hspace{3cm}a_m-a_f+(b_m-b_f)\overline{ED}_f+(c_m-c_f)\overline{EXP}_f+(d_m-d_f)\overline{UNION}_f$

    The sum of first three terms on the right hand is the explained part of wage gap due to gender difference in education, experience and whether unionized and the sum of last four terms are the unexplained part of wage gap that might be discrimination

    Datasets

    The following two datasets have all the necessary information I need in order to build prediction model and quantify discrimination in credit card application. I also provided descriptions of two datasets as well as how I am going to use them.

    Dataset 1: Credit Card Data from book "Econometric Analysis"

    This is a dataset attached to the book 'Econometric Analysis' by William Greene. I am using this dataset because it is very clean

    Dataset 2: Cleaned Credit approval dataset from UCI

    This Kaggle dataset is originally from UCI machine learning repository. Kaggle User SAMUEL CORTINHAS have cleaned the original UCI data by filling missing values and inferring feature names from the raw dataset. This step is necessary in order to get more context and make this dataset easier to use. I will be using this cleaned version as a starting point of my analysis.

    Collaboration Plan

    There is no collaboration plan because Jingzhi Yang will be the only one doing this project

    Exploratory Data Analysis

    Dataset 1: Credit Card Data from book 'Econometric Analysis'

    The first step is to load the datasets.

    First the income here is per 10000. We need to rescale it. Second although panda infers the data types correctly, we'd be better off to map 'card'column to 1 or 0 to make our model training easier.

    What is the default rate distribution of this dataset?

    It seems like majority of people have less than or equal to 2 defaults in the history

    How does default rate affect credit card applications?

    As expected, the more defaults you have, the less likely you will get approved. In fact, if you have more than two defaults you have less than 20% chance to get approved.

    What is the income distribution of this dataset?

    Majority of the people earned less than 60000 a year

    How does income affect credit card applications?

    As expected, the more income you have, the more likely you will get approved

    What is the age distribution of this dataset?

    Most credit card applicants are in their mid 30s

    How does age affect credit card applications?

    Surprisingly age doesn't affect application too much. People always assume that the older you are the more financially stable you are but it doesn't appear to the case here.

    Dataset 2: Cleaned Credit approval dataset from UCI

    The first step is to load the datasets and check data types

    It looks like panda infer the data types correctly.

    What is the Job distribution of applicants?

    It looks like a lot of people work in energy sector

    How does job types affect credit card application?

    Interestingly people working at utilities industry have the highest avg approval rate. It is probably because utility jobs are financially stable.

    What is the Ethnicity distribution of applicants?

    Majority of the applicants are white followed by black applicants

    What are the approval rates for black and non-black applicants?

    Non-black applicants are significantly favored

    What is the Gender distribution of applicants?

    We can see that number of male applicants are twice as much as female applicants

    What are the approval rates for male and female applicants?

    Male applicants are more favored than female applicants

    Summary of EDA

    From EDA we know that approval rate for non-black applicants are a lot higher than the one for black aplicants. The approval rate for male applicants are slightly higher than the one for female applicants. This preliminary result doesn't necessarily show that gender discrimination doesn't exist because we didn't control for other factors such as credit score. One way to explore whether race and gender are important in credit card decision is to build a prediction model with and without race and gender to see if this will affect accuracy which will be covered in the next section.

    Building Models

    In this section, I will build two prediction models:KNN and logistic regression to see if gender and ethnicity play an important role in credit card approval decision. I am using KNN and logistic regression because KNN is non-linear classifier and logistic regression is a linear classifier. I want to use both linear and non-linear classifiers to compare the results to add credibility.

    Part1: k-Nearest Neighbors(KNN)

    Using KNN model to predict application result with all the features followed by KNN model without considering gender and the other without considering ethnicity. We want to see if gender and race play an important role in credit card decision

    Now let us remove Gender to see if it will affect accuracy. It seems like removing gender doesn't affect the accuracy

    Now let us remove Ethnicity to see if it will affect accuracy. It seems like removing ethnicity decreases the accuracy

    Part2: Logistic Regression

    Using Logistic Regression model to predict application result. Like we did in KNN model, we will start with logistic regression with full set of features followed by one without gender and the other without race. We want to see if gender and race play an important role in credit card decision

    Now let us remove gender to see if it will affect accuracy. It seems like removing gender doesn't affect the accuracy which agrees with what we found out in KNN model

    Now let us remove Ethnicity to see if it will affect accuracy. It seems like removing ethnicity decreases the accuracy which agrees with what we found out in KNN model

    Summary of Model Building Section:

    The results from KNN and logistic regression model agree with each other. Gender doesn't play a role in credit card decision but ethnicity does. We can verify our findings using Oaxaca-Blinder-Kitagawa decomposition in the following section

    Oaxaca-Blinder-Kitagawa Decomposition

    We will be using statsmodel Oaxaca package to perform analysis. If you forget what Oaxaca-Blinder-Kitagawa Decomposition is ( I know you didn't finish reading those boring math ) you can go to Background to review.

    It is important you understand what is explained and unexplained part of the gap.

    Is there gender discrimination in credit card application?

    The raw gender gap is only 3.13% which means that men are only 3.13% more likely than women to get approved which is a good news for feminists. Controlling for debt,income, ethnicity, employed status, previous bank relationship, credit score and default rate, we found that in this 3.13% gap between men and women, gender discrimination only accounts for 19.59% of that 3.13% gap. This effectively shows that gender discrimination is minimal.

    Is there racial discrimination on black folks in credit card application?

    The raw racial gap is 23.19% which means that non-black applicants are 23.19% more likely than black applicants to get approved. Controlling for gender, debt,income, ethnicity, employed status, previous bank relationship, credit score and default rate we found that in this 23.19% gap between black and non-black applicants, racial discrimination accounts for 31.68% of that 23.19% gap. This effectively shows that racial discrimination exists and contributes significantly to the racial gap in credit card approval decision.

    Conclusion

    This project shows that gender discrimination is minimal in credit card approval decision and racial discrimination is significant in credit card approval decision. There are many other types of discrimination that can be studied in a similar way. For example in EDA section, we found that job types can affect approval rate significantly. A possible extension of this project is to explore job types discrimination in credit card approval decisions.

    This study has its own limitations. One limitation of this study is that our analysis is based on using two small datasets(around 400 observations). A future improvement of this study can be finding a bigger dataset that includes more observations.

    If you are interested in learning more about racial discrimination in credit card industry, check out this paper by Andrea Freeman 'Racism in the Credit Card Industry'