Machine Learning Project Walkthrough: Preparing the features

news/2024/7/5 0:51:24

2019独角兽企业重金招聘Python工程师标准>>> hot3.png

1: Recap

In the past mission, you removed all of the columns that contained redundant information, weren't useful for modeling, required too much processing to make useful, or leaked information from the future. We've exported the Dataframe from the end of the last mission to a CSV file named filtered_loans_2007.csv to differentiate the file with the loans_2007.csv we used in the last mission. In this mission, we'll prepare the data for machine learning by focusing on handling missing values, converting categorical columns to numeric columns, and removing any other extraneous columns we encounter throughout this process.

This is because the mathematics underlying most machine learning models assumes that the data is numerical and contains no missing values. To reinforce this requirement, scikit-learn will return an error if you try to train a model using data that contain missing values or non-numeric values when working with models like linear regression and logistic regression.

Let's start by computing the number of missing values and come up with a strategy for handling them. Then, we'll focus on the categorical columns.

We can return the number of missing values across the Dataframe by:

  • first using the Pandas Dataframe method isnull to return a Dataframe containing Boolean values:
    • True if the original value is null,
    • False if the original value isn't null.
  • then using the Pandas Dataframe method sum to calculate the number of null values in each column.

 

 
null_counts = df.isnull().sum()

Instructions

  • Read infiltered_loans_2007.csv as a Dataframe and assign it toloans.
  • Use the isnull and summethods to return the number of null values in each column. Assign the resulting Series object to null_counts.
  • Use the print function to display null_counts.

 

import pandas as pd
loans = pd.read_csv('filtered_loans_2007.csv')
null_counts = loans.isnull().sum()
print(null_counts)

2: Handling Missing Values

While most of the columns have 0 missing values, 2 columns have 50 or less rows with missing values, and 1 column,pub_rec_bankruptcies, contains 697 rows with missing values. Let's remove columns entirely where more than 1% of the rows for that column contain a null value. In addition, we'll remove the remaining rows containing null values.

This means that we'll keep the following columns and just remove rows containing missing values for them:

  • title
  • revol_util
  • last_credit_pull_d

and drop the pub_rec_bankruptcies column entirely since more than 1% of the rows have a missing value for this column.

Let's use the strategy of removing the pub_rec_bankruptcies column first then removing all rows containing any missing values at all to cover both of these cases. This way, we only remove the rows containing missing values for the title and revol_util columns but not the pub_rec_bankruptcies column.

Instructions

  • Use the drop method to remove the pub_rec_bankruptciescolumn from loans.
  • Use the dropna method to remove all rows from loanscontaining any missing values.
  • Use the dtypes attribute followed by the value_counts()method to return the counts for each column data type. Use theprint function to display these counts.


loans = loans.drop("pub_rec_bankruptcies", axis=1)
loans = loans.dropna(axis=0)
print(loans.dtypes.value_counts())

 

3: Text Columns

While the numerical columns can be used natively with scikit-learn, the object columns that contain text need to be converted to numerical data types. Let's return a new Dataframe containing just the object columns so we can explore them in more depth. You can use the Dataframe method select_dtypes to select only the columns of a certain data type:

 

 
float_df = df.select_dtypes(include=['float'])

Let's select just the object columns then display a sample row to get a better sense of how the values in each column are formatted.

Instructions

  • Use the Dataframe methodselect_dtypes to select only the columns of object type fromloans and assign the resulting Dataframe object_columns_df.
  • Display the first row inobject_columns_df using theprint function.


object_columns_df = loans.select_dtypes(include=["object"])
print(object_columns_df.iloc[0])

4: Converting Text Columns

Some of the columns seem like they represent categorical values, but we should confirm by checking the number of unique values in those columns:

  • home_ownership: home ownership status, can only be 1 of 4 categorical values according to the data dictionary,
  • verification_status: indicates if income was verified by Lending Club,
  • emp_length: number of years the borrower was employed upon time of application,
  • term: number of payments on the loan, either 36 or 60,
  • addr_state: borrower's state of residence,
  • purpose: a category provided by the borrower for the loan request,
  • title: loan title provided the borrower,

There are also some columns that represent numeric values, that need to be converted:

  • int_rate: interest rate of the loan in %,
  • revol_util: revolving line utilization rate or the amount of credit the borrower is using relative to all available credit, read more here.

Based on the first row's values for purpose and title, it seems like these columns could reflect the same information. Let's explore the unique value counts separately to confirm if this is true.

Lastly, some of the columns contain date values that would require a good amount of feature engineering for them to be potentially useful:

  • earliest_cr_line: The month the borrower's earliest reported credit line was opened,
  • last_credit_pull_d: The most recent month Lending Club pulled credit for this loan.

Since these date features require some feature engineering for modeling purposes, let's remove these date columns from the Dataframe.

5: First 5 Categorical Columns

Let's explore the unique value counts of the columnns that seem like they contain categorical values.

Instructions

  • Display the unique value counts for the following columns:home_ownership,verification_status,emp_lengthterm,addr_state columns:
    • Store these column names in a list named cols.
    • Use a for loop to iterate over cols:
      • Use the printfunction combined with the Series methodvalue_counts to display each column's unique value counts.

cols = ['home_ownership', 'verification_status', 'emp_length', 'term', 'addr_state']
for c in cols:
    print(loans[c].value_counts())

 

 

6: The Reason For The Loan

The home_ownershipverification_statusemp_lengthterm, and addr_state columns all contain multiple discrete values. We should clean the emp_length column and treat it as a numerical one since the values have ordering (2 years of employment is less than 8 years).

First, let's look at the unique value counts for the purpose and title columns to understand which column we want to keep.

Instructions

  • Use the value_counts method and the print function to display the unique values in the following columns:
    • purpose
    • title


print(loans["purpose"].value_counts())
print(loans["title"].value_counts())

7: Categorical Columns

The home_ownershipverification_statusemp_length, and term columns each contain a few discrete categorical values. We should encode these columns as dummy variables and keep them.

It seems like the purpose and title columns do contain overlapping information but we'll keep the purpose column since it contains a few discrete values. In addition, the title column has data quality issues since many of the values are repeated with slight modifications (e.g. Debt Consolidation and Debt Consolidation Loan and debt consolidation).

We can use the following mapping to clean the emp_length column:

  • "10+ years": 10
  • "9 years": 9
  • "8 years": 8
  • "7 years": 7
  • "6 years": 6
  • "5 years": 5
  • "4 years": 4
  • "3 years": 3
  • "2 years": 2
  • "1 year": 1
  • "< 1 year": 0
  • "n/a": 0

We erred on the side of being conservative with the 10+ years< 1 year and n/a mappings. We assume that people who may have been working more than 10 years have only really worked for 10 years. We also assume that people who've worked less than a year or if the information is not available that they've worked for 0. This is a general heuristic but it's not perfect.

Lastly, the addr_state column contains many discrete values and we'd need to add 49 dummy variable columns to use it for classification. This would make our Dataframe much larger and could slow down how quickly the code runs. Let's remove this column from consideration.

Instructions

  • Remove thelast_credit_pull_d,addr_statetitle, andearliest_cr_line columns from loans.
  • Convert the int_rate andrevol_util columns to float columns by:
    • Using the str acessor followed by the rstripstring method to strip the right trailing percent sign (%):
      • loans['int_rate'].str.rstrip('%')returns a new Series with % stripped from the right side of each value.
    • On the resulting Series object, use the astypemethod to convert to the float type.
    • Assign the new Series of float values back to the respective columns in the Dataframe.
  • Use the replace method to clean the emp_length column.

mapping_dict = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0
    }
}
loans = loans.drop(["last_credit_pull_d", "earliest_cr_line", "addr_state", "title"], axis=1)
loans["int_rate"] = loans["int_rate"].str.rstrip("%").astype("float")
loans["revol_util"] = loans["revol_util"].str.rstrip("%").astype("float")
loans = loans.replace(mapping_dict)

 

8: Dummy Variables

Let's now encode the home_ownershipverification_statustitle, and term columns as dummy variables so we can use them in our model. We first need to use the Pandas get_dummies method to return a new Dataframe containing a new column for each dummy variable:

 

 
# Returns a new Dataframe containing 1 column for each dummy variable.
dummy_df = pd.get_dummies(loans["term", "verification_status"])

We can then use the concat method to add these dummy columns back to the original Dataframe:

 

 
loans = pd.concat([loans, dummy_df], axis=1]

and then drop the original columns entirely using the drop method:

 

 
loans = loans.drop(["verification_status", "term"], axis=1)

Instructions

  • Encode the home_ownership,verification_status,emp_lengthpurpose, andterm columns as integer values:
    • Use the Series methodastype to convert each column to the categorydata type.
    • Use the get_dummiesfunction to return a Dataframe containing the dummy columns.
    • Use the concat method to add these dummy columns back to loans.
    • Remove the original, non-dummy columns (home_ownership,verification_status,purpose, and term) fromloans.


cat_columns = ["home_ownership", "verification_status", "purpose", "term"]
dummy_df = pd.get_dummies(loans[cat_columns])
loans = pd.concat([loans, dummy_df], axis=1)
loans = loans.drop(cat_columns, axis=1)

9: Next Steps

In this mission, we performed the last amount of data preparation necessary to start training machine learning models. We converted all of the columns to numerical values because those are the only type of value scikit-learn can work with. In the next mission, we'll experiment with training models and evaluating accuracy using cross-validation.

 

 

 

 

转载于:https://my.oschina.net/Bettyty/blog/752996


http://www.niftyadmin.cn/n/2642661.html

相关文章

js string转json有斜杠_三十、深入Python中的Pickle和Json模块

「Author&#xff1a;Runsen」听过Python序列化pickle和Json标准库吗&#xff1f;picklepickle模块是以二进制的形式序列化后保存到文件中(保存文件的后缀为".pkl")&#xff0c;不能直接打开进行预览。而python的另一个序列化标准模块 json&#xff0c;可以直接打开查…

高通820A在5G时代来临的表现

在2019年新春CES展&#xff0c;手机界酝酿着一场大变革&#xff0c;即5G通信时代的到来。说到5G技术&#xff0c;可不仅仅只用在手机上&#xff0c;5G通信可以联络到任务网络设备&#xff0c;主要是以传输数据的速度快&#xff0c;会让消费者们体验前所未有的改变&#xff0c;得…

Android IPC机制(三)使用AIDL实现跨进程方法调用

上一篇文章中我们介绍了使用Messenger来进行进程间通信的方法&#xff0c;但是我们能发现Messenger是以串行的方式来处理客户端发来的信息&#xff0c;如果有大量的消息发到服务端&#xff0c;服务端仍然一个一个的处理再响应客户端显然是不合适的。另外&#xff0c;Messenger用…

骁龙 820A芯片机器学习方案介绍

高通公司推出了多项计划以进一步推动智能互联汽车领域的发展&#xff0c;包括使用高通Snapdragon 820A为下一代大众信息娱乐系统提供动力&#xff0c;宣布在德国进行Cellular V2X试验&#xff0c;揭示出一种新的高端带宽千兆LTE用于车辆&#xff0c;并演示Snapdragon 820A如何为…

cocosc++怎么打印_VS2010-win32下cocos2dx控制台打印的方法

在xcode中 直接使用printf 或者 cout<但是在VS2010 却死活不好用 真郁闷-----------------10-9更新----------------下面的代码在 自己建立的项目里都已经存在啦AllocConsole(); freopen("CONIN$", "r", stdin); freopen("CONOUT$", "…

复联4里用到的方法论

引子 为了不剧透&#xff0c;我忍了很久才写这篇文章。直到现在复联4的免费在线观看版在网上都可以搜到了。所以介于漫威系列电影里很多方面和互联网的共性&#xff0c;今天说说自己的看法。 漫威系列得到了很多工程师的喜爱&#xff0c;除了编剧选演员日久弥新的审美观这个必…

Maven the definitive guide.pdf

http://vdisk.weibo.com/s/yW8b4aInnvTwk

git tag本地删除以及远程删除

假设存在tag:12345 git tag -d 12345 #删除本地记录 git push origin :refs/tags/12345 #删除远程记录PS: 如果您觉得我的文章对您有帮助&#xff0c;请关注我的微信公众号&#xff0c;谢谢!