I hope you enjoyed reading my last post on The Foundation of Machine Learning.
Today, I will like to walk you through the Data Preprocessing aspect of Machine Learning, which is the core of ML.
By the way, what is Data Preprocessing?
Data Scientists across the word have endeavored to give meaning to Data preprocessing. However, simply put, data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues.
How is this done? Just like medical professionals getting a patient prepped for surgery so is data preprocessing, it prepares raw data for further processing. Below are the steps to be taken in data preprocessing
- Data cleaning: fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies.
- Data integration: using multiple databases, data cubes, or files.
- Data transformation: normalization and aggregation.
- Data reduction: reducing the volume but producing the same or similar analytical results.
- Data discretization: part of data reduction, replacing numerical attributes with nominal ones.
It’s time we take some practical steps towards understanding how Data Preprocessing is done.
Step 1. Data Collection
Here we have a dataset that contains information of IT professionals such as their countries, age, salary and sex as displayed below:
Feel free to create a replica of this dataset or you can download the exact dataset here
You must have observed that the dataset above contains some empty cells, which is deliberate. You will get to see how this plays out soon. Our subsequent nuggets will focus exclusively on working with missing data.
Step 2. Importing the Libraries
Now let us import our libraries i.e. precompiled routines and resources a programmer needs to get some jobs done, and commence preprocessing.
There are three main libraries we will explore in this module, they are numpy, matplotlib.pyplot and pandas. numpy library contains mathematical tools therefore it can be used to include any type of mathematics in our code, while matplotlib.pyplot is used to plot intuitive graphs, and lastly pandas is used to import and manage datasets.
Here’s how we import libraries:
#importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
PS: The alias given to them; it will prove useful later in your code for ease of implementation.
Step 3. Importing the Dataset
Now that we have imported the libraries, we need to get our dataset. In my local PC, I have named my dataset “Ruby” and it’s in a .csv format. So we do:
Ruby = pd.read_csv(“Ruby.csv”)
Once the dataset has been imported, our variable explorer environment looks like this:
Congratulations!! We have successfully inserted our dataset into our test environment.
Step 4. Setting the Datasets into Dependent and Independent Variables
Okay so the next task we need to accomplished is to determine what our dependent(y) and independent(x) variables should be? Let’s go back in history when we were in high school, we were taught that independent variables are variables that are tweaked or manipulated to give an outcome/value of a dependent variable.
So from our dataset above, we can conclude that variables nationality, age and salary are our independent variables while our dependent variable is the gender variable, because our aim is to be able to determine the gender of say IT professionals in Silicon Valley based on their salary, nationality, and age. Therefore, we set our variables thus:
#setting the dependent and independent variable
x = Ruby.iloc[: , :-1].values
y = Ruby.iloc[: , 3].values
PS: The index of the columns signified: in python, index starts from 0.
For the x-variable, the first “:” signifies that we want to choose all the rows, while the second “:” stands for the columns, and the -1 after it, shows we want to take all the columns except the last column in the dataset. While for the y variable, the “3” signifies that we want to pick only the third column.
Our IPYTHON console environment should look like this:
We can see from the image above how python has set out our variables into x and y respectively.
With that, we have successfully reached the end of this week’s nugget of Machine Learning, we have learnt how to import libraries, which contain our routines and resource, get our dataset from a local resource, and set our dependent and independent variables.
Don’t forget we still have some missing data in our dataset and which need to be resolved. Watch out for the next nugget where we learn how to take care of some missing value in our dataset. Until then, continue to breathe and personify learning!
Writer: Raji Adam Bifola (MCP,MSCA ). Data Scientist/BI Analyst at Techspecialist Consulting Limited.